A core assumption of DITTO lies in sampleefficiency coming from demonstrations. In the-ory, a user could achieve similar performanceby labeling many pairwise preferences with anideal set of demonstrations in mind. As a pre-liminary approximation, one author provideddemonstrations for the user study and also an-notated 500 preference pairs using outputs sam-pled from the instruction following Mistral 7B(demonstrations in Appendix E.4). Altogether,we constructed a pairwise preferences datasetDpref = {(x, yi, yj )}, where yi ≻ yj . We thencomputed win rates between 20 pairs sampledfrom Mistral trained on (a) 4 demonstrationswith DITTO, and (b) on {0...500} preferencepairs with just DPO. When we sample pairwisepreferences from πref alone, we observe that gen-erated pairs are out-of-distribution relative to thedemonstrations—pairwise preferences do notreach a user’s demonstrated behavior (results in Fig. 3: “Base policy,” in blue). Even when we fine-tune πref on the user’s demonstrations, we still need > 500 preferences to match DITTO performance(Fig. 3: “Demo-finetuned policy,” in orange).
DITTO works well because it uses just a few examples from (high reward) experts to train the model efficiently. You could also train the model by comparing many pairs of outputs and choosing the better ones, but this takes much more time and effort.