15 Matching Annotations
  1. Jun 2024
    1. A core assumption of DITTO lies in sampleefficiency coming from demonstrations. In the-ory, a user could achieve similar performanceby labeling many pairwise preferences with anideal set of demonstrations in mind. As a pre-liminary approximation, one author provideddemonstrations for the user study and also an-notated 500 preference pairs using outputs sam-pled from the instruction following Mistral 7B(demonstrations in Appendix E.4). Altogether,we constructed a pairwise preferences datasetDpref = {(x, yi, yj )}, where yi ≻ yj . We thencomputed win rates between 20 pairs sampledfrom Mistral trained on (a) 4 demonstrationswith DITTO, and (b) on {0...500} preferencepairs with just DPO. When we sample pairwisepreferences from πref alone, we observe that gen-erated pairs are out-of-distribution relative to thedemonstrations—pairwise preferences do notreach a user’s demonstrated behavior (results in Fig. 3: “Base policy,” in blue). Even when we fine-tune πref on the user’s demonstrations, we still need > 500 preferences to match DITTO performance(Fig. 3: “Demo-finetuned policy,” in orange).

      DITTO works well because it uses just a few examples from (high reward) experts to train the model efficiently. You could also train the model by comparing many pairs of outputs and choosing the better ones, but this takes much more time and effort.

    2. veraged acrossall authors, DITTO outperforms all baselines, with an average 77.09% win-rate across both CMCC(71.67%) and CCAT50 (82.50%). On CCAT50, DITTO outperforms all baselines across authorsbut one. On CMCC, DITTO outperforms all other baselines for 5/10 authors, followed by few-shotprompting for 3/10. While SFT serves as a strong baseline (56.78% on CMCC, 73.89% on CCAT),DITTO provides an average ↑11.7% pt. win rate improvement compared to SFT alone.Prompted baselines also lag far behind DITTO, especially zero-shot (including closed-source) models(avg. ↓54.4% pt. decrease on Mistral, ↓51.5% pt. on GPT-4). While zero-shot GPT-4 is alreadyfinetuned using RLHF, we suspect that this training feedback differs significantly from that of authorsin both CMCC and CCAT50. Adding few-shot examples to the prompt does help: win rates forfew-shot prompting increase compared to zero-shot for both Mistral (↑20.94% pt.) and GPT-4(↑22.95% pt.) based LLMs. However, including few-shot examples still falls behind applying DITTO(avg. ↓37.35% pt. decrease for Mistral; ↓26.99% pt. for GPT-4). We suspect the underlying RLHFpriors for out-of-the-box LLMs are fairly strong. Qualitatively, few-shot generations still soundGPT-generated relative to DITTO (Table 6 in Appendix).
      • DITTO wins 77% of the time on average in tasks like writing emails and articles and performs 12% better than traditional fine-tuning and surpasses methods using few or no examples, especially on models like Mistral and GPT-4.

      • Few-shot prompting improves results but still sounds more like standard GPT outputs, while DITTO’s outputs are better aligned with users' style.

    3. Thismeans that unlike synthetic data generation paradigms [ 26 ], DITTO does not require a model thatperforms well at the given task a priori.

      Tuning by synthetic data ageneration requires models that can product usable synthetic data, this improved alginment method requires no such pre-existing models.

    4. We also find that ablating components of DITTO results in reduced performance (Table 3). If wesample all negatives at the start—instead of iteratively resampling in an online fashion—we observethat win rates compared to using DITTO drop from 70.1% to 57.3%. While iteratively re-samplingimproves performance, continuously updating πref during this online process can significantly degradeperformance: win rates drop from 70.1% to 45.8%. We suspect updating πref results in potentialoverfitting. Finally, both replay and inter-policy comparisons help DITTO. Removing replay andinterpolicy comparisons reduces win rates from DITTO by 6.5 and 2 points respectively.
      • Sampling all negatives at once drops win rates from 70.1% to 57.3%.
      • Continuously updating the reference policy (πref) reduces win rates further to 45.8%, likely due to overfitting.
      • Removing replay and inter-policy comparisons decreases performance by 6.5 and 2 points, respectively.
    5. Another limitation involves DITTO speed: DITTO is slower than training-free approaches (prompting)and SFT (15 minutes with DITTO vs. 2 minutes with SFT on 7 demonstrations). A bottleneck lies insampling, though we suspect a mix of prior (e.g., vLLM [ 25]) and future work in LLM inferenceoptimization can improve DITTO’s speed. Finally, DITTO is uninterpretable. It is unclear exactlywhat a model learns after several iterations: do values shift too, or is it just style? We also suspectthat forgetting may affect DITTO. Even with LoRA, models DITTO-ed on writing sometimes refuseto generate code. Related work on overgeneralization might mitigate these effects [40].

      DITTO faces limitations such as biases in GPT evaluations, slower training speed compared to other methods, and unclear learning processes that may lead to forgetting previous knowledge.

    6. At the first iteration, let the initial policy be π0. We can sample from this policy to assemble adataset D0 = {(x, yπ0 )}. Then, we can generate comparison data for RLHF as yE ⪰ yπ0 , which wedenote as DE ⪰ D0 for brevity. Using these induced comparisons, we update π0 to obtain a newpolicy π1. By definition, EπE [r(x, y)] ≥ Eπ1 [r(x, y)] as well. It follows that we can also generatecomparisons using π1 as DE ⪰ D1. Continuing this procedure, we generate a progressively morediverse comparison dataset using all prior policies. We refer to these as “replay” comparisons.While this approach is theoretically consistent, it decreases the likelihood of the LM everywhereexcept at expert demonstrations. Though permissible in data rich scenarios, this may also lead tooverfitting with a small DE . However, if we assume that the policy improves at each iteration, i.e.Eπt+1 [r(x, y)] ≥ Eπt [r(x, y)], then we can also consider comparisons between policies during thecourse of learning. Unlike comparisons with the expert, we do not guarantee that this holds; inpractice, however, we found that models tended to improve with each iteration, perhaps owing tothe convexity of both reward modeling and Eq. (1). This lets us sample comparisons between thecomplete ranking of policiesDE ⪰ Dt ⪰ Dt−1 ⪰ ... ⪰ D1 ⪰ D0. (2)The effect of adding these “intermodel” and “replay” comparisons is that the likelihoods of earliersamples (e.g., those in D1) are pushed down more than those of later samples (e.g., those in Dt),smoothing the implicit reward landscape.

      New comparisons are made not only between the user examples and the latest model outputs but also between outputs from different stages of the model's training. This helps the model learn progressively and avoid overfitting.

    7. Though suchcomparisons are derived from policies insteadof individual examples, they have proven effective in prior work [ 6]. A naïve approach for DITTOwould then optimize Eq. (1) using this dataset and an off-the-shelf RLHF algorithm. Doing so wouldincrease the probability of the expert responses while decreasing the probability of the current modelsamples, unlike standard finetuning which only does the former

      Demonstrative alignment actively reduces the probability of non-expert responses, something that reinforcement learning with human feedback was not designed to do.

    8. While this objective is ubiquitous in prior work [ 32, 34 ], it is typically applied in the context ofpopulation-based reward functions learned from large comparison datasets collected via a multitudeof annotators. In contrast, we consider r(x, y) to be the objective of a single individual. In thisregime, collecting thousands of comparisons from one user is infeasible. Instead, we assume accessto a small dataset of expert demonstrations, denoted DE .

      Instead of reward-training a model on a large population of examples to product comparisons, comparisons data is created using differences between LLM outputs and a single expert demonstrator with very large reward.

    9. We find that win rates for DITTO outperform methods like SFT (avg. 11% pt. increase),self-play methods like SPIN (20.2% pt.), and few-shot prompting (33.4% pt.) on Mistral 7B—evenwhen few-shot prompts are provided to a more powerful LLM (GPT-4, 18% pt.).

      Alignment by demonstration essentially does better than all prevalent tuning techniques.

    10. DITTO can be interpreted as an online imitation learning algorithm, where data sampled fromthe LLM is used to distinguish expert behavior.

      Since it involve s the LLM comparing demonstrations with its own intermediate outputs.

    11. we can achieve strong alignment with individuals by leveraging a small number ofuser-provided examples of desired behavior.

      As opposed to a large number of examples required for tuning

    12. How might weefficiently communicate preferences and align a language model to a new individual or task?

      Tuning for a small task can require hundres or thousands of examples

    13. LLM outputs feel unopinionated and generic because of this mismatch.

      Because before tuning they are designed to handle anything in the world.

    14. Across our benchmarks and userstudy, we find that win-rates for DITTO outperform few-shot prompting, supervisedfine-tuning, and other self-play methods by an average of 19% points

      This method of aligned produces better results than previous methods even without tuning.

    15. DITTO cheaply generatesonline comparison data by treating users’ demonstrations as preferred over outputfrom the LLM and its intermediate checkpoints.

      DITTO takes user examples and treats them as better than what the model generates. It uses these comparisons to help the model learn and improve.