11 Matching Annotations
  1. May 2022
    1. The source sequence will be pass to the TransformerEncoder, which will produce a new representation of it. This new representation will then be passed to the TransformerDecoder, together with the target sequence so far (target words 0 to N). The TransformerDecoder will then seek to predict the next words in the target sequence (N+1 and beyond).
  2. Dec 2021
    1. The transformer model introduces the idea of instead of adding another complex mechanism (attention) to an already complex Seq2Seq model; we can simplify the solution by forgetting about everything else and just focusing on attention.
  3. Nov 2021
    1. The selective-second-order-with-skips model is a useful way to think about what transformers do, at least in the decoder side. It captures, to a first approximation, what generative language models like OpenAI's GPT-3 are doing.
    1. The Query word can be interpreted as the word for which we are calculating Attention. The Key and Value word is the word to which we are paying attention ie. how relevant is that word to the Query word.

      Finally

    1. Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step.
    1. The attention layer (W in the diagram) computes three vectors based on the input, termed key, query, and value.

      Could you be more specific?

    2. Attention is a means of selectively weighting different elements in input data, so that they will have an adjusted impact on the hidden states of downstream layers.
  4. Aug 2021
    1. So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.
  5. Jan 2021
  6. May 2020