7 Matching Annotations
  1. Last 7 days
    1. we lose an essential piece of information – the tokens’ relative positions

      Hence we add positional encodings!

    2. meaning of words depends on the context they appear.

      This is what is different from word2vec/skip gram models vs transformers. Static vs Dynamic embedding; former one generates one embedding for a word, regardless of what context it was used in. But latter ones generate a dynamic embedding for a word, since attention is included while training the encoder for the embeddings.

      In more detail: There is only one vector for 'bank': a weighted average of bank the financial institution and bank the thing next to a river.

      Q/A The input embedding matrix of a transformer holds static embeddings, just like word2vec, How can I then get the contextual embedding of a specific word given a specific sentence?

      Static embeddings use context for training and a lookup table for inference. Contextual embeddings use context in training and inference.

      The initial layers still use static embeddings but then using self attention mechanism, a context aware embedding is generated for the word by looking at all other words in the sentence. So, when we have a different sentence using the same word, the dynamic embedding changes since the value of attention will be different for the word in different sentences.

    1. basically it would be the length of the longest sentence in our training dataset

      Why? because; size of the list is the Hyperparameter, not the size of embedding.

      Complexity of self-attention is n^2. Because we multiply n queries (at n time steps) with n keys (Ki, where i <= n) You don't multiply all the Ki with Q, just the K in the context window. Which is precisely the hyperparameter author has mentioned. size of this list: length of the longest sentence in training dataset

  2. Aug 2024
    1. one attention head is focusing most on "the animal", while another is focusing on "tired"

      8 colored boxes are 8 attention heads

    2. multiply each value vector by the softmax score (in preparation to sum them up)

      this makes me question, what exactly is a K vector for the word. How different is it from the V vector? Looks like K is being used to compute a strength of V. Can K and V both have same values? I am trying to relate K,V to the database analogy mentioned here: https://d2l.ai/chapter_attention-mechanisms-and-transformers/queries-keys-values.html

    3. with the key vector of the respective word we’re scoring.

      Read until equation 11.1.3 https://d2l.ai/chapter_attention-mechanisms-and-transformers/queries-keys-values.html

      Now that we have a database consisting of (k,v) pairs, one way of calculating this score is by calculating similarity between a given query and all the keys. Dot product simulates that.