- Last 7 days
-
sthalles.github.io sthalles.github.io
-
we lose an essential piece of information – the tokens’ relative positions
Hence we add positional encodings!
-
meaning of words depends on the context they appear.
This is what is different from word2vec/skip gram models vs transformers. Static vs Dynamic embedding; former one generates one embedding for a word, regardless of what context it was used in. But latter ones generate a dynamic embedding for a word, since attention is included while training the encoder for the embeddings.
In more detail: There is only one vector for 'bank': a weighted average of bank the financial institution and bank the thing next to a river.
Q/A The input embedding matrix of a transformer holds static embeddings, just like word2vec, How can I then get the contextual embedding of a specific word given a specific sentence?
Static embeddings use context for training and a lookup table for inference. Contextual embeddings use context in training and inference.
The initial layers still use static embeddings but then using self attention mechanism, a context aware embedding is generated for the word by looking at all other words in the sentence. So, when we have a different sentence using the same word, the dynamic embedding changes since the value of attention will be different for the word in different sentences.
-
-
nlp.seas.harvard.edu nlp.seas.harvard.edu
-
class Generator(nn.Module):
Generates output tokens
-
-
jalammar.github.io jalammar.github.io
-
basically it would be the length of the longest sentence in our training dataset
Why? because; size of the list is the Hyperparameter, not the size of embedding.
Complexity of self-attention is n^2. Because we multiply n queries (at n time steps) with n keys (Ki, where i <= n) You don't multiply all the Ki with Q, just the K in the context window. Which is precisely the hyperparameter author has mentioned. size of this list: length of the longest sentence in training dataset
-
- Aug 2024
-
jalammar.github.io jalammar.github.io
-
one attention head is focusing most on "the animal", while another is focusing on "tired"
8 colored boxes are 8 attention heads
-
multiply each value vector by the softmax score (in preparation to sum them up)
this makes me question, what exactly is a K vector for the word. How different is it from the V vector? Looks like K is being used to compute a strength of V. Can K and V both have same values? I am trying to relate K,V to the database analogy mentioned here: https://d2l.ai/chapter_attention-mechanisms-and-transformers/queries-keys-values.html
-
with the key vector of the respective word we’re scoring.
Read until equation 11.1.3 https://d2l.ai/chapter_attention-mechanisms-and-transformers/queries-keys-values.html
Now that we have a database consisting of (k,v) pairs, one way of calculating this score is by calculating similarity between a given query and all the keys. Dot product simulates that.
-