A Transformer is a neural network architecture introduced by Google. It revolutionized how machines understand and generate sequences
Instead of processing words one at a time, transformers look at all words in a sequence simultaneously and use a mechanism called self-attention to understand how each word relates to every other word.
Self-Attention Mechanism
Each token “looks” at other tokens in the sentence and assigns attention weights — numbers that represent how important each word is to understanding the current one.
Example:
In the sentence “The patient who had pneumonia was discharged.”,
the word “was” should pay more attention to “patient” than to “pneumonia.”
The self-attention mechanism captures this context automatically.
- Stacked Layers
Many layers of self-attention and feed-forward networks are stacked.
Each layer learns increasingly abstract relationships — syntax, semantics, and even reasoning patterns.