Hypothesis

linear attention mech-anism

Softmax Function - converts vector of predictions into probabilities for each class. https://www.geeksforgeeks.org/deep-learning/the-role-of-softmax-in-neural-networks-detailed-explanation-and-applications/

Linear Attention = approximation of softmax (by using linear dot product of kernel feature maps to convert each step into addition for the update equations). https://linear-transformers.com/ https://haileyschoelkopf.github.io/blog/2024/linear-attn/

Also note that flash attention (a newer, better in terms of memory approach), is briefly mentioned at the end of this paper, and discussed more in the sequel paper.

Softmax Linear Attention Flash Attention

Tags

Annotators

URL