linear attention mech-anism
Softmax Function - converts vector of predictions into probabilities for each class. https://www.geeksforgeeks.org/deep-learning/the-role-of-softmax-in-neural-networks-detailed-explanation-and-applications/
Linear Attention = approximation of softmax (by using linear dot product of kernel feature maps to convert each step into addition for the update equations). https://linear-transformers.com/ https://haileyschoelkopf.github.io/blog/2024/linear-attn/
Also note that flash attention (a newer, better in terms of memory approach), is briefly mentioned at the end of this paper, and discussed more in the sequel paper.