1 Matching Annotations
  1. Sep 2025
    1. linear attention mech-anism

      Softmax Function - converts vector of predictions into probabilities for each class. https://www.geeksforgeeks.org/deep-learning/the-role-of-softmax-in-neural-networks-detailed-explanation-and-applications/

      Linear Attention = approximation of softmax (by using linear dot product of kernel feature maps to convert each step into addition for the update equations). https://linear-transformers.com/ https://haileyschoelkopf.github.io/blog/2024/linear-attn/

      Also note that flash attention (a newer, better in terms of memory approach), is briefly mentioned at the end of this paper, and discussed more in the sequel paper.