Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V computes compatibility scores between queries and keys, then mixes value vectors.
Vaswani et al.'s 2017 paper replaced recurrence with attention, enabling parallel training at scale.
Explore interactively
Open equation page