Skip to content
Information Sciences
Deep Learning
2017
Advanced

Transformer Attention

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Each token attends to all others—weighted by query-key similarity, scaled by dimension.

By Ashish Vaswani et al.

Information Sciences
Transformer Attention
2017 · Ashish Vaswani et al.
Source Verified
89%

Rabbit Hole Mode

Five doors into the universe behind this equation. Choose your path.

Launch collection — full story, visual, sound & share card
Why it matters: Powering ChatGPT, Gemini, Claude, and the entire generative AI revolution.

Discoverers: Ashish Vaswani et al. (2017)

What does it mean?

Each token attends to all others—weighted by query-key similarity, scaled by dimension.

Why should I care?

Powering ChatGPT, Gemini, Claude, and the entire generative AI revolution.

Equation Compass

Variables & Units

SymbolNameUnitMeaning
QQQueryQuery matrix
KKKeyKey matrix
VVValueValue matrix
dkd_kDimensionKey dimension for scaling

Worked Example

In 'The cat sat', 'sat' attends strongly to 'cat' as subject.

AI Guide (Pro)

Ask questions about equations and get answers grounded in the Equation Universe catalog.

Share this equation

Equation Universe

Transformer Attention

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Real-world impact

ChatGPT & modern AI

Scaled dot-product attention is the engine behind large language models.

Photo: Unsplash — AI neural concept

Each token attends to all others—weighted by query-key similarity, scaled by dimension.

equation-universe.vercel.app

Post