Transformer Attention
Each token attends to all others—weighted by query-key similarity, scaled by dimension.
By Ashish Vaswani et al.
Rabbit Hole Mode
Five doors into the universe behind this equation. Choose your path.
Discoverers: Ashish Vaswani et al. (2017)
What does it mean?
Each token attends to all others—weighted by query-key similarity, scaled by dimension.
Why should I care?
Powering ChatGPT, Gemini, Claude, and the entire generative AI revolution.
Equation Compass
North — Prerequisites
West — History
East — Applications
South — Derivations
Variables & Units
| Symbol | Name | Unit | Meaning |
|---|---|---|---|
| Query | — | Query matrix | |
| Key | — | Key matrix | |
| Value | — | Value matrix | |
| Dimension | — | Key dimension for scaling |
Worked Example
AI Guide (Pro)
Ask questions about equations and get answers grounded in the Equation Universe catalog.
Continue your trail
Sources & further reading
Share this equation
Equation Universe
Transformer Attention
Real-world impact
ChatGPT & modern AI
Scaled dot-product attention is the engine behind large language models.
Photo: Unsplash — AI neural concept
Each token attends to all others—weighted by query-key similarity, scaled by dimension.
equation-universe.vercel.app