Information Sciences

Deep Learning

2017

Advanced

Transformer Attention

\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Each token attends to all others—weighted by query-key similarity, scaled by dimension.

By Ashish Vaswani et al.

Information Sciences

2017 · Ashish Vaswani et al.

Source Verified

89%

Rabbit Hole Mode

Five doors into the universe behind this equation. Choose your path.

Story PortalWho discovered this?Discover how Vaswani et al. and others shaped this equation.Visual PortalWhat does it look like?See the equation come alive in the Visual Studio.Machine PortalWhere is it used?Explore machines powered by this equation — large language models.Math PortalWhat does it derive from?Trace the mathematical lineage from matrix multiplication.Future PortalWhere is it going?Efficient long-context attention

Launch collection — full story, visual, sound & share card

Why it matters: Powering ChatGPT, Gemini, Claude, and the entire generative AI revolution.

Discoverers: Ashish Vaswani et al. (2017)

What does it mean?

Each token attends to all others—weighted by query-key similarity, scaled by dimension.

Why should I care?

Powering ChatGPT, Gemini, Claude, and the entire generative AI revolution.

Equation Compass

North — Prerequisites

West — History

Ashish Vaswani et al.

East — Applications

South — Derivations

Derivation

Variables & Units

Symbol	Name	Unit	Meaning
$Q$	Query	—	Query matrix
$K$	Key	—	Key matrix
$V$	Value	—	Value matrix
$d_k$	Dimension	—	Key dimension for scaling

Worked Example

In 'The cat sat', 'sat' attends strongly to 'cat' as subject.

AI Guide (Pro)

Ask questions about equations and get answers grounded in the Equation Universe catalog.

Upgrade to Pro Try demo (7 days)

Continue your trail

Neural Network Artificial Intelligence AI Engineer θ

Sources & further reading

Search Wikipedia Search Wolfram MathWorld Open in Wolfram|Alpha Watch explainers on YouTube

Equation Universe