Lab: Attention, Visualized

Interactive visualization of self-attention. Type a sentence, click a word, see attention weights compute live. The math behind transformers, ChatGPT, every modern LLM.

An interactive companion to attention — the idea at the heart of every modern LLM. Type a sentence, click a word, see what it "attends to". The math is real: random embeddings → scaled dot-product → softmax.

Insightful AI World · Lab

Attention, Visualized

Type a sentence. Click a word. See what it "looks at" — and how that's the whole secret behind transformers, ChatGPT, and every modern LLM.

Before you start

Read this short panel first. It tells you what the lab is, what it is trying to make you see, and how you will know if you got there.

🎯 Purpose

This lab is an interactive visualisation of self-attention, the mechanism at the heart of every modern transformer. You type a sentence, click on any word, and see — as actual coloured bars and arcs computed from real scaled dot-product attention — which other words that word is "looking at" and how strongly. You can also switch between 1, 2, 3, and 4 attention heads and adjust the temperature (how sharp or diffuse the attention distribution becomes).

💡 What it is trying to make you see

That attention is a soft, learned search: for every token, the model decides how much to look at every other token before deciding what that token means in context. The pattern (pronouns pulling toward antecedents, verbs pulling toward subjects, modifiers pulling toward head nouns) is not coded by humans — it emerges from training. The companion article walks through the math; this lab lets you click on "it" in the cat-mat example and see which words it attends to.

✅ What you should understand after playing

After a minute of clicking, you should leave able to:

  • Say in one sentence what "token X attends to token Y" means, in terms of the model's representation of X.
  • Predict, for a pronoun in a sentence, which other token it will attend to most strongly — and explain why.
  • Describe what changes when you flip from 1 head to 4 heads, and what the temperature slider is doing to the softmax.

If those three are true for you when you leave, the lab did its job. If not, re-read the worked example below and try one more sentence.

How to use it — 30 seconds

  1. Type or pick a sentence. Use the input below or click one of the example chips.
  2. Click any token. Coloured bars appear under every token showing how much the clicked one attends to each. Brighter = more attention.
  3. Switch heads or move temperature. Watch how the distribution changes. Heads specialise; temperature sharpens or softens.

A worked example — try this sentence

Use the default sentence: "The cat sat on the mat because it was tired." Click on the word "it".

Look at the bars. "It" should attend most strongly to "cat", not to "mat". The model has learned that pronouns referring to animate things pull toward likely animate antecedents.

Nobody coded "pronouns attend to their antecedents." The pattern emerged from training. You are watching that pattern in action.

Now bump to 4 heads. Different heads will attend differently — some still focus on "cat", others may pull toward "tired" or the sentence structure. This is why multi-head attention is more expressive than single-head: each head can learn its own pattern.

1. Type a sentence

Or pick one of our examples below. Self-attention scores will be computed for every word against every other word.

2. Click any word to see what it attends to

The selected word's attention weights to every other word are shown as a colored bar underneath each token. Brighter = more attention.

3. The full attention matrix

Every row is one word. Each cell shows how much that word attends to every other word. This matrix is what self-attention computes — for every layer, every head, every model.

4. Try different heads & temperatures

Real transformers have multiple "heads" — different attention patterns computed in parallel. Temperature controls how sharply attention focuses.

What's actually happening

Self-attention in one sentence: for every word in the sentence, the model decides how much to look at every other word — including itself — before deciding what that word "means" in context.

The math: each word becomes a vector (embedding). We compute three more vectors from it — Q (query: "what am I looking for?"), K (key: "what do I offer?"), and V (value: "what info do I carry?"). The attention score from word i to word j is the dot product Q_i · K_j, scaled, then softmaxed across all words.

Why it matters: in the sentence "the cat sat on the mat because it was tired" — the word "it" needs to attend strongly to "cat" to figure out what it refers to. A well-trained model learns exactly that pattern. The visualization above uses random embeddings so the patterns are noisy — real models train millions of these vectors over trillions of words until the patterns become meaningful.

Heads: rather than one attention pattern, transformers compute several in parallel ("multi-head attention"). Different heads learn different things — one might focus on syntax, another on coreference, another on rare words. This demo shows 4 random heads; real models use 8–128.

Temperature: higher temperature spreads attention across many words (soft, blurry); lower temperature concentrates it on the single best match (sharp). Training learns the right level automatically; the slider lets you feel the difference.

Further reading: Vaswani et al., "Attention Is All You Need" (2017) — the paper that introduced the transformer. Jay Alammar, "The Illustrated Transformer" for an excellent visual walkthrough.

This is our first interactive lab. More AI explainers coming. Feedback or suggestions.