Step 1 / 8

1 — Input Text & Tokenization

Tokens click any token to inspect it

2 — Token Embeddings

E[token_id] = Embedding_Matrix[token_id, :] ∈ ℝ^d_model

Embedding matrix each column = one token vector | rows = dimensions

3 — Positional Encoding sin/cos at different frequencies

PE(pos, 2i) = sin(pos / 10000^2i/d) PE(pos, 2i+1) = cos(pos / 10000^(2i+1)/d)

PE signal

X = Embed + PE (input to transformer)

4 — Q, K, V Projections three learned linear maps per head

Q = X·W_Q K = X·W_K V = X·W_V W ∈ ℝ^{d_model × d_head}

5 — Self-Attention Scores hover any cell to see the relationship

Scores = Q · K^T / √d_k AttnWeights = softmax(Scores)

Raw scores (QKᵀ / √dk)

→ softmax →

Attention weights (rows sum to 1.0)

6 — Value Aggregation + Residual context-enriched representations

AttnOut = AttnWeights · V X' = LayerNorm(X + AttnOut)

Context-aware token vectors each token now sees all others

Residual stream magnitude bars show per-dim values after LayerNorm

7 — Feed-Forward Network applied independently to each token position

FFN(x) = GeLU(x·W₁+b₁)·W₂+b₂ d_ff = 4×d_model

After FFN + LayerNorm final per-token representations

8 — Next Token Prediction logits → softmax → sample

logits = h_last·W_vocab P(w) = softmax(logits / temperature)

Temperature

0.8

Generated sequence

Model Config

d_model8

Num Heads2

d_ff multiplier4×

Vocabulary Sample

Step Explanation

Selected Token

Click a token to inspect its embedding, attention patterns, and activations