1 — Input Text & Tokenization
Tokens click any token to inspect it
2 — Token Embeddings
E[token_id] = Embedding_Matrix[token_id, :] ∈ ℝd_model
Embedding matrix each column = one token vector | rows = dimensions
3 — Positional Encoding sin/cos at different frequencies
PE(pos, 2i) = sin(pos / 100002i/d) PE(pos, 2i+1) = cos(pos / 10000(2i+1)/d)
PE signal
+
X = Embed + PE (input to transformer)
4 — Q, K, V Projections three learned linear maps per head
Q = X·WQ K = X·WK V = X·WV W ∈ ℝd_model × d_head
5 — Self-Attention Scores hover any cell to see the relationship
Scores = Q · KT / √dk AttnWeights = softmax(Scores)
Raw scores (QKᵀ / √dk)
→ softmax →
Attention weights (rows sum to 1.0)
6 — Value Aggregation + Residual context-enriched representations
AttnOut = AttnWeights · V X' = LayerNorm(X + AttnOut)
Context-aware token vectors each token now sees all others
Residual stream magnitude bars show per-dim values after LayerNorm
7 — Feed-Forward Network applied independently to each token position
FFN(x) = GeLU(x·W1+b1)·W2+b2 d_ff = 4×d_model
After FFN + LayerNorm final per-token representations
8 — Next Token Prediction logits → softmax → sample
logits = hlast·Wvocab P(w) = softmax(logits / temperature)
Temperature
0.8Generated sequence
Model Config
Vocabulary Sample
Step Explanation
Selected Token
Click a token to inspect its embedding, attention patterns, and activations