Day 1: Foundations | Oxford LLMs

Classification pipelines, retrieval systems, simulated survey respondents: every LLM application in this course rests on the same machinery. Understanding it is the difference between using these models as opaque tools and using them as instruments whose behaviour you can reason about, debug, and justify in a methods section.

This module builds the foundations from the ground up: how text becomes numbers, how the Transformer processes those numbers in context, why "predict the next token" turns out to be a powerful learning signal, and how scale and cost shape what is practical for research. Each section introduces a concept you will rely on throughout the course.

After This Module You Will Be Able To

Explain how word embeddings encode meaning as geometry and why that matters for measuring culture and bias.
Describe how tokenization shapes what a model sees, and identify where it introduces inequality.
Trace a self-attention computation step by step and distinguish encoder-only from decoder-only architectures.
Explain why next-token prediction produces general capabilities and how cross-entropy loss drives learning.
Read scaling-law plots and reason about the relationship between model size, data, compute, and performance.
Estimate context window and memory requirements for a given model and research task.

Representing Meaning

Before a language model can do anything useful, it needs a way to represent words as numbers. Not just any numbers: numbers that capture meaning. This section traces the path from the simplest text representations to the geometric spaces that power modern NLP.

Bag of Words: Counting What Appears

The simplest idea: count which words appear. A document mentioning "economy," "inflation," and "growth" is probably about economic policy. One mentioning "goal," "league," and "match" is probably about sports.

This is called a bag-of-words (BoW) representation. It throws away word order, grammar, and nuance, keeping only word frequencies. Despite this, BoW has powered decades of productive social science research: TF-IDF for document retrieval, topic models like LDA and STM for discovering themes in large corpora, and dictionary methods for measuring sentiment or policy focus.

But BoW has a hard ceiling. Every word is a separate dimension, equally distant from every other word. "Cat" is as far from "dog" as from "democracy." Two sentences about the same topic can look completely different if they use different vocabulary. BoW captures what words appear, but not what they mean.

Word Embeddings: Meaning as Geometry

The breakthrough idea (Mikolov et al., 2013): instead of treating every word as a separate symbol, learn a dense vector for each word from its context. Words that appear in similar contexts get similar vectors.

This is the distributional hypothesis: "you shall know a word by the company it keeps" (Firth, 1957). A word that often appears near "government," "legislation," and "vote" will end up with a vector close to other political terms: even if they never co-occurred in the same sentence.

Definition

Word Embedding

A learned mapping $f: \mathcal{V} \to \mathbb{R}^d$ from a vocabulary $\mathcal{V}$ to a $d$-dimensional real vector space, where semantic and syntactic relationships between words are encoded as geometric relationships between their vectors.

Word2Vec learns embeddings by training a shallow neural network on a simple task: given a word, predict its neighbours (Skip-gram), or given the neighbours, predict the word (CBOW). GloVe (Pennington et al., 2014) achieves a similar result by factorising a global word co-occurrence matrix. Both produce vectors where semantic similarity corresponds to geometric proximity.

Schematic illustration of word embeddings in 2D. In a trained embedding space, words from the same domain cluster together, even though the model was never told these categories exist. This structure emerges from word co-occurrence patterns. (Layout is simplified for clarity; real projections via PCA or t-SNE are noisier.)

Cosine Similarity: Measuring Closeness

How do we measure whether two word vectors are "close"? The standard tool is cosine similarity: it measures the angle between two vectors, ignoring their length.

$$\text{cosine\_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \cdot \|\mathbf{b}\|}$$

Result is the cosine of the angle between the two vectors: 1.0 = identical direction, 0.0 = orthogonal (no overlap), âˆ’1.0 = opposite.

Cosine similarity measures the angle between vectors, not their magnitude. Drag vector b to see how the cosine value changes: 1.0 when aligned, 0.0 when orthogonal, −1.0 when opposite.

Analogy Arithmetic

The most striking property of word embeddings: relationships become directions. The direction from "man" to "king" captures something like "royalty." Add that same direction to "woman," and you land near "queen."

$$\vec{v}_{\text{king}} - \vec{v}_{\text{man}} + \vec{v}_{\text{woman}} \approx \vec{v}_{\text{queen}}$$

Vector arithmetic captures semantic relationships. The direction from "man" to "king" is approximately the same as the direction from "woman" to "queen."

This works because the embedding space organises concepts along roughly consistent axes. Gender is one direction. Geography is another ("Paris − France + Germany ≈ Berlin"). Tense, plurality, and many other relationships are encoded as well. That said, subsequent work has shown that analogy results are sensitive to evaluation methodology and less robust than the original papers suggested: the effect is real, but noisier than the clean examples imply.

Semantic Projection: Reading Hidden Dimensions

A 100-dimensional vector seems abstract, but it encodes rich, interpretable structure. The technique of semantic projection reveals this: define a meaningful direction using pairs of anchor words, then project other words onto that axis to see where they fall.

For example, a "gender" direction can be defined using anchors like (man, woman), (he, she), (him, her). Projecting occupations onto this axis reveals how strongly the training corpus associates each occupation with gender. This is not just a curiosity: it is a research method.

Occupations projected onto a gender axis derived from GloVe embeddings. The bias reflects patterns in the training corpus, not ground truth.

The projection above uses illustrative data, but the method is real. The same technique, applied rigorously to trained embeddings, has become a foundational tool for computational social science.

Social Science Application: Embeddings as Cultural Measurement. The same geometry that makes embeddings useful for NLP makes them a research instrument for studying culture and bias. A foundational line of work demonstrates this progression:

Caliskan, Bryson & Narayanan (2017) showed in Science that word embeddings replicate the full range of implicit biases measured in humans via the Implicit Association Test (IAT). Embedding geometry absorbs the biases present in training text.
Garg, Schiebinger, Jurafsky & Zou (2018) extended this by training embeddings on text from each decade of the 20th century, tracking how gender and ethnic stereotypes in American English evolved over time (PNAS): a computational history of bias that aligns with known social changes.
Kozlowski, Taddy & Evans (2019) generalised the approach in American Sociological Review, establishing a framework for mapping cultural dimensions — social class, affluence, education — as directions in embedding space. Projecting words onto these axes reveals the cultural associations that texts carry.

If word embeddings place semantically similar words near each other, what happens to words with multiple meanings: like "bank" (river bank vs. financial bank)? What limitation does this reveal about static embeddings?

Reveal

Static embeddings assign a single vector per word, regardless of context. "Bank" gets one representation that averages across all its senses. This is a fundamental limitation: the same word in different sentences gets the same numbers. Solving this problem is exactly what the Transformer architecture does, which we turn to after tokenization.

In the notebook: Sections 1–2 walk you through building BoW vectors, implementing cosine similarity from scratch, testing word analogies, and projecting occupations onto a gender axis.

Resources

Mikolov et al. (2013), Efficient Estimation of Word Representations in Vector Space: the foundational Word2Vec paper.
Pennington, Socher & Manning (2014), GloVe: Global Vectors for Word Representation: combines global matrix factorisation with local context windows.
Lena Voita: NLP Course: Word Embeddings: an excellent in-depth explainer.
TensorFlow Embedding Projector: interactive 3D visualisation of embedding spaces.
Grand et al. (2022), Semantic Projection: the method for mapping words onto interpretable semantic scales.

Tokenization

We flagged a fundamental limitation of static embeddings: one vector per word, regardless of context. Solving that is the job of the Transformer (next section). But before we get there, we need to address something even more basic: what counts as a "word" in the first place? Models don't actually operate on words as we think of them. Before text enters a model, it must be split into discrete units the model can process. This step, tokenization, bridges raw text and the embedding layer. It determines what the model "sees," and its consequences for research are more significant than most users realise.

The Problem: Words Are Not Enough

A word-level vocabulary is appealing but impractical. English alone has hundreds of thousands of words. Add misspellings, names, code, and other languages, and the vocabulary explodes. Any word not in the vocabulary becomes an unknown <UNK> token: invisible to the model.

Character-level tokenization solves the unknown-word problem (any text can be spelled out letter by letter) but creates sequences that are extremely long and hard to learn from. The model must figure out that "c-a-t" is a concept, from individual letters.

Subword tokenization is the compromise that modern LLMs actually use. It keeps common words as single tokens ("the," "and") but breaks rare words into smaller, reusable pieces ("unhappiness" → "un" + "happiness" or "un" + "happi" + "ness").

Definition

Byte-Pair Encoding (BPE)

A tokenization strategy that splits text into units smaller than words but larger than characters. Starting from individual bytes or characters, BPE iteratively merges the most frequent adjacent pair until a target vocabulary size is reached.

How BPE Works

BPE originated as a data compression algorithm (Gage, 1994) and was adapted for neural machine translation by Sennrich et al. (2016). The idea is elegantly simple. Start with individual characters. Count every adjacent pair in the training corpus. Merge the most frequent pair into a new token. Repeat until you reach the desired vocabulary size.

Step 0: Start with characters

Step 0 / 0

Step through BPE merge operations on real words. At each step, the most frequent adjacent pair in the training corpus is merged into a single token. Notice how common subwords ("un", "ing", "tion") emerge as reusable building blocks.

Why Tokenization Matters for Research

Tokenization is not just preprocessing. It fundamentally shapes what a model "sees."

Multilingual inequality. Tokenizers are trained predominantly on English text. The same sentence in Hindi, Arabic, or Yoruba often requires two to ten times more tokens than in English. More tokens means lower effective resolution, higher API costs, and faster context-window exhaustion. When working with multilingual social science data, always check how your text tokenizes.

Number fragmentation. Numbers are often split into seemingly arbitrary pieces ("2024" → "202" + "4"). This is why models struggle with arithmetic: they do not see "2024" as a single number.

The "strawberry" problem. Ask a model "how many r's in strawberry?" and it may answer incorrectly. The tokenizer splits "strawberry" into subwords like "straw" + "berry": the model never sees individual letters, so it cannot count them.

Key Takeaway

Tokenization is part of your measurement instrument. When you send text to an LLM, the tokenizer decides how that text is represented. Multilingual tokenizers often allocate fewer tokens to non-English text, giving the model lower effective resolution for those languages. Before running experiments, inspect your tokenization.

In the notebook: Section 3 lets you explore tokenization hands-on: comparing how different models split the same text and observing multilingual tokenization inequality directly.

Resources

Sennrich, Haddow & Birch (2016), Neural Machine Translation of Rare Words with Subword Units: introduces BPE for NLP.
Tiktokenizer: visualise how OpenAI's tokenizers split text in real time.
Hugging Face Tokenizer Playground: compare how different tokenizers split the same input.
Andrej Karpathy: "Let's Build the GPT Tokenizer": a hands-on deep dive into BPE.
HuggingFace NLP Course, Chapter 6: thorough walkthrough of BPE and other subword algorithms.
3Blue1Brown: "But what is a GPT?": Chapter 5 covers tokenization with excellent visuals.

The Transformer

We now have vectors for words and a way to split text into tokens. But recall the limitation we flagged: static embeddings give each word one vector, regardless of context. "Bank" gets the same representation whether it appears next to "river" or "deposit." This section introduces the architecture that solved this problem and, in doing so, made modern AI possible.

The Polysemy Problem

Consider these two sentences:

"I deposited money at the bank."
"We sat on the bank of the river."

With GloVe, both instances of "bank" get the exact same vector. A model using static embeddings cannot distinguish the two meanings. It must rely on other words in the sentence to disambiguate, but the embedding itself carries no context.

What we need is a mechanism that lets each token "look at" the rest of the sequence and adjust its representation accordingly. That mechanism is self-attention, and the architecture built around it is the Transformer (Vaswani et al., 2017).

Before Transformers: The Sequential Bottleneck

Earlier models (RNNs, LSTMs) processed tokens one at a time, left to right. Each token's representation depended on the previous hidden state, creating a chain. This had two problems:

Speed: Sequential processing cannot be parallelised across tokens. Training was slow.
Long-range dependencies: Information from early tokens had to survive through many steps of the chain to influence later tokens. In practice, it often degraded or was lost.

The Transformer solved both problems by replacing recurrence with attention. The core mechanism allows every token to directly attend to any other token, regardless of distance, and all positions are processed simultaneously. (Decoder-only models add a causal mask that restricts each token to past positions only — covered in the section below.)

Self-Attention: The Core Mechanism

Self-attention works by having each token "ask a question" of every other token and collect relevant information. This happens through three learned projections:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information should I provide if I'm relevant?"

Each token's input embedding is multiplied by three separate weight matrices to produce its Q, K, and V vectors. Then, for each token, we compute the dot product of its Query with every Key. High dot products mean high relevance. These scores are scaled and passed through softmax to become weights (summing to 1). Finally, the weighted sum of Values gives the token's new, context-enriched representation.

Definition

Self-Attention

A mechanism where each position in a sequence computes a weighted sum over all positions. Each token produces three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I provide?). Attention weights are determined by the compatibility between Queries and Keys.

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$

Scaled dot-product attention. The $\sqrt{d_k}$ factor prevents dot products from growing too large in high dimensions, which would push softmax into regions with near-zero gradients.

Let's trace this computation step by step on a toy example with tiny 4-dimensional vectors. Real models use hundreds or thousands of dimensions, but the math is identical. (This widget uses causal (autoregressive) attention: each token attends only to itself and preceding positions — exactly as in decoder-only models. Select earlier tokens to see how context grows through the sequence; the causal masking section below explains the mechanism.)

Contextual Embeddings: The Payoff

This is how context changes representation. After passing through self-attention, "bank" in "I sat by the river bank" produces a vector close to "river" and "shore." The same word "bank" in "I went to the bank to deposit money" produces a vector close to "finance" and "account." The model builds a different representation for each occurrence, informed by context.

Definition

Contextual Embedding

A representation where each token's vector depends on the entire surrounding sequence. The same word produces different vectors in different contexts, resolving ambiguity that static embeddings cannot.

This shift, from one vector per word type to one vector per word token in context, is what makes modern language models so powerful. Every downstream task you will encounter in this course depends on it. When you classify text (Module 3), you use the model's contextual representation of the input. When you build a RAG pipeline (Module 4), contextual embeddings power the retrieval step. When you prompt a model (Module 2), you are writing input that the model processes through layers of contextual attention.

Multi-Head Attention

You just traced a single attention head in detail. A single head can only capture one type of relationship at a time. Multi-head attention runs several heads in parallel, each with its own W_Q, W_K, W_V matrices. Different heads learn to focus on different things: one might track syntactic dependencies (subject–verb), another semantic relationships (bank–river), another broad context mixing. Their outputs are concatenated and projected back to the original dimension with a learned W_O.

The interactive block below uses the same six token embeddings x_i and the same recipe as the walkthrough: for each head, q_i, k_i, v_i from the embeddings, then softmax over q_i · k_j / √d_k (here d_k = 4), then a weighted sum of v vectors. Head 2 reuses the exact same W_Q, W_K, W_V as in the single-head widget above, so its heatmap matches that computation; Heads 1 and 3 use different weights so you can compare patterns.

head_h = Σ_j α_i,j^(h) v_j^(h) · MultiHead = Concat(head₁, …, head_H) W_O

Each square is an attention weight α_i,j for query row i and key column j (darker = higher). Hover a cell for the exact percentage; click any cell in a row to pin that query token across all heads. The panel below shows each head’s contextual output as a heatmap strip (same encoding as the single-head widget), then a toy projection back to 4-D.

Head 1: Syntactic

Distinct W_Q, W_K, W_V

"I" ↔ "sat": subject–verb agreement

Head 2: Semantic

Same W_Q, W_K, W_V as the single-head demo

"bank" → "river": meaning from context

Head 3: Broad Context

Distinct W_Q, W_K, W_V

Content ↔ function words: gathering context

Three parallel heads read the same embeddings but apply different projections, so attention matrices differ. Concatenating head outputs and multiplying by W_O fuses those views back into one d-dimensional vector per token. Toy matrices here; real models use many heads and layers, and heads are rarely as tidy as these illustrations.

The Full Transformer Block

A Transformer block combines self-attention with a few other components:

One Transformer block. Input flows through multi-head attention, then through a feed-forward network. Residual connections (dashed lines) add the input back to the output at each stage, and layer normalisation stabilises training. A full model stacks many such blocks.

Residual connections (He et al., 2016) add each sub-layer's input directly to its output. This lets gradients flow through the network without degrading, enabling very deep stacks (GPT-3 uses 96 layers). Layer normalisation (Ba, Kiros & Hinton, 2016) stabilises the scale of activations at each layer.

The Feed-Forward Network

The block diagram above shows a Feed-Forward Network (FFN) sandwiched between the two residual connections. It is two linear transformations with a nonlinearity between them, applied independently to every token position:

$$\text{FFN}(\mathbf{x}) = \mathbf{W}_2\,\sigma(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$$

Two learned linear transformations with a nonlinearity $\sigma$ between them. The inner dimension $d_\text{ff}$ is typically $4 \times d_\text{model}$, so the FFN first expands each token's representation and then compresses it back. $\sigma$ is GELU in GPT-style models, SwiGLU in Llama.

The FFN's role becomes clear when you set it alongside attention. Self-attention is the cross-position operation: it lets each token look at all others and gather relevant context. After that mixing step, every token has a richer representation — but the attention mechanism itself applies the same weighted-sum operation everywhere. The FFN is the per-position computation that follows: it applies a learned, nonlinear transformation to each token's representation independently, giving the model capacity to "process" whatever attention assembled. A useful intuition: attention decides what information to retrieve; the FFN decides what to do with it.

Research on mechanistic interpretability has found that FFN layers function partly as key-value memories (Geva et al., 2021): the first weight matrix acts like keys that match input patterns, and the second acts like values that write information into the residual stream when a key fires. This framing has become productive for understanding where and how factual knowledge is stored and updated in large models — directly relevant if you are using LLMs as sources of world knowledge in research applications.

Positional Encodings

Self-attention processes all tokens in parallel: there is nothing in the mechanism that distinguishes position 1 from position 100. Without help, "dog bites man" and "man bites dog" would produce identical representations.

Positional encodings solve this by adding position information to each token's embedding before it enters the attention layers. The original Transformer used sinusoidal functions of the position. Modern models often use learned positional embeddings or rotary position encodings (RoPE) (Su et al., 2021), which generalise better to long sequences.

Causal Masking: Why Decoders Only See the Past

The mechanism just described allows every token to attend to every other token in the sequence. This works well for some architectures, but it creates a fundamental problem for autoregressive language models. Consider training on the sequence "I sat by the river bank." To learn to predict "bank," the model must not be allowed to see "bank" in its own input context — that would be trivially cheating. More generally, to predict token t, the model should only attend to tokens 1 through t − 1. If token 3 can attend to token 6 during training, the learning signal is corrupted.

The solution is a causal mask (also called an autoregressive mask). Before the softmax step, all attention scores that point forward in the sequence — from position i to any position j > i — are set to −∞. Because e^−∞ = 0, softmax assigns exactly zero weight to those positions. Each token can only gather information from itself and earlier tokens: the attention matrix becomes lower-triangular.

The causal attention mask for the six-token sequence used in the widget above. Each row is a query token; each column is a key it may attend to. Blue cells are permitted; grey cells are forced to −∞ before softmax, making their weight exactly zero. "bank" (bottom row) can see all five preceding tokens — which is how its meaning is disambiguated in context — but it cannot attend to itself when the model is trying to predict it during training. The highlighted cell shows "bank" attending most strongly to "river," consistent with the widget's computed weights.

This masking is applied identically at every layer and every head. Models that use it are called decoder-only (or causal) Transformers: GPT, Llama, Mistral, Claude, and most contemporary chat and generation models fall into this category. The attention widget above already applied this causal mask — you can see that early tokens have fewer positions to attend to. In a deployed decoder-only model, the upper triangle of every attention matrix is always zeroed out.

The counterpart — models that keep full, unmasked, bidirectional attention — are called encoder-only models. This architectural choice pairs with a different training objective and produces a different family of strengths and limitations, covered in the next subsection.

Encoder-Only vs. Decoder-Only: Choosing the Right Tool

Two choices — which tokens a position may attend to, and what the model is trained to predict — combine to produce the two main model families. Understanding the difference is practical, not just theoretical: the wrong choice of architecture for a task is one of the most common errors in computational social science workflows.

Left: an encoder-only model's attention matrix — every token attends to every other token in both directions. Right: a decoder-only model's attention matrix — lower-triangular, each token attending only to past positions. The difference in attention pattern follows from the difference in training objective.

Encoder-only: masked language modeling

Encoder-only models are trained with the Masked Language Model (MLM) objective, introduced by BERT (Devlin et al., 2019). At training time, roughly 15% of input tokens are randomly replaced with a special [MASK] token. The model predicts the original token at each masked position, using context from both sides simultaneously. Because prediction can draw on future tokens, no causal mask is needed or used.

Definition

Masked Language Model (MLM)

A training objective in which a random subset of input tokens (typically 15 %) is replaced by a special [MASK] token, and the model is trained to predict the original tokens from the surrounding context. Unlike causal language modeling, the model can use context from both directions simultaneously.

The payoff is contextual representations that are informed by the complete surrounding context at once. A [CLS] token prepended to the input accumulates a sequence-level representation that can be fed to a classifier; alternatively, individual token vectors can be pooled. Either way, the model's output is a set of rich embeddings — not a probability distribution over next tokens.

This makes encoder-only models the natural choice for:

Text classification: sentiment, ideology, topic, relevance labelling. Fine-tune the [CLS] representation on labelled examples and the model learns to separate categories in its embedding space.
Named-entity recognition and extraction: token-level predictions (per-position output rather than a single sequence label).
Semantic similarity and dense retrieval: sentence-transformer models (e.g. SBERT) use pooled encoder outputs to embed documents into a comparable vector space — the same space powering RAG pipelines (Module 4).

The key limitation: encoder-only models are not designed to generate free text. There is no autoregressive decoding loop, no next-token sampling. Prompting an encoder model with "Summarise this document" does not produce a summary.

Notable encoder-only models: BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) — a robustly optimised BERT with better training recipes and no NSP objective, DeBERTa (He et al., 2021) — adds disentangled attention over content and position. RoBERTa and DeBERTa remain strong baselines for classification tasks in social science.

Encoder–decoder: sequence-to-sequence

A third family combines both components: the encoder reads the full input with bidirectional attention; the decoder generates the output causally, attending to both its own past tokens and the encoder's representations (via cross-attention). This sequence-to-sequence design is suited to tasks where the output is a transformed version of the input: translation, abstractive summarisation, structured extraction. Key models include T5 (Raffel et al., 2020) and BART (Lewis et al., 2020). In practice, large decoder-only models prompted appropriately handle many of these tasks, so encoder-decoder models are less dominant in current workflows — but worth knowing when you encounter older pipelines or fine-tuned summarisation models.

Social Science Application. The dominant workflow in computational social science text annotation has been: fine-tune a pre-trained encoder (typically RoBERTa) on a labelled sample of your corpus, then apply it at scale. The performance gains from moving to RoBERTa over earlier methods were large (Laurer et al., 2024). Module 3 covers how to build, evaluate, and validate this pipeline rigorously — including the question of when a fine-tuned encoder beats a prompted decoder-only model, and when it does not.

In Step 3 of the attention stepper (the “Queries, Keys, and Values” subsection above), you saw the raw dot products divided by √d_k before softmax. With our tiny d_k = 4 the difference is modest. What would happen in a real model with d_k = 64 or 128?

Reveal

The dot products scale with dimension: their variance grows roughly proportional to d_k. Without the √d_k divisor, large dot products push softmax into near-zero-gradient regions (outputs very close to 0 or 1), making learning extremely slow. The scaling keeps variance near 1 so softmax stays in its useful range.

Key Takeaway

The Transformer replaced sequential processing with parallel attention, solving the polysemy problem along the way. Every token can attend to any other token directly, regardless of distance — and in decoder-only models, a causal mask restricts this to past positions, enabling left-to-right generation. The result: contextual embeddings, where the same word produces different vectors in different contexts. This is the foundation for everything that follows in this course.

In the notebook: Section 4 walks you through self-attention by hand. You assign attention weights manually, then compare your intuitions against the model's actual attention patterns.

Resources

Vaswani et al. (2017), Attention Is All You Need: the paper that introduced the Transformer architecture.
Jay Alammar: The Illustrated Transformer: the canonical visual walkthrough of the architecture.
3Blue1Brown: "Attention in Transformers, Visually Explained": intuitive geometric explanation of attention.
Brendan Bycroft's LLM Visualization: interactive 3D walkthrough of an entire Transformer.
Transformer Explainer (Polo Club of Data Science): interactive visual explanation of how Transformers work.
Jay Alammar: The Illustrated GPT-2: how the decoder-only Transformer works in practice.
Anthropic: In-context Learning and Induction Heads (2022): a mechanistic view of how Transformers learn.
He et al. (2016), Deep Residual Learning for Image Recognition: introduced skip/residual connections that enable training of very deep networks.
Ba, Kiros & Hinton (2016), Layer Normalization: the normalisation variant used in Transformers (applied per-sample, not per-batch).
Su et al. (2021), RoFormer: Enhanced Transformer with Rotary Position Embedding: the positional encoding used in most modern LLMs (Llama, Mistral, etc.).
Devlin et al. (2019), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: introduced masked language modeling and the encoder-only architecture.
Liu et al. (2019), RoBERTa: A Robustly Optimized BERT Pretraining Approach: showed that careful training choices matter as much as architecture; standard baseline for classification.
Geva et al. (2021), Transformer Feed-Forward Layers Are Key-Value Memories: mechanistic interpretation of FFN layers as storing and retrieving factual knowledge.

Language Modeling

We now have the architecture: a machine that can read a sequence of tokens and build context-aware representations for each one. But what does it learn? The answer is disarmingly simple: predict the next token. This training objective sounds trivial. Its consequences are not.

Definition

Autoregressive Language Model

A model that generates a sequence one token at a time, left to right. At each step it predicts a probability distribution over the vocabulary for the next token, conditioned on all tokens generated so far.

The Chain Rule of Probability

Any joint probability over a sequence can be decomposed into a product of conditional probabilities. This is not an approximation: it is an exact identity from probability theory:

$$P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_{\lt t})$$

The joint probability of a sequence equals the product of each token's probability given all preceding tokens $x_{\lt t}$.

In plain language: the probability of a whole sentence equals the probability of the first word, times the probability of the second word given the first, times the probability of the third given the first two, and so on. An autoregressive language model learns each of these conditional distributions.

Why Next-Token Prediction Is So Powerful

Consider what it takes to predict the next word accurately across diverse text. To predict the word after "The capital of France is," the model must learn patterns that function like factual recall. To predict the next word in a Python function, it must capture regularities that mirror syntactic knowledge. To continue a logical argument, it must develop internal representations that approximate reasoning.

The prediction objective is simple, but solving it well on internet-scale data requires the model to build sophisticated internal representations of grammar, semantics, facts, and reasoning patterns. These representations then transfer to downstream tasks: summarisation, translation, question answering, and more.

An autoregressive model takes a sequence of tokens and outputs a probability distribution over the vocabulary for the next token. During generation, it samples from this distribution, appends the chosen token, and repeats.

From Hidden State to Logits: The Unembedding

The diagram above shows a probability distribution appearing at the output, but we have not yet said how it gets there. After the input tokens are processed through all N Transformer blocks, each position t has a final hidden state h_t ∈ ℝ^d — a rich, context-informed vector. To produce predictions over the vocabulary, one more step is needed: the unembedding.

A learned weight matrix W_U ∈ ℝ^|𝒱|×d projects the final hidden state to a vector of |𝒱| raw scores, one per vocabulary token. These scores are the logits.

$$\mathbf{z}_t = \mathbf{h}_t\,W_U^\top, \qquad W_U \in \mathbb{R}^{|\mathcal{V}| \times d}$$

The final hidden state $\mathbf{h}_t \in \mathbb{R}^d$ is projected to a vector of $|\mathcal{V}|$ raw scores (logits) by the unembedding matrix $W_U$. Softmax then turns these into the next-token probability distribution. In some model families (GPT-2, BERT), $W_U = W_E^\top$ (weight tying): the embedding and unembedding matrices share parameters. Many recent LLMs (Llama, Mistral) use separate matrices instead.

The output pipeline for a single position. The final hidden state is projected to one logit per vocabulary token by the unembedding matrix; softmax normalises these into a probability distribution. In some models (GPT-2, BERT) the unembedding matrix is the transpose of the input embedding matrix (weight tying); many recent LLMs use separate matrices instead.

Learning the conditionals

The chain rule is not only a way to write the joint probability of a sequence: it is also the blueprint for learning. Each factor $P(x_t \mid x_{\lt t})$ is one prediction problem at position $t$. We just saw the pipeline that produces a prediction: the final hidden state is projected to logits, and softmax turns them into a distribution over the full vocabulary. The formal version:

$$P_\theta(w \mid x_{\lt t}) = \frac{\exp(z_w)}{\sum_{w' \in \mathcal{V}} \exp(z_{w'})}$$

Logits $z$ come from the last linear layer of the model; softmax turns them into a valid probability distribution over the entire vocabulary $\mathcal{V}$. Learning adjusts $\theta$ so that mass concentrates on the token that actually appears in the data—by competing across all alternatives, not by ignoring the rest.

Training texts are treated as samples from the data-generating process we want to approximate; standard practice is to maximize the likelihood of the observed tokens under the model. Taking logs turns the product of conditionals into a sum that is easier to optimize:

$$\log P_\theta(x_1, \ldots, x_T) = \sum_{t=1}^{T} \log P_\theta(x_t \mid x_{\lt t})$$

Maximum likelihood estimation chooses $\theta$ to make the observed sequence as probable as possible under the model. Parameters $\theta$ enter through the neural network that defines each conditional $P_\theta(\cdot \mid x_{\lt t})$.

The cross-entropy objective

Maximizing log-likelihood is the same as minimizing the negative log-likelihood. At each position we pay a penalty $-\log P_\theta(x_t^\ast \mid x_{\lt t})$ that is large when the model assigns low probability to the true next token $x_t^\ast$ and small when it assigns high probability. That per-step quantity is the cross-entropy between a one-hot target (the actual token) and the model's predicted distribution:

$$\ell_t = -\log P_\theta(x_t^\ast \mid x_{\lt t})$$

At position $t$, $x_t^\ast$ is the true next token from the training text. This is the cross-entropy (log loss) for a single step: it is high when the model assigns low probability to the correct token.

Averaging over the $T$ positions in a sequence (and, in practice, over minibatches of sequences) yields the standard loss used in pre-training. It is the average negative log-likelihood—often called cross-entropy loss or simply loss:

$$\mathcal{L} = -\frac{1}{T}\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{\lt t})$$

The usual training loss is the average negative log-likelihood per token. Minimizing $\mathcal{L}$ is equivalent to maximum likelihood. Gradients flow back through softmax and the rest of the network so that predicted probabilities improve.

Training minimizes $\mathcal{L}$ on the training corpus. Test loss is the same functional form evaluated on held-out text: how surprised the model is, on average, by tokens it did not train on. Scaling-law curves plot precisely this quantity (or equivalent summaries such as perplexity below) as compute, data, or model size grow.

Perplexity: Measuring Model Quality

The average negative log-likelihood is a direct number to minimize, but it is not always the most intuitive scale. A common alternative is perplexity: it re-expresses the same underlying quantity as a single "effective branching factor" per step. How do we know whether one language model is better than another on held-out data? Compare perplexity (or equivalently average NLL): lower means the model assigns higher probability to the actual text and predicts more accurately.

$$\text{PPL}(X) = \exp\!\left(-\frac{1}{T}\sum_{t=1}^{T} \log P(x_t \mid x_{\lt t})\right)$$

Perplexity is the exponentiated average negative log-likelihood. Lower is better. A perplexity of $k$ means the model is, on average, as uncertain as choosing uniformly among $k$ options.

Intuitively, a perplexity of 20 means the model is, on average, as uncertain about the next token as if it were choosing uniformly among 20 options. A perfect model that always predicts the right token would have perplexity 1. The next section takes up the same average negative log-likelihood (or perplexity) as an empirical phenomenon: scaling laws describe how it decreases when you increase compute, data, or model size.

A model trained only to predict the next word can summarise documents, answer questions, translate between languages, and write code. Why would next-token prediction produce such general capabilities?

Reveal

To predict the next token accurately across diverse text, the model must build internal representations of syntax, semantics, factual knowledge, reasoning patterns, and more. The prediction objective is simple, but solving it well on internet-scale data requires sophisticated internal representations that transfer to downstream tasks. In a sense, next-token prediction is a universal learning signal: any pattern in text that helps prediction can, in principle, be learned.

In the notebook: Section 5 lets you generate text from a real language model, experiment with temperature and sampling strategies, and compute perplexity on different texts.

Resources

Lena Voita: NLP Course: Language Modeling: clear, visual introduction to language modeling concepts.
HuggingFace: Perplexity of Fixed-Length Models: technical details on computing perplexity.
Radford et al. (2018), Improving Language Understanding by Generative Pre-Training: GPT-1: demonstrating that generative pre-training transfers to downstream tasks.
Radford et al. (2019), Language Models are Unsupervised Multitask Learners: GPT-2: scaling up and the emergence of zero-shot capabilities.

Scaling, Data & Capabilities

We now have all the ingredients: embeddings, tokenization, the Transformer, and the next-token prediction objective—minimizing average cross-entropy (negative log-likelihood) on training text. What happens when we make this bigger? One of the most consequential empirical findings in modern AI is that test loss—the same loss on held-out data—improves predictably as you increase compute, data, or model size, following smooth power-law curves. But the deeper story is what these improvements mean in practice: lower loss translates into qualitatively new capabilities.

Three Power Laws

Kaplan et al. (2020) showed that test loss follows the same functional form against three independent variables: total compute, dataset size, and number of model parameters. On a log-log plot, each relationship is a straight line.

$$L(X) \propto X^{-\alpha}$$

Loss scales as a power law with each of three variables: compute $C$, dataset size $D$, and model parameters $N$. Each follows the same functional form with different exponents $\alpha$.

Kaplan et al. modelled each variable's contribution separately. Double the compute, and loss drops by a predictable factor. Double the data, same pattern. Double the parameters, same again. The exponents differ, but the power-law form holds across model families and training setups.

Compute

Dataset Size

Parameters

Test loss as a function of compute, dataset size, and model parameters (log-log scale). Each relationship follows a smooth power law: a straight line on log-log axes. Schematic illustration of the relationship described in Kaplan et al. (2020).

From Lower Loss to New Capabilities

Power laws are not just descriptive. They are predictive: train a few small models, fit the curve, and extrapolate. This is how frontier models get planned. But for social scientists, the question is not "what will the loss be?" It is: what can the model actually do at this scale?

The empirical record shows that as models get larger and loss decreases, qualitatively new behaviours appear. GPT-2 (1.5B parameters) could generate coherent paragraphs but struggled with factual recall. GPT-3 (175B) demonstrated few-shot learning: give it a handful of examples in the prompt, and it could perform new tasks without any fine-tuning (Brown et al., 2020). Subsequent models showed improved performance on arithmetic, code generation, multi-step reasoning, and nuanced text understanding.

Whether these transitions represent genuine "emergent" capabilities or are artifacts of how we measure performance is an active debate. Schaeffer et al. (2024) argued that apparent emergence often disappears when you use continuous rather than threshold-based metrics. The practical implication, however, is real: a 70B model handles nuanced political text classification substantially better than a 7B model, and model choice is a research design decision with predictable consequences for task performance.

Diminishing Returns and Complementary Strategies

Power laws have diminishing returns. Each 10× increase in compute yields a smaller absolute improvement. Hoffmann et al. (2022) showed that most early large models were undertrained: balancing model size and training data matters as much as raw scale.

These constraints are pushing the field toward complementary strategies: better data curation, improved architectures (mixture-of-experts), reasoning-focused training, and a fundamentally different approach: test-time compute scaling, spending more compute at inference rather than training.

Pre-training Data: What Models Learn From

The D in the scaling law is not an abstraction. Modern frontier models are trained on 1–15 trillion tokens of text: hundreds of times the volume of English Wikipedia, drawn from web pages, books, code, academic papers, forums, and more. The composition of that data determines what a model "knows," what perspectives it reflects, and where it systematically fails — making it a direct concern for research validity.

The dominant ingredient is Common Crawl — a periodic snapshot of the public web, measured in petabytes. Models do not train on it raw. The standard pipeline filters aggressively for quality (removing spam, near-duplicates, and incoherent text), then upweights higher-quality sources: books, curated encyclopaedias, code repositories, and academic papers. The exact mixture matters — Hoffmann et al. (2022) showed that how tokens are selected is as consequential as how many there are. For closed models such as GPT-4 the recipe is proprietary. For open models such as Llama 3, the broad composition is at least described.

Several properties of this data pipeline carry direct methodological implications for social science:

Knowledge cutoffs. Training data is collected up to a fixed date; events after that date are invisible to the model. This is not a minor caveat. If you are analysing discourse about a recent election, a legislative session, or a breaking crisis, you need to verify whether the model's training predates it. Most frontier models publish their cutoff; check it before designing a study.
Linguistic and cultural skew. English accounts for roughly 50–70% of most pre-training corpora, despite representing a small fraction of world languages. Within English, the internet over-represents Western, urban, formally-educated writing. A model's implicit priors — what counts as a "normal" political view, a "typical" family, a "reasonable" argument — are shaped by this skew. Santurkar et al. (2023) showed that the opinions expressed by several major LLMs align most closely with liberal, educated, white Americans rather than the broader public.
Domain depth varies. Domains with heavy online presence in English — mainstream news, scientific literature, software documentation — are well-represented. Oral traditions, regional political systems, non-Western legal codes, and paywalled corpora are not. A model may handle US Senate floor speeches well and struggle with municipal council minutes in Welsh or parliamentary debates in Swahili.
Benchmark contamination. Evaluation datasets are text. If they appear in the training corpus — and many do — the model's performance on them measures memorisation as much as generalisation. This is a live problem for published leaderboards (Jacovi et al., 2023), and a reason to prefer held-out or custom evaluation sets in published research.

Social Science Application. Santurkar et al. (2023) systematically measured the opinion distributions expressed by several major LLMs against US public opinion polls. They found that models do not reflect a neutral or globally-representative viewpoint: responses aligned most strongly with the demographic profile of liberal, college-educated, white Americans. For any research that elicits model "opinions" or uses models to simulate survey respondents, this skew is a validity threat that must be acknowledged and, where possible, tested for.

Key Takeaway

Pre-training data is part of your measurement instrument, not just an engineering detail. A model trained primarily on English web text from before a given date, filtered by quality criteria that are often unpublished, carries specific demographic biases, temporal blind spots, and domain limitations. Using such a model for social science research requires characterising these properties — not as disclaimers, but as part of a rigorous measurement strategy.

Coming in Module 2: We explore test-time compute scaling: using prompting strategies like chain-of-thought reasoning to extract more capability at inference. This represents a fundamental shift: instead of only scaling training, we can also scale thinking.

Social Science Application. Spirling (2023) argues in Nature Computational Science that transparency, reproducibility, and data control require access to open-source models. When a model is a black box controlled by a company, researchers cannot fully understand or verify their instrument. The architectural foundations covered in this module (pre-training data, model weights, tokenization) connect directly to fundamental questions of scientific integrity.

Social Science Application. Bail (2024) offers in PNAS a balanced assessment of integrating LLMs into social science workflows. He emphasises valid measurement, reproducibility, and the distinction between tasks where LLMs augment human researchers and tasks where they might introduce systematic error. A good frame for everything in this course.

If scaling laws have diminishing returns, what strategies beyond raw scale might improve model performance? Think about the full pipeline: data, architecture, training objective, and inference.

Reveal

Several complementary approaches are being explored. Data curation: higher-quality, deduplicated, and domain-specific training data yields more per token. Architecture improvements: mixture-of-experts, longer contexts, and more efficient attention variants. Training objectives: reinforcement learning from human feedback (RLHF), constitutional AI, and reasoning-focused training. Test-time compute: chain-of-thought prompting, self-consistency, and tree search let models "think longer" at inference: a form of scaling that does not require retraining.

Resources

Kaplan et al. (2020), Scaling Laws for Neural Language Models: quantified the power-law relationships between compute, data, parameters, and loss.
Hoffmann et al. (2022), Training Compute-Optimal Large Language Models: showed that optimal scaling requires balancing model size and training data.
Schaeffer et al. (2024), Are Emergent Abilities of Large Language Models a Mirage?: argues that apparent emergence may be an artifact of metric choice.
Santurkar et al. (2023), Whose Opinions Do Language Models Reflect?: shows that LLM opinion distributions align most closely with liberal, educated, white Americans — a direct validity concern for survey-simulation research.
Jacovi et al. (2023), Stop Uploading Test Data in the Era of Internet-Scale Language Models: makes the case for benchmark contamination as a serious reproducibility threat.

Context, Memory & Cost

Everything in this module connects to practical realities you will face when using LLMs for research. The Transformer architecture has a direct consequence that shapes how you work with these models: attention computation scales quadratically with sequence length. Whether a model uses full bidirectional attention (every token attending to every other, as in encoder-only models) or causal attention (each token attending only to preceding tokens, as in decoder-only models), the attention matrix has O(n²) entries. Understanding this relationship is essential for budgeting research projects and making informed model choices.

The Quadratic Cost of Attention

In standard self-attention, each of n tokens computes attention scores against all n tokens. The attention matrix has n² entries. Double the sequence length, and the computation quadruples. This is why processing a 200-page PDF is fundamentally more expensive than a one-paragraph email, and why context windows have hard upper limits.

The KV Cache: Why Inference Eats Memory

During text generation, the model produces tokens one at a time. To avoid recomputing attention over the entire sequence at each step, models cache the Key and Value matrices for all previously generated tokens. This is the KV cache.

$$\text{Memory}_{\text{KV}} = 2 \times n_{\text{layers}} \times d_{\text{model}} \times n_{\text{tokens}} \times \text{precision}$$

The KV cache stores two matrices (Keys and Values) for every layer, for every token generated so far. This is why long conversations consume GPU memory and why context windows have hard limits.

The KV cache grows linearly with sequence length and linearly with the number of layers. For a model like Llama-3-70B (80 layers, d_model = 8192) processing 8,192 tokens in FP16 precision, the KV cache alone requires roughly 20 GB of GPU memory before compression techniques like Grouped-Query Attention (the calculator below lets you toggle GQA to see the difference). This is why longer conversations consume more resources and why models get slower with longer inputs.

Sequence length: 4,096 tokens

5122K8K32K128K

KV cache as fraction of GPU memory (24 GB)

KV cache Model weights Free

Show with Grouped-Query Attention (GQA)

Methodology & assumptions

KV cache memory = 2 × n_layers × d_model × n_tokens × bytes_per_param. Factor of 2 is for Key and Value matrices. FP16 precision (2 bytes) is used throughout.

Attention FLOPs = 2 × n_tokens² × d_model per layer (for QK^T and attention × V), summed over all layers. This is the attention-only cost; feed-forward layers add roughly 2× more.

GQA savings: Grouped-Query Attention shares KV heads across multiple query heads. With a typical group size of 8 (e.g. Llama-3), KV cache is reduced by 8×. Llama-3-8B uses 8 KV heads (vs 32 query heads); Llama-3-70B uses 8 KV heads (vs 64 query heads).

Model weight memory is estimated as parameters × 2 bytes (FP16). Real deployments often use quantization (4-bit, 8-bit) which reduces this.

GPU memory baseline: 24 GB (typical consumer/research GPU, e.g. RTX 4090 or A10G). Models requiring more than this need multi-GPU setups or quantization.

What This Means for Your Research

These architectural constraints translate directly into research design decisions. The calculator above lets you see the tradeoffs concretely: pick a model, increase the sequence length, and watch memory requirements grow. A few implications worth internalising:

Context window limits. Every model has a maximum context length (4K, 32K, 128K, or more tokens). If your documents exceed this, you need a strategy: summarisation, chunking, or retrieval-augmented generation (Module 4). The context window is not just a technical limit; it shapes what questions you can ask of a single model call.

Cost scales with tokens. API providers charge per token, and longer inputs cost more. When processing a corpus of 10,000 parliamentary speeches, the total cost depends on how many tokens each speech contains, which in turn depends on the tokenizer (recall the multilingual inequality from the tokenization section). Budget accordingly.

Efficiency innovations matter. Toggle the GQA switch in the calculator to see how grouped-query attention reduces memory. Techniques like GQA and multi-latent attention (MLA, used in DeepSeek models) compress Key/Value representations across attention heads. These are not just engineering details: they determine which models can run on available hardware and at what cost.

Key Takeaway

Context windows, token costs, and memory constraints are not incidental engineering details. They are parameters of your research design. Choosing a model, setting a context length, and deciding how to handle long documents are methodological decisions that belong in your methods section, informed by the architectural foundations covered in this module.

Module Summary

This module traced the path from raw text to a functioning language model. Embeddings map words to a geometric space where proximity encodes meaning, turning text into numbers that models can process. Tokenization determines the units the model actually sees, with direct consequences for multilingual research and cost. The Transformer replaces sequential processing with parallel attention, producing contextual representations where the same word gets different vectors in different contexts.

Language modeling gives the Transformer its learning signal: predict the next token. This deceptively simple objective, optimised via cross-entropy loss on internet-scale data, forces the model to build internal representations of syntax, facts, and reasoning. Scaling laws describe how predictably loss decreases with more compute, data, and parameters, and how the composition of pre-training data shapes what a model knows and where it systematically fails. Finally, context windows and memory costs translate these architectural choices into practical constraints that shape research design.

Every concept introduced here recurs throughout the course. Prompting (Module 2) works because of next-token prediction. Classification and RAG (Modules 3–4) rely on contextual embeddings. Model selection requires understanding the scaling and cost tradeoffs introduced here.

Coming in Module 2: We move from understanding these foundations to using them. You will learn how post-training transforms a next-token predictor into a useful assistant, and how prompting lets you steer model behaviour for research tasks.