Module 1 of 5

Foundations

From Embeddings to Transformers: how machines represent meaning as geometry, and how the architecture that changed everything actually works.

Representing Meaning

Before a language model can do anything useful, it needs a way to represent words as numbers. Not just any numbers: numbers that capture meaning. This section traces the path from the simplest text representations to the geometric spaces that power modern NLP.

Bag of Words: Counting What Appears

The simplest idea: count which words appear. A document mentioning "economy," "inflation," and "growth" is probably about economic policy. One mentioning "goal," "league," and "match" is probably about sports.

This is called a bag-of-words (BoW) representation. It throws away word order, grammar, and nuance, keeping only word frequencies. Despite this, BoW has powered decades of productive social science research: TF-IDF for document retrieval, topic models like LDA and STM for discovering themes in large corpora, and dictionary methods for measuring sentiment or policy focus.

But BoW has a hard ceiling. Every word is a separate dimension, equally distant from every other word. "Cat" is as far from "dog" as from "democracy." Two sentences about the same topic can look completely different if they use different vocabulary. BoW captures what words appear, but not what they mean.

Word Embeddings: Meaning as Geometry

The breakthrough idea (Mikolov et al., 2013): instead of treating every word as a separate symbol, learn a dense vector for each word from its context. Words that appear in similar contexts get similar vectors.

This is the distributional hypothesis: "you shall know a word by the company it keeps" (Firth, 1957). A word that often appears near "government," "legislation," and "vote" will end up with a vector close to other political terms: even if they never co-occurred in the same sentence.

Definition

Word Embedding

A learned mapping $f: \mathcal{V} \to \mathbb{R}^d$ from a vocabulary $\mathcal{V}$ to a $d$-dimensional real vector space, where semantic and syntactic relationships between words are encoded as geometric relationships between their vectors.

Word2Vec learns embeddings by training a shallow neural network on a simple task: given a word, predict its neighbours (Skip-gram), or given the neighbours, predict the word (CBOW). GloVe (Pennington et al., 2014) achieves a similar result by factorising a global word co-occurrence matrix. Both produce vectors where semantic similarity corresponds to geometric proximity.

government president election parliament democracy economy inflation market fiscal budget river mountain forest ocean tree family mother father child love Politics Economy Nature Family

Schematic illustration of word embeddings in 2D. In a trained embedding space, words from the same domain cluster together, even though the model was never told these categories exist. This structure emerges from word co-occurrence patterns. (Layout is simplified for clarity; real projections via PCA or t-SNE are noisier.)

Cosine Similarity: Measuring Closeness

How do we measure whether two word vectors are "close"? The standard tool is cosine similarity: it measures the angle between two vectors, ignoring their length.

$$\text{cosine\_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \cdot \|\mathbf{b}\|}$$

Result is the cosine of the angle between the two vectors: 1.0 = identical direction, 0.0 = orthogonal (no overlap), −1.0 = opposite.

a b cos θ = 0.80 θ = 37° Drag the handle to change the angle between vectors

Cosine similarity measures the angle between vectors, not their magnitude. Drag vector b to see how the cosine value changes: 1.0 when aligned, 0.0 when orthogonal, −1.0 when opposite.

Analogy Arithmetic

The most striking property of word embeddings: relationships become directions. The direction from "man" to "king" captures something like "royalty." Add that same direction to "woman," and you land near "queen."

$$\vec{v}_{\text{king}} - \vec{v}_{\text{man}} + \vec{v}_{\text{woman}} \approx \vec{v}_{\text{queen}}$$

Vector arithmetic captures semantic relationships. The direction from "man" to "king" is approximately the same as the direction from "woman" to "queen."

This works because the embedding space organises concepts along roughly consistent axes. Gender is one direction. Geography is another ("Paris − France + Germany ≈ Berlin"). Tense, plurality, and many other relationships are encoded as well. That said, subsequent work has shown that analogy results are sensitive to evaluation methodology and less robust than the original papers suggested: the effect is real, but noisier than the clean examples imply.

Semantic Projection: Reading Hidden Dimensions

A 100-dimensional vector seems abstract, but it encodes rich, interpretable structure. The technique of semantic projection reveals this: define a meaningful direction using pairs of anchor words, then project other words onto that axis to see where they fall.

For example, a "gender" direction can be defined using anchors like (man, woman), (he, she), (him, her). Projecting occupations onto this axis reveals how strongly the training corpus associates each occupation with gender. Caliskan et al. (2017) showed that embeddings replicate a wide range of implicit biases measured in humans: a finding with profound implications for social science.

Dimension:

Occupations projected onto a gender axis derived from GloVe embeddings. The bias reflects patterns in the training corpus, not ground truth.

Stop and Think

If word embeddings place semantically similar words near each other, what happens to words with multiple meanings: like "bank" (river bank vs. financial bank)? What limitation does this reveal about static embeddings?

Reveal

Static embeddings assign a single vector per word, regardless of context. "Bank" gets one representation that averages across all its senses. This is a fundamental limitation that contextual embeddings (produced by Transformers) address by generating different representations for the same word depending on surrounding context. We return to this in the From Static to Contextual section.

In the notebook: Exercises 1–3 walk you through building BoW vectors, implementing cosine similarity from scratch, testing word analogies, and projecting occupations onto a gender axis.

Resources

Tokenization

In the previous section, we talked about vectors for "words." But what counts as a word? Models don't actually operate on words as we think of them. Before text enters a model, it must be split into discrete units the model can process. This step: tokenization: bridges raw text and the embedding layer. The choice of tokenization strategy has real consequences for what the model "sees" and how well it handles different languages, numbers, and edge cases.

The Problem: Words Are Not Enough

A word-level vocabulary is appealing but impractical. English alone has hundreds of thousands of words. Add misspellings, names, code, and other languages, and the vocabulary explodes. Any word not in the vocabulary becomes an unknown <UNK> token: invisible to the model.

Character-level tokenization solves the unknown-word problem (any text can be spelled out letter by letter) but creates sequences that are extremely long and hard to learn from. The model must figure out that "c-a-t" is a concept, from individual letters.

Subword tokenization is the compromise that modern LLMs actually use. It keeps common words as single tokens ("the," "and") but breaks rare words into smaller, reusable pieces ("unhappiness" → "un" + "happiness" or "un" + "happi" + "ness").

Definition

Byte-Pair Encoding (BPE)

A tokenization strategy that splits text into units smaller than words but larger than characters. Starting from individual bytes or characters, BPE iteratively merges the most frequent adjacent pair until a target vocabulary size is reached.

How BPE Works

BPE, introduced by Sennrich et al. (2016), is elegantly simple. Start with individual characters. Count every adjacent pair in the training corpus. Merge the most frequent pair into a new token. Repeat until you reach the desired vocabulary size.

Step 0: Characters: l o w e s t Step 1: Merge "es" (most frequent pair in corpus): l o w es t Step 2: Merge "est" (next most frequent pair): l o w est 6 tokens 5 tokens 4 tokens BPE repeats this until the vocabulary reaches its target size.

BPE starts with individual characters and iteratively merges the most frequent adjacent pair. Common words end up as single tokens; rare words are composed of reusable subword pieces.

Why Tokenization Matters for Research

Tokenization is not just preprocessing. It fundamentally shapes what a model "sees."

Multilingual inequality. Tokenizers are trained predominantly on English text. The same sentence in Hindi, Arabic, or Yoruba often requires two to ten times more tokens than in English. More tokens means lower effective resolution, higher API costs, and faster context-window exhaustion. When working with multilingual social science data, always check how your text tokenizes.

Number fragmentation. Numbers are often split into seemingly arbitrary pieces ("2024" → "202" + "4"). This is why models struggle with arithmetic: they do not see "2024" as a single number.

The "strawberry" problem. Ask a model "how many r's in strawberry?" and it may answer incorrectly. The tokenizer splits "strawberry" into subwords like "straw" + "berry": the model never sees individual letters, so it cannot count them.

Key Takeaway

Tokenization is part of your measurement instrument. When you send text to an LLM, the tokenizer decides how that text is represented. Multilingual tokenizers often allocate fewer tokens to non-English text, giving the model lower effective resolution for those languages. Before running experiments, inspect your tokenization.

In the notebook: Section 3 lets you explore tokenization hands-on: comparing how different models split the same text and observing multilingual tokenization inequality directly.

Resources

From Static to Contextual

We now know how text is split into tokens, and that each token gets its own embedding vector. But recall a limitation we flagged in the first section: static embeddings give each word type a single vector. "Bank" gets one representation that averages across all its senses: river bank, financial bank, blood bank. Now we see how modern models solve it.

The Polysemy Problem

Consider these two sentences:

  • "I deposited money at the bank."
  • "We sat on the bank of the river."

With GloVe, both instances of "bank" get the exact same vector. A model using static embeddings cannot distinguish the two meanings. It must rely on other words in the sentence to disambiguate: but the embedding itself carries no context.

Contextual Representations

A Transformer-based model builds a different representation for each token at each position, informed by context. In encoder models like BERT, each token attends to the full sequence in both directions. In autoregressive (decoder-only) models like GPT, each token attends only to preceding tokens. Either way, the same word "bank" produces a vector close to "finance" and "account" in one sentence, and close to "river" and "shore" in another.

Definition

Contextual Embedding

A representation where each token's vector depends on the entire surrounding sequence. The same word produces different vectors in different contexts, resolving ambiguity that static embeddings cannot.

This shift: from one vector per word type to one vector per word token in context: is what makes modern language models so powerful. The representations carry far more information because they incorporate the available context (the full sequence in encoder models, or the preceding context in autoregressive models).

Stop and Think

If contextual embeddings produce different vectors for the same word in different contexts, what mechanism allows the model to "mix in" information from surrounding words? (Hint: we cover this in detail in the Transformer section.)

Reveal

Self-attention. Each token computes a weighted sum over all other tokens in the sequence. The weights are learned, so the model decides which surrounding words are most relevant for building each token's representation. A token like "bank" attends heavily to "deposited" and "money" in one context, and to "river" and "sat" in another: producing very different output vectors.

Why This Matters for the Rest of the Course

Contextual embeddings are the foundation for everything that follows. When you classify text (Day 3), you use the model's contextual representation of the input. When you build a retrieval-augmented generation (RAG) pipeline (Day 4), embeddings power the retrieval step: and contextual embeddings produce much better search results than static ones. When you prompt a model (Day 2), you are writing input that the model will process through layers of contextual attention. Understanding how context shapes representations helps you write better prompts and debug unexpected behaviour.

Key Takeaway

Static embeddings: one vector per word, context-blind. Contextual embeddings: one vector per word in its specific context, produced by reading the full sequence through layers of self-attention. This is the leap that makes modern language models work.

Language Modeling

Contextual embeddings don't appear from nowhere: they are produced by a model trained on a specific objective. At its core, a language model learns a probability distribution over sequences of tokens. The dominant paradigm for modern LLMs is autoregressive language modeling: predict the next token given all previous tokens. This objective sounds simple. Its consequences are profound.

Definition

Autoregressive Language Model

A model that generates a sequence one token at a time, left to right. At each step it predicts a probability distribution over the vocabulary for the next token, conditioned on all tokens generated so far.

The Chain Rule of Probability

Any joint probability over a sequence can be decomposed into a product of conditional probabilities. This is not an approximation: it is an exact identity from probability theory:

$$P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_{\lt t})$$

The joint probability of a sequence equals the product of each token's probability given all preceding tokens $x_{\lt t}$.

In plain language: the probability of a whole sentence equals the probability of the first word, times the probability of the second word given the first, times the probability of the third given the first two, and so on. An autoregressive language model learns each of these conditional distributions.

Why Next-Token Prediction Is So Powerful

Consider what it takes to predict the next word accurately across diverse text. To predict the word after "The capital of France is," the model must learn patterns that function like factual recall. To predict the next word in a Python function, it must capture regularities that mirror syntactic knowledge. To continue a logical argument, it must develop internal representations that approximate reasoning.

The prediction objective is simple, but solving it well on internet-scale data requires the model to build sophisticated internal representations of grammar, semantics, facts, and reasoning patterns. These representations then transfer to downstream tasks: summarisation, translation, question answering, and more.

The cat sat on the Next token? mat 42% table18% floor12% Input sequence Probability distribution

An autoregressive model takes a sequence of tokens and outputs a probability distribution over the vocabulary for the next token. During generation, it samples from this distribution, appends the chosen token, and repeats.

Perplexity: Measuring Model Quality

How do we know whether one language model is better than another? The standard metric is perplexity: it measures how "surprised" the model is by a held-out test set. Lower perplexity means the model assigns higher probability to the actual text: it predicts more accurately.

$$\text{PPL}(X) = \exp\!\left(-\frac{1}{T}\sum_{t=1}^{T} \log P(x_t \mid x_{\lt t})\right)$$

Perplexity is the exponentiated average negative log-likelihood. Lower is better. A perplexity of $k$ means the model is, on average, as uncertain as choosing uniformly among $k$ options.

Intuitively, a perplexity of 20 means the model is, on average, as uncertain about the next token as if it were choosing uniformly among 20 options. A perfect model that always predicts the right token would have perplexity 1.

Stop and Think

A model trained only to predict the next word can summarise documents, answer questions, translate between languages, and write code. Why would next-token prediction produce such general capabilities?

Reveal

To predict the next token accurately across diverse text, the model must build internal representations of syntax, semantics, factual knowledge, reasoning patterns, and more. The prediction objective is simple, but solving it well on internet-scale data requires sophisticated internal representations that transfer to downstream tasks. In a sense, next-token prediction is a universal learning signal: any pattern in text that helps prediction can, in principle, be learned.

In the notebook: Section 5 lets you generate text from a real language model, experiment with temperature and sampling strategies, and compute perplexity on different texts.

Resources

The Transformer

The Transformer architecture (Vaswani et al., 2017) is the engine behind every modern large language model. Its key innovation is self-attention: a mechanism that lets each token attend to every other token in the sequence, learning which parts of the context matter most for each prediction.

Before Transformers: The Sequential Bottleneck

Earlier models (RNNs, LSTMs) processed tokens one at a time, left to right. Each token's representation depended on the previous hidden state, creating a chain. This had two problems:

  • Speed: Sequential processing cannot be parallelised across tokens. Training was slow.
  • Long-range dependencies: Information from early tokens had to survive through many steps of the chain to influence later tokens. In practice, it often degraded or was lost.

The Transformer solved both problems by replacing recurrence with attention. Every token can directly attend to every other token, regardless of distance. And all positions are processed simultaneously.

Self-Attention: The Core Mechanism

Self-attention works by having each token "ask a question" of every other token and collect relevant information. This happens through three learned projections:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What information should I provide if I'm relevant?"

Each token's input embedding is multiplied by three separate weight matrices to produce its Q, K, and V vectors. Then, for each token, we compute the dot product of its Query with every Key. High dot products mean high relevance. These scores are scaled and passed through softmax to become weights (summing to 1). Finally, the weighted sum of Values gives the token's new, context-enriched representation.

Definition

Self-Attention

A mechanism where each position in a sequence computes a weighted sum over all positions. Each token produces three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I provide?). Attention weights are determined by the compatibility between Queries and Keys.

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$

Scaled dot-product attention. The $\sqrt{d_k}$ factor prevents dot products from growing too large in high dimensions, which would push softmax into regions with near-zero gradients.

Context:

Click a token to see what it attends to (Query → Keys)

Click each token to see its attention pattern. Then switch sentences and click "bank" again: notice how the attention shifts entirely from "river" to "deposit" and "money." (Attention weights are illustrative, not extracted from a specific model.)

Multi-Head Attention

A single attention head can only capture one type of relationship at a time. Multi-head attention runs several attention heads in parallel, each with its own Q, K, V weight matrices. Different heads learn to attend to different things: one head might track syntactic dependencies (subject–verb agreement), another might track coreference ("she" → "Marie"), and another might capture semantic relationships. Their outputs are concatenated and projected back to the original dimension.

Head 1: Syntactic

Query ↓ Key → I sat by the river bank I sat by the river bank

"sat" → "I" (subject–verb), "the" → "river" (determiner–noun)

Head 2: Semantic

Query ↓ Key → I sat by the river bank I sat by the river bank

"bank" → "river" (meaning), "sat" → "bank" (location)

Head 3: Positional

Query ↓ Key → I sat by the river bank I sat by the river bank

Each token attends most to its immediate neighbours

Illustrative attention patterns for three hypothetical heads in the same layer, processing "I sat by the river bank." Each head captures a different type of relationship: syntactic structure, semantic meaning, or local position. The model combines all heads to build a rich, multi-faceted representation.

The Full Transformer Block

A Transformer block combines self-attention with a few other components:

Input Embeddings Multi-Head Attention Add & Layer Norm residual Feed-Forward Network Add & Layer Norm residual Output Embeddings Repeat N times (e.g. 96 layers in GPT-3) to form the full Transformer.

One Transformer block. Input flows through multi-head attention, then through a feed-forward network. Residual connections (dashed lines) add the input back to the output at each stage, and layer normalisation stabilises training. A full model stacks many such blocks.

Residual connections (He et al., 2016) add each sub-layer's input directly to its output. This lets gradients flow through the network without degrading, enabling very deep stacks (GPT-3 uses 96 layers). Layer normalisation (Ba, Kiros & Hinton, 2016) stabilises the scale of activations at each layer.

Positional Encodings

Self-attention processes all tokens in parallel: there is nothing in the mechanism that distinguishes position 1 from position 100. Without help, "dog bites man" and "man bites dog" would produce identical representations.

Positional encodings solve this by adding position information to each token's embedding before it enters the attention layers. The original Transformer used sinusoidal functions of the position. Modern models often use learned positional embeddings or rotary position encodings (RoPE) (Su et al., 2021), which generalise better to long sequences.

Stop and Think

Why do we divide the dot product by √dk in the attention formula? What would happen without this scaling?

Reveal

When the dimension dk is large, the dot products of Q and K vectors tend to grow large in magnitude. Large inputs to softmax push the output into regions where the gradient is extremely small (near 0 or 1), making learning very slow. Dividing by √dk keeps the variance of the dot products at roughly 1, ensuring that softmax operates in a useful range. This is a simple but critical engineering detail.

Key Takeaway

The Transformer replaced sequential processing with parallel attention. Every token can attend to every other token directly, regardless of distance. This architectural choice is why LLMs could scale to billions of parameters trained on trillions of tokens: the foundation of their remarkable capabilities.

In the notebook: Section 4 walks you through self-attention by hand. You assign attention weights manually, then compare your intuitions against the model's actual attention patterns.

Resources

Scaling Laws

Now that we understand the Transformer architecture, a natural question arises: what happens when we make it bigger? One of the most consequential discoveries in modern AI: language model performance improves predictably as you increase compute, data, or model size. Across a wide range of scales, the relationship follows smooth power-law curves: a remarkably strong empirical regularity. This insight transformed AI development from experimentation into engineering.

Three Power Laws

Kaplan et al. (2020) showed that test loss follows the same functional form against three independent variables: total compute, dataset size, and number of model parameters. On a log-log plot, each relationship is a straight line.

$$L(X) \propto X^{-\alpha}$$

Loss scales as a power law with each of three variables: compute $C$, dataset size $D$, and model parameters $N$. Each follows the same functional form with different exponents $\alpha$.

Kaplan et al. modelled each variable's contribution separately. Double the compute, and loss drops by a predictable factor. Double the data, same pattern. Double the parameters, same again. The exponents differ, but the power-law form holds across model families and training setups. (We will see shortly that the variables are not truly independent: how you balance model size and data matters.)

Compute

3.4 3.0 2.6 2.2 1.8 10 10³ 10⁶ 10⁹ PF-days Test Loss

Dataset Size

3.4 3.0 2.6 2.2 1.8 1M 100M 10B 1T Tokens Test Loss

Parameters

3.4 3.0 2.6 2.2 1.8 10M 100M 10B 1T Parameters Test Loss

Test loss as a function of compute, dataset size, and model parameters (log-log scale). Each relationship follows a smooth power law : a straight line on log-log axes. Adapted from Kaplan et al. (2020).

Why This Matters: Forecasting Performance

Power laws are not just descriptive. They are predictive. Train a few small models, measure their loss, fit the curve, and extrapolate to any target scale. Before spending hundreds of millions of dollars on a training run, a lab can estimate the result.

This is how frontier models get planned. GPT-4, Claude, and Gemini were not shots in the dark. Their developers trained smaller pilot models, verified the scaling curve, and then committed compute at the predicted optimal point. Scaling laws turned AI development from guesswork into engineering.

For researchers using these models, the implications are direct. Understanding scaling explains why a 70B model handles nuanced political text better than a 7B model, why API costs differ by orders of magnitude, and why model choice is a research design decision: not just a technical one.

Implications for Model Development

Scaling laws drove the race from GPT-2 (1.5B) to GPT-3 (175B) to frontier models with hundreds of billions of dense parameters: or, in mixture-of-experts architectures, over a trillion total. Each generation validated the power-law prediction: more scale, better loss.

But power laws have diminishing returns. Each 10× increase in compute yields a smaller absolute improvement. The first doubling is dramatic; the tenth is marginal. Combined with rising costs of compute, energy, and data, pure scale hits practical limits.

Hoffmann et al. (2022) showed that most early large models were undertrained: balancing model size and training data matters as much as raw scale. This insight reshaped how training budgets are allocated.

These constraints are pushing the field toward complementary strategies: better data curation, improved architectures, reasoning-focused training (reinforcement learning from human feedback), and a fundamentally different approach: test-time compute scaling: spending more compute at inference rather than training.

Coming in Day 2: We explore test-time compute scaling: using prompting strategies like chain-of-thought reasoning to extract more capability at inference. This represents a fundamental shift: instead of only scaling training, we can also scale thinking.

Stop and Think

If scaling laws have diminishing returns, what strategies beyond raw scale might improve model performance? Think about the full pipeline: data, architecture, training objective, and inference.

Reveal

Several complementary approaches are being explored. Data curation: higher-quality, deduplicated, and domain-specific training data yields more per token. Architecture improvements: mixture-of-experts, longer contexts, and more efficient attention variants. Training objectives: reinforcement learning from human feedback (RLHF), constitutional AI, and reasoning-focused training. Test-time compute: chain-of-thought prompting, self-consistency, and tree search let models "think longer" at inference: a form of scaling that does not require retraining.

Key Takeaway

Scaling laws are both an explanation of the field's trajectory and a practical tool for forecasting. They tell us why larger models are better, how much better they will be, and when pure scaling hits diminishing returns. For social scientists, this means model choice is not arbitrary: it is a design decision with predictable consequences for task performance, cost, and capability.

Resources

Social Science Applications

The foundational concepts in this module are not just theoretical. They have direct, substantive applications in social science research. This section highlights key papers that demonstrate how word embeddings, language models, and the Transformer architecture open new research possibilities: and introduce new methodological challenges.

Bias Detection in Word Embeddings

The semantic projection technique we saw earlier has a powerful application: measuring societal biases in language. Because embeddings are trained on human-generated text, they absorb the biases present in that text.

Caliskan, Bryson & Narayanan (2017), Semantics derived automatically from language corpora contain human-like biases. Science. Demonstrated that word embeddings replicate a wide range of implicit biases measured in humans via the Implicit Association Test (IAT): including associations between European-American names and pleasant words, and between African-American names and unpleasant words. This paper established that embedding geometry is a lens for studying culture.

Garg, Schiebinger, Jurafsky & Zou (2018), Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes. PNAS. By training embeddings on text from each decade of the 20th century, they tracked how gender and ethnic stereotypes in American English evolved over time: a computational history of bias that aligns with known social changes.

Cultural Analysis Through Geometry

Kozlowski, Taddy & Evans (2019), The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings. American Sociological Review. Established a framework for using vector space geometry to map cultural dimensions. They showed that the concepts of social class, affluence, and education are encoded as directions in embedding space, and that projecting words onto these axes reveals the cultural associations that texts carry. This approach transforms embeddings from a technical tool into a method for computational cultural analysis.

Open Science and Reproducibility

Spirling (2023), Why open-source generative AI models are an ethical imperative for social science. Nature Computational Science. Argues that transparency, reproducibility, and data control require access to open-source models. When a model is a black box controlled by a company, researchers cannot fully understand or verify their instrument. This paper connects the architectural foundations we have studied: pre-training data, model weights, tokenization: to fundamental questions of scientific integrity.

Broader Impact

Bail (2024), Can Generative AI Improve Social Science? PNAS. A balanced assessment of the potential benefits and pitfalls of integrating LLMs into social science workflows. Emphasises valid measurement, reproducibility, and the distinction between tasks where LLMs augment human researchers and tasks where they might introduce systematic error. A good frame for thinking about everything you will learn in the rest of this course.

Key Takeaway

The foundations covered in this module: embeddings, tokenization, language modeling, attention, and scaling: are not just technical prerequisites. They are the building blocks of a new research methodology. Understanding how models represent and process language is essential for using them responsibly as instruments of social science inquiry. The subsequent modules build directly on these foundations: prompting (Day 2) works because of next-token prediction, classification and RAG (Days 3–4) rely on contextual embeddings, and model selection (Day 2–3) requires understanding the scaling tradeoffs introduced here.