Module 1 of 5
Foundations
From Embeddings to Transformers: how machines represent meaning as geometry, and how the architecture that changed everything actually works.
Representing Meaning
Before a language model can do anything useful, it needs a way to represent words as numbers. Not just any numbers: numbers that capture meaning. This section traces the path from the simplest text representations to the geometric spaces that power modern NLP.
Bag of Words: Counting What Appears
The simplest idea: count which words appear. A document mentioning "economy," "inflation," and "growth" is probably about economic policy. One mentioning "goal," "league," and "match" is probably about sports.
This is called a bag-of-words (BoW) representation. It throws away word order, grammar, and nuance, keeping only word frequencies. Despite this, BoW has powered decades of productive social science research: TF-IDF for document retrieval, topic models like LDA and STM for discovering themes in large corpora, and dictionary methods for measuring sentiment or policy focus.
But BoW has a hard ceiling. Every word is a separate dimension, equally distant from every other word. "Cat" is as far from "dog" as from "democracy." Two sentences about the same topic can look completely different if they use different vocabulary. BoW captures what words appear, but not what they mean.
Word Embeddings: Meaning as Geometry
The breakthrough idea (Mikolov et al., 2013): instead of treating every word as a separate symbol, learn a dense vector for each word from its context. Words that appear in similar contexts get similar vectors.
This is the distributional hypothesis: "you shall know a word by the company it keeps" (Firth, 1957). A word that often appears near "government," "legislation," and "vote" will end up with a vector close to other political terms: even if they never co-occurred in the same sentence.
Definition
Word Embedding
A learned mapping $f: \mathcal{V} \to \mathbb{R}^d$ from a vocabulary $\mathcal{V}$ to a $d$-dimensional real vector space, where semantic and syntactic relationships between words are encoded as geometric relationships between their vectors.
Word2Vec learns embeddings by training a shallow neural network on a simple task: given a word, predict its neighbours (Skip-gram), or given the neighbours, predict the word (CBOW). GloVe (Pennington et al., 2014) achieves a similar result by factorising a global word co-occurrence matrix. Both produce vectors where semantic similarity corresponds to geometric proximity.
Schematic illustration of word embeddings in 2D. In a trained embedding space, words from the same domain cluster together, even though the model was never told these categories exist. This structure emerges from word co-occurrence patterns. (Layout is simplified for clarity; real projections via PCA or t-SNE are noisier.)
Cosine Similarity: Measuring Closeness
How do we measure whether two word vectors are "close"? The standard tool is cosine similarity: it measures the angle between two vectors, ignoring their length.
$$\text{cosine\_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \cdot \|\mathbf{b}\|}$$
Result is the cosine of the angle between the two vectors: 1.0 = identical direction, 0.0 = orthogonal (no overlap), −1.0 = opposite.
Cosine similarity measures the angle between vectors, not their magnitude. Drag vector b to see how the cosine value changes: 1.0 when aligned, 0.0 when orthogonal, −1.0 when opposite.
Analogy Arithmetic
The most striking property of word embeddings: relationships become directions. The direction from "man" to "king" captures something like "royalty." Add that same direction to "woman," and you land near "queen."
$$\vec{v}_{\text{king}} - \vec{v}_{\text{man}} + \vec{v}_{\text{woman}} \approx \vec{v}_{\text{queen}}$$
Vector arithmetic captures semantic relationships. The direction from "man" to "king" is approximately the same as the direction from "woman" to "queen."
This works because the embedding space organises concepts along roughly consistent axes. Gender is one direction. Geography is another ("Paris − France + Germany ≈ Berlin"). Tense, plurality, and many other relationships are encoded as well. That said, subsequent work has shown that analogy results are sensitive to evaluation methodology and less robust than the original papers suggested: the effect is real, but noisier than the clean examples imply.
Semantic Projection: Reading Hidden Dimensions
A 100-dimensional vector seems abstract, but it encodes rich, interpretable structure. The technique of semantic projection reveals this: define a meaningful direction using pairs of anchor words, then project other words onto that axis to see where they fall.
For example, a "gender" direction can be defined using anchors like (man, woman), (he, she), (him, her). Projecting occupations onto this axis reveals how strongly the training corpus associates each occupation with gender. Caliskan et al. (2017) showed that embeddings replicate a wide range of implicit biases measured in humans: a finding with profound implications for social science.
Occupations projected onto a gender axis derived from GloVe embeddings. The bias reflects patterns in the training corpus, not ground truth.
If word embeddings place semantically similar words near each other, what happens to words with multiple meanings: like "bank" (river bank vs. financial bank)? What limitation does this reveal about static embeddings?
Reveal
Static embeddings assign a single vector per word, regardless of context. "Bank" gets one representation that averages across all its senses. This is a fundamental limitation that contextual embeddings (produced by Transformers) address by generating different representations for the same word depending on surrounding context. We return to this in the From Static to Contextual section.
In the notebook: Exercises 1–3 walk you through building BoW vectors, implementing cosine similarity from scratch, testing word analogies, and projecting occupations onto a gender axis.
Resources
- Mikolov et al. (2013), Efficient Estimation of Word Representations in Vector Space: the foundational Word2Vec paper.
- Pennington, Socher & Manning (2014), GloVe: Global Vectors for Word Representation: combines global matrix factorisation with local context windows.
- Lena Voita: NLP Course: Word Embeddings: an excellent in-depth explainer.
- TensorFlow Embedding Projector: interactive 3D visualisation of embedding spaces.
- Grand et al. (2022), Semantic Projection: the method for mapping words onto interpretable semantic scales.
Tokenization
In the previous section, we talked about vectors for "words." But what counts as a word? Models don't actually operate on words as we think of them. Before text enters a model, it must be split into discrete units the model can process. This step: tokenization: bridges raw text and the embedding layer. The choice of tokenization strategy has real consequences for what the model "sees" and how well it handles different languages, numbers, and edge cases.
The Problem: Words Are Not Enough
A word-level vocabulary is appealing but impractical. English alone has hundreds of thousands of words. Add misspellings, names, code, and other languages, and the vocabulary explodes. Any word not in the vocabulary becomes an unknown <UNK> token: invisible to the model.
Character-level tokenization solves the unknown-word problem (any text can be spelled out letter by letter) but creates sequences that are extremely long and hard to learn from. The model must figure out that "c-a-t" is a concept, from individual letters.
Subword tokenization is the compromise that modern LLMs actually use. It keeps common words as single tokens ("the," "and") but breaks rare words into smaller, reusable pieces ("unhappiness" → "un" + "happiness" or "un" + "happi" + "ness").
Definition
Byte-Pair Encoding (BPE)
A tokenization strategy that splits text into units smaller than words but larger than characters. Starting from individual bytes or characters, BPE iteratively merges the most frequent adjacent pair until a target vocabulary size is reached.
How BPE Works
BPE, introduced by Sennrich et al. (2016), is elegantly simple. Start with individual characters. Count every adjacent pair in the training corpus. Merge the most frequent pair into a new token. Repeat until you reach the desired vocabulary size.
BPE starts with individual characters and iteratively merges the most frequent adjacent pair. Common words end up as single tokens; rare words are composed of reusable subword pieces.
Why Tokenization Matters for Research
Tokenization is not just preprocessing. It fundamentally shapes what a model "sees."
Multilingual inequality. Tokenizers are trained predominantly on English text. The same sentence in Hindi, Arabic, or Yoruba often requires two to ten times more tokens than in English. More tokens means lower effective resolution, higher API costs, and faster context-window exhaustion. When working with multilingual social science data, always check how your text tokenizes.
Number fragmentation. Numbers are often split into seemingly arbitrary pieces ("2024" → "202" + "4"). This is why models struggle with arithmetic: they do not see "2024" as a single number.
The "strawberry" problem. Ask a model "how many r's in strawberry?" and it may answer incorrectly. The tokenizer splits "strawberry" into subwords like "straw" + "berry": the model never sees individual letters, so it cannot count them.
Key Takeaway
Tokenization is part of your measurement instrument. When you send text to an LLM, the tokenizer decides how that text is represented. Multilingual tokenizers often allocate fewer tokens to non-English text, giving the model lower effective resolution for those languages. Before running experiments, inspect your tokenization.
In the notebook: Section 3 lets you explore tokenization hands-on: comparing how different models split the same text and observing multilingual tokenization inequality directly.
Resources
- Sennrich, Haddow & Birch (2016), Neural Machine Translation of Rare Words with Subword Units: introduces BPE for NLP.
- Tiktokenizer: visualise how OpenAI's tokenizers split text in real time.
- Hugging Face Tokenizer Playground: compare how different tokenizers split the same input.
- Andrej Karpathy: "Let's Build the GPT Tokenizer": a hands-on deep dive into BPE.
- HuggingFace NLP Course, Chapter 6: thorough walkthrough of BPE and other subword algorithms.
- 3Blue1Brown: "But what is a GPT?": Chapter 5 covers tokenization with excellent visuals.
From Static to Contextual
We now know how text is split into tokens, and that each token gets its own embedding vector. But recall a limitation we flagged in the first section: static embeddings give each word type a single vector. "Bank" gets one representation that averages across all its senses: river bank, financial bank, blood bank. Now we see how modern models solve it.
The Polysemy Problem
Consider these two sentences:
- "I deposited money at the bank."
- "We sat on the bank of the river."
With GloVe, both instances of "bank" get the exact same vector. A model using static embeddings cannot distinguish the two meanings. It must rely on other words in the sentence to disambiguate: but the embedding itself carries no context.
Contextual Representations
A Transformer-based model builds a different representation for each token at each position, informed by context. In encoder models like BERT, each token attends to the full sequence in both directions. In autoregressive (decoder-only) models like GPT, each token attends only to preceding tokens. Either way, the same word "bank" produces a vector close to "finance" and "account" in one sentence, and close to "river" and "shore" in another.
Definition
Contextual Embedding
A representation where each token's vector depends on the entire surrounding sequence. The same word produces different vectors in different contexts, resolving ambiguity that static embeddings cannot.
This shift: from one vector per word type to one vector per word token in context: is what makes modern language models so powerful. The representations carry far more information because they incorporate the available context (the full sequence in encoder models, or the preceding context in autoregressive models).
If contextual embeddings produce different vectors for the same word in different contexts, what mechanism allows the model to "mix in" information from surrounding words? (Hint: we cover this in detail in the Transformer section.)
Reveal
Self-attention. Each token computes a weighted sum over all other tokens in the sequence. The weights are learned, so the model decides which surrounding words are most relevant for building each token's representation. A token like "bank" attends heavily to "deposited" and "money" in one context, and to "river" and "sat" in another: producing very different output vectors.
Why This Matters for the Rest of the Course
Contextual embeddings are the foundation for everything that follows. When you classify text (Day 3), you use the model's contextual representation of the input. When you build a retrieval-augmented generation (RAG) pipeline (Day 4), embeddings power the retrieval step: and contextual embeddings produce much better search results than static ones. When you prompt a model (Day 2), you are writing input that the model will process through layers of contextual attention. Understanding how context shapes representations helps you write better prompts and debug unexpected behaviour.
Key Takeaway
Static embeddings: one vector per word, context-blind. Contextual embeddings: one vector per word in its specific context, produced by reading the full sequence through layers of self-attention. This is the leap that makes modern language models work.
Language Modeling
Contextual embeddings don't appear from nowhere: they are produced by a model trained on a specific objective. At its core, a language model learns a probability distribution over sequences of tokens. The dominant paradigm for modern LLMs is autoregressive language modeling: predict the next token given all previous tokens. This objective sounds simple. Its consequences are profound.
Definition
Autoregressive Language Model
A model that generates a sequence one token at a time, left to right. At each step it predicts a probability distribution over the vocabulary for the next token, conditioned on all tokens generated so far.
The Chain Rule of Probability
Any joint probability over a sequence can be decomposed into a product of conditional probabilities. This is not an approximation: it is an exact identity from probability theory:
$$P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_{\lt t})$$
The joint probability of a sequence equals the product of each token's probability given all preceding tokens $x_{\lt t}$.
In plain language: the probability of a whole sentence equals the probability of the first word, times the probability of the second word given the first, times the probability of the third given the first two, and so on. An autoregressive language model learns each of these conditional distributions.
Why Next-Token Prediction Is So Powerful
Consider what it takes to predict the next word accurately across diverse text. To predict the word after "The capital of France is," the model must learn patterns that function like factual recall. To predict the next word in a Python function, it must capture regularities that mirror syntactic knowledge. To continue a logical argument, it must develop internal representations that approximate reasoning.
The prediction objective is simple, but solving it well on internet-scale data requires the model to build sophisticated internal representations of grammar, semantics, facts, and reasoning patterns. These representations then transfer to downstream tasks: summarisation, translation, question answering, and more.
An autoregressive model takes a sequence of tokens and outputs a probability distribution over the vocabulary for the next token. During generation, it samples from this distribution, appends the chosen token, and repeats.
Perplexity: Measuring Model Quality
How do we know whether one language model is better than another? The standard metric is perplexity: it measures how "surprised" the model is by a held-out test set. Lower perplexity means the model assigns higher probability to the actual text: it predicts more accurately.
$$\text{PPL}(X) = \exp\!\left(-\frac{1}{T}\sum_{t=1}^{T} \log P(x_t \mid x_{\lt t})\right)$$
Perplexity is the exponentiated average negative log-likelihood. Lower is better. A perplexity of $k$ means the model is, on average, as uncertain as choosing uniformly among $k$ options.
Intuitively, a perplexity of 20 means the model is, on average, as uncertain about the next token as if it were choosing uniformly among 20 options. A perfect model that always predicts the right token would have perplexity 1.
A model trained only to predict the next word can summarise documents, answer questions, translate between languages, and write code. Why would next-token prediction produce such general capabilities?
Reveal
To predict the next token accurately across diverse text, the model must build internal representations of syntax, semantics, factual knowledge, reasoning patterns, and more. The prediction objective is simple, but solving it well on internet-scale data requires sophisticated internal representations that transfer to downstream tasks. In a sense, next-token prediction is a universal learning signal: any pattern in text that helps prediction can, in principle, be learned.
In the notebook: Section 5 lets you generate text from a real language model, experiment with temperature and sampling strategies, and compute perplexity on different texts.
Resources
- Lena Voita: NLP Course: Language Modeling: clear, visual introduction to language modeling concepts.
- HuggingFace: Perplexity of Fixed-Length Models: technical details on computing perplexity.
- Radford et al. (2018), Improving Language Understanding by Generative Pre-Training: GPT-1: demonstrating that generative pre-training transfers to downstream tasks.
- Radford et al. (2019), Language Models are Unsupervised Multitask Learners: GPT-2: scaling up and the emergence of zero-shot capabilities.
The Transformer
The Transformer architecture (Vaswani et al., 2017) is the engine behind every modern large language model. Its key innovation is self-attention: a mechanism that lets each token attend to every other token in the sequence, learning which parts of the context matter most for each prediction.
Before Transformers: The Sequential Bottleneck
Earlier models (RNNs, LSTMs) processed tokens one at a time, left to right. Each token's representation depended on the previous hidden state, creating a chain. This had two problems:
- Speed: Sequential processing cannot be parallelised across tokens. Training was slow.
- Long-range dependencies: Information from early tokens had to survive through many steps of the chain to influence later tokens. In practice, it often degraded or was lost.
The Transformer solved both problems by replacing recurrence with attention. Every token can directly attend to every other token, regardless of distance. And all positions are processed simultaneously.
Self-Attention: The Core Mechanism
Self-attention works by having each token "ask a question" of every other token and collect relevant information. This happens through three learned projections:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information should I provide if I'm relevant?"
Each token's input embedding is multiplied by three separate weight matrices to produce its Q, K, and V vectors. Then, for each token, we compute the dot product of its Query with every Key. High dot products mean high relevance. These scores are scaled and passed through softmax to become weights (summing to 1). Finally, the weighted sum of Values gives the token's new, context-enriched representation.
Definition
Self-Attention
A mechanism where each position in a sequence computes a weighted sum over all positions. Each token produces three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I provide?). Attention weights are determined by the compatibility between Queries and Keys.
$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$
Scaled dot-product attention. The $\sqrt{d_k}$ factor prevents dot products from growing too large in high dimensions, which would push softmax into regions with near-zero gradients.
Click a token to see what it attends to (Query → Keys)
Click each token to see its attention pattern. Then switch sentences and click "bank" again: notice how the attention shifts entirely from "river" to "deposit" and "money." (Attention weights are illustrative, not extracted from a specific model.)
Multi-Head Attention
A single attention head can only capture one type of relationship at a time. Multi-head attention runs several attention heads in parallel, each with its own Q, K, V weight matrices. Different heads learn to attend to different things: one head might track syntactic dependencies (subject–verb agreement), another might track coreference ("she" → "Marie"), and another might capture semantic relationships. Their outputs are concatenated and projected back to the original dimension.
Head 1: Syntactic
"sat" → "I" (subject–verb), "the" → "river" (determiner–noun)
Head 2: Semantic
"bank" → "river" (meaning), "sat" → "bank" (location)
Head 3: Positional
Each token attends most to its immediate neighbours
Illustrative attention patterns for three hypothetical heads in the same layer, processing "I sat by the river bank." Each head captures a different type of relationship: syntactic structure, semantic meaning, or local position. The model combines all heads to build a rich, multi-faceted representation.
The Full Transformer Block
A Transformer block combines self-attention with a few other components:
One Transformer block. Input flows through multi-head attention, then through a feed-forward network. Residual connections (dashed lines) add the input back to the output at each stage, and layer normalisation stabilises training. A full model stacks many such blocks.
Residual connections (He et al., 2016) add each sub-layer's input directly to its output. This lets gradients flow through the network without degrading, enabling very deep stacks (GPT-3 uses 96 layers). Layer normalisation (Ba, Kiros & Hinton, 2016) stabilises the scale of activations at each layer.
Positional Encodings
Self-attention processes all tokens in parallel: there is nothing in the mechanism that distinguishes position 1 from position 100. Without help, "dog bites man" and "man bites dog" would produce identical representations.
Positional encodings solve this by adding position information to each token's embedding before it enters the attention layers. The original Transformer used sinusoidal functions of the position. Modern models often use learned positional embeddings or rotary position encodings (RoPE) (Su et al., 2021), which generalise better to long sequences.
Why do we divide the dot product by √dk in the attention formula? What would happen without this scaling?
Reveal
When the dimension dk is large, the dot products of Q and K vectors tend to grow large in magnitude. Large inputs to softmax push the output into regions where the gradient is extremely small (near 0 or 1), making learning very slow. Dividing by √dk keeps the variance of the dot products at roughly 1, ensuring that softmax operates in a useful range. This is a simple but critical engineering detail.
Key Takeaway
The Transformer replaced sequential processing with parallel attention. Every token can attend to every other token directly, regardless of distance. This architectural choice is why LLMs could scale to billions of parameters trained on trillions of tokens: the foundation of their remarkable capabilities.
In the notebook: Section 4 walks you through self-attention by hand. You assign attention weights manually, then compare your intuitions against the model's actual attention patterns.
Resources
- Vaswani et al. (2017), Attention Is All You Need: the paper that introduced the Transformer architecture.
- Jay Alammar: The Illustrated Transformer: the canonical visual walkthrough of the architecture.
- 3Blue1Brown: "Attention in Transformers, Visually Explained": intuitive geometric explanation of attention.
- Brendan Bycroft's LLM Visualization: interactive 3D walkthrough of an entire Transformer.
- Transformer Explainer (Polo Club of Data Science): interactive visual explanation of how Transformers work.
- Jay Alammar: The Illustrated GPT-2: how the decoder-only Transformer works in practice.
- Anthropic: In-context Learning and Induction Heads (2022): a mechanistic view of how Transformers learn.
- He et al. (2016), Deep Residual Learning for Image Recognition: introduced skip/residual connections that enable training of very deep networks.
- Ba, Kiros & Hinton (2016), Layer Normalization: the normalisation variant used in Transformers (applied per-sample, not per-batch).
- Su et al. (2021), RoFormer: Enhanced Transformer with Rotary Position Embedding: the positional encoding used in most modern LLMs (Llama, Mistral, etc.).
Scaling Laws
Now that we understand the Transformer architecture, a natural question arises: what happens when we make it bigger? One of the most consequential discoveries in modern AI: language model performance improves predictably as you increase compute, data, or model size. Across a wide range of scales, the relationship follows smooth power-law curves: a remarkably strong empirical regularity. This insight transformed AI development from experimentation into engineering.
Three Power Laws
Kaplan et al. (2020) showed that test loss follows the same functional form against three independent variables: total compute, dataset size, and number of model parameters. On a log-log plot, each relationship is a straight line.
$$L(X) \propto X^{-\alpha}$$
Loss scales as a power law with each of three variables: compute $C$, dataset size $D$, and model parameters $N$. Each follows the same functional form with different exponents $\alpha$.
Kaplan et al. modelled each variable's contribution separately. Double the compute, and loss drops by a predictable factor. Double the data, same pattern. Double the parameters, same again. The exponents differ, but the power-law form holds across model families and training setups. (We will see shortly that the variables are not truly independent: how you balance model size and data matters.)
Compute
Dataset Size
Parameters
Test loss as a function of compute, dataset size, and model parameters (log-log scale). Each relationship follows a smooth power law : a straight line on log-log axes. Adapted from Kaplan et al. (2020).
Why This Matters: Forecasting Performance
Power laws are not just descriptive. They are predictive. Train a few small models, measure their loss, fit the curve, and extrapolate to any target scale. Before spending hundreds of millions of dollars on a training run, a lab can estimate the result.
This is how frontier models get planned. GPT-4, Claude, and Gemini were not shots in the dark. Their developers trained smaller pilot models, verified the scaling curve, and then committed compute at the predicted optimal point. Scaling laws turned AI development from guesswork into engineering.
For researchers using these models, the implications are direct. Understanding scaling explains why a 70B model handles nuanced political text better than a 7B model, why API costs differ by orders of magnitude, and why model choice is a research design decision: not just a technical one.
Implications for Model Development
Scaling laws drove the race from GPT-2 (1.5B) to GPT-3 (175B) to frontier models with hundreds of billions of dense parameters: or, in mixture-of-experts architectures, over a trillion total. Each generation validated the power-law prediction: more scale, better loss.
But power laws have diminishing returns. Each 10× increase in compute yields a smaller absolute improvement. The first doubling is dramatic; the tenth is marginal. Combined with rising costs of compute, energy, and data, pure scale hits practical limits.
Hoffmann et al. (2022) showed that most early large models were undertrained: balancing model size and training data matters as much as raw scale. This insight reshaped how training budgets are allocated.
These constraints are pushing the field toward complementary strategies: better data curation, improved architectures, reasoning-focused training (reinforcement learning from human feedback), and a fundamentally different approach: test-time compute scaling: spending more compute at inference rather than training.
Coming in Day 2: We explore test-time compute scaling: using prompting strategies like chain-of-thought reasoning to extract more capability at inference. This represents a fundamental shift: instead of only scaling training, we can also scale thinking.
If scaling laws have diminishing returns, what strategies beyond raw scale might improve model performance? Think about the full pipeline: data, architecture, training objective, and inference.
Reveal
Several complementary approaches are being explored. Data curation: higher-quality, deduplicated, and domain-specific training data yields more per token. Architecture improvements: mixture-of-experts, longer contexts, and more efficient attention variants. Training objectives: reinforcement learning from human feedback (RLHF), constitutional AI, and reasoning-focused training. Test-time compute: chain-of-thought prompting, self-consistency, and tree search let models "think longer" at inference: a form of scaling that does not require retraining.
Key Takeaway
Scaling laws are both an explanation of the field's trajectory and a practical tool for forecasting. They tell us why larger models are better, how much better they will be, and when pure scaling hits diminishing returns. For social scientists, this means model choice is not arbitrary: it is a design decision with predictable consequences for task performance, cost, and capability.
Resources
- Kaplan et al. (2020), Scaling Laws for Neural Language Models: quantified the power-law relationships between compute, data, parameters, and loss. The paper that turned scaling into a predictive science.
- Hoffmann et al. (2022), Training Compute-Optimal Large Language Models: showed that optimal scaling requires balancing model size and training data. Essential reading for understanding modern training decisions.
Social Science Applications
The foundational concepts in this module are not just theoretical. They have direct, substantive applications in social science research. This section highlights key papers that demonstrate how word embeddings, language models, and the Transformer architecture open new research possibilities: and introduce new methodological challenges.
Bias Detection in Word Embeddings
The semantic projection technique we saw earlier has a powerful application: measuring societal biases in language. Because embeddings are trained on human-generated text, they absorb the biases present in that text.
Caliskan, Bryson & Narayanan (2017), Semantics derived automatically from language corpora contain human-like biases. Science. Demonstrated that word embeddings replicate a wide range of implicit biases measured in humans via the Implicit Association Test (IAT): including associations between European-American names and pleasant words, and between African-American names and unpleasant words. This paper established that embedding geometry is a lens for studying culture.
Garg, Schiebinger, Jurafsky & Zou (2018), Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes. PNAS. By training embeddings on text from each decade of the 20th century, they tracked how gender and ethnic stereotypes in American English evolved over time: a computational history of bias that aligns with known social changes.
Cultural Analysis Through Geometry
Kozlowski, Taddy & Evans (2019), The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings. American Sociological Review. Established a framework for using vector space geometry to map cultural dimensions. They showed that the concepts of social class, affluence, and education are encoded as directions in embedding space, and that projecting words onto these axes reveals the cultural associations that texts carry. This approach transforms embeddings from a technical tool into a method for computational cultural analysis.
Open Science and Reproducibility
Spirling (2023), Why open-source generative AI models are an ethical imperative for social science. Nature Computational Science. Argues that transparency, reproducibility, and data control require access to open-source models. When a model is a black box controlled by a company, researchers cannot fully understand or verify their instrument. This paper connects the architectural foundations we have studied: pre-training data, model weights, tokenization: to fundamental questions of scientific integrity.
Broader Impact
Bail (2024), Can Generative AI Improve Social Science? PNAS. A balanced assessment of the potential benefits and pitfalls of integrating LLMs into social science workflows. Emphasises valid measurement, reproducibility, and the distinction between tasks where LLMs augment human researchers and tasks where they might introduce systematic error. A good frame for thinking about everything you will learn in the rest of this course.
Key Takeaway
The foundations covered in this module: embeddings, tokenization, language modeling, attention, and scaling: are not just technical prerequisites. They are the building blocks of a new research methodology. Understanding how models represent and process language is essential for using them responsibly as instruments of social science inquiry. The subsequent modules build directly on these foundations: prompting (Day 2) works because of next-token prediction, classification and RAG (Days 3–4) rely on contextual embeddings, and model selection (Day 2–3) requires understanding the scaling tradeoffs introduced here.