Day 2: From Models to Tools

Post-Training

In Day 1, we built the core loop of a language model: tokenize, embed, attend, predict. We saw that a base model generates fluent text: but it just completes whatever you give it. Ask it to classify a tweet, and it writes another tweet. It is not being uncooperative; it was trained to predict the next token, so that is exactly what it does.

Something happens between a base model and ChatGPT that turns a next-token predictor into something that answers questions, follows instructions, and refuses harmful requests. That something is post-training.

The Base-to-Assistant Gap

Give a base model the prompt “Classify this tweet as positive or negative: I love this weather! Answer:” and it might continue with another tweet, a news headline, or anything else that looks plausible as a continuation. An instruction-tuned model would respond with “positive” because it has learned, through post-training, that this kind of prompt expects a label.

Both models have the same knowledge: they were pre-trained on the same data. The difference is entirely in how they were taught to behave. Post-training is a pipeline of techniques that reshapes behaviour without fundamentally changing what the model knows.

The post-training pipeline. Each stage builds on the previous one and uses different training data (shown in blue below each box). Not every model follows this exact sequence: some combine stages or use variants: but the general pattern is consistent across major model families. Illustrative diagram.

Supervised Fine-Tuning (SFT)

The first stage: show the model curated (instruction, response) pairs: from as few as hundreds to hundreds of thousands: and train it to imitate the responses. This is analogous to giving someone a style guide with worked examples: “when asked X, respond like Y.”

Definition

Supervised Fine-Tuning (SFT)

Training a pretrained model on curated (instruction, response) pairs so it learns to produce task-oriented responses instead of open-ended text completions. SFT teaches the model the format of helpful behaviour: how to structure answers, when to ask for clarification, and how to handle different request types.

After SFT, the model can follow instructions and produce formatted responses. But it has only a limited sense of which valid response is better than another: it has learned from high-quality demonstrations, but it cannot reliably distinguish between two on-topic responses that differ in subtle quality dimensions like depth, accuracy, or nuance. Something more is needed to teach fine-grained quality discrimination.

Reward Modeling & RLHF

The key idea behind Reinforcement Learning from Human Feedback (RLHF): let humans define “quality” by comparing pairs of model outputs. For a given prompt, the model generates two responses. A human annotator decides which is better. This preference data is then used to train a reward model: a separate neural network that predicts how much a human would prefer a given response.

Definition

RLHF (Reinforcement Learning from Human Feedback)

A training procedure where human preference judgments are distilled into a reward model, which is then used to optimise the language model via reinforcement learning (typically PPO). The model learns to produce responses that score highly on the reward model while remaining close to the original SFT model.

Once the reward model is trained, the language model is optimised using Proximal Policy Optimization (PPO) to produce responses that score highly. But there is a critical constraint: the optimised model must not stray too far from the SFT model. Without this constraint, the model could discover degenerate strategies that exploit the reward model: producing high-scoring but nonsensical text. The KL divergence penalty prevents this.

$$\max_{\pi} \; \mathbb{E}_{x \sim \mathcal{D},\; y \sim \pi(\cdot|x)}\!\left[r(x, y)\right] - \beta \, \text{KL}\!\left[\pi(\cdot|x) \;\|\; \pi_{\text{ref}}(\cdot|x)\right]$$

The policy $\pi$ is trained to maximise expected reward $r(x, y)$ while staying close to the reference policy $\pi_{\text{ref}}$ (the SFT model). The coefficient $\beta$ controls how strongly the model is penalised for deviating from the reference policy: keeping generations close to the SFT model's distribution and preventing degenerate outputs.

The RLHF architecture. Two separate neural networks interact: the reward model (trained once on human preferences, then frozen) scores responses, while PPO updates the policy model to maximise reward. The reference model (a frozen copy of the SFT checkpoint) provides the KL constraint that prevents the policy from producing degenerate, high-scoring text. Illustrative diagram.

The RLHF framework was developed by Christiano et al. (2017) for general reinforcement learning and adapted for language models by Ouyang et al. (2022) in the InstructGPT paper. The core innovation was showing that relatively modest amounts of human preference data: tens of thousands of comparisons, not millions: could dramatically improve model behaviour.

Direct Preference Optimization (DPO)

Rafailov et al. (2023) showed that the separate reward model in RLHF is not strictly necessary. Their key insight: the language model itself can serve as an implicit reward model. DPO directly optimises the model on preference pairs without training a separate reward model or running an RL loop.

Definition

Direct Preference Optimization (DPO)

A training method that aligns a language model with human preferences by directly optimising on pairs of preferred and dispreferred responses, without training a separate reward model. DPO achieves comparable results to RLHF with simpler training infrastructure.

$$\mathcal{L}_{\text{DPO}}(\pi_\theta) = -\mathbb{E}_{(x,\, y_w,\, y_l)}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right]$$

DPO directly increases the probability of the preferred response $y_w$ relative to the dispreferred $y_l$, each measured against the reference model. The language model itself serves as an implicit reward model: no separate reward model is needed.

Notation note: the DPO loss writes the policy as π_θ to emphasise that θ are the parameters being optimised: the same policy written as π in the RLHF objective above.

In practice, DPO tends to produce results comparable to RLHF with substantially simpler training infrastructure: no reward model to maintain, no RL training loop to stabilise. Many recent open models, including Llama 3 and Qwen 2.5, use DPO or its variants for preference alignment.

Safety Training & Constitutional AI

The final stage of post-training addresses safety: teaching the model when not to comply.

Red-teaming: human testers deliberately try to elicit harmful outputs (instructions for violence, discriminatory content, private information). The model is then trained to refuse these inputs.

Constitutional AI (Bai et al., 2022): the model critiques its own outputs against a set of written principles (a “constitution”) and is trained to prefer the self-revised versions. This reduces reliance on human annotators for safety-related feedback.

Safety training introduces an alignment tax: the model may become overly cautious, refusing benign requests that superficially resemble harmful ones. For researchers, this manifests as models declining to engage with sensitive topics: even when the research purpose is legitimate. Understanding this trade-off helps diagnose unexpected refusals in research workflows.

What Post-Training Changes (And What It Doesn’t)

Post-training primarily reshapes behaviour rather than adding knowledge. The vast majority of the model’s factual knowledge, language capabilities, and reasoning patterns come from pre-training. If the base model has weak coverage of a domain, post-training alone is unlikely to fill that gap: the instruct version will just express its ignorance more politely, or (worse) confabulate an answer more convincingly.

Definition

Hallucination (Confabulation)

The generation of confident, well-structured text that is factually incorrect, unsupported, or fabricated. Language models do not “know” whether their outputs are true: they produce text that is statistically plausible given the context. Post-training can make hallucinations more dangerous: RLHF trains the model to produce authoritative-sounding responses, which means fabricated claims may be delivered with the same confidence as accurate ones. For research, this means model outputs should never be treated as ground truth without independent verification.

For research: when a model gets a classification wrong, the problem is usually a capability limitation from pre-training or a prompting issue, not a post-training failure. Post-training determines the format (one-word answer vs. paragraph), the style (hedged vs. confident), and the boundaries (comply vs. refuse). It does not determine the accuracy.

But post-training also introduces systematic biases that matter for research. RLHF annotators encode their preferences into the reward model. The demographic composition of the annotator pool, their cultural context, and the guidelines they follow all shape what the model considers a “good” answer. We return to this in the Social Science Applications section.

RLHF annotators decide what “helpful” means. If the annotator pool skews toward a particular demographic, political leaning, or cultural context, how might that affect a model’s usefulness for research on politically sensitive topics?

Reveal

The model may systematically favour framings, perspectives, or conclusions that align with the annotator pool’s views. On politically contested topics, it may present one position as more “balanced” or “helpful.” For social science research, this means the model is not a neutral instrument : its outputs carry the preferences baked in during alignment. This is a form of measurement bias that researchers must account for.

In the notebook: Exercise 1 puts you in the role of an RLHF annotator. You rank pairs of model responses and discover first-hand how subjective “better” can be : especially on contested topics like immigration policy or causal inference methodology.

Key Takeaway

Post-training turns a language model into a tool. SFT teaches format: how to respond. Preference optimization (RLHF or DPO) teaches quality: which responses are better. Safety training teaches boundaries: when not to respond. In practice these stages interact: SFT data quality affects output quality, and preference optimization can reshape format as well as content: but the separation is a useful mental model. These are design choices made by the model provider, and they shape every output the model produces in your research.

Resources

Ouyang et al. (2022), Training Language Models to Follow Instructions with Human Feedback: the InstructGPT paper; foundational work on RLHF for language models.
Christiano et al. (2017), Deep Reinforcement Learning from Human Preferences: the original RLHF framework that InstructGPT builds upon.
Rafailov et al. (2023), Direct Preference Optimization: eliminates the need for a separate reward model in preference learning.
Bai et al. (2022), Constitutional AI: Harmlessness from AI Feedback: Anthropic’s approach to scalable safety alignment.
Chip Huyen: “RLHF: Reinforcement Learning from Human Feedback”: accessible technical overview of the full pipeline.
The Alignment Handbook (Hugging Face): practical guide to implementing alignment techniques.

Prompting

Post-training gave us a model that follows instructions. The quality of those instructions: the prompt: now determines the quality of the output. For social scientists, this is not just a practical concern. A prompt is a measurement instrument: it defines what construct you measure, how reliably you measure it, and whether your results replicate.

Prompting as a Research Instrument

Consider the analogy to survey design. A survey question’s wording determines what respondents report. Leading questions produce biased responses. Ambiguous questions produce noisy responses. The same is true for prompts: a classification prompt that asks for “sentiment” measures something different from one that asks for “stance,” even when applied to the same text.

Just as survey methodology demands pre-testing, piloting, and reporting the exact question wording, prompt-based research should demand the same rigour. The prompt is the instrument. Report it fully, test its sensitivity, and validate it against known ground truth.

Zero-Shot Prompting

The simplest approach: describe the task and provide the input, with no examples. The model relies entirely on its pre-trained knowledge and post-training to interpret the request.

Zero-shot prompting works well when the task is unambiguous and aligns with the model’s training distribution: standard sentiment analysis, language identification, simple factual questions. It tends to struggle with tasks that require a specific interpretation of categories, domain conventions, or nuanced distinctions (such as the difference between stance and sentiment).

Few-Shot Prompting & In-Context Learning

A striking property of large language models: they can learn new tasks from a handful of examples provided in the prompt itself. This is in-context learning, demonstrated at scale by Brown et al. (2020) in the GPT-3 paper.

In few-shot prompting, you include labeled examples before the target input. The model infers the classification pattern from these examples and applies it to the new input. This is not fine-tuning: no weights are updated. The model processes the examples as part of its context and adapts its behaviour accordingly.

Definition

In-Context Learning

The ability of a language model to adapt its behaviour based on examples provided in the prompt, without any weight updates. The model infers the task from the pattern of input–output pairs and applies it to new inputs. In-context learning tends to improve with model size: larger models learn from examples more reliably.

Key practical considerations for few-shot prompting: balance the examples across classes (do not provide five “support” examples and one “oppose” example). Shuffle the order : models can be sensitive to which class appears last. And keep examples representative of the cases the model will encounter, not just easy ones.

System Prompts & Role Specification

Most API-based models accept a system prompt that frames the model’s identity and behavioural constraints before the user’s message. System prompts can set context (“You are a political science research assistant”), define output format (“Always respond with valid JSON”), and establish boundaries (“Do not provide your own opinion on the topic”).

For research applications, well-crafted system prompts can improve consistency by anchoring the model’s behaviour across many classifications. However, system prompts are not a guarantee : sufficiently unusual inputs can override them. They are best thought of as strong defaults, not hard constraints.

Structured Outputs

When building classification pipelines, you need machine-parseable output: not free-form text. Structured output means asking the model to respond in a specific format: JSON, XML, or a constrained schema.

Larger models comply with formatting instructions more reliably. Smaller models (under ~10B parameters) often produce nearly-valid JSON with small errors: a missing comma, an extra field. Modern APIs from major providers now offer guaranteed structured output modes that constrain the model’s generation to valid JSON conforming to a provided schema, eliminating parsing failures at the cost of slightly constrained generation.

Temperature & Sampling Parameters

Beyond the prompt itself, sampling parameters control how the model selects tokens from its predicted probability distribution. These settings directly affect output variability and reproducibility.

Definition

Temperature

A parameter that scales the model’s output logits before applying softmax. Temperature = 0 (or near-zero) makes the model deterministic, always selecting the highest-probability token. Higher temperatures flatten the distribution, increasing randomness and diversity. For classification and annotation tasks, low temperature (0–0.2) is standard to ensure consistent, reproducible outputs.

A related parameter, top-p (nucleus sampling), restricts token selection to the smallest set whose cumulative probability exceeds a threshold (e.g., 0.95). In practice, most research workflows set temperature to 0 and leave top-p at its default. The key point: always report your sampling parameters. A classification study run at temperature 0 and the same study run at temperature 0.7 can produce meaningfully different results, and failing to report this makes replication impossible.

Prompt Sensitivity & Reproducibility

Here is the finding that should concern every social scientist using LLMs: small changes to prompt wording can produce large changes in classification results.

Sclar et al. (2024) systematically tested how minor prompt variations: rephrasing the instruction, changing label names, reordering components: affect model outputs. They found that accuracy can swing by over 10 percentage points from seemingly equivalent prompts. This is the LLM equivalent of question wording effects in survey research.

Definition

Prompt Sensitivity

The degree to which a model’s output changes in response to semantically equivalent prompt reformulations. High prompt sensitivity means that classification results are fragile: they depend on arbitrary wording choices rather than the underlying signal in the data.

Implications for research: if your results depend on how you phrase the prompt, you have a reproducibility problem. Best practice is to test multiple prompt variants, report the range of results, and select the prompt based on validation against a gold-standard set: not based on which variant happens to produce the best-looking numbers.

The interactive explorer below compares how the same tweet is classified under different prompting strategies. The third tab previews chain-of-thought prompting, which we cover in detail in the next section.

Click a strategy tab to see how the same tweet is classified under different prompting approaches.

You’re designing a prompt to classify political speeches as populist or non-populist. Would you use zero-shot, few-shot, or chain-of-thought prompting? What are the trade-offs?

Reveal

Few-shot with chain-of-thought is likely best for this task. Zero-shot risks inconsistent criteria: the model may apply its own implicit definition of populism, which may not match yours. Few-shot examples anchor the model’s understanding of your specific operationalization. Chain-of-thought reasoning makes the classification rationale transparent and auditable. The trade-off is token cost and latency per classification, which matters at scale.

You run the same classification task with three prompt variants and get F1 scores of 0.82, 0.71, and 0.78. What should you report in your paper?

Reveal

Report all three. If you only report 0.82, you are cherry-picking the most flattering result: the same methodological error as running multiple survey question versions and only reporting the one that confirms your hypothesis. Best practice: report the range, explain your selection criterion (e.g., validated against a gold standard), and ideally release all prompt variants with your replication materials.

In the notebook: Exercises 2–4 walk you through zero-shot classification, few-shot classification (with varying numbers of examples), and a systematic prompt sensitivity experiment on real political tweets. You will see first-hand how small wording changes shift classification results.

Key Takeaway

Prompting is experimental design. Your prompt determines what you measure (construct validity), how reliably you measure it (reproducibility), and whether examples improve performance (in-context learning). Treat prompt design with the same rigour you would apply to survey question design: pre-test, validate, and report fully.

Resources

Brown et al. (2020), Language Models are Few-Shot Learners: the GPT-3 paper; establishes in-context learning as a paradigm.
Sclar et al. (2024), Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design: systematic study of prompt sensitivity effects.
Anthropic Prompt Engineering Guide: practical, detailed, regularly updated.
OpenAI Prompt Engineering Guide: complementary perspective on prompt design.
Prompt Engineering Guide: comprehensive community-maintained resource covering all major techniques.
DSPy (Stanford NLP): framework for programmatic prompt optimization.

Reasoning

Standard prompting: whether zero-shot or few-shot: works well for many tasks. But some cases require more: sarcastic tweets, implicit stance, ambiguous framing, or tasks that involve multiple logical steps. For these, asking the model to reason before answering can substantially improve accuracy.

This section traces the evolution from a simple prompting technique (chain-of-thought) to a new class of models specifically trained to reason: and connects both to the scaling laws from Day 1.

Chain-of-Thought Prompting

Wei et al. (2022) showed that adding the phrase “think step by step” or providing worked examples with explicit reasoning dramatically improves performance on tasks requiring multi-step inference. The idea is simple: instead of asking for a direct answer, ask the model to show its work.

Definition

Chain-of-Thought (CoT) Prompting

A prompting strategy where the model is asked to produce intermediate reasoning steps before giving a final answer. By decomposing a problem into sub-steps, the model can handle tasks that require logical inference, disambiguation, or multi-step reasoning: tasks where a direct answer would often be wrong.

Why does this work? One hypothesis: the intermediate tokens generated during reasoning serve as a form of working memory. The model can “store” partial results in the generated text and attend back to them when producing the final answer. Without CoT, the model must compress all reasoning into the implicit computation within a single forward pass: a much harder task.

Direct prompting vs. chain-of-thought on a sarcastic tweet. The reasoning step gives the model an opportunity to identify sarcasm before committing to a label. Illustrative example.

Zero-Shot Chain-of-Thought

Kojima et al. (2022) discovered that simply appending “Let’s think step by step” to a prompt: without any examples: is often enough to elicit reasoning. This zero-shot CoT approach is remarkably effective for its simplicity: it requires no example engineering and adds minimal tokens to the prompt.

The trade-off is control. With few-shot CoT, you guide the model’s reasoning by showing what good reasoning looks like. With zero-shot CoT, the model decides its own reasoning strategy, which can be inconsistent across inputs.

Reasoning Models

A recent class of models takes the chain-of-thought idea further: instead of relying on the user to prompt for reasoning, the model is trained to reason before answering. These models generate an internal “thinking” process: sometimes visible to the user, sometimes hidden: before producing a final response.

Key examples include OpenAI’s o1 and o3 models, DeepSeek-R1 (open-weight, trained with reinforcement learning to develop reasoning strategies), and Claude with extended thinking. What makes these models notable is how reasoning emerges: DeepSeek-R1 showed that training a model with reinforcement learning : rewarding correct answers without specifying how to reason : can cause the model to independently develop chain-of-thought-like strategies.

Test-Time Compute Scaling

In Day 1, we saw that model performance follows power laws with respect to training compute, data, and parameters. Reasoning models introduce a different kind of scaling: test-time compute. Instead of making the model bigger or training it longer, you let it think longer on each input.

Definition

Test-Time Compute Scaling

Improving model performance by allocating more computation during inference (generating the answer) rather than during training. Reasoning models, chain-of-thought prompting, and self-consistency methods (generating multiple CoT responses and taking a majority vote) all exploit test-time compute: the model generates more tokens per response, effectively “thinking longer” about harder problems.

This represents a shift in how we think about model capability. Traditional scaling asked: how big should the model be? Test-time scaling asks: how long should the model think? The two are complementary. A smaller model that thinks for 30 seconds can sometimes match a larger model that answers instantly: at the cost of higher latency and more output tokens.

When to Use Reasoning

Reasoning: whether through CoT prompting or dedicated reasoning models: is not always worth the cost. The gains are largest on tasks that involve:

Ambiguity: sarcasm, irony, implicit stance, or texts where the surface meaning differs from the intended meaning. Multi-step inference: tasks requiring combination of multiple pieces of information. Complex categorisation: classification schemes with many categories or fine-grained distinctions.

For simple binary classification on clear-cut texts, standard prompting is usually sufficient and far cheaper. Running a reasoning model on 10,000 straightforward tweets is wasteful: the model will “think” about each one and reach the same answer it would have given instantly, at 10× the cost and latency.

For binary sentiment classification on 10,000 tweets, would you use a reasoning model? What about for interpreting a single ambiguous policy speech?

Reveal

For 10,000 tweets: no. Most tweets are unambiguous, and the per-token cost and latency of reasoning models would be wasteful. Use a standard model with zero-shot or few-shot prompting, and reserve CoT for the subset of cases the model is uncertain about. For a single ambiguous speech: yes. The cost of extra tokens is negligible for one input, and the reasoning process helps the model handle complex framing, mixed signals, and implicit positions.

In the notebook: Exercise 5 tests chain-of-thought prompting on the tweets that the model misclassified with direct prompting. Exercise 6 demonstrates construct validity: the difference between classifying sentiment and classifying stance.

Key Takeaway

Chain-of-thought prompting and reasoning models let you trade compute at inference for accuracy on hard cases. Use them selectively: the gains are largest on ambiguous, multi-step tasks. For bulk classification of clear-cut texts, standard prompting is usually sufficient and far more cost-effective. The emerging paradigm of test-time compute scaling complements traditional training-time scaling: instead of only making models bigger, we can also let them think longer.

Resources

Wei et al. (2022), Chain-of-Thought Prompting Elicits Reasoning in Large Language Models: the foundational CoT paper.
Kojima et al. (2022), Large Language Models are Zero-Shot Reasoners: the “Let’s think step by step” paper.
DeepSeek-AI (2025), DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning: open-source reasoning model showing reasoning emerging from RL.
OpenAI: “Learning to Reason with LLMs”: o1 announcement and technical context.
Anthropic: Extended Thinking documentation: practical guide to using Claude’s reasoning mode.
Lightman et al. (2023), Let’s Verify Step by Step: process reward models that evaluate each reasoning step.

Evaluating & Choosing Models

You now know how to use a model: post-training made it follow instructions, prompting lets you frame tasks precisely, and reasoning techniques handle harder cases. But which model? The landscape includes hundreds of options: open and closed, small and large, general and specialised: and the right choice depends on your specific research context.

Perplexity Revisited

In Day 1, we introduced perplexity as a measure of how well a language model predicts text.

$$\text{PPL}(X) = \exp\!\left(-\frac{1}{T}\sum_{t=1}^{T} \log P(x_t \mid x_{\lt t})\right)$$

Lower perplexity means the model predicts the text more confidently. A perplexity of $k$ means the model is, on average, as uncertain as choosing uniformly among $k$ options.

Perplexity is the standard training metric for language models, and lower perplexity generally correlates with better text generation within a model family. But for downstream tasks: classification, information extraction, summarisation: perplexity is a poor predictor of performance. A model with lower perplexity is not necessarily better at classifying your political texts. The connection between language modelling ability and task-specific accuracy is mediated by model architecture, training data composition, post-training choices, and the nature of the task.

Perplexity is useful for comparing variants of the same model (e.g., different checkpoints during training) but unreliable for comparing across model families, architectures, or sizes. Treat it as necessary but insufficient.

The Benchmark Landscape

The field evaluates models using standardised benchmarks. Each captures a different dimension of capability:

MMLU (Hendrycks et al., 2021): 57-subject multiple-choice test covering STEM, humanities, and social sciences. The most widely cited general benchmark. Scores above ~90% are now common among frontier models, leading to concerns about ceiling effects.

HumanEval: code generation benchmark testing whether models can write correct Python functions from docstrings. Important for research workflows that involve code generation, but narrow in scope.

GPQA (Rein et al., 2023): graduate-level science questions designed to be difficult even for domain experts. Intended to measure deep reasoning, but the test set is small, making scores noisy.

Chatbot Arena (Chiang et al., 2024): human preference rankings from blind comparisons on real conversations. Arguably the most ecologically valid benchmark, since it reflects how users actually experience model quality. But it measures general helpfulness, not research-specific capability.

Benchmark Limitations

Every benchmark has fundamental limitations. Three deserve particular attention:

Definition

Benchmark Contamination

The possibility that a model has seen benchmark questions or closely related content during pre-training, inflating its score beyond its true capability. Because language models are trained on large internet corpora, and benchmark questions often appear online, some degree of contamination is difficult to rule out for any widely used benchmark.

Contamination is endemic. Benchmark questions leak into training data, and models may have effectively memorised answers rather than demonstrating generalizable reasoning on novel problems. Newer benchmarks attempt to mitigate this with dynamic question sets, but the arms race between benchmark creators and training data curators is ongoing.

Gaming: model providers can optimise specifically for benchmark performance: through targeted fine-tuning, example selection, or architectural choices: without improving general capability. A model that scores well on MMLU is not guaranteed to handle your specific annotation task well.

Construct validity: this is the most important limitation for social scientists. No standard benchmark measures your task. MMLU tests broad knowledge; your research might require nuanced understanding of political rhetoric. Chatbot Arena tests conversational helpfulness; your research might require precise, consistent annotation. The benchmark that matters most is the one you build yourself.

Open vs. Closed Models

The choice between open-weight and closed (proprietary API) models involves real trade-offs across five dimensions:

Capability: frontier closed models (GPT-4o, Claude 3.5 Sonnet, Gemini) currently tend to outperform open models on the hardest tasks, though the gap has narrowed substantially. For many classification and extraction tasks, the best open models (Llama 3, Qwen 2.5, DeepSeek) perform comparably.

Cost: open models are free to use (you pay only for compute), while closed APIs charge per token. For large-scale annotation projects, the cost difference can be orders of magnitude. Self-hosting an open model on a GPU cluster is capital-intensive but can be dramatically cheaper per token at scale.

Transparency: open models give you full access to weights, architecture details, and (sometimes) training data documentation. Closed models are opaque: you cannot inspect why they produce a particular output. For research that requires understanding the instrument, transparency matters.

Data privacy: with open models, your data never leaves your infrastructure. With closed APIs, your data is sent to the provider’s servers. For sensitive data: survey responses, medical records, classified documents: this may be unacceptable under IRB protocols or data protection regulations.

Reproducibility: open model weights are fixed. You can pin a specific version and guarantee identical outputs indefinitely. Closed models can change without notice: the provider may update the model between when you run your study and when reviewers try to replicate it. For scientific work, this is a significant concern.

Cost-Capability Frontiers

Definition

Cost-Capability Frontier

The boundary of achievable capability at each cost level. Models on the frontier offer the best performance for their price; models below it are dominated: another model achieves equal or better performance at equal or lower cost. The frontier shifts over time as new models are released.

The key insight for researchers: you rarely need the most capable model. If a 70B open model achieves 92% accuracy on your task and the frontier closed model achieves 95%, the open model may be the better choice for a project that requires processing millions of documents : especially if data privacy is a concern.

The cost-capability frontier shifts rapidly. A model that was frontier-quality six months ago may be outperformed by one that costs a fraction of the price today. Check current comparisons (e.g., Artificial Analysis) before committing to a model for a large project.

Approximate positions based on public benchmarks and pricing as of early 2025. Capability reflects an approximate composite of MMLU, Arena ELO, and coding benchmarks: no single number captures true capability (see text). The dashed line marks the cost-capability frontier. The landscape changes rapidly : verify current data before making decisions.

Building Task-Specific Evaluations

Because no standard benchmark measures your task, you need to build your own evaluation. The process follows the same logic as validating any measurement instrument:

1. Create a gold-standard set. Manually label a subset of your data (50–200 items is often sufficient for initial evaluation). Use multiple coders and compute inter-annotator agreement.

2. Pilot multiple models. Run your classification prompt on the gold-standard set with 2–4 candidate models. Compare accuracy, F1, and: crucially: patterns of disagreement with your human coders.

3. Assess failure modes. Where does the model disagree with humans? Are the errors random, or systematic? Systematic errors suggest a model-task mismatch; random errors may be tolerable.

4. Document and iterate. Record the model version, prompt, temperature setting, and evaluation metrics. These are your instrument specifications: they belong in the methods section of your paper.

A model scores 90% on MMLU but only 65% on your annotation task. What might explain the gap?

Reveal

Several factors could explain the discrepancy. MMLU tests broad factual knowledge, while your task may require domain-specific understanding, nuanced text interpretation, or adherence to a specific coding scheme. The model may also struggle with the particular format of your data (tweets vs. academic text, informal language, abbreviations). Additionally, MMLU uses multiple-choice format, which is fundamentally different from open-ended classification. This is precisely why task-specific evaluation is essential.

In the notebook: Exercise 7 has you save your best classification results and manually label 10 tweets. Day 3 opens by computing inter-annotator agreement between you and the model : the start of a proper validation pipeline.

Coming in Day 3: We move from single-model evaluation to building complete classification pipelines at scale: API access, batching, cost management, and systematic validation using inter-annotator agreement metrics. We also cover when prompting is not enough and you need to fine-tune.

Key Takeaway

No single model is best for everything. Standard benchmarks are useful for rough comparisons but suffer from contamination, gaming, and construct validity issues. The benchmark that matters most is the one you build yourself. Choose models based on your specific needs: data sensitivity, task complexity, budget, and reproducibility requirements.

Resources

Hendrycks et al. (2021), Measuring Massive Multitask Language Understanding: introduces MMLU, the most widely cited LLM benchmark.
Chiang et al. (2024), Chatbot Arena: An Open Platform for Evaluating LLMs through Human Preference: Elo-based rankings from human comparisons.
Rein et al. (2023), GPQA: A Graduate-Level Google-Proof Q&A Benchmark: expert-level questions resistant to memorisation.
Chatbot Arena Leaderboard: live updated model rankings based on human preference.
Open LLM Leaderboard (Hugging Face): open model comparison across benchmarks.
Artificial Analysis: cost, speed, and quality comparisons across providers and models.
HELM (Stanford CRFM): holistic evaluation covering accuracy, calibration, fairness, and robustness.

Social Science Applications

The tools covered in this module: alignment, prompting, reasoning, evaluation: are already transforming social science methodology. This section highlights key papers that demonstrate how these techniques open new research possibilities and introduce new methodological challenges.

Alignment & Opinion Representation

Post-training does not just affect model style: it shapes what perspectives models represent. Understanding this is essential for any research that uses model outputs as data.

Santurkar et al. (2023), Whose Opinions Do Language Models Reflect? ICML. Demonstrates that RLHF shifts model outputs toward the opinion distributions of specific demographic groups: particularly those overrepresented among annotators. For social scientists using LLMs to simulate human responses or measure public opinion, this finding is critical: the model’s “opinions” are partly an artifact of whose preferences shaped its training.

Prompting as Experimental Design

The prompting techniques covered in this module are not just practical tools: they are methodological choices with direct implications for research validity.

Ziems et al. (2024), Can Large Language Models Transform Computational Social Science? Computational Linguistics. A comprehensive review of LLM applications across computational social science that treats prompt design as experimental design. The paper emphasises that methodological validity in CSS requires the same attention to instrument design that traditional social science demands: and that LLMs introduce new validity concerns (prompt sensitivity, model opacity, temporal instability) alongside their capabilities.

Argyle et al. (2023), Out of One, Many: Using Language Models to Simulate Human Samples. Political Analysis. Demonstrates that prompt framing: how you specify a persona’s demographic characteristics, backstory, and context: substantially determines the distribution of simulated survey responses. Under carefully controlled prompting, LLMs can approximate some population-level patterns, but the prompt is doing much of the work. This connects the prompting techniques from this module directly to the question of whether LLM outputs can stand in for human data.

Reasoning & Analytical Complexity

Huang & Chang (2023), Towards Reasoning in Large Language Models: A Survey. Findings of ACL. Surveys the reasoning capabilities and limitations of large language models. For social science applications, the key finding is that chain-of-thought prompting can improve performance on nuanced text annotation tasks: detecting implicit sentiment, sarcasm, or framing in political text: where direct classification tends to miss subtlety. However, the survey also cautions that model-generated reasoning is not always faithful to the model’s actual computation; the “explanation” may be a post-hoc rationalisation rather than a transparent reasoning trace.

Evaluation as Research Design

Weber & Reichardt (2023), Evaluation is All You Need. Offers a framework for thinking about LLM evaluation in social science contexts. The central argument: benchmark thinking translates directly to research design. Just as you would not trust a survey instrument without validating it, you should not trust an LLM classifier without building a task-specific evaluation. The paper provides practical guidance on constructing domain-specific evaluations before committing to a model.

Key Takeaway

The foundations covered in this module: post-training, prompting, reasoning, and evaluation: are not just technical prerequisites. They are methodological decisions that directly affect research validity. Alignment shapes what perspectives models represent. Prompt design defines your construct. Reasoning techniques trade cost for accuracy. Evaluation practices determine whether your results are trustworthy. The subsequent modules build directly on these tools: Day 3 covers classification pipelines at scale, validation frameworks, and the decision of when to fine-tune instead of prompt.