Day 2: From Models to Tools

Module 1 ended with a machine that predicts the next token. That machine is powerful but not useful: ask it to classify a tweet and it writes another tweet. This module covers the three layers that turn a language model into a research instrument.

Post-training reshapes model behaviour without changing its knowledge. Supervised fine-tuning teaches format (how to respond to instructions), preference optimization teaches quality (which responses are better), and safety training teaches boundaries (when not to respond). Each of these stages is a design choice made by the model provider, and each introduces biases that matter for research: the annotators who define "helpful" shape what the model considers a good answer.

Prompting is the primary interface for working with post-trained models, and it functions as a measurement instrument. The same classification task with slightly different wording can produce substantially different results, just as survey questions do. Prompt design, temperature settings, structured output formats, and few-shot examples all affect what you measure and how reliably you measure it.

Reasoning techniques extend what prompting can do. Chain-of-thought prompting and dedicated reasoning models trade inference compute for accuracy on hard cases, a form of test-time compute scaling that complements the training-time scaling laws from Module 1. The module closes with the practical question this all leads to: given dozens of available models with different capabilities, costs, and transparency properties, how do you choose the right one for your research?

After This Module You Will Be Able To

Explain how SFT, preference optimization, and safety training reshape model behaviour without changing model knowledge.
Design a classification prompt with appropriate structural components and justify each choice as an experimental design decision.
Predict when chain-of-thought reasoning will improve results and when it will waste budget.
Evaluate models against task-specific criteria rather than relying on generic benchmarks.
Identify the reproducibility, transparency, and validity trade-offs between open and closed models.

Post-Training

In Module 1, we built the core loop of a language model: tokenize, embed, attend, predict. We saw that a base model generates fluent text, but it just completes whatever you give it. Ask it to classify a tweet, and it writes another tweet. It is not being uncooperative; it was trained to predict the next token, so that is exactly what it does.

Something happens between a base model and ChatGPT that turns a next-token predictor into something that answers questions, follows instructions, and refuses harmful requests. That something is post-training.

The Base-to-Assistant Gap

Give a base model the prompt “Classify this tweet as positive or negative: I love this weather! Answer:” and it might continue with another tweet, a news headline, or anything else that looks plausible as a continuation. An instruction-tuned model would respond with “positive” because it has learned, through post-training, that this kind of prompt expects a label.

Both models have the same knowledge: they were pre-trained on the same data. The difference is entirely in how they were taught to behave. Post-training is a pipeline of techniques that reshapes behaviour without fundamentally changing what the model knows.

The post-training pipeline. Each stage builds on the previous one and uses different training data (shown in blue below each box). Not every model follows this exact sequence: some combine stages or use variants, but the general pattern is consistent across major model families. Illustrative diagram.

Supervised Fine-Tuning (SFT)

Definition

Supervised Fine-Tuning (SFT)

Continued training of a pre-trained language model on curated (instruction, response) pairs. SFT teaches the model format and style—how to respond to instructions—without fundamentally changing the knowledge acquired during pre-training. Uses standard cross-entropy loss on far fewer examples (thousands to tens of thousands) than the billions of tokens in pre-training.

The first stage: show the model curated (instruction, response) pairs and train it to imitate the responses. This is analogous to giving someone a style guide with worked examples: “when asked X, respond like Y.”

A concrete training example might look like this. Instruction: “Classify the following tweet as positive, negative, or neutral. Tweet: ‘Finally some sunshine after weeks of rain!’” Response: “Positive. The tweet expresses relief and happiness about an improvement in weather.” Thousands of such pairs, across diverse tasks and formats, teach the model what an “answer” looks like.

The scale involved is modest by pre-training standards. InstructGPT used roughly 13,000 demonstration examples for SFT—compare that to the hundreds of billions of tokens consumed during pre-training. SFT does not rewrite the model’s knowledge; it adjusts the surface layer of behaviour. The same cross-entropy loss used in pre-training drives learning, but now on curated examples instead of raw internet text.

After SFT, the model can follow instructions and produce formatted responses. But it has only a limited sense of which valid response is better than another. It has learned from high-quality demonstrations, but it cannot reliably distinguish between two on-topic responses that differ in subtle quality dimensions like depth, accuracy, or nuance.

For researchers, SFT also opens a door: if an off-the-shelf model performs poorly on your domain, you can fine-tune it on your own (instruction, response) pairs to improve task-specific performance. Module 3 covers when and how to pursue this.

Preference Optimization: Teaching Quality

Definition

Preference Optimization (RLHF & DPO)

Training a model to prefer higher-quality responses using human comparison data. RLHF (Reinforcement Learning from Human Feedback) trains a separate reward model on preference pairs, then uses reinforcement learning to optimise the language model. DPO (Direct Preference Optimization) skips the reward model and optimises the language model directly on the preference data with a simpler training loop. Both methods keep the model close to its SFT baseline to prevent degenerate outputs.

The key idea: let humans define “quality” by comparing pairs of model outputs. For a given prompt, the model generates two responses. A human annotator decides which is better. This preference data is then used to train the model to favour responses that humans prefer.

Two main approaches exist. RLHF (Reinforcement Learning from Human Feedback) trains a separate reward model on the preference data, then uses reinforcement learning to optimise the language model against that reward. DPO (Direct Preference Optimization) skips the reward model entirely: it directly optimises the language model on the preference pairs. DPO tends to produce comparable results with simpler training infrastructure, and is used by many recent open models including Llama 3 and Qwen 2.5.

In both approaches, the optimised model must stay close to the SFT model. Without this constraint, the model could discover degenerate strategies that exploit the preference signal: producing high-scoring but nonsensical text. The mathematical mechanism differs (a KL divergence penalty in RLHF, an implicit constraint in DPO), but the purpose is the same: keep the model useful while improving quality.

Safety Training & Constitutional AI

The final stage addresses safety: teaching the model when not to comply. Red-teaming involves human testers deliberately trying to elicit harmful outputs, and the model is then trained to refuse these inputs. Constitutional AI (Bai et al., 2022) takes a different approach: the model critiques its own outputs against a set of written principles (a “constitution”) and is trained to prefer the self-revised versions, reducing reliance on human annotators for safety feedback.

Safety training introduces an alignment tax: the model may become overly cautious, refusing benign requests that superficially resemble harmful ones. For researchers, this manifests as models declining to engage with sensitive topics even when the research purpose is legitimate.

What Post-Training Changes (And What It Doesn’t)

Post-training primarily reshapes behaviour rather than adding knowledge. If the base model has weak coverage of a domain, post-training alone is unlikely to fill that gap: the instruct version will just express its ignorance more politely, or (worse) confabulate an answer more convincingly.

Definition

Hallucination (Confabulation)

The generation of confident, well-structured text that is factually incorrect, unsupported, or fabricated. Post-training can make hallucinations more dangerous: RLHF trains the model to produce authoritative-sounding responses, which means fabricated claims may be delivered with the same confidence as accurate ones. For research, this means model outputs should never be treated as ground truth without independent verification.

For research: when a model gets a classification wrong, the problem is usually a capability limitation from pre-training or a prompting issue, not a post-training failure. Post-training determines the format (one-word answer vs. paragraph), the style (hedged vs. confident), and the boundaries (comply vs. refuse). It does not determine the accuracy.

But post-training also introduces systematic biases that matter for research. RLHF annotators encode their preferences into the reward model. The demographic composition of the annotator pool, their cultural context, and the guidelines they follow all shape what the model considers a “good” answer.

Social Science Application. Santurkar et al. (2023), Whose Opinions Do Language Models Reflect? ICML. RLHF introduces a form of construct validity threat: the model’s outputs reflect not just the pre-training data but the preferences of the specific annotator pool that defined “helpful.” If that pool skews toward particular demographic, political, or cultural positions, the model’s “opinions” will too. For social scientists using LLMs to measure attitudes or classify political text, this is analogous to running a survey through translators who share a systematic bias: the instrument is no longer neutral.

RLHF annotators decide what “helpful” means. If the annotator pool skews toward a particular demographic, political leaning, or cultural context, how might that affect a model’s usefulness for research on politically sensitive topics?

Reveal

The model may systematically favour framings, perspectives, or conclusions that align with the annotator pool’s views. On politically contested topics, it may present one position as more “balanced” or “helpful.” For social science research, this means the model is not a neutral instrument: its outputs carry the preferences baked in during alignment. This is a form of measurement bias that researchers must account for.

In the notebook: Exercise 1 puts you in the role of an RLHF annotator. You rank pairs of model responses and discover first-hand how subjective “better” can be, especially on contested topics like immigration policy or causal inference methodology.

Key Takeaway

Post-training turns a language model into a tool. SFT teaches format: how to respond. Preference optimization (RLHF or DPO) teaches quality: which responses are better. Safety training teaches boundaries: when not to respond. These are design choices made by the model provider, and they shape every output the model produces in your research.

Resources

Ouyang et al. (2022), Training Language Models to Follow Instructions with Human Feedback: the InstructGPT paper; foundational work on RLHF for language models.
Rafailov et al. (2023), Direct Preference Optimization: eliminates the need for a separate reward model in preference learning.
Bai et al. (2022), Constitutional AI: Harmlessness from AI Feedback: Anthropic’s approach to scalable safety alignment.
Chip Huyen: “RLHF: Reinforcement Learning from Human Feedback”: accessible technical overview of the full pipeline.
The Alignment Handbook (Hugging Face): practical guide to implementing alignment techniques.

Prompting

Post-training gave us a model that follows instructions. But which instructions? The quality of the prompt now determines the quality of the output, and here a new problem emerges: the same task, stated slightly differently, can produce substantially different results. For social scientists, this is not just a practical concern. A prompt is a measurement instrument: it defines what construct you measure, how reliably you measure it, and whether your results replicate.

Prompting as a Research Instrument

Consider the analogy to survey design. A survey question’s wording determines what respondents report. Leading questions produce biased responses. Ambiguous questions produce noisy responses. The same is true for prompts: a classification prompt that asks for “sentiment” measures something different from one that asks for “stance,” even when applied to the same text.

Just as survey methodology demands pre-testing, piloting, and reporting the exact question wording, prompt-based research should demand the same rigour. The prompt is the instrument. Report it fully, test its sensitivity, and validate it against known ground truth.

Ziems et al. (2024), Can Large Language Models Transform Computational Social Science? Computational Linguistics. A comprehensive review that treats prompt design as experimental design. LLMs introduce new validity concerns (prompt sensitivity, model opacity, temporal instability) alongside their capabilities.

Anatomy of a Classification Prompt

A well-designed prompt has distinct structural components, each serving a specific function. The interactive diagram below breaks down a complete classification prompt. Click any highlighted region to see what it does, why it matters, and what happens if you leave it out.

The notebook exercises walk you through building prompts at each level of complexity: zero-shot (just a task description), few-shot (with labeled examples), and chain-of-thought (with reasoning steps). The interactive above shows the fully-specified version. The Minimal toggle shows what you lose when you strip it down.

Temperature & Sampling

Definition

Temperature

A scalar parameter T that controls the sharpness of the probability distribution over next tokens. Logits are divided by T before softmax: at T = 0 the model always picks the highest-probability token (greedy, deterministic); at T = 1 the distribution is unchanged; at T > 1 the distribution flattens, making rare tokens more likely. For classification tasks, T = 0 maximises reproducibility.

Beyond the prompt itself, sampling parameters control how the model selects tokens from its predicted probability distribution. The most important parameter is temperature. Recall from Module 1 that the model produces logits which softmax converts into probabilities. Temperature modifies this step:

$$P(w) = \frac{\exp(z_w / T)}{\sum_{w' \in \mathcal{V}} \exp(z_{w'} / T)}$$

Temperature $T$ scales the logits before softmax. At $T = 1$ the distribution is unchanged. As $T \to 0$ all mass concentrates on the highest-scoring token (greedy decoding). As $T \to \infty$ the distribution approaches uniform, making rare tokens as likely as common ones.

Drag the slider below and watch what happens to the distribution over candidate output tokens.

Two additional sampling parameters are common. Top-k sampling restricts the model to the k highest-probability tokens, zeroing out everything else before resampling. Top-p (nucleus) sampling dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p (e.g., 0.9). Both methods prevent the model from selecting extremely unlikely tokens while preserving more diversity than greedy decoding.

For research classification tasks, set temperature to 0. Greedy decoding makes the output deterministic: the same prompt produces the same label every time, which is essential for reproducibility. Reserve higher temperatures for tasks where diversity matters, such as generating synthetic survey responses or brainstorming coding prompts.

Prompt Sensitivity

Here is the finding that should concern every social scientist using LLMs: small changes to prompt wording can produce large changes in classification results. Sclar et al. (2024) found that accuracy can swing by over 10 percentage points from semantically equivalent prompts. This is the LLM equivalent of question wording effects in survey research.

The visualisation below shows the same twelve tweets classified under five prompt variants. Notice which tweets are stable across all variants and which ones flip. Click any row to inspect the tweet and the failure pattern.

Social Science Application. Prompt sensitivity is the LLM equivalent of question-wording effects in survey methodology. In political science, decades of research show that small changes in question phrasing produce large shifts in measured opinion (e.g., “welfare” vs. “assistance to the poor”). The same dynamic applies here: semantically equivalent prompts can produce classification swings of 10+ percentage points (Sclar et al., 2024). The methodological implication is clear: a single prompt is a single question wording. Robust research requires testing multiple prompt variants and reporting the range, just as survey research requires pre-testing and reporting exact question wording.

You run the same classification task with three prompt variants and get F1 scores of 0.82, 0.71, and 0.78. What should you report in your paper?

Reveal

Report all three. If you only report 0.82, you are cherry-picking the most flattering result: the same methodological error as running multiple survey question versions and only reporting the one that confirms your hypothesis. Best practice: report the range, explain your selection criterion (e.g., validated against a gold standard), and ideally release all prompt variants with your replication materials.

You’re designing a prompt to classify political speeches as populist or non-populist. Would you use zero-shot, few-shot, or chain-of-thought prompting? What are the trade-offs?

Reveal

Few-shot with chain-of-thought is likely best for this task. Zero-shot risks inconsistent criteria: the model may apply its own implicit definition of populism, which may not match yours. Few-shot examples anchor the model’s understanding of your specific operationalization. Chain-of-thought reasoning makes the classification rationale transparent and auditable. The trade-off is token cost and latency per classification, which matters at scale.

In the notebook: Exercises 2–4 walk you through zero-shot classification, few-shot classification (with varying numbers of examples), and a systematic prompt sensitivity experiment on real political tweets. Exercise 6 demonstrates construct validity: the difference between classifying sentiment and classifying stance.

Social Science Application. Argyle et al. (2023) demonstrated in Political Analysis that prompt framing and demographic "backstories" determine the distribution of simulated survey responses. The same survey question, posed with different persona prompts, produces different opinion distributions. This is the prompting-as-experimental-design argument applied to simulation: the prompt does not just extract information, it constructs the instrument. The same principle applies to the extraction and RAG pipelines in Module 4, where prompt design determines what the model extracts and how it synthesises evidence.

Structured Outputs

When building classification pipelines, the model’s response needs to be machine-readable. A response like “I would classify this as support because...” is useful for a human reader but breaks automated parsing. Structured output formats constrain the model to produce responses in a predictable schema.

Consider the difference. A free-text response might say: “This tweet expresses opposition to the policy, primarily through sarcastic framing. I would classify it as oppose.” A structured JSON response to the same prompt might return:

{
  "label": "oppose",
  "confidence": "high",
  "reasoning": "Sarcastic framing ('really changed the world') signals opposition."
}

The first version requires custom parsing that will inevitably break on edge cases. The second can be processed with a single line of code across 10,000 documents.

The simplest approach is to instruct the model to respond with a single word or a JSON object. Most API providers also offer JSON mode, which guarantees valid JSON output at the decoding level, and function calling (or “tool use”), which constrains the output to match a predefined schema. These work by modifying the token sampling process: at each generation step, tokens that would produce invalid JSON are masked out, ensuring the output always conforms to the requested structure. This eliminates the parsing failures that plague free-text classification at scale.

For research pipelines, structured outputs serve two purposes. First, reliability: when processing 10,000 documents, you cannot afford to manually fix malformed responses. Second, auditability: a JSON response with separate fields for “label” and “reasoning” lets you log both the classification and the model’s rationale, enabling systematic error analysis. Module 3 covers the full pipeline from structured outputs through batching and cost management.

Key Takeaway

Prompting is experimental design. Your prompt determines what you measure (construct validity), how reliably you measure it (reproducibility), and whether examples improve performance (in-context learning). Treat prompt design with the same rigour you would apply to survey question design: pre-test, validate, and report fully.

Resources

Brown et al. (2020), Language Models are Few-Shot Learners: the GPT-3 paper; establishes in-context learning as a paradigm.
Sclar et al. (2024), Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design: systematic study of prompt sensitivity effects.
Anthropic Prompt Engineering Guide: practical, detailed, regularly updated.
OpenAI Prompt Engineering Guide: complementary perspective on prompt design.
DSPy (Stanford NLP): framework for programmatic prompt optimization.

Reasoning

Even a well-designed prompt has limits. Standard prompting works well for clear-cut classification, but sarcastic tweets, implicit stance, ambiguous framing, or tasks that involve multiple logical steps often defeat direct prompts. The model produces a label in a single forward pass without any intermediate reasoning—and for hard cases, that is not enough.

In Module 1, we saw that model performance follows power laws with training compute, data, and parameters. But there is a second axis of scaling that operates at inference rather than training: letting the model think longer before answering. Chain-of-thought prompting and dedicated reasoning models exploit this axis, trading output tokens and latency for accuracy on tasks that require multi-step inference, disambiguation, or nuanced judgment.

Chain-of-Thought Prompting

Definition

Chain-of-Thought (CoT) Prompting

A prompting strategy that elicits intermediate reasoning steps before a final answer. The generated tokens serve as external working memory, allowing the model to decompose multi-step problems. CoT can be elicited by providing worked examples (few-shot CoT) or by simply appending “Let’s think step by step” (zero-shot CoT).

Wei et al. (2022) showed that adding the phrase “think step by step” or providing worked examples with explicit reasoning dramatically improves performance on tasks requiring multi-step inference. Instead of asking for a direct answer, ask the model to show its work.

Why does this work? The intermediate tokens serve as a form of working memory. The model can “store” partial results in the generated text and attend back to them when producing the final answer. Without CoT, the model must compress all reasoning into a single forward pass. Kojima et al. (2022) showed that even simply appending “Let’s think step by step” without any examples can elicit this behaviour.

Direct prompting vs. chain-of-thought on a sarcastic tweet. The reasoning step gives the model an opportunity to identify sarcasm before committing to a label. Illustrative example.

Reasoning Models & Test-Time Compute

A recent class of models takes chain-of-thought further: instead of relying on the user to prompt for reasoning, the model is trained to reason before answering. These include OpenAI’s o1 and o3, DeepSeek-R1 (open-weight, where reasoning emerged from reinforcement learning without explicit instruction), and Claude with extended thinking.

This represents a shift in how we think about model capability. In Module 1, we saw that performance follows power laws with training compute, data, and parameters. Reasoning models introduce test-time compute scaling: instead of making the model bigger, let it think longer. A smaller model that reasons for 30 seconds can sometimes match a larger model that answers instantly, at the cost of higher latency and more output tokens.

Huang & Chang (2023), Towards Reasoning in Large Language Models: A Survey. Findings of ACL. Chain-of-thought prompting improves performance on nuanced annotation tasks (implicit sentiment, sarcasm, framing in political text), but model-generated reasoning is not always faithful to the model’s actual computation: the “explanation” may be a post-hoc rationalisation rather than a transparent reasoning trace.

When to Use Reasoning: The Cost Trade-off

Reasoning is not always worth the cost. The gains are largest on ambiguous, multi-step tasks and negligible on clear-cut classification. The interactive below lets you explore the trade-off at different corpus sizes.

For binary sentiment classification on 10,000 tweets, would you use a reasoning model? What about for interpreting a single ambiguous policy speech?

Reveal

For 10,000 tweets: no. Most tweets are unambiguous, and the per-token cost of reasoning models would be wasteful. Use a standard model with zero-shot or few-shot prompting, and reserve CoT for the subset of cases the model is uncertain about. For a single ambiguous speech: yes. The cost of extra tokens is negligible for one input, and the reasoning process helps the model handle complex framing, mixed signals, and implicit positions.

In the notebook: Exercise 5 tests chain-of-thought prompting on the tweets that the model misclassified with direct prompting. You will see exactly which cases benefit from reasoning and which do not.

Key Takeaway

Chain-of-thought prompting and reasoning models let you trade compute at inference for accuracy on hard cases. Use them selectively: the gains are largest on ambiguous, multi-step tasks. For bulk classification of clear-cut texts, standard prompting is sufficient and far more cost-effective.

Resources

Wei et al. (2022), Chain-of-Thought Prompting Elicits Reasoning in Large Language Models: the foundational CoT paper.
Kojima et al. (2022), Large Language Models are Zero-Shot Reasoners: the “Let’s think step by step” paper.
DeepSeek-AI (2025), DeepSeek-R1: open-source reasoning model showing reasoning emerging from RL.
Anthropic: Extended Thinking documentation: practical guide to using Claude’s reasoning mode.
Lightman et al. (2023), Let’s Verify Step by Step: process reward models that evaluate each reasoning step.

Evaluating & Choosing Models

Post-training determines how a model behaves, prompting determines what you measure, and reasoning techniques extend accuracy on hard cases. But all of this assumes you have already chosen a model. Which one? The landscape includes hundreds of options across a wide range of capabilities, costs, and transparency properties. The choice is not just technical: it is a research design decision with consequences for reproducibility, data privacy, and budget—and unlike a survey instrument, the model you chose might silently change between your pilot and your final data collection.

Social Science Application. Choosing a model is analogous to choosing a measurement instrument. Just as the decision between a structured survey and open-ended interviews shapes what you can measure and how you should interpret results, the decision between GPT-4, Claude, and Llama shapes your study’s validity, reproducibility, and scope. Each model embodies different training data, alignment choices, and capability profiles. This choice belongs in your methods section, with explicit justification, just as you would justify your choice of survey instrument or sampling strategy.

Benchmarks: Useful but Insufficient

Definition

Benchmark

A standardised test set with known correct answers, used to compare model performance on a fixed task. Benchmarks function like standardised tests (GRE, LSAT): they measure something real and enable cross-model comparison, but they do not measure the specific construct you care about. A model’s benchmark score is a necessary but insufficient predictor of its performance on your research task.

The field evaluates models using standardised benchmarks. MMLU (Hendrycks et al., 2021) tests broad knowledge across 57 subjects. HumanEval measures code generation. GPQA (Rein et al., 2023) tests graduate-level reasoning. Chatbot Arena (Chiang et al., 2024) ranks models by blind human preference on real conversations.

Think of benchmarks like standardised tests (GRE, LSAT): they correlate with something real, but they do not measure what you specifically care about. Three fundamental problems undermine them. Contamination: models may have seen benchmark questions during training, inflating scores. Gaming: providers can optimise specifically for benchmark performance without improving general capability. Construct validity: no standard benchmark measures your task. A high MMLU score does not mean the model will classify your political texts well. The benchmark that matters most is the one you build yourself.

Open vs. Closed Models

The choice between open-weight and closed (proprietary API) models involves trade-offs across five dimensions that map directly onto research design concerns.

Capability: frontier closed models currently lead on the hardest tasks, though the gap has narrowed substantially. For many social science classification tasks, the difference is small enough that other dimensions dominate the decision.

Cost: open models are free to use (you pay only for compute). At scale, self-hosting can be dramatically cheaper per token—a difference that matters when processing millions of documents.

Transparency: open models give you full access to weights and architecture. Closed models are opaque: you cannot inspect what training data was used, what safety filters are applied, or how the model was post-trained.

Data privacy: with open models, your data never leaves your infrastructure. This may be essential under IRB protocols, GDPR requirements, or when working with sensitive political, health, or legal data.

Reproducibility: this is the dimension researchers most often underestimate. Open model weights are fixed: a snapshot you download today produces the same outputs in two years. Closed models can change without notice. If you submit a paper in June using GPT-4 and reviewers try to replicate your results in December, the model behind the API may have been updated, retrained, or replaced entirely. Chen et al. (2023) documented substantial behaviour drift in GPT-4 and GPT-3.5 over just a few months, with accuracy on specific tasks changing by 10+ percentage points between API snapshots.

The term “open” itself exists on a spectrum. Fully open models release weights, training code, and training data (e.g., OLMo). Open-weight models release weights but not training data or code (e.g., Llama, Mistral). API-only models provide no access to internals (e.g., GPT-4, Claude). For reproducibility, any model whose weights you can download is a significant improvement over API-only access, even if the training data remains proprietary.

Social Science Application. Chen et al. (2023), How is ChatGPT’s behavior changing over time?, documented that GPT-4’s accuracy on specific tasks changed by over 10 percentage points between API snapshots taken just months apart. For empirical research, this is a temporal validity threat: the instrument you validated in your pilot study may not be the same instrument producing your final results. When using closed API models, always record the exact model version string (e.g., gpt-4-0613), pin it if the API allows, and acknowledge the version drift risk in your methods section.

Cost-Capability Frontiers

The key insight for researchers: you rarely need the most capable model. If a 70B open model achieves 92% accuracy on your task and the frontier closed model achieves 95%, the open model may be the better choice when processing millions of documents, especially if data privacy is a concern. The frontier shifts rapidly: check current comparisons (e.g., Artificial Analysis) before committing to a model for a large project.

Illustrative chart based on approximate positions from public benchmarks and API pricing (early 2026). Model names, positions, and costs change frequently. This chart shows the structure of the landscape (frontier, open/closed clustering), not precise coordinates. Always verify current data before choosing a model.

 ▸How this chart was constructed

Capability scores are approximate composites drawn from Chatbot Arena ELO ratings, Artificial Analysis intelligence index scores, and benchmark results reported on the Vellum and Onyx leaderboards as of March 2026. No single benchmark captures true capability; the vertical axis reflects a rough consensus across sources, not a precise measurement.

Cost scores are based on blended API token pricing (input + output at an approximate 3:1 ratio) from official provider documentation and TLDL and CostGoat comparison data. Open-weight model costs reflect typical hosted API rates (e.g., via Together AI or Fireworks); self-hosting costs would be lower at scale but depend on infrastructure.

This chart is pedagogical, not a buying guide. It illustrates the structure of the cost-capability landscape (the frontier, the open/closed divide, the clustering patterns). Positions are approximate and will shift as new models release and prices change. Always check current data before committing to a model for a research project.

Building Task-Specific Evaluations

Because no standard benchmark measures your task, you need to build your own evaluation. The process follows the same logic as validating any measurement instrument:

1. Create a gold-standard set. Manually label a subset of your data (50–200 items is often sufficient for initial evaluation). Use multiple coders and compute inter-annotator agreement.

2. Pilot multiple models. Run your classification prompt on the gold-standard set with 2–4 candidate models. Compare accuracy, F1, and, crucially, patterns of disagreement with your human coders.

3. Assess failure modes. Where does the model disagree with humans? Are the errors random, or systematic? Systematic errors (e.g., all sarcastic tweets misclassified) suggest a model-task mismatch; random errors may be tolerable.

4. Document and iterate. Record the model version, prompt, temperature setting, and evaluation metrics. These are your instrument specifications: they belong in the methods section of your paper.

Step 2 mentions F1. For classification tasks, accuracy alone is misleading when classes are imbalanced. If 95% of your tweets are neutral, a model that labels everything “neutral” achieves 95% accuracy while being completely useless. Precision, recall, and F1 give a more honest picture:

$$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$$

Of all the items the model labelled positive, how many actually are? High precision means few false alarms.

$$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$$

Of all the items that actually are positive, how many did the model find? High recall means few missed cases.

$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

The harmonic mean of precision and recall. $F_1$ penalises models that sacrifice one for the other: a model that labels everything positive gets perfect recall but terrible precision, and hence a low $F_1$.

For multi-class tasks, report macro-averaged F1 (average F1 across all classes, giving equal weight to rare categories) or per-class F1. When comparing model labels to human labels, treat the comparison as an inter-annotator agreement problem: Cohen’s kappa or Krippendorff’s alpha account for agreement that would occur by chance, giving a more rigorous measure than raw accuracy. Module 3 walks through computing these metrics on your own classification results.

Weber & Reichardt (2023), Evaluation is All You Need. The central argument: benchmark thinking translates directly to research design. Just as you would not trust a survey instrument without validating it, you should not trust an LLM classifier without building a task-specific evaluation.

A model scores 90% on MMLU but only 65% on your annotation task. What might explain the gap?

Reveal

Several factors could explain the discrepancy. MMLU tests broad factual knowledge, while your task may require domain-specific understanding, nuanced text interpretation, or adherence to a specific coding scheme. The model may also struggle with the particular format of your data (tweets vs. academic text, informal language, abbreviations). Additionally, MMLU uses multiple-choice format, which is fundamentally different from open-ended classification. This is precisely why task-specific evaluation is essential.

In the notebook: Exercise 7 has you save your best classification results and manually label 10 tweets. Module 3 walks you through computing inter-annotator agreement between you and the model: the start of a proper validation pipeline.

Coming in Module 3: We move from single-model evaluation to building complete classification pipelines at scale: API access, batching, cost management, and systematic validation using inter-annotator agreement metrics. We also cover when prompting is not enough and you need to fine-tune.

Key Takeaway

No single model is best for everything. Standard benchmarks are useful for rough comparisons but suffer from contamination, gaming, and construct validity issues. The benchmark that matters most is the one you build yourself. Choose models based on your specific needs: data sensitivity, task complexity, budget, and reproducibility requirements.

Resources

Hendrycks et al. (2021), Measuring Massive Multitask Language Understanding: introduces MMLU, the most widely cited LLM benchmark.
Chiang et al. (2024), Chatbot Arena: Elo-based rankings from human comparisons.
Rein et al. (2023), GPQA: expert-level questions resistant to memorisation.
Chatbot Arena Leaderboard: live updated model rankings.
Artificial Analysis: cost, speed, and quality comparisons across providers.
Chen et al. (2023), How is ChatGPT’s behavior changing over time?: documents significant performance drift in closed API models over months.

Module Summary

This module traced the path from a raw language model to a usable research instrument. Post-training reshapes behaviour: SFT teaches format, preference optimization teaches quality, and safety training teaches boundaries. None of these stages add knowledge; they determine how the model expresses what it already learned during pre-training, and each introduces biases that researchers must account for.

Prompting is the primary interface to post-trained models, and it functions as experimental design. Prompt wording, temperature, structured output schemas, and few-shot examples all shape what you measure and how reliably you measure it. Prompt sensitivity is the LLM equivalent of question-wording effects in survey research: small changes can produce large swings in results. Reasoning techniques—chain-of-thought prompting and dedicated reasoning models—extend accuracy on hard cases by trading inference compute for depth, a form of test-time compute scaling that complements the training-time scaling laws from Module 1.

Finally, model selection is itself a research design decision. Standard benchmarks provide rough orientation but suffer from contamination, gaming, and construct validity gaps. The benchmark that matters most is the one you build yourself: a gold-standard evaluation set drawn from your own data, validated against human coders, and documented in your methods section. The choice between open and closed models involves trade-offs across capability, cost, transparency, data privacy, and reproducibility—dimensions that map directly onto the concerns of any rigorous research design.

Coming in Module 3: We move from single-model evaluation to building complete classification pipelines at scale. You will learn how to access models programmatically, batch thousands of requests efficiently, manage costs, and validate results systematically using inter-annotator agreement metrics. We also cover when prompting reaches its limits and fine-tuning becomes the right tool.