Module 3 of 5
Deploying for Research
Fine-tuning, working at scale, and validation: building rigorous classification pipelines from prompt to production.
In Module 2, you built a prompt, chose a model, and classified a sample of texts. You also manually labelled tweets yourself, the first step toward comparing your judgment to the model’s. Today starts from that comparison and asks a harder question: can you turn that prototype into a research instrument you would stake a publication on?
Doing so requires three capabilities: adapting models to your specific task when prompting hits a ceiling, running them reliably at the scale your corpus demands, and, most critically, verifying that the results are trustworthy enough to build an argument on.
The thread connecting all three: the model is a measurement instrument, and instruments require calibration.
After This Module You Will Be Able To
- Decide when prompting, RAG, or fine-tuning is the right approach for a given task.
- Explain how LoRA adapts a large model efficiently and choose between encoder and decoder architectures for classification.
- Build a scalable classification pipeline with structured outputs, batching, and cost estimation.
- Validate LLM classifications using Cohen’s kappa, per-class F1, and qualitative error analysis.
- Identify systematic biases (positional, sycophancy, cultural, temporal, distribution shift) and design mitigations.
Page & notebook ordering. This page covers fine-tuning first, then scaling, then validation. The notebook reverses the first and third: Exercises 1–3 start with validation (building on your Module 2 labels), Exercises 4–5 cover fine-tuning, and Section 3 provides API and scale infrastructure. Both orderings work; the concepts are circular rather than strictly sequential. If you are working through both side-by-side, read the validation section of this page before starting the notebook exercises.
Fine-Tuning
Module 2 showed how prompting and reasoning techniques turn a language model into a classification instrument. For many tasks, this works well. But prompting has a ceiling: when your classification scheme is domain-specific, when the model makes systematic errors that better examples cannot fix, or when per-token costs make API-based prompting prohibitive at scale, fine-tuning adapts a model’s weights to your specific task.
The Decision Framework: Prompt, RAG, or Fine-Tune
Before investing in fine-tuning, ask whether it is necessary. Three approaches exist for improving model performance, and each fits different circumstances.
Prompting (covered in Module 2) is the right starting point when zero-shot or few-shot performance meets your validation threshold, when you need flexibility across multiple tasks, when you have little or no labeled data, or when the task aligns well with the model’s pre-training distribution.
Retrieval-Augmented Generation (RAG) is the right choice when the model needs domain knowledge that exceeds its training data or its context window: your corpus of parliamentary records, legal texts, or historical archives. RAG uses embeddings (from Module 1) to retrieve relevant documents and feed them to the model alongside the query. We cover RAG in Module 4.
Fine-tuning is the right choice when prompting produces systematic errors that more or better examples cannot resolve, when you need consistent output formatting across thousands of classifications, when you have labeled data (hundreds to thousands of examples), or when domain-specific language or conventions are not handled well by general-purpose models.
A simplified decision tree for choosing between prompting, RAG, and fine-tuning. Real decisions often involve hybrid approaches, for example using RAG and fine-tuning together. The key factor is validation: if your current approach meets your threshold, added complexity is not justified. Illustrative diagram.
Social Science Application. The decision to fine-tune is also a decision about construct operationalization. When you fine-tune a model on your labeled examples, your training data is the operationalization: it defines what “populism” or “negative sentiment” or “economic framing” means in measurable terms. This makes the quality of your training labels a construct validity concern, not just a data quality issue. The same theoretical construct, operationalized through different training examples, will produce a different measurement instrument. Document your labeling criteria with the same rigour you would apply to a survey codebook.
What Fine-Tuning Does
Fine-tuning continues training a pre-trained model on your labeled data. The model’s weights are adjusted to minimise prediction error on your examples. This is the same gradient-based optimisation from pre-training, but applied to a much smaller, task-specific dataset.
For decoder models (GPT, Llama, Qwen), fine-tuning typically uses Supervised Fine-Tuning (SFT): you provide (instruction, response) pairs—the same format you used for prompting in Module 2—and the model learns to produce those responses. The critical insight: the training data is just prompt–response pairs. The difference from few-shot prompting is that instead of the model seeing your examples at inference time, it internalises the patterns during training.
$$\mathcal{L}_{\text{SFT}} = -\sum_{t=1}^{T} \log P_\theta(y_t \mid y_{\lt t}, x)$$
The model (with parameters $\theta$) is trained to predict each response token $y_t$ given the instruction $x$ and all preceding response tokens. This is the same next-token prediction objective from pre-training, applied to curated instruction–response pairs.
LoRA: Parameter-Efficient Fine-Tuning
Full fine-tuning updates every parameter in the model. For a 7B-parameter model, this means updating seven billion numbers, requiring multiple high-end GPUs and risking catastrophic forgetting, where the model loses general capabilities while learning your task.
Definition
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method (Hu et al., 2021) that freezes all pre-trained weights and injects small, trainable low-rank matrices into the model’s attention layers. Instead of updating billions of parameters, LoRA trains tens of millions, typically less than 1% of the total. This achieves performance comparable to full fine-tuning at a fraction of the compute cost.
The key observation behind LoRA: weight updates during fine-tuning tend to occupy a low-dimensional subspace. They do not need the full dimensionality of the weight matrix. LoRA exploits this by decomposing the update into two small matrices whose product approximates the full-rank update.
$$W' = W + \Delta W = W + BA, \quad B \in \mathbb{R}^{d \times r},\; A \in \mathbb{R}^{r \times d},\; r \ll d$$
The original weight matrix $W$ is frozen. The update $\Delta W$ is decomposed into two small matrices $B$ and $A$ whose product has the same shape as $W$. The rank $r$ is typically 8–64, far smaller than the hidden dimension $d$ (often 4096 or more). For self-attention, the Q, K, V, and O projections are $d \times d$, so both adapter matrices share the same outer dimension. This reduces trainable parameters by a factor of 100–1000×.
LoRA injects small trainable matrices into the frozen model. The rank r controls the trade-off between expressiveness and efficiency. Higher rank captures more complex adaptations but uses more memory. Illustrative diagram; dimensions are not to scale.
The calculator below lets you explore how model size and LoRA rank affect the number of trainable parameters, memory requirements, and hardware feasibility. Try the presets to see realistic configurations, then adjust the sliders to match your own setup.
QLoRA (Dettmers et al., 2023) takes this further: it loads the frozen base model in 4-bit precision (NormalFloat4 quantisation), reducing memory by roughly 75% compared to 16-bit storage. The LoRA adapters themselves remain in full precision. This combination means a 7B model fits on a single consumer GPU with 16GB of VRAM using QLoRA, and even a 4B model can be fine-tuned on a free Google Colab T4 (15GB).
The output of LoRA fine-tuning is a small adapter file—typically tens of megabytes—that can be shared, versioned, and loaded on top of the base model. Multiple adapters can be trained for different tasks and swapped without reloading the base model.
Encoder Models for Classification
LoRA makes fine-tuning decoder models practical on modest hardware. But for pure classification—assigning a fixed label to text, with no generation required—there is often a better option entirely. Everything covered so far (GPT, Claude, Llama, Qwen) uses a decoder-only architecture. These models read text left to right, generating one token at a time. They are designed for generation: writing text, following instructions, answering questions. But for pure classification, assigning a fixed label to a piece of text, a different architecture is often the better choice.
Encoder models (BERT, RoBERTa, DeBERTa) read the entire input simultaneously. Every token attends to every other token in both directions: forward and backward. In Module 1, we described how decoder models use causal masking so each token can only attend to preceding tokens. Encoder models remove this mask entirely: the representation of any word is informed by everything before and after it.
Definition
Encoder Model
A Transformer that processes the entire input sequence with
bidirectional attention: every token can
attend to every other token simultaneously. Encoder models produce a
single fixed-length representation of the input (typically via a
special [CLS] token) that is then mapped to a
classification label. They cannot generate text token by token;
they are designed for understanding, not production.
The classification mechanism is straightforward. The encoder processes
the input and produces a contextualised vector for each token. A
special [CLS] token, prepended to every input,
aggregates information from the full sequence. A single linear
layer maps this vector to a probability distribution over the label
set.
$$P(y \mid x) = \text{softmax}\!\left(W_c \cdot \mathbf{h}_{\text{[CLS]}} + b\right)$$
The encoder produces a contextualised hidden state $\mathbf{h}_{\text{[CLS]}}$ for the special classification token. A learned linear layer $W_c$ maps this single vector to a probability distribution over the label set. The entire model (encoder and classification head) is fine-tuned end-to-end.
The Encoder Lineage: BERT → RoBERTa → DeBERTa
BERT (Devlin et al., 2019) established the encoder paradigm by training a model to predict randomly masked tokens from their surrounding context. This bidirectional pre-training produced representations that dramatically outperformed previous methods on classification tasks. RoBERTa (Liu et al., 2019) showed that BERT was substantially undertrained: training longer, on more data, and with larger batches achieved considerable improvements without changing the architecture.
DeBERTa
(He et al., 2021)
introduced disentangled attention, separating content and
position into independent vectors for finer-grained representations.
DeBERTaV3
(He et al., 2023)
switched to a more sample-efficient pre-training method. The result:
deberta-v3-base (184M parameters) and
deberta-v3-small (44M parameters) remain, as of early
2026, among the strongest general-purpose encoder models for
classification. When you see “fine-tuned BERT” in recent
social science papers, the model used is almost always a DeBERTa
variant.
Encoder vs. Decoder: When to Use Each
Encoder vs. decoder architecture for a classification task. The encoder produces a single classification in one pass; the decoder generates the answer token by token, which is slower but more flexible. Illustrative diagram; real models have many more layers and tokens.
For social science researchers, the practical guidance is straightforward:
Use an encoder model when your task is classification with a fixed set of categories, you have labeled training data (hundreds to thousands of examples), you need fast inference (thousands of texts per minute), and you do not need the model to generate free text. For most annotation tasks in computational social science (sentiment analysis, stance detection, topic labeling, framing analysis), a fine-tuned DeBERTa is likely the best choice in terms of accuracy per compute dollar.
Use a decoder model (via LoRA or prompting) when you need the model to generate text (summaries, explanations, structured extractions), when your task requires flexible instructions, when you are working with multiple tasks simultaneously, or when you have no labeled training data and must rely on zero/few-shot prompting.
Recent comparative work supports this distinction. Widmann & Wich (2023) found that fine-tuned encoder models (BERT-family) remain state-of-the-art for many classification tasks, sometimes outperforming much larger prompted decoder models. Benayas, Sicilia Urbán & Mora Cantallops (2024) confirmed in Language Resources and Evaluation that encoder-only architectures generally provide better efficiency-to-performance ratios for discriminative classification, while noting that fully fine-tuned large decoders can occasionally match or exceed encoder performance when sufficient data and compute are available.
Practical Considerations
Data and Training Settings
Data requirements: For encoder fine-tuning, a few hundred labeled examples often produce strong results; a few thousand approach diminishing returns. For LoRA fine-tuning of decoders, similar quantities work, though the per-example signal may be weaker because the model must also learn the output format.
Data splitting: Divide your labeled data into three non-overlapping subsets. The training set (typically 70–80%) is what the model learns from. The validation set (10–15%) is used during training to monitor overfitting and select the best checkpoint. The model never trains on it, but your decisions are informed by it. The test set (10–15%) is held out until the very end and used only once to report final performance. If you tune hyperparameters or select models based on the test set, it is no longer a test set; it has become a second validation set, and your reported numbers will be optimistic.
Hyperparameters: For encoders, the most critical
parameter is learning rate: typically 2×10-5
to 5×10-5, with 2–4 training epochs and batch
sizes of 16–32. For LoRA, common settings include rank 8–16,
lora_alpha 16–32, and dropout 0.05–0.1.
These are well-established defaults; the notebook walks through
specific configurations.
Overfitting: With small datasets, the model can memorise training examples rather than learning generalisable patterns. Monitor validation loss during training: if it starts increasing while training loss continues to decrease, the model is overfitting. Use early stopping to halt training at the best validation checkpoint.
Class imbalance: If your training data has far more examples of one category than another, the model will learn to favour the majority class. Address this at training time through loss weighting (multiplying the loss for minority-class examples by a factor proportional to the imbalance) or oversampling (repeating minority-class examples in the training set). Both are straightforward to configure in standard training libraries.
Evaluation: Always evaluate on a held-out test set that the model has never seen during training or validation. Report per-class precision, recall, and F1, not just overall accuracy. A model with 90% accuracy may achieve 95% on the majority class and 60% on the minority class, which is unacceptable for most research purposes.
Workflow Patterns
The prompt–then–fine-tune loop: In practice, prompting and fine-tuning are not alternatives; they are stages. You develop your classification scheme via prompting (Module 2), validate it, identify systematic errors, and then use the model’s own correct outputs, verified against your gold standard, as training data for fine-tuning. This iterative loop is often more practical than manually labeling thousands of examples from scratch.
Label noise: When your training data comes from human coders (or from the prompt–then–fine-tune loop above), some labels will be wrong. Fine-tuning amplifies these errors: the model learns to reproduce the mistakes in your training set with high confidence. Clean your training data carefully, and treat suspiciously high training accuracy as a warning sign rather than a success.
You have 500 labeled political tweets and want to classify 100,000 more. Would you use a fine-tuned DeBERTa, a LoRA-adapted Llama, or zero-shot prompting with Claude? What factors drive your choice?
Reveal
For pure binary or multi-class classification with 500 labeled examples, a fine-tuned DeBERTa is likely the best choice. It will train in minutes, infer in milliseconds per tweet, and cost essentially nothing to run (no API fees, minimal GPU time). LoRA on Llama would also work but is slower and more resource-intensive for a task that does not require generation. Zero-shot Claude might match DeBERTa’s accuracy on straightforward cases, but at ~$1.50–3.00 per 100K tweets (depending on the model tier) and with no guarantee of consistency across API updates. The 500 labeled examples make fine-tuning the clear winner here.
In the notebook: Exercise 4 walks you through formatting training data for LoRA fine-tuning: converting labeled tweets into chat-format (instruction, response) pairs. Exercise 5 trains a LoRA adapter on Qwen 3 4B and probes whether fine-tuning fixes the hard cases from Module 2. The extension section fine-tunes DeBERTa for a direct encoder-vs-decoder comparison.
Key Takeaway
Fine-tuning is the tool you reach for when prompting hits a ceiling. LoRA makes fine-tuning accessible on consumer hardware by training less than 1% of parameters. For pure classification tasks with labeled data, encoder models (especially DeBERTa) are typically smaller, faster, and at least as accurate as fine-tuned decoders. The choice between approaches should be driven by your specific task requirements, available data, and validation results: not by which model is newest or largest. Whether you stay with prompting or move to fine-tuning, the next challenge is deploying your approach reliably at the scale your corpus demands.
Resources
Start here:
- Hu et al. (2021), LoRA: Low-Rank Adaptation of Large Language Models: the foundational LoRA paper; efficient fine-tuning by learning low-rank weight updates.
- Dettmers et al. (2023), QLoRA: Efficient Finetuning of Quantized Language Models: 4-bit quantisation enabling fine-tuning on consumer GPUs.
- Hugging Face Transformers: step-by-step fine-tuning tutorials for both encoder and decoder models.
Further reading:
- Devlin et al. (2019), BERT: Pre-training of Deep Bidirectional Transformers: the paper that established encoder-based NLP.
- He et al. (2021), DeBERTa: Decoding-enhanced BERT with Disentangled Attention: disentangled attention for better classification.
- He et al. (2023), DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training: among the strongest general-purpose encoder models for classification.
- Liu et al. (2019), RoBERTa: A Robustly Optimized BERT Pretraining Approach: showed that BERT’s training was substantially suboptimal.
- Hugging Face PEFT library: LoRA, QLoRA, and other parameter-efficient methods.
- Unsloth: 2× faster LoRA fine-tuning with lower memory usage.
Working at Scale
You have a model that works: prompted, fine-tuned, or encoder-based. It classifies 200 validation examples correctly. But your corpus has 50,000 texts. Without proper infrastructure, a working prototype becomes a weekend of babysitting API calls, a corrupted results file at text 8,000, and a budget surprise when the invoice arrives.
Moving from notebook exploration to research-scale annotation requires programmatic access, structured outputs, error handling, and cost planning. This section covers the practical infrastructure that turns a working prototype into a reliable research pipeline.
API Access
Major model providers (OpenAI, Anthropic, Google) expose their models through REST APIs with a common structure: you send a JSON request containing your messages and parameters, and receive a JSON response with the model’s output. Python client libraries wrap this into clean function calls.
The core abstraction is the messages array: a list of (role, content) pairs representing the conversation. A system message sets the model’s behaviour, a user message provides the input, and the model returns an assistant message with its response. This is the same structure you used in Module 2 for prompting: the API formalises it.
Key parameters that affect output quality and cost: temperature (0 for deterministic classification, higher for creative tasks), max_tokens (cap the response length: for binary classification, 10 tokens is more than enough), and model (the specific model version, which should be pinned for reproducibility).
Structured Outputs
When building classification pipelines, you need machine-parseable
output, not free-form text. Without constraints, a model asked
to classify a tweet might return {"label": "support"} for
one text, then The sentiment is positive for the next,
then I'd classify this as "support" but with some caveats...
for a third. At scale, these inconsistencies corrupt your data
silently: your parser extracts a label from the first response, throws
an exception on the second, and misparses the third.
Definition
Structured Output
A model response constrained to a specific format: typically JSON conforming to a provided schema. Modern APIs offer guaranteed structured output modes that constrain the model’s token generation to valid JSON, eliminating parsing failures. This is achieved by modifying the sampling process to only allow tokens that produce valid syntax at each step.
For research pipelines, structured outputs are essential. Without them, you need extensive post-processing to extract labels from free-text responses, and edge cases (the model adding qualifications, refusing to classify, or producing unexpected formats) can silently corrupt your data. With schema-constrained output, every response is guaranteed to parse correctly.
Batching & Async Processing
Processing texts one at a time is straightforward but slow. A single API call takes 0.5–2 seconds for classification; at that rate, 10,000 texts take 1.5–5.5 hours. Three patterns speed this up:
Asynchronous requests with rate limiting: Send
multiple requests concurrently, using a semaphore to cap the number of
simultaneous connections (typically 10–50, depending on your rate
limit). This can reduce processing time by 10–50×. The
notebook demonstrates this pattern with Python’s
asyncio.
Batch APIs: Anthropic and OpenAI offer dedicated batch endpoints where you submit a file of requests and receive results within hours. Anthropic’s Batch API provides a 50% cost reduction for non-urgent work. For large annotation projects where same-day results are acceptable, this is the most cost-effective approach.
Checkpointing: For any pipeline processing thousands of texts, save results incrementally. If the script crashes at text 8,000 of 10,000, you should be able to resume from where it stopped rather than re-running everything. This is not a performance optimisation: it is a reliability requirement. API connections fail, rate limits trigger, and machines restart.
Cost Estimation
API pricing is token-based: you pay per million input and output tokens. For classification tasks, the input is typically 50–200 tokens (instruction + text) and the output is 1–10 tokens (the label). The cost per classification is therefore dominated by input tokens.
The estimator below compares costs across model tiers for your specific corpus. Set the number of texts, average length, and prompt overhead; it calculates total token counts and shows what each provider would charge. Use it to decide whether a budget model meets your needs or whether self-hosting makes economic sense.
Social Science Application. Cost is a research design constraint, not just an operational detail. Just as survey researchers trade off sample size against measurement depth within a fixed budget, LLM researchers trade off model capability against corpus coverage. A cheaper model that classifies 100,000 texts may yield better aggregate estimates than an expensive frontier model applied to 10,000, especially if the accuracy difference is small. Frame cost decisions as you would any power analysis: how many observations do you need, at what measurement quality, to answer your research question?
For research budgeting: estimate your corpus size, multiply by the per-text cost for your chosen model, and add 20–30% for retries, prompt iteration, and validation runs. A project classifying 100,000 texts with a budget model costs under 10 USD; the same project with a frontier model costs 100–200 USD. Self-hosted open models eliminate per-token costs but require GPU access.
Prompt caching can substantially reduce these costs. Anthropic, OpenAI, and Google all offer caching mechanisms that store the processed representation of your system prompt. In a classification pipeline where every request shares the same instruction prefix, cached input tokens cost 50–90% less than uncached tokens. Since classification prompts are dominated by the (repeated) instruction rather than the (varying) input text, caching can cut your effective cost by half or more at scale. Enable it by default for any pipeline processing more than a few hundred texts.
Context window limits are another practical concern. If input texts exceed the model’s context window (typically 4,096–128,000 tokens depending on the model), you will need to truncate or chunk them. For classification, truncating to the first N tokens is often sufficient, since the opening of a document usually contains enough signal. For longer texts where the relevant content may appear anywhere, consider splitting the text into overlapping chunks, classifying each, and aggregating the results.
Self-Hosted Inference
When data privacy prevents sending text to external APIs, when per-token costs are prohibitive at scale, or when you need guaranteed reproducibility (no model version changes), self-hosted inference becomes necessary. You run the model on your own hardware and control the entire pipeline.
Definition
Self-Hosted Inference
Running a language model on infrastructure you control: a university GPU cluster, a cloud instance, or even a local machine. Your data never leaves your environment. You pin the exact model version and guarantee identical outputs indefinitely. The trade-off is that you handle infrastructure, maintenance, and the upfront cost of GPU access.
vLLM (Kwon et al., 2023) is the most widely used open-source inference engine. Its core innovation, PagedAttention, manages the KV cache (from Module 1) like virtual memory, reducing memory waste from 60–80% to under 4%. This allows serving 2–4× more concurrent requests on the same hardware. vLLM supports over 50 model architectures and exposes an OpenAI-compatible API, so existing code works without modification.
SGLang (Zheng et al., 2024) achieves even higher throughput through RadixAttention, which automatically discovers and reuses shared prefixes across requests. When many requests share the same system prompt or instruction prefix (as in classification pipelines), SGLang can be significantly faster than vLLM. Benchmarks show 29% higher throughput for standard serving and up to 4.6× faster for concurrent requests.
Ollama provides the simplest path to local inference.
It wraps llama.cpp (efficient C++ inference) in a
Docker-like interface: install, pull a model, and run it with a single
command. Ollama is the right choice for researchers who want to
experiment with local models without managing GPU infrastructure.
It runs on consumer hardware, including Apple Silicon laptops,
though throughput is substantially lower than dedicated GPU
serving.
A complete annotation pipeline. Each stage has a clear input and output. The validation step at the end feeds back into prompt refinement: this loop is what makes the pipeline rigorous. Illustrative diagram.
Your university IRB prohibits sending student survey responses to external services. You need to classify 50,000 responses. What are your options?
Reveal
Three viable approaches: (1) Fine-tune a DeBERTa model locally since it runs on a single GPU or even CPU for inference, and your data never leaves your machine. (2) Self-host an open-weight decoder model (e.g., Llama 3 via vLLM or Ollama) on your university’s GPU cluster. (3) Use a provider that offers data processing agreements (DPAs) compatible with your IRB requirements: some enterprise API tiers guarantee data is not used for training and is deleted after processing. Option 1 is typically simplest and cheapest for classification tasks.
In the notebook: Section 3 provides reference code for OpenAI and Anthropic API patterns, async batching with semaphores, and a cost comparison table. These are templates you can adapt for your own projects.
Key Takeaway
Scaling from notebook to pipeline requires three things: structured outputs for reliable parsing, async batching for speed, and checkpointing for resilience. Choose between API access (convenient, pay-per-token) and self-hosted inference (private, fixed-cost) based on your data sensitivity and budget. Whichever you choose, the pipeline is only as good as its validation: which is what we cover next.
Resources
Start here:
- Anthropic API documentation: structured outputs, batching, tool use, and message batches.
- OpenAI Cookbook: practical recipes for common API patterns.
- Artificial Analysis: live cost, speed, and quality comparisons across providers and models.
Further reading:
- Kwon et al. (2023), Efficient Memory Management for Large Language Model Serving with PagedAttention: introduces vLLM and the PagedAttention mechanism.
- Zheng et al. (2024), SGLang: Efficient Execution of Structured Language Model Programs: RadixAttention for fast structured inference.
- Ollama: the simplest way to run open models locally.
- Groq, Together AI, Fireworks AI: inference providers for open models; useful for cost comparison.
Validation
You now have a pipeline that can process 50,000 texts in hours. But speed without accuracy is worse than useless: it produces confident, large-scale wrong answers. Whether you use prompting, fine-tuning, or a dedicated encoder model, you face the same fundamental question: can you trust the results? A model that classifies 10,000 texts in minutes is worthless if 15% of those classifications are systematically wrong in ways that bias your analysis.
The answer requires the same methodological rigour that social scientists apply to any measurement instrument. In content analysis, reliability is established by having multiple human coders label the same texts and measuring their agreement. An LLM-based classifier is another coder. It needs the same scrutiny.
The LLM as Another Coder
A common mistake: treating the model’s output as ground truth. When a model classifies a tweet as “support,” that is a prediction: not a fact. The same tweet might be ambiguous, sarcastic, or genuinely borderline. The model’s classification is one coder’s judgment, and it should be evaluated against the same standards you would apply to a human research assistant.
This framing has a practical consequence: your accuracy ceiling is set by human inter-coder agreement, not by 100%. If two well-trained human coders agree on 85% of texts (κ ≈ 0.70), expecting the model to exceed that level is unreasonable. The model should match or approximate human-level reliability: not surpass it.
Social Science Application. The “LLM as coder” framing maps directly onto the classical content analysis methodology established by Krippendorff (2019). In content analysis, any coder—human or machine—is a measurement device requiring reliability assessment. The LLM classification pipeline follows the same design: define a coding scheme (your prompt or training labels), train the coder (fine-tuning or in-context learning), assess pilot reliability (validation on a gold-standard set), run production coding (at-scale classification), and compute post-hoc reliability (agreement metrics on the full corpus). Treating the LLM as “another coder” is not a metaphor: it is the methodologically correct framework.
Cohen’s Kappa: Beyond Accuracy
Raw accuracy, the percentage of correct classifications, is the most intuitive metric but also the most misleading. If 80% of your corpus expresses “support” and a model labels everything as “support,” it achieves 80% accuracy while being completely useless. The model has learned nothing; it exploits the class distribution.
Definition
Cohen’s Kappa (κ)
A measure of agreement between two coders that corrects for the agreement expected by chance. Unlike raw accuracy, kappa is not inflated by class imbalance. It is the standard reliability metric in content analysis and should be reported whenever an LLM is used as a text annotator.
$$\kappa = \frac{p_o - p_e}{1 - p_e}$$
$p_o$ is the observed agreement between two coders (the proportion of items on which they agree). $p_e$ is the agreement expected by chance, calculated from the marginal distributions of each coder’s labels. A $\kappa$ of 1.0 means perfect agreement; 0.0 means agreement no better than chance; negative values indicate systematic disagreement.
A common interpretation scale: κ > 0.80 is “almost perfect” agreement, 0.60–0.80 is “substantial,” 0.40–0.60 is “moderate,” and below 0.40 is “fair” to “poor.” These thresholds are conventional guidelines, not hard rules: the appropriate threshold depends on the difficulty of the task and the consequences of misclassification.
Beyond Kappa: F1 and Krippendorff’s Alpha
Kappa answers one question: how much do the coders agree overall? But it does not tell you where the disagreement concentrates. If your model achieves κ = 0.70 overall but systematically fails on one specific category, kappa alone will not reveal that. You need a metric that decomposes performance by class.
$$F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$
Precision is the fraction of predicted positives that are actually positive. Recall is the fraction of actual positives that the model found. $F_1$ is their harmonic mean: it penalises models that sacrifice one metric for the other. Report per-class $F_1$ to detect imbalanced performance.
F1 provides exactly this decomposition. While kappa measures overall agreement, per-class F1 reveals where the model fails. A model with high overall kappa but low F1 on the minority class is systematically missing cases that may be analytically important. Always report both: kappa for the overall picture, per-class F1 for the diagnostic detail.
Both kappa and F1 assume a specific setup: two coders (you and the model) labelling the same texts. In practice, annotation projects are often messier. When your project involves more than two coders (e.g., the LLM, a research assistant, and you), when some texts are coded by different subsets of coders, or when your labels are ordinal rather than nominal (e.g., a 1–5 scale rather than discrete categories), Krippendorff’s alpha is the more appropriate metric.
$$\alpha = 1 - \frac{D_o}{D_e}$$
$D_o$ is the observed disagreement and $D_e$ is the disagreement expected by chance. Unlike Cohen’s $\kappa$, Krippendorff’s $\alpha$ handles any number of coders, accommodates missing data, and works with nominal, ordinal, interval, and ratio scales.
The explorer below makes these metrics concrete. Edit the four cells of the confusion matrix, or try the preset scenarios, and watch how accuracy, kappa, and per-class F1 respond. Pay special attention to the “Accuracy trap” scenario: it is the single most common way accuracy misleads in social science classification.
Multi-Class Classification
The explorer above uses binary classification for clarity, but most real CSS tasks have three or more categories (e.g., support / oppose / neutral, or multiple framing types). The same metrics generalise: compute per-class precision, recall, and F1 for each category, then report both the per-class values and a summary. Macro-averaged F1 (the unweighted mean of per-class F1 scores) treats every class equally regardless of size. Use this when minority classes are analytically important. Micro-averaged F1 (computed from total TP, FP, FN across all classes) gives more weight to larger classes and equals overall accuracy in the multi-class case. Cohen’s kappa and Krippendorff’s alpha both handle multiple categories natively and no modification is needed.
Class imbalance becomes more complex with more categories. A model can achieve high overall accuracy by performing well on two large classes while failing entirely on a small but important third class. Always check per-class metrics before reporting aggregate numbers.
Qualitative Error Analysis
Aggregate metrics tell you how much the model gets wrong. They do not tell you why. Qualitative error analysis, reading through misclassified texts and categorising the failure modes, is essential for diagnosis and improvement.
Five error patterns recur across LLM classification tasks:
Sarcasm and irony: The surface text expresses one sentiment while the intended meaning is the opposite. Models that rely on keyword matching tend to miss this. Chain-of-thought prompting (from Module 2) can help, as can providing sarcastic examples in few-shot prompts.
Mixed signals: Texts that contain both supportive and opposing language: e.g., acknowledging a movement’s goals while criticising its methods. The model must decide which signal dominates. Clearer instructions about what to prioritise (author’s overall stance vs. individual statements) can reduce these errors.
Ambiguity: Texts that are genuinely too short or context-dependent to classify reliably. These are not model failures; they reflect real uncertainty in the data. Consider adding an “ambiguous” or “uncertain” category rather than forcing a binary choice.
Indirect stance: The author reports or quotes someone else’s position without stating their own. A tweet saying “Protesters claim the march was a success” is reporting, not supporting. Instruct the model to classify the author’s stance, not the reported stance.
Domain mismatch: The model misinterprets domain-specific language, abbreviations, or cultural references. This is most common with non-English text, subculture-specific language, or highly specialised terminology. Fine-tuning or providing domain context in the prompt can address this.
Technique
Confidence Thresholding
Models provide uncertainty signals: encoder models produce softmax probabilities for each class, and decoder models expose log-probabilities (logprobs) for each generated token. When the model’s top-class probability is low (e.g., below 0.7), the text is likely ambiguous or borderline. Setting a confidence threshold below which texts are routed to human review can dramatically improve the reliability of your final dataset without requiring manual review of every text. In practice, thresholding 10–20% of the most uncertain cases for human adjudication is often the best cost-quality trade-off.
A complementary strategy is ensemble classification: running the same text through two or three models (or the same model multiple times at temperature > 0) and taking the majority vote. When all classifiers agree, confidence is high; when they disagree, the text is flagged for review. This is cheap for API-based pipelines (2–3× cost) and often more reliable than confidence thresholding alone, because it catches systematic model-specific errors that a single model’s confidence score would not reveal.
You analyse 50 misclassified texts and find that 30 involve sarcasm, 10 involve indirect reporting, and 10 are genuinely ambiguous. What does this tell you about your improvement strategy?
Reveal
The error distribution is actionable. Sarcasm dominates (60%), so the highest-impact fix is adding chain-of-thought reasoning or sarcastic few-shot examples. Indirect reporting (20%) suggests clarifying the instruction to target author stance specifically. The ambiguous cases (20%) may be irreducible , so consider an “uncertain” category. The key insight: not all errors are equal. Prioritise fixes that address the most common pattern first.
Systematic Biases
Beyond individual errors, LLMs exhibit systematic biases that affect classification at a population level. These are not random mistakes: they are directional tendencies that can corrupt aggregate statistics even when per-text accuracy appears acceptable. Five are especially relevant for social science research:
Social Science Application. The five biases below constitute a taxonomy of measurement validity threats specific to LLM-based research. Each maps to a classical validity concern. Positional and sycophancy biases threaten internal validity: they introduce systematic error unrelated to the construct being measured. Cultural and political biases threaten external validity: results from one cultural context may not generalise. Temporal instability threatens reliability: the same measurement procedure may produce different results at different times. Distribution shift threatens generalizability: validation on one subset does not guarantee performance on another. Documenting which threats apply and how they are mitigated belongs in every methods section.
Positional bias
When given ordered options (“support or oppose”), models tend to favour the first-listed option.
Mitigation: rotate label order across classifications and measure whether results change.
Sycophancy bias
Models tend toward the more “positive” or “agreeable” option, over-predicting support in stance detection or positive sentiment. A direct consequence of RLHF training (from Module 2).
Mitigation: check per-class recall; if the “positive” class has much higher recall than the “negative” class, sycophancy is likely at work.
Cultural and political bias
Models trained predominantly on English-language Western data may misinterpret rhetorical conventions from other cultural contexts and may exhibit political leanings from their training data and alignment process.
Mitigation: document as a measurement validity concern; test with politically diverse validation samples.
Temporal instability
Closed-source model providers update their models without notice. The exact same prompt may produce different results next month.
Mitigation: pin model versions where possible; document exact version, date, and provider. Open-weight models avoid this issue entirely.
Distribution shift
A model validated on one subset may perform differently on texts from a different time period, source, or domain. A classifier trained on 2020 tweets may struggle with 2024 tweets using different slang or referencing different events.
Mitigation: validate on each stratum separately. This is arguably the most common failure mode, and the one most often overlooked because aggregate metrics can mask stratum-specific degradation.
Building a Validation Plan
Before deploying any classification pipeline, you need a validation plan. This is not optional: it is the methodological equivalent of pre-registering your survey instrument.
1. Create a gold-standard set. Manually label a random sample of your corpus. For binary classification, 200–500 texts typically provides stable kappa estimates. Use stratified sampling if your corpus has known subgroups (e.g., different sources, time periods, or topics). Include edge cases deliberately; they are what differentiate good classifiers from mediocre ones.
2. Establish the human ceiling. Have at least two humans (ideally including yourself) label the same subset. Their inter-coder agreement sets the ceiling: if humans agree at κ = 0.75, that is the maximum you can reasonably expect from the model.
3. Set a threshold. Decide in advance what level of agreement is acceptable for your research question. κ ≥ 0.70 is a common threshold in political science; some tasks may justify lower (exploratory) or require higher (high-stakes policy analysis).
4. Analyse failures qualitatively. Do not just report the number. Read the misclassified texts. Categorise the error patterns. Determine whether the errors are random (tolerable) or systematic (threatening to your analysis).
5. Document everything. Your methods section should report: the model name and version, the exact prompt, the temperature and sampling settings, the gold-standard size and sampling method, the kappa score, per-class F1, the qualitative error analysis, and any modifications made based on validation results. This is not optional transparency: it is the minimum for replicable research.
How large does your gold-standard set need to be? The answer depends on how precise your kappa estimate needs to be: a small sample produces a wide confidence interval, making it impossible to distinguish “good” from “adequate” agreement. The calculator below shows how sample size, expected agreement, and the number of categories interact.
The validation cycle. Note that this is iterative: validation is not a one-time check at the end but a loop that informs every design decision in the pipeline. Illustrative diagram.
In the notebook: Exercises 1–3 walk you through the full validation workflow. You manually label 20 tweets (establishing your own inter-coder agreement with the gold standard), categorise error patterns in the model’s misclassifications, and design a validation plan for a hypothetical research project.
A colleague says: “My model gets 94% accuracy on my validation set. I don’t need to check kappa: 94% is clearly good enough.” What is wrong with this reasoning?
Reveal
Three problems. First, 94% accuracy with severe class imbalance (e.g., 90% of texts are positive) could mean the model is barely outperforming a naive “always predict positive” baseline. Kappa corrects for this. Second, accuracy does not reveal whether errors are concentrated in one class; per-class F1 does. Third, without knowing the human inter-coder agreement, 94% may or may not be close to the ceiling. If humans agree at 96%, the model is near optimal. If humans agree at 94%, the model is already at the ceiling and further improvement is unlikely.
Key Takeaway
Validation is not optional. Treat the LLM as another coder: compute Cohen’s kappa (not just accuracy), report per-class F1, and analyse errors qualitatively. Know your ceiling by measuring human–human agreement. Watch for systematic biases (positional, sycophancy, cultural, temporal) that corrupt aggregate statistics even when per-text accuracy appears adequate. Document every methodological choice. This applies whether you use prompting, fine-tuning, or any other approach.
Resources
Start here:
- Pangakis, Wolken & Fasching (2023), Automated Annotation with Generative AI Requires Validation: essential cautionary paper demonstrating that LLM annotation performance varies dramatically across tasks.
- Gilardi, Alizadeh & Haunss (2023), ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks: landmark comparison of LLM and human annotation quality, with important caveats.
- Scikit-learn classification metrics: Cohen’s κ, F1, precision, recall, and confusion matrix implementations.
Further reading:
- Törnberg (2024), ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages: zero-shot LLM classification on political stance.
- Krippendorff, K. (2019), Content Analysis: An Introduction to Its Methodology: the foundational text on annotation methodology and inter-coder reliability.
- Reiss (2023), Testing the Reliability of ChatGPT for Text Annotation and Classification: systematic assessment of reliability across annotation tasks.
Module Summary
This module addressed the gap between a working prototype and a publishable research pipeline. Fine-tuning adapts models when prompting hits a ceiling: LoRA makes this feasible on consumer hardware by training only a small low-rank update, while encoder models like DeBERTa offer a faster, often more accurate alternative for pure classification tasks. The decision between prompting, RAG, and fine-tuning is not purely technical; it is a research design choice that determines how your theoretical construct gets operationalized into a measurement procedure.
Working at scale turns a notebook experiment into a reliable data pipeline. Structured outputs guarantee parseable results across thousands of texts, async batching cuts processing time by orders of magnitude, and cost estimation prevents budget surprises. Whether you use a cloud API, a self-hosted open model, or a local encoder, the infrastructure choices shape what is practically feasible for your research.
Validation is the thread that connects everything. The model is another coder, and it requires the same scrutiny: kappa corrects for chance agreement, per-class F1 reveals where performance breaks down, qualitative error analysis identifies fixable patterns, and the five systematic biases (positional, sycophancy, cultural, temporal, distribution shift) are measurement validity threats that must be documented and mitigated. Every concept from Module 1 (embeddings, attention, scaling) and Module 2 (post-training, prompting, model selection) feeds into the pipeline built here.
Coming in Module 4: We move beyond classification to ask what else language models can do for social science research. You will learn how to extract structured information from unstructured text, retrieve and synthesise evidence from large corpora using RAG, and evaluate these pipelines with metrics that go far beyond accuracy.
Putting It Together
You now have the individual pieces: fine-tuning to adapt models when prompting is insufficient, infrastructure to run them at scale, and validation to verify that the results are trustworthy. The remaining question is how these pieces fit together in practice.
The End-to-End Workflow
Consider a concrete scenario: you have 50,000 parliamentary speeches and want to classify each into one of four framing categories (economic, moral, security, rights). Here is how the three sections of today’s session connect into a single workflow.
Stage 1: Prototype with prompting (Module 2). Write a classification prompt with clear definitions and a few examples of each category. Test it on 50–100 speeches using a capable model (e.g., Sonnet, GPT-5). Iterate until the prompt produces sensible results on easy cases.
Stage 2: Validate the prototype. Manually label 300 speeches (stratified across categories and time periods). Have a second human label the same set to establish the inter-coder agreement ceiling. Run the model on this sample, compute kappa and per-class F1, and analyse the errors qualitatively. If κ ≥ 0.70 and no category is systematically failing, the prompt may be sufficient, so proceed to Stage 4.
Stage 3: Fine-tune if needed. If validation reveals systematic errors that better prompting cannot fix, use the validated subset (plus any additional labeled data) to fine-tune. For four fixed categories, a fine-tuned DeBERTa is likely the best choice: fast inference, no API costs, strong classification performance. If you also need the model to extract quotes or provide reasoning, consider LoRA on a decoder model instead. Re-validate after fine-tuning: the same gold-standard set, the same metrics.
Stage 4: Deploy at scale. If using a prompted API model, set up async batching with checkpointing. Enable structured outputs to guarantee parseable JSON. Use the batch API for 50% cost savings if same-day turnaround is acceptable. If using a fine-tuned encoder, inference is fast enough that 50,000 speeches process in minutes on a single GPU. If data sensitivity prohibits external APIs, self-host via vLLM or run the encoder locally.
Stage 5: Validate again. After the full run, sample another 200–300 classified speeches (stratified by category and by any subgroups you care about, e.g., time periods, parties, chambers). Compute the same metrics. If performance degrades on a subgroup, investigate why. This is not the same validation from Stage 2; it checks whether your pipeline generalises across the full corpus, not just the subset you tested on.
Stage 6: Document and report. Your methods section should specify: the model name and version, the prompt (or the fine-tuning data and hyperparameters), temperature and sampling settings, gold-standard size and sampling method, inter-coder agreement (human–human and human–model), per-class F1, the qualitative error analysis, and any modifications made based on validation results.
The stages are not strictly sequential; you will loop between them. But the key discipline is: validate before scaling, and validate again after. Most failures in LLM-based research come from skipping one of these checks.
Key Papers
Ornstein et al. (2023), How to Train Your Stochastic Parrot: Large Language Models for Political Texts. The closest thing to a practitioner’s handbook for the full pipeline covered in this module. Covers prompt design, validation workflows, cost management, and the transition from interactive exploration to research-grade annotation. If you read one paper from today’s session, make it this one.
Social Science Application. Spirling (2023), Why open-source generative AI models are an ethical imperative for social science. Nature Computational Science. Reliance on closed-source models undermines three core scientific values: reproducibility (the model may change between your study and replication, as documented by Chen et al., 2023 in Module 2), transparency (you cannot inspect training data, safety filters, or alignment choices), and equitable access (API costs exclude researchers without well-funded labs). For the deployment decisions covered today, the implication is concrete: when possible, use open-weight models that can be version-pinned, inspected, and shared. When closed models are necessary, document every detail.
Key Takeaway
Current evidence strongly supports that LLMs are viable annotation tools, but only when deployed with the same methodological rigour applied to any measurement instrument. The workflow is: prototype with prompting, validate against human coders, fine-tune if needed, deploy at scale with structured outputs and checkpointing, and validate again on the full corpus. Document every methodological choice. Use open models when possible. Module 4 moves beyond classification to ask what else language models can do for social science research: extracting structured information from unstructured text, retrieving and synthesising evidence from large corpora with RAG, and evaluating these pipelines rigorously enough to publish on.