Module 3 of 5

Deploying for Research

Fine-tuning, working at scale, and validation: building rigorous classification pipelines from prompt to production.

Fine-Tuning

Day 2 showed how prompting and reasoning techniques turn a language model into a classification instrument. For many tasks, this works well. But prompting has a ceiling: when your classification scheme is domain-specific, when the model makes systematic errors that better examples cannot fix, or when per-token costs make API-based prompting prohibitive at scale : fine-tuning adapts a model’s weights to your specific task.

The Decision Framework: Prompt, RAG, or Fine-Tune

Before investing in fine-tuning, ask whether it is necessary. Three approaches exist for improving model performance, and each fits different circumstances.

Prompting (covered in Day 2) is the right starting point when zero-shot or few-shot performance meets your validation threshold, when you need flexibility across multiple tasks, when you have little or no labeled data, or when the task aligns well with the model’s pre-training distribution.

Retrieval-Augmented Generation (RAG) is the right choice when the model needs domain knowledge that exceeds its training data or its context window: your corpus of parliamentary records, legal texts, or historical archives. RAG uses embeddings (from Day 1) to retrieve relevant documents and feed them to the model alongside the query. We cover RAG in Day 4.

Fine-tuning is the right choice when prompting produces systematic errors that more or better examples cannot resolve, when you need consistent output formatting across thousands of classifications, when you have labeled data (hundreds to thousands of examples), or when domain-specific language or conventions are not handled well by general-purpose models.

Does zero/few-shot prompting meet your validation threshold? YES Use prompting NO Does the model need domain knowledge or access to a large corpus? YES Use RAG NO Do you have hundreds+ labeled examples? NO Collect labels first YES Is your task pure classification with a fixed set of categories? YES Encoder model (DeBERTa) Smaller, faster, often more accurate NO LoRA on decoder (Llama, Qwen) Flexible, generative, multi-task

A simplified decision tree for choosing between prompting, RAG, and fine-tuning. Real decisions often involve hybrid approaches : for example, using RAG and fine-tuning together. The key factor is validation: if your current approach meets your threshold, added complexity is not justified. Illustrative diagram.

What Fine-Tuning Does

Fine-tuning continues training a pre-trained model on your labeled data. The model’s weights are adjusted to minimise prediction error on your examples. This is the same gradient-based optimisation from pre-training, but applied to a much smaller, task-specific dataset.

For decoder models (GPT, Llama, Qwen), fine-tuning typically uses Supervised Fine-Tuning (SFT): you provide (instruction, response) pairs: the same format you used for prompting in Day 2: and the model learns to produce those responses. The critical insight: the training data is just prompt–response pairs. The difference from few-shot prompting is that instead of the model seeing your examples at inference time, it internalises the patterns during training.

$$\mathcal{L}_{\text{SFT}} = -\sum_{t=1}^{T} \log P_\theta(y_t \mid y_{\lt t}, x)$$

The model is trained to predict each response token $y_t$ given the instruction $x$ and all preceding response tokens. This is the same next-token prediction objective from pre-training, applied to curated instruction–response pairs.

LoRA: Parameter-Efficient Fine-Tuning

Full fine-tuning updates every parameter in the model. For a 7B-parameter model, this means updating seven billion numbers : requiring multiple high-end GPUs and risking catastrophic forgetting, where the model loses general capabilities while learning your task.

Definition

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning method (Hu et al., 2021) that freezes all pre-trained weights and injects small, trainable low-rank matrices into the model’s attention layers. Instead of updating billions of parameters, LoRA trains tens of millions : typically less than 1% of the total. This achieves performance comparable to full fine-tuning at a fraction of the compute cost.

The key observation behind LoRA: weight updates during fine-tuning tend to occupy a low-dimensional subspace. They do not need the full dimensionality of the weight matrix. LoRA exploits this by decomposing the update into two small matrices whose product approximates the full-rank update.

$$W' = W + \Delta W = W + BA, \quad B \in \mathbb{R}^{d \times r},\; A \in \mathbb{R}^{r \times k},\; r \ll d$$

The original weight matrix $W$ is frozen. The update $\Delta W$ is decomposed into two small matrices $B$ and $A$ whose product has the same shape as $W$. The rank $r$ is typically 8–64, far smaller than the original dimensions $d$ and $k$ (often 4096 or more). This reduces trainable parameters by a factor of 100–1000×.

W d × k FROZEN ~7B parameters + B d × r × A r × k TRAINABLE ~10M parameters = W’ r = 8–64 (typical) vs. d, k = 4096+ (attention dimensions) Adapters are injected into each attention layer’s Q, K, V, O projections

LoRA injects small trainable matrices into the frozen model. The rank r controls the trade-off between expressiveness and efficiency. Higher rank captures more complex adaptations but uses more memory. Illustrative diagram; dimensions are not to scale.

QLoRA (Dettmers et al., 2023) takes this further: it loads the frozen base model in 4-bit precision (NormalFloat4 quantisation), reducing memory by roughly 75% compared to 16-bit storage. The LoRA adapters themselves remain in full precision. This combination means a 7B model fits on a single consumer GPU with 16GB of VRAM, and even a 3B model can be fine-tuned on a free Google Colab T4 (15GB).

The output of LoRA fine-tuning is a small adapter file: typically tens of megabytes: that can be shared, versioned, and loaded on top of the base model. Multiple adapters can be trained for different tasks and swapped without reloading the base model.

Encoder Models for Classification

Everything covered so far: GPT, Claude, Llama, Qwen : uses a decoder-only architecture. These models read text left to right, generating one token at a time. They are designed for generation: writing text, following instructions, answering questions. But for pure classification : assigning a fixed label to a piece of text: a different architecture is often the better choice.

Encoder models (BERT, RoBERTa, DeBERTa) read the entire input simultaneously. Every token attends to every other token in both directions: forward and backward. In Day 1, we described how decoder models use causal masking so each token can only attend to preceding tokens. Encoder models remove this mask entirely: the representation of any word is informed by everything before and after it.

Definition

Encoder Model

A Transformer that processes the entire input sequence with bidirectional attention: every token can attend to every other token simultaneously. Encoder models produce a single fixed-length representation of the input (typically via a special [CLS] token) that is then mapped to a classification label. They cannot generate text token by token; they are designed for understanding, not production.

The classification mechanism is straightforward. The encoder processes the input and produces a contextualised vector for each token. A special [CLS] token: prepended to every input : aggregates information from the full sequence. A single linear layer maps this vector to a probability distribution over the label set.

$$P(y \mid x) = \text{softmax}\!\left(W_c \cdot \mathbf{h}_{\text{[CLS]}} + b\right)$$

The encoder produces a contextualised hidden state $\mathbf{h}_{\text{[CLS]}}$ for the special classification token. A learned linear layer $W_c$ maps this single vector to a probability distribution over the label set. The entire model: encoder and classification head: is fine-tuned end-to-end.

The Encoder Lineage: BERT → RoBERTa → DeBERTa

BERT (Devlin et al., 2019) established the encoder paradigm. It introduced masked language modelling: randomly mask 15% of tokens and train the model to predict them from the surrounding context. This bidirectional pre-training produced representations that dramatically outperformed previous methods on classification, question answering, and other understanding tasks. BERT was, for several years, the standard tool for text classification in NLP research and social science.

RoBERTa (Liu et al., 2019) showed that BERT was substantially undertrained. By training longer, on more data, with larger batches, and with a refined masking strategy, RoBERTa achieved considerable improvements without changing the architecture. The lesson: training procedure matters as much as architecture.

DeBERTa (He et al., 2021) introduced two architectural innovations. Disentangled attention separates content and position into independent vectors, allowing the model to compute content-to-content, content-to-position, and position-to-content attention scores separately. This finer-grained attention captures more nuanced relationships. The enhanced mask decoder incorporates absolute position information at the prediction stage, addressing an ambiguity limitation in BERT’s approach.

DeBERTaV3 (He et al., 2023) replaced masked language modelling with ELECTRA-style pre-training, a more sample-efficient approach. The result: deberta-v3-base (184M parameters) and deberta-v3-small (44M parameters) currently represent the best general-purpose encoder models for classification. A 2025 comparison found that DeBERTaV3 remains superior to the newer ModernBERT in sample efficiency and overall benchmark performance, though ModernBERT offers advantages in long-context support.

Encoder vs. Decoder: When to Use Each

Encoder (DeBERTa) [CLS] This march rocks Every token attends to every other (bidirectional) Classification Head support Size: 44M–400M params Training: 2–5 minutes on GPU Inference: milliseconds per text Best for: fixed-category classification Cannot: generate text or follow instructions Decoder (Llama / GPT) Classify: This march rocks Each token attends only to preceding tokens (causal) generate The author... generate supports... support Size: 3B–70B+ params Training (LoRA): 5–15 minutes Inference: seconds per text Best for: flexible tasks, generation, extraction Cannot: natively output a single classification score

Encoder vs. decoder architecture for a classification task. The encoder produces a single classification in one pass; the decoder generates the answer token by token, which is slower but more flexible. Illustrative diagram; real models have many more layers and tokens.

For social science researchers, the practical guidance is straightforward:

Use an encoder model when your task is classification with a fixed set of categories, you have labeled training data (hundreds to thousands of examples), you need fast inference (thousands of texts per minute), and you do not need the model to generate free text. For most annotation tasks in computational social science : sentiment analysis, stance detection, topic labeling, framing analysis: a fine-tuned DeBERTa is likely the best choice in terms of accuracy per compute dollar.

Use a decoder model (via LoRA or prompting) when you need the model to generate text (summaries, explanations, structured extractions), when your task requires flexible instructions, when you are working with multiple tasks simultaneously, or when you have no labeled training data and must rely on zero/few-shot prompting.

Recent comparative work supports this distinction. Widmann & Wich (2023) found that fine-tuned encoder models (BERT-family) remain state-of-the-art for many classification tasks, sometimes outperforming much larger prompted decoder models. A 2024 analysis in Language Resources and Evaluation confirmed that encoder-only architectures generally provide better efficiency-to-performance ratios for discriminative classification, while noting that fully fine-tuned large decoders can occasionally match or exceed encoder performance when sufficient data and compute are available.

Practical Considerations

Data requirements: For encoder fine-tuning, a few hundred labeled examples often produce strong results; a few thousand approach diminishing returns. For LoRA fine-tuning of decoders, similar quantities work, though the per-example signal may be weaker because the model must also learn the output format.

Hyperparameters: For encoders, the most critical parameter is learning rate: typically 2×10-5 to 5×10-5, with 2–4 training epochs and batch sizes of 16–32. For LoRA, common settings include rank 8–16, lora_alpha 16–32, and dropout 0.05–0.1. These are well-established defaults; the notebook walks through specific configurations.

Overfitting: With small datasets, the model can memorise training examples rather than learning generalisable patterns. Monitor validation loss during training: if it starts increasing while training loss continues to decrease, the model is overfitting. Use early stopping to halt training at the best validation checkpoint.

Evaluation: Always evaluate on a held-out test set that the model has never seen during training or validation. Report per-class precision, recall, and F1: not just overall accuracy. A model with 90% accuracy may achieve 95% on the majority class and 60% on the minority class, which is unacceptable for most research purposes.

Stop and Think

You have 500 labeled political tweets and want to classify 100,000 more. Would you use a fine-tuned DeBERTa, a LoRA-adapted Llama, or zero-shot prompting with Claude? What factors drive your choice?

Reveal

For pure binary or multi-class classification with 500 labeled examples, a fine-tuned DeBERTa is likely the best choice. It will train in minutes, infer in milliseconds per tweet, and cost essentially nothing to run (no API fees, minimal GPU time). LoRA on Llama would also work but is slower and more resource-intensive for a task that does not require generation. Zero-shot Claude might match DeBERTa’s accuracy on straightforward cases, but at ~$1.50–3.00 per 100K tweets (depending on the model tier) and with no guarantee of consistency across API updates. The 500 labeled examples make fine-tuning the clear winner here.

In the notebook: Exercise 4 walks you through formatting training data for LoRA fine-tuning: converting labeled tweets into chat-format (instruction, response) pairs. Exercise 5 trains a LoRA adapter on Qwen2.5-3B and probes whether fine-tuning fixes the hard cases from Day 2. The extension section fine-tunes DeBERTa for a direct encoder-vs-decoder comparison.

Key Takeaway

Fine-tuning is the tool you reach for when prompting hits a ceiling. LoRA makes fine-tuning accessible on consumer hardware by training less than 1% of parameters. For pure classification tasks with labeled data, encoder models (especially DeBERTa) are typically smaller, faster, and at least as accurate as fine-tuned decoders. The choice between approaches should be driven by your specific task requirements, available data, and validation results: not by which model is newest or largest. Once you have a fine-tuned model, the next question is how to deploy it at scale.

Resources

Working at Scale

You have chosen your approach: prompted classification, a fine-tuned decoder, or a fine-tuned encoder. The next challenge is infrastructure: how to process thousands or millions of texts efficiently, reliably, and within budget.

Running a model interactively in a chat interface works for exploration. Moving to research-scale annotation requires programmatic access, structured outputs, error handling, and cost planning. This section covers the practical infrastructure that turns a working prototype into a research pipeline.

API Access

Major model providers (OpenAI, Anthropic, Google) expose their models through REST APIs with a common structure: you send a JSON request containing your messages and parameters, and receive a JSON response with the model’s output. Python client libraries wrap this into clean function calls.

The core abstraction is the messages array: a list of (role, content) pairs representing the conversation. A system message sets the model’s behaviour, a user message provides the input, and the model returns an assistant message with its response. This is the same structure you used in Day 2 for prompting: the API formalises it.

Key parameters that affect output quality and cost: temperature (0 for deterministic classification, higher for creative tasks), max_tokens (cap the response length: for binary classification, 10 tokens is more than enough), and model (the specific model version, which should be pinned for reproducibility).

Structured Outputs

When building classification pipelines, you need machine-parseable output: not free-form text. Asking a model to “respond with JSON” usually works for large models but can produce malformed output from smaller ones (a missing comma, an extra field, explanatory text outside the JSON).

Definition

Structured Output

A model response constrained to a specific format: typically JSON conforming to a provided schema. Modern APIs offer guaranteed structured output modes that constrain the model’s token generation to valid JSON, eliminating parsing failures. This is achieved by modifying the sampling process to only allow tokens that produce valid syntax at each step.

For research pipelines, structured outputs are essential. Without them, you need extensive post-processing to extract labels from free-text responses, and edge cases (the model adding qualifications, refusing to classify, or producing unexpected formats) can silently corrupt your data. With schema-constrained output, every response is guaranteed to parse correctly.

Batching & Async Processing

Processing texts one at a time is straightforward but slow. A single API call takes 0.5–2 seconds for classification; at that rate, 10,000 texts take 1.5–5.5 hours. Three patterns speed this up:

Asynchronous requests with rate limiting: Send multiple requests concurrently, using a semaphore to cap the number of simultaneous connections (typically 10–50, depending on your rate limit). This can reduce processing time by 10–50×. The notebook demonstrates this pattern with Python’s asyncio.

Batch APIs: Anthropic and OpenAI offer dedicated batch endpoints where you submit a file of requests and receive results within hours. Anthropic’s Batch API provides a 50% cost reduction for non-urgent work. For large annotation projects where same-day results are acceptable, this is the most cost-effective approach.

Checkpointing: For any pipeline processing thousands of texts, save results incrementally. If the script crashes at text 8,000 of 10,000, you should be able to resume from where it stopped rather than re-running everything. This is not a performance optimisation: it is a reliability requirement. API connections fail, rate limits trigger, and machines restart.

Cost Estimation

API pricing is token-based: you pay per million input and output tokens. For classification tasks, the input is typically 50–200 tokens (instruction + text) and the output is 1–10 tokens (the label). The cost per classification is therefore dominated by input tokens.

Estimated cost per 10,000 classifications assuming ~150 input tokens and ~5 output tokens per text. Prices as of early 2025; check provider pricing pages for current rates.

For research budgeting: estimate your corpus size, multiply by the per-text cost for your chosen model, and add 20–30% for retries, prompt iteration, and validation runs. A project classifying 100,000 texts with a budget model costs under $10; the same project with a frontier model costs $100–200. Self-hosted open models eliminate per-token costs but require GPU access.

Self-Hosted Inference

When data privacy prevents sending text to external APIs, when per-token costs are prohibitive at scale, or when you need guaranteed reproducibility (no model version changes), self-hosted inference becomes necessary. You run the model on your own hardware and control the entire pipeline.

Definition

Self-Hosted Inference

Running a language model on infrastructure you control: a university GPU cluster, a cloud instance, or even a local machine. Your data never leaves your environment. You pin the exact model version and guarantee identical outputs indefinitely. The trade-off is that you handle infrastructure, maintenance, and the upfront cost of GPU access.

vLLM (Kwon et al., 2023) is the most widely used open-source inference engine. Its core innovation, PagedAttention, manages the KV cache (from Day 1) like virtual memory, reducing memory waste from 60–80% to under 4%. This allows serving 2–4× more concurrent requests on the same hardware. vLLM supports over 50 model architectures and exposes an OpenAI-compatible API, so existing code works without modification.

SGLang (Zheng et al., 2024) achieves even higher throughput through RadixAttention, which automatically discovers and reuses shared prefixes across requests. When many requests share the same system prompt or instruction prefix : as in classification pipelines: SGLang can be significantly faster than vLLM. Benchmarks show 29% higher throughput for standard serving and up to 4.6× faster for concurrent requests.

Ollama provides the simplest path to local inference. It wraps llama.cpp (efficient C++ inference) in a Docker-like interface: install, pull a model, and run it with a single command. Ollama is the right choice for researchers who want to experiment with local models without managing GPU infrastructure. It runs on consumer hardware: including Apple Silicon laptops : though throughput is substantially lower than dedicated GPU serving.

Corpus Raw texts Prompt Template Model API or local Parse Extract label Store Checkpoint Validate κ, F1, errors Iterate: refine prompt, adjust model, re-validate

A complete annotation pipeline. Each stage has a clear input and output. The validation step at the end feeds back into prompt refinement: this loop is what makes the pipeline rigorous. Illustrative diagram.

Stop and Think

Your university IRB prohibits sending student survey responses to external services. You need to classify 50,000 responses. What are your options?

Reveal

Three viable approaches: (1) Fine-tune a DeBERTa model locally : it runs on a single GPU or even CPU for inference, and your data never leaves your machine. (2) Self-host an open-weight decoder model (e.g., Llama 3 via vLLM or Ollama) on your university’s GPU cluster. (3) Use a provider that offers data processing agreements (DPAs) compatible with your IRB requirements: some enterprise API tiers guarantee data is not used for training and is deleted after processing. Option 1 is typically simplest and cheapest for classification tasks.

In the notebook: Section 3 provides reference code for OpenAI and Anthropic API patterns, async batching with semaphores, and a cost comparison table. These are templates you can adapt for your own projects.

Key Takeaway

Scaling from notebook to pipeline requires three things: structured outputs for reliable parsing, async batching for speed, and checkpointing for resilience. Choose between API access (convenient, pay-per-token) and self-hosted inference (private, fixed-cost) based on your data sensitivity and budget. Whichever you choose, the pipeline is only as good as its validation: which is what we cover next.

Resources

Validation

Whether you use prompting, fine-tuning, or a dedicated encoder model, you face the same fundamental question: can you trust the results? A model that classifies 10,000 texts in minutes is useless if 15% of those classifications are systematically wrong in ways that bias your analysis.

The answer requires the same methodological rigour that social scientists apply to any measurement instrument. In content analysis, reliability is established by having multiple human coders label the same texts and measuring their agreement. An LLM-based classifier is another coder. It needs the same scrutiny.

The LLM as Another Coder

A common mistake: treating the model’s output as ground truth. When a model classifies a tweet as “support,” that is a prediction: not a fact. The same tweet might be ambiguous, sarcastic, or genuinely borderline. The model’s classification is one coder’s judgment, and it should be evaluated against the same standards you would apply to a human research assistant.

This framing has a practical consequence: your accuracy ceiling is set by human inter-coder agreement, not by 100%. If two well-trained human coders agree on 85% of texts (κ ≈ 0.70), expecting the model to exceed that level is unreasonable. The model should match or approximate human-level reliability: not surpass it.

Cohen’s Kappa: Beyond Accuracy

Raw accuracy: the percentage of correct classifications : is the most intuitive metric but also the most misleading. If 80% of your corpus expresses “support” and a model labels everything as “support,” it achieves 80% accuracy while being completely useless. The model has learned nothing; it exploits the class distribution.

Definition

Cohen’s Kappa (κ)

A measure of agreement between two coders that corrects for the agreement expected by chance. Unlike raw accuracy, kappa is not inflated by class imbalance. It is the standard reliability metric in content analysis and should be reported whenever an LLM is used as a text annotator.

$$\kappa = \frac{p_o - p_e}{1 - p_e}$$

$p_o$ is the observed agreement between two coders (the proportion of items on which they agree). $p_e$ is the agreement expected by chance, calculated from the marginal distributions of each coder’s labels. A $\kappa$ of 1.0 means perfect agreement; 0.0 means agreement no better than chance; negative values indicate systematic disagreement.

A common interpretation scale: κ > 0.80 is “almost perfect” agreement, 0.60–0.80 is “substantial,” 0.40–0.60 is “moderate,” and below 0.40 is “fair” to “poor.” These thresholds are conventional guidelines, not hard rules: the appropriate threshold depends on the difficulty of the task and the consequences of misclassification.

Beyond Kappa: F1 and Krippendorff’s Alpha

$$F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$

Precision is the fraction of predicted positives that are actually positive. Recall is the fraction of actual positives that the model found. $F_1$ is their harmonic mean: it penalises models that sacrifice one metric for the other. Report per-class $F_1$ to detect imbalanced performance.

F1 provides complementary information to kappa. While kappa measures overall agreement, per-class F1 reveals where the model fails. A model with high overall kappa but low F1 on the minority class is systematically missing cases that may be analytically important.

When your project involves more than two coders (e.g., the LLM, a research assistant, and you), or when some texts are coded by different subsets of coders, Krippendorff’s alpha is the more appropriate metric.

$$\alpha = 1 - \frac{D_o}{D_e}$$

$D_o$ is the observed disagreement and $D_e$ is the disagreement expected by chance. Unlike Cohen’s $\kappa$, Krippendorff’s $\alpha$ handles any number of coders, accommodates missing data, and works with nominal, ordinal, interval, and ratio scales.

Qualitative Error Analysis

Aggregate metrics tell you how much the model gets wrong. They do not tell you why. Qualitative error analysis : reading through misclassified texts and categorising the failure modes: is essential for diagnosis and improvement.

Five error patterns recur across LLM classification tasks:

Sarcasm and irony: The surface text expresses one sentiment while the intended meaning is the opposite. Models that rely on keyword matching tend to miss this. Chain-of-thought prompting (from Day 2) can help, as can providing sarcastic examples in few-shot prompts.

Mixed signals: Texts that contain both supportive and opposing language: e.g., acknowledging a movement’s goals while criticising its methods. The model must decide which signal dominates. Clearer instructions about what to prioritise (author’s overall stance vs. individual statements) can reduce these errors.

Ambiguity: Texts that are genuinely too short or context-dependent to classify reliably. These are not model failures : they reflect real uncertainty in the data. Consider adding an “ambiguous” or “uncertain” category rather than forcing a binary choice.

Indirect stance: The author reports or quotes someone else’s position without stating their own. A tweet saying “Protesters claim the march was a success” is reporting, not supporting. Instruct the model to classify the author’s stance, not the reported stance.

Domain mismatch: The model misinterprets domain-specific language, abbreviations, or cultural references. This is most common with non-English text, subculture-specific language, or highly specialised terminology. Fine-tuning or providing domain context in the prompt can address this.

Stop and Think

You analyse 50 misclassified texts and find that 30 involve sarcasm, 10 involve indirect reporting, and 10 are genuinely ambiguous. What does this tell you about your improvement strategy?

Reveal

The error distribution is actionable. Sarcasm dominates (60%), so the highest-impact fix is adding chain-of-thought reasoning or sarcastic few-shot examples. Indirect reporting (20%) suggests clarifying the instruction to target author stance specifically. The ambiguous cases (20%) may be irreducible : consider an “uncertain” category. The key insight: not all errors are equal. Prioritise fixes that address the most common pattern first.

Systematic Biases

Beyond individual errors, LLMs exhibit systematic biases that affect classification at a population level. These are not random mistakes: they are directional tendencies that can corrupt aggregate statistics even when per-text accuracy appears acceptable.

Positional bias: When given ordered options (“support or oppose”), models tend to favour the first-listed option. Rotating label order across classifications and measuring whether results change can detect and partially mitigate this.

Sycophancy bias: Models tend toward the more “positive” or “agreeable” option. In stance detection, this manifests as over-predicting support. In sentiment analysis, it manifests as over-predicting positive sentiment. This is a direct consequence of RLHF training (from Day 2), where annotators tend to prefer agreeable outputs.

Cultural and political bias: Models trained predominantly on English-language Western data may misinterpret rhetorical conventions from other cultural contexts. They may also exhibit political leanings reflected in their training data and alignment process. For research on politically sensitive topics, this is a measurement validity concern that must be documented.

Temporal instability: Closed-source model providers update their models without notice. The exact same prompt may produce different results next month than it does today. For research that requires replicability, this is a fundamental problem. Pin model versions where possible, and document the exact version, date, and provider in your methods section. Open-weight models avoid this issue entirely.

Building a Validation Plan

Before deploying any classification pipeline, you need a validation plan. This is not optional: it is the methodological equivalent of pre-registering your survey instrument.

1. Create a gold-standard set. Manually label a random sample of your corpus. For binary classification, 200–500 texts typically provides stable kappa estimates. Use stratified sampling if your corpus has known subgroups (e.g., different sources, time periods, or topics). Include edge cases deliberately : they are what differentiate good classifiers from mediocre ones.

2. Establish the human ceiling. Have at least two humans (ideally including yourself) label the same subset. Their inter-coder agreement sets the ceiling: if humans agree at κ = 0.75, that is the maximum you can reasonably expect from the model.

3. Set a threshold. Decide in advance what level of agreement is acceptable for your research question. κ ≥ 0.70 is a common threshold in political science; some tasks may justify lower (exploratory) or require higher (high-stakes policy analysis).

4. Analyse failures qualitatively. Do not just report the number. Read the misclassified texts. Categorise the error patterns. Determine whether the errors are random (tolerable) or systematic (threatening to your analysis).

5. Document everything. Your methods section should report: the model name and version, the exact prompt, the temperature and sampling settings, the gold-standard size and sampling method, the kappa score, per-class F1, the qualitative error analysis, and any modifications made based on validation results. This is not optional transparency: it is the minimum for replicable research.

1. Sample 200–500 texts 2. Label Human + model 3. Compute κ, F1, confusion 4. Analyse Error patterns 5. Decide κ ≥ threshold? Errors random? → Deploy or iterate If below threshold: refine prompt, add examples, or switch approach

The validation cycle. Note that this is iterative: validation is not a one-time check at the end but a loop that informs every design decision in the pipeline. Illustrative diagram.

In the notebook: Exercises 1–3 walk you through the full validation workflow. You manually label 20 tweets (establishing your own inter-coder agreement with the gold standard), categorise error patterns in the model’s misclassifications, and design a validation plan for a hypothetical research project.

Stop and Think

A colleague says: “My model gets 94% accuracy on my validation set. I don’t need to check kappa: 94% is clearly good enough.” What is wrong with this reasoning?

Reveal

Three problems. First, 94% accuracy with severe class imbalance (e.g., 90% of texts are positive) could mean the model is barely outperforming a naive “always predict positive” baseline. Kappa corrects for this. Second, accuracy does not reveal whether errors are concentrated in one class : per-class F1 does. Third, without knowing the human inter-coder agreement, 94% may or may not be close to the ceiling. If humans agree at 96%, the model is near optimal. If humans agree at 94%, the model is already at the ceiling and further improvement is unlikely.

Key Takeaway

Validation is not optional. Treat the LLM as another coder: compute Cohen’s kappa (not just accuracy), report per-class F1, and analyse errors qualitatively. Know your ceiling by measuring human–human agreement. Watch for systematic biases : positional, sycophancy, cultural, temporal: that corrupt aggregate statistics even when per-text accuracy appears adequate. Document every methodological choice. This applies whether you use prompting, fine-tuning, or any other approach.

Resources

Social Science Applications

The tools covered in this module: fine-tuning, scaled deployment, and rigorous validation: are the infrastructure that makes LLM-based research viable. This section highlights key papers that demonstrate how researchers are deploying these techniques and the methodological lessons they have learned.

LLM-Based Annotation at Scale

Ornstein et al. (2023), How to Train Your Stochastic Parrot: Large Language Models for Political Texts. A practical guide to building LLM annotation pipelines at scale. Demonstrates how LLMs become transformative for social science by enabling the processing of corpora that would take human coders years. The paper provides concrete guidance on prompt design, validation workflows, and cost management: the exact topics covered in this module.

Fine-Tuning vs. Prompting: When Each Wins

Widmann & Wich (2023), Creating and Comparing Dictionary, Word Embedding, and Transformer-Based Sentiment Analysis Tools. Political Analysis. Systematically compares fine-tuned encoder models (BERT family) against prompted LLMs for sentiment classification. The finding: fine-tuned encoders remain state-of-the-art for many classification tasks, sometimes outperforming models orders of magnitude larger. This supports the encoder-vs-decoder guidance in this module: for fixed-category classification with labeled data, smaller fine-tuned models are often the better choice.

The Validation Imperative

Pangakis, Wolken & Fasching (2023), Automated Annotation with Generative AI Requires Validation. Replicated 27 annotation tasks across 11 datasets from published social science papers and found that LLM performance varied dramatically: from near-perfect to barely above chance : depending on the dataset and task. The takeaway: task-by-task validation is non-negotiable. The authors propose a five-step validation workflow and provide open-source software to implement it.

Gilardi, Alizadeh & Haunss (2023), ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks. PNAS. A widely cited study showing that ChatGPT’s zero-shot annotation quality exceeds that of crowd-sourced workers on several political text classification tasks. Important nuance: the comparison is against crowd workers (often low-training, high-noise), not against expert coders. When compared against trained research assistants, the gap narrows or disappears. The paper remains valuable for establishing that LLMs are a viable annotation tool : but the comparison baseline matters.

Reproducibility & Open Science

Spirling (2023), Why open-source generative AI models are an ethical imperative for social science. Nature Computational Science. Argues that reliance on closed-source models undermines reproducibility, transparency, and equitable access: core scientific values. For the deployment decisions in this module, Spirling’s argument is directly relevant: when possible, use open-weight models that can be version-pinned, inspected, and shared. When closed models are necessary (because of capability requirements), document the version, date, and provider meticulously.

Key Takeaway

The research covered in this module demonstrates a clear consensus: LLMs are viable annotation tools, but only when deployed with the same methodological rigour applied to any measurement instrument. Fine-tune when prompting is insufficient. Validate before trusting. Document everything. Use open models when possible. Day 4 moves beyond classification to tasks where LLMs offer capabilities that traditional methods cannot match: information extraction from unstructured text, retrieval-augmented generation over large corpora, and the use of language models to simulate human survey responses.