Day 4: Beyond Classification

Modules 2 and 3 treated the LLM as a classifier: assign a label from a fixed set, verify against human coders, deploy at scale. Today you ask it to do two harder things: extract structured information from unstructured text, and retrieve and synthesise evidence across corpora too large to read. Both move from closed-ended labels to open-ended outputs, and that shift changes everything about how you evaluate the results.

The thread connecting both tasks is faithfulness. When the model classifies a text, the worst it can do is pick the wrong label. When it extracts claims or generates an answer from retrieved passages, it can invent information that looks exactly like real data. Every technique in this module exists to make model outputs traceable to their sources, and every evaluation metric you will learn measures a different dimension of that traceability.

After This Module You Will Be Able To

Design an extraction schema, prompt a model to populate it, and verify faithfulness against source text.
Build a RAG pipeline (chunk, embed, index, retrieve, generate) and explain the design trade-offs at each stage.
Evaluate extraction and RAG outputs using both automated metrics (RAGAS, provenance checks) and manual validation, and explain why neither alone is sufficient.
Distinguish hallucination, omission, and distortion as failure modes, and design verification strategies for each.
Connect extraction and RAG into a research workflow where every claim is traceable to a specific passage in the source.

Information Extraction & Summarization

In Modules 2 and 3, you classified text: assigning labels to documents. That answers questions like “is this tweet supportive or opposed?” Now you will do something harder: extract structured information from unstructured text. This is the difference between asking “is this article about democracy?” and asking “what specific claims does this article make about democracy, what evidence does it cite, and what methods does it use?”

This matters because review articles, legislative texts, interview transcripts, and policy documents are dense with extractable structure: claims, cited findings, causal arguments, policy positions. Manually cataloguing all of this across a large corpus would take weeks. An LLM can attempt it in seconds. The question is: can you trust what it extracts?

From Classification to Extraction

Classification maps a text to a label from a fixed set. Extraction is open-ended: the model must identify relevant pieces of information, structure them according to a schema you define, and ideally trace each piece back to its source in the text. The output is not a single word but a structured object: a list of claims, a table of entities, a set of cited findings with their provenance.

This makes extraction substantially harder to evaluate. With classification, you compare the model's label to a ground truth label. With extraction, you need to check: Did the model find all the relevant items? Did it invent any that are not in the text? Did it faithfully represent what the text actually says? These are three different failure modes, and each one can corrupt your data in different ways.

Definition

Information Extraction

The task of identifying and structuring specific pieces of information from unstructured text. Unlike classification (which assigns a label), extraction produces structured output: entities, relationships, claims, events, or arguments, each ideally traceable to a specific passage in the source.

The Three Failure Modes

When an LLM extracts information from text, three things can go wrong. Understanding these failure modes is essential for building trustworthy extraction pipelines.

Hallucination: the model invents information that is not in the source text. It might attribute a finding to a citation that does not exist, or claim the article makes an argument it never makes. Hallucinated extractions are particularly dangerous because they look exactly like real ones: confident, well-structured, and plausible. The only way to catch them is to check against the source.

Omission: the model misses important information that is clearly present in the text. It might extract three of five key claims, silently dropping the two that are most nuanced or that require more careful reading. Omissions are less dangerous than hallucinations (they don't add false data) but they bias your dataset toward whatever the model finds easiest to extract.

Distortion: the model extracts something that is partially correct but changes the meaning. It might report that “Smith (2020) found that immigration increases wages” when the article actually says “Smith (2020) found that high-skilled immigration increases wages in some sectors.” The claim is not invented, but the qualifications that make it accurate have been stripped away.

Which failure mode is most dangerous for a literature review that uses LLM-extracted claims as its primary data source? Would your answer change if the extracted data were used for a meta-analysis versus a qualitative synthesis?

Reveal

For a meta-analysis, distortion is arguably most dangerous: a subtly altered effect size or a dropped qualifier changes the quantitative input without being obviously wrong. For a qualitative synthesis, hallucination is worse: an invented claim could lead to entirely false conclusions about the state of a field. Omission matters in both cases but is easier to detect (you can check coverage against a table of contents or known papers). The key point: different downstream uses require different verification strategies.

Faithfulness as the Core Requirement

The thread running through all of today's material is faithfulness: every piece of extracted or generated information should be traceable to a specific passage in the source text. This is the standard that separates a research tool from a hallucination machine.

Definition

Faithfulness (in extraction and generation)

The property that every claim in the model's output is supported by the source material provided to it. A faithful extraction contains only information present in the source text. A faithful summary does not add, invent, or distort. Faithfulness is distinct from correctness: a faithful extraction accurately represents what the source says, even if the source itself is wrong.

Verifying faithfulness requires comparing the model's output against the source text, item by item. Two practical approaches exist. String matching checks whether key phrases from each extracted claim appear in the source, which is fast but brittle (it misses paraphrases). LLM-as-judge uses a separate model call to evaluate whether each claim is supported by the passage, which is more flexible but introduces its own error rate. In practice, using both provides a reasonable check.

Schema Design: Deciding What to Extract

Before writing a prompt, you need to decide what categories of information to extract. This is codebook design, the same skill from Module 3's classification work, applied to a harder problem.

For academic text, common extraction categories include: substantive claims (assertions the authors make), cited findings (results attributed to other researchers), methodological approaches, and causal arguments. But the categories are not always clean. Is a “cited finding” also a “claim” when the authors endorse it? Is a research design a “method” or a “framework”? These ambiguities are real and affect what the model extracts. Defining them precisely in the prompt is as important as defining classification labels.

In the notebook: Exercise 1 puts you in the role of the extractor. You manually identify claims, cited findings, and methods from a passage of an Annual Review of Political Science article on AI governance. This builds your gold standard: when the model does the same extraction in Exercise 2, you compare its output against yours. Exercise 3 has you write a faithfulness verification function to check the model's work systematically.

Summarization: A Special Case of Extraction

Summarization is extraction where the target is the overall argument rather than individual items. The same failure modes apply: the model can hallucinate claims the article never makes, omit key arguments, or distort nuanced positions into simplified versions.

For research use, the critical question is whether a summary is displacive: does reading the summary make it unnecessary to read the original? If the summary is intended as a research tool (helping you decide which articles to read in full), some information loss is acceptable. If the summary is intended as a data source (coding the article's position for quantitative analysis), faithfulness is paramount and should be verified item by item.

You use an LLM to summarize 200 policy documents and then code each summary for policy positions. A colleague points out that you are coding the model's interpretation, not the documents themselves. Is this a valid concern? How would you address it?

Reveal

It is a valid concern. The model's summary is a lossy transformation of the original: it reflects the model's choices about what to include, emphasize, or simplify. Coding the summary introduces an intermediate layer of interpretation that may systematically distort certain positions. Two mitigation strategies: (1) validate on a sample by coding both the original documents and the summaries and measuring agreement, and (2) use extraction rather than summarization, pulling specific claims with provenance rather than relying on the model's editorial judgment about what matters.

Key Takeaway

Information extraction turns unstructured text into structured data, but faithfulness is the bottleneck. The model can hallucinate, omit, and distort, and these errors are harder to detect than classification errors because the output space is open-ended. Every extraction pipeline needs a verification step. The methodological standard is simple: every extracted claim should be traceable to a specific passage in the source.

Resources

Min et al. (2023), FActScore: Fine-grained Atomic Evaluation of Factual Precision: a framework for evaluating faithfulness by decomposing claims into atomic facts and checking each one.
Tang et al. (2023), Understanding Factual Errors in Summarization: a taxonomy of faithfulness errors in LLM-generated summaries.
Ziems et al. (2024), Can Large Language Models Transform Computational Social Science?: comprehensive review covering extraction methodology and validity concerns for social science applications.

Retrieval-Augmented Generation (RAG)

In Section 1, you extracted information from a passage you already had. But what if you have a question and do not know which passages to look at? A colleague asks: “What does the recent political science literature say about how states approach AI governance?” The answer is somewhere in 24 review articles, totalling roughly 480 pages. No model's context window can hold all of that, and even if it could, performance degrades with very long contexts.

RAG solves this: instead of feeding the model everything, you first retrieve the most relevant passages, then feed only those to the model for generation. The retrieval step uses embeddings: the same vector representations from Module 1. You embed your question, embed all your passages, and find the passages whose vectors are closest to your question's vector. Then you pass those passages to the model as context.

Definition

Retrieval-Augmented Generation (RAG)

A framework that augments an LLM's generation with retrieved context from an external knowledge base. Instead of relying on the model's training data (which may be outdated, incomplete, or fabricated), RAG grounds the model's output in specific documents you control. The pipeline has four stages: chunk the documents, embed and index the chunks, retrieve relevant chunks at query time, and generate an answer conditioned on the retrieved context.

The RAG Pipeline

Building a RAG system involves four design decisions, each of which affects the quality of the final output. Getting retrieval wrong means the model never sees the relevant information; getting generation wrong means the model does not use the information it does see. Both failure points need separate evaluation.

Step 1: Chunking

Documents must be split into pieces small enough for the embedding model to handle and specific enough for retrieval to be meaningful. This is chunking, and it is the most underappreciated design choice in RAG.

Chunk size is the primary trade-off. Small chunks (100-200 words) are precise: retrieval returns exactly the relevant sentence or paragraph. But they lose context: a claim extracted from a 100-word chunk may be missing the qualifications stated in the surrounding paragraphs. Large chunks (500-1000 words) preserve context but dilute relevance: a chunk about immigration policy that also discusses taxation may be retrieved for tax questions when the immigration content is irrelevant.

Overlap addresses boundary problems. Without overlap, a claim that spans two chunks is split between them, and neither chunk alone captures the full meaning. Overlapping chunks (sharing 50-100 words with their neighbours) ensure that boundary information appears in at least one chunk. The cost is redundancy: more chunks means more embeddings and a larger index.

Metadata attached to each chunk (article title, author, topic, year) enables filtered retrieval: search only within articles about a specific topic, or restrict to a particular time period. Without metadata, a query about Labour's immigration policy might retrieve chunks from Conservative manifestos that happen to use similar language.

Step 2: Embedding and Indexing

Each chunk is converted into a dense vector using an embedding model. In the notebook, you use all-MiniLM-L6-v2, a small (80MB) sentence-transformer that produces 384-dimensional vectors. It was trained so that semantically similar texts end up with similar vectors, which is exactly what retrieval requires.

These vectors are stored in a vector index (the notebook uses FAISS, Facebook's similarity search library) that enables fast nearest-neighbour search. When a question comes in, you embed it with the same model and find the k chunks whose vectors are closest to the question's vector.

The embedding model converts text into vectors based on semantic similarity. But “semantically similar” is not the same as “relevant to the question.” Can you think of a case where a passage is highly relevant to a research question but would have low cosine similarity to the question's embedding?

Reveal

A passage that answers a question often uses different vocabulary from the question itself. The question “What causes democratic backsliding?” uses the term “backsliding,” but a relevant passage might discuss “erosion of institutional constraints” or “executive aggrandisement” without ever using the word “backsliding.” Good embedding models handle common synonyms, but domain-specific terminology and indirect answers remain a challenge. This is why evaluating retrieval quality is as important as evaluating generation quality.

Step 3: Retrieval

At query time, the question is embedded and the index returns the k most similar chunks. The choice of k is another design decision. Too few results may miss relevant information. Too many dilute the context with irrelevant passages, which can cause the model to generate text that blends relevant and irrelevant content. In practice, k = 3 to 10 is common, and the right value depends on how concentrated the relevant information is across the corpus.

Retrieval quality is the bottleneck. If the retriever returns the wrong passages, no amount of generation quality will save the output. The model will produce a confident, well-structured answer based on irrelevant context. Evaluating retrieval separately from generation is essential: check whether the retrieved chunks actually contain the information needed to answer the question before evaluating the generated answer.

In the notebook: Exercise 4 has you rank passages by relevance manually, then compare your ranking to the embedding model's ranking. You will discover where semantic similarity captures relevance and where it fails.

Step 4: Generation

The retrieved chunks are formatted into a prompt alongside the question, and the model generates an answer. The generation prompt must enforce two constraints: the model should only use information from the provided passages, and it should cite which passage each claim comes from. Without these constraints, the model will freely mix retrieved information with its own training data, and you will not be able to tell which claims are grounded and which are fabricated.

Model scale matters for generation quality. In the notebook, you compare a 3B-parameter local model with a 72B-parameter API model on the same retrieval results. The larger model is substantially better at synthesizing information across multiple passages without losing faithfulness. The smaller model tends to either copy passages verbatim (safe but not a synthesis) or attempt synthesis and introduce errors.

Evaluating RAG Output

RAG evaluation has two layers, and conflating them is a common mistake.

Retrieval evaluation: Did the system find the right passages? Check the retrieved chunks against the question and verify that the answer is actually contained in them. If retrieval fails, the generation cannot succeed, regardless of model quality.

Generation evaluation: Given the correct passages, did the model produce a faithful, well-grounded answer? This is the same faithfulness verification from Section 1, now applied to generated text rather than extracted items. Each claim in the answer should be traceable to a specific retrieved passage.

The notebook introduces a provenance check that uses embedding similarity to match each sentence in the generated answer to its closest retrieved passage. Sentences with low similarity to any passage are flagged as potentially ungrounded: the model may have drawn on its training data rather than the provided context.

In production systems, teams usually combine several evaluation layers rather than relying on one score. A practical stack is: retrieval metrics (recall@k, precision@k, MRR) to test whether the right evidence is found, response metrics (faithfulness/groundedness, answer relevance, context relevance) to test whether the answer uses that evidence correctly, and human review on sampled outputs to detect subtle reasoning failures that automatic metrics miss.

This is where frameworks like RAGAS are useful: they standardize automatic scoring for common RAG failure modes and make model-to-model or prompt-to-prompt comparisons reproducible. But these metrics have trade-offs. They are fast and scalable, yet sensitive to judge model choice, prompt wording, and domain mismatch. Treat them as triage signals, not ground truth.

For research and policy work, the rule should be explicit: automated evaluation filters candidates; manual validation decides what is publishable. If a claim matters for an argument, include a citation and verify it against the source passage directly.

In the notebook: Exercise 5 has you write the generation prompt for the RAG pipeline, enforcing source-only generation and citation. Exercise 6 compares the RAG answer to a no-context answer on the same question: you will see hallucination in the no-context version and grounding in the RAG version. The provenance check function then verifies whether the RAG answer actually uses the retrieved passages. As an extension, evaluate a small batch with a RAGAS-style metric set, then manually audit the highest-scoring and lowest-scoring answers to see where metric scores align with researcher judgment and where they fail.

Your RAG system retrieves 5 passages and generates an answer with 8 claims. The provenance check shows that 6 claims match retrieved passages with high similarity, but 2 do not match any passage. What could explain the ungrounded claims? How would you handle this in a research pipeline?

Reveal

Several explanations are possible. The model may have drawn on its training data to fill gaps in the retrieved context. The claims may be valid inferences from the passages that the provenance check fails to match (paraphrased too heavily for embedding similarity to catch). Or the model may have hallucinated. You cannot tell from the check alone. In a research pipeline, ungrounded claims should be flagged for manual review, not silently included. One practical approach: set a grounding threshold and automatically discard claims below it, then manually verify a sample of the claims above it.

RAG for Social Science Research

RAG enables research workflows that were previously impractical. Querying decades of parliamentary records, analyzing large legislative databases, building interactive research tools over document collections, or conducting systematic reviews across hundreds of papers: these all become feasible when retrieval and generation are combined into a pipeline.

But RAG does not eliminate the researcher's judgment. It changes where judgment is applied: from reading every document to designing the pipeline (chunk size, embedding model, retrieval parameters, generation prompt) and evaluating its output (retrieval quality, faithfulness, provenance). The pipeline is an instrument, and like any instrument, it needs validation before you trust its measurements.

Anthropic (2024), Contextual Retrieval. Demonstrates that prepending context (a brief summary of the document) to each chunk before embedding substantially improves retrieval accuracy. A practical technique for production RAG pipelines where retrieval quality is the bottleneck.

Coming in Module 5: The functions you build today (retrieve(), rag_answer(), verify_faithfulness()) become tools that an agent can call. Instead of you writing the pipeline steps manually, an agent decides which tool to use and when. The pipeline becomes a planning loop.

Key Takeaway

RAG grounds the model's answers in your actual data, scaling LLM-based analysis to corpora of any size. But it is not magic: retrieval quality determines generation quality, and both need separate evaluation. Every design choice (chunk size, overlap, embedding model, k, generation prompt) affects the output. The provenance thread from extraction carries through to RAG: every claim should be traceable to a source. Automated metrics help you scale evaluation, but manual validation remains the final quality gate.

Resources

Lewis et al. (2020), Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks: the foundational RAG paper.
Gao et al. (2024), Retrieval-Augmented Generation for Large Language Models: A Survey: comprehensive overview of RAG techniques, architectures, and evaluation.
RAGAS documentation and RAGAS (GitHub): widely used framework for RAG evaluation metrics (faithfulness, answer relevance, context relevance, and related scores).
TruLens and DeepEval: practical frameworks for continuous LLM/RAG evaluation, tracing, and regression checks in development workflows.
Anthropic (2024), Contextual Retrieval: improved chunking strategy that prepends document context to each chunk.
LangChain and LlamaIndex: production RAG pipeline frameworks.
Sentence Transformers: the embedding models used for semantic retrieval. all-MiniLM-L6-v2 is used in the notebook.
ChromaDB, FAISS, Pinecone: vector database options for embedding storage and retrieval at different scales.

Evaluating Extraction & RAG Pipelines

The previous two sections introduced faithfulness checks and provenance scoring as inline verification steps. This section steps back and asks a bigger question: how do you know your entire pipeline is working? Not just whether one answer is grounded, but whether the system as a whole produces outputs you can stake a research claim on.

Evaluation in extraction and RAG is harder than in classification for a structural reason. Classification has a single output (a label) and a single ground truth (the correct label). Extraction and RAG have compound outputs: multiple claims, each with a provenance chain, each potentially faithful or unfaithful, relevant or irrelevant. A single “accuracy” number cannot capture this. You need a stack of metrics, each testing a different failure mode.

Why You Need Separate Metrics for Retrieval and Generation

A RAG system can fail in two independent ways, and conflating them is the most common evaluation mistake.

Retrieval failure: the system finds the wrong passages. The answer will be wrong regardless of how good the generation model is, because it never sees the evidence it needs. Retrieval failure is invisible in end-to-end evaluation: the model generates a confident answer from whatever context it was given, and the answer may look plausible even though the evidence is irrelevant.

Generation failure: the system finds the right passages but the model misuses them. It may hallucinate beyond the context, ignore key evidence, or distort nuances. Generation failure is also invisible if you only check whether the answer “sounds right.”

Measuring both layers separately tells you where your pipeline is breaking, not just that it is breaking. A study by PremAI (2026) found that retrieval accuracy alone explains only about 60% of RAG quality variance; the remaining 40% comes from how the model uses retrieved context. If you only measure end-to-end quality, you cannot tell which layer to fix.

Retrieval Metrics

These metrics test whether the retriever found the right passages, independent of what the model does with them.

Precision@k: of the k retrieved chunks, how many are actually relevant to the query? If you retrieve 5 chunks and 3 contain the answer, precision@5 is 0.6. Low precision means the model's context is diluted with irrelevant material.

Recall@k: of all the relevant passages in the corpus, how many appear in the top k results? If the answer is spread across 4 passages and the retriever finds 3, recall@5 is 0.75. Low recall means the model is missing evidence.

Mean Reciprocal Rank (MRR): how early does the first relevant result appear? MRR rewards retrievers that rank the best evidence first, which matters because models tend to weight earlier passages more heavily.

Context relevancy: a semantic measure of how related the retrieved chunks are to the query, beyond keyword overlap. This catches cases where retrieval returns passages that use the same terms but discuss a different topic.

Definition

The RAG Triad

A simplified evaluation framework using three core metrics: context relevancy (did the retriever find the right passages?), faithfulness (did the generator only use what it was given?), and answer relevancy (did the output actually address the question?). Each metric maps to different pipeline parameters: context relevancy depends on chunk size and embedding model, faithfulness on the LLM and prompt, answer relevancy on the prompt template. Fixing a low score means tuning the right layer.

Generation Metrics

These metrics test whether the model used the retrieved passages correctly.

Faithfulness: the proportion of claims in the generated answer that are supported by the retrieved context. This is the single most important metric for research use. A faithfulness score below 0.8 means the model is routinely drawing on its training data rather than the provided passages. For regulated or publication-grade work, target 0.9 or above.

Answer relevancy: does the generated answer actually address the question asked? A model can be perfectly faithful to the retrieved context and still produce an irrelevant answer if the context itself doesn’t contain what was needed. High faithfulness combined with low answer relevancy usually points to a retrieval problem, not a generation problem.

Groundedness / provenance: for each claim in the output, can you trace it back to a specific retrieved passage? This is the sentence-level version of faithfulness. Ungrounded sentences are flagged for review. The provenance check from Section 1 (matching output sentences to source passages via embedding similarity) is one implementation of this metric.

Your RAG pipeline scores 0.92 on faithfulness but 0.54 on answer relevancy. What is most likely going wrong, and which part of the pipeline would you fix first?

Reveal

High faithfulness means the model is not hallucinating: it sticks to the retrieved context. Low answer relevancy means the retrieved context does not contain what is needed to answer the question. This is a retrieval problem, not a generation problem. Fix the retriever first: try different chunk sizes, increase k, improve the embedding model, or add query rephrasing. Do not touch the generation prompt until retrieval is working.

RAGAS and Automated Evaluation Frameworks

RAGAS (Retrieval Augmented Generation Assessment) standardises the metrics above into a reusable evaluation pipeline. It is reference-free and LLM-driven: it uses a judge model to score faithfulness, answer relevancy, context precision, and context recall without requiring hand-labelled ground truth for every question. This makes it practical to run over hundreds of test queries, which is essential for catching failure modes that appear only on certain question types.

Other frameworks serve similar roles. DeepEval provides the same core metrics plus regression testing: you can integrate it into a CI/CD pipeline so that a prompt change or model swap automatically triggers evaluation and fails the build if scores drop. TruLens adds tracing: it records every intermediate step (retrieved chunks, prompt, response) so you can debug individual failures.

These tools are valuable for development and iteration. But they have real limitations that you need to understand before relying on them.

Why Automated Metrics Are Not Enough

Every automated RAG metric works the same way: a judge model (typically GPT-4 class) reads the question, the retrieved passages, and the generated answer, then scores each dimension. This is LLM-as-judge evaluation, and it inherits all the biases you studied in Module 3.

Judge model sensitivity. The choice of judge model changes the scores. A study using three different judge models on the same outputs found faithfulness scores that varied by up to 0.15 depending on which model did the judging. If your metric depends on an LLM’s interpretation of “supported by the context,” then changing the judge is like changing the ruler.

Prompt sensitivity. The wording of the evaluation prompt matters. Small changes to how you ask the judge to score faithfulness (“is this claim supported?” vs. “can this claim be inferred?”) can shift scores. The metrics are not measuring a fixed property; they are measuring how a specific judge model interprets a specific evaluation prompt.

Domain mismatch. Judge models were trained on general text. They may not reliably score faithfulness in highly technical domains (legal, medical, domain-specific social science) where a subtle distortion requires expert knowledge to detect. A faithfulness score of 0.95 from a general-purpose judge does not mean the output is accurate by domain standards.

Failure mode blindness. Automated metrics are good at catching obvious hallucinations (the model claims something not in the context) and bad at catching subtle distortions (the model paraphrases a nuanced claim and loses the qualifications). For research, the subtle cases are exactly the ones that matter.

Research Note. Traditional NLP metrics like BLEU and ROUGE measure word overlap between the generated text and a reference answer. They are fast to compute and do not require a judge model. But they are essentially useless for RAG evaluation: a semantically correct paraphrase scores low on ROUGE, while a hallucinated sentence that happens to share vocabulary with the reference scores high. If a collaborator suggests evaluating your RAG system with BLEU or ROUGE, redirect to the metrics described here.

Manual Validation: The Final Quality Gate

The uncomfortable truth is this: automated evaluation filters candidates; manual validation decides what is publishable. This is not a limitation of current tools that will be solved by better models. It is structural. The question “is this claim faithful to the source?” is ultimately a judgment about meaning, and meaning requires domain expertise.

A practical evaluation workflow has three layers:

Layer 1: Automated metrics at scale. Run RAGAS or DeepEval over your full test set (50–200 questions). Use the scores to identify which parts of the pipeline are weak. Low context recall? Fix the retriever. Low faithfulness? Fix the prompt or model. This is triage, not validation.

Layer 2: Stratified sampling for manual review. From the full test set, manually review a stratified sample: high-scoring outputs (to check whether the metrics are right), low-scoring outputs (to understand the failure modes), and edge cases (ambiguous questions, long documents, domain-specific terminology). A sample of 30–50 outputs is usually sufficient to calibrate your trust in the automated scores.

Layer 3: Domain expert audit. For any claim that will appear in a publication, verify it against the source passage directly. This means reading the original text, not the summary or the retrieved chunk. If a RAG answer says “Smith (2020) found X,” open Smith (2020) and check.

A colleague argues that manual validation is too slow and expensive for a corpus of 10,000 documents, and that RAGAS scores above 0.9 should be sufficient for publication. How would you respond?

Reveal

Manual validation is not meant to cover 10,000 documents. It covers a sample. Even 50 manually reviewed outputs (0.5% of the corpus) can reveal systematic failure patterns that automated metrics miss: subtle distortions, domain-specific hallucinations, or consistent omission of minority viewpoints. The RAGAS score is a useful filter, but it measures what a judge model thinks about faithfulness, not what a domain expert knows. A score of 0.9 means the judge model flagged 10% of claims as potentially ungrounded. For a publication, you need to know what those 10% are and whether the pattern is random or systematic. That requires human eyes.

In the notebook: Exercise 7 walks you through a complete evaluation workflow. You run RAGAS metrics over a batch of RAG outputs, identify the highest- and lowest-scoring answers, then manually audit both groups to see where automated scores align with your domain judgment and where they diverge. The goal is not to replace the automated metrics but to learn when to trust them.

Key Takeaway

Evaluating extraction and RAG requires separate metrics for separate failure modes: retrieval metrics test whether you found the right evidence, generation metrics test whether you used it correctly. Automated frameworks like RAGAS make this scalable but inherit the biases of their judge models. The rule for research is non-negotiable: automated evaluation identifies problems; manual validation by someone who understands the domain decides what is trustworthy enough to publish. There is no score threshold that exempts you from reading the sources.

Resources

RAGAS documentation and Es et al. (2023), RAGAS: Automated Evaluation of Retrieval Augmented Generation: the foundational framework and paper for automated RAG evaluation.
DeepEval: evaluation framework with CI/CD integration and the RAG Triad metrics; useful for regression testing after prompt or model changes.
TruLens: tracing and evaluation framework that records intermediate pipeline steps for debugging individual failures.
Min et al. (2023), FActScore: decomposes text into atomic facts and checks each against a source, a fine-grained alternative to binary faithfulness scoring.
Li et al. (2024), Benchmarking LLM-as-Judge: systematic study of how judge model choice, prompt wording, and domain affect evaluation reliability.

Putting It Together

You now have three capabilities: extracting structured claims from text, retrieving and synthesising evidence across large corpora, and evaluating both with a stack of metrics calibrated by manual review. The remaining question is how these fit together in a research workflow.

From Pipeline Components to Research Design

Consider a concrete project: you want to map how 200 political science articles discuss the relationship between AI governance and democratic accountability. No single technique from today is sufficient.

Extraction gives you the data. For each article, you extract substantive claims, cited findings, and causal arguments. The extraction schema defines what you are measuring: this is construct operationalisation, the same problem from Module 3’s codebook design, applied to a richer output space. Each extracted item is verified against its source passage using the faithfulness checks from Section 1.

RAG gives you synthesis. Once you have a database of verified extractions, you can query across the entire corpus: “What evidence do these articles cite for AI regulatory capture?” or “How do arguments about democratic oversight differ between US and EU contexts?” The RAG pipeline retrieves relevant chunks, generates a grounded synthesis, and the provenance chain lets you trace every claim back to a specific article.

Evaluation binds it together. The three-layer evaluation workflow from Section 3 applies at every stage: automated metrics for triage, stratified manual review for calibration, domain expert audit for anything you will publish. The same principle from Module 3’s validation pipeline holds: the model is an instrument, and instruments require calibration at every step.

Ziems et al. (2024), Can Large Language Models Transform Computational Social Science? provides the broadest review of where LLMs fit in CSS workflows. Their taxonomy of tasks (classification, extraction, generation, simulation) maps directly onto the progression from Module 2 through today. The paper’s central warning is that methodological shortcuts compound: unverified extraction feeds ungrounded RAG, which feeds unreliable conclusions. The evaluation infrastructure from Section 3 exists to break this chain.

Choosing the Right Tool

Not every research question requires extraction or RAG. The choice depends on what you are trying to measure.

Use classification (Modules 2–3) when the answer is a label from a fixed set. Sentiment, stance, topic category, frame type. The output is simple, evaluation is straightforward (kappa, F1), and the pipeline is well-understood.

Use extraction when you need structured data from within the text: claims, entities, relationships, cited findings. The output is richer and harder to evaluate, but it captures information that classification cannot.

Use RAG when you need to synthesise across a corpus too large to read: answering research questions, building evidence maps, conducting systematic reviews. RAG adds retrieval to generation, grounding the model’s output in your actual data.

Combine them when the research design demands it. Extract claims from individual articles, index the extractions, and use RAG to query across the full set. Each layer adds capability and complexity; add layers only when simpler approaches are insufficient.

Key Takeaway

Extraction and RAG extend the LLM pipeline from Module 3 in two directions: richer outputs (structured claims instead of labels) and larger scope (corpus-wide synthesis instead of document-level annotation). Both inherit the same core requirement: every claim must be traceable to a source. The validation infrastructure does not change, it deepens. The workflow is: extract with verification, retrieve with evaluation, synthesise with provenance, and validate manually before publishing.

Module Summary

This module moved beyond classification to tackle the two hardest tasks in LLM-assisted research: extracting structured information from unstructured text, and synthesising evidence across corpora too large to read manually. Information extraction turns documents into structured data (claims, entities, causal arguments), but the model can hallucinate, omit, and distort. Every extraction pipeline needs a faithfulness check that compares each output item against its source passage.

Retrieval-Augmented Generation grounds the model’s answers in your actual data, making corpus-scale analysis feasible. But RAG is a pipeline, not a single tool: chunk size, embedding model, retrieval parameters, and generation prompt each affect quality in different ways. Evaluating retrieval and generation separately is essential; conflating them hides where the pipeline is breaking.

Evaluation is the thread that ties everything together. Automated metrics like RAGAS make large-scale evaluation practical, but they inherit the biases of their judge models and miss the subtle distortions that matter most for research. The non-negotiable standard: automated evaluation identifies problems; manual validation by a domain expert decides what is publishable. Every concept from Module 1 (embeddings for retrieval), Module 2 (prompting for generation), and Module 3 (validation as calibration) feeds into the pipelines built here.

Coming in Module 5: The functions you built today—extract(), retrieve(), rag_answer(), verify_faithfulness()—become tools that an agent can call. Instead of you writing the pipeline steps manually, an agent decides which tool to use and when. The pipeline becomes a planning loop, and the evaluation challenge shifts from checking individual outputs to verifying entire reasoning chains.