Module 5 of 5

Agentic Workflows

Building autonomous research agents with tool use, the ReAct pattern, failure modes and safeguards, and the bridge from notebooks to production tools.

Over the past four days, you have built every piece of a research pipeline by hand: prompting, classifying, validating, extracting, retrieving, evaluating. Each step required you to decide what to do next. Today, you hand that decision-making to the model itself.

An agent is an LLM wrapped in a loop. You give it a goal and a set of tools — the same functions you wrote in Module 4 (search, extract, verify) — and it decides which tool to call, in what order, and what to do with the results. This is not a different kind of model. It is the same model, doing the same things, but orchestrating them autonomously.

The power is obvious: a multi-step literature review that would take you hours can run in minutes. The risk is equally obvious: every failure mode from the course — hallucination, omission, distortion, retrieval errors, unfaithful generation — can now compound across steps without a human checking each one. Building agents that are useful and trustworthy is the challenge of this session.

After This Module You Will Be Able To

  1. Explain how an LLM agent works: tools, system prompt, parser, and execution loop.
  2. Build a ReAct agent that alternates between reasoning and tool use, and trace its decision-making step by step.
  3. Identify the four failure modes specific to agents (hallucination on missing data, format drift, redundant loops, cascading errors) and implement safeguards for each.
  4. Evaluate agent output using provenance verification, connecting back to the faithfulness checks from Module 4.
  5. Use production tools (Codespaces, Claude Code, MCP) to move from notebook prototypes to real research workflows.

What Is an Agent?

In every session so far, you have been the decision-maker. You decided which prompt to write, which tool to call, which results to inspect. The model did what you asked, one step at a time.

An agent reverses this. You give the model a goal, and it decides what steps to take: which tools to call, in what order, and what to do with the results. The model becomes the planner; you become the supervisor.

This is not a different kind of model. It is the same LLM you have been using all week, wrapped in a loop that lets it take actions and observe results. The architecture is surprisingly simple: a system prompt that describes available tools, a parser that extracts tool calls from the model's output, and a while loop that executes tools and feeds results back. That's it.

Definition

LLM Agent

A system where a language model autonomously plans and executes multi-step tasks by calling external tools (search, code execution, APIs, databases) and using the results to inform subsequent actions. The model decides what to do and when; the surrounding code handles how (parsing, execution, error handling).

Tools as Building Blocks

An agent is only as useful as its tools. In the notebook, the tools come directly from your Module 4 work: search_literature() searches the political science corpus using the FAISS index you built, extract_findings() pulls structured claims from a passage, summarize() condenses text, and calculate() evaluates mathematical expressions.

Each tool is a Python function with a clear interface: it takes a string input and returns a string output. The agent does not know how the tools work internally. It knows only their names, their descriptions, and the format of their inputs. This separation is important: it means you can swap out the implementation (replace FAISS with a database, replace the local model with an API) without changing the agent's behaviour.

The pattern scales. Production agents like Claude Code use the same architecture with more powerful tools: file system operations, shell commands, web search, code execution. The Model Context Protocol (MCP) standardises this further, providing a universal interface so that any tool can work with any model.

Stop and Think

Think about your own research workflow. What are the “tools” you use repeatedly: searching a database, reading a paper, extracting data from a table, running a statistical test? If each of these were a function an agent could call, which parts of your workflow could be automated, and which require judgment that a model cannot provide?

Reveal

Most researchers find that the search → read → extract → organise loop is highly automatable: an agent can find relevant papers, pull out key claims, and compile structured summaries. The parts that resist automation are theoretical judgment (is this framework appropriate?), methodological criticism (is this identification strategy valid?), and creative synthesis (what does this pattern mean?). The practical sweet spot is using agents for the mechanical parts and reserving human judgment for the interpretive parts.

In the notebook: Exercise 1a is “Be the Algorithm” for agents. You plan the steps a model would take to answer a research question, writing out the Thought / Action / Observation sequence by hand. Exercise 1b has you test each tool individually before putting them in the loop.

The ReAct Pattern

Most LLM agents follow a framework called ReAct (Yao et al., 2023): the model alternates between Reasoning and Acting. At each step, it produces a thought (reasoning about what to do next), then an action (a tool call), then observes the result. This cycle repeats until the model has enough information to produce a final answer.

The power of ReAct is transparency. Because the model writes out its reasoning at every step, you can audit the decision-making process. If the agent gives a wrong answer, you can trace back through the thoughts and observations to find where it went wrong: Did it search for the wrong thing? Did it misinterpret a tool result? Did it synthesize correctly but from incomplete evidence?

The Three Components

Building a ReAct agent requires three pieces, each with different levels of difficulty:

The system prompt is the hardest part. It must describe the agent's role, list the available tools with precise descriptions, specify the exact Thought / Action / Action Input format, and set behavioural rules (one tool at a time, do not invent information, cite sources). If the format specification is imprecise, the model's output will be unparseable.

The parser extracts tool calls from the model's text output. It looks for patterns like Action: search_literature and Action Input: social media polarization, or detects when the model produces a Final Answer: instead. This is string parsing, not conceptual work, but it must match the format specified in the system prompt exactly.

The loop orchestrates everything. It sends the conversation to the model, parses the response, executes the requested tool, appends the observation to the conversation, and repeats. It also enforces safeguards: a maximum number of steps (preventing infinite loops), error handling for format violations, and cost tracking.

Definition

ReAct (Reasoning + Acting)

A prompting framework where the model alternates between reasoning (thinking through what to do next) and acting (executing a tool call), using the results of each action to inform subsequent reasoning. This creates a transparent, step-by-step problem-solving loop that can be audited and debugged.

Text Parsing vs. Native Tool Calling

The notebook builds the ReAct loop from scratch: the model outputs text, and your code parses Action: tool_name from the response. This is pedagogically valuable because it exposes every moving part. But production systems rarely work this way.

Modern APIs (OpenAI, Anthropic, Google) support native tool calling: you define tools as a JSON schema, the model returns a structured tool-call object rather than free text, and the API handles the format for you. This eliminates format drift entirely — the model cannot produce an unparseable response because the tool call is a structured object, not a substring in prose.

Frameworks like LangGraph and the Anthropic tool-use API abstract this further, letting you define tools as Python functions and handling the serialisation, execution, and result injection automatically. The ReAct loop is still running underneath, but you no longer write the parser yourself.

Research Note. The notebook uses text-parsed ReAct intentionally: you need to see how the loop works to understand what can go wrong. Once you have built it by hand, switch to native tool calling for any real project. The failure modes (hallucination, cascading errors, redundant loops) remain the same regardless of whether the tool call is parsed from text or returned as a structured object. Format drift is the one failure mode that native APIs eliminate.

Stop and Think

The system prompt for a ReAct agent is much longer and more prescriptive than a classification prompt. Why? What happens if the format specification is vague?

Reveal

A classification prompt produces a single output (a label). An agent prompt produces structured output that must be parsed by code. If the model writes I'll search for: social media instead of Action: search_literature, the parser fails and the loop breaks. The system prompt must be prescriptive enough that the model produces machine-parseable output reliably across many steps. This is why larger models are better for agents: they follow structured formats more consistently.

In the notebook: Exercise 2a has you write the system prompt. Exercise 2b has you build the agent loop. Exercise 2c runs the agent on real research questions so you can watch it reason, act, and synthesize in real time.

When Agents Fail

Agents fail in ways that are qualitatively different from single-call LLM failures. A classification error affects one label. An agent error can cascade through multiple steps, producing a confident-sounding synthesis built on flawed intermediate results. Understanding these failure modes is essential for building agents you can trust.

Hallucination on Missing Data

When the corpus does not contain information relevant to the query, a well-behaved agent should say so. Instead, many agents fabricate plausible-sounding results. The agent might “search” the corpus, get back irrelevant passages, and then synthesize an answer as if the passages contained what it needed. The final answer reads as confidently grounded, but the connection between the retrieved text and the claims is fabricated.

This is more dangerous than single-call hallucination because the agent did use a tool. The presence of a search step creates a false sense of grounding. The answer looks like it went through proper retrieval when it did not.

Format Drift

After several steps, the model may drift out of the Thought / Action / Observation format. It might start writing conversational prose, skip the Action line, or merge its reasoning with a premature final answer. When the parser cannot extract a tool call, the loop stalls or produces an error.

Format drift is more common with smaller models and longer conversations. The accumulated context pushes the model away from the structured format established in the system prompt. Mitigation strategies include keeping the conversation short (fewer steps), re-injecting format reminders, or using models specifically trained for tool calling.

Infinite Loops and Redundant Actions

An agent might search for the same query repeatedly, extracting from the same passage multiple times, or cycling between tools without making progress. This wastes tokens and money without improving the answer. A hard step limit (typically 5-10 steps) prevents runaway costs, but it does not prevent the agent from wasting those steps on redundant actions.

Cascading Errors

An early mistake propagates through subsequent steps. If the agent's first search retrieves irrelevant passages, the extraction step will extract irrelevant claims, and the synthesis will build on irrelevant evidence. Each step looks locally reasonable (the extraction is faithful to the passage it was given), but the chain is globally wrong because the first link was bad.

This is the agent-specific version of the “retrieval quality is the bottleneck” lesson from Module 4. In an agent, every tool call is a potential bottleneck, and errors compound.

Safeguards

Production agents address these failure modes with several mechanisms:

Step budgets cap the number of tool calls. After reaching a limit, the agent must synthesize from whatever it has. Some implementations inject a warning at step N-2: “You have 2 steps remaining. Summarize your findings.”

Minimum evidence rules require the agent to search at least twice before producing a final answer, preventing premature synthesis from a single source.

Provenance verification runs after the agent finishes: a separate call checks whether every claim in the final answer is supported by an observation from a tool call. This is the same faithfulness check from Module 4, applied to agent output. The three-layer evaluation workflow (automated metrics, stratified manual review, domain expert audit) extends naturally to agents: run RAGAS-style metrics over the agent’s final answers, manually audit a sample of the reasoning traces, and verify any claim that will be published.

Human-in-the-loop checkpoints pause the agent at critical decision points and ask for human approval before proceeding. For high-stakes applications (modifying data, sending communications), this is essential.

Stop and Think

An agent answers your research question with a well-structured synthesis citing three sources. The provenance check shows that all three citations correspond to real passages in the corpus. Does this mean the answer is trustworthy?

Reveal

Not necessarily. Provenance confirms that the citations are real, but it does not confirm that the agent's interpretation of those passages is correct. The agent may have distorted the meaning (the same failure mode from Module 4's extraction section), selectively quoted to support a conclusion the passages do not actually support, or missed contradicting evidence because it did not search broadly enough. Grounded citations are a necessary condition for trust, not a sufficient one.

In the notebook: Exercise 2d stress-tests the agent with out-of-domain questions, adversarial inputs, and questions designed to trigger failure modes. Exercise 2e has you pick one safeguard and implement it: minimum evidence rules, step budget awareness, or post-hoc source verification.

Key Takeaway

An agent is a loop, not magic. The quality depends on three things: the model's ability to follow the Thought/Action format reliably, the quality and coverage of the tools, and safeguards against the failure modes catalogued above. In production, you add retry logic, context management, cost tracking, and human-in-the-loop approval for high-stakes actions.

Resources

From Notebooks to Production

The agent you build in the notebook works, but it lives in a Colab environment. Production agents need a real development environment (a terminal, file system, version control), persistent tool connections (databases, APIs, web search), better observability (logging, cost tracking, auditing), and human oversight for high-stakes actions.

Three tools bridge this gap. Your instructor will demonstrate each one live during the session.

GitHub Codespaces

Codespaces provides a full cloud development environment in your browser: VS Code with a terminal, file system, Git, Python, and anything else you need. No local installation required.

For agent development, Codespaces solves the “works on my machine” problem. You configure the environment once (in a devcontainer.json file), and every collaborator gets an identical setup. GitHub's free tier includes 60 hours per month, which is sufficient for experimentation.

Claude Code

Claude Code is a command-line agent that lives in your terminal. It can read and write files, execute shell commands, search codebases, and plan multi-step tasks. It uses the exact same ReAct architecture you built in the notebook, with a frontier model and real filesystem tools.

For researchers, Claude Code is useful for data analysis (“Load this CSV, compute summary statistics, and create a visualization”), pipeline building (“Write a Python script that cleans and merges these three datasets”), and iterative analysis (“The regression shows heteroskedasticity. Add robust standard errors and re-run”).

Installation requires Node.js and an Anthropic API key:

npm install -g @anthropic-ai/claude-code
cd my-research-project/
claude

Model Context Protocol (MCP)

MCP is an open standard that lets AI models connect to external tools and data sources. Instead of defining tools as Python functions (as you do in the notebook), MCP defines a protocol so that any tool can work with any model.

Think of it as USB for AI: just as USB lets any peripheral work with any computer, MCP lets any tool work with any AI model. You can connect Claude to your university's database, Google Drive, or Zotero. You can build custom MCP servers that expose your datasets or analysis pipelines. And you can share tools across your research group: one person builds the server, everyone uses it.

MCP servers are lightweight programs (often 50-100 lines of Python or TypeScript). The MCP documentation has quickstart guides and examples.

Anthropic (2024), Model Context Protocol Specification. Open standard for connecting AI models to external tools and data sources. The specification defines how tools advertise their capabilities, how models request tool calls, and how results are returned.

Key Takeaway

The gap between a notebook prototype and a production workflow is tooling, not theory. Codespaces give you a reproducible environment, Claude Code puts a frontier agent in your terminal, and MCP lets you connect any model to any data source. The agent architecture is the same in all three: tools, a loop, and safeguards. Start with the notebook, graduate to production tools when the task outgrows a single session.

Resources

Module Summary

This module introduced the agent: an LLM wrapped in a loop that decides what to do, executes tools, and uses the results to plan its next step. The ReAct pattern (Thought / Action / Observation) makes this decision-making transparent and auditable. Modern APIs replace the text-parsed version with native tool calling, but the underlying loop is the same.

Failure modes in agents are qualitatively different from single-call errors: hallucination on missing data looks grounded because a search step occurred, format drift breaks the loop silently, redundant actions waste budget, and cascading errors propagate early mistakes through every subsequent step. Safeguards (step budgets, minimum evidence rules, provenance verification, human-in-the-loop checkpoints) exist for each failure mode. None are optional for production use.

Production tools bridge the gap from notebooks to real workflows: Codespaces for reproducible environments, Claude Code for terminal-based agents, and MCP for connecting models to your data. These tools use the same architecture you built by hand, with more capable models and better safeguards.

The Full Arc: Five Modules, One Toolkit

The progression across the course has been deliberate. Module 1 built the foundations: how models represent meaning and generate text. Modules 2 and 3 taught you to control models: prompting, validation, fine-tuning, APIs at scale. Module 4 applied those skills to harder research tasks: extraction, RAG, and rigorous evaluation. Module 5 showed how to orchestrate all of these into autonomous workflows.

Each module’s tools became the next module’s building blocks. Module 1’s embeddings powered Module 4’s retrieval. Module 2’s prompting skills shaped Module 4’s extraction prompts and Module 5’s system prompt. Module 3’s validation framework carried through as the faithfulness checks in Modules 4 and 5. Module 4’s evaluation metrics (RAGAS, provenance, manual review) became Module 5’s agent safeguards. And Module 4’s functions became Module 5’s agent tools.

Stop and Think

Think about a research question from your own work. How could you decompose it into a multi-step pipeline using the techniques covered in this course? Which modules’ tools would each step draw on?

Where to Go from Here

The course gave you foundations. Here are concrete next steps, ordered by how quickly you can start:

This week: Apply the RAG pipeline from Module 4 to your own corpus (parliamentary records, news archives, legal texts). Build a classification pipeline from Module 3 for your own annotation task. Try Claude Code on a real research project.

This month: Set up a Codespace for your research group’s codebase. Build a custom MCP server that connects Claude to your data. Run a validation study comparing LLM annotations to human annotations on your specific task.

This year: Integrate LLM-based methods into a published paper. Contribute to the methodological conversation about validity, reproducibility, and best practices for LLMs in social science.

Key Takeaway

Agents can automate significant portions of the research workflow: literature reviews, data collection pipelines, iterative analysis. The key is understanding what to delegate and what requires human judgment. Start with well-defined, bounded tasks (literature search and extraction) before attempting open-ended research assistance. Build safeguards in from the start, not after something goes wrong. And remember the thread that ran through every day of this course: the model is an instrument, and instruments require calibration.

Stay connected. All notebooks and materials remain available at llmsforsocialscience.net. If you are interested in doing research on LLMs for social science, reach out: we are building a research lab focused on these methods.