Day 4: Social Science Applications

Information Extraction, Summarization & RAG

Many social science datasets are too large for a single LLM context window, or require grounding in specific documents. Retrieval-Augmented Generation (RAG) addresses this by combining a retrieval system with an LLM, allowing it to find and reason over relevant passages from large corpora.

Definition

Retrieval-Augmented Generation (RAG)

A framework that augments an LLM's generation with retrieved context from an external knowledge base. The pipeline typically involves: document chunking → embedding and indexing → query-time retrieval → context-augmented generation.

We cover the full RAG pipeline: chunking strategies, embedding models for retrieval, vector stores, and evaluating RAG output quality. The emphasis is on building pipelines that are reliable enough for research use, where faithfulness to source documents is paramount.

Key Takeaway

RAG lets you scale LLM-based analysis to corpora of any size while maintaining source grounding. But retrieval quality is the bottleneck: if the retriever misses relevant passages or returns irrelevant ones, the generator will produce plausible-sounding but unfaithful output.

LLMs as Simulated Agents

A provocative line of research uses LLMs as stand-ins for human respondents: what Argyle et al. (2023) call "silicon sampling." Can we prompt an LLM to simulate how a 65-year-old conservative voter in rural Ohio would respond to a survey? And if so, when should we trust the results?

This module examines the Homo silicus paradigm: its theoretical motivation, empirical validation studies, failure modes, and ethical considerations. We explore both the promise (cheap, fast, controllable simulated populations) and the risks (systematic biases, lack of genuine lived experience).

Under what conditions might simulating survey responses with an LLM be a valid complement to traditional surveys? Under what conditions would it be misleading or even harmful?

Reveal

Simulation may be valid for exploring hypotheses, pilot-testing survey instruments, or studying well-documented populations where the model's training data includes representative perspectives. It becomes misleading when used as a substitute for data from underrepresented populations (the model has less to draw on), when the simulated responses are presented as genuine data, or when the research question depends on authentic lived experience rather than stereotypical patterns.