Explores how large language models can simulate human behavior and produce reliable causal insights, offering new tools and challenges for social science and AI research.
Researchers are exploring the social science applications of large language models (LLMs) in an increasing number of ways. This includes using LLMs as text annotators, summarizers of research literatures, treatments in experiments, and as representations or simulations of human opinions and behavior. On this latter point, sometimes referred to as silicon sampling, interest has expanded beyond the simulation of individual and aggregate psychology to the representation and modeling of experimental and causal inferences. In this paper, we outline some core considerations in this line of work, asking how researchers can correctly infer causal estimates from LLMs instead of parallel observational estimates. We describe the type of research design needed to thoroughly evaluate such differences, implement it replicating one published article from political science, and discuss the implications of this work on experimental research, work using LLMs, and the potential of generative AI more broadly.