Using Large Language Models to Simulate Multiple Humans Gati Aher, 1,3 Rosa I. Arriaga, 2 Adam Tauman Kalai 3 1 Franklin W. Olin College of Engineering 2 Georgia Institute of Technology 3 Microsoft Research gaher@olin.edu Abstract We propose a method for using a large language model, such as GPT-3, to simulate responses of different humans in a given context. We test our method by attempting to repro- duce well-established economic, psycholinguistic, and social experiments. The method requires prompt templates for each experiment. Simulations are run by varying the (hypothet- ical) subject details, such as name, and analyzing the text generated by the language model. To validate our method- ology, we use GPT-3 to simulate the Ultimatum Game, gar- den path sentences, risk aversion, and the Milgram Shock ex- periments. In order to address concerns of exposure to these studies in training data, we also evaluate simulations on novel variants of these studies. We show that it is possible to sim- ulate responses of different people and that their responses are largely consistent with prior human studies from the lit- erature. Using large language models as simulators offers ad- vantages but also poses risks. Our use of a language model for simulation is contrasted with anthropomorphic views of a language model as having its own behavior. Introduction Recent Large Language Models (LLMs) such as GPT-3 (Brown et al. 2020) and PaLM (Chowdhery et al. 2022) have been used to generate human-like text. Can these same LLMs thus be used to simulate the behavior of multiple dif- ferent humans from a given population, in a given context? This capability could be useful for numerous interactions in which involving a large and diverse cast of humans is costly, unethical or impossible. It could be useful for forming hy- potheses about human behavior which could be later tested in laboratory experiments. However, simulations also pose risks: they may produce biased or otherwise inaccurate re- sults which may be misinterpreted as factual, and certain simulations may be traumatizing or offensive in nature. As a first step in addressing human simulation, we design simulators for reproducing classic human studies. We pro- pose an approach for using LLMs to simulate the behavior of different individuals in the context of four well-established studies from psycholinguistics, behavioral economics, and social psychology. We validate our results by comparing the distribution of simulated behaviors to those of prior studies. Our results suggest that LLMs can be used to simulate the behavior of multiple individuals and, in multiple cases, the distribution of behaviors matches what has been described in the literature. Our work is a proof of concept, and future refinements may improve simulation fidelity on the vast lit- erature on human studies. In this paper we define a simulator as a program that takes inputs describing the subject and other experimental condi- tions and then uses an LLM to output a record of the sim- ulated experiment. The simulator can be run repeatedly to simulate as many runs of the experiment as statistically nec- essary. More generally, if each instance of the study requires the interaction of n 1 subjects, then each simulated run would require an input describing the n subjects, as we will illustrate. Given a string of text, called a prompt, an LLM predicts the probability of the subsequent words or tokens following that prompt, and can be used to generate a ran- domized text completion, one word or token at a time. Designing simulators requires choosing prompts. Figure 1 illustrates the difference between a typical prompt used for classification and ours for simulation. The record output at the end may consist of a prompt concatenated with the com- pletion, or it may be stitched together from multiple calls to the LLM with different prompts. We suggest practices that we found useful for designing simulators. We evaluate simulators for four studies across five avail- able LLMs, using OpenAI’s API to access GPT-3 models of five different powers. First, we test how often the text generated by these models conforms to our validation cri- teria. Second, we test whether the method generates consis- tently different outcome distributions for different names— a model that is insensitive to different names, or one that varies randomly, may be no better than simulating a single individual. Finally, we test whether the output distribution is consistent with prior human studies. We find that many of the simulators generate distributions highly consistent with human experiments, with the more powerful models produc- ing simulations that are more consistent with prior studies. This suggests that the problem of simulating multiple hu- mans using LLMs may become more feasible with time, though model size alone cannot correct for biases present in training data. For example, Wikipedia itself recognizes that white males are vastly over-represented among its contribu- tors (Wikipedia 2022). We chose to simulate the ultimatum game, garden path sentences, risk aversion, and the Milgram Shock experiment. For the classic ultimatum game, simulated responders reject arXiv:2208.10264v3 [cs.CL] 3 Sep 2022