Researchers at Stanford University paid 1052 people $60 to read the first two lines of The Great Gatsby in an app. After that, the artificial intelligence, which looked like a 2D sprite from the SNES-era Final Fantasy game, asked the participants to tell the story of their lives. Scientists took these interviews and created an artificial intelligence based on them, which, according to them, reproduces the behavior of participants with 85% accuracy.
The study titled “Modeling Generative Agents Based on 1,000 People” is a joint project of Stanford and scientists working at Google’s DeepMind AI research lab. The idea is that creating AI agents based on random people can help politicians and businessmen better understand the public. Why conduct focus groups or surveys of the public if you can talk to them once, create an LLM based on that conversation, and then have their thoughts and opinions forever? Or at least as close an approximation of those thoughts and feelings as the LLM can reproduce.
“This work lays the groundwork for new tools that can help investigate individual and collective behavior,” the article’s abstract reads.
“How, for example, might different groups of people respond to new health policies and messages, react to the launch of a new product, or to major shocks?” The article continues. “When simulated individuals are grouped into collectives, these simulations can help pilot interventions, develop complex theories that capture the nuances of causal and contextual interactions, and expand our understanding of structures such as institutions and networks in fields such as economics, sociology, organization, and political science.”
All of these possibilities, based on a two-hour interview, were taken into account in the LLM test, which answered questions largely similar to those encountered in real life.
Much of the process was automated. The researchers contracted with Bovitz, a market research firm, to recruit participants. The goal was to get a broad sample of the US population, as broad as possible with a limit of 1,000 people. To complete the survey, users registered in a specially designed interface, created a 2D sprite avatar, and started communicating with an artificial intelligence interviewer.
The interview questions and style are a modified version of those used by the American Voices Project, a joint project of Stanford and Princeton Universities that interviews people across the country.
Each interview began with the participants reading the first two lines from The Great Gatsby (“When I was young and impressionable, my father gave me advice that I have been running through my head ever since. ‘Whenever you feel like criticizing someone,’ he told me, ‘just remember that not everyone in this world has had the advantages you have had’) to calibrate the audio recording.
According to the document, “The interview interface displayed a two-dimensional sprite avatar representing the interviewer agent in the center, with the participant avatar below, heading toward a bar indicating progress. When the AI interviewer spoke, a pulsating animation of the central circle with the interviewer’s avatar signaled this.”
The two-hour interviews resulted in an average of 6,491 words of transcripts. They asked questions about race, gender, politics, income, social media use, work stress, and family composition. The researchers published the interview script and the questions asked by the AI.
These transcripts, each under 10,000 words, were uploaded to another LLM, which the researchers used to create generative agents designed to copy the interview participants. The researchers then asked the participants and the AI clones more questions and ran economic games to see how they compared. “When the agent is asked a question, the entire interview transcript is fed into the model’s prompt, instructing it to imitate a human based on the interview data,” the article says.
This part of the process was as close to controlled as possible. The researchers used the General Social Survey (GSS) and the Big Five Personality Inventory (BFI) to test how well the LLMs matched their inspiration. They then put participants and LLMs through five economic games to see how they compared.
The results were mixed. Artificial intelligence agents answered about 85% of the questions in the same way as real participants at GSS. At BFI, they reached 80%. However, the numbers plummeted when the agents started playing economic games. The researchers offered real participants cash prizes to play games such as Prisoner’s Dilemma and Dictator’s Game.
In the Prisoner’s Dilemma game, participants can work together to succeed or cheat a partner for a chance at a big win. In the Dictator game, participants must choose how to distribute resources among other participants. Real subjects made more money than the initial $60 by playing these games.
When faced with these economic games, the AI clones of humans also did not reproduce their real-life counterparts. “On average, the generative agents achieved a normalized correlation of 0.66,” or about 60%.
The paper is worth reading in its entirety if you’re interested in how scientists view AI agents and the public. It didn’t take long for researchers to reduce the human personality to an AI that behaves in a similar way. Given the time and energy, they could probably bring them closer.
This worries me. Not because I don’t want to see the indescribable human spirit reduced to a spreadsheet, but because I know that this kind of technology will be used for bad purposes. We’ve already seen stupider law masters trained on public records trick grandmothers into giving their banking information to an AI relative after a short phone call. What happens if these machines have a script? What happens when they have access to specially created personalities based on social media activity and other publicly available information?
What happens when a corporation or a politician decides that the public wants and needs something based on an approximation of their will rather than their actual will?