Generative Agent Simulations of 1,000 People

Generative Agent Simulations of 1,000 People

1. Objective

2. Methodology

Interview Collection

AI Interviewer Architecture

3. Generative Agent Architecture

Memory System

Expert Reflection

Each reflection captures different behavioral, ideological, or demographic aspects of the participant.

Response Generation

4. Evaluation

Tasks Evaluated

Metric

Results

Task Metric Interview-based Agents Demographic-based Persona-based
GSS Accuracy 0.85 0.71 0.70
BFI Corr. 0.80 0.55 0.75
Econ Games Corr. 0.66 0.31 lower

5. Ablation and Comparative Studies

Interview Ablation

Interview Summary Test

Composite Agent Test

6. Bias and Fairness

Findings

7. Agent Bank Access

Two Access Tiers

8. Contributions & Significance

9. Limitations & Future Work

Thoughts:

The entire interview transcript is injected into the model prompt(average number of words: 6k). This will result in a significant increase in costs for API usage. It would be better if they put the transcipt into a text summariser or note generator. It may wont work as well as the raw interview, but it will be cheaper to do

for experiments requiring more multiple decision making steps, agents were given memory of the previous stimuli in the form of short text descriptors.

The main difference from the earlier implmentation of the simulacra is that this time they are trying not simulate real people.
The strength of simulation is based on a few factors:

Predicting an individuals attitures and behaviours:

There are two approached that were carried out:

How the evaluation is done:

​# Thoughts:

The entire interview transcript is injected into the model prompt(average number of words: 6k). This will result in a significant increase in costs for API usage. It would be better if they put the transcipt into a text summariser or note generator. It may wont work as well as the raw interview, but it will be cheaper to do

for experiments requiring more multiple decision making steps, agents were given memory of the previous stimuli in the form of short text descriptors.

The main difference from the earlier implmentation of the simulacra is that this time they are trying not simulate real people.
The strength of simulation is based on a few factors:

​## Predicting an individuals attitures and behaviours:
There are two approached that were carried out:
​- Demographic attributes:
- Atttributes such as the age, gender, political outlook etc are captured based on teh GSS (General Social Survery) questions.
​- Paragraph summarisation of the person (persona based):
- Done by making the participant write a paragraph about themseles after the interview. This is similar to the how they did this in the their previous paper: Generative Agents Interactive Simulacra of Human Behavior.

​## How the evaluation is done:

My main concern with this is, how secure is the evaluation? I feel like there is a high likelihood of data leakage in the form of the thematic similarities between the questions that are asked. The authors did address that the interview and the "testing" questionnaires are completely different, but they do overlap thematically.

Ok so one of methods that they used was the point that I brought up in the beginning of this section, which was they could use a text summarisation model, which they did and apparently it performed rather well.

Randomly removing 80% of the interview only affected the GSS core from an average of 0.85 to 0.79, and text summarisation was less slightly lesser, reducing the core to just 0.83.

knaklnslnlgnrwgmn4

My main concern with this is, how secure is the evaluation? I feel like there is a high likelihood of data leakage in the form of the thematic similarities between the questions that are asked. The authors did address that the interview and the "testing" questionnaires are completely different, but they do overlap thematically.

Ok so one of methods that they used was the point that I brought up in the beginning of this section, which was they could use a text summarisation model, which they did and apparently it performed rather well.

Randomly removing 80% of the interview only affected the GSS core from an average of 0.85 to 0.79, and text summarisation was less slightly lesser, reducing the core to just 0.83.