GPT-4 vs GPT-3.5 Performance in Game Simulations

Table of Links

D Prompts

The prompts introduced in this section includes game rules that can either be human written rules or LLM generated rules. For experiments without game rules, we simply remove the rules from the corresponding prompts.

D.1 Prompt Example: Fact

D.1 Prompt Example: Fact

D.1.2 State Difference Prediction

D.2 Prompt Example: Fenv

D.2.1 Full State Prediction

D.2.2 State Difference Prediction

D.3 Prompt Example: FR (Game Progress)

D.4 Prompt Example: F

D.4.1 Full State Prediction

D.4.2 State Difference Prediction

D.5 Other Examples

Below is an example of the rule of an action:

Below is an example of the rule of an object:

Below is an example of the score rule:

Below is an example of a game state:

Below is an example of a JSON that describes the difference of two game states:

E GPT-3.5 results

Table 5 and Table 6 shows the performance of a GPT-3.5 simulator predicting objects properties and game progress respectively. There is a huge gap between the GPT-4 performance and GPT-3.5 performance, providing yet another example of how fast LLM develops in the two years. It is also worth notices that the performance difference is larger when no rules is provided, indicating that GPT-3.5 is especially weak at applying common sense knowledge to this few-shot world simulation task.

F Histograms

1. In Figure 3, we show detailed experimental results on the full state prediction task performed by GPT-4.

2. In Figure 4, we show detailed experimental results on the state difference prediction task performed by GPT-4.

3. In Figure 5, we show detailed experimental results on the full state prediction task performed by GPT-3.5.

4. In Figure 6, we show detailed experimental results on the state difference prediction task performed by GPT-3.5.

Figure 3: GPT-4 - Full State prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

Figure 4: GPT-4 - Difference prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

Figure 5: GPT-3.5 - Full State prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

Figure 6: GPT-3.5 - Difference prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

Authors:

(1) Ruoyao Wang, University of Arizona ([email protected]);

(2) Graham Todd, New York University ([email protected]);

(3) Ziang Xiao, Johns Hopkins University ([email protected]);

(4) Xingdi Yuan, Microsoft Research Montréal ([email protected]);

(5) Marc-Alexandre Côté, Microsoft Research Montréal ([email protected]);

(6) Peter Clark, Allen Institute for AI ([email protected]).;

(7) Peter Jansen, University of Arizona and Allen Institute for AI ([email protected]).

This paper is available on arxiv under CC BY 4.0 license.