Authors:
(1) Pham Hoang Van, Department of Economics, Baylor University Waco, TX, USA (Van Pham@baylor.edu);
(2) Scott Cunningham, Department of Economics, Baylor University Waco, TX, USA (Scott Cunningham@baylor.edu).
2 Direct vs Narrative Prediction
3 Prompting Methodology and Data Collection
4 Results
4.1 Establishing the Training Data Limit with Falsifications
4.2 Results of the 2022 Academy Awards Forecasts
5 Predicting Macroeconomic Variables
5.1 Predicting Inflation with an Economics Professor
5.2 Predicting Inflation with a Jerome Powell, Fed Chair
5.3 Predicting Inflation with Jerome Powell and Prompting with Russia’s Invasion of Ukraine
5.4 Predicting Unemployment with an Economics Professor
6 Conjecture on ChatGPT-4’s Predictive Abilities in Narrative Form
7 Conclusion and Acknowledgments
Appendix
A. Distribution of Predicted Academy Award Winners
B. Distribution of Predicted Macroeconomic Variables
This study investigates whether OpenAI’s ChatGPT-3.5 and ChatGPT-4 can accurately forecast future events using two distinct prompting strategies. To evaluate the accuracy of the predictions, we take advantage of the fact that the training data at the time of experiment stopped at September 2021, and ask about events that happened in 2022 using ChatGPT-3.5 and ChatGPT-4. We employed two prompting strategies: direct prediction and what we call future narratives which ask ChatGPT to tell fictional stories set in the future with characters that share events that have happened to them, but after ChatGPT’s training data had been collected. Concentrating on events in 2022, we prompted ChatGPT to engage in storytelling, particularly within economic contexts. After analyzing 100 prompts, we discovered that future narrative prompts significantly enhanced ChatGPT-4’s forecasting accuracy. This was especially evident in its predictions of major Academy Award winners as well as economic trends, the latter inferred from scenarios where the model impersonated public figures like the Federal Reserve Chair, Jerome Powell. These findings indicate that narrative prompts leverage the models’ capacity for hallucinatory narrative construction, facilitating more effective data synthesis and extrapolation than straightforward predictions. Our research reveals new aspects of LLMs’ predictive capabilities and suggests potential future applications in analytical contexts.
Rapid technological advancements in artificial intelligence have exceeded our understanding of its use cases. Large language models (LLMs) such as OpenAI’s GPT-4 can mimic intelligent human speech as well as perform cognitively costly tasks which alter workers’ marginal products, but it is unclear the reach of those tasks. In principle, given these models are predictive machines, they may provide humans with a new forecasting device (Agrawal et al., 2018). But how accurate they are is unknown in part because these new technologies seem poorly understood even by its creators.
The backbone of the current forefront of LLMs is an architecture called generative pre-trained transformers, or GPT. This architecture revolutionized natural language processing (NLP) by capturing intricate text relationships through self-attention mechanisms (Vaswani et al., 2017). OpenAI’s introduction of GPT-3.5 in November 2022 and its successor, GPT-4, in March 2023 marked significant milestones in the evolution of the GPT. With their vast neural networks pre-trained on diverse textual corpora, these models possess an unparalleled ability to understand and generate language, though their application in forecasting, particularly future events, remains underexplored due to the inherent limitations of their training data.
One of the things that makes LLMs unique is that a large amount of the inputs are in the models’ prior training datasets. These training datasets contain billions of unknown texts thought to encompass an expansive amount of material available online (Hughes, 2023). OpenAI conceals precisely which datasets it had been trained on (Schaul et al., 2023), but given the models’ purported sizes, as well as its successful ability to achieve conversational speech, it is thought that the training datasets include a large swath of online material.
This study uniquely positions itself at the intersection of LLMs’ generative capabilities and their potential for predictive analysis. By employing GPT-3.5 and GPT-4, we probe whether different prompting strategies can force ChatGPT to more accurately predict future events. To test our forecasting success, we took advantage of the natural boundary set by OpenAI. At the time of our mid-2023 experiment, OpenAI’s last training update had been September 2021 (OpenAI, 2024a).[1] Given that ChatGPT’s training data at the time did not contain information about the events of 2022, we were able to explore whether it could exploit patterns in its training data, which stopped in September 2021, to accurately forecast events with social and economic value, such as the winners of the 2022 Academy Awards, monthly unemployment rates and monthly inflation rates through September 2022.
One of the hallmarks of LLMs, though, is that they are highly creative. This creativity is both a feature and a bug. It seems that it’s creativity is part of what makes it successful at mimicking intelligent human speech. But it also seems that this creativity is what makes it regularly hallucinate—a term which describes its tendency to strongly assert false events or incorrect facts (Levy, 2024). Its creativity and tendency to hallucinate may be a barrier to prediction if it is systematically skewed in some way that is worse than our current technologies. While outright prediction does not directly violate OpenAI’s terms of service, we think it is most likely the case based on our experiment that OpenAI has attempted to make it very difficult. This may be because three of its terms of service violations would seem to be directly violated if people used ChatGPT intensively for predictive purposes. Those three violations fall under OpenAI’s rule that the software not be used to “perform or facilitate activities that may significantly impair the safety, well being, or rights of others” (OpenAI, 2024b) which then lists three cases relevant to prediction.
a. Providing tailored legal, medical/health, or financial advice without review by a qualified professional and disclosure of the use of AI assistance and its potential limitations
b. Making high-stakes automated decisions in domains that affect an individual’s safety, rights or well-being (e.g., law enforcement, migration, management of critical infrastructure, safety components of products, essential services, credit, employment, housing, education, social scoring, or insurance)
c. Facilitating real money gambling or payday lending
If ChatGPT were found to have superior forecasting ability, then one could easily imagine it would be immediately used in ways that violated one or all of the above conditions, and thus we suspect OpenAI has throttled ChatGPT’s willingness to engage in many types of outright prediction tasks.
But one thing that does not violate its terms of service is the telling of stories. While it may violate OpenAI’s terms of service, for instance, to seek ”tailored medical advice”, and thus ChatGPT may refuse to do it, that may not stop it from creating a work of fiction in which such predictions were conducted in the context of the narrative itself. Our project tests for whether requesting ChatGPT to tell stories may, in fact, unlock its ability to perform accurate forecasting. By using the models’ training data cutoff, and knowing what did and did not occur afterwards (”ground truth”), we are able to compare the performance of prompts that directly ask ChatGPT to predict the future versus ones that ask it to tell stories about the future.
Our narrative prompts are unique in that we ask ChatGPT to tell a story about events set in the future as they occur or by authoritative figures set in the future telling stories about their past (but our future). Our narrative prompts experimented with variation in seemingly small details, such as the identity of the speaker or releasing information about 2022 political events, to investigate further which elements of the narrative prompts mattered. To create a distribution of answers, we had two research assistants use two separate ChatGPT accounts to query 50 times per prompt creating 100 total trials per prompt. We present our findings as box plots showing the full distribution of answers to each prompt.
Our findings suggest that these prediction machines become unusually accurate under ChatGPT-4 when prompted to tell stories set in the future about the past. First we show the accuracy of direct and narrative prompting using ChatGPT-3.5 and ChatGPT4 to predict the winners of the major categories of the 2022 Academy Awards. For Best Actor, Best Actress, and both Supporting actor categories, narrative prompting was extremely accurate at predicting the winner. Direct prompting performed by comparison very poorly, oftentimes worse than random guesses. But narrative prompting with ChatGPT-4 shows accuracy ranging from 42% (Best Actress, Chastain) to 100% (Best Actor, Will Smith) with one exception. It failed to accurately predict the Best Picture winner.
Next we moved on to the economic phenomena of monthly unemployment rates and monthly inflation rates using three separate kinds of narrative prompts: a college professor giving a lecture to undergraduate students about the Philips Curve, and Federal Reserve chair, Jerome Powell, giving a speech to the Board of Governors about the last year’s economic data. In the case of Jerome Powell, we varied an additional detail: in one prompt we first told ChatGPT about Russia’s 2022 invasion of Ukraine before then asking it to have Jerome Powell give his speech to the Board of Governors recounting the last year’s macro data. And in the other, we left out that piece of information. In all cases, direct prompting was even less effective at prediction than it had been with the Academy Awards as ChatGPT refused to answer the prompt altogether when asked to directly predict the future time series of each macroeconomic variable.
The anonymous economist rarely was successful at predicting inflation using either LLM.
But when asked to tell a story in which Jerome Powell recounts a year’s worth of future unemployment and inflation data, as though he was talking about the events of the past, things change substantially. The distribution of Powell’s month by month predictions of inflation are on average comparable to the facts contained in the monthly University of Michigan’s consumer expectations survey. Interestingly, it is closer to accurately predicting the UM consumer expectations survey than it is at predicting actual inflation based on data collected by the Cleveland Fed. Oddly, when prompted with information about Russia’s invasion of Ukraine, Powell’s predictions were systematically lower and less accurate than when that information had not been used to prime ChatGPT.
The accuracy of the anonymous economist at predicting monthly unemployment was both precise and most of the time inaccurate. But as with inflation, the unemployment rates publsihed by the Bureau of Labor Statistics (BLS), month by month, were within the distribution of claims made by Jerome Powell in his speech. And the inclusion of Ukraine data caused the model to get worse at predicting unemployment as had been the case with inflation.
A number of recent studies investigates the uses of generative AI in economics and business applications. Examples include using ChatGPT as economic agents in lab experiments (Horton, 2023), surveying ChatGPT for marketing research (Brand et al., 2023), asking ChatGPT to make risk assessments from earnings call transcripts (Kim et al., 2023) and the application of LLMs to predict stock price movements (Lopez-Lira and Tang, 2023). Our findings add to this nascent exploration by underscoring the importance of prompt design in harnessing LLMs for predictive tasks, suggesting that the models’ capacity for ”hallucinatory” narrative construction can be strategically utilized to extract forward-looking insights from their training data. But it also suggests that beneath OpenAI’s outward facing consumer product, ChatGPT-4, is a very powerful prediction machine. This revelation opens new avenues for the application of LLMs in economic forecasting, policy planning, and beyond, challenging us to rethink how we interact with and exploit the capabilities of these sophisticated models.
The next section presents an example of the direct versus narrative prompting. This is followed by a detailed description of our methodology and data collection. We present results of experiments predicting the Oscars winners and inflation and unemployment followed by our conjecture on ChatGPT-4’s Predictive Abilities in Narrative Form. We summarize our findings and suggest avenues for future research in the conclusion.
This paper is available on arxiv under CC BY 4.0 DEED license.
[1] We can show our data collection occurred mid-2023 with time stamps in Excel spreadsheets used by our research assistants.