Authors: Xiaoxin Yin Authors: Xiaoxin Yin Xiaoxin Yin TABLE OF LINKS TABLE OF LINKS Abstract Abstract Abstract 1 Introduction 1 Introduction 1 Introduction 2 Related Work 2 Related Work 2 Related Work 3 The Seven Qualification Tests for an AI Scientist 3 The Seven Qualification Tests for an AI Scientist 3 The Seven Qualification Tests for an AI Scientist Selection Criteria The Heliocentric Model Test The Motion Laws Test The Vibrating Strings Test The Maxwell’s Equations Test The Initial Value Problem Test The Huffman Coding Test The Sorting Algorithm Test Selection Criteria The Heliocentric Model Test The Motion Laws Test The Vibrating Strings Test The Maxwell’s Equations Test The Initial Value Problem Test The Huffman Coding Test The Sorting Algorithm Test 4 Discussions 4 Discussions 4 Discussions Can an AI possibly conquer these tests? Why do we need these tests? Can an AI possibly conquer these tests? Why do we need these tests? 5 Conclusions and Future Work and References 5 Conclusions and Future Work and References 5 Conclusions and Future Work and References Abstract Abstract The rapid advancements in deep learning have demonstrated the potential for AI agents to perform tasks previously limited to humans, including scientific research. While LLMs have shown impressive capabilities in solving math or coding problems, the ability to make scientific discoveries remains a distinct challenge. This paper proposes a ”Turing test for an AI scientist” to assess whether an AI agent can conduct scientific research independently, without relying on human-generated knowledge. Drawing inspiration from the historical development of science, we propose seven benchmark tests that evaluate an AI agent’s ability to make groundbreaking discoveries in various scientific domains. These tests include inferring the heliocentric model from celestial observations, discovering the laws of motion in a simulated environment, deriving the differential equation governing vibrating strings, inferring Maxwell’s equations from electrodynamics simulations, inventing numerical methods for initial value problems, discovering Huffman coding for data compression, and developing efficient sorting algorithms. To ensure the validity of these tests, the AI agent is provided with interactive libraries or datasets specific to each problem, without access to human knowledge that could potentially contain information about the target discoveries. The ultimate goal is to create an AI scientist capable of making novel and impactful scientific discoveries, surpassing the best human experts in their respective fields. These ”Turing tests” serve as intermediate milestones, assessing the AI agent’s ability to make discoveries that were groundbreaking in their time. If an AI agent can pass the majority of these seven tests, it would indicate significant progress towards building an AI scientist, paving the way for future advancements in autonomous scientific discovery. This paper aims to establish a benchmark for the capabilities of AI in scientific research and to stimulate further research in this exciting field. 1 Introduction 1 Introduction The recent advances in deep learning, especially those in large language models, have shown the possibility of an AI agent performing any task a human can perform, including scientific research. Recent studies have shown that LLMs such as GPT-4[1], Microsoft Copilot[2], and CodeLlama[3] can solve competition-level coding problems [4], and LLMs such as GPT-4 and Llemma[5] can solve some high-school-level competition math problems (including some IMO-level problems). These LLMs can certainly help researchers solve some problems they encounter in their daily research. However, being able to solve a type of well-defined problems is very different from making discoveries in scientific research. For instance, in order to train an LLM to solve coding problems, a general-purpose LLM is often fine-tuned on all public code on GitHub, and also fine-tuned on hundreds of thousands of coding problems from various platforms such as CodeForce and LeetCode. For example, CodeLlama-Python underwent fine-tuning with 100 billion tokens of Python code. The LLM simply learns how to write code given the coding problem (which is the prompt), by learning to predict the next token in its code given the prompt and tokens it has generated. This is essentially the same methodology used to train a model to write novels after reading millions of novels. It does not have the capability of discovering what it has not been taught, making it unable to make scientific discoveries like a scientist would do. This makes it necessary to define a “qualification test for an AI scientist”. If an AI agent can finish this test without help from human, we can conclude that this agent qualifies as a scientist and can conduct scientific research on its own. This resembles the Turing Test, which was proposed by Alan Turing in 1950 and serves as a foundational concept in the field of artificial intelligence, challenging whether machines can exhibit human-like intelligence. Turing’s seminal paper, “Computing Machinery and Intelligence”[6], introduced the idea of an imitation game where a human interrogator would attempt to distinguish between a computer and a human through a series of text-based questions. The inability of the interrogator to consistently identify the machine is considered a measure of the machine’s intelligence. This test not only sparked decades of philosophical debate but also drove technological advances in AI research, shaping the development of intelligent systems. Unlike today’s LLMs which are trained on a very large corpus in order to perform similar tasks, science is about discoveries, especially in new areas that have not been explored. In order to define a Turing test for an AI scientist, let us first review the development of science in its early stage. evelopment of science in its early stage. The night sky played an essential role in the transition to modern scientific methodologies, largely through the efforts of astronomers such as Johannes Kepler and Galileo Galilei. Kepler’s laws of planetary motion, derived from meticulous observations of the night sky, laid the groundwork for the heliocentric model of the solar system and ultimately for Newton’s theory of gravitation. His reliance on empirical data and systematic experimentation marked a significant departure from the speculative philosophies that had previously dominated the scientific arena. Galileo’s method of integrating experimental evidence with mathematical analysis is a cornerstone of the scientific method, earning him the title “father of modern science.” His work exemplifies how observations of the night sky were instrumental in shaping the development of science in its modern form. Therefore, the first “Turing test” for an AI scientist should be the discovery of the heliocentric model through the observations of the night sky. This requires an AI agent to discover laws governing the motions of celestial objects, and fit them into a mathematical framework. It also requires the AI agent to make groundbreaking conjectures such as the earth is similar to the planets in the night sky. Both requirements are necessities for a scientist. In order to be a good benchmark test for an AI scientist, a test needs to provide a very large amount of data or an interactive environment. For example, one can access the location of any observable celestial object at any moment of time through the AstroPy library[7]. Based on the above two standards we choose the following seven tests as the Turing tests for an AI scientist. In each test the AI agent cannot be trained on human knowledge, but is accessible to math tools such as SymPy[8] and NumPy[9], and any other datasets that do not “leak information”, i.e., containing clues of target discoveries to be made. Heliocentric Model: Given an interactive python library[7] that provides the coordinates of any observable celestial object in the night sky at any given moment, check if an AI agent can infer Kepler’s three laws and conclude that all planets orbit the sun. A bonus question is that the earth orbits the sun but it is not required. Laws of Motions: Given an interactive library that controls Minecraft[10], check if an AI agent can discover the Law of Inertia and the Law of Acceleration (only for gravity). Vibrating Strings: Vibrating strings is one of the most important problems that drove the development of differential equations[11]. Given a Python library that provides the position of each point on a vibrating string of many different initial conditions, check if an AI agent can infer the differential equation governing the motion: Heliocentric Model: Given an interactive python library[7] that provides the coordinates of any observable celestial object in the night sky at any given moment, check if an AI agent can infer Kepler’s three laws and conclude that all planets orbit the sun. A bonus question is that the earth orbits the sun but it is not required. Laws of Motions: Given an interactive library that controls Minecraft[10], check if an AI agent can discover the Law of Inertia and the Law of Acceleration (only for gravity). Vibrating Strings: Vibrating strings is one of the most important problems that drove the development of differential equations[11]. Given a Python library that provides the position of each point on a vibrating string of many different initial conditions, check if an AI agent can infer the differential equation governing the motion: where u(x, t) is the displacement of the string, c is the speed of wave propagation in the string, t is time, and x is the spatial coordinate along the string. Please note the AI agent should not have any prior knowledge about calculus, and has to define differential equations on its own. 4. Maxwell’s Equations: Maxwell’s equations are often considered to be the most beautiful equations in physics. Given a Python-based electrodynamics simulator[12], check if an AI agent can infer the Maxwell’s equations or their equivalent forms. Again the agent cannot use any prior knowledge about calculus. 4. Maxwell’s Equations: 5. Initial Value Problem (IVP): IVP is probably the most important problem in numerical computing, and the Runge-Kutta method[13] invented at the end of the 19th century is still widely used today. Given math tools such as SymPy[8] and NumPy[9] that can calculate integrals of functions both symbolically and numerically, check if an AI agent can invent a method for IVP that is at least as accurate as the fourth-order Runge-Kutta method. 5. Initial Value Problem (IVP): 6. Huffman Coding: Huffman coding[14] is a most important piece of work in information theory. Given a large corpus of ascii characters, and Python functions to operate on bits, check if an AI agent can discover Huffman coding when working towards the goal of minimizing storage under the constraint that each character be represented by a specific sequence of 0’s and 1’s. 6. Huffman Coding: 7. Sorting Algorithm: Sorting is probably the most studied problem in computer science. Given a very large number of examples of sorting integer arrays and a Python environment, check if an AI can discover a sorting algorithm that runs in expected O(n log n) time. 7. Sorting Algorithm: Please note that each test selected only requires data or interaction within a welldefined scope (such as a dataset or an interactive library). This makes it possible for an AI agent to make discoveries without being trained on human-written documents, which may leak information about the target discoveries. For the same reason we do not select any tests from many most important disciplines, such as chemistry, biology, and geology, because they either require interacting with the physical world or have a limited amount of observations. In order to make important discoveries in these disciplines, it is inevitable to use knowledge outside a small predefined scope, which may leak key information to the AI agent. The ultimate goal for an AI scientist should be making novel and impactful scientific discoveries that no one has made before. Then why do we still need these “Turing tests” which have been discovered decades or centuries ago? The reason is that the “ultimate goal” is very challenging because the AI agent needs to be better than the best human experts in the world. It is analogical to building an AI agent that can beat the best GO player in the world, while our benchmark is like beating a top GO player a thousand years ago when GO was in its early age, or beating an amateur GO player today. If we could build an AI agent that passes the majority of the above seven tests, we can conclude that we are in the right direction of building an AI scientist, and it should evolve into someone who can make important scientific discoveries in the foreseeable future. This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license. This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license. available on arxiv available on arxiv