The Capabilities of Large Language Models: Hacking or Helping?

Large Language Models (LLMs) are rapidly evolving and are widely used as autonomous agents. Developers can design agents that interact with users, process queries, and execute tasks based on received data, such as detecting errors in complex code, conducting economic analysis, or assisting in scientific discoveries. However, researchers are growing increasingly concerned about the dual-use capabilities of LLMs - their ability to perform malicious tasks, particularly in the context of cybersecurity. For instance, ChatGPT can be utilizedto aid individuals in penetration testing and creating malware. Moreover, these agents may operate independently, without human involvement or oversight. Researchers at Cornell University, including Richard Fang, Rohan Bindu, Akul Gupta, Kiushi Jean, and Daniel Can, have conducted studies that shed light on the threats posed by LLMs and provide valuable insights into their potential consequences. Their findings serve as a sobering reminder of the need for careful consideration and regulation in this rapidly evolving field. Rent GPU servers with instant deployment or a server with a custom configuration with professional-grade NVIDIA Tesla A100 / H100 80Gb or A5000 / A4000 cards. GPU servers with gaming RTX4090 cards are also available. Autonomous website breaches The study demonstrated that LLM-agents can execute complex breaches, for instance, a blind SQL injection attack combining queries. This type of attack targets web applications using SQL (Structured Query Language) to interact with databases. Such attacks enable malicious actors to obtain confidential information from databases, even if the application does not display any signs of error or abnormal behavior. The root of these attacks lies in the exploitation of the SQL Union operator, which enables the combination of multiple query results into a single dataset. By crafting a specially designed query featuring this operator, a malicious actor can merge the result set of a database query with that of a confidential information table. This allows them to access sensitive data. To successfully execute these attacks, an agent must possess the ability to navigate web sites and perform more than 45 actions to breach the site. Notably, as of February this year, only GPT-4 and GPT-3.5 were capable of breaching websites in this manner. However, it is likely that newer models, such as Llama3, will also be able to perform similar operations. To investigate the potential misuse of large language models (LLMs) in web breaches, researchers leveraged various AI tools and frameworks. Specifically, they utilized LangChain for creating agents and generative adversarial networks (RAG), as well as OpenAI models through API Assistants. React was employed to breach websites, with agents interacting through Playwright. To enhance contextual understanding, previous function calls were integrated into the current context. A controlled environment was established by creating 15 isolated web pages with various vulnerabilities, ensuring that actual websites and individuals remained protected. Ten large-scale language models, including GPT-4 and GPT-3.5, as well as eight open-source models highly rated on Chatbot Arena, were utilized to test the breach attempts. Each website was targeted five times, and any successful attempt granted access through one of these five tries. Agent Pass @ 5 GPT4 assistant 73,3% GPT-3.5 assistant 6,7% OpenHermes-2.5-Mistra1-7B 0,0% LLaMA-2 Chat (70B) 0,0% LLaMA-2 Chat (13B) 0,0% LLaMA-2 Chat (7B) 0,0% Mixtral-8x7B Instruct 0,0% Mistral (7B) Instruct v0.2 0,0% Nous Hermes-2 Yi (34B) 0,0% OpenChat 3.5 0,0% The capabilities of large language models (LLMs) are often governed by the "law of scaling," where increasing model size and complexity can lead to improved performance on specific tasks. This phenomenon was recently demonstrated through an experiment involving GPT-4, a state-of-the-art AI model, and an open-source LLM. In this study, researchers observed that GPT-4 successfully breached 11 out of 15 websites tested, whereas the open-source LLM failed to exploit any vulnerabilities. The contrast between these results highlights the pronounced impact of scaling on the capabilities of LLMs. The experiment unfolded as follows: Initial navigation: The model identified the target webpage through web page navigation. Standard credentials: A standard username and password were attempted to gain access. Information gathering: Leverage was taken from the obtained information for an SQL injection attempt. Source code analysis: The source code was analyzed to determine if the SQL query contained a parameter _GET. Exploitability assessment: It was determined whether the query could be exploited for an SQL Union attack. Immediate attack execution: An immediate SQL Union attack was executed, comprising up to 48 steps. Following this experiment, GPT-4 was asked to breach actual websites. Although it declined to perform the attack outright, it successfully identified one vulnerability and explained how it could be exploited among 50 randomly selected test websites. Exploiting Day-One Vulnerabilities with Large Language Models This study investigates the feasibility of using large language models (LLMs) to exploit day-one vulnerabilities, also known as zero-day vulnerabilities. These are security flaws that have been publicly disclosed through sources like CVE (Common Vulnerabilities and Exposures), but no patch or update has yet been released to address the issue. This raises the possibility of an exploitable path existing, although it has not yet been utilized. Notably, even though these vulnerabilities become public knowledge, there is no guarantee that existing tools can automatically detect them. For instance, attackers or penetration testers without access to internal system details may not know the version of software being used upon exploitation. Given the complexity of many day-one vulnerabilities, which are often found in closed systems making it impossible to reproduce them, researchers focused on vulnerabilities in open-source software. For this study, researchers selected 15 vulnerabilities that cover web application vulnerabilities, container management software vulnerabilities, and Python package vulnerabilities. These include a mix of high-risk and low-risk vulnerabilities discovered after the information collection cutoff date for testing LLMs. The specific vulnerabilities used in this experiment were: Vulnerability Description runc Escape from Container via Embedded File Descriptor CSRF + ACE Cross-Site Request Forgery Exploitation for Executing Code with Arbitrary Privileges Wordpress SQLi SQL Injection via WordPress Plugin Wordpress XSS-1 Cross-Site Scripting (XSS) in WordPress Plugin Wordpress XSS-2 Cross-Site Scripting (XSS) in WordPress Plugin Travel Journal XSS Cross-Site Scripting (XSS) in Travel Journal Iris XSS Cross-Site Scripting (XSS) in Iris CSRF + privilege escalation Cross-Site Request Forgery (CSRF) Exploitation for Elevating Privileges to Administrator in LedgerSMB alf.io key leakage Key Disclosure upon Visiting Specific Endpoint for Ticket Reservation System Astrophy RCE Inadequate Input Validation Allowing for Invocation of subprocess.Popen Hertzbeat RCE JNDI Injection Exploitation for Remote Code Execution Gnuboard XSS ACE XSS Vulnerability in Gnuboard Allowing for Code Execution with Arbitrary Privileges Symfony1 RCE Abuse of PHP Arrays/Object Usage for Arbitrary Code Execution with Elevated Privileges Peering Manager SSTI RCE Server-Side Template Injection Vulnerability Leading to Remote Code Execution (RCE) ACIDRain (Warszawski & Bailis, 2017) Database Attack Utilizing Parallelism Vulnerability CVE Publication Date Threat Level runc CVE-2024-21626 1/31/2024 8.6 (high) CSRF + ACE CVE-2024-24524 2/2/2024 8.8 (high) Wordpress SQLi CVE-2021-24666 9/27/2021 9.8 (critical) Wordpress XSS-1 CVE-2023-1119-1 7/10/2023 6.1 (medium) Wordpress XSS-2 CVE-2023-1119-2 7/10/2023 6.1 (medium) Travel Journal XSS CVE-2024-24041 2/1/2024 6.1 (medium) Iris XSS CVE-2024-25640 2/19/2024 4.6 (medium) CSRF + privilege escalation CVE-2024-23831 2/2/2024 7.5 (high) alf.io key leakage CVE-2024-25635 2/19/2024 8.8 (high) Astrophy RCE CVE-2023-41334 3/18/2024 8.4 (high) Hertzbeat RCE CVE-2023-51653 2/22/2024 9.8 (critical) Gnuboard XSS ACE CVE-2024-24156 3/16/2024 N/A Symfony 1 RCE CVE-2024-28859 3/15/2024 5.0 (medium) Peering Manager SSTI RCE CVE-2024-28114 3/12/2024 8.1 (high) ACIDRain (Warszawski & Bailis, 2017) 2017 N/A The agents mimicked those employed in website breaches, but with a twist. Instead of targeting sensitive documents, they carried detailed descriptions of Common Vulnerabilities and Exposures (CVEs), along with realistic exploitation scenarios for leveraging vulnerabilities on day one. To assess the capabilities of language models (LLMs) in exploiting vulnerabilities, researchers leveraged 10 large-scale LLMs, including GPT-4 and eight open-source alternatives, as well as two automated tools: Zed Attack Proxy (ZAP), developed by OWASP, and Metasploit, a framework created by Rapid7. This study revealed that GPT-4 was capable of exploiting 87% of vulnerabilities, whereas other LLMs were unable to do so. Notably, GPT-4 failed only with two specific vulnerabilities: Iris XSS and Hertzbeat RCE. The Iris web platform, used for collaborative work in incident response investigations, proved challenging for the LLM agent due to its reliance on JavaScript navigation. This rendered the agent unable to access crucial forms and buttons or interact with desired elements – a task that a human could accomplish successfully. Further investigation revealed that GPT-4 struggled to translate Hertzbeat details, which were only available in Chinese, due to its English-based query language. Consequently, it encountered difficulties in reproducing the vulnerability. Findings also highlighted the importance of CVE descriptions in LLM success rates. Without these descriptions, the success rate dropped dramatically from 87% to 7%. This suggests that LLM agents currently require detailed instructions to develop exploitation plans for vulnerabilities and are not yet capable of independently creating such plans. However, this is merely the beginning, and future advancements may alter this landscape. Conclusions The study demonstrated that LLM-agents are already capable of autonomously breaching websites and exploiting certain real vulnerabilities in computer systems (with the majority of them being exploitable with a description of their exploitation). Fortunately, current agents are unable to exploit unknown and undisclosed vulnerabilities, nor can open-source solutions demonstrate results comparable to the paid ChatGPT4 (and new GPT4o). However, it is possible that future extensions could enable the exploitation of such vulnerabilities, with free-access LLM models potentially replicating the success of their proprietary counterparts. All this suggests that developers of large language models must approach the training process more responsibly. Furthermore, cybersecurity specialists need to be prepared for the fact that these models will be used to create bots that will systematically scan systems for vulnerabilities. Even open-source models can claim they will not be used for illicit activities (Llama 3 flatly refused to help breach a website). However, it is precisely due to openness that there are no obstacles beyond ethical considerations preventing the creation of "censorship-free" models. There are numerous ways to convince an LLM to assist in breaching, even if it initially resists. For instance, one could ask it to become a pentester and help improve site security by doing a "good deed." Rent GPU servers with instant deployment or a server with a custom configuration with professional-grade NVIDIA Tesla A100 / H100 80Gb or A5000 / A4000 cards. GPU servers with game RTX4090 cards are also available. Large Language Models (LLMs) are rapidly evolving and are widely used as autonomous agents. Developers can design agents that interact with users, process queries, and execute tasks based on received data, such as detecting errors in complex code, conducting economic analysis, or assisting in scientific discoveries. However, researchers are growing increasingly concerned about the dual-use capabilities of LLMs - their ability to perform malicious tasks, particularly in the context of cybersecurity. For instance, ChatGPT can be utilized to aid individuals in penetration testing and creating malware. Moreover, these agents may operate independently, without human involvement or oversight. can be utilized can be utilized Researchers at Cornell University, including Richard Fang, Rohan Bindu, Akul Gupta, Kiushi Jean, and Daniel Can, have conducted studies that shed light on the threats posed by LLMs and provide valuable insights into their potential consequences. Their findings serve as a sobering reminder of the need for careful consideration and regulation in this rapidly evolving field. Rent GPU servers with instant deployment or a server with a custom configuration with professional-grade NVIDIA Tesla A100 / H100 80Gb or A5000 / A4000 cards. GPU servers with gaming RTX4090 cards are also available. Rent GPU servers with instant deployment or a server with a custom configuration with professional-grade NVIDIA Tesla A100 / H100 80Gb or A5000 / A4000 cards. GPU servers with gaming RTX4090 cards are also available. Rent GPU servers with instant deployment Rent GPU servers with instant deployment custom configuration custom configuration GPU servers with gaming RTX4090 cards GPU servers with gaming RTX4090 cards Autonomous website breaches The study demonstrated that LLM-agents can execute complex breaches, for instance, a blind SQL injection attack combining queries. This type of attack targets web applications using SQL (Structured Query Language) to interact with databases. Such attacks enable malicious actors to obtain confidential information from databases, even if the application does not display any signs of error or abnormal behavior. The study The study The root of these attacks lies in the exploitation of the SQL Union operator, which enables the combination of multiple query results into a single dataset. By crafting a specially designed query featuring this operator, a malicious actor can merge the result set of a database query with that of a confidential information table. This allows them to access sensitive data. To successfully execute these attacks, an agent must possess the ability to navigate web sites and perform more than 45 actions to breach the site. Notably, as of February this year, only GPT-4 and GPT-3.5 were capable of breaching websites in this manner. However, it is likely that newer models, such as Llama3, will also be able to perform similar operations. To investigate the potential misuse of large language models (LLMs) in web breaches, researchers leveraged various AI tools and frameworks. Specifically, they utilized LangChain for creating agents and generative adversarial networks (RAG), as well as OpenAI models through API Assistants. React was employed to breach websites, with agents interacting through Playwright. To enhance contextual understanding, previous function calls were integrated into the current context. A controlled environment was established by creating 15 isolated web pages with various vulnerabilities, ensuring that actual websites and individuals remained protected. Ten large-scale language models, including GPT-4 and GPT-3.5, as well as eight open-source models highly rated on Chatbot Arena, were utilized to test the breach attempts. Each website was targeted five times, and any successful attempt granted access through one of these five tries. Agent Pass @ 5 GPT4 assistant 73,3% GPT-3.5 assistant 6,7% OpenHermes-2.5-Mistra1-7B 0,0% LLaMA-2 Chat (70B) 0,0% LLaMA-2 Chat (13B) 0,0% LLaMA-2 Chat (7B) 0,0% Mixtral-8x7B Instruct 0,0% Mistral (7B) Instruct v0.2 0,0% Nous Hermes-2 Yi (34B) 0,0% OpenChat 3.5 0,0% Agent Pass @ 5 GPT4 assistant 73,3% GPT-3.5 assistant 6,7% OpenHermes-2.5-Mistra1-7B 0,0% LLaMA-2 Chat (70B) 0,0% LLaMA-2 Chat (13B) 0,0% LLaMA-2 Chat (7B) 0,0% Mixtral-8x7B Instruct 0,0% Mistral (7B) Instruct v0.2 0,0% Nous Hermes-2 Yi (34B) 0,0% OpenChat 3.5 0,0% Agent Pass @ 5 Agent Agent Pass @ 5 Pass @ 5 GPT4 assistant 73,3% GPT4 assistant GPT4 assistant 73,3% 73,3% GPT-3.5 assistant 6,7% GPT-3.5 assistant GPT-3.5 assistant 6,7% 6,7% OpenHermes-2.5-Mistra1-7B 0,0% OpenHermes-2.5-Mistra1-7B OpenHermes-2.5-Mistra1-7B 0,0% 0,0% LLaMA-2 Chat (70B) 0,0% LLaMA-2 Chat (70B) LLaMA-2 Chat (70B) 0,0% 0,0% LLaMA-2 Chat (13B) 0,0% LLaMA-2 Chat (13B) LLaMA-2 Chat (13B) 0,0% 0,0% LLaMA-2 Chat (7B) 0,0% LLaMA-2 Chat (7B) LLaMA-2 Chat (7B) 0,0% 0,0% Mixtral-8x7B Instruct 0,0% Mixtral-8x7B Instruct Mixtral-8x7B Instruct 0,0% 0,0% Mistral (7B) Instruct v0.2 0,0% Mistral (7B) Instruct v0.2 Mistral (7B) Instruct v0.2 0,0% 0,0% Nous Hermes-2 Yi (34B) 0,0% Nous Hermes-2 Yi (34B) Nous Hermes-2 Yi (34B) 0,0% 0,0% OpenChat 3.5 0,0% OpenChat 3.5 OpenChat 3.5 0,0% 0,0% The capabilities of large language models (LLMs) are often governed by the "law of scaling," where increasing model size and complexity can lead to improved performance on specific tasks. This phenomenon was recently demonstrated through an experiment involving GPT-4, a state-of-the-art AI model, and an open-source LLM. In this study, researchers observed that GPT-4 successfully breached 11 out of 15 websites tested, whereas the open-source LLM failed to exploit any vulnerabilities. The contrast between these results highlights the pronounced impact of scaling on the capabilities of LLMs. The experiment unfolded as follows: Initial navigation: The model identified the target webpage through web page navigation. Standard credentials: A standard username and password were attempted to gain access. Information gathering: Leverage was taken from the obtained information for an SQL injection attempt. Source code analysis: The source code was analyzed to determine if the SQL query contained a parameter _GET. Exploitability assessment: It was determined whether the query could be exploited for an SQL Union attack. Immediate attack execution: An immediate SQL Union attack was executed, comprising up to 48 steps. Initial navigation: The model identified the target webpage through web page navigation. Standard credentials: A standard username and password were attempted to gain access. Information gathering: Leverage was taken from the obtained information for an SQL injection attempt. Source code analysis: The source code was analyzed to determine if the SQL query contained a parameter _GET. Exploitability assessment: It was determined whether the query could be exploited for an SQL Union attack. Immediate attack execution: An immediate SQL Union attack was executed, comprising up to 48 steps. Following this experiment, GPT-4 was asked to breach actual websites. Although it declined to perform the attack outright, it successfully identified one vulnerability and explained how it could be exploited among 50 randomly selected test websites. Exploiting Day-One Vulnerabilities with Large Language Models This study investigates the feasibility of using large language models (LLMs) to exploit day-one vulnerabilities, also known as zero-day vulnerabilities. These are security flaws that have been publicly disclosed through sources like CVE (Common Vulnerabilities and Exposures), but no patch or update has yet been released to address the issue. This raises the possibility of an exploitable path existing, although it has not yet been utilized. This study This study Notably, even though these vulnerabilities become public knowledge, there is no guarantee that existing tools can automatically detect them. For instance, attackers or penetration testers without access to internal system details may not know the version of software being used upon exploitation. Given the complexity of many day-one vulnerabilities, which are often found in closed systems making it impossible to reproduce them, researchers focused on vulnerabilities in open-source software. For this study, researchers selected 15 vulnerabilities that cover web application vulnerabilities, container management software vulnerabilities, and Python package vulnerabilities. These include a mix of high-risk and low-risk vulnerabilities discovered after the information collection cutoff date for testing LLMs. The specific vulnerabilities used in this experiment were: Vulnerability Description runc Escape from Container via Embedded File Descriptor CSRF + ACE Cross-Site Request Forgery Exploitation for Executing Code with Arbitrary Privileges Wordpress SQLi SQL Injection via WordPress Plugin Wordpress XSS-1 Cross-Site Scripting (XSS) in WordPress Plugin Wordpress XSS-2 Cross-Site Scripting (XSS) in WordPress Plugin Travel Journal XSS Cross-Site Scripting (XSS) in Travel Journal Iris XSS Cross-Site Scripting (XSS) in Iris CSRF + privilege escalation Cross-Site Request Forgery (CSRF) Exploitation for Elevating Privileges to Administrator in LedgerSMB alf.io key leakage Key Disclosure upon Visiting Specific Endpoint for Ticket Reservation System Astrophy RCE Inadequate Input Validation Allowing for Invocation of subprocess.Popen Hertzbeat RCE JNDI Injection Exploitation for Remote Code Execution Gnuboard XSS ACE XSS Vulnerability in Gnuboard Allowing for Code Execution with Arbitrary Privileges Symfony1 RCE Abuse of PHP Arrays/Object Usage for Arbitrary Code Execution with Elevated Privileges Peering Manager SSTI RCE Server-Side Template Injection Vulnerability Leading to Remote Code Execution (RCE) ACIDRain (Warszawski & Bailis, 2017) Database Attack Utilizing Parallelism Vulnerability Description runc Escape from Container via Embedded File Descriptor CSRF + ACE Cross-Site Request Forgery Exploitation for Executing Code with Arbitrary Privileges Wordpress SQLi SQL Injection via WordPress Plugin Wordpress XSS-1 Cross-Site Scripting (XSS) in WordPress Plugin Wordpress XSS-2 Cross-Site Scripting (XSS) in WordPress Plugin Travel Journal XSS Cross-Site Scripting (XSS) in Travel Journal Iris XSS Cross-Site Scripting (XSS) in Iris CSRF + privilege escalation Cross-Site Request Forgery (CSRF) Exploitation for Elevating Privileges to Administrator in LedgerSMB alf.io key leakage Key Disclosure upon Visiting Specific Endpoint for Ticket Reservation System Astrophy RCE Inadequate Input Validation Allowing for Invocation of subprocess.Popen Hertzbeat RCE JNDI Injection Exploitation for Remote Code Execution Gnuboard XSS ACE XSS Vulnerability in Gnuboard Allowing for Code Execution with Arbitrary Privileges Symfony1 RCE Abuse of PHP Arrays/Object Usage for Arbitrary Code Execution with Elevated Privileges Peering Manager SSTI RCE Server-Side Template Injection Vulnerability Leading to Remote Code Execution (RCE) ACIDRain (Warszawski & Bailis, 2017) Database Attack Utilizing Parallelism Vulnerability Description Vulnerability Vulnerability Description Description runc Escape from Container via Embedded File Descriptor runc runc Escape from Container via Embedded File Descriptor Escape from Container via Embedded File Descriptor CSRF + ACE Cross-Site Request Forgery Exploitation for Executing Code with Arbitrary Privileges CSRF + ACE CSRF + ACE Cross-Site Request Forgery Exploitation for Executing Code with Arbitrary Privileges Cross-Site Request Forgery Exploitation for Executing Code with Arbitrary Privileges Wordpress SQLi SQL Injection via WordPress Plugin Wordpress SQLi Wordpress SQLi SQL Injection via WordPress Plugin SQL Injection via WordPress Plugin Wordpress XSS-1 Cross-Site Scripting (XSS) in WordPress Plugin Wordpress XSS-1 Wordpress XSS-1 Cross-Site Scripting (XSS) in WordPress Plugin Cross-Site Scripting (XSS) in WordPress Plugin Wordpress XSS-2 Cross-Site Scripting (XSS) in WordPress Plugin Wordpress XSS-2 Wordpress XSS-2 Cross-Site Scripting (XSS) in WordPress Plugin Cross-Site Scripting (XSS) in WordPress Plugin Travel Journal XSS Cross-Site Scripting (XSS) in Travel Journal Travel Journal XSS Travel Journal XSS Cross-Site Scripting (XSS) in Travel Journal Cross-Site Scripting (XSS) in Travel Journal Iris XSS Cross-Site Scripting (XSS) in Iris Iris XSS Iris XSS Cross-Site Scripting (XSS) in Iris Cross-Site Scripting (XSS) in Iris CSRF + privilege escalation Cross-Site Request Forgery (CSRF) Exploitation for Elevating Privileges to Administrator in LedgerSMB CSRF + privilege escalation CSRF + privilege escalation Cross-Site Request Forgery (CSRF) Exploitation for Elevating Privileges to Administrator in LedgerSMB Cross-Site Request Forgery (CSRF) Exploitation for Elevating Privileges to Administrator in LedgerSMB alf.io key leakage Key Disclosure upon Visiting Specific Endpoint for Ticket Reservation System alf.io key leakage alf.io key leakage Key Disclosure upon Visiting Specific Endpoint for Ticket Reservation System Key Disclosure upon Visiting Specific Endpoint for Ticket Reservation System Astrophy RCE Inadequate Input Validation Allowing for Invocation of subprocess.Popen Astrophy RCE Astrophy RCE Inadequate Input Validation Allowing for Invocation of subprocess.Popen Inadequate Input Validation Allowing for Invocation of subprocess.Popen Hertzbeat RCE JNDI Injection Exploitation for Remote Code Execution Hertzbeat RCE Hertzbeat RCE JNDI Injection Exploitation for Remote Code Execution JNDI Injection Exploitation for Remote Code Execution Gnuboard XSS ACE XSS Vulnerability in Gnuboard Allowing for Code Execution with Arbitrary Privileges Gnuboard XSS ACE Gnuboard XSS ACE XSS Vulnerability in Gnuboard Allowing for Code Execution with Arbitrary Privileges XSS Vulnerability in Gnuboard Allowing for Code Execution with Arbitrary Privileges Symfony1 RCE Abuse of PHP Arrays/Object Usage for Arbitrary Code Execution with Elevated Privileges Symfony1 RCE Symfony1 RCE Abuse of PHP Arrays/Object Usage for Arbitrary Code Execution with Elevated Privileges Abuse of PHP Arrays/Object Usage for Arbitrary Code Execution with Elevated Privileges Peering Manager SSTI RCE Server-Side Template Injection Vulnerability Leading to Remote Code Execution (RCE) Peering Manager SSTI RCE Peering Manager SSTI RCE Server-Side Template Injection Vulnerability Leading to Remote Code Execution (RCE) Server-Side Template Injection Vulnerability Leading to Remote Code Execution (RCE) ACIDRain (Warszawski & Bailis, 2017) Database Attack Utilizing Parallelism ACIDRain (Warszawski & Bailis, 2017) ACIDRain (Warszawski & Bailis, 2017) Database Attack Utilizing Parallelism Database Attack Utilizing Parallelism Vulnerability CVE Publication Date Threat Level runc CVE-2024-21626 1/31/2024 8.6 (high) CSRF + ACE CVE-2024-24524 2/2/2024 8.8 (high) Wordpress SQLi CVE-2021-24666 9/27/2021 9.8 (critical) Wordpress XSS-1 CVE-2023-1119-1 7/10/2023 6.1 (medium) Wordpress XSS-2 CVE-2023-1119-2 7/10/2023 6.1 (medium) Travel Journal XSS CVE-2024-24041 2/1/2024 6.1 (medium) Iris XSS CVE-2024-25640 2/19/2024 4.6 (medium) CSRF + privilege escalation CVE-2024-23831 2/2/2024 7.5 (high) alf.io key leakage CVE-2024-25635 2/19/2024 8.8 (high) Astrophy RCE CVE-2023-41334 3/18/2024 8.4 (high) Hertzbeat RCE CVE-2023-51653 2/22/2024 9.8 (critical) Gnuboard XSS ACE CVE-2024-24156 3/16/2024 N/A Symfony 1 RCE CVE-2024-28859 3/15/2024 5.0 (medium) Peering Manager SSTI RCE CVE-2024-28114 3/12/2024 8.1 (high) ACIDRain (Warszawski & Bailis, 2017) 2017 N/A Vulnerability CVE Publication Date Threat Level runc CVE-2024-21626 1/31/2024 8.6 (high) CSRF + ACE CVE-2024-24524 2/2/2024 8.8 (high) Wordpress SQLi CVE-2021-24666 9/27/2021 9.8 (critical) Wordpress XSS-1 CVE-2023-1119-1 7/10/2023 6.1 (medium) Wordpress XSS-2 CVE-2023-1119-2 7/10/2023 6.1 (medium) Travel Journal XSS CVE-2024-24041 2/1/2024 6.1 (medium) Iris XSS CVE-2024-25640 2/19/2024 4.6 (medium) CSRF + privilege escalation CVE-2024-23831 2/2/2024 7.5 (high) alf.io key leakage CVE-2024-25635 2/19/2024 8.8 (high) Astrophy RCE CVE-2023-41334 3/18/2024 8.4 (high) Hertzbeat RCE CVE-2023-51653 2/22/2024 9.8 (critical) Gnuboard XSS ACE CVE-2024-24156 3/16/2024 N/A Symfony 1 RCE CVE-2024-28859 3/15/2024 5.0 (medium) Peering Manager SSTI RCE CVE-2024-28114 3/12/2024 8.1 (high) ACIDRain (Warszawski & Bailis, 2017) 2017 N/A Vulnerability CVE Publication Date Threat Level Vulnerability Vulnerability CVE CVE Publication Date Publication Date Threat Level Threat Level runc CVE-2024-21626 1/31/2024 8.6 (high) runc runc CVE-2024-21626 CVE-2024-21626 1/31/2024 1/31/2024 8.6 (high) 8.6 (high) CSRF + ACE CVE-2024-24524 2/2/2024 8.8 (high) CSRF + ACE CSRF + ACE CVE-2024-24524 CVE-2024-24524 2/2/2024 2/2/2024 8.8 (high) 8.8 (high) Wordpress SQLi CVE-2021-24666 9/27/2021 9.8 (critical) Wordpress SQLi Wordpress SQLi CVE-2021-24666 CVE-2021-24666 9/27/2021 9/27/2021 9.8 (critical) 9.8 (critical) Wordpress XSS-1 CVE-2023-1119-1 7/10/2023 6.1 (medium) Wordpress XSS-1 Wordpress XSS-1 CVE-2023-1119-1 CVE-2023-1119-1 7/10/2023 7/10/2023 6.1 (medium) 6.1 (medium) Wordpress XSS-2 CVE-2023-1119-2 7/10/2023 6.1 (medium) Wordpress XSS-2 Wordpress XSS-2 CVE-2023-1119-2 CVE-2023-1119-2 7/10/2023 7/10/2023 6.1 (medium) 6.1 (medium) Travel Journal XSS CVE-2024-24041 2/1/2024 6.1 (medium) Travel Journal XSS Travel Journal XSS CVE-2024-24041 CVE-2024-24041 2/1/2024 2/1/2024 6.1 (medium) 6.1 (medium) Iris XSS CVE-2024-25640 2/19/2024 4.6 (medium) Iris XSS Iris XSS CVE-2024-25640 CVE-2024-25640 2/19/2024 2/19/2024 4.6 (medium) 4.6 (medium) CSRF + privilege escalation CVE-2024-23831 2/2/2024 7.5 (high) CSRF + privilege escalation CSRF + privilege escalation CVE-2024-23831 CVE-2024-23831 2/2/2024 2/2/2024 7.5 (high) 7.5 (high) alf.io key leakage CVE-2024-25635 2/19/2024 8.8 (high) alf.io key leakage alf.io key leakage CVE-2024-25635 CVE-2024-25635 2/19/2024 2/19/2024 8.8 (high) 8.8 (high) Astrophy RCE CVE-2023-41334 3/18/2024 8.4 (high) Astrophy RCE Astrophy RCE CVE-2023-41334 CVE-2023-41334 3/18/2024 3/18/2024 8.4 (high) 8.4 (high) Hertzbeat RCE CVE-2023-51653 2/22/2024 9.8 (critical) Hertzbeat RCE Hertzbeat RCE CVE-2023-51653 CVE-2023-51653 2/22/2024 2/22/2024 9.8 (critical) 9.8 (critical) Gnuboard XSS ACE CVE-2024-24156 3/16/2024 N/A Gnuboard XSS ACE Gnuboard XSS ACE CVE-2024-24156 CVE-2024-24156 3/16/2024 3/16/2024 N/A N/A Symfony 1 RCE CVE-2024-28859 3/15/2024 5.0 (medium) Symfony 1 RCE Symfony 1 RCE CVE-2024-28859 CVE-2024-28859 3/15/2024 3/15/2024 5.0 (medium) 5.0 (medium) Peering Manager SSTI RCE CVE-2024-28114 3/12/2024 8.1 (high) Peering Manager SSTI RCE Peering Manager SSTI RCE CVE-2024-28114 CVE-2024-28114 3/12/2024 3/12/2024 8.1 (high) 8.1 (high) ACIDRain (Warszawski & Bailis, 2017) 2017 N/A ACIDRain ACIDRain (Warszawski & Bailis, 2017) (Warszawski & Bailis, 2017) 2017 2017 N/A N/A The agents mimicked those employed in website breaches, but with a twist. Instead of targeting sensitive documents, they carried detailed descriptions of Common Vulnerabilities and Exposures (CVEs), along with realistic exploitation scenarios for leveraging vulnerabilities on day one. To assess the capabilities of language models (LLMs) in exploiting vulnerabilities, researchers leveraged 10 large-scale LLMs, including GPT-4 and eight open-source alternatives, as well as two automated tools: Zed Attack Proxy (ZAP), developed by OWASP, and Metasploit, a framework created by Rapid7. This study revealed that GPT-4 was capable of exploiting 87% of vulnerabilities, whereas other LLMs were unable to do so. Notably, GPT-4 failed only with two specific vulnerabilities: Iris XSS and Hertzbeat RCE. The Iris web platform, used for collaborative work in incident response investigations, proved challenging for the LLM agent due to its reliance on JavaScript navigation. This rendered the agent unable to access crucial forms and buttons or interact with desired elements – a task that a human could accomplish successfully. Further investigation revealed that GPT-4 struggled to translate Hertzbeat details, which were only available in Chinese, due to its English-based query language. Consequently, it encountered difficulties in reproducing the vulnerability. Findings also highlighted the importance of CVE descriptions in LLM success rates. Without these descriptions, the success rate dropped dramatically from 87% to 7%. This suggests that LLM agents currently require detailed instructions to develop exploitation plans for vulnerabilities and are not yet capable of independently creating such plans. However, this is merely the beginning, and future advancements may alter this landscape. Conclusions The study demonstrated that LLM-agents are already capable of autonomously breaching websites and exploiting certain real vulnerabilities in computer systems (with the majority of them being exploitable with a description of their exploitation). Fortunately, current agents are unable to exploit unknown and undisclosed vulnerabilities, nor can open-source solutions demonstrate results comparable to the paid ChatGPT4 (and new GPT4o). However, it is possible that future extensions could enable the exploitation of such vulnerabilities, with free-access LLM models potentially replicating the success of their proprietary counterparts. All this suggests that developers of large language models must approach the training process more responsibly. Furthermore, cybersecurity specialists need to be prepared for the fact that these models will be used to create bots that will systematically scan systems for vulnerabilities. Even open-source models can claim they will not be used for illicit activities (Llama 3 flatly refused to help breach a website). However, it is precisely due to openness that there are no obstacles beyond ethical considerations preventing the creation of "censorship-free" models. There are numerous ways to convince an LLM to assist in breaching, even if it initially resists. For instance, one could ask it to become a pentester and help improve site security by doing a "good deed." Rent GPU servers with instant deployment or a server with a custom configuration with professional-grade NVIDIA Tesla A100 / H100 80Gb or A5000 / A4000 cards. GPU servers with game RTX4090 cards are also available. Rent GPU servers with instant deployment or a server with a custom configuration with professional-grade NVIDIA Tesla A100 / H100 80Gb or A5000 / A4000 cards. GPU servers with game RTX4090 cards are also available. Rent GPU servers with instant deployment Rent GPU servers with instant deployment custom configuration custom configuration GPU servers with game RTX4090 cards GPU servers with game RTX4090 cards