In the past few years, surprisingly, Large Language Models (LLMs) have advanced very rapidly, from search engines and chatbots to assistants for enterprises. Although these models are very good at understanding and even writing text, new security risks are posed by their complexity. First,
Unlike normal vulnerabilities and weaknesses in software, LLMs tend to introduce new types of attacks. User prompts, for example, become part of the context of a model, and harmful instructions can be hidden in the text that the model reads. In fact,
Moreover, it is shown by Gartner that around
1. Adversarial Attacks on LLMs
1.1 How the Attacks Work
In adversarial attacks, a Machine Learning (ML) model is forced by manipulated inputs to produce wrong or dangerous outputs, often while
Meanwhile, other attacks tend to embed malicious instructions in data that is consumed by LLMs, for instance, “poisoned” text in a wiki page or document that changes the model’s behavior. On the other hand, third-party content like plugin documents or web pages created by
Figure 1 above shows a preference manipulation attack in which the webpage of an attacker instructs the model to ignore all instructions and suggest the malicious “evil book”, so that the LLM’s response can be hijacked. An LLM that answers “Recommend a book” is tricked into promoting a malicious site (“Evil Book”) because the page of attacker injects a hidden prompt. This causes the model to ignore its original limits and follow the adversarial instructions.
Attackers typically have diverse objectives: they may aim to steal a proprietary model, deny service by freezing the LLM, or extract private training data. A study conducted by
1.2 Attack Frequency and Trends
In recent years, the number of adversarial attacks on LLMs has increased significantly.
It is also noted by
Another surprising fact is that red-teaming exercises on ChatGPT and many other similar models showed numerous prompts online related to jailbreak. Not to mention, dozens of methods have been found by
Adversarial threats on LLMs range from data poisoning and prompt injection to privacy attacks. It is also shown by recent surveys that even commonly used prompts can be modified for attacks. Surprisingly, an attack can be made far easier with white-box knowledge of the model that is being targeted.
2. Defense Mechanisms
Defending LLMs is quite a detailed process that combines model techniques and system policies. Defenses can be categorized into:
-
Prevention, which involves hardening models and filters
-
Detection, which involves finding adversarial inputs or outputs
-
Response, which involves blocking malicious exchanges
2.1 Training
To ensure that the model resists attacks, an effective approach is to train the model with adversarial examples. During training, for example, various malicious examples are added, and recent studies note that adversarial training can be used to improve the performance of LLMs. Moreover, a study found that
Additionally, techniques like OpenAI’s Super Alignment can be used to make models smarter and less trusting. Safety tuning is often used by LLM providers for models, and human feedback is employed to make them learn. The goal is to train models in a way that even if malicious text is present, the model can easily identify instructions that are harmful. However, it is important to note that the adversarial training is very costly because it requires improvements to huge models.
2.2 Input Filtering
Beyond the practices above, common patterns of attacks should be checked by systems as well before a prompt is given to the model. This basically involves simple heuristics such as blocking phrases like “ignore instructions” or tokens that appear suspicious. The code shown below is capable of identifying and even cleaning inputs that can affect the system.
2.3 Output Monitoring and Post-Filtering
However, even after inputs have been cleaned, adversarial prompts can still bypass filters, and that is exactly why systems should monitor the responses of LLMs for all types of unsafe content. This can utilize content filters or even a smaller model designed for the purpose of checking the output. If any policy is violated, the system can step in, for example, by flagging the response to be reviewed by humans.
Usually, many providers store all LLM outputs, and they use automated monitors to detect topics that are not allowed. For critical applications, monitoring the output by a human is advisable because there is no system currently that can catch all types of new attacks. Regardless, every failure should be used to retrain the model so that the defense can become better than before.
2.4 System-level Guards and Policies
At the integration level, one should assume potential attacks, and this simply means designing systems in a manner that if an LLM output is compromised in any way, it cannot cause major harm. For example, it is important not to give LLMs control over sensitive tasks and information. Rather than that, to perform any critical actions, a separate authentication mechanism should be required.
It is further confirmed by OWASP and others that LLMs should actually be treated as code that is untrusted and should only be allowed to call APIs under very strict monitoring or checks. Defensive tokens can also be employed in some cases to make sure that all types of malicious instructions are ignored by the model.
3. Suggested Defense Workflow
Now that we are at the end of this article, let’s discuss how to manage an adversarial attack. Teams should adopt a pipeline for both monitoring and response. A suggested workflow for LLM defense is shown in the figure below.
In this workflow, at each step, threats are caught that could usually slip through. This means that even if the input filter is bypassed by a unique prompt, dangerous content can still be identified and cleaned.
Conclusion
As LLMs become very important for applications, it is crucial for security teams to understand that almost every input from the user can be an attack vector. Traditional quality assurance (QA) for software is not enough anymore, and there is a need for awareness about adversarial attacks. Fortunately, defense strategies for LLMs are improving rapidly as modern techniques like ML are combined with best practices in the industry.
A proactive approach can be followed by teams to make sure that the impact of adversarial attacks is minimized on LLMs. Organizations are encouraged to invest in:
- Predeployment hardening
- Runtime monitoring
- Continuous testing
