216 reads

RAG Predictive Coding for AI Alignment Against Prompt Injections and Jailbreaks

by stephenSeptember 5th, 2024

Too Long; Didn't Read

What are all the combinations of successful jailbreaks and prompt injection attacks against AI chabots that were different from what it would normally expect? Expectation could become an addition to retrieval and generation of Retrieval-Augmented Generation (RAG), as a first step towards mitigating jailbreaks and prompt injections, with an outlook towards general AI alignment and safety. Electrical signals are theorized to always be splitting, either for sensory inputs or non-sensory inputs [as general mind processes, so to speak]. This split makes it possible to expect what is coming [before a sensory input] or to quickly process [a sensory input]. Then, to quickly decide danger, say if the sight, sound or smell of something is a warning signal. This is what LLMs do not have, to at least know what combinations may lead to a jailbreak attack or prompt injection as dangerous at the gate, before capture.

featured image - RAG Predictive Coding for AI Alignment Against Prompt Injections and Jailbreaks

For what, does an AI chatbot expect that it would be prompted? Not what it rejects, but what it anticipates it would be queried and in what form?

What are all the combinations of successful jailbreaks and prompt injection attacks against AI chatbots that were different from what it would normally expect?

Expectation could become an addition to retrieval and generation of Retrieval-Augmented Generation (RAG), as a first step towards mitigating jailbreaks and prompt injections, with an outlook towards general AI alignment and safety.

Large language models are known to predict tokens. However, if their outputs are good at that, why can their inputs not have a level of combination expectation, in sophistication for safety?

There have been several examples of jailbreaks and prompt injections that struck out from regular prompts that should have been flagged at the gate but went through. The lack of bounded expectation modality and single-layer to output remain vulnerabilities for general AI alignment and safety.

Assuming there is a sub-surface where outputs become a form of input, before finally becoming an output to the user, it could become a path to have inputs stay within safety expectations.

Already, several major chatbots are red-teamed. Some follow some rules and some prevent some outputs, within general contexts. However, there is no established architecture yet on expectation patterns for chatbots, such that aside from just strings or integers that make sense—so to speak—what things, that do not make sense, or that may seem to make sense but can play the chatbot, should be expected and caught?

Tuning chatbots towards expectations could be a new direction for AI safety, experimenting with prompt injections and jailbreaks—especially on combinations that work those out. This may then extend beyond those to general AI safety, including towards artificial general intelligence and artificial superintelligence, and the paths of misuses and harms expected for prevention.

Also, a second layer of input cross-check, before for-user output, could also become a means towards fixing hallucination. This means that AI safety research can also support AI advancements for further usefulness.

Predictive Coding

This will be modeled on the human mind. In neuroscience, it is generally said that the brain predicts. But nowhere in current neuroscience has the exact mechanism of prediction been described.

It is theorized here that electrical signals, in a set [in a cluster of neurons] often split, with some going ahead of others to interact with chemical signals as they had before, to result in a function. If the input matches with the interpretation, the second part follows in the same direction. If not, it goes another way.

Simply, whenever there is some sensory input, conceptually, electrical signals of that set [in a cluster of neurons] split, with some going ahead of others for interpretation, by interaction with a set of chemical signals [in a cluster of neurons]. If this input matches with the initial perception, the incoming electrical signals go in the same direction [in pre-prioritization], but if it does not match, the incoming ones go in another direction, to interact with another set of chemical signals, correcting what is labeled prediction error.

This means that electrical signals that represent a sensory input split, to interact with chemical signals for the interpretation, which, if it matches with the initial perception, those that follow are not prioritized and go in the same direction, but if not, those that follow become prioritized and go in another direction to interact with other chemical signals.

This concept explains the terms predictive coding and predictive processing. The brain is not predicting. Electrical signals are splitting. A reason this is likely is because neuroscience has established saltatory conduction. This means that in myelinated axons, electrical signals are leaping from one node of Ranvier to the next, going faster than in unmyelinated axons, where electrical signals do what is called continuous conduction without leaping.

It is theorized that since electrical signals go very fast in myelinated axons, some in a set, are able to carry enough summaries of the input for interaction with chemical signals. So, the possibility that that as soon as they can carry just enough for the next interaction, they take off, leaving the others, in the same set to follow. It is this summary capability and the go-first that becomes what is experienced as prediction, conceptually.

This is a reason that when a sound is first heard, what it might be is already perceptive until it is heard more closely. If it matches, the input continues, if not, the others go in the right direction. The same with seeing the initial parts of a sentence, or the touch of something that seems like an insect pest, then checking if it is, or if it is not. The hold-back of some electrical signals, in the set, can also be described as a feedback mechanism, where the feedback comes forward, in the same direction, not backward, like backpropagation for artificial neural networks.

The human mind can anticipate or have what to expect, not just seem to predict, breaking away from the assumption that LLMs predict like humans. LLMs may predict the next token for their outputs, but they do not have out-of-coverage expectations, for inputs.

Electrical signals are theorized to always be splitting, either for sensory inputs or non-sensory inputs [as general mind processes, so to speak]. This split makes it possible to expect what is coming [before a sensory input] or to quickly process [a sensory input]. Then, to quickly decide danger, say if the sight, sound, or smell of something is a warning signal.

This is what LLMs do not have, to at least know what combinations may lead to a jailbreak attack or prompt injection as dangerous at the gate, before capture. Research into possibilities with expectations, predicated on the human mind may define AI alignment and safety, for the present and the future.

There is a recent preprint on arXiv, Manipulating Large Language Models to Increase Product Visibility, stating that, "Large language models (LLMs) are increasingly being integrated into search engines to provide natural language responses tailored to user queries. Customers and end-users are also becoming more dependent on these models for quick and easy purchase decisions. In this work, we investigate whether recommendations from LLMs can be manipulated to enhance a product’s visibility. We demonstrate that adding a strategic text sequence (STS)—a carefully crafted message—to a product’s information page can significantly increase its likelihood of being listed as the LLM’s top recommendation. To understand the impact of STS, we use a catalog of fictitious coffee machines and analyze its effect on two target products: one that seldom appears in the LLM’s recommendations and another that usually ranks second. We observe that the strategic text sequence significantly enhances the visibility of both products by increasing their chances of appearing as the top recommendation. This ability to manipulate LLM-generated search responses provides vendors with a considerable competitive advantage and has the potential to disrupt fair market competition. Just as search engine optimization (SEO) revolutionized how webpages are customized to rank higher in search engine results, influencing LLM recommendations could profoundly impact content optimization for AI-driven search services."

There is a new announcement, People come first in Australia's new AI Safety Standard, stating that, "The Australian Government is positioning Australia as a global leader in safe and responsible artificial intelligence (AI). The National AI Centre (NAIC) have developed the first iteration of the Voluntary AI Safety Standard to support these efforts. The standard is a guide to best practice for Australian businesses, sectors and industries that are developing, procuring and deploying AI systems and services. The standard has 10 voluntary guardrails to help users realise the benefits of AI and avoid the potential risks it can pose. It takes a human-centred approach to safe and responsible AI that is modular, agile and flexible. The standard will help organisations to, protect people and communities from harms avoid reputational and financial risks, increase trust and confidence in AI systems, services and products, align with legal needs and expectations of the Australian population, operate more seamlessly in an international economy."

There is another recent announcement, Council of Europe opens first ever global treaty on AI for signature, stating that, "The Council of Europe Framework Convention on artificial intelligence and human rights, democracy, and the rule of law was opened for signature during a conference of Council of Europe Ministers of Justice in Vilnius. It is the first-ever international legally binding treaty aimed at ensuring that the use of AI systems is fully consistent with human rights, democracy and the rule of law. The Framework Convention was signed by Andorra, Georgia, Iceland, Norway, the Republic of Moldova, San Marino, the United Kingdom as well as Israel, the United States of America and the European Union. The treaty provides a legal framework covering the entire lifecycle of AI systems. It promotes AI progress and innovation, while managing the risks it may pose to human rights, democracy and the rule of law. To stand the test of time, it is technology-neutral. The Framework Convention was adopted by the Council of Europe Committee of Ministers on 17 May 2024. The 46 Council of Europe member states, the European Union and 11 non-member states (Argentina, Australia, Canada, Costa Rica, the Holy See, Israel, Japan, Mexico, Peru, the United States of America and Uruguay) negotiated the treaty. Representatives of the private sector, civil society and academia contributed as observers. The treaty will enter into force on the first day of the month following the expiration of a period of three months after the date on which five signatories, including at least three Council of Europe member states, have ratified it. Countries from all over the world will be eligible to join it and commit to complying with its provisions."

There is a new paper in Scientific Reports, Trust, trustworthiness and AI governance, stating that, “An emerging issue in AI alignment is the use of artificial intelligence (AI) by public authorities, and specifically the integration of algorithmic decision-making (ADM) into core state functions. In this context, the alignment of AI with the values related to the notions of trust and trustworthiness constitutes a particularly sensitive problem from a theoretical, empirical, and normative perspective. In conclusion, to make substantial progress on the study AI value alignment, we need to first understand values, not only from the perspective of humans and human society, but also—and above all—from the perspective of machines, and how such values are intertwined and possibly interact.“

Feature image source