Disclaimer: The author works at Microsoft. The article is a general exploration of user trust and AI, but it does call out some references to the author’s company’s products, such as Copilot/Azure Open AI.
“I knew I had just seen the most important advance in technology since the graphical user interface.” — Bill Gates, about ChatGPT acing an AP-level biology test
Generative AI’s surge in popularity has opened up new and innovative possibilities, transforming the way we live and work. According to a study conducted by PwC, AI could contribute up to $15.7 trillion to the global economy in 2030. A Forbes Advisor report shows that a staggering 97 percent of business owners believe ChatGPT will help their business. These are exciting times for new startups specializing in Generative AI.
However, there is a dichotomy in how end users feel about the Generative AI revolution. For instance, 46 percent of Americans are equally concerned and excited about AI (Artificial Intelligence). This type of user mistrust and skepticism has the potential to encumber AI adoption. What are some ways to mitigate inaccurate or misleading content generated by Large Language Models? How can companies build products that users not only love but trust?
Large Language Models (LLMs) have been labeled stochastic parrots — a fitting metaphor for entities that regurgitate convincingly plausible answers without really understanding the meaning behind them. As Stephen Wolfram put it, a model like ChatGPT is simply adding one word at a time.
LLMs are trained on massive amounts of data from various sources but aren’t by themselves doing any lookups or referring to external sources of information.
This means that an LLM can make misleading claims or factual errors — while sounding fairly confident about the mistake. These counterfactual statements are, somewhat anthropomorphically, nicknamed hallucinations.
To some extent, LLMs are meant to hallucinate. It’s Generative AI’s version of “it’s not a bug — it’s by design.” The same capability that makes LLMs hallucinate is also what makes them powerful. On the flip side, there are situations when hallucinations are not desirable and where it is critical to get truthful and factual responses.
Working around the non-determinism of GenAI products continues to be a persistent challenge. However, there are ways to ensure that we can still maintain user trust while harnessing all the creativity and potential behind AI.
One way to mitigate hallucinations is to augment the knowledge an LLM possesses with a grounding source of truth. Imagine you’re making someone take a test on a complex, previously unseen topic. This person is a fluent English speaker, but they know next to nothing about the topic they’re being tested on. Now, they could simply fabricate answers for all the questions.
Or, you can give them an open textbook about the topic they’re being tested on so they have reference material from which they can answer. This example is a simplified explanation of what grounding represents in the context of LLMs.
Grounding is table stakes while creating any kind of custom LLM application, especially for internal data (such as customer service chatbots that rely on non–public-facing knowledge bases) or applications that rely on a corpus to retrieve answers (a.k.a., search engines).
As far as implementation goes, Retrieval Augmented Generation is probably the most ubiquitous pattern currently used for grounding LLMs. There are numerous out-of-the-box RAG solutions in the market that take in a custom corpus to do retrieval and generation.
Azure Open AI on your own data is one such example, where users can upload a dataset or document from blob storage and “chat” with it. The feature also offers citations to the source text, along with the response. Making an LLM cite an authoritative source as the reason behind a particular answer automatically increases transparency and adds credibility.
Grounding can also help with evaluation and validation of response quality, a key component to testing Generative AI applications before making them production-worthy for public release. In the absence of a grounding dataset, there is no way to “compare” the responses of an LLM to see if they are based on truth or not. So, choosing the right implementation of RAG and the right grounding dataset is critical for improving the factualness of the GenAI application.
Surveys have shown that people think providing too little context on AI decisions can erode trust. Deep Learning models are trained to get accurate and fast predictions, but the end users are often clueless as to why the model generated a particular response.
A famous example of this black box behavior occurs in my favorite work of fiction, Douglas Adams’ brilliant Hitchhiker’s Guide to Galaxy.
In the book, the supercomputer Deep Thought is tasked with working out the Answer to The Ultimate Question of Life, the Universe, and Everything.
After 7.5 million years, Deep Thought responds that the answer is 42. The scientists are puzzled, but Deep Thought is quite confident about the answer: “I think the problem, to be quite honest with you, is that you’ve never actually known what the question was.”
In order to then calculate the Ultimate Question, a special computer is built, and hilarities ensue — but I won’t spoil the book any further.
Decades after the book was written, we now have ways to peek behind the curtain and understand why an AI-generated a particular response. While traditional Machine Learning (ML) models use techniques like LIME and SHAP for explainability, Chain-of-Thought (CoT) prompting is a popular technique to add more transparency to Generative AI’s reasoning process.
In essence, CoT consists of meta-prompts asking the AI to explain itself or interpret why it arrived at a particular output. In fact, researchers have noted that just adding a phrase like “Let’s think step by step” also increases the accuracy of generated responses, even in zero-shot scenarios.
Apart from grounding, there are also various other techniques to reduce hallucinations, including advanced prompt engineering with CoVE (chain of thought verification), feedback loops with iterative improvements (or “humans in the loop”), using a combination of fine-tuned models with RAG, asking an LLM to self-critique its own output, as well as playing around with the “creativity” and “temperature” of LLM APIs.
However, none of these methods are infallible since an LLM is essentially non-deterministic and unpredictable.
There could be grave ethical as well as practical implications from Artificial Intelligence making decisions without any oversight. Incorrect predictions could lead to potentially life-threatening implications in fields like medical sciences, pharmacology, autonomous vehicle design, law enforcement, or disaster management.
In the end, the safest option is for humans to have the final say in accepting or rejecting any AI-driven predictions. Having the end user in charge of vetting AI decisions also serves as a great feedback loop for developers to learn from and iterate continuously.
Efforts to reduce hallucinations by improving factualness and transparency will increase user confidence in AI products. In addition, accountability is a key Responsible AI that must be at the center of all AI design. AI can be an amazing tool to revolutionize the way we all live and work — but ultimately, not to replace human judgment.