如何有效评估您的 RAG + LLM 申请

Duy Huynh17m2023/12/27

有没有想过今天的一些应用程序为何看起来几乎神奇地智能？这种魔力的很大一部分来自于 RAG 和 LLM。

嘿！有没有想过今天的一些应用程序为何看起来几乎神奇地智能？这种魔力的很大一部分来自于 RAG 和 LLM。将 RAG（检索增强一代）视为人工智能世界中聪明的书呆子。它挖掘大量信息以准确找到您的问题所需的内容。然后，还有LLM（大型语言模型），就像著名的GPT系列一样，它会基于其令人印象深刻的文本生成能力来生成流畅的答案。将这两者结合在一起，你就得到了一个不仅智能而且具有超级相关性和情境感知能力的人工智能。这就像将一个超快的研究助理与一个机智的健谈者结合在一起。这个组合对于任何事情都非常有用，从帮助您快速找到特定信息到进行令人惊讶的真实聊天。

但问题是：我们如何知道我们的人工智能是否真的有帮助，而不仅仅是说一些花哨的术语？这就是评估的用武之地。它至关重要，而不仅仅是锦上添花。我们需要确保我们的人工智能不仅准确，而且相关、有用，并且不会偏离奇怪的切线。毕竟，如果智能助手无法理解您的需求或给您的答案完全不靠谱，那么它有什么用呢？

评估我们的 RAG + LLM 申请就像一次现实检查。它告诉我们，我们是否真的走上了创建真正有帮助而不仅仅是技术上令人印象深刻的人工智能的正轨。因此，在这篇文章中，我们将深入探讨如何做到这一点——确保我们的人工智能在实践中和理论上一样出色！

开发阶段

在开发阶段，必须按照典型的机器学习模型评估流程进行思考。在标准 AI/ML 设置中，我们通常使用多个数据集，例如开发集、训练集和测试集，并采用定量指标来衡量模型的有效性。然而，评估大型语言模型 (LLM) 面临着独特的挑战。传统的定量指标很难捕捉法学硕士的输出质量，因为这些模型擅长生成既多样化又富有创意的语言。因此，很难有一套全面的标签来进行有效的评估。

在学术界，研究人员可能会使用 MMLU 等基准和分数来对 LLM 进行排名，并且可能会聘请人类专家来评估 LLM 产出的质量。然而，这些方法并不能无缝过渡到生产环境，生产环境的开发速度很快，实际应用需要立即得到结果。这不仅仅与法学硕士的表现有关；还与法学硕士的表现有关。现实世界的需求考虑到整个过程，包括数据检索、提示写作和法学硕士的贡献。为每个新系统迭代或文档或领域发生变化时制定人工基准是不切实际的。此外，行业的快速发展并不能承受人类测试人员在部署前评估每个更新的漫长等待。因此，调整学术界的评估策略以适应快速且注重结果的生产环境是一项相当大的挑战。

因此，如果您遇到这种情况，您可能会考虑类似由法学硕士提供的伪分数之类的东西。该分数可以反映自动评估指标和人类判断的精髓的结合。这种混合方法旨在弥合人类评估者的细致理解与机器评估的可扩展、系统分析之间的差距。

例如，如果您的团队正在开发一个针对您的特定领域和数据进行培训的内部法学硕士，那么该过程通常需要开发人员、提示工程师和数据科学家的协作努力。每个成员都发挥着关键作用：

开发商就是建筑师。他们构建了应用程序的框架，确保 RAG + LLM 链无缝集成，并且可以轻松地浏览不同的场景。
及时工程师是创意者。他们设计模拟现实世界用户交互的场景和提示。他们思考“假设”并推动系统处理广泛的主题和问题。
数据科学家是战略家。他们分析回答，深入研究数据，并利用统计专业知识来评估人工智能的表现是否符合标准。

这里的反馈循环至关重要。当我们的人工智能响应提示时，团队会仔细检查每一个输出。 AI 理解这个问题吗？答复准确且相关吗？语言能更流畅一点吗？然后，该反馈会循环回系统中以进行改进。

为了更上一层楼，想象一下使用像 OpenAI 的 GPT-4 这样的大师 LLM 作为评估您自行开发的 LLM 的基准。您的目标是匹配甚至超越 GPT 系列的性能，该系列以其坚固性和多功能性而闻名。您可以按以下步骤进行：

生成相关数据集：首先创建反映您领域细微差别的数据集。该数据集可以由专家整理或在 GPT-4 的帮助下合成，以节省时间，确保它符合您的黄金标准。
定义成功指标：利用硕士法学硕士的优势来协助定义您的指标。鉴于法学硕士硕士可以处理更复杂的任务，您可以自由选择最适合您目标的指标。在社区标准中，您可能希望看到Langchain和其他一些库（如ragas ）的一些工作。他们有一些指标，如忠实度、上下文回忆、上下文精确度、答案相似度等。
自动化您的评估管道：为了跟上快速的开发周期，建立一个自动化的管道。这将在每次更新或更改后根据您的预定义指标一致地评估应用程序的性能。通过自动化流程，您可以确保评估不仅彻底，而且可以高效迭代，从而实现快速优化和细化。

例如，在下面的演示中，我将向您展示如何使用 OpenAI 的 GPT-4 在简单的文档检索对话任务上自动评估各种开源 LLM。

首先，我们利用 OpenAI GPT-4 创建从文档派生的合成数据集，如下所示：

 import os import json import pandas as pd from dataclasses import dataclass from langchain.chat_models import ChatOpenAI from langchain.chains import LLMChain from langchain.prompts import PromptTemplate from langchain.document_loaders import PyPDFLoader from langchain.text_splitter import CharacterTextSplitter from langchain.output_parsers import JsonOutputToolsParser, PydanticOutputParser from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate QA_DATASET_GENERATION_PROMPT = PromptTemplate.from_template( "You are an expert on generate question-and-answer dataset based on a given context. You are given a context. " "Your task is to generate a question and answer based on the context. The generated question should be able to" " to answer by leverage the given context. And the generated question-and-answer pair must be grammatically " "and semantically correct. Your response must be in a json format with 2 keys: question, answer. For example," "\n\n" "Context: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower." "\n\n" "Response: {{" "\n" " \"question\": \"Where is France and what is it's capital?\"," "\n" " \"answer\": \"France is in Western Europe and it's capital is Paris.\"" "\n" "}}" "\n\n" "Context: The University of California, Berkeley is a public land-grant research university in Berkeley, California. Established in 1868 as the state's first land-grant university, it was the first campus of the University of California system and a founding member of the Association of American Universities." "\n\n" "Response: {{" "\n" " \"question\": \"When was the University of California, Berkeley established?\"," "\n" " \"answer\": \"The University of California, Berkeley was established in 1868.\"" "\n" "}}" "\n\n" "Now your task is to generate a question-and-answer dataset based on the following context:" "\n\n" "Context: {context}" "\n\n" "Response: ", ) OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") if OPENAI_API_KEY is None: raise ValueError("OPENAI_API_KEY is not set") llm = ChatOpenAI( model="gpt-4-1106-preview", api_key=OPENAI_API_KEY, temperature=0.7, response_format={ "type": "json_object" }, ) chain = LLMChain( prompt=QA_DATASET_GENERATION_PROMPT, llm=llm ) file_loader = PyPDFLoader("./data/cidr_lakehouse.pdf") text_splitter = CharacterTextSplitter(chunk_size=1000) chunks = text_splitter.split_documents(file_loader.load()) questions, answers = [], [] for chunk in chunks: for _ in range(2): response = chain.invoke({ "context": chunk }) obj = json.loads(response["text"]) questions.append(obj["question"]) answers.append(obj["answer"]) df = pd.DataFrame({ "question": questions, "answer": answers }) df.to_csv("./data/cidr_lakehouse_qa.csv", index=False)

运行上述代码后，我们得到一个 CSV 文件作为结果。该文件包含与我们输入的文档相关的问题和答案对，如下所示：

然后，我们使用 Langchain 构建简单的 DocumentRetrievalQA 链，并替换为通过 Ollama 在本地运行的几个开源 LLM。您可以在这里找到我之前的教程。

 from tqdm import tqdm from langchain.chains import RetrievalQA from langchain.chat_models import ChatOllama from langchain.vectorstores import FAISS from langchain.embeddings import HuggingFaceEmbeddings vector_store = FAISS.from_documents(chunks, HuggingFaceEmbeddings()) retriever = vector_store.as_retriever() def test_local_retrieval_qa(model: str): chain = RetrievalQA.from_llm( llm=ChatOllama(model=model), retriever=retriever, ) predictions = [] for it, row in tqdm(df.iterrows(), total=len(df)): resp = chain.invoke({ "query": row["question"] }) predictions.append(resp["result"]) df[f"{model}_result"] = predictions test_local_retrieval_qa("mistral") test_local_retrieval_qa("llama2") test_local_retrieval_qa("zephyr") test_local_retrieval_qa("orca-mini") test_local_retrieval_qa("phi") df.to_csv("./data/cidr_lakehouse_qa_retrieval_prediction.csv", index=False)

综上所述，上面的代码建立了一个简单的文档检索链。我们使用多种模型来执行该链，例如 Mistral、Llama2、Zephyr、Orca-mini 和 Phi。因此，我们在现有的 DataFrame 中添加了五个额外的列来存储每个 LLM 模型的预测结果。

现在，我们使用 OpenAI 的 GPT-4 定义一条主链来评估预测结果。在此设置中，我们将计算类似于近似 F1 分数的正确性分数，这在传统 AI/ML 问题中很常见。为了实现这一目标，我们将应用并行概念，例如真阳性 (TP)、假阳性 (FP) 和假阴性 (FN)，定义如下：

TP：答案和基本事实中都存在的陈述。
FP：答案中存在但在基本事实中找不到的陈述。
FN：在基本事实中找到了相关陈述，但在答案中省略了。

有了这些定义，我们可以使用以下公式计算精确率、召回率和 F1 分数：

 import os import numpy as np import pandas as pd from tqdm import tqdm from langchain.chains import LLMChain from langchain.chat_models import ChatOpenAI from langchain.prompts import PromptTemplate OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") if OPENAI_API_KEY is None: raise ValueError("OPENAI_API_KEY is not set") CORRECTNESS_PROMPT = PromptTemplate.from_template( """ Extract following from given question and ground truth. Your response must be in a json format with 3 keys and does not need to be in any specific order: - statements that are present in both the answer and the ground truth - statements present in the answer but not found in the ground truth - relevant statements found in the ground truth but omitted in the answer Please be concise and do not include any unnecessary information. You should classify the statements as claims, facts, or opinions with semantic matching, no need exact word-by-word matching. Question:What powers the sun and what is its primary function? Answer: The sun is powered by nuclear fission, similar to nuclear reactors on Earth, and its primary function is to provide light to the solar system. Ground truth: The sun is actually powered by nuclear fusion, not fission. In its core, hydrogen atoms fuse to form helium, releasing a tremendous amount of energy. This energy is what lights up the sun and provides heat and light, essential for life on Earth. The sun's light also plays a critical role in Earth's climate system and helps to drive the weather and ocean currents. Extracted statements: [ {{ "statements that are present in both the answer and the ground truth": ["The sun's primary function is to provide light"], "statements present in the answer but not found in the ground truth": ["The sun is powered by nuclear fission", "similar to nuclear reactors on Earth"], "relevant statements found in the ground truth but omitted in the answer": ["The sun is powered by nuclear fusion, not fission", "In its core, hydrogen atoms fuse to form helium, releasing a tremendous amount of energy", "This energy provides heat and light, essential for life on Earth", "The sun's light plays a critical role in Earth's climate system", "The sun helps to drive the weather and ocean currents"] }} ] Question: What is the boiling point of water? Answer: The boiling point of water is 100 degrees Celsius at sea level. Ground truth: The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit) at sea level, but it can change with altitude. Extracted statements: [ {{ "statements that are present in both the answer and the ground truth": ["The boiling point of water is 100 degrees Celsius at sea level"], "statements present in the answer but not found in the ground truth": [], "relevant statements found in the ground truth but omitted in the answer": ["The boiling point can change with altitude", "The boiling point of water is 212 degrees Fahrenheit at sea level"] }} ] Question: {question} Answer: {answer} Ground truth: {ground_truth} Extracted statements:""", ) judy_llm = ChatOpenAI( model="gpt-4-1106-preview", api_key=OPENAI_API_KEY, temperature=0.0, response_format={ "type": "json_object" }, ) judy_chain = LLMChain( prompt=CORRECTNESS_PROMPT, llm=judy_llm ) def evaluate_correctness(column_name: str): chain = LLMChain( prompt=CORRECTNESS_PROMPT, llm=ChatOpenAI( model="gpt-4-1106-preview", api_key=OPENAI_API_KEY, temperature=0.0, response_format={ "type": "json_object" }, ) ) key_map = { "TP": "statements that are present in both the answer and the ground truth", "FP": "statements present in the answer but not found in the ground truth", "FN": "relevant statements found in the ground truth but omitted in the answer", # noqa: E501 } TP, FP, FN = [], [], [] for it, row in tqdm(df.iterrows(), total=len(df)): resp = chain.invoke({ "question": row["question"], "answer": row[column_name], "ground_truth": row["answer"] }) obj = json.loads(resp["text"]) TP.append(len(obj[key_map["TP"]])) FP.append(len(obj[key_map["FP"]])) FN.append(len(obj[key_map["FN"]])) # convert to numpy array TP = np.array(TP) FP = np.array(FP) FN = np.array(FN) df[f"{column_name}_recall"] = TP / (TP + FN) df[f"{column_name}_precision"] = TP / (TP + FP) df[f"{column_name}_correctness"] = 2 * df[f"{column_name}_recall"] * df[f"{column_name}_precision"] / (df[f"{column_name}_recall"] + df[f"{column_name}_precision"]) evaluate_correctness("mistral_result") evaluate_correctness("llama2_result") evaluate_correctness("zephyr_result") evaluate_correctness("orca-mini_result") evaluate_correctness("phi_result") print("|====Model====|=== Recall ===|== Precision ==|== Correctness ==|") print(f"|mistral | {df['mistral_result_recall'].mean():.4f} | {df['mistral_result_precision'].mean():.4f} | {df['mistral_result_correctness'].mean():.4f} |") print(f"|llama2 | {df['llama2_result_recall'].mean():.4f} | {df['llama2_result_precision'].mean():.4f} | {df['llama2_result_correctness'].mean():.4f} |") print(f"|zephyr | {df['zephyr_result_recall'].mean():.4f} | {df['zephyr_result_precision'].mean():.4f} | {df['zephyr_result_correctness'].mean():.4f} |") print(f"|orca-mini | {df['orca-mini_result_recall'].mean():.4f} | {df['orca-mini_result_precision'].mean():.4f} | {df['orca-mini_result_correctness'].mean():.4f} |") print(f"|phi | {df['phi_result_recall'].mean():.4f} | {df['phi_result_precision'].mean():.4f} | {df['phi_result_correctness'].mean():.4f} |") print("|==============================================================|") df.to_csv("./data/cidr_lakehouse_qa_retrieval_prediction_correctness.csv", index=False)

好的，现在我们已经有了几个模型的简单基准。这可以被视为每个模型如何处理文档检索任务的初步指标。虽然这些数字提供了一个快照，但它们只是故事的开始。它们可以作为了解哪些模型更擅长从给定语料库中检索准确且相关的信息的基线。您可以在这里找到源代码。

人在环反馈

当谈到通过人机循环反馈来调整我们的人工智能时，人类测试人员和法学硕士之间的协同作用至关重要。这种关系不仅仅在于收集反馈，还在于创建一个能够适应人类输入并从中学习的响应式人工智能系统。

互动过程

测试人员的输入：测试人员与 RAG + LLM 链互动，从人的角度评估其输出。他们提供有关人工智能响应的相关性、准确性和自然性等方面的反馈。
对法学硕士硕士的反馈：这就是奇迹发生的地方。人类测试人员的反馈将直接传达给法学硕士硕士。与标准模型不同，法学硕士硕士旨在理解和解释这种反馈，以完善其后续输出。
硕士法学硕士的及时调整：根据这些反馈，法学硕士硕士调整了我们发展法学硕士的提示。这个过程类似于导师指导学生。硕士法学硕士仔细修改了发展法学硕士对提示的解释和反应方式，确保更有效和上下文感知的响应机制。

硕士法学硕士的双重角色

硕士法学硕士既作为内部开发的法学硕士的基准，又作为反馈循环的积极参与者。它评估反馈，调整提示或模型参数，并从本质上从人类交互中“学习”。

实时适应的好处

这个过程是变革性的。它使人工智能能够实时适应，使其更加敏捷并与人类语言和思维过程的复杂性保持一致。这种实时适应确保了人工智能的学习曲线陡峭且连续。

改进循环

通过交互、反馈和适应的循环，我们的人工智能不再只是一个工具；而是一个工具。它成为一个学习实体，能够通过与人类测试人员的每次交互来改进。这种人机交互模型确保我们的人工智能不会停滞不前，而是不断发展成为更高效、更直观的助手。

总之，人机环反馈不仅仅是收集人类的见解，而是创建动态且适应性强的人工智能，可以微调其行为以更好地为用户服务。这个迭代过程确保我们的 RAG + LLM 应用程序保持领先地位，不仅提供答案，而且提供上下文相关的、细致入微的响应，反映对用户需求的真正理解。

对于简单的演示，您可以在此视频中观看 ClearML 如何使用此概念来增强 Promptimizer。

运营阶段

过渡到运营阶段就像从彩排过渡到开幕之夜。在这里，我们的RAG+LLM申请不再是假设的实体；他们成为真实用户日常工作流程的积极参与者。此阶段是开发阶段所做的所有准备和微调的试金石。

在此阶段，我们的团队（运营、产品和分析师）协调部署和管理应用程序，确保我们构建的所有内容不仅能够正常运行，而且能够在实时环境中蓬勃发展。在这里，我们可以考虑实施 A/B 测试策略，以受控方式衡量应用程序的有效性。

A/B 测试框架：我们将用户群分为两部分——控制部分，继续使用应用程序的既定版本（版本 1），以及测试部分，尝试版本 2 中的新功能（实际上您还可以同时运行多个 A/B 测试）。这使我们能够收集有关用户体验、功能接受度和整体性能的比较数据。
运营推出：运营团队的任务是顺利推出两个版本，确保基础设施稳健，并且任何版本转换对于用户来说都是无缝的。
产品演进：产品团队密切关注用户反馈的脉搏，努力迭代产品。该团队确保新功能符合用户需求和整体产品愿景。
分析见解：分析团队严格检查从 A/B 测试中收集的数据。他们的见解对于确定新版本是否优于旧版本以及是否准备好进行更广泛的发布至关重要。
绩效指标：监控关键绩效指标（KPI）以衡量每个版本的成功程度。其中包括用户参与度指标、满意度评分以及应用程序输出的准确性。

操作阶段是动态的，通过连续的反馈循环提供信息，不仅改进了应用程序，还提高了用户参与度和满意度。这是一个以监控、分析、迭代为特征的阶段，最重要的是，从实时数据中学习。

在这个阶段，我们的目标不仅是维持开发阶段设定的高标准，而且要超越这些标准，确保我们的 RAG + LLM 应用程序始终处于创新和可用性的前沿。

结论

总之，检索增强生成 (RAG) 和大型语言模型 (LLM) 的集成标志着人工智能的重大进步，将深度数据检索与复杂的文本生成融为一体。但我们需要一个正确有效的评估方法和迭代开发策略。开发阶段强调定制人工智能评估并通过人机反馈对其进行增强，确保这些系统具有同理心并适应现实世界的场景。这种方法凸显了人工智能从单纯的工具到协作伙伴的演变。操作阶段在现实场景中测试这些应用程序，使用 A/B 测试和持续反馈循环等策略来确保有效性和基于用户交互的持续演进。

L O A D I N G
. . . comments & more!

About Author

Duy Huynh@vndee

Retired competitive programmer, passionate engineer, and a science enthusiast.

Read my stories