了解如何构建研究论文检索,搜索和总结的AI代理 了解如何构建研究论文检索,搜索和总结的AI代理 对于研究人员来说,与最新发现保持最新状态,就像在一堆子中找到针头一样。想象一个人工智能驱动的助手不仅可以获取最相关的论文,还可以概括关键见解,并实时回答您的具体问题。 本文深入研究如何使用Superlinked的复杂文档嵌入功能构建这样一个AI研究代理,通过整合语义和时间相关性,我们消除了复杂的重新排列的需要,确保信息的有效和准确的检索。 本文深入研究如何使用Superlinked的复杂文档嵌入功能构建这样一个AI研究代理,通过整合语义和时间相关性,我们消除了复杂的重新排列的需要,确保信息的有效和准确的检索。 TL;DR: 使用 Superlinked 的矢量搜索构建实时人工智能研究代理,通过直接嵌入和查询文档来跳过复杂的 RAG 管道,使研究更快,更简单,更智能。 (想直接跳到代码吗?在这里查看GitHub上的开源。准备尝试对你自己的代理用例进行语义搜索吗?我们在这里帮助你。 查看GitHub上的开源 . 这里 这里 这里 我们来这里是为了 . 帮助 帮助 帮助 本文介绍了如何使用内核代理来构建一个代理系统来处理查询。 here’s the . 科拉 . 科拉 科拉 如何开始建立研究助理系统? 传统上,构建这样的系统需要复杂性和大量的资源投资。搜索系统通常会根据相关性检索一组初始的广泛文件,然后应用次要的重新排序过程来改进和重新排序结果。虽然重新排序提高了准确性,但由于最初所需的广泛数据检索而显著增加了计算复杂性、延迟性和过度。 用 Superlinked 构建一个代理系统 这个AI代理可以做三件事: 查找论文:按主题查找研究论文(例如“量子计算”),然后按相关性和近期进行排序。 摘要论文:将获取的论文压缩成小块的见解。 回答问题:根据针对性的用户查询,直接从特定研究论文中提取答案。 Superlinked eliminates the need for re-ranking methods as it improves vector search relevance. Superlinked’s RecencySpace will be used which specifically encodes temporal metadata,优先考虑最近的文档在检索过程中,并消除需要计算上昂贵的重新排名。 步骤1:设置工具箱 %pip install superlinked 为了使事情变得更容易和更模块化,我创建了一个抽象工具类,这将简化构建和添加工具的过程。 import pandas as pd import superlinked.framework as sl from datetime import timedelta from sentence_transformers import SentenceTransformer from openai import OpenAI import os from abc import ABC, abstractmethod from typing import Any, Optional, Dict from tqdm import tqdm from google.colab import userdata # Abstract Tool Class class Tool(ABC): @abstractmethod def name(self) -> str: pass @abstractmethod def description(self) -> str: pass @abstractmethod def use(self, *args, **kwargs) -> Any: pass # Get API key from Google Colab secrets try: api_key = userdata.get('OPENAI_API_KEY') except KeyError: raise ValueError("OPENAI_API_KEY not found in user secrets. Please add it using Tools > User secrets.") # Initialize OpenAI Client api_key = os.environ.get("OPENAI_API_KEY", "your-openai-key") # Replace with your OpenAI API key if not api_key: raise ValueError("Please set the OPENAI_API_KEY environment variable.") client = OpenAI(api_key=api_key) model = "gpt-4" 步骤2:了解数据集 此示例使用包含大约 10,000 篇 AI 研究论文的数据集。 要方便,只需运行下面的单元格,它会自动下载数据集到您的工作目录. 您也可以使用自己的数据源,如研究论文或其他学术内容。 加格尔 import pandas as pd !wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1FCR3TW5yLjGhEmm-Uclw0_5PWVEaLk1j' -O arxiv_ai_data.csv 目前,为了使事情运行更快,我们将使用较小的文件子集来加速工作,但请尝试使用完整数据集的示例。 这里的一个重要技术细节是,数据集的时刻印将从字符串时刻印(如1993-08-01 00:00:00+00:00)转换为潘达斯日期对象。 df = pd.read_csv('arxiv_ai_data.csv').head(100) # Convert to datetime but keep it as datetime (more readable and usable) df['published'] = pd.to_datetime(df['published']) # Ensure summary is a string df['summary'] = df['summary'].astype(str) # Add 'text' column for similarity search df['text'] = df['title'] + " " + df['summary'] Debug: Columns in original DataFrame: ['authors', 'categories', 'comment', 'doi', 'entry_id', 'journal_ref' 'pdf_url', 'primary_category', 'published', 'summary', 'title', 'updated'] 了解数据集列 以下是我们数据集中的关键列的简要概述,这些列将在接下来的步骤中发挥重要作用: 发表:研究论文的发布日期。 摘要:论文的摘要,提供简要的概述。 entry_id:来自arXiv的每个文件的唯一标识符。 对于这个演示,我们专注于四个栏目: , , ,和 为了优化检索质量,标题和摘要合并成一个单一的综合文本列,构成我们嵌入和搜索流程的核心。 entry_id published title summary Superlinked 的内存索引: Superlinked 的内存索引直接将我们的数据集存储在 RAM 中,使检索速度非常快,非常适合实时搜索和快速原型制作。 步骤三:定义超链接方案 为了前进,需要一个计划来绘制我们的数据。 關鍵領域: PaperSchema lass PaperSchema(sl.Schema): text: sl.String published: sl.Timestamp # This will handle datetime objects properly entry_id: sl.IdField title: sl.String summary: sl.String paper = PaperSchema() 定义超链接空间以实现有效的回收 组织和有效查询我们的数据集的一个重要步骤是定义两个专门的矢量空间:TextSimilaritySpace和RecencySpace。 文学空间 该 旨在编码文本信息,例如研究论文的标题和摘要,转化为矢量。 通过将文本转换为嵌入式,这个空间显著提高了语义搜索的简单性和准确性。 TextSimilaritySpace text_space = sl.TextSimilaritySpace( text=sl.chunk(paper.text, chunk_size=200, chunk_overlap=50), model="sentence-transformers/all-mpnet-base-v2" ) 最新空间 该 捕捉时间元数据,强调研究出版物的最新性. 通过编码时间标签,这个空间赋予更新的文档更大的意义. 因此,搜索结果自然平衡了内容的相关性与出版日期,有利于最近的见解。 RecencySpace recency_space = sl.RecencySpace( timestamp=paper.published, period_time_list=[ sl.PeriodTime(timedelta(days=365)), # papers within 1 year sl.PeriodTime(timedelta(days=2*365)), # papers within 2 years sl.PeriodTime(timedelta(days=3*365)), # papers within 3 years ], negative_filter=-0.25 ) 想想RecencySpace作为一个基于时间的过滤器,类似于按日期排序您的电子邮件或查看Instagram帖子,首先使用最新的内容。 较小的时间分数(例如365天)允许更细微的,每年基于时间的排名。 较大的时间分数(如1095天)创建更宽的时间段。 该 要更清楚地解释这一点,请考虑下面的例子,其中两个论文具有相同的内容相关性,但它们的排名将取决于它们的出版日期。 negative_filter Paper A: Published in 1996 Paper B: Published in 1993 Scoring example: - Text similarity score: Both papers get 0.8 - Recency score: - Paper A: Receives the full recency boost (1.0) - Paper B: Gets penalized (-0.25 due to negative_filter) Final combined scores: - Paper A: Higher final rank - Paper B: Lower final rank 这些空间是使数据集更易于访问和有效的关键,它们允许基于内容和时间的搜索,并非常有助于了解研究论文的相关性和最新性。 步骤四:构建指数 接下来,这些空间被合并成一个索引,这是搜索引擎的核心: paper_index = sl.Index([text_space, recency_space]) 然后,DataFrame 被绘制到方案中,并以批次(一次 10 张文件)加载到内存存储中: # Parser to map DataFrame columns to schema fields parser = sl.DataFrameParser( paper, mapping={ paper.entry_id: "entry_id", paper.published: "published", paper.text: "text", paper.title: "title", paper.summary: "summary", } ) # Set up in-memory source and executor source = sl.InMemorySource(paper, parser=parser) executor = sl.InMemoryExecutor(sources=[source], indices=[paper_index]) app = executor.run() # Load the DataFrame with a progress bar using batches batch_size = 10 data_batches = [df[i:i + batch_size] for i in range(0, len(df), batch_size)] for batch in tqdm(data_batches, total=len(data_batches), desc="Loading Data into Source"): source.put([batch]) 内存执行器是为什么超链接在这里闪烁 - 1000 张文件在 RAM 中合适,并且查询没有磁盘 I/O 瓶颈。 步骤五:创建查询 接下来是查询创建,这是创建查询编造模板的地方. 要管理这一点,我们需要一个可以平衡相关性和最新性的查询模板。 # Define the query knowledgebase_query = ( sl.Query( paper_index, weights={ text_space: sl.Param("relevance_weight"), recency_space: sl.Param("recency_weight"), } ) .find(paper) .similar(text_space, sl.Param("search_query")) .select(paper.entry_id, paper.published, paper.text, paper.title, paper.summary) .limit(sl.Param("limit")) ) 这使我们能够选择是否优先考虑内容(relevance_weight)或最近(recency_weight) - 一个非常有用的组合,以满足我们的代理人的需求。 步骤六:构建工具 接下来是工具部分。 我们将开发三种工具... 检索工具:这个工具是通过插入Superlinked的索引来创建的,允许它根据查询提取前五篇论文。它平衡了相关性(重量为0)和近期(重量为0,5),以实现“查找论文”的目标。我们想要的是找到与查询相关的论文。 class RetrievalTool(Tool): def __init__(self, df, app, knowledgebase_query, client, model): self.df = df self.app = app self.knowledgebase_query = knowledgebase_query self.client = client self.model = model def name(self) -> str: return "RetrievalTool" def description(self) -> str: return "Retrieves a list of relevant papers based on a query using Superlinked." def use(self, query: str) -> pd.DataFrame: result = self.app.query( self.knowledgebase_query, relevance_weight=1.0, recency_weight=0.5, search_query=query, limit=5 ) df_result = sl.PandasConverter.to_pandas(result) # Ensure summary is a string if 'summary' in df_result.columns: df_result['summary'] = df_result['summary'].astype(str) else: print("Warning: 'summary' column not found in retrieved DataFrame.") return df_result 下一个是上 此工具专为需要一份简要的论文摘要的案例而设计,以便使用它,将提供 ,这是需要总结的纸张的ID。 如果未提供,该工具将无法工作,因为这些ID是为了在数据集中找到相应的文件的要求。 Summarization Tool paper_id paper_id class SummarizationTool(Tool): def __init__(self, df, client, model): self.df = df self.client = client self.model = model def name(self) -> str: return "SummarizationTool" def description(self) -> str: return "Generates a concise summary of specified papers using an LLM." def use(self, query: str, paper_ids: list) -> str: papers = self.df[self.df['entry_id'].isin(paper_ids)] if papers.empty: return "No papers found with the given IDs." summaries = papers['summary'].tolist() summary_str = "\n\n".join(summaries) prompt = f""" Summarize the following paper summaries:\n\n{summary_str}\n\nProvide a concise summary. """ response = self.client.chat.completions.create( model=self.model, messages=[{"role": "user", "content": prompt}], temperature=0.7, max_tokens=500 ) return response.choices[0].message.content.strip() 最后,我们有 这个工具链接 获取相关论文,然后使用它们来回答问题,如果没有找到相关论文来回答问题,它将提供基于一般知识的答案。 QuestionAnsweringTool RetrievalTool class QuestionAnsweringTool(Tool): def __init__(self, retrieval_tool, client, model): self.retrieval_tool = retrieval_tool self.client = client self.model = model def name(self) -> str: return "QuestionAnsweringTool" def description(self) -> str: return "Answers questions about research topics using retrieved paper summaries or general knowledge if no specific context is available." def use(self, query: str) -> str: df_result = self.retrieval_tool.use(query) if 'summary' not in df_result.columns: # Tag as a general question if summary is missing prompt = f""" You are a knowledgeable research assistant. This is a general question tagged as [GENERAL]. Answer based on your broad knowledge, not limited to specific paper summaries. If you don't know the answer, provide a brief explanation of why. User's question: {query} """ else: # Use paper summaries for specific context contexts = df_result['summary'].tolist() context_str = "\n\n".join(contexts) prompt = f""" You are a research assistant. Use the following paper summaries to answer the user's question. If you don't know the answer based on the summaries, say 'I don't know.' Paper summaries: {context_str} User's question: {query} """ response = self.client.chat.completions.create( model=self.model, messages=[{"role": "user", "content": prompt}], temperature=0.7, max_tokens=500 ) return response.choices[0].message.content.strip() 步骤7:构建核心代理 接下来是内核代理人,它作为中央控制器起作用,确保运行顺利和高效。作为系统的核心组件,内核代理人通过根据其目的路由查询来协调通信,当多个代理人同时运作时。 class KernelAgent: def __init__(self, retrieval_tool: RetrievalTool, summarization_tool: SummarizationTool, question_answering_tool: QuestionAnsweringTool, client, model): self.retrieval_tool = retrieval_tool self.summarization_tool = summarization_tool self.question_answering_tool = question_answering_tool self.client = client self.model = model def classify_query(self, query: str) -> str: prompt = f""" Classify the following user prompt into one of the three categories: - retrieval: The user wants to find a list of papers based on some criteria (e.g., 'Find papers on AI ethics from 2020'). - summarization: The user wants to summarize a list of papers (e.g., 'Summarize papers with entry_id 123, 456, 789'). - question_answering: The user wants to ask a question about research topics and get an answer (e.g., 'What is the latest development in AI ethics?'). User prompt: {query} Respond with only the category name (retrieval, summarization, question_answering). If unsure, respond with 'unknown'. """ response = self.client.chat.completions.create( model=self.model, messages=[{"role": "user", "content": prompt}], temperature=0.7, max_tokens=10 ) classification = response.choices[0].message.content.strip().lower() print(f"Query type: {classification}") return classification def process_query(self, query: str, params: Optional[Dict] = None) -> str: query_type = self.classify_query(query) if query_type == 'retrieval': df_result = self.retrieval_tool.use(query) response = "Here are the top papers:\n" for i, row in df_result.iterrows(): # Ensure summary is a string and handle empty cases summary = str(row['summary']) if pd.notna(row['summary']) else "" response += f"{i+1}. {row['title']} \nSummary: {summary[:200]}...\n\n" return response elif query_type == 'summarization': if not params or 'paper_ids' not in params: return "Error: Summarization query requires a 'paper_ids' parameter with a list of entry_ids." return self.summarization_tool.use(query, params['paper_ids']) elif query_type == 'question_answering': return self.question_answering_tool.use(query) else: return "Error: Unable to classify query as 'retrieval', 'summarization', or 'question_answering'." 在此阶段,研究代理系统的所有组件都已配置,系统现在可以通过为核心代理提供适当的工具进行初始化,之后,研究代理系统将完全运行。 retrieval_tool = RetrievalTool(df, app, knowledgebase_query, client, model) summarization_tool = SummarizationTool(df, client, model) question_answering_tool = QuestionAnsweringTool(retrieval_tool, client, model) # Initialize KernelAgent kernel_agent = KernelAgent(retrieval_tool, summarization_tool, question_answering_tool, client, model) 现在让我们来测试这个系统。 # Test query print(kernel_agent.process_query("Find papers on quantum computing in last 10 years")) 运行此功能将激活 它将根据相关性和最新情况收集相关论文,并返回相关列. 如果返回的结果包括摘要列(表示论文从数据集中被收集),它将使用这些摘要并将其返回给我们。 RetrievalTool Query type: retrieval Here are the top papers: 1. Quantum Computing and Phase Transitions in Combinatorial Search Summary: We introduce an algorithm for combinatorial search on quantum computers that is capable of significantly concentrating amplitude into solutions for some NP search problems, on average. This is done by... 1. The Road to Quantum Artificial Intelligence Summary: This paper overviews the basic principles and recent advances in the emerging field of Quantum Computation (QC), highlighting its potential application to Artificial Intelligence (AI). The paper provi... 1. Solving Highly Constrained Search Problems with Quantum Computers Summary: A previously developed quantum search algorithm for solving 1-SAT problems in a single step is generalized to apply to a range of highly constrained k-SAT problems. We identify a bound on the number o... 1. The model of quantum evolution Summary: This paper has been withdrawn by the author due to extremely unscientific errors.... 1. Artificial and Biological Intelligence Summary: This article considers evidence from physical and biological sciences to show machines are deficient compared to biological systems at incorporating intelligence. Machines fall short on two counts: fi... 让我们尝试另一个查询,这次,让我们做一个总结。 print(kernel_agent.process_query("Summarize this paper", params={"paper_ids": ["http://arxiv.org/abs/cs/9311101v1"]})) Query type: summarization This paper discusses the challenges of learning logic programs that contain the cut predicate (!). Traditional learning methods cannot handle clauses with cut because it has a procedural meaning. The proposed approach is to first generate a candidate base program that covers positive examples, and then make it consistent by inserting cut where needed. Learning programs with cut is difficult due to the need for intensional evaluation, and current induction techniques may need to be limited to purely declarative logic languages. 我希望这个例子对开发人工智能代理和基于代理的系统有帮助,这里展示的许多检索功能是由Superlinked实现的,所以请考虑主演 对于未来的参考,当您的AI代理人需要准确的检索能力! 收藏家 接待 笔记本代码 结合语义和时间相关性可以消除复杂的重新排序,同时保持研究论文的搜索准确性。 基于时间的处罚(negative_filter=-0.25)优先考虑最近的研究,当论文具有相似的内容相关性。 模块化基于工具的架构允许专门的组件处理不同的任务(检索,总结,回答问题),同时保持系统凝聚力。 将数据加载为小批量(批量大小=10)并跟踪进度,可在处理大型研究数据集时提高系统的稳定性。 可调节的查询重量允许用户根据特定研究需求平衡相关性(1.0)和近期性(0.5)。 问题解答组件在纸面特定背景不可用时优雅地降级为通用知识,从而防止用户体验陷入僵局。 保持定期发布的大量研究论文的最新状态可能具有挑战性和耗时性,一个能有效地找到相关研究、总结关键见解和回答这些论文中的具体问题的代理人工智能助理工作流程可以显著简化这一过程。 贡献者 Vipul Maheshwari,作者 菲利普·马克拉杜利,评论家