3,266 讀數

如何使用 GPT、LangChain 和 Node.js 提取和生成 JSON 数据

经过 Karol Horosin7m2023/08/21

太長; 讀書

在本文中，我将解释如何使用人工智能驱动的应用程序框架 LangChain 以及 GPT 和 Node.js 来提取和生成结构化 JSON 数据。该教程涵盖了LangChain的安装和设置、创建提示模板、使用OpenAI模型生成数据、错误处理以及从PDF文件中提取数据。我提供分步说明、代码片段和示例来演示该过程。本教程展示了如何使用这种方法创建强大的应用程序来处理来自各种来源的结构化数据。

featured image - 如何使用 GPT、LangChain 和 Node.js 提取和生成 JSON 数据

在这篇博文中，我将分享如何使用 LangChain（一个用于构建 AI 驱动的应用程序的灵活框架）通过 GPT 和 Node.js 提取和生成结构化 JSON 数据。我将提供代码片段和简洁的说明来帮助您设置和运行该项目。

关于浪链

LangChain是一个创新且多功能的框架，旨在简化人工智能驱动的应用程序的开发。

凭借其模块化架构，它提供了一套全面的组件，用于制作提示模板、连接到不同的数据源以及与各种工具无缝交互。

通过简化提示工程、数据源集成和工具交互，LangChain使开发者能够专注于核心应用逻辑，加速开发进程。

LangChain 支持 Python 和 JavaScript API，具有高度适应性，使开发人员能够跨多个平台和用例利用自然语言处理和人工智能的力量。

LangChain 包含从 LLM 中获取结构化（如 JSON 格式）输出的工具。让我们利用它们来发挥我们的优势。

安装和设置

我假设您拥有 NodeJS 的最新版本之一。我使用的是节点18。如果您需要更多详细信息，请访问LangChain网站。

首先，新建一个节点项目，即：

为您的项目创建一个新目录，然后在终端中导航到该目录。
运行 npm init 来初始化新的 Node.js 项目。
创建一个index.js文件。

然后，让我们安装LangChain并配置API密钥。还包括其他依赖项。

 npm i langchain # configure credentials (easiest) export OPENAI_API_KEY=XXX export SERPAPI_API_KEY=XXX

这仅用于演示用途。我不想导出变量；我正在使用流行的dotenv npm 库。

让我们在 JS 文件顶部导入所需的依赖项。

 import { z } from "zod"; import { OpenAI } from "langchain/llms/openai"; import { PromptTemplate } from "langchain/prompts"; import { StructuredOutputParser, OutputFixingParser, } from "langchain/output_parsers";

生成数据

让我们从生成一些假数据开始，看看解析的可能性。

输出模式定义

首先，我们需要告诉图书馆我们想要得到什么。 LangChain 支持使用名为 Zod 的流行库来定义预期模式：

 const parser = StructuredOutputParser.fromZodSchema( z.object({ name: z.string().describe("Human name"), surname: z.string().describe("Human surname"), age: z.number().describe("Human age"), appearance: z.string().describe("Human appearance description"), shortBio: z.string().describe("Short bio secription"), university: z.string().optional().describe("University name if attended"), gender: z.string().describe("Gender of the human"), interests: z .array(z.string()) .describe("json array of strings human interests"), }) );

提示模板

为了使用这个模板，我们需要创建一个名为 PromptTemplate 的 LangChain 结构。它将包含来自解析器的格式指令：

 const formatInstructions = parser.getFormatInstructions(); const prompt = new PromptTemplate({ template: `Generate details of a hypothetical person.\n{format_instructions} Person description: {description}`, inputVariables: ["description"], partialVariables: { format_instructions: formatInstructions }, });

试试看

要执行结构化输出，请使用输入调用 OpenAI 模型：

 const model = new OpenAI({ temperature: 0.5, model: "gpt-3.5-turbo" }); const input = await prompt.format({ description: "A man, living in Poland", }); const response = await model.call(input);

以下是将发送到人工智能模型的内容。这很可能会在未来的 LangChain 版本中发生改变。

 Generate details of a hypothetical person. You must format your output as a JSON value that adheres to a given "JSON Schema" instance. "JSON Schema" is a declarative language that allows you to annotate and validate JSON documents. For example, the example "JSON Schema" instance {{"properties": {{"foo": {{"description": "a list of test words", "type": "array", "items": {{"type": "string"}}}}}}, "required": ["foo"]}}}} would match an object with one required property, "foo". The "type" property specifies "foo" must be an "array", and the "description" property semantically describes it as "a list of test words". The items within "foo" must be strings. Thus, the object {{"foo": ["bar", "baz"]}} is a well-formatted instance of this example "JSON Schema". The object {{"properties": {{"foo": ["bar", "baz"]}}}} is not well-formatted. Your output will be parsed and type-checked according to the provided schema instance, so make sure all fields in your output match exactly! Here is the JSON Schema instance your output must adhere to: '''json {"type":"object","properties":{"name":{"type":"string","description":"Human name"},"surname":{"type":"string","description":"Human surname"},"age":{"type":"number","description":"Human age"},"appearance":{"type":"string","description":"Human appearance description"},"shortBio":{"type":"string","description":"Short bio secription"},"university":{"type":"string","description":"University name if attended"},"gender":{"type":"string","description":"Gender of the human"},"interests":{"type":"array","items":{"type":"string"},"description":"json array of strings human interests"}},"required":["name","surname","age","appearance","shortBio","gender","interests"],"additionalProperties":false,"$schema":"http://json-schema.org/draft-07/schema#"} ''' Person description: A man, living in Poland.

模型的输出将如下所示：

 { "name": "Adam", "surname": "Kowalski", "age": 21, "appearance": "Adam is a tall and slim man with short dark hair and blue eyes.", "shortBio": "Adam is a 21 year old man from Poland. He is currently studying computer science at the University of Warsaw.", "university": "University of Warsaw", "gender": "Male", "interests": ["Computer Science", "Cooking", "Photography"] }

正如您所看到的，我们得到了我们所需要的。我们可以生成具有与人物角色其他部分相匹配的复杂描述的完整身份。如果我们需要丰富我们的模拟数据集，我们可以要求另一个人工智能模型根据外观生成照片。

错误处理

您可能想知道在生产应用程序中使用 LLM 是否安全。幸运的是，浪链就专注于解决这样的问题。如果输出需要修复，请使用 OutputFishingParser。如果您的法学硕士输出的内容不符合您的要求，它会尝试修复错误。

 try { console.log(await parser.parse(response)); } catch (e) { console.error("Failed to parse bad output: ", e); const fixParser = OutputFixingParser.fromLLM( new OpenAI({ temperature: 0, model: "gpt-3.5-turbo" }), parser ); const output = await fixParser.parse(response); console.log("Fixed output: ", output); }

从文件中提取数据

要使用 LangChain 从文件中加载和提取数据，您可以按照以下步骤操作。在此示例中，我们将加载 PDF 文件。方便的是，LangChain 有专门用于此目的的实用程序。我们需要一个额外的依赖项。

 npm install pdf-parse

我们将加载埃隆·马斯克的简短简介并提取我们之前生成的信息。请在此处下载 PDF 文件： Google Drive 。

首先，让我们创建一个新文件，例如structured-pdf.js 。让我们从加载 PDF 开始。

 import { PDFLoader } from "langchain/document_loaders/fs/pdf"; const loader = new PDFLoader("./elon.pdf"); const docs = await loader.load(); console.log(docs);

我们需要修改提示模板以指示提取，而不是生成。我还必须修改提示来修复 JSON 渲染问题，因为结果有时不一致。

 const prompt = new PromptTemplate({ template: "Extract information from the person description.\n{format_instructions}\nThe response should be presented in a markdown JSON codeblock.\nPerson description: {inputText}", inputVariables: ["inputText"], partialVariables: { format_instructions: formatInstructions }, });

最后，我们需要扩展允许的输出长度（比生成的情况多一点数据），因为默认值为 256 个令牌。我们还需要使用加载的文档而不是预先确定的人员描述来调用模型。

 const model = new OpenAI({ temperature: 0.5, model: "gpt-3.5-turbo", maxTokens: 2000 }); const input = await prompt.format({ inputText: docs[0].pageContent, });

由于这些修改，我们得到以下输出：

 { name: 'Elon', surname: 'Musk', age: 51, appearance: 'normal build, short-cropped hair, and a trimmed beard', // truncated by me shortBio: "Elon Musk, a 51-year-old male entrepreneur, inventor, and CEO, is best known for his...', gender: 'male', interests: [ 'space exploration', 'electric vehicles', 'artificial intelligence', 'sustainable energy', 'tunnel construction', 'neural interfaces', 'Mars colonization', 'hyperloop transportation' ] }

通过执行这些步骤，我们已经从 PDF 文件中提取了结构化 JSON 数据！这种方法用途广泛，可以根据您的特定用例进行调整。