3,266 판독값

GPT, LangChain 및 Node.js를 사용하여 JSON 데이터를 추출하고 생성하는 방법

~에 의해 Karol Horosin7m2023/08/21

너무 오래; 읽다

이 기사에서는 GPT 및 Node.js와 함께 AI 기반 애플리케이션 프레임워크인 LangChain을 사용하여 구조화된 JSON 데이터를 추출하고 생성하는 방법을 설명합니다. 이 튜토리얼에서는 LangChain 설치 및 설정, 프롬프트 템플릿 생성, OpenAI 모델을 사용하여 데이터 생성, 오류 처리 및 PDF 파일에서 데이터 추출을 다룹니다. 프로세스를 보여주기 위해 단계별 지침, 코드 조각 및 예제를 제공합니다. 이 튜토리얼에서는 이 접근 방식을 사용하여 다양한 소스의 구조화된 데이터로 작업하기 위한 강력한 애플리케이션을 만드는 방법을 보여줍니다.

People Mentioned

featured image - GPT, LangChain 및 Node.js를 사용하여 JSON 데이터를 추출하고 생성하는 방법

이번 블로그 게시물에서는 AI 기반 애플리케이션 구축을 위한 유연한 프레임워크인 LangChain을 사용하여 GPT 및 Node.js로 구조화된 JSON 데이터를 추출하고 생성하는 방법을 공유하겠습니다. 프로젝트를 설정하고 실행하는 데 도움이 되는 코드 조각과 간결한 지침을 제공하겠습니다.

랭체인 소개

LangChain은 AI 기반 애플리케이션 개발을 간소화하도록 설계된 혁신적이고 다양한 프레임워크입니다.

모듈식 아키텍처를 통해 프롬프트 템플릿 제작, 다양한 데이터 소스 연결, 다양한 도구와의 원활한 상호 작용을 위한 포괄적인 구성 요소 제품군을 제공합니다.

신속한 엔지니어링, 데이터 소스 통합 및 도구 상호 작용을 단순화함으로써 LangChain은 개발자가 핵심 애플리케이션 논리에 집중하여 개발 프로세스를 가속화할 수 있도록 합니다.

Python과 JavaScript API 모두에서 사용할 수 있는 LangChain은 적응성이 뛰어나 개발자가 여러 플랫폼과 사용 사례에서 자연어 처리 및 AI의 강력한 기능을 활용할 수 있도록 지원합니다.

LangChain에는 LLM에서 구조화된(JSON 형식과 같은) 출력을 가져오는 도구가 포함되어 있습니다. 그것들을 우리에게 유리하게 활용합시다.

설치 및 설정

나는 당신이 최신 버전의 NodeJS 중 하나를 가지고 있다고 가정합니다. 저는 노드 18을 사용했습니다. 자세한 내용이 필요하면 LangChain 웹사이트를 방문하세요.

먼저 다음과 같이 새 노드 프로젝트를 만듭니다.

프로젝트에 대한 새 디렉터리를 만들고 터미널에서 해당 디렉터리로 이동합니다.
npm init를 실행하여 새 Node.js 프로젝트를 초기화하세요.
index.js 파일을 만듭니다.

그러면 LangChain을 설치하고 API 키를 구성해 보겠습니다. 다른 종속성이 포함됩니다.

 npm i langchain # configure credentials (easiest) export OPENAI_API_KEY=XXX export SERPAPI_API_KEY=XXX

이는 단지 시연용으로만 사용됩니다. 나는 변수를 내보내지 않는 것을 선호합니다. 대신 인기 있는 dotenv npm 라이브러리를 사용하고 있습니다.

JS 파일 위에 필요한 종속성을 가져와 보겠습니다.

 import { z } from "zod"; import { OpenAI } from "langchain/llms/openai"; import { PromptTemplate } from "langchain/prompts"; import { StructuredOutputParser, OutputFixingParser, } from "langchain/output_parsers";

데이터 생성

구문 분석 가능성을 확인하기 위해 가짜 데이터를 생성하는 것부터 시작해 보겠습니다.

출력 스키마 정의

먼저, 우리가 얻고 싶은 것이 무엇인지 도서관에 알려야 합니다. LangChain은 Zod라는 널리 사용되는 라이브러리를 사용하여 예상 스키마 정의를 지원합니다.

 const parser = StructuredOutputParser.fromZodSchema( z.object({ name: z.string().describe("Human name"), surname: z.string().describe("Human surname"), age: z.number().describe("Human age"), appearance: z.string().describe("Human appearance description"), shortBio: z.string().describe("Short bio secription"), university: z.string().optional().describe("University name if attended"), gender: z.string().describe("Gender of the human"), interests: z .array(z.string()) .describe("json array of strings human interests"), }) );

프롬프트 템플릿

이 템플릿을 사용하려면 PromptTemplate이라는 LangChain 구성을 만들어야 합니다. 여기에는 파서의 형식 지침이 포함됩니다.

 const formatInstructions = parser.getFormatInstructions(); const prompt = new PromptTemplate({ template: `Generate details of a hypothetical person.\n{format_instructions} Person description: {description}`, inputVariables: ["description"], partialVariables: { format_instructions: formatInstructions }, });

사용해 보세요

구조화된 출력을 실행하려면 다음 입력을 사용하여 OpenAI 모델을 호출하세요.

 const model = new OpenAI({ temperature: 0.5, model: "gpt-3.5-turbo" }); const input = await prompt.format({ description: "A man, living in Poland", }); const response = await model.call(input);

AI 모델에 전송되는 내용은 다음과 같습니다. 이는 향후 LangChain 버전에서 변경될 가능성이 높습니다.

 Generate details of a hypothetical person. You must format your output as a JSON value that adheres to a given "JSON Schema" instance. "JSON Schema" is a declarative language that allows you to annotate and validate JSON documents. For example, the example "JSON Schema" instance {{"properties": {{"foo": {{"description": "a list of test words", "type": "array", "items": {{"type": "string"}}}}}}, "required": ["foo"]}}}} would match an object with one required property, "foo". The "type" property specifies "foo" must be an "array", and the "description" property semantically describes it as "a list of test words". The items within "foo" must be strings. Thus, the object {{"foo": ["bar", "baz"]}} is a well-formatted instance of this example "JSON Schema". The object {{"properties": {{"foo": ["bar", "baz"]}}}} is not well-formatted. Your output will be parsed and type-checked according to the provided schema instance, so make sure all fields in your output match exactly! Here is the JSON Schema instance your output must adhere to: '''json {"type":"object","properties":{"name":{"type":"string","description":"Human name"},"surname":{"type":"string","description":"Human surname"},"age":{"type":"number","description":"Human age"},"appearance":{"type":"string","description":"Human appearance description"},"shortBio":{"type":"string","description":"Short bio secription"},"university":{"type":"string","description":"University name if attended"},"gender":{"type":"string","description":"Gender of the human"},"interests":{"type":"array","items":{"type":"string"},"description":"json array of strings human interests"}},"required":["name","surname","age","appearance","shortBio","gender","interests"],"additionalProperties":false,"$schema":"http://json-schema.org/draft-07/schema#"} ''' Person description: A man, living in Poland.

모델의 출력은 다음과 같습니다.

 { "name": "Adam", "surname": "Kowalski", "age": 21, "appearance": "Adam is a tall and slim man with short dark hair and blue eyes.", "shortBio": "Adam is a 21 year old man from Poland. He is currently studying computer science at the University of Warsaw.", "university": "University of Warsaw", "gender": "Male", "interests": ["Computer Science", "Cooking", "Photography"] }

보시다시피 우리는 필요한 것을 얻었습니다. 페르소나의 다른 부분과 일치하는 복잡한 설명을 통해 전체 정체성을 생성할 수 있습니다. 모의 데이터 세트를 강화해야 하는 경우 다른 AI 모델에 외관을 기반으로 사진을 생성하도록 요청할 수 있습니다.

오류 처리

프로덕션 애플리케이션에서 LLM을 사용하는 것이 어떤 방식으로든 안전한지 궁금할 수 있습니다. 다행스럽게도 LangChain은 이와 같은 문제에 초점을 맞추고 있습니다. 출력을 수정해야 하는 경우 OutputFixingParser를 사용하세요. LLM이 요구 사항과 일치하지 않는 결과를 출력하는 경우 오류를 수정하려고 시도합니다.

 try { console.log(await parser.parse(response)); } catch (e) { console.error("Failed to parse bad output: ", e); const fixParser = OutputFixingParser.fromLLM( new OpenAI({ temperature: 0, model: "gpt-3.5-turbo" }), parser ); const output = await fixParser.parse(response); console.log("Fixed output: ", output); }

파일에서 데이터 추출

LangChain을 사용하여 파일에서 데이터를 로드하고 추출하려면 다음 단계를 따르세요. 이 예에서는 PDF 파일을 로드하겠습니다. 편리하게도 LangChain에는 이러한 목적을 위한 유틸리티가 있습니다. 하나의 추가 종속성이 필요합니다.

 npm install pdf-parse

Elon Musk의 짧은 약력을 로드하고 이전에 생성한 정보를 추출하겠습니다. 여기에서 PDF 파일을 다운로드하세요: google 드라이브 .

먼저, structured-pdf.js 와 같은 새 파일을 생성해 보겠습니다. PDF를 로드하는 것부터 시작해 보겠습니다.

 import { PDFLoader } from "langchain/document_loaders/fs/pdf"; const loader = new PDFLoader("./elon.pdf"); const docs = await loader.load(); console.log(docs);

생성이 아닌 추출을 나타내도록 프롬프트 템플릿을 수정해야 합니다. 또한 때때로 결과가 일관되지 않았기 때문에 JSON 렌더링 문제를 해결하기 위해 프롬프트를 수정해야 했습니다.

 const prompt = new PromptTemplate({ template: "Extract information from the person description.\n{format_instructions}\nThe response should be presented in a markdown JSON codeblock.\nPerson description: {inputText}", inputVariables: ["inputText"], partialVariables: { format_instructions: formatInstructions }, });

마지막으로 기본값은 256개 토큰이므로 허용되는 출력 길이를 확장해야 합니다(생성된 경우보다 데이터가 약간 더 많습니다). 또한 미리 결정된 인물 설명이 아닌 로드된 문서를 사용하여 모델을 호출해야 합니다.

 const model = new OpenAI({ temperature: 0.5, model: "gpt-3.5-turbo", maxTokens: 2000 }); const input = await prompt.format({ inputText: docs[0].pageContent, });

이러한 수정 덕분에 다음과 같은 출력을 얻습니다.

 { name: 'Elon', surname: 'Musk', age: 51, appearance: 'normal build, short-cropped hair, and a trimmed beard', // truncated by me shortBio: "Elon Musk, a 51-year-old male entrepreneur, inventor, and CEO, is best known for his...', gender: 'male', interests: [ 'space exploration', 'electric vehicles', 'artificial intelligence', 'sustainable energy', 'tunnel construction', 'neural interfaces', 'Mars colonization', 'hyperloop transportation' ] }

다음 단계에 따라 PDF 파일에서 구조화된 JSON 데이터를 추출했습니다! 이 접근 방식은 다목적이며 특정 사용 사례에 맞게 조정할 수 있습니다.