3,266 測定値

GPT、LangChain、Node.js を使用して JSON データを抽出および生成する方法

に Karol Horosin7m2023/08/21

長すぎる; 読むには

この記事では、AI 駆動のアプリケーションフレームワークである LangChain を GPT および Node.js とともに使用して、構造化された JSON データを抽出および生成する方法を説明します。このチュートリアルでは、LangChain のインストールと設定、プロンプトテンプレートの作成、OpenAI モデルを使用したデータの生成、エラー処理、PDF ファイルからのデータ抽出について説明します。プロセスを示すための段階的な手順、コードスニペット、および例を提供します。このチュートリアルでは、このアプローチを使用して、さまざまなソースからの構造化データを操作するための強力なアプリケーションを作成する方法を紹介します。

featured image - GPT、LangChain、Node.js を使用して JSON データを抽出および生成する方法

このブログ投稿では、AI 駆動型アプリケーションを構築するための柔軟なフレームワークである LangChain を使用して、GPT と Node.js で構造化された JSON データを抽出および生成する方法を共有します。プロジェクトのセットアップと実行に役立つコードスニペットと簡潔な手順を提供します。

ラングチェーンについて

LangChain は、AI 駆動型アプリケーションの開発を合理化するために設計された革新的で多用途なフレームワークです。

モジュラーアーキテクチャにより、プロンプトテンプレートの作成、多様なデータソースへの接続、さまざまなツールとのシームレスな対話のための包括的なコンポーネントスイートが提供されます。

LangChain は、プロンプトエンジニアリング、データソースの統合、ツールの対話を簡素化することで、開発者がコアアプリケーションロジックに集中できるようにし、開発プロセスを加速します。

Python と JavaScript API の両方で利用できる LangChain は適応性が高く、開発者が複数のプラットフォームやユースケースにわたって自然言語処理と AI の力を活用できるようにします。

LangChain には、LLM から構造化された (JSON 形式などの) 出力を取得するツールが含まれています。それらを有効に活用しましょう。

インストールとセットアップ

NodeJS の最新バージョンのいずれかを使用していることを前提としています。私はノード 18 を使用しました。詳細が必要な場合は、LangChain Web サイトにアクセスしてください。

まず、新しいノードプロジェクトを作成します。つまり、次のようになります。

プロジェクト用の新しいディレクトリを作成し、ターミナルでそのディレクトリに移動します。
npm init を実行して、新しい Node.js プロジェクトを初期化します。
index.jsファイルを作成します。

次に、LangChain をインストールし、API キーを設定しましょう。他の依存関係も含まれます。

 npm i langchain # configure credentials (easiest) export OPENAI_API_KEY=XXX export SERPAPI_API_KEY=XXX

これはデモ用にのみ使用されます。私は変数をエクスポートしないことを好みます。代わりに、一般的なdotenv npm ライブラリを使用しています。

必要な依存関係を JS ファイルの上にインポートしましょう。

 import { z } from "zod"; import { OpenAI } from "langchain/llms/openai"; import { PromptTemplate } from "langchain/prompts"; import { StructuredOutputParser, OutputFixingParser, } from "langchain/output_parsers";

データの生成

まずは偽のデータを生成して、解析の可能性を見てみましょう。

出力スキーマ定義

まず、何を取得したいのかをライブラリに伝える必要があります。 LangChain は、Zod と呼ばれる一般的なライブラリを使用した予期されるスキーマの定義をサポートしています。

 const parser = StructuredOutputParser.fromZodSchema( z.object({ name: z.string().describe("Human name"), surname: z.string().describe("Human surname"), age: z.number().describe("Human age"), appearance: z.string().describe("Human appearance description"), shortBio: z.string().describe("Short bio secription"), university: z.string().optional().describe("University name if attended"), gender: z.string().describe("Gender of the human"), interests: z .array(z.string()) .describe("json array of strings human interests"), }) );

プロンプトテンプレート

このテンプレートを使用するには、PromptTemplate という LangChain コンストラクトを作成する必要があります。これには、パーサーからのフォーマット命令が含まれます。

 const formatInstructions = parser.getFormatInstructions(); const prompt = new PromptTemplate({ template: `Generate details of a hypothetical person.\n{format_instructions} Person description: {description}`, inputVariables: ["description"], partialVariables: { format_instructions: formatInstructions }, });

やってみて

構造化出力を実行するには、入力を使用して OpenAI モデルを呼び出します。

 const model = new OpenAI({ temperature: 0.5, model: "gpt-3.5-turbo" }); const input = await prompt.format({ description: "A man, living in Poland", }); const response = await model.call(input);

AI モデルに送信される内容は次のとおりです。これは、将来の LangChain バージョンで変更される可能性が高くなります。

 Generate details of a hypothetical person. You must format your output as a JSON value that adheres to a given "JSON Schema" instance. "JSON Schema" is a declarative language that allows you to annotate and validate JSON documents. For example, the example "JSON Schema" instance {{"properties": {{"foo": {{"description": "a list of test words", "type": "array", "items": {{"type": "string"}}}}}}, "required": ["foo"]}}}} would match an object with one required property, "foo". The "type" property specifies "foo" must be an "array", and the "description" property semantically describes it as "a list of test words". The items within "foo" must be strings. Thus, the object {{"foo": ["bar", "baz"]}} is a well-formatted instance of this example "JSON Schema". The object {{"properties": {{"foo": ["bar", "baz"]}}}} is not well-formatted. Your output will be parsed and type-checked according to the provided schema instance, so make sure all fields in your output match exactly! Here is the JSON Schema instance your output must adhere to: '''json {"type":"object","properties":{"name":{"type":"string","description":"Human name"},"surname":{"type":"string","description":"Human surname"},"age":{"type":"number","description":"Human age"},"appearance":{"type":"string","description":"Human appearance description"},"shortBio":{"type":"string","description":"Short bio secription"},"university":{"type":"string","description":"University name if attended"},"gender":{"type":"string","description":"Gender of the human"},"interests":{"type":"array","items":{"type":"string"},"description":"json array of strings human interests"}},"required":["name","surname","age","appearance","shortBio","gender","interests"],"additionalProperties":false,"$schema":"http://json-schema.org/draft-07/schema#"} ''' Person description: A man, living in Poland.

モデルからの出力は次のようになります。

 { "name": "Adam", "surname": "Kowalski", "age": 21, "appearance": "Adam is a tall and slim man with short dark hair and blue eyes.", "shortBio": "Adam is a 21 year old man from Poland. He is currently studying computer science at the University of Warsaw.", "university": "University of Warsaw", "gender": "Male", "interests": ["Computer Science", "Cooking", "Photography"] }

ご覧のとおり、必要なものはすべて揃っています。ペルソナの他の部分と一致する複雑な説明を含むアイデンティティ全体を生成できます。模擬データセットを強化する必要がある場合は、外観に基づいて写真を生成するように別の AI モデルに依頼できます。

エラー処理

実稼働アプリケーションで LLM を使用するのが安全なのかどうか疑問に思うかもしれません。幸いなことに、LangChain はまさにこのような問題に焦点を当てています。出力を修正する必要がある場合は、OutputFixingParser を使用します。 LLM が要件に一致しないものを出力した場合にエラーを修正しようとします。

 try { console.log(await parser.parse(response)); } catch (e) { console.error("Failed to parse bad output: ", e); const fixParser = OutputFixingParser.fromLLM( new OpenAI({ temperature: 0, model: "gpt-3.5-turbo" }), parser ); const output = await fixParser.parse(response); console.log("Fixed output: ", output); }

ファイルからのデータの抽出

LangChain を使用してファイルからデータをロードおよび抽出するには、次の手順に従います。この例では、PDF ファイルをロードします。便利なことに、LangChain にはこの目的のためだけにユーティリティがあります。追加の依存関係が 1 つ必要です。

 npm install pdf-parse

イーロン・マスクの短い略歴をロードし、以前に生成した情報を抽出します。ここから PDF ファイルをダウンロードします: Google ドライブ。

まず、新しいファイル (例えば、 structured-pdf.jsを作成しましょう。まずはPDFの読み込みから始めましょう。

 import { PDFLoader } from "langchain/document_loaders/fs/pdf"; const loader = new PDFLoader("./elon.pdf"); const docs = await loader.load(); console.log(docs);

生成ではなく抽出を示すようにプロンプトテンプレートを変更する必要があります。また、結果が時々一貫性を持たなかったため、JSON レンダリングの問題を修正するためにプロンプトを変更する必要がありました。

 const prompt = new PromptTemplate({ template: "Extract information from the person description.\n{format_instructions}\nThe response should be presented in a markdown JSON codeblock.\nPerson description: {inputText}", inputVariables: ["inputText"], partialVariables: { format_instructions: formatInstructions }, });

最後に、デフォルトが 256 トークンであるため、許可する出力長を拡張する必要があります (生成された場合よりも少しデータが多くなります)。また、事前に決定された人物の説明ではなく、ロードされたドキュメントを使用してモデルを呼び出す必要があります。

 const model = new OpenAI({ temperature: 0.5, model: "gpt-3.5-turbo", maxTokens: 2000 }); const input = await prompt.format({ inputText: docs[0].pageContent, });

これらの変更のおかげで、次の出力が得られます。

 { name: 'Elon', surname: 'Musk', age: 51, appearance: 'normal build, short-cropped hair, and a trimmed beard', // truncated by me shortBio: "Elon Musk, a 51-year-old male entrepreneur, inventor, and CEO, is best known for his...', gender: 'male', interests: [ 'space exploration', 'electric vehicles', 'artificial intelligence', 'sustainable energy', 'tunnel construction', 'neural interfaces', 'Mars colonization', 'hyperloop transportation' ] }

次の手順に従って、PDF ファイルから構造化された JSON データを抽出しました。このアプローチは多用途であり、特定のユースケースに合わせて適応できます。