743 測定値

Siri を使用して Paul Graham のエッセイを検索する埋め込み機能を備えた製品の構築

に Embedbase11m2023/02/14

長すぎる; 読むには

「埋め込み」は、データを比較できるようにする機械学習の概念です。「Embedbase」は、埋め込みを構築、保存、取得するためのオープンソース API です。 Apple Siri ショートカットで使用する Paul Graham のエッセイの検索エンジンを構築します。

featured image - Siri を使用して Paul Graham のエッセイを検索する埋め込み機能を備えた製品の構築

Embedbase は、埋め込みを構築、保存、取得するためのオープンソース API です。

今日は、Paul Graham のエッセイの検索エンジンを構築します。これを Apple Siri ショートカットで使用します。たとえば、これらのエッセイについて Siri に質問します。

「埋め込み」は、データを比較できるようにする機械学習の概念です。

今日は、埋め込みの技術的なトピックには立ち入りません。

「埋め込み」についての考え方は、似たようなものをバッグにまとめるようなものです。したがって、おもちゃの袋があり、特定のおもちゃを見つけたい場合は、袋の中を見て、近くに他のおもちゃがあるかどうかを確認して、必要なおもちゃを見つけます。コンピューターは単語でも同じことを行うことができます。似たような単語をバッグにまとめて、近くにある他の単語に基づいて必要な単語を見つけます。

実稼働ソフトウェアで埋め込みを使用する場合は、それらを簡単に保存してアクセスできる必要があります。

埋め込みを保存および計算するための多くのベクトルデータベースと NLP モデルがあります。埋め込みを処理するための追加のトリックもいくつかあります。

たとえば、必要のないときに再計算するのはコストがかかり、非効率になる可能性があります。たとえば、ノートに「犬」と「犬」が含まれている場合、変更された情報が役に立たない可能性があるため、必ずしも再計算する必要はありません。

さらに、これらのベクターデータベースは、学習曲線が急で、機械学習を理解する必要があります。

Embedbase を使用すると、機械学習、ベクトルデータベース、計算の最適化について何も知らなくても、数行のコードでデータを同期し、意味的に検索できます。

Siri でポール・グラハムのエッセイを検索する

順番に、次のことを行います。

Embedbase をローカルまたは Google Cloud Run にデプロイする
Embedbase にデータを取り込むCrawleeを使用して、Paul Graham のエッセイ用のクローラーを構築する
音声と自然言語で Embedbase を介して Paul Graham のエッセイを検索できる Apple Siri ショートカットを構築する

ビルドタイム！

技術スタック

埋め込みベース
タイプスクリプト
クロウリー + 劇作家クローラー
デプロイのためのGoogle Cloud Run
インデックスを照会するためのApple Siri ショートカット

レポのクローン

git clone https://github.com/another-ai/embedbase cd embedbase

松かさの設定

Pinecone の Web サイトにアクセスし、ログインしてインデックスを作成します。

「ポール」という名前を付け、ディメンション「1536」を使用します (この数値を正しく理解することが重要です。内部的には、これは OpenAI データ構造の「埋め込み」の「サイズ」です)。他の設定はそれほど重要ではありません。

Embedbase が Pinecone と通信できるようにする Pinecone API キーを取得する必要があります。

OpenAI の構成

ここで、 https://platform.openai.com/account/api-keysで OpenAI 構成を取得する必要があります (必要に応じてアカウントを作成してください)。

「新しいキーを作成」を押します。

また、ここで組織 ID を取得します。

Embedbase構成の作成

ここで、ファイル「config.yaml」( embedbase ディレクトリ内) に値を書き込んで入力します。

 # embedbase/config.yaml # https://app.pinecone.io/ pinecone_index: "my index name" # replace this with your environment pinecone_environment: "us-east1-gcp" pinecone_api_key: "" # https://platform.openai.com/account/api-keys openai_api_key: "sk-xxxxxxx" # https://platform.openai.com/account/org-settings openai_organization: "org-xxxxx"

Embedbaseの実行

🎉 今すぐEmbedbaseを実行できます!

Docker を起動します。お持ちでない場合は、公式サイトの指示に従ってインストールしてください。

次に、Embedbase を実行します。

 docker-compose up

(オプション) クラウド展開

これはオプションです。次のパートに進んでください。

インフラを扱いたくないですか？ホスト型のバージョンをまもなくリリースします。ここをクリックして、いつ発売されるかを最初に知りましょう

やる気があれば、 Embedbase をGoogle Cloud Run にデプロイできます。 Google Cloud プロジェクトがあり、公式ドキュメントからコマンドライン「gcloud」がインストールされていることを確認してください。

 # login to gcloud gcloud auth login # Get your Google Cloud project ID PROJECT_ID=$(gcloud config get-value project) # Enable container registry gcloud services enable containerregistry.googleapis.com # Enable Cloud Run gcloud services enable run.googleapis.com # Enable Secret Manager gcloud services enable secretmanager.googleapis.com # create a secret for the config gcloud secrets create EMBEDBASE_PAUL_GRAHAM --replication-policy=automatic # add a secret version based on your yaml config gcloud secrets versions add EMBEDBASE_PAUL_GRAHAM --data-file=config.yaml # Set your Docker image URL IMAGE_URL="gcr.io/${PROJECT_ID}/embedbase-paul-graham:0.0.1" # Build the Docker image for cloud deployment docker buildx build . --platform linux/amd64 -t ${IMAGE_URL} -f ./search/Dockerfile # Push the docker image to Google Cloud Docker registries # Make sure to be authenticated https://cloud.google.com/container-registry/docs/advanced-authentication docker push ${IMAGE_URL} # Deploy Embedbase to Google Cloud Run gcloud run deploy embedbase-paul-graham \ --image ${IMAGE_URL} \ --region us-central1 \ --allow-unauthenticated \ --set-secrets /secrets/config.yaml=EMBEDBASE_PAUL_GRAHAM:1

Paul Graham エッセイのクローラーの構築

Web クローラーを使用すると、Web サイトのすべてのページをダウンロードできます。これは、Google が使用する基本的なアルゴリズムです。

リポジトリのクローンを作成し、依存関係をインストールします。

 git clone https://github.com/another-ai/embedbase-paul-graham cd embedbase-paul-graham npm i

コードを見てみましょう。Typescript プロジェクトに必要なすべてのファイルに圧倒されても、心配しないで無視してください。

 // src/main.ts // Here we want to start from the page that list all Paul's essays const startUrls = ['http://www.paulgraham.com/articles.html']; const crawler = new PlaywrightCrawler({ requestHandler: router, }); await crawler.run(startUrls);

クローラーが「ルート」で初期化されていることがわかりますが、これらの不思議なルートは何ですか?

 // src/routes.ts router.addDefaultHandler(async ({ enqueueLinks, log }) => { log.info(`enqueueing new URLs`); await enqueueLinks({ // Here we tell the crawler to only accept pages that are under // "http://www.paulgraham.com/" domain name, // for example if we find a link on Paul's website to an url // like "https://ycombinator.com/startups" if it will ignored globs: ['http://www.paulgraham.com/**'], label: 'detail', }); }); router.addHandler('detail', async ({ request, page, log }) => { // Here we will do some logic on all pages under // "http://www.paulgraham.com/" domain name // for example, collecting the page title const title = await page.title(); // getting the essays' content const blogPost = await page.locator('body > table > tbody > tr > td:nth-child(3)').textContent(); if (!blogPost) { log.info(`no blog post found for ${title}, skipping`); return; } log.info(`${title}`, { url: request.loadedUrl }); // Remember that usually AI models and databases have some limits in input size // and thus we will split essays in chunks of paragraphs // split blog post in chunks on the \n\n const chunks = blogPost.split(/\n\n/); if (!chunks) { log.info(`no blog post found for ${title}, skipping`); return; } // If you are not familiar with Promises, don't worry for now // it's just a mean to do things faster await Promise.all(chunks.flatMap((chunk) => { const d = { url: request.loadedUrl, title: title, blogPost: chunk, }; // Here we just want to send the page interesting // content into Embedbase (don't mind Dataset, it's optional local storage) return Promise.all([Dataset.pushData(d), add(title, chunk)]); })); });

add()とは?

 const add = (title: string, blogPost: string) => { // note "paul" in the URL, it can be anything you want // that will help you segment your data in // isolated parts const url = `${baseUrl}/v1/paul`; const data = { documents: [{ data: blogPost, }], }; // send the data to Embedbase using "node-fetch" library fetch(url, { method: 'POST', headers: { 'Content-Type': 'application/json', }, body: JSON.stringify(data), }).then((response) => { return response.json(); }).then((data) => { console.log('Success:', data); }).catch((error) => { console.error('Error:', error); }); };

これでクローラーを実行できます。Embedbase のすべてをダウンロードして取り込むのに 1 分もかかりません。

OpenAI クレジットが 1 ドル未満で使用されます

 npm start

Embedbase をクラウドにデプロイした場合は、

 # you can get your cloud run URL like this: CLOUD_RUN_URL=$(gcloud run services list --platform managed --region us-central1 --format="value(status.url)" --filter="metadata.name=embedbase-paul-graham") npm run playground ${CLOUD_RUN_URL}

端末 (Embedbase Docker コンテナーとノードプロセスの両方) で何らかのアクティビティがエラーなしで表示されるはずです (それ以外の場合は、お気軽にお問い合わせください)。

(オプション) ターミナルで Embedbase を検索する

サンプルリポジトリの「 src/playground.ts 」は、ターミナルで Embedbase を操作できる単純なスクリプトです。コードは単純です。

 // src/playground.ts const search = async (query: string) => { const url = `${baseUrl}/v1/paul/search`; const data = { query, }; return fetch(url, { method: 'POST', headers: { 'Content-Type': 'application/json', }, body: JSON.stringify(data), }).then((response) => { return response.json(); }).then((data) => { console.log('Success:', data); }).catch((error) => { console.error('Error:', error); }); }; const p = prompt(); // this is an interactive terminal that let you search in paul graham // blog posts using semantic search // It is an infinite loop that will ask you for a query // and show you the results const start = async () => { console.log('Welcome to the Embedbase playground!'); console.log('This playground is a simple example of how to use Embedbase'); console.log('Currently using Embedbase server at', baseUrl); console.log('This is an interactive terminal that let you search in paul graham blog posts using semantic search'); console.log('Try to run some queries such as "how to get rich"'); console.log('or "how to pitch investor"'); while (true) { const query = p('Enter a semantic query:'); if (!query) { console.log('Bye!'); return; } await search(query); } }; start();

Embedbase をローカルで実行している場合は、次のように実行できます。

 npm run playground

または、次のように、Embedbase をクラウドにデプロイした場合:

 npm run playground ${CLOUD_RUN_URL}

結果：

(オプション) Apple Siri ショートカットの作成

楽しい時間！ Apple Siri ショートカットを作成して、ポールグラハムのエッセイについて Siri に質問できるようにしましょう 😜

まず、Apple ショートカットを開始しましょう。

新しいショートカットを作成します。

このショートカットに「Search Paul」という名前を付けます (ショートカットの開始を Siri に依頼する方法になることに注意してください。簡単なものを選択してください)。

平易な英語で言えば、このショートカットはユーザーにクエリを要求し、それを使って Embedbase を呼び出し、見つけたエッセイを大声で発音するよう Siri に指示します。

「テキストのディクテーション」を使用すると、検索クエリを音声で尋ねることができます (英語を選択)
わかりやすくするために、Embedbaseのエンドポイントを「テキスト」に保存します。設定に応じて変更します ( Embedbase をローカルで実行する場合は「https://localhost:8000/v1/search」)。
わかりやすくするために、変数にエンドポイントを再度設定します
ディクテーションされたテキストについても同じ
これで、"Get contents of" は、クロール中に以前に定義した "vault_id" を "paul" として使用し、"query" プロパティに変数 "query" を使用して、Embedbase への HTTP POST 要求を実行します。

「Get for in」は、Embedbase レスポンスからプロパティ「similarities」を抽出します
「各項目を繰り返し」は、類似性ごとに次のようになります。
1. 「document_path」プロパティを取得する
2. 変数「パス」（リスト）に追加
「結合」は、新しい行で結果を「結合」します
(オプション、以下にその方法を示します) これは、Siri が発音したときに、OpenAI GPT3 を使用して結果テキストの一部をより良い音に変換するために、辛さのショートカットに追加できる楽しいトリックです。
その結果を「テキスト」にまとめて音声に対応させます
Siriに話しかけてもらう

この機能的な GPT3 ショートカットを使用して、結果をより適切なテキストに変換できます。

(「Authorization」の値に「Bearer [YOUR OPENAI KEY]」を入力)