743 讀數

构建嵌入驱动的产品以使用 Siri 搜索 Paul Graham 的散文

经过 Embedbase11m2023/02/14

太長; 讀書

“嵌入”是机器学习中的一个概念，可让您比较数据。 “Embedbase”是一个用于构建、存储和检索嵌入的开源 API。我们将为 Paul Graham 的文章构建一个搜索引擎，我们将与 Apple Siri 快捷方式一起使用。

featured image - 构建嵌入驱动的产品以使用 Siri 搜索 Paul Graham 的散文

Embedbase是一个用于构建、存储和检索嵌入的开源 API。

今天我们将为 Paul Graham 的文章构建一个搜索引擎，我们将使用 Apple Siri 快捷方式，例如向 Siri 询问有关这些文章的问题。

“嵌入”是机器学习中的一个概念，可让您比较数据。

我们今天不会深入探讨嵌入的技术主题。

考虑“嵌入”的一种方式就像将类似的东西放在一个袋子里。因此，如果您有一袋玩具，并且想要找到某个玩具，您可以查看袋子，看看它附近还有什么其他玩具，以确定您想要哪个。计算机可以对单词做同样的事情，把相似的单词放在一个袋子里，然后根据它附近的其他单词找到它想要的单词。

当您想在生产软件中使用嵌入时，您需要能够轻松地存储和访问它们。

有许多矢量数据库和 NLP 模型用于存储和计算嵌入。还有一些额外的技巧来处理嵌入。

例如，在没有必要时重新计算可能会变得代价高昂且效率低下，例如，如果注释包含“dog”然后包含“dog”，则您不一定要重新计算，因为更改后的信息可能没有用。

此外，这些矢量数据库需要陡峭的学习曲线，并且需要了解机器学习。

Embedbase让您无需对机器学习、矢量数据库和优化计算一无所知，只需几行代码即可同步和语义搜索数据。

用 Siri 搜索 Paul Graham 的文章

按顺序，我们将：

在本地或 Google Cloud Run 上部署 Embedbase
使用Crawlee为 Paul Graham 的文章构建一个爬虫，将数据提取到 Embedbase
构建一个 Apple Siri 快捷方式，让您可以通过 Embedbase 使用语音和自然语言搜索 Paul Graham 的文章

建造时间！

技术栈

嵌入库
打字稿
Crawlee + 剧作家爬虫
用于部署的Google Cloud Run
用于查询索引的Apple Siri 快捷方式

克隆回购

git clone https://github.com/another-ai/embedbase cd embedbase

设置松果

前往Pinecone 网站，登录并创建索引：

我们将其命名为“paul”并使用维度“1536”（正确获取此数字很重要，在幕后，它是 OpenAI 数据结构“嵌入”的“大小”），其他设置不太重要。

您需要获取 Pinecone API 密钥，以便 Embedbase 与 Pinecone 进行通信：

配置 OpenAI

现在您需要在https://platform.openai.com/account/api-keys获取您的 OpenAI 配置（如果需要，请创建一个帐户）。

按“创建新密钥”：

另外，在此处获取您的组织 ID：

创建您的Embedbase配置

现在写入并填充文件“config.yaml”中的值（在 embedbase 目录中）：

 # embedbase/config.yaml # https://app.pinecone.io/ pinecone_index: "my index name" # replace this with your environment pinecone_environment: "us-east1-gcp" pinecone_api_key: "" # https://platform.openai.com/account/api-keys openai_api_key: "sk-xxxxxxx" # https://platform.openai.com/account/org-settings openai_organization: "org-xxxxx"

运行Embedbase

🎉 您现在可以运行Embedbase了！

启动Docker，如果没有，请按照官网说明安装。

现在运行 Embedbase：

 docker-compose up

（可选）云端部署

这是可选的，请随时跳到下一部分！

不想处理红外线？我们即将推出托管版本。只需单击此处即可第一时间知道它何时发布

如果你有动力，可以将Embedbase部署到 Google Cloud Run。确保有一个谷歌云项目，并通过官方文档安装了命令行“gcloud”。

 # login to gcloud gcloud auth login # Get your Google Cloud project ID PROJECT_ID=$(gcloud config get-value project) # Enable container registry gcloud services enable containerregistry.googleapis.com # Enable Cloud Run gcloud services enable run.googleapis.com # Enable Secret Manager gcloud services enable secretmanager.googleapis.com # create a secret for the config gcloud secrets create EMBEDBASE_PAUL_GRAHAM --replication-policy=automatic # add a secret version based on your yaml config gcloud secrets versions add EMBEDBASE_PAUL_GRAHAM --data-file=config.yaml # Set your Docker image URL IMAGE_URL="gcr.io/${PROJECT_ID}/embedbase-paul-graham:0.0.1" # Build the Docker image for cloud deployment docker buildx build . --platform linux/amd64 -t ${IMAGE_URL} -f ./search/Dockerfile # Push the docker image to Google Cloud Docker registries # Make sure to be authenticated https://cloud.google.com/container-registry/docs/advanced-authentication docker push ${IMAGE_URL} # Deploy Embedbase to Google Cloud Run gcloud run deploy embedbase-paul-graham \ --image ${IMAGE_URL} \ --region us-central1 \ --allow-unauthenticated \ --set-secrets /secrets/config.yaml=EMBEDBASE_PAUL_GRAHAM:1

构建 Paul Graham 论文的爬虫

网络爬虫允许你下载一个网站的所有页面，它是谷歌使用的底层算法。

克隆存储库并安装依赖项：

 git clone https://github.com/another-ai/embedbase-paul-graham cd embedbase-paul-graham npm i

让我们看看代码，如果您对 Typescript 项目所需的所有文件感到不知所措，请不要担心并忽略它们。

 // src/main.ts // Here we want to start from the page that list all Paul's essays const startUrls = ['http://www.paulgraham.com/articles.html']; const crawler = new PlaywrightCrawler({ requestHandler: router, }); await crawler.run(startUrls);

可以看到爬虫是用“routes”初始化的，这些神秘的路由是什么？

 // src/routes.ts router.addDefaultHandler(async ({ enqueueLinks, log }) => { log.info(`enqueueing new URLs`); await enqueueLinks({ // Here we tell the crawler to only accept pages that are under // "http://www.paulgraham.com/" domain name, // for example if we find a link on Paul's website to an url // like "https://ycombinator.com/startups" if it will ignored globs: ['http://www.paulgraham.com/**'], label: 'detail', }); }); router.addHandler('detail', async ({ request, page, log }) => { // Here we will do some logic on all pages under // "http://www.paulgraham.com/" domain name // for example, collecting the page title const title = await page.title(); // getting the essays' content const blogPost = await page.locator('body > table > tbody > tr > td:nth-child(3)').textContent(); if (!blogPost) { log.info(`no blog post found for ${title}, skipping`); return; } log.info(`${title}`, { url: request.loadedUrl }); // Remember that usually AI models and databases have some limits in input size // and thus we will split essays in chunks of paragraphs // split blog post in chunks on the \n\n const chunks = blogPost.split(/\n\n/); if (!chunks) { log.info(`no blog post found for ${title}, skipping`); return; } // If you are not familiar with Promises, don't worry for now // it's just a mean to do things faster await Promise.all(chunks.flatMap((chunk) => { const d = { url: request.loadedUrl, title: title, blogPost: chunk, }; // Here we just want to send the page interesting // content into Embedbase (don't mind Dataset, it's optional local storage) return Promise.all([Dataset.pushData(d), add(title, chunk)]); })); });

什么是add() ？

 const add = (title: string, blogPost: string) => { // note "paul" in the URL, it can be anything you want // that will help you segment your data in // isolated parts const url = `${baseUrl}/v1/paul`; const data = { documents: [{ data: blogPost, }], }; // send the data to Embedbase using "node-fetch" library fetch(url, { method: 'POST', headers: { 'Content-Type': 'application/json', }, body: JSON.stringify(data), }).then((response) => { return response.json(); }).then((data) => { console.log('Success:', data); }).catch((error) => { console.error('Error:', error); }); };

现在您可以运行爬虫了，下载和摄取 Embedbase 中的所有内容应该不到一分钟。

将使用 OpenAI 学分，费用低于 1 美元

 npm start

如果您将 Embedbase 部署到云端，请使用

# you can get your cloud run URL like this: CLOUD_RUN_URL=$(gcloud run services list --platform managed --region us-central1 --format="value(status.url)" --filter="metadata.name=embedbase-paul-graham") npm run playground ${CLOUD_RUN_URL}

您应该在您的终端（Embedbase Docker 容器和节点进程）中看到一些活动没有错误（否则请随时寻求帮助）。

（可选）在终端中搜索 Embedbase

在示例存储库中，您可以注意到“ src/playground.ts ”，这是一个简单的脚本，可让您在终端中与 Embedbase 进行交互，代码非常简单：

 // src/playground.ts const search = async (query: string) => { const url = `${baseUrl}/v1/paul/search`; const data = { query, }; return fetch(url, { method: 'POST', headers: { 'Content-Type': 'application/json', }, body: JSON.stringify(data), }).then((response) => { return response.json(); }).then((data) => { console.log('Success:', data); }).catch((error) => { console.error('Error:', error); }); }; const p = prompt(); // this is an interactive terminal that let you search in paul graham // blog posts using semantic search // It is an infinite loop that will ask you for a query // and show you the results const start = async () => { console.log('Welcome to the Embedbase playground!'); console.log('This playground is a simple example of how to use Embedbase'); console.log('Currently using Embedbase server at', baseUrl); console.log('This is an interactive terminal that let you search in paul graham blog posts using semantic search'); console.log('Try to run some queries such as "how to get rich"'); console.log('or "how to pitch investor"'); while (true) { const query = p('Enter a semantic query:'); if (!query) { console.log('Bye!'); return; } await search(query); } }; start();

如果你在本地运行 Embedbase，你可以像这样运行它：

 npm run playground

或者，像这样，如果您将 Embedbase 部署到云端：

 npm run playground ${CLOUD_RUN_URL}

结果：

（可选）构建 Apple Siri 快捷方式

欢乐时光！让我们构建一个 Apple Siri 快捷方式，以便能够向 Siri 询问有关 Paul Graham 文章的问题😜

首先，让我们启动 Apple 快捷方式：

创建一个新的快捷方式：

我们将把这个快捷方式命名为“Search Paul”（请注意，这将是您要求 Siri 启动快捷方式的方式，所以选择简单的东西）

用简单的英语来说，这个快捷方式向用户询问一个查询并用它调用 Embedbase，然后告诉 Siri 大声读出它找到的文章。

“听写文本”让您用语音询问您的搜索查询（选择英语）
为了清楚起见，我们将Embedbase的端点存储在“文本”中，根据您的设置进行更改（如果您在本地运行Embedbase ，则为“https://localhost:8000/v1/search”）
为了清楚起见，我们再次在变量中设置端点
口述文本也一样
现在“获取内容”将使用我们之前在爬行期间定义的“vault_id”作为“paul”并使用变量“query”作为“query”属性向 Embedbase 发出 HTTP POST 请求

“Get for in”将从 Embedbase 响应中提取属性“相似性”
对于每个相似性，“对每个项目重复”将：
1. 获取“document_path”属性
2. 添加到变量“路径”（列表）
“合并”会将结果“加入”新行
（可选，将在下面展示）这是一个有趣的技巧，您可以将其添加到快捷方式中以增加辣味，使用 OpenAI GPT3 将结果文本的一部分转换为在 Siri 发音时听起来更好
我们将结果组装成一个“文本”，以便语音友好
让 Siri 说出来

您可以使用此功能性 GPT3 快捷方式将结果转换为更好的文本

（用“Bearer [YOUR OPENAI KEY]”填写“Authorization”值）