Embedbase is an open-source API for building, storing & retrieving embeddings.
Today we will build a search engine for Paul Graham’s essays that we will use with Apple Siri Shortcuts, e.g. asking Siri questions about these essays.
“Embeddings” are a concept in machine learning that allows you to compare data.
We will not dive into the technical topic of embeddings today.
A way to think about "embeddings" is like putting similar things together in a bag. So if you have a bag of toys, and you want to find a certain toy, you look in the bag and see what other toys are near it to figure out which one you want. A computer can do the same thing with words, putting similar words together in a bag, and then finding the word it wants based on the other words near it.
When you want to use embeddings in your production software, you need to be able to store and access them easily.
There are many vector databases and NLP models for storing and calculating embeddings. There are also some additional tricks to dealing with embeddings.
For example, it can becomecostly and inefficient to recompute when it is not necessary, e.g., if the note contains “dog” and then “dog,” you don’t necessarily want to recompute because the changed information might not be useful.
In addition, these vector databases require a steep learning curve and the need to understand machine learning.
Embedbase allows you to synchronize and semantically search data without knowing anything about machine learning, vector databases, and optimizing calculations in a few lines of code.
In order, we will:
git clone https://github.com/another-ai/embedbase
cd embedbase
Head to Pinecone website, login and create an index:
We will name it “paul” and use dimension “1536” (important to get this number right, under the hood, it is the “size” of OpenAI data structure “embeddings”), the other settings are less important.
You need to get your Pinecone API key that will let Embedbase communicate with Pinecone:
Now you need to get your OpenAI configuration at https://platform.openai.com/account/api-keys (create an account if needed).
Press “Create a new key”:
Also, get your organization ID here:
Now write and fill the values in the file “config.yaml” (in embedbase directory):
# embedbase/config.yaml
# https://app.pinecone.io/
pinecone_index: "my index name"
# replace this with your environment
pinecone_environment: "us-east1-gcp"
pinecone_api_key: ""
# https://platform.openai.com/account/api-keys
openai_api_key: "sk-xxxxxxx"
# https://platform.openai.com/account/org-settings
openai_organization: "org-xxxxx"
🎉 You can run Embedbase now!
Start Docker, if you don’t have it, please install it by following the instructions on the official website.
Now run Embedbase:
docker-compose up
This is optional, feel free to skip to the next part!
If you are motivated, you can deploy Embedbase to Google Cloud Run. Make sure to have a Google Cloud project and have installed the command line “gcloud” through official documentation.
# login to gcloud
gcloud auth login
# Get your Google Cloud project ID
PROJECT_ID=$(gcloud config get-value project)
# Enable container registry
gcloud services enable containerregistry.googleapis.com
# Enable Cloud Run
gcloud services enable run.googleapis.com
# Enable Secret Manager
gcloud services enable secretmanager.googleapis.com
# create a secret for the config
gcloud secrets create EMBEDBASE_PAUL_GRAHAM --replication-policy=automatic
# add a secret version based on your yaml config
gcloud secrets versions add EMBEDBASE_PAUL_GRAHAM --data-file=config.yaml
# Set your Docker image URL
IMAGE_URL="gcr.io/${PROJECT_ID}/embedbase-paul-graham:0.0.1"
# Build the Docker image for cloud deployment
docker buildx build . --platform linux/amd64 -t ${IMAGE_URL} -f ./search/Dockerfile
# Push the docker image to Google Cloud Docker registries
# Make sure to be authenticated https://cloud.google.com/container-registry/docs/advanced-authentication
docker push ${IMAGE_URL}
# Deploy Embedbase to Google Cloud Run
gcloud run deploy embedbase-paul-graham \
--image ${IMAGE_URL} \
--region us-central1 \
--allow-unauthenticated \
--set-secrets /secrets/config.yaml=EMBEDBASE_PAUL_GRAHAM:1
Web crawlers allow you to download all the pages of a website, it is the underlying algorithm used by Google.
Clone the repository and install the dependencies:
git clone https://github.com/another-ai/embedbase-paul-graham
cd embedbase-paul-graham
npm i
Let’s look at the code, if you are overwhelmed with all the files necessary for a Typescript project, don’t worry & ignore them.
// src/main.ts
// Here we want to start from the page that list all Paul's essays
const startUrls = ['http://www.paulgraham.com/articles.html'];
const crawler = new PlaywrightCrawler({
requestHandler: router,
});
await crawler.run(startUrls);
You can see that the crawler is initialized with “routes”, what are these mysterious routes?
// src/routes.ts
router.addDefaultHandler(async ({ enqueueLinks, log }) => {
log.info(`enqueueing new URLs`);
await enqueueLinks({
// Here we tell the crawler to only accept pages that are under
// "http://www.paulgraham.com/" domain name,
// for example if we find a link on Paul's website to an url
// like "https://ycombinator.com/startups" if it will ignored
globs: ['http://www.paulgraham.com/**'],
label: 'detail',
});
});
router.addHandler('detail', async ({ request, page, log }) => {
// Here we will do some logic on all pages under
// "http://www.paulgraham.com/" domain name
// for example, collecting the page title
const title = await page.title();
// getting the essays' content
const blogPost = await page.locator('body > table > tbody > tr > td:nth-child(3)').textContent();
if (!blogPost) {
log.info(`no blog post found for ${title}, skipping`);
return;
}
log.info(`${title}`, { url: request.loadedUrl });
// Remember that usually AI models and databases have some limits in input size
// and thus we will split essays in chunks of paragraphs
// split blog post in chunks on the \n\n
const chunks = blogPost.split(/\n\n/);
if (!chunks) {
log.info(`no blog post found for ${title}, skipping`);
return;
}
// If you are not familiar with Promises, don't worry for now
// it's just a mean to do things faster
await Promise.all(chunks.flatMap((chunk) => {
const d = {
url: request.loadedUrl,
title: title,
blogPost: chunk,
};
// Here we just want to send the page interesting
// content into Embedbase (don't mind Dataset, it's optional local storage)
return Promise.all([Dataset.pushData(d), add(title, chunk)]);
}));
});
What is add() ?
const add = (title: string, blogPost: string) => {
// note "paul" in the URL, it can be anything you want
// that will help you segment your data in
// isolated parts
const url = `${baseUrl}/v1/paul`;
const data = {
documents: [{
data: blogPost,
}],
};
// send the data to Embedbase using "node-fetch" library
fetch(url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify(data),
}).then((response) => {
return response.json();
}).then((data) => {
console.log('Success:', data);
}).catch((error) => {
console.error('Error:', error);
});
};
Now you can run the crawler, it should take less than a minute to download & ingest everything in Embedbase.
OpenAI credits will be used, for less than <$1
npm start
If you deployed Embedbase to the cloud, please use
# you can get your cloud run URL like this:
CLOUD_RUN_URL=$(gcloud run services list --platform managed --region us-central1 --format="value(status.url)" --filter="metadata.name=embedbase-paul-graham")
npm run playground ${CLOUD_RUN_URL}
You should see some activity in your terminal (both Embedbase Docker container & the node process) without errors (feel free to reach out for help otherwise).
In the example repository you can notice “src/playground.ts” which is a simple script that lets you interact with Embedbase in your terminal, the code is straightforward:
// src/playground.ts
const search = async (query: string) => {
const url = `${baseUrl}/v1/paul/search`;
const data = {
query,
};
return fetch(url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify(data),
}).then((response) => {
return response.json();
}).then((data) => {
console.log('Success:', data);
}).catch((error) => {
console.error('Error:', error);
});
};
const p = prompt();
// this is an interactive terminal that let you search in paul graham
// blog posts using semantic search
// It is an infinite loop that will ask you for a query
// and show you the results
const start = async () => {
console.log('Welcome to the Embedbase playground!');
console.log('This playground is a simple example of how to use Embedbase');
console.log('Currently using Embedbase server at', baseUrl);
console.log('This is an interactive terminal that let you search in paul graham blog posts using semantic search');
console.log('Try to run some queries such as "how to get rich"');
console.log('or "how to pitch investor"');
while (true) {
const query = p('Enter a semantic query:');
if (!query) {
console.log('Bye!');
return;
}
await search(query);
}
};
start();
You can run it like this if you are running Embedbase locally:
npm run playground
Or, like this, if you deployed Embedbase to the cloud:
npm run playground ${CLOUD_RUN_URL}
Results:
Fun time! Let’s build an Apple Siri Shortcut to be able to ask Siri questions about Paul Graham essays 😜
First, let’s start Apple Shortcuts:
Create a new shortcut:
We will name this shortcut “Search Paul” (be aware that it will be how you ask Siri to start the Shortcut, so pick something easy)
In plain English, this shortcut asks the user a query and calls Embedbase with it, and tells Siri to pronounce out loud the essays it found.
“Get for in” will extract the property “similarities” from the Embedbase response
“Repeat with each item in” will, for each similarity:
“Combine” will “join” the results with a new line
(Optional, will show how below) This is a fun trick you can add to the shortcut for spiciness, using OpenAI GPT3 to transform a bit of the result text to sound better when Siri pronounces it
We assemble the result into a “Text” to be voice-friendly
Ask Siri to speak it
You can transform the results into nicer text using this functional GPT3 shortcut
(fill “Authorization” value with “Bearer [YOUR OPENAI KEY]”)
With Embedbase, you can build a semantic-powered product in no time without having the headache of building, storing & retrieving embeddings while keeping the cost low.