https://www.youtube.com/shorts/9700RsFxMBc?embedable=true is an open-source API for building, storing & retrieving embeddings. Embedbase Today we will build a search engine for Paul Graham’s essays that we will use with Apple Siri Shortcuts, e.g. asking Siri questions about these essays. “Embeddings” are a concept in machine learning that allows you to compare data. We will not dive into the technical topic of embeddings today. A way to think about "embeddings" is like putting similar things together in a bag. So if you have a bag of toys, and you want to find a certain toy, you look in the bag and see what other toys are near it to figure out which one you want. A computer can do the same thing with words, putting similar words together in a bag, and then finding the word it wants based on the other words near it. When you want to use embeddings in your production software, you need to be able to store and access them easily. There are many vector databases and NLP models for storing and calculating embeddings. There are also some additional tricks to dealing with embeddings. For example, it can become to recompute when it is not necessary, e.g., if the note contains “dog” and then “dog,” you don’t necessarily want to recompute because the changed information might not be useful. costly and inefficient In addition, these vector databases require a steep learning curve and the need to understand machine learning. allows you to synchronize and semantically search data without knowing anything about machine learning, vector databases, and optimizing calculations in a few lines of code. Embedbase Searching Paul Graham essays with Siri In order, we will: Deploy Embedbase locally or on Google Cloud Run Build a crawler for Paul Graham essays using that ingest data into Embedbase Crawlee Build an Apple Siri Shortcut that lets you search Paul Graham essays’ through Embedbase with voice & natural language Build time! Tech stack Embedbase Typescript Crawlee + Playwright crawler for deployment Google Cloud Run for querying the index Apple Siri Shortcuts Cloning the repo git clone https://github.com/another-ai/embedbase cd embedbase Setting up Pinecone Head to , login and create an index: Pinecone website We will name it “paul” and use dimension “1536” ( , under the hood, it is the “size” of OpenAI data structure “embeddings”), the other settings are less important. important to get this number right You need to get your Pinecone API key that will let Embedbase communicate with Pinecone: Configuring OpenAI Now you need to get your OpenAI configuration at (create an account if needed). https://platform.openai.com/account/api-keys Press “Create a new key”: Also, get your organization ID here: Creating your config Embedbase Now write and fill the values in the file “config.yaml” ( ): in embedbase directory # embedbase/config.yaml # https://app.pinecone.io/ pinecone_index: "my index name" # replace this with your environment pinecone_environment: "us-east1-gcp" pinecone_api_key: "" # https://platform.openai.com/account/api-keys openai_api_key: "sk-xxxxxxx" # https://platform.openai.com/account/org-settings openai_organization: "org-xxxxx" Running Embedbase 🎉 You can run now! Embedbase Start Docker, if you don’t have it, please install it by . following the instructions on the official website Now run Embedbase: docker-compose up (Optional) Cloud deployment This is optional, feel free to skip to the next part! Don’t want to handle infra? We’re launching a hosted version soon. Just click here to be first to know when it comes out If you are motivated, you can deploy to Google Cloud Run. Make sure to have a Google Cloud project and have installed the command line “gcloud” through . Embedbase official documentation # login to gcloud gcloud auth login # Get your Google Cloud project ID PROJECT_ID=$(gcloud config get-value project) # Enable container registry gcloud services enable containerregistry.googleapis.com # Enable Cloud Run gcloud services enable run.googleapis.com # Enable Secret Manager gcloud services enable secretmanager.googleapis.com # create a secret for the config gcloud secrets create EMBEDBASE_PAUL_GRAHAM --replication-policy=automatic # add a secret version based on your yaml config gcloud secrets versions add EMBEDBASE_PAUL_GRAHAM --data-file=config.yaml # Set your Docker image URL IMAGE_URL="gcr.io/${PROJECT_ID}/embedbase-paul-graham:0.0.1" # Build the Docker image for cloud deployment docker buildx build . --platform linux/amd64 -t ${IMAGE_URL} -f ./search/Dockerfile # Push the docker image to Google Cloud Docker registries # Make sure to be authenticated https://cloud.google.com/container-registry/docs/advanced-authentication docker push ${IMAGE_URL} # Deploy Embedbase to Google Cloud Run gcloud run deploy embedbase-paul-graham \ --image ${IMAGE_URL} \ --region us-central1 \ --allow-unauthenticated \ --set-secrets /secrets/config.yaml=EMBEDBASE_PAUL_GRAHAM:1 Building the Paul Graham essays’ crawler Web crawlers allow you to download all the pages of a website, it is the underlying algorithm used by Google. Clone the repository and install the dependencies: git clone https://github.com/another-ai/embedbase-paul-graham cd embedbase-paul-graham npm i Let’s look at the code, if you are overwhelmed with all the files necessary for a Typescript project, don’t worry & ignore them. // src/main.ts // Here we want to start from the page that list all Paul's essays const startUrls = ['http://www.paulgraham.com/articles.html']; const crawler = new PlaywrightCrawler({ requestHandler: router, }); await crawler.run(startUrls); You can see that the crawler is initialized with “routes”, what are these mysterious routes? // src/routes.ts router.addDefaultHandler(async ({ enqueueLinks, log }) => { log.info(`enqueueing new URLs`); await enqueueLinks({ // Here we tell the crawler to only accept pages that are under // "http://www.paulgraham.com/" domain name, // for example if we find a link on Paul's website to an url // like "https://ycombinator.com/startups" if it will ignored globs: ['http://www.paulgraham.com/**'], label: 'detail', }); }); router.addHandler('detail', async ({ request, page, log }) => { // Here we will do some logic on all pages under // "http://www.paulgraham.com/" domain name // for example, collecting the page title const title = await page.title(); // getting the essays' content const blogPost = await page.locator('body > table > tbody > tr > td:nth-child(3)').textContent(); if (!blogPost) { log.info(`no blog post found for ${title}, skipping`); return; } log.info(`${title}`, { url: request.loadedUrl }); // Remember that usually AI models and databases have some limits in input size // and thus we will split essays in chunks of paragraphs // split blog post in chunks on the \n\n const chunks = blogPost.split(/\n\n/); if (!chunks) { log.info(`no blog post found for ${title}, skipping`); return; } // If you are not familiar with Promises, don't worry for now // it's just a mean to do things faster await Promise.all(chunks.flatMap((chunk) => { const d = { url: request.loadedUrl, title: title, blogPost: chunk, }; // Here we just want to send the page interesting // content into Embedbase (don't mind Dataset, it's optional local storage) return Promise.all([Dataset.pushData(d), add(title, chunk)]); })); }); What is ? add() const add = (title: string, blogPost: string) => { // note "paul" in the URL, it can be anything you want // that will help you segment your data in // isolated parts const url = `${baseUrl}/v1/paul`; const data = { documents: [{ data: blogPost, }], }; // send the data to Embedbase using "node-fetch" library fetch(url, { method: 'POST', headers: { 'Content-Type': 'application/json', }, body: JSON.stringify(data), }).then((response) => { return response.json(); }).then((data) => { console.log('Success:', data); }).catch((error) => { console.error('Error:', error); }); }; Now you can run the crawler, it should take less than a minute to download & ingest everything in Embedbase. OpenAI credits will be used, for less than <$1 npm start If you deployed Embedbase to the cloud, please use # you can get your cloud run URL like this: CLOUD_RUN_URL=$(gcloud run services list --platform managed --region us-central1 --format="value(status.url)" --filter="metadata.name=embedbase-paul-graham") npm run playground ${CLOUD_RUN_URL} You should see some activity in your terminal (both Embedbase Docker container & the node process) without errors ( ). feel free to reach out for help otherwise (Optional) Searching through Embedbase in your terminal In the example repository you can notice “ ” which is a simple script that lets you interact with Embedbase in your terminal, the code is straightforward: src/playground.ts // src/playground.ts const search = async (query: string) => { const url = `${baseUrl}/v1/paul/search`; const data = { query, }; return fetch(url, { method: 'POST', headers: { 'Content-Type': 'application/json', }, body: JSON.stringify(data), }).then((response) => { return response.json(); }).then((data) => { console.log('Success:', data); }).catch((error) => { console.error('Error:', error); }); }; const p = prompt(); // this is an interactive terminal that let you search in paul graham // blog posts using semantic search // It is an infinite loop that will ask you for a query // and show you the results const start = async () => { console.log('Welcome to the Embedbase playground!'); console.log('This playground is a simple example of how to use Embedbase'); console.log('Currently using Embedbase server at', baseUrl); console.log('This is an interactive terminal that let you search in paul graham blog posts using semantic search'); console.log('Try to run some queries such as "how to get rich"'); console.log('or "how to pitch investor"'); while (true) { const query = p('Enter a semantic query:'); if (!query) { console.log('Bye!'); return; } await search(query); } }; start(); You can run it like this if you are running Embedbase locally: npm run playground Or, like this, if you deployed Embedbase to the cloud: npm run playground ${CLOUD_RUN_URL} Results: (Optional) Building Apple Siri Shortcut Fun time! Let’s build an Apple Siri Shortcut to be able to ask Siri questions about Paul Graham essays 😜 First, let’s start Apple Shortcuts: Create a new shortcut: We will name this shortcut “Search Paul” (be aware that it will be how you ask Siri to start the Shortcut, so pick something easy) In plain English, this shortcut asks the user a query and calls Embedbase with it, and tells Siri to pronounce out loud the essays it found. “Dictate text” let you ask your search query with voice (choose English language) We store the endpoint of in a “Text” for clarity, change according to your setup ("https://localhost:8000/v1/search” if you run locally) Embedbase Embedbase We set the endpoint in a variable again for clarity Same for the dictated text Now “Get contents of” will do an HTTP POST request to Embedbase using our previously defined during the crawling “vault_id” as “paul” and use the variable “query” for the “query” property “Get for in” will extract the property “similarities” from the Embedbase response “Repeat with each item in” will, for each similarity: Get the “document_path” property Add to a variable “paths” (a list) “Combine” will “join” the results with a new line (Optional, will show how below) This is a fun trick you can add to the shortcut for spiciness, using OpenAI GPT3 to transform a bit of the result text to sound better when Siri pronounces it We assemble the result into a “Text” to be voice-friendly Ask Siri to speak it https://www.youtube.com/shorts/9700RsFxMBc?embedable=true You can transform the results into nicer text using this functional GPT3 shortcut (fill “Authorization” value with “Bearer [YOUR OPENAI KEY]”) With Embedbase, you can build a semantic-powered product in no time without having the headache of building, storing & retrieving embeddings while keeping the cost low. Conclusion Please try Embedbase, drop an issue, and if you don’t mind supporting this effort & starring the repository ❤️ Don’t want to handle infra? We’re launching a hosted version soon. Just click here to be first to know when it comes out https://github.com/another-ai/embedbase-paul-graham?embedable=true https://www.youtube.com/shorts/9700RsFxMBc?embedable=true