paint-brush
Building an Embeddings Powered Product to Search Paul Graham Essays Using Siriby@anotherai
743 reads
743 reads

Building an Embeddings Powered Product to Search Paul Graham Essays Using Siri

by EmbedbaseFebruary 14th, 2023
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

“Embeddings” are a concept in machine learning that allows you to compare data. “Embedbase” is an open-source API for building, storing & retrieving embeddings. We will build a search engine for Paul Graham’s essays that we will use with Apple Siri Shortcuts.
featured image - Building an Embeddings Powered Product to Search Paul Graham Essays Using Siri
Embedbase HackerNoon profile picture


Embedbase is an open-source API for building, storing & retrieving embeddings.


Today we will build a search engine for Paul Graham’s essays that we will use with Apple Siri Shortcuts, e.g. asking Siri questions about these essays.


“Embeddings” are a concept in machine learning that allows you to compare data.

We will not dive into the technical topic of embeddings today.


A way to think about "embeddings" is like putting similar things together in a bag. So if you have a bag of toys, and you want to find a certain toy, you look in the bag and see what other toys are near it to figure out which one you want. A computer can do the same thing with words, putting similar words together in a bag, and then finding the word it wants based on the other words near it.


https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8


When you want to use embeddings in your production software, you need to be able to store and access them easily.


There are many vector databases and NLP models for storing and calculating embeddings. There are also some additional tricks to dealing with embeddings.


For example, it can becomecostly and inefficient to recompute when it is not necessary, e.g., if the note contains “dog” and then “dog,” you don’t necessarily want to recompute because the changed information might not be useful.


In addition, these vector databases require a steep learning curve and the need to understand machine learning.


Embedbase allows you to synchronize and semantically search data without knowing anything about machine learning, vector databases, and optimizing calculations in a few lines of code.

Searching Paul Graham essays with Siri


In order, we will:


  1. Deploy Embedbase locally or on Google Cloud Run
  2. Build a crawler for Paul Graham essays using Crawlee that ingest data into Embedbase
  3. Build an Apple Siri Shortcut that lets you search Paul Graham essays’ through Embedbase with voice & natural language

Build time!

Tech stack


Cloning the repo


git clone https://github.com/another-ai/embedbase
cd embedbase


Setting up Pinecone

Head to Pinecone website, login and create an index:

Creating a Pinecone index



We will name it “paul” and use dimension “1536” (important to get this number right, under the hood, it is the “size” of OpenAI data structure “embeddings”), the other settings are less important.


Pinecone index configuration


You need to get your Pinecone API key that will let Embedbase communicate with Pinecone:


Getting Pinecone API key



Configuring OpenAI


Now you need to get your OpenAI configuration at https://platform.openai.com/account/api-keys (create an account if needed).


Press “Create a new key”:

Creating an OpenAI key


Also, get your organization ID here:

Getting OpenAI organization ID


Creating your Embedbase config


Now write and fill the values in the file “config.yaml” (in embedbase directory):

# embedbase/config.yaml
# https://app.pinecone.io/
pinecone_index: "my index name"
# replace this with your environment
pinecone_environment: "us-east1-gcp"
pinecone_api_key: ""

# https://platform.openai.com/account/api-keys
openai_api_key: "sk-xxxxxxx"
# https://platform.openai.com/account/org-settings
openai_organization: "org-xxxxx"


Running Embedbase


🎉 You can run Embedbase now!

Start Docker, if you don’t have it, please install it by following the instructions on the official website.

Starting Docker on Mac

Now run Embedbase:

docker-compose up

(Optional) Cloud deployment

If you are motivated, you can deploy Embedbase to Google Cloud Run. Make sure to have a Google Cloud project and have installed the command line “gcloud” through official documentation.


# login to gcloud
gcloud auth login

# Get your Google Cloud project ID
PROJECT_ID=$(gcloud config get-value project)

# Enable container registry
gcloud services enable containerregistry.googleapis.com

# Enable Cloud Run
gcloud services enable run.googleapis.com

# Enable Secret Manager
gcloud services enable secretmanager.googleapis.com

# create a secret for the config
gcloud secrets create EMBEDBASE_PAUL_GRAHAM --replication-policy=automatic

# add a secret version based on your yaml config
gcloud secrets versions add EMBEDBASE_PAUL_GRAHAM --data-file=config.yaml

# Set your Docker image URL
IMAGE_URL="gcr.io/${PROJECT_ID}/embedbase-paul-graham:0.0.1"

# Build the Docker image for cloud deployment
docker buildx build . --platform linux/amd64 -t ${IMAGE_URL} -f ./search/Dockerfile

# Push the docker image to Google Cloud Docker registries
# Make sure to be authenticated https://cloud.google.com/container-registry/docs/advanced-authentication
docker push ${IMAGE_URL}

# Deploy Embedbase to Google Cloud Run
gcloud run deploy embedbase-paul-graham \
  --image ${IMAGE_URL} \
  --region us-central1 \
  --allow-unauthenticated \
  --set-secrets /secrets/config.yaml=EMBEDBASE_PAUL_GRAHAM:1


Building the Paul Graham essays’ crawler

Web crawlers allow you to download all the pages of a website, it is the underlying algorithm used by Google.


Clone the repository and install the dependencies:

git clone https://github.com/another-ai/embedbase-paul-graham
cd embedbase-paul-graham
npm i


Let’s look at the code, if you are overwhelmed with all the files necessary for a Typescript project, don’t worry & ignore them.


// src/main.ts

// Here we want to start from the page that list all Paul's essays
const startUrls = ['http://www.paulgraham.com/articles.html'];

const crawler = new PlaywrightCrawler({
    requestHandler: router,
});

await crawler.run(startUrls);


You can see that the crawler is initialized with “routes”, what are these mysterious routes?


// src/routes.ts
router.addDefaultHandler(async ({ enqueueLinks, log }) => {
    log.info(`enqueueing new URLs`);
    await enqueueLinks({
        // Here we tell the crawler to only accept pages that are under
        // "http://www.paulgraham.com/" domain name,
        // for example if we find a link on Paul's website to an url
        // like "https://ycombinator.com/startups" if it will ignored
        globs: ['http://www.paulgraham.com/**'],
        label: 'detail',
    });
});

router.addHandler('detail', async ({ request, page, log }) => {
    // Here we will do some logic on all pages under
    // "http://www.paulgraham.com/" domain name

    // for example, collecting the page title
    const title = await page.title();

    // getting the essays' content
    const blogPost = await page.locator('body > table > tbody > tr > td:nth-child(3)').textContent();
    if (!blogPost) {
        log.info(`no blog post found for ${title}, skipping`);
        return;
    }
    log.info(`${title}`, { url: request.loadedUrl });
    
    // Remember that usually AI models and databases have some limits in input size
    // and thus we will split essays in chunks of paragraphs
    // split blog post in chunks on the \n\n
    const chunks = blogPost.split(/\n\n/);
    if (!chunks) {
        log.info(`no blog post found for ${title}, skipping`);
        return;
    }
    // If you are not familiar with Promises, don't worry for now
    // it's just a mean to do things faster
    await Promise.all(chunks.flatMap((chunk) => {
        const d = {
            url: request.loadedUrl,
            title: title,
            blogPost: chunk,
        };
        // Here we just want to send the page interesting
        // content into Embedbase (don't mind Dataset, it's optional local storage)
        return Promise.all([Dataset.pushData(d), add(title, chunk)]);
    }));
});


What is add() ?


const add = (title: string, blogPost: string) => {
    // note "paul" in the URL, it can be anything you want
    // that will help you segment your data in
    // isolated parts
    const url = `${baseUrl}/v1/paul`;
    const data = {
        documents: [{
            data: blogPost,
        }],
    };
    // send the data to Embedbase using "node-fetch" library
    fetch(url, {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
        },
        body: JSON.stringify(data),
    }).then((response) => {
        return response.json();
    }).then((data) => {
        console.log('Success:', data);
    }).catch((error) => {
        console.error('Error:', error);
    });
};


Now you can run the crawler, it should take less than a minute to download & ingest everything in Embedbase.

OpenAI credits will be used, for less than <$1

OpenAI cost


npm start


If you deployed Embedbase to the cloud, please use


# you can get your cloud run URL like this:
CLOUD_RUN_URL=$(gcloud run services list --platform managed --region us-central1 --format="value(status.url)" --filter="metadata.name=embedbase-paul-graham")
npm run playground ${CLOUD_RUN_URL}


You should see some activity in your terminal (both Embedbase Docker container & the node process) without errors (feel free to reach out for help otherwise).


(Optional) Searching through Embedbase in your terminal

In the example repository you can notice “src/playground.ts” which is a simple script that lets you interact with Embedbase in your terminal, the code is straightforward:

// src/playground.ts
const search = async (query: string) => {
    const url = `${baseUrl}/v1/paul/search`;
    const data = {
        query,
    };
    return fetch(url, {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
        },
        body: JSON.stringify(data),
    }).then((response) => {
        return response.json();
    }).then((data) => {
        console.log('Success:', data);
    }).catch((error) => {
        console.error('Error:', error);
    });
};

const p = prompt();

// this is an interactive terminal that let you search in paul graham
// blog posts using semantic search
// It is an infinite loop that will ask you for a query
// and show you the results
const start = async () => {
    console.log('Welcome to the Embedbase playground!');
    console.log('This playground is a simple example of how to use Embedbase');
    console.log('Currently using Embedbase server at', baseUrl);
    console.log('This is an interactive terminal that let you search in paul graham blog posts using semantic search');
    console.log('Try to run some queries such as "how to get rich"');
    console.log('or "how to pitch investor"');
    while (true) {
        const query = p('Enter a semantic query:');
        if (!query) {
            console.log('Bye!');
            return;
        }
        await search(query);
    }
};

start();


You can run it like this if you are running Embedbase locally:

npm run playground


Or, like this, if you deployed Embedbase to the cloud:

npm run playground ${CLOUD_RUN_URL}


Results:

(Optional) Building Apple Siri Shortcut

Fun time! Let’s build an Apple Siri Shortcut to be able to ask Siri questions about Paul Graham essays 😜


First, let’s start Apple Shortcuts:


Starting Apple Shortcuts


Create a new shortcut:

Creating a new Apple Siri Shortcuts


We will name this shortcut “Search Paul” (be aware that it will be how you ask Siri to start the Shortcut, so pick something easy)


In plain English, this shortcut asks the user a query and calls Embedbase with it, and tells Siri to pronounce out loud the essays it found.


The first part of the shortcut


  1. “Dictate text” let you ask your search query with voice (choose English language)
  2. We store the endpoint of Embedbase in a “Text” for clarity, change according to your setup ("https://localhost:8000/v1/search” if you run Embedbase locally)
  3. We set the endpoint in a variable again for clarity
  4. Same for the dictated text
  5. Now “Get contents of” will do an HTTP POST request to Embedbase using our previously defined during the crawling “vault_id” as “paul” and use the variable “query” for the “query” property


The last part of the shortcut


  1. “Get for in” will extract the property “similarities” from the Embedbase response

  2. “Repeat with each item in” will, for each similarity:

    1. Get the “document_path” property
    2. Add to a variable “paths” (a list)
  3. “Combine” will “join” the results with a new line

  4. (Optional, will show how below) This is a fun trick you can add to the shortcut for spiciness, using OpenAI GPT3 to transform a bit of the result text to sound better when Siri pronounces it

  5. We assemble the result into a “Text” to be voice-friendly

  6. Ask Siri to speak it


You can transform the results into nicer text using this functional GPT3 shortcut


(fill “Authorization” value with “Bearer [YOUR OPENAI KEY]”)


A functional GPT3 shortcut to transform anything



With Embedbase, you can build a semantic-powered product in no time without having the headache of building, storing & retrieving embeddings while keeping the cost low.

Conclusion

Please try Embedbase, drop an issue, and if you don’t mind supporting this effort & starring the repository ❤️