https://www.youtube.com/shorts/9700RsFxMBc?embedable=true is an open-source API for building, storing & retrieving embeddings. Embedbase Today we will build a search engine for Paul Graham’s essays that we will use with Apple Siri Shortcuts, e.g. asking Siri questions about these essays. “Embeddings” are a concept in machine learning that allows you to compare data. We will not dive into the technical topic of embeddings today. A way to think about "embeddings" is like putting similar things together in a bag. So if you have a bag of toys, and you want to find a certain toy, you look in the bag and see what other toys are near it to figure out which one you want. A computer can do the same thing with words, putting similar words together in a bag, and then finding the word it wants based on the other words near it. When you want to use embeddings in your production software, you need to be able to store and access them easily. There are many vector databases and NLP models for storing and calculating embeddings. There are also some additional tricks to dealing with embeddings. For example, it can become to recompute when it is not necessary, e.g., if the note contains “dog” and then “dog,” you don’t necessarily want to recompute because the changed information might not be useful. costly and inefficient In addition, these vector databases require a steep learning curve and the need to understand machine learning. allows you to synchronize and semantically search data without knowing anything about machine learning, vector databases, and optimizing calculations in a few lines of code. Embedbase Searching Paul Graham essays with Siri In order, we will: Deploy Embedbase locally or on Google Cloud Run Build a crawler for Paul Graham essays using that ingest data into Embedbase Crawlee Build an Apple Siri Shortcut that lets you search Paul Graham essays’ through Embedbase with voice & natural language Build time! Tech stack Embedbase Typescript Crawlee + Playwright crawler for deployment Google Cloud Run for querying the index Apple Siri Shortcuts Cloning the repo git clone https://github.com/another-ai/embedbase
cd embedbase Setting up Pinecone Head to , login and create an index: Pinecone website We will name it “paul” and use dimension “1536” ( , under the hood, it is the “size” of OpenAI data structure “embeddings”), the other settings are less important. important to get this number right You need to get your Pinecone API key that will let Embedbase communicate with Pinecone: Configuring OpenAI Now you need to get your OpenAI configuration at (create an account if needed). https://platform.openai.com/account/api-keys Press “Create a new key”: Also, get your organization ID here: Creating your config Embedbase Now write and fill the values in the file “config.yaml” ( ): in embedbase directory # embedbase/config.yaml
# https://app.pinecone.io/
pinecone_index: "my index name"
# replace this with your environment
pinecone_environment: "us-east1-gcp"
pinecone_api_key: ""

# https://platform.openai.com/account/api-keys
openai_api_key: "sk-xxxxxxx"
# https://platform.openai.com/account/org-settings
openai_organization: "org-xxxxx" Running Embedbase 🎉 You can run now! Embedbase Start Docker, if you don’t have it, please install it by . following the instructions on the official website Now run Embedbase: docker-compose up (Optional) Cloud deployment This is optional, feel free to skip to the next part! Don’t want to handle infra? We’re launching a hosted version soon. Just click here to be first to know when it comes out If you are motivated, you can deploy to Google Cloud Run. Make sure to have a Google Cloud project and have installed the command line “gcloud” through . Embedbase official documentation # login to gcloud
gcloud auth login

# Get your Google Cloud project ID
PROJECT_ID=$(gcloud config get-value project)

# Enable container registry
gcloud services enable containerregistry.googleapis.com

# Enable Cloud Run
gcloud services enable run.googleapis.com

# Enable Secret Manager
gcloud services enable secretmanager.googleapis.com

# create a secret for the config
gcloud secrets create EMBEDBASE_PAUL_GRAHAM --replication-policy=automatic

# add a secret version based on your yaml config
gcloud secrets versions add EMBEDBASE_PAUL_GRAHAM --data-file=config.yaml

# Set your Docker image URL
IMAGE_URL="gcr.io/${PROJECT_ID}/embedbase-paul-graham:0.0.1"

# Build the Docker image for cloud deployment
docker buildx build . --platform linux/amd64 -t ${IMAGE_URL} -f ./search/Dockerfile

# Push the docker image to Google Cloud Docker registries
# Make sure to be authenticated https://cloud.google.com/container-registry/docs/advanced-authentication
docker push ${IMAGE_URL}

# Deploy Embedbase to Google Cloud Run
gcloud run deploy embedbase-paul-graham \
  --image ${IMAGE_URL} \
  --region us-central1 \
  --allow-unauthenticated \
  --set-secrets /secrets/config.yaml=EMBEDBASE_PAUL_GRAHAM:1 Building the Paul Graham essays’ crawler Web crawlers allow you to download all the pages of a website, it is the underlying algorithm used by Google. Clone the repository and install the dependencies: git clone https://github.com/another-ai/embedbase-paul-graham
cd embedbase-paul-graham
npm i Let’s look at the code, if you are overwhelmed with all the files necessary for a Typescript project, don’t worry & ignore them. // src/main.ts

// Here we want to start from the page that list all Paul's essays
const startUrls = ['http://www.paulgraham.com/articles.html'];

const crawler = new PlaywrightCrawler({
    requestHandler: router,
});

await crawler.run(startUrls); You can see that the crawler is initialized with “routes”, what are these mysterious routes? // src/routes.ts
router.addDefaultHandler(async ({ enqueueLinks, log }) => {
    log.info(`enqueueing new URLs`);
    await enqueueLinks({
        // Here we tell the crawler to only accept pages that are under
        // "http://www.paulgraham.com/" domain name,
        // for example if we find a link on Paul's website to an url
        // like "https://ycombinator.com/startups" if it will ignored
        globs: ['http://www.paulgraham.com/**'],
        label: 'detail',
    });
});

router.addHandler('detail', async ({ request, page, log }) => {
    // Here we will do some logic on all pages under
    // "http://www.paulgraham.com/" domain name

    // for example, collecting the page title
    const title = await page.title();

    // getting the essays' content
    const blogPost = await page.locator('body > table > tbody > tr > td:nth-child(3)').textContent();
    if (!blogPost) {
        log.info(`no blog post found for ${title}, skipping`);
        return;
    }
    log.info(`${title}`, { url: request.loadedUrl });
    
    // Remember that usually AI models and databases have some limits in input size
    // and thus we will split essays in chunks of paragraphs
    // split blog post in chunks on the \n\n
    const chunks = blogPost.split(/\n\n/);
    if (!chunks) {
        log.info(`no blog post found for ${title}, skipping`);
        return;
    }
    // If you are not familiar with Promises, don't worry for now
    // it's just a mean to do things faster
    await Promise.all(chunks.flatMap((chunk) => {
        const d = {
            url: request.loadedUrl,
            title: title,
            blogPost: chunk,
        };
        // Here we just want to send the page interesting
        // content into Embedbase (don't mind Dataset, it's optional local storage)
        return Promise.all([Dataset.pushData(d), add(title, chunk)]);
    }));
}); What is ? add() const add = (title: string, blogPost: string) => {
    // note "paul" in the URL, it can be anything you want
    // that will help you segment your data in
    // isolated parts
    const url = `${baseUrl}/v1/paul`;
    const data = {
        documents: [{
            data: blogPost,
        }],
    };
    // send the data to Embedbase using "node-fetch" library
    fetch(url, {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
        },
        body: JSON.stringify(data),
    }).then((response) => {
        return response.json();
    }).then((data) => {
        console.log('Success:', data);
    }).catch((error) => {
        console.error('Error:', error);
    });
}; Now you can run the crawler, it should take less than a minute to download & ingest everything in Embedbase. OpenAI credits will be used, for less than <$1 npm start If you deployed Embedbase to the cloud, please use # you can get your cloud run URL like this:
CLOUD_RUN_URL=$(gcloud run services list --platform managed --region us-central1 --format="value(status.url)" --filter="metadata.name=embedbase-paul-graham")
npm run playground ${CLOUD_RUN_URL} You should see some activity in your terminal (both Embedbase Docker container & the node process) without errors ( ). feel free to reach out for help otherwise (Optional) Searching through Embedbase in your terminal In the example repository you can notice “ ” which is a simple script that lets you interact with Embedbase in your terminal, the code is straightforward: src/playground.ts // src/playground.ts
const search = async (query: string) => {
    const url = `${baseUrl}/v1/paul/search`;
    const data = {
        query,
    };
    return fetch(url, {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
        },
        body: JSON.stringify(data),
    }).then((response) => {
        return response.json();
    }).then((data) => {
        console.log('Success:', data);
    }).catch((error) => {
        console.error('Error:', error);
    });
};

const p = prompt();

// this is an interactive terminal that let you search in paul graham
// blog posts using semantic search
// It is an infinite loop that will ask you for a query
// and show you the results
const start = async () => {
    console.log('Welcome to the Embedbase playground!');
    console.log('This playground is a simple example of how to use Embedbase');
    console.log('Currently using Embedbase server at', baseUrl);
    console.log('This is an interactive terminal that let you search in paul graham blog posts using semantic search');
    console.log('Try to run some queries such as "how to get rich"');
    console.log('or "how to pitch investor"');
    while (true) {
        const query = p('Enter a semantic query:');
        if (!query) {
            console.log('Bye!');
            return;
        }
        await search(query);
    }
};

start(); You can run it like this if you are running Embedbase locally: npm run playground Or, like this, if you deployed Embedbase to the cloud: npm run playground ${CLOUD_RUN_URL} Results: (Optional) Building Apple Siri Shortcut Fun time! Let’s build an Apple Siri Shortcut to be able to ask Siri questions about Paul Graham essays 😜 First, let’s start Apple Shortcuts: Create a new shortcut: We will name this shortcut “Search Paul” (be aware that it will be how you ask Siri to start the Shortcut, so pick something easy) In plain English, this shortcut asks the user a query and calls Embedbase with it, and tells Siri to pronounce out loud the essays it found. “Dictate text” let you ask your search query with voice (choose English language) We store the endpoint of in a “Text” for clarity, change according to your setup ("https://localhost:8000/v1/search” if you run locally) Embedbase Embedbase We set the endpoint in a variable again for clarity Same for the dictated text Now “Get contents of” will do an HTTP POST request to Embedbase using our previously defined during the crawling “vault_id” as “paul” and use the variable “query” for the “query” property “Get for in” will extract the property “similarities” from the Embedbase response “Repeat with each item in” will, for each similarity: Get the “document_path” property Add to a variable “paths” (a list) “Combine” will “join” the results with a new line (Optional, will show how below) This is a fun trick you can add to the shortcut for spiciness, using OpenAI GPT3 to transform a bit of the result text to sound better when Siri pronounces it We assemble the result into a “Text” to be voice-friendly Ask Siri to speak it https://www.youtube.com/shorts/9700RsFxMBc?embedable=true You can transform the results into nicer text using this functional GPT3 shortcut (fill “Authorization” value with “Bearer [YOUR OPENAI KEY]”) With Embedbase, you can build a semantic-powered product in no time without having the headache of building, storing & retrieving embeddings while keeping the cost low. Conclusion Please try Embedbase, drop an issue, and if you don’t mind supporting this effort & starring the repository ❤️ Don’t want to handle infra? We’re launching a hosted version soon. Just click here to be first to know when it comes out https://github.com/another-ai/embedbase-paul-graham?embedable=true https://www.youtube.com/shorts/9700RsFxMBc?embedable=true

The code in this story is for educational purposes. The readers are solely responsible for whatever they build with it.

Building an Embeddings Powered Product to Search Paul Graham Essays Using Siri

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Apple Unveils Siri Upgrade to Control Apps with Voice Commands

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

104 Stories To Learn About Programming Top Story

Apple Unveils Siri Upgrade to Control Apps with Voice Commands

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

104 Stories To Learn About Programming Top Story

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps