Listen to this story
DataStax is the real-time data company for building production GenAI applications.
With all the buzz about large language models, vectors, and vector search, it's important to take a step back and understand how these artificial intelligence technology advancements translate into results for an organization and, ultimately, its customers.
In an earlier article, I told the story of a hypothetical contractor who was hired to help implement AI/ML solutions at a big-box retailer. Here, we continue that journey as our distributed systems and AI specialist uses vector search to drive results with customer promotions at a big-box retailer.
Today, we met with the Promotions team. They’re looking for our help in making some smarter decisions about customer advertisements, offers, and coupons. Right now, promotions are mainly based on geographic markets. So, promotions sent out to customers in one city will differ from those offered to customers in another city. Thus, these campaigns have to be planned out ahead of time.
The issue is that the marketing department is putting pressure on them to provide more strategic methods of targeting customers, geographic regions notwithstanding. The thought is that specific customers might be more likely to take advantage of certain offers based on their purchase history. If we can find a way to offer them in real-time, like a 10-percent discount on a related product (per cart), we might be able to drive some additional sales.
First, I decided to have a look at some of our anonymized order data in their Apache Cassandra cluster. It‘s clear that there are definitely some patterns in the data, with some customers purchasing the same items (mostly grocery products) with some predictable regularity. Maybe there’s some way we can leverage this data?
One thing that we have going for us: Our customers tend to engage with us via multiple channels. Some folks use the website, some use the mobile app, and some folks will still walk into any one of our 1,000 or so brick-and-mortar stores. And more than half of those customers in the store use the mobile app concurrently.
Another interesting point: If we aggregate the item sale data by household address instead of just by customer ID, we see even more rigid shopping patterns. After pulling data together from a few different sources, we can start to paint a picture of just what this data looks like.
Example: A couple has a dog. Usually, one spouse buys the dog food. But sometimes, the other spouse does. Those events at the individual customer level don’t make much of a pattern.
But when aggregated together at the household level, they do. In fact, they consistently buy more than one 6-pound roll of “HealthyFresh - Chicken raw dog food” each week. Judging by the recommended serving sizes, they also either have a half-dozen little dogs or one big dog, probably the latter. A visualization of this data can be seen in Figure 1.
The stores that each customer frequents can also come into play here. For example, our customer Marie might be traveling for work and might visit one of our stores that she has never been to. If the store is a significant distance from her list of frequent stores, we can assume that Marie is likely shopping for specific items related to her travel. In this case, her normal shopping habits would not apply (she probably will not buy dog food), and we do not need to present her with a promotion to encourage her to spend.
However, if Marie is in one of her frequent stores, it might make sense to trigger a promotion on her mobile device. If we encourage our customers to scan items with their phones, we will know which items are in their physical shopping cart. Then, we can present similar items to complement the products already in the cart.
Finding similar products means we’ll need to compute similarity vectors for our products. There are a few ways that we can do this. In the interest of putting a minimum viable product together, we can focus only on the product names and build a natural language processing (NLP) model based on a “bag of words” approach.
In this approach, we take every word from all of the product names and give each unique word an entry. This is our vocabulary. The similarity vectors that we create and store with each product become an array of ones and zeros, indicating whether the current product name possesses that word, as shown below in Table 1. We can use a platform like
Table 1 - A bag of words NLP vocabulary for product names under the Pet Supplies category, showing how each vector is assembled.
One problem with the “bag of words” approach is that the vectors can end up with many more zeros than ones. This can lead to longer model training times and longer prediction times. To cut down on those problems, we’ll build a unique vocabulary for each major product category. Vectors will not be usable across different categories, but that’s OK because we can filter by category at query time.
We can then create a table in our Apache Cassandra cluster to support vector searching for each specific category. As our vocabulary contains 14 words, the size of our vector will also need to be 14:
CREATE TABLE pet_supply_vectors (
product_id TEXT PRIMARY KEY,
product_name TEXT,
product_vector vector<float, 14>);
For the vector search to function appropriately, we will need to create a storage-attached secondary index (SASI) on the table:
CREATE CUSTOM INDEX ON pet_supply_vectors(product_vector) USING 'StorageAttachedIndex';
We can then load the output from our ML model into Cassandra. With the data present, the next step is to add a new service, which performs the vector search using the following query:
SELECT * FROM pet_supply_vectors
ORDER BY product_vector
ANN OF [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] LIMIT 2;
This runs an approximate nearest neighbor (ANN) algorithm using the vector stored with the current product. That will return both the current product and the next-closest (nearest neighbor) by vector search, thanks to theLIMIT 2
clause.
In the above query, we are using the vector from the “HealthyFresh - Chicken raw dog food” product, assuming that a customer has either just added it to their cart online or scanned it with their phone. We process this event and compose the following message:
customer_id: a3f5c9a3
device_id: e6f40454
product_id: pf1843
product_name: “HealthyFresh - Chicken raw dog food”
product_vector: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
We then send this message to an Apache Pulsar topic. We consume from the topic and use the above data to call the getPromotionProduct
endpoint on our Promotions microservice. This runs the query indicated above, returning both of the “HealthyFresh” flavors. We disregard the data for the product_vector that matches at 100% (the product we already have) and trigger a promotion for the “HealthyFresh - Beef” flavor on their device:
After running with this logic in place for a couple of weeks, our Promotions team reached out and reported that our methods are triggering an additional sale roughly 25% of the time. While it is tempting to call that a “win,” there are definitely some additional things we could look at improving.
We went with a “bag of words” NLP approach just to get an initial (software) product out the door. I’ve read good things about different NLP algorithms like “Word2Vec,” which may be a better approach in the long run. Our model was also only concerned with building a vocabulary with the words that made up the product names. Perhaps expanding our model inputs to include other product details (e.g., size, color, brand) might help to fine-tune things a bit?)
In an upcoming article, we’ll show how we used vector search to help our transportation services team improve their delivery route efficiency.
Check out the GitHub repository for an example
By Aaron Ploetz, DataStax
Also published here.