Deep learning has become a hot topic these years and it benefits a lot in different industries like retail, e-commerce, finance, etc. I am working at a global retail & e-commerce company with millions of product and customers.
In my daily work, I use the power of data and deep learning to provide personalized recommendations for our customers, and recently I tried the embedding-based approach, which performs very well compared with our current algorithms.
In this blog, I will describe the Embedding technique that I developed, and how to implement it in the large-scale machine learning system. Basically, I trained the product embeddings from customers’ shopping sequence, used time-decay to differentiate short-term and long-term interests and feed them into neural networks to get personalized recommendations. The whole process is implemented within Google Cloud Platform and Apache Airflow.
In 2013, Google released word2vec project that provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. After that, word embedding has been widely used in the NLP domain, while people used high-dimensional sparse vectors like one-hot encoding in the past. At the same time, researchers have found the embedding can be also used in other domains like search and recommendations, where we can put latent meanings into the products to train the machine learning tasks through the use of neural networks.
The way we get word embeddings is done by the co-occurrence of words and their neighbor words with the assumption that words appear together are more likely to be related than those that are far away.
With the similar idea of how we get word embeddings, we can make an analogy like this: a word is like a product; a sentence is like a sequence of ONE customer’s shopping sequence; an article is like a sequence of ALL customers’ shopping sequence. This embedding technique allows us to represent product or user as low dimensional continuous vectors, while the one-hot encoding method will lead to the curse of dimensionality for the machine learning models.
Assuming that we have the clickstream data with N users, and each user has a sequence product of (p1, p2, … pm) ∈ P which is the union of products clicked by the user. Give this dataset, the aim is to learn a d-dimensional real-valued representation v(Pi) ∈ Rd of each unique product Pi. To choose the right ‘d’, it is a tradeoff between model performance and the memory for the vector calculations. After multiple offline experiments, I choose d=50 as the vector length.
First, prepare the shopping sequence. To make the sequence similar to the real sentence, I eliminate users who only interacted with less than five products. Then I list the products in the ascending time order for each user.
Now we have the sequence data ready to be trained in the neural networks, where we can obtain the product embeddings. In practice, I tried two approaches to do that: one method is to modify Google’s word2vec TensorFlow code. Speaking of the details, I use the skip-gram model with Negative Sampling and update the weights with Stochastic Gradient Descent. The second method is StarSpace, a general-purpose neural embedding model developed by Facebook AI Research, that can solve a wide variety of problems.
With this, I have the product embeddings for 98% of our products using one-week clickstream data, resulting in high-quality low-dimensional representations.
I use two approaches to validate that the product embeddings are meaningful. The first one is cosine similarity from pairs of d-dimensional vectors. For example, as we know, the similarity between an iPhone X and Samsung galaxy should be higher than the similarity between an iPhone X and a chair.
The second approach is to use t-SNE to visualize the Embeddings. What we can expect is the similar products should be closer in the embedding space. As we can see, it is obviously true. So we can conclude that the embeddings can be used to efficiently calculate the similarities between products, and we can use them as input for our neural networks (not the NN where we get the embeddings, but the NN for personalized outputs.)
We should put the customer information into the model to get personalized recommendations, and we can aggregate user’s browsing history by taking a time-decay weighted average of the d-dimensional product embeddings, with the assumption that the recent products play a more important role for customer’s final decision than those that customers viewed a long time ago.
The weights here are a softmax probability on time calculated by the formula in the left. D is a parameter controlling how important the recent events are, and t is a function of the time period between the current time and the past event time.
I use Apache Airflow to get user embeddings from raw clickstream data and pre-trained product embeddings. After four BigQueryOperator tasks, I export the processed embeddings to Google Cloud Storage for the modeling part.
After having the user embeddings, we can feed them into the neural networks for a recommendation. The target is the most recent product for each user, and the features are user embeddings obtained from any product embeddings except the most recent one for this user. I use dense layers with ReLu activation functions, followed by Dropout layers and BatchNormalization Layers. The final layer is a softmax with M-class classifications, where M is a number of unique products in our trained embeddings.
I conducted offline experiments with several metrics like next-add-to-cart hit rate, mean reciprocal rank, and there is a significant improvement from this model compared with current models.
More Embeddings: I only use click data in the current embedding-based model. But speaking of the Sequence, any event can make a customer sequence: search, add to cart, purchase.
Besides the event type, I can also get embeddings from the product information such as product descriptions, product images, and features.
However, a tradeoff between latency(model complexity) and performance need to be considered.
Global positive samples: Inspired by KDD 2018 best paper from Airbnb: Real-time Personalization using Embeddings for Search Ranking at Airbnb, I will consider adding the final purchased items in the sessions as the global positive samples.
The reason behind it is that there are many factors that may influence a customer’s final shopping decision, and I think the purchased products will have some latent relationship with others clicked but not purchased products.
Create your free account to unlock your custom reading experience.