Bits of Thought: Yelp Content As Embeddings

A few months ago, the Yelp engineering team published an article about how they use embeddings to represent and organize online content, particularly within the context of Yelp’s efforts to offer easily accessible high-quality content.

https://engineeringblog.yelp.com/2023/04/yelp-content-as-embeddings.html?source=post_page-----2c1d62f74e10--------------------------------&embedable=true

Yelp aims to offer easily accessible high-quality content. We need to tag, organize and rank online content to attain this goal. For this purpose, Yelp engineers have started using general embeddings on different data. It improves usability and efficiency for all kinds of model development.

How refreshing is it not to have one more article about embeddings in the context of RAG (Retrieval-Augmented Generation) isn’t it!? That was the hook for me!

Additionally, having worked many times on enterprise search systems (intricate implementation details but also ranking/relevancy aspects), I was also looking at use cases that were not search-oriented, or at least not directly. Although it can be interesting to explore embeddings in the context of semantic search (vs full-text search) or hybrid search, which is a valid topic by itself, that’s not something for this time. :)

Without further ado, let’s dive deep into this article! I’m excited to learn about how Yelp supports its various use cases related to tagging, organizing, and ranking. Embeddings are the common backbone for many ML models and facilitate cross-team collaborations.

I will use this article to introduce some machine learning concepts along the way that can help to better understand the domain.

But first, this is all about embeddings!

Text embedding has been researched in depth in the scientific community. First, embeddings were generated with sparse vectors representing words.

Spoiler alert: the article not only explores text embeddings but also delves into photo embeddings (yep, Yelp has a lot of different content…)!

At its core, one of the article’s focuses is embeddings, which are vectors utilized to represent diverse types of content. Named appropriately, these vectors are ‘embedded’ into a designated space, serving as a numerical representation for the encoded content. This space is a coordinate system where each vector has a specific position. The arrangement of vectors in this space captures the relationships and context between them.

In the context of embeddings, each vector in the space has multiple dimensions. Each dimension represents a specific feature or characteristic of the content.

Embeddings developed further with context-aware embeddings since the same word can have different meanings depending on how it is used in a sentence. With the use of transformers in recent years, we now have text snippet embeddings that capture more semantic meaning.

Consider a simplified text embedding with two dimensions: one representing the frequency of technology-related terms and the other for health-related terms. A vector [4, 1] might represent a text heavily focused on technology. Embeddings can encode entire text passages, not just single words, and can have numerous dimensions tailored for specific use cases (e.g., similarity, sentiment analysis). This is a simplified representation of text embeddings. The model used by Yelp is more advanced than this example. A good starting point to learn about word embeddings is to explore Word2Vec.

Whether vectors represent words, photos, or text snippets, proximity in space implies meaningful connections. This representation enables advanced operations and use cases.

Small thing: sparse vs dense vectors. Sparse vectors offer memory efficiency but may be computationally expensive, while dense vectors provide richer representations at the cost of increased memory requirements. Both are compact representations of the content.

To wrap up this refresher on embeddings (though we spent quite some time there), in the realm of multimodal models (models that can handle different types of data, such as both text and images), the direct use of raw data representations is infrequent. Instead, content is effectively encoded as vectors, providing a convenient means for manipulation. Once the content is represented numerically, basic mathematical operations become feasible (the basis for many ML algorithms).

Oh, and I almost forgot: how do we create these embeddings? A transformer is a neural network architecture designed for various tasks, including generating nuanced and semantically meaningful embeddings not limited to text alone. It considers broader contexts, whether within a sentence, text snippets, or other data types like images in multimodal settings. Here’s the introductory paper if you want to learn more:

So what does Yelp want to do?

Semantic comprehension of the text is essential for Yelp. Yelp reviews are our most valuable asset since they contain a lot of business context and sentiment. We want to capture the essence of each review text to serve their information to our users better. We looked for versatility in our embedding as we try to use the same embedding in various tasks: tagging, information extraction, sentiment analysis and ranking.

I briefly mentioned it earlier, but here’s the core idea! The goal is to have versatile embeddings that can be applied in different contexts to accomplish various tasks. Once these embeddings are obtained, they can be used as input for different machine learning models, each tailored to specific tasks.

To achieve this versatility, a transformer embedding model is employed, trained on labeled input data relevant to the desired tasks. In the end, the space where the embeddings are encoded captures the knowledge necessary for these diverse tasks.

Embeddings based on reviews are currently generated by the Universal Sentence Encoder off-the-shelf model offered by Tensorflow. This blog section will present the USE model, any modifications tested to improve it and its advantages for the Yelp dataset.

The Universal Sentence Encoder (USE) serves as the base transformer model they’ve chosen. If you’re interested in more details, you can refer to their article or read the paper introducing the Universal Sentence Encoder.

Alright, I’ll skip this part. I’m not particularly interested in the inner workings of the USE transformer; it’s a bit too detailed for me. I’m sure it’s impressive, though! ;)

You can find the base model on Hugging Face, and it also includes a comprehensive description of its internals.

https://huggingface.co/Dimitre/universal-sentence-encoder?source=post_page-----2c1d62f74e10--------------------------------&embedable=true

Here’s something helpful for better visualizing these embeddings: earlier, I provided an example of a 2-dimensional vector used to encode whether a text was more focused on technology or health. USE generates, by default, 512-dimensional embeddings.

It helps to relativize how they capture enough context to be used for all these various tasks.

Then, the article goes on to describe how they used this model to encode reviews and the results.

By nature, most NLP models will perform better when trained on domain-specific text. With this hypothesis, we developed and compared a Yelp fine-tuned encoder with the pre-trained USE model available on TensorFlow-Hub. We aimed to create a better model adapted to the Yelp domain than the pre-trained model. After fine-tuning the model, we wanted to use it to generate an embedding for reviews specifically.

Some vocabulary clarifications first: the terms ‘off-the-shelf model,’ ‘foundation model,’ and ‘base model’ are often used interchangeably. They typically refer to pre-trained models available for immediate use without the need for extensive customization or additional training.

The development of a base model is a resource-intensive process, with a focus on data quantity over quality. The costs associated with this first stage can amount to millions of dollars in resource processing and can span several weeks. The exact details of large model development, including training time and dataset sizes, are proprietary information held by the developers and researchers involved in the project. Access to this information is crucial to identify biases and understand the ethical impact on applications that use them. Open-source models, like USE, have an advantage here.

Given the significant costs associated with creating base models, not every company can afford this investment. Consequently, having access to these pre-trained models off the shelf for free proves to be highly convenient.

In the second stage, these pre-trained base models can be fine-tuned to better capture business domain knowledge. While the first stage focuses on learning the structure, features, and recurrent patterns from extensive datasets (task and content-specific), the second stage involves training the base model on a smaller, domain, task-specific dataset, prioritizing quality over quantity.

However, in Yelp’s case, the fine-tuned model did not outperform the base model during inference.

The model evaluation made on the Yelp domain showed that the ready-to-use model performed better than or as well as the Yelp pre-trained encoder for all tasks. These results happened because either the Yelp domain touches many generic subjects in the USE model or our experiments lacked task diversity to gain an edge. Based on these results, we decided to keep the off-the-shelf USE pre-trained model.

Definitely, the quality of the fine-tuning dataset could be a significant reason. Factors like diversity and biases might impede the fine-tuning process. This raises questions about how they assess the quality of their input data, specifically reviews, and whether they undergo preprocessing. Hyperparameters can also play a role in impacting the process.

Now, the choice of evaluation metrics is crucial. If the metrics used do not align with the goals of the fine-tuning tasks, they might not accurately reflect the model’s performance.

We computed the cosine similarity between different embedding representations of reviews

Cosine similarity is commonly used to compare text embeddings. It does not take into account the magnitude of the vectors (i.e., the length of the text). Instead, it focuses on the orientation of the vectors rather than their absolute values. It’s easy to understand that the semantic meaning of a sentence should not be impacted by the length, but if the vectors point in opposite directions, they probably express opposite meanings.

Note also that, as mentioned earlier, USE generates fixed-length vectors. In this case, it would not have been an issue with cosine similarity, but fixed-length vectors are easier to integrate with different ML algorithms (and this is what they want, remember: tagging, sentiment analysis, etc.).

… It transforms varying sentence lengths into a fixed-length vector representation.

But ultimately, what I’m really missing in this testing approach, beyond the metric that was used, is the process. I would have liked to know more about their testing approach, including details about the data they used, whether it is automated, some insights into the infrastructure, etc.

Now that they have explained how they generate text embeddings from user reviews of businesses.

They mention that they use these initial embeddings to generate another type of embeddings: business embeddings.

We chose to base our business embedding on user content. We select the 50 most recent reviews and average their vector embeddings to create our first business embedding representation. It’s a great way to start since reviews contain quality content describing the businesses. The next step will be to add the photo embeddings as well.

So basically, they average the top 50 reviews’ embeddings of a given business to generate a single embedding per business.

Something like that?

business_embedding = sum(top_50_embeddings) / len(top_50_embeddings)

After all, that’s the beauty of having vectors…

However, they don’t explain how they pick out those top 50 embeddings. Perhaps they include the review rating as part of the text embedding and use it down the road for ranking. A clearer explanation of the approach regarding additional metadata handling would have been nice.

And then there’s this assumption that “reviews contain quality content.” Really? I mean, as a user, when I drop a note on a business page, it’s not always a masterpiece… So, I’m still hung up on my earlier question about sizing up the quality of their input data.

And why do they do that…?

Business embeddings help generate a top-k similarity list to relate businesses to other businesses, users to businesses and users to users based on their matching business interaction history. This correlation matrix of similarity helps show signification recommendations like “Users like you also liked…” or “Since you like business A, you might like business B”. You can learn more about this use case in this blog post.

Ah, that’s cool — recommendations based on vectors. That’s a use case that makes sense (comparing related vectors) and adds a lot of end-user value. Well done!

The approach they mention is based on a technique named collaborative filtering, either item-based or user-based. In their case, the correlation matrix mentioned in the context of generating recommendations indicates that a similarity score, possibly Pearson correlation, is computed for each pair of businesses or users.

They refer to another post that provides a detailed explanation of their approach.

It reminds me of an amazing book, old but incredibly well-written, that actually sparked an obsession with recommendations for me at that time — by the late Toby Segaran. But there’s so much more than recommendations in this book!

https://www.oreilly.com/library/view/programming-collective-intelligence/9780596529321/?source=post_page-----2c1d62f74e10--------------------------------&embedable=true

Ok, we are almost at the end, 2/3 done…

They now introduce another type of embedding: photo embeddings. Well, this one is not for the faint of heart, and honestly, I will go through it quickly because if you want nitty-gritty details, head to the article. I think I have more newbie questions than comments.

This time, they chose to use a different model than USE to generate these embeddings: CLIP. CLIP (Contrastive Language-Image Pre-training) and USE are tailored for different purposes, with CLIP being more adept at handling image-text associations and abstract visual concepts.

Here are some of the advantages highlighted:

The CLIP model is a zero-shot model, meaning it can infer successfully from an unseen dataset. A zero-shot model is an opportunity for Yelp to better identify and tag unseen photo categories to improve photo search. Our classifier won’t need a thousand examples for each new tag or label added.

The next section compares the CLIP model with their existing classifiers, ResNet50, for the classification of Food, Drinks, Menu, Interior, Exterior, and Home Services Contractor. Without any fine-tuning, the pre-trained CLIP model outperforms ResNet50, as assessed by precision/recall measures.

The main advantages of CLIP include its ability:

To recognize categories it has never seen during training (zero-shot learning), while ResNet50 requires examples for each category.
To capture semantic relationships and concepts within images and text, unlike ResNet50, which lacks an inherent understanding of textual information associated with the images.

This is a really interesting finding: just like USE, the base CLIP model outperforms a fine-tuned model (note: ResNet50 is not a transformer architecture) for Yelp.

It would have been interesting to know how a fine-tuned CLIP using Yelp data would have fared. Additionally, I would have liked more information about the testing methodology.

Finally, they look at some of CLIP’s vulnerabilities.

Before using CLIP and publishing its results, it’s better to know its vulnerabilities and how to optimize the model performance.

The key vulnerability highlighted in the text is related to what the authors refer to as ‘typographic attacks’ — the model’s susceptibility to being misled by the presence of specific words or text in images.

https://gizmodo.com/if-skynet-takes-over-try-writing-robot-on-your-shirt-1846433086?source=post_page-----2c1d62f74e10--------------------------------&embedable=true

In summary, the vulnerability stems from CLIP’s sensitivity to textual elements in images, posing a risk of misclassification when certain words are strategically placed. That’s something to know when deploying CLIP in real-world scenarios, particularly in contexts where typographic attacks could be a concern.

I understand that USE could not handle the joint processing of images and text, leading them to explore CLIP.

However, the embeddings generated by the two models exist in different spaces, preventing cross-content operations by mixing these two types of embeddings.

I wonder if such a requirement could arise.

Additionally, I read about Hugging Face providing an encoder-transformer template that can be used to build your own transformer architecture by combining several base models if I understand correctly.

https://huggingface.co/blog/warm-starting-encoder-decoder?source=post_page-----2c1d62f74e10--------------------------------&embedable=true

Could they use this if needed? This is clearly beyond my expertise.

In conclusion, that was a very interesting article!

It showcased:

the usage of embeddings to power different use cases,
the usage of embeddings as an interface between products and teams,
the efficacy of off-the-shelf models compared to models that have been finely tuned (for Yelp),
and the advantage of having these base models available for free and, even better, open source (either known training set and/or parameters/weights).

Still, some open questions persist, such as:

validation of the quality of the input data (e.g., reviews),
details about the testing methodologies,
The storage infrastructure is missing: where did they decide to store and serve these embeddings? It could be a basic data lake (blob storage) or a more advanced vector database; however, the required search capabilities are not really advanced at this point.
Unclear also about the approach to managing metadata, which is also related to the previous point.
I would have loved to learn more about the sentiment analysis use case that was mentioned at the beginning.
Investigate and delve deeper into the usage of embeddings in the context of graph representation, as embeddings effectively capture relationships between concepts.

I hope this post gives you some insights into what you could do with embeddings,

How you could generate them, and what parts of your products could benefit from them.

These off-the-shelf models can help you easily incorporate this technology to make it work for you and deliver more end-user value.

The sky is the limit, and you don’t actually need to know all the intricacies.

Found this article useful? Follow me on Linkedin and Hackernoon! Please 👏 this article to share it!