A Simple Guide to Latent Semantic Indexing (analysis) and How it Bolsters Search

Written by algolia | Published 2023/04/20
Tech Story Tags: algolia | latent-semantic-indexing | bolster-search | search-engine | lsi-keywords | vector-search | artificial-intelligence | good-company

TLDRThe English language has evolved in interesting ways along with different groups of people’s lexicons. The language nuances various people are using and expecting to hear can be turned into information that software can more easily figure out and use. Latent semantic indexing (LSI) is developed for creating vectors and performing information retrieval.via the TL;DR App

You say “sweats”, someone else say “joggers”.

He says “fifth wheel”, she says “camper”, your neighbor says “RV”.

Some people call movie stars “wealthy”; others describe them as “rich”.

The English language has evolved in interesting ways along with different groups of people’s lexicons so that we now have a wealth (richness?) of similar words and phrases — synonyms — to use as communication options. Plus, some of our words (for example, “mouse”) mean two entirely different things altogether.

The ambiguity caused by this phenomenon can be a problem, such as when computer science gets involved. Through people’s use of natural language and unique words, massive amounts of information and collections of documents are created and shared every day: research studies, emails, school work, web pages, social media posts, blog posts, call transcripts, news stories, search queries. And in addition to creation of new documents, there’s plenty of editing of old ones.

Thanks to natural language processing (NLP), the language nuances various people are using and expecting to hear can be turned into information that software can more easily figure out and use or apply “intelligently.”

Website and app search, in particular, benefit from top-notch NLP technology. After all, if someone is searching for “tights” on an ecommerce website where they’re only referred to as “leggings”, and the search engine hasn’t been apprised that these terms are used interchangeably, or differently by different geographically based groups, the shopper using the wrong search term is likely to be out of luck.

What is Latent Semantic Indexing?

Fortunately, we have latent semantic indexing (LSI; also called latent semantic analysis or LSA for short), developed for creating vectors and performing information retrieval. This technical method of doing NLP utilizes a mathematical technique called singular value decomposition (SVD), which looks for relationships between concepts and words in unstructured data.

LSA analyzes relationships between documents and the terms they contain by reducing the number of dimensions. It’s good at understanding document relatedness by analyzing which terms are often used together — for example, “royalty” and “queen” — as a prelude to building a model of semantic relatedness.

LSI isn’t a new thing. Way before the Internet came on the scene — in the early days, the 1980s — this statistical model was developed to facilitate text matching. It’s a patented technology (Bell Communications Research). One of its major uses then was helping search engines understand relationships between concepts and words in order to provide people with the most relevant search results for their search queries. And although the patent expired in 2008, it’s still going strong in the world of data science.

For enhancing search, it’s still being tapped for upgrading the search functionality on sites and apps. How the technology works is a bit complex, but suffice it to say that it’s related to creating vectors and performing information retrieval despite people entering “alternative” words to those housed in the site’s or app’s index. The benefit is its ability to help people find whatever piece of content they want faster, leading to both happier searchers and better metrics and revenues for organizations and businesses.

What are LSI Keywords?

You may have heard the term “LSI keywords” used in relation to Google search and search engine optimization (SEO) strategy. If so, you’re in good company, but we must temporarily digress.

According to Google senior search analyst John Mueller, “LSI keywords” aren’t a viable thing (nor are things like “LSI keyword suggestions”, the act of “finding LSI keywords”, or plans to “use LSI keyword generators” or set up a Google keyword planner), at least in the context of Google-related keywords and SEO tools.

Search Engine Journal’s Roger Montti notes that the myth of Google using LSI keywords may stem from use of such phrases as “‘Semantic Analysis,’ ‘Semantic Indexing’ and ‘Semantic Search’ having become SEO buzzwords, given life by Ask Jeeves’ semantic search technology and Google’s purchase of semantic analysis company Applied Semantics.” SEO guru Bill Slawski goes into depth in a blog post.

Good to know. But this Google situation aside, LSI is still a relevant concept in the world of search. Semantically related keywords are critical considerations. Semantic keywords are still something to think seriously about if you want an effective website or content marketing that works. Semantic search is a viable undertaking that has ramifications in terms of adding search functionality to a company’s site or app.

Synonyms: the same idea with different versions

To return accurate search results, search engine retrieval methods must be able to effectively “understand” and deal with synonyms, those pesky words and phrases that, through no fault of their own, have the same meaning as other words and phrases.

Synonyms certainly aren’t a rarity. Even in just a single industry, like retail, they abound because of the diverse ways in which different groups of people, such as those with different dialects, thinking about items in different contexts — or natives of different countries — refer to the same item or idea.

It happens more than you might think. A searcher on a clothing seller’s ecommerce store site might naturally think to enter a search query for a:

  • shirt, blouse, or top
  • handbag, bag, or purse
  • pants or slacks
  • jacket, parka, or coat
  • hat or cap

Some of these terms are victims of overuse; others are considered odd by conventional standards because only certain age groups use them, for instance. Regardless of the wide synonymy abyss, a search engine must intimately know them all.

Polysemy: Different Things with the Same Name

Table, mouse, sharp…these are all polysems — words with more than one meaning. And as is the case with synonyms, with polysemy, getting a search engine to confidently understand which version you’re referring to is key.

How Latent Semantic Indexing Refines Search

In terms of digital marketing, if a shopper can’t pull up a great results page for what they need, that’s a pressing problem. If you can’t suggest relevant related searches when a shopper gets no helpful results for their main keyword, that’s a big red flag. For online business success, every synonym must be accounted for; every related term known, every possible semantically related word or phrase duly anticipated.

Does this linguistically challenging situation stump the average modern search engine?

Not with LSI in the semantic search picture. By statistically analyzing words that appear in a text document, it can anticipate which of the synonymous word different meanings is being queried and respond with the most accurate search engine results page (SERP). And when it comes to polysemy, it can tell which version of the word the searcher wants.

The Secret: Vectors

At Algolia, this type of semantic search technology falls in the realm of vector search. Using machine learning models that detect semantic relationships between objects in an index, it finds related objects that have similar characteristics. (Here’s some background on how cosine similarity determines closeness in word meaning). Vector embeddings (also known as “word embeddings” or just “vectors”) are applied, along with spelling correction, language processing, and category matching.

Artificial intelligence (AI) models powered by vector engines can instantly retrieve accurate information. They can figure out that words and phrases like “gardening” and “yard work” are related words. They know that someone searching for “Puma” is looking for athletic shoes as their relevant content, not a large wildcat. “They can quickly break through the limits of exact keyword matching and immediately deliver optimal, relevant results.

Our search solution combines this AI-based vector-space functionality with traditional keyword search to offer true hybrid search, covering all the bases. By applying full-text keyword search and vector search to each query, searchers get super accurate results, and fast.

If you’ve been wanting to ensure that your users or shoppers get the absolute right search results, feel understood in their information quests, and have an excellent user experience regardless of the specific keywords they initially enter in the search bar, you’re in the right place.

  • We can help you add natural language text search to your site search functionality, create image-search capabilities, build a powerful recommendation system
  • We can meet the needs of any size dataset without additional overhead
  • Our tie-breaking algorithm weighs and compares matching items, ensuring that the best matches appear first.

There’s an added bonus, too: upgrading your search functionality could lead to surprising improvements in your conversions; just ask our clients, who know this firsthand.

Sign up for a free demo or let us know when you’re ready to learn more.

Also published here.


Written by algolia | Algolia empowers Builders with the Search and Recommendation services they need to build world-class experiences.
Published by HackerNoon on 2023/04/20