How LinkedIn Uses NLP to Design their Help Search System by@harshit158

How LinkedIn Uses NLP to Design their Help Search System

Read on Terminal Reader
Open TLDR
react to story with heart
react to story with light
react to story with boat
react to story with money
LinkedIn’s Help Search System is being used (as of 2019**) in designing its help search system. This highlights the problem statement and the different iterations of solutions that were adopted along with their shortcomings. This is the summary and my key takeaways from the original post by LinkedIn on how NLP is used in designing help search systems. The given query is used to fetch all the relevant documents (hits) using Lucene indexing. Each hit is scored using the [BM25F](https://www.google.com/article/53174626597/) algorithm.
image
Harshit Sharma HackerNoon profile picture

Harshit Sharma

ML Engineer @ Juniper Networks | https://medium.com/@harshit158

linkedin social icon

This is the summary and my key takeaways from the original post by LinkedIn on how NLP is being used (as of 2019) in designing its Help Search System. This highlights the problem statement and the different iterations of solutions that were adopted along with their shortcomings.

Problem Statement:

Given a query by a user, fetch the most relevant Help Article from the database.

(Image by Author) Problem Statement

(Image by Author) Problem Statement

Iteration 1: Initial Solution

  1. Indexed all the help articles (documents) in the database using Lucene Index. In short, it generates an inverted dictionary that maps terms to all the documents it appeared in.
    Source: Original Blog

    Source: Original Blog

2. The given query is used to fetch all the relevant documents (hits) using Lucene indexing.

3. Each hit is scored using the BM25F algorithm, which takes the document structure into account, giving the highest weights to hits in the Title, then hits in the Keywords and then the Body, and returns a weighted score.

4. Return the best-scored articles.

Why it failed

Since the document retrieval system is Term based (syntactics), without taking semantics into account, the following are two example failure cases:

(Image by Author) Examples of use cases that failed

(Image by Author) Examples of use cases that failed

Iteration 2: Final Solution

Step 1: Text Normalization

how canceling my premium accounts immediately” normalized to “cancel premium account”

Source: Original Blog

Source: Original Blog

Step 2: Query Mapping

It might happen that the normalized query doesn’t have any words in common with the words in the articles. Hence, each query is mapped to a more representative query to fill the gap between a user’s terminology and the article’s terminology.

Done in the following two steps:

  1. Query Grouping: Queries are grouped together based on similarity metrics
    (Image by Author) Illustration of Query grouping

    (Image by Author) Illustration of Query grouping

2. Topic Mining and Rep Scoring: For each of the queries in the Query group, a repScore is calculated and the top K queries are selected as Rep Queries

image
(Image by Author) Illustration of Topic Mining and Rep scoring

(Image by Author) Illustration of Topic Mining and Rep scoring

sim(RQ, Q2) is the similarity between the raw query and another query in the group

sim(Q2, title) is the maximum similarity between Q2 and one of the topics from the title (similarly for the body)

Step 3: Intent Classification

Long-tailed queries might not have a Rep Query, in which case a CNN is used for classifying the Intent of the query.

For example: “Canceling Your Premium Subscription” and “Canceling or Updating a Premium Subscription Purchased on Your Apple Device” are considered to have the same intent of “cancel premium.”

Overall Flow
(Image by Author) Overall Flow

(Image by Author) Overall Flow


Also Published Here

react to story with heart
react to story with light
react to story with boat
react to story with money
L O A D I N G
. . . comments & more!