This is the summary and my key takeaways from the
Given a query by a user, fetch the most relevant Help Article from the database.
2. The given query is used to fetch all the relevant documents (hits) using Lucene indexing.
3. Each hit is scored using the BM25F algorithm, which takes the document structure into account, giving the highest weights to hits in the Title, then hits in the Keywords and then the Body, and returns a weighted score.
4. Return the best-scored articles.
Since the document retrieval system is Term based (syntactics), without taking semantics into account, the following are two example failure cases:
“how canceling my premium accounts immediately” normalized to “cancel premium account”
It might happen that the normalized query doesn’t have any words in common with the words in the articles. Hence, each query is mapped to a more representative query to fill the gap between a user’s terminology and the article’s terminology.
Done in the following two steps:
2. Topic Mining and Rep Scoring: For each of the queries in the Query group, a repScore is calculated and the top K queries are selected as Rep Queries
sim(RQ, Q2) is the similarity between the raw query and another query in the group
sim(Q2, title) is the maximum similarity between Q2 and one of the topics from the title (similarly for the body)
Long-tailed queries might not have a Rep Query, in which case a CNN is used for classifying the Intent of the query.
For example: “Canceling Your Premium Subscription” and “Canceling or Updating a Premium Subscription Purchased on Your Apple Device” are considered to have the same intent of “cancel premium.”
Also Published Here