Bite-Sized Tips To Make Chinese Full-Text Search

Written by gregdevogo | Published 2020/09/27
Tech Story Tags: search-engine | search-index | tokenization | semantic-segmentation | machine-learning | search | china | algorithms

TLDR Chinese language belongs to the so-called CJK language family (Chinese, Japanese, and Korean) They are probably the most complicated languages for full-text search implement as in them word meanings heavily depend on numerous hieroglyphs variations and their sequences and the characters are not split up into words. To find an exact match in a full text search, we have to face the challenge of tokenization whose main task is to break down the text into low-level units of values that can be searched by the user. The easiest way of Chinese text segmentation assumes the use of N-grams.via the TL;DR App

no story

Written by gregdevogo | Experienced BackEnd dev, trying to balance between madness, creativity and procrastination
Published by HackerNoon on 2020/09/27