Auto-complete is a key feature for many web services. When you type in some phrases in Google, it presents a list of search suggestions. Sometimes these results use your input as prefix and sometimes not. How does Google achieve this so fast and accurate? And how can we design a simplified working auto-complete system in Python? Since we are designing the backend of a web service, we need to consider how data flow between server and database, and recovery mechanisms if server goes down. Below is a list of features from distributed systems infrastructure perspective. New and recovered application servers can be built from database. Options to replicate and partition databases. Application servers should be capable of updating the database with latest usage data. Application servers are capable of building a database from scratch. From server perspective, we need to consider how to optimize performance. How do we update the top results? If we were to handle thousands of requests per second, the delay must be minimized. How frequent do we perform update at all? Do we assume eventual consistency? How do we delete phrases from the server if necessary? Data Structures and Algorithms To handle high volume data, the server should be able to quickly search, insert and delete phrases. Also we should optimize the update operation. Consider the basic case that all suggestions are with same prefix of the user input. Then the most time-efficient data structure to store data is prefix trees, also known as . We are not going to elaborate on how Trie works, since there are plenty of out there for that purpose. Basically given a list of phrases with longest length M and it takes O(M) time to search any phrase in the Trie. The operation is inherently fast thanks to Trie. Trie articles search However we still need careful design of the architecture to support other operations. The Trie node design is as follows. A TrieNode is a node with a prefix string and pointers to child/parent nodes. It stores top suggestions using Python Counter. We can access the most frequently accessed results efficiently with built-in method. Also notice that it has a flag indicating if the prefix in the node is a complete word, and it is extremely important to support logics in various methods. most_common() TrieNode: __init__(self, =None, =None, =False):""" class def prefix parent is_word :param prefix: prefix of this node  
    :param parent: parent node in the trie  
    :param is\_word: True if the node stores a node  
    """  
    self.prefix = _prefix_        self.children = dict()  
    self.parent = _parent_        self.count = 0        
    self.top\_results = Counter()  
    _if is\_word_:  
        self.top\_results\[self.prefix\] = 1  
    self.isWord = _is\_word_ Since the main structure of the server is based on Trie, the fundamental algorithms involved are graph algorithms. Basic graph-traversal algorithms such as depth-first-search (DFS) and breadth-first-search (BFS) are used extensively in the code when we need to traverse the whole graph. Of course, details vary from function to function, such as DFS function signature. As a simple example for BFS, the method for deleting phrases find all the phrases in the sub-tree. __delete_helper __delete_helper(self, ):"""Breadth-first search to find all children nodes that are words:param node: TrieNode, subtree root:return: set(str)"""q = deque([ ])res = set() q:cur = q.popleft() cur.isWord:res.add(cur.prefix) _, child cur.children.items():q.append(child) res def node node while if for in return The method searches for replacements of wrong spelling uses DFS. It traverses a object, which is a nested list of strings, and returns all the combinations of words. __search_helper word_list @staticmethod __search_helper( , , , ): == len( ): .append(list( )) word [ ]: .append(word)Server.__search_helper( , +1, , ) .pop() def word_list idx path res if idx word_list res path returnfor in word_list idx path word_list idx path res path Database We want to build an auto-complete server which is connected to a database. The choice of database is Neo4j, an excellent option for expressing complex relationships in graphs. We use Python package which provides all necessary APIs to communicate with databases. Here is a visualization of how the data looks like in Neo4j browser after inserting 4 words {trying, tied, time, timing}. py2neo Components Design For update operation, instead of searching all nodes in sub-tree, we use traversal from leaves all the way up to the root to update the top search results. This optimization reduces the time complexity from exponential to polynomial. However, when the server stores millions of phrases, updating top search results takes long time. We should find a reasonable update frequency so that to balance the consistency and latency trade-off. The server update frequency can be configured via a class attribute. There were I encountered designing the server class. I would like to share the thoughts involved for solving some of these problems. numerous challenges The first design challenges is how to update database. A subset of terms might already been stored in the database and others are new terms. When traversing the graph database, we have to differentiate if the node exists or not. If node exists, new counts are added; if it does not, new nodes are created at right locations. But how to update the count of each phrase in database? The idea is each node should maintains the count of its own phrase conststently. When updating database, we always use this value for consistency. When you use Google search, notice that even if you type in some gibberish the system auto-corrects your input and returns reasonable search results. We want to achieve something similar. Extending the classical auto-corrector from , we created a class that returns a number of auto-correction results in case of misspelling. The idea is if the input word is not in an English vocabulary (~40K words), its replacements are searched and inserted to the server. However, this design creates another problem. If we have too many replacements, then the sorting and ranking will increase latency significantly. Therefore, we limit the replacements for each misspelling word to a small number such as 2 in current build. Peter Norvig Spell Serialization is essential for converting objects to a sequence of bytes in order to store in disks or transfer over the network. Serialize complex objects such as the Trie server is no easy task. We have to consider which are the most essential data to compress and how to rebuild application server given serialization representation. To reconstruct TrieNode as accurate as possible, we have to serialize . The sequence of serialize and deserialize application server come in pairs. We decided to use depth-first-search sequence for serialization. With all these design decisions, we were able to serialize an app server and deserialize to create a new one. Below is an example of server serialization with single word “time”. The sequence in the list is DFS sequence and each item encodes the data we described above. prefix, number_children, top_results and is_word [['', '0', 'time 1', '1'],['t', '0', 'time 1', '1'],['ti', '0', 'time 1', '1'],['tim', '0', 'time 1', '1'],['time', '1', 'time 1', '0']] Future Work Currently users access the server by running which is based on Tkinter package in the Python standard library. Future plans include providing a REST API using Flask to access the service. Also we can add functionality to redirect user selections. app.py Open source Please feel free to check out the GitHub for implementation details and we welcome any improvement suggestion. _Auto-complete-System - Auto-complete system using Neo4j graph database for data storage and retrieval. The search API…_github.com weihesdlegend/Auto-complete-System

Flow

Design Auto-complete System in Python

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

How I Got Multiple Software Engineer Job Offers Switching from Another Industry

A Guide to Using Data Classes in Python

Elasticsearch: Building AutoComplete functionality

Elasticsearch: Using Completion Suggester to build AutoComplete

How the autocomplete search in the Mozilla Documentation Works

How I Got Multiple Software Engineer Job Offers Switching from Another Industry

A Guide to Using Data Classes in Python

Elasticsearch: Building AutoComplete functionality

Elasticsearch: Using Completion Suggester to build AutoComplete

How the autocomplete search in the Mozilla Documentation Works

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps