Weihe Wang

@weihewang2012

Design Auto-complete System in Python

Auto-complete is a key feature for many web services. When you type in some phrases in Google, it presents a list of search suggestions. Sometimes these results use your input as prefix and sometimes not. How does Google achieve this so fast and accurate? And how can we design a simplified working auto-complete system in Python?

Since we are designing the backend of a web service, we need to consider how data flow between server and database, and recovery mechanisms if server goes down. Below is a list of features from distributed systems infrastructure perspective.

  • New and recovered application servers can be built from database.
  • Options to replicate and partition databases.
  • Application servers should be capable of updating the database with latest usage data.
  • Application servers are capable of building a database from scratch.

From server perspective, we need to consider how to optimize performance.

  • How do we update the top results? If we were to handle thousands of requests per second, the delay must be minimized.
  • How frequent do we perform update at all? Do we assume eventual consistency?
  • How do we delete phrases from the server if necessary?

Data Structures and Algorithms

To handle high volume data, the server should be able to quickly search, insert and delete phrases. Also we should optimize the update operation.

Consider the basic case that all suggestions are with same prefix of the user input. Then the most time-efficient data structure to store data is prefix trees, also known as Trie. We are not going to elaborate on how Trie works, since there are plenty of articles out there for that purpose. Basically given a list of phrases with longest length M and it takes O(M) time to search any phrase in the Trie. The search operation is inherently fast thanks to Trie.

However we still need careful design of the architecture to support other operations. The Trie node design is as follows. A TrieNode is a node with a prefix string and pointers to child/parent nodes. It stores top suggestions using Python Counter. We can access the most frequently accessed results efficiently with most_common() built-in method. Also notice that it has a flag indicating if the prefix in the node is a complete word, and it is extremely important to support logics in various methods.

class TrieNode:
def __init__(self, prefix=None, parent=None, is_word=False):
"""

:param prefix: prefix of this node
:param parent: parent node in the trie
:param is_word: True if the node stores a node
"""
self.prefix = prefix
self.children = dict()
self.parent = parent
self.count = 0
self.top_results = Counter()
if is_word:
self.top_results[self.prefix] = 1
self.isWord = is_word

Since the main structure of the server is based on Trie, the fundamental algorithms involved are graph algorithms. Basic graph-traversal algorithms such as depth-first-search (DFS) and breadth-first-search (BFS) are used extensively in the code when we need to traverse the whole graph. Of course, details vary from function to function, such as DFS function signature.

As a simple example for BFS, the __delete_helper method for deleting phrases find all the phrases in the sub-tree.

def __delete_helper(self, node):
"""
Breadth-first search to find all children nodes that are words
:param node: TrieNode, subtree root
:return: set(str)
"""
q = deque([node])
res = set()
while q:
cur = q.popleft()
if cur.isWord:
res.add(cur.prefix)
for _, child in cur.children.items():
q.append(child)
return res

The __search_helper method searches for replacements of wrong spelling uses DFS. It traverses a word_list object, which is a nested list of strings, and returns all the combinations of words.

@staticmethod
def __search_helper(word_list, idx, path, res):
if idx == len(word_list):
res.append(list(path))
return
for
word in word_list[idx]:
path.append(word)
Server.__search_helper(word_list, idx+1, path, res)
path.pop()

Database

We want to build an auto-complete server which is connected to a database. The choice of database is Neo4j, an excellent option for expressing complex relationships in graphs. We use py2neo Python package which provides all necessary APIs to communicate with databases. Here is a visualization of how the data looks like in Neo4j browser after inserting 4 words {trying, tied, time, timing}.

Components Design

For update operation, instead of searching all nodes in sub-tree, we use traversal from leaves all the way up to the root to update the top search results. This optimization reduces the time complexity from exponential to polynomial.

However, when the server stores millions of phrases, updating top search results takes long time. We should find a reasonable update frequency so that to balance the consistency and latency trade-off. The server update frequency can be configured via a class attribute.

There were numerous challenges I encountered designing the server class. I would like to share the thoughts involved for solving some of these problems.

The first design challenges is how to update database. A subset of terms might already been stored in the database and others are new terms. When traversing the graph database, we have to differentiate if the node exists or not. If node exists, new counts are added; if it does not, new nodes are created at right locations. But how to update the count of each phrase in database? The idea is each node should maintains the count of its own phrase conststently. When updating database, we always use this value for consistency.

When you use Google search, notice that even if you type in some gibberish the system auto-corrects your input and returns reasonable search results. We want to achieve something similar. Extending the classical auto-corrector from Peter Norvig, we created a Spell class that returns a number of auto-correction results in case of misspelling. The idea is if the input word is not in an English vocabulary (~40K words), its replacements are searched and inserted to the server. However, this design creates another problem. If we have too many replacements, then the sorting and ranking will increase latency significantly. Therefore, we limit the replacements for each misspelling word to a small number such as 2 in current build.

Serialization is essential for converting objects to a sequence of bytes in order to store in disks or transfer over the network. Serialize complex objects such as the Trie server is no easy task. We have to consider which are the most essential data to compress and how to rebuild application server given serialization representation. To reconstruct TrieNode as accurate as possible, we have to serialize prefix, number_children, top_results and is_word. The sequence of serialize and deserialize application server come in pairs. We decided to use depth-first-search sequence for serialization. With all these design decisions, we were able to serialize an app server and deserialize to create a new one. Below is an example of server serialization with single word “time”. The sequence in the list is DFS sequence and each item encodes the data we described above.

[['', '0', 'time 1', '1'],
['t', '0', 'time 1', '1'],
['ti', '0', 'time 1', '1'],
['tim', '0', 'time 1', '1'],
['time', '1', 'time 1', '0']]

Future Work

Currently users access the server by running app.py which is based on Tkinter package in the Python standard library. Future plans include providing a REST API using Flask to access the service. Also we can add functionality to redirect user selections.

Open source

Please feel free to check out the GitHub for implementation details and we welcome any improvement suggestion.

More by Weihe Wang

Topics of interest

More Related Stories