Here’s a problem: say you’re ingesting roughly 4 million Reddit comments per day, and you need to detect the presence of 10,000 specific keywords in near real time.
The naive answer is obvious - loop through each comment and keyword. Done.
for comment in reddit_comments:
for keyword in keywords:
if keyword in comment:
# alert user
It’s simple. Also visibly slow.
Even if you ignore the math, it’s (easily) doing billions of string comparisons. The complexity is quadratic, and there’s no way you’d be able to run it on a tiny VPS.
That’s the problem Lewis Van Winkle came across when building F5Bot (a free Reddit scraping service). And he ran it on a tiny VPS with an anemic CPU, not some beefy cluster.
The solution? An algorithm from 1975 called Aho-Corasick. It lets you search for patterns in a single pass.
The $600 bibliographer problem
The problem surfaced at Bell Labs (1975) when Margaret Corasick was working on a tool to search for keywords on government-provided tapes. A bibliographer hit $600 usage limits while searching for hundreds of keywords at once (and the program hadn’t finished). The kind of problem you hit when your algorithm doesn’t scale.
Corasick attended a seminar on algorithm design by Alfred Aho, and they came up with what we know as the Aho-Corasick algorithm.
She implemented it. The bibliographer’s search now ran for $25. 24x speedup.
Why care
The Aho-corasick is actually a very popular algorithm. The original implementation of fgrep was Aho-Corasick. Virus scanners use it to scan for thousands of malware signatures. F5Bot uses it to monitor comments from Reddit and Hacker News.
Just look at the benchmark from F5Bot.
Loaded 3000 keywords to search on a text of 19377 chars
searching with strpos (naive): 0.384 seconds
searching with aho-corasick: 0.055 seconds (7x faster)
That’s a flex I’d be happy to have any day.
How it works
The basic idea is - if you’re searching for multiple keywords, you’re doing redundant work.
Let’s say you’re searching for : "art", "cart", "ted". Using the naive approach, you’d scan through your text three times - once for each keyword.
If you’ve already searched for "art" and you know it’s there, you can start there. Just build a finite state machine and exploit shared suffixes. Basically, a tree (of strings) where each node represents a prefix of one of your keywords. As you scan through your text character by character, you move through this graph. Sounds eerily similar to a trie? That’s because it’s a trie!
What’s clever is that the algorithm builds “failure links” between nodes. In English, that’s - “if the next character doesn’t match, don’t start from the root. Jump to this other node in the tree that shares the longest suffix.”
Let me show you a simple example. If you’re searching for {a, ab, b, bc, bca, c, ca, caa}, the algorithm builds a trie (prefix tree) and adds the extra links you see in blue/green:
If you’re familiar with a trie, it just adds a failure link. That way, you can instantly jump to the longest matching suffix instead of scanning the word from root again. Let’s say you’re searching "abccab", you walk through this trie (automaton) once, character by character. The text is only scanned once - so you end up with a time complexity of O(n); where n is the length of the text.
A naive approach would have you scanning the word again and again; O(n * m * k) where m is the number of keywords and k is their average length. Aho-Corasick is O(n+m+z), where z is the number of matches. Even if you searched for millions of keywords, Aho-Corasick handles it linearly.
Implementation
You can find a simple implementation in my recent project, where I adapted it for searching 9 mutations (patterns) in a genome file (basically 4M characters), and it searched the whole thing in 0.75s. Sure, the patterns I’m using are tiny, but it’s pretty fast!
I’m showing a simple example so you get how it works.
# naive approach
for comment in reddit_comments:
for keyword in keywords:
if keyword in comment:
# alert user
# with aho-corasick
step 1: build an automaton (trie) using patterns
step 2: add failure links (with bfs)
step 3: search automaton by following failure links, and return patterns that matched
I think the implementation is about 60 lines of actual code.
There’s also a recent library - pyahocorasick; it’s super easy to use.
Use this when
You don’t have to spam ahocorasick when searching for three keywords in a paragraph. Use it when:
- you’re searching for thousands of keywords that don’t change often (F5Bot usecase)
- it needs to run often (think streams)
Closing thoughts
The Aho-Corasick was invented by Margaret with a pen and paper 50 years ago, without any AI. Still, the core ideas hasn’t changed much in multi-pattern string matching.
Computers today are millions of times faster, and the $600 search problem would be trivial today. But if you’re processing millions of Reddit comments per day on a tiny VPS, you can’t just throw hardware at the problem.
Food for thought, there’s probably a 50-year-old algorithm that solves your problem better than whatever you’re about to implement. If you read this far, let me know if you know what algorithms you find exciting!
Footnotes:
- The original 1975 paper Efficient string matching: An aid to bibliographic search is surprisingly readable.
- Margaret J. Corasick’s PhD dissertation (1974) was “securing proprietary data within open systems” - early information security. Sad that she doesn’t have her own Wikipedia page, despite her algorithm and Alfred Aho (he’s the ‘A’ in AWK - Aho, Weinberger, Kernighan) both having their own Wikipedia pages.
- There’s a Python library called
pyahocorasickthat’s implemented in C for production use. I’m sure it’s way faster than my version, but the principles are identical. I’ll run a benchmark soon. - KMP (Knuth-Morris-Pratt) uses a similar approach for compiling patterns into a finite automata. Basically
ahocorasickfor a single pattern. Both precompute a structure and exploit prefixes that show up. - The Aho-Corasick automaton can become massive - O(total characters in all keywords). If you have millions of patterns, it might need significant memory.
