At we make the fastest email experience in the world. We must therefore process massive amounts of text rapidly. We need to find links, validate emails, parse invitations, and much more. Superhuman , very Most programmers process text with . And rightly so: regular expressions are concise yet powerful. regular expressions But many programmers have also encountered the dark side of regular expressions. When regular expressions go wrong, they go devastatingly wrong. The Dark side of Regular Expressions Superhuman automatically converts email addresses into mailto: links. In this case, we defined an email address as any string that matches this regular expression: . You can read this as: “0 or more of any character except @, then an @ sign, then 0 or more of any character except @”. /([^@]*)@([^@]*)/ But while reading the of email addresses, we found some that did not match our regular expression. For example, if the username is quoted then it can also contain an @ sign. Consider . formal specification "joe@home"@example.com Now you might be thinking: “That’s crazy! Who would do that?!” But when you deal with user data, the unexpected happens all the time. Imagine you have 1 million users, and that each user has 50,000 emails. Between them, this is 50 billion emails. Even if it only happens once in every billion emails, that could still easily be 50 times! Wanting to do the right thing, we naïvely changed our regular expression to . This is exactly the same as before, but also allows the username to contain 0 or more quoted sections. /("[^"]*"|[^@])*@[^@]*/ It only took a few days before we received an email saying: “Please help! Superhuman is using 100% CPU and not responding…” We could see that the problem started after we changed this regular expression, and we could see that Superhuman broke on one email in particular. In some cases, the CPU would lock up for days. But why? It turns out that we had accidentally become vulnerable to (ReDoS). regular expression denial of service Regular Expression Denial of Service Theoretically, a regular expression is equivalent to a state machine that matches one character at a time. The state machine for our first email matcher looks like this: Example state machine for /[^@]*@[^@]*/ This state machine has 3 states: , , and . The machine starts in state . While here, the machine will match any character except and remain in state . If it encounters an sign, the machine will transition to state . A B $ A @ A @ B In state , the machine will match any character except , and remain in state . At the end of the string, the machine transitions to state . B @ B $ If the machine encountered an sign while in state , it would error as there are no matching transitions. The error would show that the string did not match the regular expression. @ B So what changed when we updated our regular expression? The state machine for our second email matcher looks like this: Example state machine for /("[^"]*"|[^@])*@[^@]*/ Do you see the problem? It’s not obvious, but we introduced non-determinism. In state , if the machine sees it has a choice: treat it as and transition to state , or treat it as and stay in state . A " " C any character except @ A There are some to this problem, but JavaScript and most modern programming languages take a dangerous approach: when the state machine has multiple paths, it will just choose one and continue. If that choice leads to the entire string matching, the machine will stop. If that choice does not lead to a match, it will and try the next path. theoretical solutions backtrack In the worst case, the state machine has to try every single possible combination of options before it can determine that there is no match. And the number of options very quickly becomes huge. In our example, every character doubles the number of possibilities. Our regular expression can therefore take (2 ) attempts to match, where is the length of the string. " O ⁿ n If you don’t believe me, open your browser console and type: let regex = /("[^"]*"|[^@])*@([^@]*)/ t = performance.now() regex.test('"""""""""""""""""""""""""""""""""""""""') console.log(performance.now() - t) // about 3 seconds. Each additional character doubles the time required. This 40 character string takes about 3 seconds on a high-end MacBook Pro. A similar string of just 64 characters will take more than 2 years — assuming you don’t run out of battery in the meantime! " In summary: If our regular expression is run against a valid email address, it will not backtrack and it will run very quickly. If our regular expression is run against the vast majority of common input, it will backtrack a little and it will still run quickly. If our regular expression is run against a very specific pattern, it will backtrack catastrophically and may never end. In other words, ReDoS manifests only in specific conditions, is catastrophic when it does, and — worst of all — cannot be caught by traditional testing as the conditions needed to trigger it are rare. The solution We turned to academia, and found that identifying ReDoS is still a relatively new area of research. One paper, however, was particularly useful. In “ ”, the authors describe a beautiful theoretical approach to this problem. They also provide a tool to test regular expressions. It works like this: Static Analysis for Regular Expression Exponential Runtime via Substructural Logics Compile the regular expression into a state machine. Look for ambiguity within the loops of the state machine execution graph. Run a bounded search of the execution graph to determine if these ambiguities can be triggered in a loop. If so, this would indicate a ReDoS vulnerability. A brilliant side effect of this strategy is that it generates an example string that will trigger ReDoS. This can be extremely useful for debugging, as you can see which parts of the regular expression are triggered. The original tool is written in OCaml and requires many dependencies to run. To make it easily available for everyday use, we created ! This is the OCaml tool wrapped in an HTTP API, and with some extra javascript-regex features. regex.rip With , you can create regexes with confidence. You can test your regexes so that users don’t have to. And best of all, you can avoid catastrophic downtime that you would otherwise only see in production. regex.rip Please feel free to use with all your regular expressions. If you want to contribute, please find the code on GitHub at . regex.rip superhuman/rxxr2 Thanks to Elena , Joy , Ruchi , Terin , and Islam for comments and suggestions on this post. At Superhuman we’re rebuilding the email experience for web & mobile. Think vim / Sublime for email: blazingly fast and visually gorgeous. If you enjoy working on user-facing problems that push existing solutions to their very limits — or beyond — join us! Learn more or email me .

How We Eliminated Regular Expression Denial of Service and How You Can Too

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

10 Reasons Why You Should Learn How To Develop Video Games

10 NoCode Tools to Help You Build Your MVP 🚀

10 Must Have Chrome Extensions for a Web Developer

10 Most Useful Code Editors Hotkeys

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

10 GitHub Repositories to Follow

10 Reasons Why You Should Learn How To Develop Video Games

10 NoCode Tools to Help You Build Your MVP 🚀

10 Must Have Chrome Extensions for a Web Developer

10 Most Useful Code Editors Hotkeys

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

10 GitHub Repositories to Follow

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps