For the Love of Jeez, Stop Using Regex in Software

If it can’t be done with regex, we can’t do it!

Do you know what a code obfuscator is? It’s a little program or piece of a compiler toolchain which can take your program code and scramble it up a lot into a bunch of seemingly random letters and symbols, so that other people won’t be able to read it (without using a reciprocal tool to help unscramble it.) The code still works the same way, but it’s just not readable by humans.

Imagine if you and I worked on some code together. One day, I check in some random piece of code and give it to you for code review. Except something’s different. Now I’ve sent it through the code obfuscator before it’s even compiled. The obfuscated code, the random-looking symbols, is sitting there in a pull request. In the source code. In Git. You have a simple CRUD app, and now there’s actual crud in it.

What would you do?

It’s hard to say, since the situation makes so little sense that it’s difficult to perceive it as anything but a practical joke. But this is exactly what happens when somebody, straight-faced, commits regex into a software repository.

Regex is a handy sysadmin tool for doing ad hoc tasks like log parsing. Nothing more or less. It gives you a 1-liner way of finding something while piping-together strings in Bash/Linux.

Regex is not magic. Regex essentially “compiles” into a complicated state machine which tracks conditions with a number of if-statements. It does not have some magical SIMD performance boost. The opposite, in fact — regex is generally slower than writing your own group of if-statements.

Regex does not belong in software because it is not readable. If you work in teams you’ve probably rejected someone’s code review because some part is too hard to understand. Regex is always hard to understand, because it’s not even code. It is difficult to confirm the veracity of regex because you need a microscope to check and double check for the accuracy of all the escape characters and other magical symbols. You must hand-translate all of it. Nobody ever expects to be able to edit someone else’s regex. You might as well delete it and start over. But you don’t actually know what it does. It is not self-documenting. It’s self-obfuscating. Reading regex is a reminder of what it felt like before you learned how to program.

Regex does not belong in software because it’s a security flaw. Certain Regex strings can contain segments that are so poorly performant that a well-crafted string (often just a long string of the same character) will cause the code to run nearly forever, and an attacker can simply send that string as an input to create a self-contained DOS attack against a service. You probably don’t have anybody on staff who can, on sight, tell you whether a given regex string is a security flaw or not. (Because you can’t read regex.)

Regex is not useful for parsing, serialization or deserilization of XML or JSON. Regex can only parse regular grammars, but the aforementioned languages are context-free grammars. You don’t want regex for dynamically mapping requests to routes in web service frameworks. You want a custom tree-like data structure.

You don’t need regex for validation of anything. You need a series of if-statements. If-statements can be verified by humans (and machines.) If-statements can be easily modified as business needs and constraints change. Your regex which precisely validates only RFC-compliant e-mail addresses must be thrown out entirely because, oops, your customer has a non-RFC compliant e-mail address, because RFCs are made up.

Regex isn’t some magical, mysterious, high-class, final frontier of computer science. Actually it sucks and stop using it in software.