Regular expressions are a concise and powerful tool for processing text. However, they also come with a steep learning curve and plenty of opportunities to make mistakes. This is the first in a series of posts about some specific pitfalls of Java regular expressions that can lead to bugs, code that’s hard to understand, or worse: code that could crash your application. In this series we will give you some examples of issues in real code caused by these pitfalls, and discuss strategies (and rules!) for writing better, more readable and maintainable regular expressions. In this post I’ll start with pitfalls related to a very common feature of regular expressions: character classes. Note that writing this blog post has been made possible thanks to the group effort of the whole SonarSource Java analysis team. Transforming our initial ideas into such features is a great collective achievement, which I’ll now share with you, speaking for the team! Character classes allow the regex engine to match only one out of several characters. For instance: The character class can match either an or a .You can also use ranges inside character classes: matches any character between and .You can inverse or negate character classes with : By starting the character class with a single you negate everything that follows in the class. So matches anything that's not a lowercase ASCII letter. [xy] x y [e-p] e p ^ ^ [^a-z] Where it starts to be tricky is that some characters have different meanings inside character classes than they do outside. The best example of this is probably the character which gains the special meaning of creating ranges when used inside a character class. hyphen/minus - To match a literal , you can escape it or move the to the beginning or end of the character class. Another example is the multipliers. For instance outside a character class, means "repeated any number of times". Inside a character class it just means "asterisk". - \- - * You probably think this "Character Classes" concept is easy and well understood by developers. However, after running our analyzer on a few GitHub open-source projects, we realized that it might not be the case at all. So let's take a look at real code and see how creative developers can be! Problem 1: Wrong use of separators There is a lot of confusion around the character. Outside of a character class, it is an alternation operator. So it would allow you to select "red" or "blue", like so: . But inside a character class, it's just a normal character with no special behavior. For example in this “mobile-phone number” matcher: | red|blue Pattern.compile( ) "^1[3|4|5|7|8][0-9]{9}$" Source (ValidateUtils.java) The author should replace with in the pattern. [3|4|5|7|8] [34578] Other developers make the same mistake with commas, as in this example: …[ , , , - ]… 0 2 3 5 9 Source (PhoneRecognition.java) And the negation symbol should only be used at the beginning of the character class and not before each element, like in this code: ^ NanoHTTPD …[^/^ ^;^,]… Source (ContentType.java) Problem 2: Wrong character A more subtle potential bug is the uppercase and lowercase mix in character ranges, like in the code: Apache Camel …[\\.|a-z|A-z| - ]… 0 9 Source (KafkaHeaderFilterStrategy.java) Do you see the bug? Not the wrong use, the other one?  Because of the second lower-case , the range matches characters in the ASCII table from to , plus , , , , , , and adds from to on top of that. Isn't it strange? So now it should take you only one second to find a bug in this code which is commented "defined by RFC7230 section 3.2.6" for this expression: | z [A-z] A Z [ \ ] ^ _ ` a z Elasticsearch Pattern.compile( ); "[a-zA-z0-9!#$%&'*+\\-.\\^_`|~]+" Source (RestRequest.java) Unfortunately, RFC7230 does not allow , , in HTTP header field values, so it's definitely a bug. A similar bug could also occur when you want to match the character - and forget to escape it or move it to the first position in the class (where it would lose its special meaning). Can you spot which - character is wrong in the following code? [ \ ] Jenkins USERINFO_CHARS_REGEX = ; "[a-zA-Z0-9%-._~!$&'()*+,;=]" Source (UrlValidator.java) It's the one in the range ; it does not match 3 characters but and because the matched characters are also present after in the character class, we know that the range was not intentional. Luckily, this expression will only fail to match the character , but sometimes this confusion can have a bigger impact: %-. %&'()*+,-. %-. - String safetextRegex = ; "^[a-zA-Z0-9 .,;-_€@$äÄöÖüÜ!?#&=]+$" Source (ValidationBean.java) Nice variable name, but unfortunately this character class is most probably not as safe as expected by its initial writer. Indeed, here does not match 3 characters, but 37! ;-_ And don't forget that a range can only match one and only one character. If you want to match characters '0' '1' '2' '3', you can use .  But what do you think the following code is supposed to match? [0-3] Apache Hadoop Pattern.compile( ) "acl[0-31]" Source (AoclDiagnosticOutputParser.java) could just be a redundancy and not a bug. But, if the intention was to match an acl number as defined by Intel from to , then it's a bug. Likewise, matching uppercase and lowercase requires two character ranges and not only one like in this code: 1 acl0 acl31 [A-Za-z] Apache Geode …[aA-zZ0- -_.]… 9 Source (CreateLuceneCommandParametersValidator.java) Problem 3: Wrong regex operator Sometimes alternations like are wrongly written using character classes. Can you spot the bug in the following source code? (jpg|png|gif) Alibaba's Tangram Pattern.compile( ); "(\\d+)x(\\d+)(_?q\\d+)?(\\.[jpg|png|gif])" Source (Utils.java) Good to know, and are just normal characters when used in character classes and lose their meaning as quantifiers. So in this next example, why would you add a ? inside a character class? * ? String VALUE = ; "[[^\"]?]+" // anything but a " in "" Source (HiveHistoryUtil.java) It's a complicated way to write , and probably the intention was actually to write . [^\"]+ [^\"]* The above bugs were found by our new rule . The initial goal of this rule was to spot tiny misunderstandings like: java:S5869 - Character classes in regular expressions should not contain the same character twice But in the end, the findings far exceeded our expectations and will ultimately prevent some very painful bugs in your applications. S5869 is available today in SonarQube, SonarCloud and SonarLint. It was Voltaire who first said that with great power comes great responsibility. But what we've learned in implementing rules for regular expressions is that with the great power of regular expressions, also come great challenges to write them well. In this post I talked about what we found with rule S5869, but it's only one of the regex rules we've been working on. Next time I'll talk about regex boundaries and complexity. Previously published at https://blog.sonarsource.com/regular-expressions-present-challenges

Apache

Intel

SonarSource

Getting Regex Boundaries Right: A How-To Guide

Developer-Led Code Security Will Dominate the SAST Market

Write better code!

2021 - HackerNoon Contributor of the Year - SECURITY

2021 - HackerNoon Contributor of the Year - CYBERSECURITY

Too Long; Didn't Read

Regexes or Regular Expressions And The Common Mistakes Programmers Make While Using Them

Regexes or Regular Expressions And The Common Mistakes Programmers Make While Using Them

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Avoid Stack Overflows By Writing Regexes Properly

Asterisk's Unnoticed Bug: The Double Quote Bug

Casing Can Break Netlify Functions: Here's How

Common Mistakes in Bug Reports and How to Fix Them

Getting Regex Boundaries Right: A How-To Guide

Go vs Rust: A Sto-array of Arrays

Avoid Stack Overflows By Writing Regexes Properly

Asterisk's Unnoticed Bug: The Double Quote Bug

Casing Can Break Netlify Functions: Here's How

Common Mistakes in Bug Reports and How to Fix Them

Getting Regex Boundaries Right: A How-To Guide

Go vs Rust: A Sto-array of Arrays

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps