We've been working recently on adding rules to help write better regular expressions in Java. I've talked already about rules to and with . Those rules help you make sure your regular expressions are accurate and maintainable. Today I will show you how to make sure that the regular expression won’t crash your program. find errors in character classes the use of boundary markers and overly complex regular expressions As a brief reminder: regular expressions are a terse and powerful mechanism for matching patterns in strings. Part of the power of regular expressions is that a pattern can concisely describe a string that might be much larger than the original pattern. Sometimes much, much larger. Sometimes enough to overflow the stack and crash your application. Due to the way regular expression matching is implemented in Java (and many other languages/libraries), matching a pattern may - depending on the regex - require stack space proportional to the length of the input. This means large inputs could cause the program to crash with a when you try to use the regex. StackOverflowException We ran into this problem in our own code when analysis tried  to find comments containing at least one word with more than a given number of characters. For this we used a regex similar to , which would cause the analysis to crash on source files containing sufficiently long comments. To prevent problems like this for our future selves as well as for you, dear users, we implemented a rule which can detect problems like this: #(.|\n)*\w{3,} java:S5998 - Regular expressions should not overflow the stack In theory any regex is susceptible to stack overflow if it contains a repetition that contains some sort of branching, such as another repetition or an alternative. However, for some regular expressions the length of input required to make them crash is much larger than for others. To account for this, our rule tries to estimate how much stack space a regular expression will consume relative to the input size and only raises issues on regular expressions whose stack consumption exceeds a configurable threshold. One way to avoid stack consumption in regular expressions is to use . Possessive quantifiers are created by adding a to a quantifier (e.g. instead of ). Doing so disables . In addition to avoiding issues with catastrophic backtracking, making a quantifier possessive allows a pattern to be matched without consuming stack space. The problem is that sometimes backtracking is necessary and using possessive quantifiers to disable it may leave you with a regex that can never match any input. possessive quantifiers + x*+ x* backtracking Consider for example the regex I cited above: : If we try to fix the stack overflow by making the quantifier possessive, we end up with . This regex can never match anything because any input that could be matched by will already have been matched by and, being possessive, it won’t give it back. Luckily we have a rule that warns you about issues like this: #(.|\n)*\w{3,} #(.|\n)*+\w{3,} \w (.|\n)*+ java:S5994 - Regex patterns following a possessive quantifier should not always fail So how would one properly fix this issue? In our case, we decided to get rid of the regex altogether since there wasn’t really a good reason to use a regex here. But the regex could also be salvaged easily enough by getting rid of the alternation. The intent of was to match any character, including line breaks, because . does not match line breaks by default. The proper way to match any character would be to enable the flag, which can also be enabled from within the regex using . So either or would work (the latter of which also works when using methods that don’t involve ). .|\n DOTALL (?s) Pattern.compile(“#.*\\w{3,}”, Pattern.DOTALL) ”(?s)#.*\\{3,}” Pattern.compile Another regex susceptible to stack overflows would be something as simple as . Here the alternation can be removed by using a character class instead, which doesn’t require stack space to match: . Luckily we have a rule that finds alternations that can be replaced by character classes and suggests just that replacement like in the following code: (a|b)* [ab]* Spring Boot java:S6035 - Single-character alternations in regular expressions should be replaced with character classes Regular expressions are a truly powerful feature. Because great power comes with great responsibility (and great opportunities for screwing things up, or confusing your colleagues and future-you), we now offer a total of 24 rules targeting regular expressions. They are all available today in SonarCloud. These three rules around avoiding stack overflows are available starting from SonarQube 8.7, but the rest are already released. You can see them all at . rules.sonarsource.com Finally, note that some of the rules perform in-depth automata-based analyses on your regular expressions to identify issues as accurately as possible. This is an extremely promising approach and we will probably discuss this feature in a future blog post because we truly believe it’s amazing. This will for sure bring A LOT to regex code quality in general, so stay tuned!

SonarSource

Getting Regex Boundaries Right: A How-To Guide

Avoid Stack Overflows By Writing Regexes Properly

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Developer-Led Code Security Will Dominate the SAST Market

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

104 Stories To Learn About Programming Top Story

Developer-Led Code Security Will Dominate the SAST Market

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

104 Stories To Learn About Programming Top Story

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps