paint-brush
Benchmarking C# Regular Expressions to Measure Performanceby@devleader
406 reads
406 reads

Benchmarking C# Regular Expressions to Measure Performance

by Dev LeaderApril 15th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Regular expressions are powerful for pattern matching, but what about performance? Check out this article for details on C# regex performance from benchmarks!
featured image - Benchmarking C# Regular Expressions to Measure Performance
Dev Leader HackerNoon profile picture

Regular expressions in C# can save you when you need to do some complex pattern matching on strings. But as the language grows and evolves, we continue to find new ways to use regular expressions in C#. That’s why I wanted to take a moment to consider C# Regex performance by running benchmarks across the various ways I’m familiar with using regular expressions.


When we have options that look the same, I like remaining curious and understanding what’s *actually* different. This turned out to be a fun exercise, and I hope you have some learnings to take away just like I did!

PLEASE NOTE:

There was an error in the benchmarks that I created because of an assumption that was invalid. The contents of this article can still be very valuable, but there is an update to the benchmark code and an accurate analysis in this follow-up article:

https://www.devleader.ca/2024/04/12/csharp-regular-expression-benchmarks-how-to-avoid-my-mistakes/


What is a Regular Expression?

Regular expressions, often referred to as regex, are powerful tools used for pattern matching in text. They allow you to define a search pattern that can be used to find, replace, or manipulate specific parts of a string. Regular expressions provide a concise and flexible way to search for and identify specific patterns within text data.


I have used regular expressions for many different things in my career, such as:

  • Pattern matching on user input
  • Scraping data from the web
  • Parsing data sources like logs or other files
  • Digital forensics data recovery
  • … and many more uses!


Regular expressions can be used for all sorts of advanced pattern matching. But if you want to get started, you can check out these articles:


Why Benchmark Regex Performance in C#?

Aside from when I am trying to profile and optimize my applications and services, I like benchmarking things when I am curious. This happens especially when I find there are seemingly multiple ways to do the same thing — it makes me wonder TRULY what the differences are aside from just syntax and usability. Recently, with collection initializers, I was completely blown away by the performance differences, so it’s always a good reminder to be curious.


Regular expressions in C# have several different flavors:

  • Static method call
  • Compiled flag or not
  • Source generators


Now, the compiled flag is supposed to give us a performance boost, but what’s the overhead of using just the static method call since it makes it really convenient to call in code? And what the heck are these (relatively) new source generators for regular expressions in C#? Microsoft had some REALLY awesome documentation on compiled regular expressions and source generation — and I would never have stumbled upon this if I wasn’t curious to benchmark.


While we won’t be comparing all of the regex methods we have access to, I did want to see the performance difference for getting all matches in a body of text. Considering the scenarios listed above, I was curious to see which would pull forward, given they all seem roughly the same on the surface.


Setting Up C# Regex Performance Benchmarks

I know you’re eager to jump STRAIGHT to the details, and while nothing is stopping you from scrolling to the bottom, I’m hopeful you’ll pause in this section to understand the benchmarks first. When I post benchmarking investigations, my ultimate goal is not to persuade you to code differently but rather to be curious about what you’re coding. I think we’ll see some obvious things to avoid in these benchmarks — but still, being curious is the goal.

The Test Data For Benchmarking

While we’re not interested in absolute performance here — and you might be if you’re profiling and benchmarking your own application — we are interested in relative performance between our various Regex options. Aside from the options we have to use, some other considerations:

  • The Regex pattern that we use could potentially influence how each mechanism performs
  • The source of data we try to match could have some sort of influence on the results


I call these things out because they are uncertain to me. The source data I feel shouldn’t be TOO big of an issue — but maybe different heuristics of matching don’t allow compiled or source-generated regular expressions to shine. Maybe the Regex pattern I’ve selected doesn’t allow compiled or source-generated regular expressions to have an advantage. Or maybe it does — and I should be able to call this out.

The point is that there are some variables that may influence results, and I don’t fully understand how. But this is me being transparent, and if you know better, feel free to share your insights!


I figured to make this as “fair” as possible, I would look for some patterns in real text: words that end in “ing” or in “ed.” To find some real text, I am using the data from Project Gutenberg — Specifically, this E-Book. It’s 2200+ lines of English text, so there are plenty of words that match our pattern.

The C# Regex Benchmark Code

There’s nothing super unique about these benchmarks compared to others I normally create. However, I’ll list out some points of interest so that you can pick them out in the code:

  • I’m using [Params] to load the source file in case you want to try these benchmarks across different datasets
  • The source data is read during the global setup
  • I need to cache some Regex instances for some of the benchmarks, which is done in the global setup


You can find the benchmark code here on GitHub and in the code below:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

using System.Reflection;
using System.Text.RegularExpressions;

BenchmarkRunner.Run(
    Assembly.GetExecutingAssembly(), 
    args: args);

[MemoryDiagnoser]
[MediumRunJob]
public partial class RegexBenchmarks
{
    private const string RegexPattern = @"\b\w*(ing|ed)\b";

    private string? _sourceText;
    private Regex? _regex;
    private Regex? _regexCompiled;
    private Regex? _generatedRegex;
    private Regex? _generatedRegexCompiled;

    [GeneratedRegex(RegexPattern, RegexOptions.None, "en-US")]
    private static partial Regex GetGeneratedRegex();

    [GeneratedRegex(RegexPattern, RegexOptions.Compiled, "en-US")]
    private static partial Regex GetGeneratedRegexCompiled();

    [Params("pg73346.txt")]
    public string? SourceFileName { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        _sourceText = File.ReadAllText(SourceFileName!);
        
        _regex = new(RegexPattern);
        _regexCompiled = new(RegexPattern, RegexOptions.Compiled);
        _generatedRegex = GetGeneratedRegex();
        _generatedRegexCompiled = GetGeneratedRegexCompiled();
    }

    [Benchmark(Baseline = true)]
    public MatchCollection Static()
    {
        return Regex.Matches(_sourceText!, RegexPattern!);
    }

    [Benchmark]
    public MatchCollection New()
    {
        Regex regex = new(RegexPattern!);
        return regex.Matches(_sourceText!);
    }

    [Benchmark]
    public MatchCollection New_Compiled()
    {
        Regex regex = new(RegexPattern!, RegexOptions.Compiled);
        return regex.Matches(_sourceText!);
    }

    [Benchmark]
    public MatchCollection Cached()
    {
        return _regex!.Matches(_sourceText!);
    }

    [Benchmark]
    public MatchCollection Cached_Compiled()
    {
        return _regexCompiled!.Matches(_sourceText!);
    }

    [Benchmark]
    public MatchCollection Generated()
    {
        return GetGeneratedRegex().Matches(_sourceText!);
    }

    [Benchmark]
    public MatchCollection Generated_Cached()
    {
        return _generatedRegex!.Matches(_sourceText!);
    }

    [Benchmark]
    public MatchCollection Generated_Compiled()
    {
        return GetGeneratedRegexCompiled().Matches(_sourceText!);
    }

    [Benchmark]
    public MatchCollection Generated_Cached_Compiled()
    {
        return _generatedRegexCompiled!.Matches(_sourceText!);
    }
}


For our benchmarks, we’ll treat the static method on the Regex class as the baseline — just so we have something to anchor to when looking at the performance results.


C# Regex Performance Results

With the BenchmarkDotNet code out of the way, let’s see the results:

C# Regex Performance Results


From the above, we can see that creating a new Regex instance every time you want to perform a match is 100x slower than the static method. This is incredible — you should NOT do this if performance is important to you! But it gets worse… If you do this AND provide the compile flag, it’s almost 1000x as bad, 10x worse than newing it up every time. These are two things you should avoid doing.


We can see with the cached variations that follow that we can effectively invert the situation, giving us a slight boost over the static method. While these runtimes are very fast, it looks like, in these situations, it’s nearly 30% faster. But temper your expectations, as I’m not convinced this scales with different data sets and different patterns!


The source-generated C# regular expressions are also much faster than the static method — but seem roughly on par with the two prior benchmarks. While the source-generated regular expressions do cache, and calling the source-generated method should be no overhead, there are two benchmark variations that suggest it’s marginally faster to keep your own cache. These could simply be outliers, though, given how close the results are.


You can check out the video for a full walk-through on these C# regex benchmarks:


Wrapping Up C# Regex Performance

The takeaways for optimizing C# Regex performance: Stop declaring regular expressions in C# right before you go to use them! And even worse, stop declaring them with the compiled flag if you’re declaring them right before using them! These two things will crush your performance.


Otherwise, it seems like the Regex class with static method across the board is pretty safe, but you get the most benefit out of compiling and caching your regex. And according to Microsoft, it can be improved even further with the C# Regex source generators in many situations.


Also published here.