A colleague recently pointed me to a blog post: . For the sake of brevity, I will refer to it as in this article. On the Futility of Email Regex Validation Futility I admit that while the challenge of writing a regex that can successfully identify whether a string conforms to the RFC 5322 definition of an Internet Message header is an entertaining challenge, is not a useful guide for the practical programmer. Futility This is because it conflates RFC 5322 message headers with RFC 5321 address literals; which, in simple language, means that what constitutes a valid SMTP email address is different from what constitutes a valid message header in general. It is also because it incites the reader to become preoccupied with edge cases that are theoretically possible from a standards point of view, but which I will demonstrate, have an infinitesimal probability of occurring “in the wild.” This article will expand upon both of these assertions, will discuss a few possible use cases for email regex, and will conclude with annotated “cookbook” examples of practical email regex. RFC 5321 Supersedes 5322 The universality of SMTP for the transmission of email means that as a practical matter, no examination of email address formatting is complete without a close reading of the relevant IETF RFC, which is 5321. 5322 considers email addresses as simply a generic message header with no special case rules applying to it. This means that comments enclosed in parenthesis are valid, even in a domain name. The referenced in includes 10 tests that contain comments, or diacritical or Unicode characters, and indicates that 8 of them represent valid email addresses. test suite Futility This is incorrect because RFC 5321 is explicit in stating that the domain name portions of email addresses “ .” are restricted for SMTP purposes to consist of a sequence of letters, digits, and hyphens drawn from the ASCII character set In the context of constructing a regular expression, it is hard to overstate the degree to which this constraint simplifies matters, especially as regards determining excessive string length. The annotation of the examples will highlight this below. It also implies some other practical considerations in the context of validation that we will explore further on. Mailbox Names in the Wild As per both RFCs, the technical name for the portion of the email address to the left of the “@“ symbol is “mailbox”. Both RFCs allow considerable latitude in what characters are allowable in the mailbox portion. The only significant practical constraint is that quotes or parenthesis must be balanced, something that is a real challenge to verify in vanilla regex. However real-world mailbox implementations are again the measure that the practical programmer should employ. As a rule, the people who pay us frown upon 90% of our billable hours being directed to resolving the 10% of theoretical edge cases that might possibly not even exist at all in real life. Let’s look at the dominant email mailbox providers, consumers, and businesses, and consider what types of email addresses they permit. For consumer email, I did some primary research, using a list of 5,280,739 email addresses that were leaked from Twitter accounts. Based on 115 million Twitter accounts, this gives us a 99% confidence level with a 0.055% margin of error for the entire population of Twitter, which would be very representative of the general population of all Internet email addresses. Here is what I learned: 82% of addresses contained only ASCII alphanumeric characters, 15% contained only ASCII alphanumeric and dots (ASCII periods), for 97% of all addresses, 3% contained only ASCII alphanumeric, dots, and dashes, for a nominal 100% of email addresses. However, this is a rounded 100%. For the trivia lovers out there, I also found: 38 addresses with underscores for 0.00072% of the total 27 with plus signs for 0.00051%, and 1 address with Unicode characters representing 0.00002% of the total. The net effect is that assuming email address mailboxes contain only ASCII alphanumeric, dots, and dashes will give you better than 5 9’s accuracy for consumer emails. For business emails, Datanyze that 6,771,269 companies use 91 different email hosting solutions. However the Pareto distribution holds, and 95.19% of those mailboxes are hosted by just 10 service providers. reports Gmail for Business (34.35% Market Share) Google allows only ASCII letters, numbers, and dots when creating a mailbox. It will however accept the plus sign . when receiving email Microsoft Exchange Online (33.60%) Allows only ASCII letters, numbers, and dots. GoDaddy Email Hosting (14.71%) Uses Microsoft 365, and allows only ASCII letters, numbers, and dots. 7 Additional Providers (12.53%) Not documented. Unfortunately, we can only be certain of 82% of businesses and we do not know how many mailboxes that represents. However, we do know that of the Twitter email addresses, only 400 out of 173,467 domains had more than 100 individual email mailboxes represented. I believe that most of the 99% of the remaining domains were business email addresses. In terms of mailbox naming policies at the server or domain level, I propose that it is reasonable to take these 237,592 email addresses as representing a population of 1 billion business email addresses with a 99% confidence level and 0.25% margin of error, giving us close to 3 9’s when assuming that an email address mailbox contains only ASCII alphanumeric, dots, and dashes. Use Cases Again, with practicality foremost in our minds, let us consider under what circumstances we might need to programmatically identify a valid email address. New Account Creation/User Signups In this use case, a prospective new customer is trying to create an account. There are two high-level strategies we might consider. In the first case, we attempt to verify that the email address that the new user provides is valid and proceed with account creation synchronously. There are two reasons why you might not want to take this approach. The first is that although you might be able to validate that the email address has a valid form, it might nonetheless not exist. The other reason is that at any kind of scale, synchronous is a red flag word, which should cause the pragmatic programmer to consider instead a fire-and-forget model where a stateless web front end passes form information to a microservice or API which will asynchronously validate the email by sending a unique link which will trigger the completion of the account creation process. Contact Forms In the case of a simple contact form, of the sort often used to download white papers, the potential downside to accepting strings that look like a valid email but are not is that you are lowering the quality of your marketing database by failing to validate if the email address really exists. So once again, the fire-and-forget model is a better option than the programmatic validation of the string entered in a form. Parsing of Referrer Logs and Other Large Volumes of Data. This leads us to the real use case for programmatic email address identification in general, and regex in particular: anonymizing or mining large chunks of unstructured text. I first came across this use case assisting a security researcher who needed to upload referrer logs to a fraud detection database. The referrer logs contained email addresses that needed to be anonymized before leaving the company’s walled garden. These were files with hundreds of millions of lines, and there were hundreds of files a day. “Lines” could be close to a thousand characters long. Iterating through the characters in a line, applying complex tests (e.g., is this the first occurrence of in the line and is it part of a file name such as ?) using loops and standard string functions would have created a time complexity that was impossibly large. @ imagefile@2x.png In fact, the in-house development team of this (very large) company had declared it an impossible task. I wrote the following compiled regex: search_pattern = re.compile("[a-zA-Z0-9\!\#\$\%\'\*\+\-\^\_\`\{\|\}\~\.]+@|\%40(?!(\w+\.)**(jpg|png))(([\w\-]+\.)+([\w\-]+)))") And dropped it into the following Python list comprehension: results = [(re.sub(search_pattern, "redacted@example.com", line)) for line in file] I cannot remember just how fast it was, but it was fast. My friend could run it on a laptop and be done in minutes. It was accurate. We clocked it at 5 9’s looking at both false negatives and false positives. My job was made somewhat easy by the fact as referrer logs; they could only contain URL “legal” characters, so I was able to map out any collisions which I documented in the repo . readme Also, I could have made it even simpler (and faster) if I had performed the email address analysis and learned with an assurance that all that was needed to get to the 5 9’s target was ASCII alphanumeric, dots, and dashes. Nonetheless, this is a good example of practicality and scoping the solution to fit the actual problem to be solved. One of the greatest quotes in all of programming lore and history is the great Ward Cunningham’s to take a second to remember exactly what you are trying to accomplish, and then ask yourself “What is the simplest thing that could possibly work?” admonition In the use case of parsing out (and optionally transforming) an email address from a large amount of unstructured text, this solution was definitely the simplest thing I could think of. Annotated Cookbook Like I said at the beginning, I found the idea of building an RFC 5322 compliant regex amusing, so I will I show you composable chunks of regex to deal with various aspects of the standard and explain how the regex policies that. In the end, I will show you what it looks like all assembled. The structure of an email address is: The mailbox Legal characters Single dots (double dots are not legal) Folded White Space (RFC 5322 craziness) (A complete regex solution would also include balanced parenthesis and/or quotes, but I do not have that yet. And very possibly never will.) The delimiter (@) The domain name Standard dns parsable domains IPv4 address literals IPv6 address literals IPv6-full IPv6-comp (for compressed) 1st form (2+ 16-bit groups of zero in the middle) 2nd form (2+ 16-bit groups of zero in the beginning) 3rd form (2 16-bit groups of zero at the end) 4th form (8 16-bit groups of zero) IPv6v4-full IPv6v4-comp (compressed) 1st form 2nd form 3rd form 4th form Now for the regex. Mailbox ^(?<mailbox>(\[a-zA-Z0-9\\+\\!\\#\\$\\%\\&\\'\\\*\\-\\/\\=\\?\\+\\\_\\\{\\}\\|\\\~]|(?<singleDot>(?<!\\.)(?<!^)\\.(?!\\.))|(?<foldedWhiteSpace>\\s?\\&\\#13\\;\\&\\#10\\;.))\{1,64}) First, we have which “anchors” the first character at the beginning of the string. This is to be used if validating a string that is supposed to contain nothing but a valid email. It makes sure that the first character is legal. ^ If the use case is instead to find an email in a longer string, omit the anchor. Next, we have . This names the capture group for convenience. Inside the captured group are the three regex chunks separated by the symbol which means that a character can match any one of the three expressions. (?<mailbox> alternate match | Part of writing good (performant and predictable) regex is to make sure that the three expressions are mutually exclusive. That is to say that a substring that matches one, will definitely not match either of the other two. To do this we use specific character classes instead of the dreaded . .* Unconditionally Legal Characters [a-zA-Z0-9\+\!\#\$\%\&\'\*\-\/\=\?\+\_\{\}\|\~] The first alternate match is a character class enclosed in square brackets, which captures all ASCII characters that are legal in an email mailbox the dot, “folded white space”, the double quote, and the parenthesis. except The reason why we excluded them is that they are only legal, which is to say that there are rules about how you can use them that have to be validated. We handle them in the next 2 alternate matches. conditionally singleDot (?<singleDot>(?<!\.)(?<!^)\.(?!\.)) The first such rule concerns the dot (period). In a mailbox, the dot is only allowed as a separator between two strings of legal characters, so two consecutive dots is not legal. To prevent a match if there are two consecutive dots, we use the regex which specifies that the next character (a dot) will not match if there is a dot preceding it. negative lookbehind (?<!\.) Regex look arounds can be chained. There is another negative lookbehind before we get to the dot which enforces the rule that the dot cannot be the first character of the mailbox. (?!^) After the dot, there is a negative look_ahead_ prevents a dot from being matched if it is immediately followed by a dot. _(?!\.)_ , this foldedWhiteSpace (?<foldedWhiteSpace>\s?\&\#13\;\&\#10\;.) This some RFC 5322 nonsense about allowing multi-line headers in messages. I am ready to bet that in the history of email addresses, there has never been anyone who seriously created an address with a multiline mailbox (they might have done it as a joke). But I am playing the 5322 game so here it is, the string of Unicode characters that creates the as an alternate match. Folded White Space Balanced Double Quotes and Parenthesis Both RFC allow for the use of double quotes as a way of enclosing (or ) characters that would normally be illegal. escaping They also allow for enclosing comments in parenthesis so that they will be humanly readable, but not considered by the mail transfer agent (MTA) when interpreting the address. In both cases, the characters are only legal if . This means that there has to be a pair of characters, one that and one that . balanced opens closes I am tempted to write that I have discovered a , however, this probably only works posthumously. The truth is that this is non-trivial in vanilla regex. demonstrationem mirabilem I have an intuition that the recursive nature of “greedy” regex might be exploited to advantage, however, I am unlikely to devote the time necessary to attack this problem for the next few years, and so in the very best tradition, I leave it as an exercise for the reader. Mailbox Length {1,64} Something that actually matter is the maximum length of a mailbox: 64 characters. does So after we close the mailbox capture group with a final closing parenthesis, we use a between curly braces to specify that we must match any of our alternates at least one time and no more than 64 times. quantifier atSign \s?(?<atSign>(?<!\-)(?<!\.)\@(?!\@)) The delimiter chunk starts off with the because according to a space is legal just before the delimiter, and I am just taking their word for it. special case \s? Futility, The rest of the capture group follows a similar pattern as ; it will not match if preceded by a dot or a dash or if followed immediately by another . singleDot @ Domain Name Here, as in the mailbox, we have 3 alternate matches. And the last of these has nested in it another 4 alternate matches. Standard DNS Parsable (?<dns>[[:alnum:]]([[:alnum:]\-]{0,63}\.){1,24}[[:alnum:]\-]{1,63}[[:alnum:]]) This will not pass several of the tests in but as mentioned earlier, it complies strictly with RFC 5321 which has the final word. Futility, IPv4 (?<IPv4>\[((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\]) There is not too much to say about this. This is a well-known and easily available regex for IPv4 addresses. IPv6 (?<IPv6>(?<IPv6Full>(\[IPv6(\:[0-9a-fA-F]{1,4}){8}\]))|(?<IPv6Comp1>\[IPv6\:((([0-9a-fA-F]{1,4})\:){1,3}(\:([0-9a-fA-F]{1,4})){1,5}?\])|\[IPv6\:((([0-9a-fA-F]{1,4})\:){1,5}(\:([0-9a-fA-F]{1,4})){1,3}?\]))|(?<IPv6Comp2>(\[IPv6\:\:(\:[0-9a-fA-F]{1,4}){1,6}\]))|(?<IPv6Comp3>(\[IPv6\:([0-9a-fA-F]{1,4}\:){1,6}\:\]))|(?<IPv6Comp4>(\[IPv6\:\:\:)\])|(?<IPv6v4Full>(\[IPv6(\:[0-9a-fA-F]{1,4}){6}\:((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3})(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\])|(?<IPv6v4Comp1>\[IPv6\:((([0-9a-fA-F]{1,4})\:){1,3}(\:([0-9a-fA-F]{1,4})){1,5}?(\:((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\])|\[IPv6\:((([0-9a-fA-F]{1,4})\:){1,5}(\:([0-9a-fA-F]{1,4})){1,3}?(\:((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\]))|(?<IPv6v4Comp2>(\[IPv6\:\:(\:[0-9a-fA-F]{1,4}){1,5}(\:((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\]))|(?<IPv6v4Comp3>(\[IPv6\:([0-9a-fA-F]{1,4}\:){1,5}\:(((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\]))|(?<IPv6v4Comp4>(\[IPv6\:\:\:((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3})(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\])) I was unable to find a good regular expression for IPv6 (and IPv6v4) addresses, so I wrote my own, carefully following the Backus/Naur notated rules from RFC 5321. I will not annotate every subgroup of the IPv6 regex, but I have named every subgroup to make it easy to pick apart and see what is going on. Nothing too interesting really except maybe the way I combined greedy matching on the “left” side and non-greedy on the “right” in the IUPv6Comp1 capture group. The Full Monty I have saved the final regex, along with the test data from Futility, and enhanced by some IPv6 test cases of my own, to . I hope you have enjoyed this article, and that it proves to be useful and a time saver for many of you. Regex101 AZW

50x Faster Code and Fewer Bugs? Ditch the Classes

Creating a React Studio Plugin From an npm Library - a Walk in The Park!

Read the inside story of the failure of enterprise IT

On the Practicality of Regex for Email Address Processing

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

50x Faster Code and Fewer Bugs? Ditch the Classes

Character Encoding Demystified: Everything you Need to Know About ASCII, Unicode, UTF-8

Introducing MyTimer: A Geeky Timer for Terminal Enthusiasts

Redefining ‘A’ in VGA Mode 03h

101 Stories To Learn About Email

10 Things to Do in a Professional Email Signature (With Examples)

50x Faster Code and Fewer Bugs? Ditch the Classes

Character Encoding Demystified: Everything you Need to Know About ASCII, Unicode, UTF-8

Introducing MyTimer: A Geeky Timer for Terminal Enthusiasts

Redefining ‘A’ in VGA Mode 03h

101 Stories To Learn About Email

10 Things to Do in a Professional Email Signature (With Examples)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps