The story of is the story of accidental implementation details that become unfixable WTF moments for all new developers. All popular languages are like this. See for example of why C and all C-syntax languages have the wrong precedence on the “&” operator. But I think JS has more than its fair share. JavaScript the explanation almost https://regexper.com/#%2FWTA%3FF%2F A famous example is what happened when the engineers on Microsoft’s JScript took a look at the operator and determined that typeof , , and (typeof 1) == "number" (typeof new Object()) == "object" (typeof undefined) == "undefined" So far, so good, but how about (typeof null) == "object" Yup, that’s right, the absence of an object is an object. So they dutifully made sure the same was true in Explorer’s JScript interpreter. You can’t blame them — they had no choice but to be compatible and web developers would not have thanked them for deviating from the “standard” even at that early stage. When we built V8 we did the same. Internet If you want to geek out for a moment and look at the type of null is “object”, take a look at ’s excellent . why Axel Rauschmayer explanation JavaScript regexps have some similar strangeness. I’ve written three JS-compatible regexp engines now, starting in 2008 when I, and Lasse Reichstein Holst Nielsen sat down to write the JS regexp engine. It’s called Irregexp and it’s now used in Chrome, Opera, node.js, Firefox and the Dart VM. Christian Plesner fastest possible From — yes, I have the T-shirt too https://xkcd.com/208/ As is often the case with JS, there’s the official ECMAScript spec, and then there’s with what you have to do to be compatible. And then there’s the stuff that’s not in Annex B, that you also have to do to be compatible. Annex B For example, according to the main spec, unknown alpha escapes are disallowed, so should be an error, but in you find out, it just matches the same as . This causes trouble when we want to improve JS regexps, eg. with . If you add support for back references using the syntax then that could break web pages that already use that to mean the same as . /bac\k/ Annex B /back/ named captures and named back-references /\k<name>/ /k<name>/ Similarly, you can’t put a quantifier on an assertion, so a lookahead followed by an asterisk would be an error: . But in Annex B we find out, that some assertions allowed quantifiers. Annex B doesn’t apply to regexp Unicode mode, which is opt-in, so is a syntax error, whereas without the for Unicode mode it would be an optional lookahead that tells you via the captured text whether or not it matched. Which might actually be useful! You should be able to rewrite by wrapping the lookahead in a non-capturing group like this , but that fails in Chromium. I filed a to fix this ‘valuable use case’. /(?=foo)*/ are /(?=(foo))?/u /u /(?:(?=(foo)))?/u bug If you put numbers after a backslash, that’s when things get really strange. Without looking at Annex B, can you guess the rules behind this? /\1/ // Matches Unicode code point 1 aka Ctrl-A   /()\1/ // Empty capture followed by a backreference to that capture   /()\01/ // Empty capture followed by code point 1   /\11/ // Match a tab character, which is code point 9!   /\18/ // Match code point 1, followed by "8"   /\176/ // Match a tilde, "~"   /\400/ // Match a space followed by a zero Did you manage to reverse engineer the rule from these examples? The rule is that the whole number is taken as a decimal backreference number, but if it has leading zeros or it is out of range (there are not enough capture parentheses) we abandon that interpretation, switch number base, and reinterpret it as up to 3 digits of octal escape up to 255 (\377), possibly followed by literal numbers. (I filed a while writing this blog post, because that’s not quite what Safari does.) Safari bug Every time I implement a parser for this, I’m convinced I can parse it in one pass, and every time, I am wrong and have to do it with a two-pass algorithm (the first one just counts the captures). Apologies to http://abstrusegoose.com/93 But where we get into serious WTF-land is the syntax. This stands for control-X and means Unicode code point 24, since x is the 24th letter of the alphabet. So far so esoteric-but-somebody-probably-finds-it-useful. The strange thing is what happens if you don’t put a-to-z after the “c”, for example . According to the main JS spec this should throw a syntax error, but that’s not what it does. \cx /\c:/ It could also just match , following the rules of Annex B and the example set by . That’s also not what it does. "c:" /\k/ It could match some random control character determined by the colon, which is what used to happen on Safari. That’s also not what it does. actually matches a literal backslash, followed by This makes it the only place in the regexp parser where a single backslash is interpreted literally. You will find this behaviour in all modern browsers and there are tests to make sure it stays this way. /\c:/ "c:" My latest (hobby) project is , an ahead-of-time regexp-to-machine code compiler, written in Dart, that uses LLVM for all its heavy lifting. To keep testing simple, I’m making sure to be compatible with other regexp engines, primarily Irregexp, so I’m reimplementing these features, with full wart-for-wart compatibility. The strange backslash-c behaviour is part of that. Hooray for backwards compatibility! Grut

The madness of parsing real world JavaScript regexps

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

10 Most Sought-After Programming Languages You Should Learn In 2021

142 Stories To Learn About Flutter

7 reasons why you should try AngularDart

A Comprehensive Guide for Building Efficient Data Structures in Dart

A Deep Dive into How Flutter Works 'Under the Hood'

A Guide to Building Interactive Charts in Flutter

10 Most Sought-After Programming Languages You Should Learn In 2021

142 Stories To Learn About Flutter

7 reasons why you should try AngularDart

A Comprehensive Guide for Building Efficient Data Structures in Dart

A Deep Dive into How Flutter Works 'Under the Hood'

A Guide to Building Interactive Charts in Flutter

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps