The madness of parsing real world JavaScript regexps

Written by erik_68861 | Published 2017/01/15
Tech Story Tags: javascript | programming | regex | dartlang | dart

TLDRvia the TL;DR App

The story of JavaScript is the story of accidental implementation details that become unfixable WTF moments for all new developers. All popular languages are like this. See for example the explanation of why C and almost all C-syntax languages have the wrong precedence on the “&” operator. But I think JS has more than its fair share.

https://regexper.com/#%2FWTA%3FF%2F

A famous example is what happened when the engineers on Microsoft’s JScript took a look at the typeof operator and determined that

(typeof 1) == "number" ,(typeof new Object()) == "object" , and(typeof undefined) == "undefined"

So far, so good, but how about(typeof null) == "object"

Yup, that’s right, the absence of an object is an object. So they dutifully made sure the same was true in Internet Explorer’s JScript interpreter. You can’t blame them — they had no choice but to be compatible and web developers would not have thanked them for deviating from the “standard” even at that early stage. When we built V8 we did the same.

If you want to geek out for a moment and look at why the type of null is “object”, take a look at Axel Rauschmayer’s excellent explanation.

JavaScript regexps have some similar strangeness. I’ve written three JS-compatible regexp engines now, starting in 2008 when I, Christian Plesner and Lasse Reichstein Holst Nielsen sat down to write the fastest possible JS regexp engine. It’s called Irregexp and it’s now used in Chrome, Opera, node.js, Firefox and the Dart VM.

From https://xkcd.com/208/ — yes, I have the T-shirt too

As is often the case with JS, there’s the official ECMAScript spec, and then there’s Annex B with what you have to do to be compatible. And then there’s the stuff that’s not in Annex B, that you also have to do to be compatible.

For example, according to the main spec, unknown alpha escapes are disallowed, so /bac\k/ should be an error, but in Annex B you find out, it just matches the same as /back/ . This causes trouble when we want to improve JS regexps, eg. with named captures and named back-references. If you add support for back references using the /\k<name>/ syntax then that could break web pages that already use that to mean the same as /k<name>/ .

Similarly, you can’t put a quantifier on an assertion, so a lookahead followed by an asterisk would be an error: /(?=foo)*/ . But in Annex B we find out, that some assertions are allowed quantifiers. Annex B doesn’t apply to regexp Unicode mode, which is opt-in, so /(?=(foo))?/u is a syntax error, whereas without the /u for Unicode mode it would be an optional lookahead that tells you via the captured text whether or not it matched. Which might actually be useful! You should be able to rewrite by wrapping the lookahead in a non-capturing group like this/(?:(?=(foo)))?/u , but that fails in Chromium. I filed a bug to fix this ‘valuable use case’.

If you put numbers after a backslash, that’s when things get really strange. Without looking at Annex B, can you guess the rules behind this?

/\1/ // Matches Unicode code point 1 aka Ctrl-A /()\1/ // Empty capture followed by a backreference to that capture /()\01/ // Empty capture followed by code point 1 /\11/ // Match a tab character, which is code point 9! /\18/ // Match code point 1, followed by "8" /\176/ // Match a tilde, "~" /\400/ // Match a space followed by a zero

Did you manage to reverse engineer the rule from these examples?

The rule is that the whole number is taken as a decimal backreference number, but if it has leading zeros or it is out of range (there are not enough capture parentheses) we abandon that interpretation, switch number base, and reinterpret it as up to 3 digits of octal escape up to 255 (\377), possibly followed by literal numbers. (I filed a Safari bug while writing this blog post, because that’s not quite what Safari does.)

Every time I implement a parser for this, I’m convinced I can parse it in one pass, and every time, I am wrong and have to do it with a two-pass algorithm (the first one just counts the captures).

Apologies to http://abstrusegoose.com/93

But where we get into serious WTF-land is the \cx syntax. This stands for control-X and means Unicode code point 24, since x is the 24th letter of the alphabet. So far so esoteric-but-somebody-probably-finds-it-useful. The strange thing is what happens if you don’t put a-to-z after the “c”, for example /\c:/ . According to the main JS spec this should throw a syntax error, but that’s not what it does.

It could also just match "c:" , following the rules of Annex B and the example set by /\k/ . That’s also not what it does.

It could match some random control character determined by the colon, which is what used to happen on Safari. That’s also not what it does.

/\c:/ actually matches a literal backslash, followed by "c:" This makes it the only place in the regexp parser where a single backslash is interpreted literally. You will find this behaviour in all modern browsers and there are tests to make sure it stays this way.

My latest (hobby) project is Grut, an ahead-of-time regexp-to-machine code compiler, written in Dart, that uses LLVM for all its heavy lifting. To keep testing simple, I’m making sure to be compatible with other regexp engines, primarily Irregexp, so I’m reimplementing these features, with full wart-for-wart compatibility. The strange backslash-c behaviour is part of that. Hooray for backwards compatibility!


Published by HackerNoon on 2017/01/15