paint-brush
Consistently bad parsing of YAMLby@jstoiko
1,718 reads
1,718 reads

Consistently bad parsing of YAML

by Jonathan StoikovitchDecember 22nd, 2017
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Parsers are not easy to get right. <code class="markup--code markup--p-code">libyaml</code>, the reference parser for <a href="https://hackernoon.com/tagged/yaml" target="_blank">YAML</a>, does most things right. However, there’s one little thing that it does wrong but since it does everything else sooo right, this little thing has been ignored. Even worse, other parser implementations have been doing it knowingly wrong because “that’s how <code class="markup--code markup--p-code">libyaml</code> does it”. There is hope though, keep reading.

Coin Mentioned

Mention Thumbnail
featured image - Consistently bad parsing of YAML
Jonathan Stoikovitch HackerNoon profile picture

Parsers are not easy to get right. libyaml, the reference parser for YAML, does most things right. However, there’s one little thing that it does wrong but since it does everything else sooo right, this little thing has been ignored. Even worse, other parser implementations have been doing it knowingly wrong because “that’s how libyaml does it”. There is hope though, keep reading.

Found unexpected ‘:’

If you’re parsing YAML — and chances are you are one way or another —, you may have stumbled upon this error while parsing things like:

urls: [https://medium.com]

(i.e. flow sequences)

or:

location: {url: https://medium.com}

(i.e. flow mappings)

even though the YAML specs explicitly say it’s valid:

Normally, YAML insists the **_:_**” mapping value indicator be separated from the value by white space. A benefit of this restriction is that the “**:**” character can be used inside plain scalars, as long as it is not followed by white space. This allows for unquoted URLs and timestamps.

What’s happening?

This is because your YAML parser for <insert_your_language_here> either relies on libyaml (loading it as shared library and providing bindings to it) or used libyaml as their reference parser, in other words as the “mother of all YAML parsers” and mirrored its behavior rather than following the YAML specs, strictly. Nothing extremely wrong about that but I am relaying the facts.

The good news is that there is an easy fix. Quoting a string that contains a colon in a flow context will do (i.e. 'https://medium.com').

The bad news is that it seems like parsers across languages are inconsistently handling this:

  • Pythonpyyaml throws an error, a PR fixes this and has been merged but hasn’t been released yet
  • Ruby psych throws an error
  • Golang go-yaml throws an error, issue submitted here
  • Java snakeyaml throws an error, issue submitted here
  • JavaScript JS-YAML handles this properly

and:

  • libyaml throws an error, but the other good news is that there is a PR that addresses it

To sum-up, the only parser that handles this properly as of now is the JavaScript one. The problem with is that if your stack consists of JavaScript and any other language, and that you’re parsing YAML across the board, it may lead to inconsistent parsing behaviors, and that’s not great.