Consistently bad parsing of YAML

Written by jstoiko | Published 2017/12/22
Tech Story Tags: computer-science | software-engineering | parsing | yaml

TLDRvia the TL;DR App

Parsers are not easy to get right. libyaml, the reference parser for YAML, does most things right. However, there’s one little thing that it does wrong but since it does everything else sooo right, this little thing has been ignored. Even worse, other parser implementations have been doing it knowingly wrong because “that’s how libyaml does it”. There is hope though, keep reading.

Found unexpected ‘:’

If you’re parsing YAML — and chances are you are one way or another —, you may have stumbled upon this error while parsing things like:

urls: [https://medium.com]

(i.e. flow sequences)

or:

location: {url: https://medium.com}

(i.e. flow mappings)

even though the YAML specs explicitly say it’s valid:

Normally, YAML insists the **_:_**” mapping value indicator be separated from the value by white space. A benefit of this restriction is that the “**:**” character can be used inside plain scalars, as long as it is not followed by white space. This allows for unquoted URLs and timestamps.

What’s happening?

This is because your YAML parser for <insert_your_language_here> either relies on libyaml (loading it as shared library and providing bindings to it) or used libyaml as their reference parser, in other words as the “mother of all YAML parsers” and mirrored its behavior rather than following the YAML specs, strictly. Nothing extremely wrong about that but I am relaying the facts.

The good news is that there is an easy fix. Quoting a string that contains a colon in a flow context will do (i.e. 'https://medium.com').

The bad news is that it seems like parsers across languages are inconsistently handling this:

  • Pythonpyyaml throws an error, a PR fixes this and has been merged but hasn’t been released yet
  • Ruby psych throws an error
  • Golang go-yaml throws an error, issue submitted here
  • Java snakeyaml throws an error, issue submitted here
  • JavaScript JS-YAML handles this properly

and:

  • libyaml throws an error, but the other good news is that there is a PR that addresses it

To sum-up, the only parser that handles this properly as of now is the JavaScript one. The problem with is that if your stack consists of JavaScript and any other language, and that you’re parsing YAML across the board, it may lead to inconsistent parsing behaviors, and that’s not great.


Published by HackerNoon on 2017/12/22