paint-brush
Tree - AST Crushes JSON, XML, YAML, TOML, etcโ€‚by@ninjin
2,626 reads
2,626 reads

Tree - AST Crushes JSON, XML, YAML, TOML, etc

by JinFebruary 17th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

An in-depth analysis of existing text data formats was carried out, after which a new format devoid of all drawbacks was developed and its various applications were demonstrated.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail

Coins Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Tree - AST Crushes JSON, XML, YAML, TOML, etc
Jin HackerNoon profile picture


Hello, my name is Dmitry Karlovsky and I.. ride bicycles.. off-road.. against the wind.. uphill.. on skis. And today I invite you to take a ride with me along and across textual data formats and design the ideal format together.


I already talked about it 5 years ago, which led to heated debates that resulted in minor syntax changes. Therefore, let me tell you from scratch what it is at the moment.

Meta

Speech
	Speaker \Dmitry Karlovsky
	Place \PiterJS #47
	Time 2020-05-20

This is an extended text version of the speech of the same name on PiterJS#47. You can read it as an article or open it in the presentation interface or .

Plan

  • Analyze popular text data formats ๐Ÿ’ฉ
  • From scratch, develop a new format without flaws ๐Ÿ‘ฝ
  • Show examples of applying the new format ๐Ÿ‘พ

Formats

We will compare 5 formats.

Format

XML

JSON

YAML

TOML

tree

Only the deaf have not heard about the first three. But the last two are dark horses for many. Well, nothing, today I will shed light on them.

XML example

XML - once the most popular format, you can say "technological standard". But despite all its power, it is now becoming obsolete, as it is too complicated for a modern web developer.

<!DOCTYPE svg
	PUBLIC "-//W3C//DTD SVG 1.1//EN"
	"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"
>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<svg version="1.1" xmlns="http://www.w3.org/2000/svg">
	<circle r="30" cx="50" cy="50" fill="orange" />
</svg>

JSON example

XML is being replaced by a simpler and more daring data format - JSON.

{
	"name": "example",
	"version": "1.0.0",
	"description": "example package",
	"main": "index.js",
	"repository": "https://example.org",
	"author": "anonymous",
	"license": "MIT"
}

If you think that this is the ideal, then I ask you to excuse me in advance, as I will upset you further.

YAML example

Someone is already prophesying YAML to replace JSON.

Date: 2001-11-23 15:03:17-5
User: ed
fatal:
  Unknown variable "bar"
Where:
  file: TopClass.py
  line: 23
  code: |
    x = MoreObject("345\n")

Due to its better human readability, it has already gained popularity in the field of manually writing configuration files.

TOML Example

Few have heard of TOML. However, take a look at the example and it will become clear why I mention it at all.

[servers]

[servers.alpha]
ip="10.0.0.1"
dc="eqdc10"

[servers.beta]
ip="10.0.0.2"
dc="eqdc10"

Yes, it's actually a standardized INI config bitten by JSON. As a result, he absorbed the worst of both worlds.

Example Tree

Finally, as a spoiler, let me show you the minimal non-empty tree file that we will develop next.

spoiler

Data models

Different formats are based on different data models. The chosen model answers the following two questions.

  • What data can we write and read without a tambourine? ๐Ÿฅ
  • How to record data that does not fit into the model? ๐Ÿ‘ 

No single format is able to support the entire variety of types of subject areas, so the need inevitably arises for packing data into a certain format and then unpacking it back.

XML Model

XML is based on a typed element model that contains one dictionary of attributes and one list of nested typed nodes.

  • NodeList
  • Element Node (<br/>)
  • Attribute Node (tabindex="1")
  • Text Node(Hello, World!)
  • CDATA Node (<![CDATA[ ... ]]>)
  • Processing Instruction Node (<? ... ?>)
  • Comment Node (<!-- ... -->)
  • Document Node
  • Document Type Node (<!DOCTYPE html>)

Disadvantages of the XML Model

This model is quite flexible, but it has a number of limitations: only strings can be attribute values, and there can be only one nested list of nodes. Despite the fact that the XML format is already not the simplest, a banal dictionary with subtrees as values โ€‹โ€‹requires additional agreements. For example, this: some elements are used to describe the keys in the parent element and such elements in the parent should be only in one instance.

<panel>
	<head>Are you sure?</head>
	<body>
		<button>Yes</button>
		<button>No</button>
	</body>
</panel>

Here panel is a component, and body is no longer a component, but a parameter. It would have a place in the attributes, but only the strings can be placed in the attributes and nothing more.

XML model extensibility

Thanks to namespaces, many languages โ€‹โ€‹can be mixed up within one XML document without breaking the interpretation of each other.

<xsl:stylesheet
	version="1.0"
	xmlns="http://www.w3.org/1999/xhtml"
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

	<xsl:template match="/">
		<html>
			<head>
				<link rel="stylesheet" href="web.css" />
			</head>
			<body>
				<xsl:apply-templates select="*" />
			</body>
		</html>
	</xsl:template>

</xsl:stylesheet>

This is a very powerful technique that is lacking in younger formats.

JSON Model

The JSON model is based on the fact that the entire tree consists of untyped lists and dictionaries. Plus a limited set of primitives as tree leaves.

  • Null
  • Boolean
  • Number
  • String
  • Array
  • Dictionary

Disadvantages of the JSON model

It would be naive to believe that two types of structural nodes are enough for everything. For example, let's take a dictionary. The keys in it are not ordered, that is, they can be returned by the parser in any order.

{
	"foo": 777
	"bar": 666
}

What if we need a dictionary with ordered keys?

[
	[ "foo" , 777 ],
	[ "bar" , 666 ]
]

We had to radically change the syntax and stick arrays of arrays. But this is just another type of dictionary.

Non-extensible JSON model

Well, the main drawback of the JSON model is its non-extensibility, which is why you have to introduce a bunch of tricky rules to cram all the variety of application types of their relations. Take, for example, a query to MongoDB, whose authors decided that JSON is a great fit for the role of a query language.

{
	"$or": [
		{
			"sex": "female",
			"age": { "$gt": 16 },
		},
		{
			hobby: {
				"$regex": "\\b(?:java|type)script\\b"
			}
		}
	]
}

We see that the paired logical operations OR and AND have a completely different syntax. The equality predicate is sorely lacking, because we still need the predicates "greater than", "less than" and even "matches the regular expression". And by the way, regular expressions themselves cannot be represented in JSON except as a string and an agreement that if it is in the dictionary for a key named "$regexp", then this is a serialized regular expression and when parsing, you need to create the corresponding object.

YAML Model

The YAML model is similar in many ways to the JSON model. Unless there is support for time and internal links.

  • !!null
  • !!bool
  • !!int
  • !!float
  • !!str
  • !!timestamp
  • !!seq
  • !!map
  • Anchor & Alias
  • Document
  • TypeTags

YAML model extensibility

The main advantage of YAML is in type annotations, which allow you to explain to the processor which algorithm to use to unpack the data.

--- !!omap
- foo:777
- bar: 666

In this example, we are telling the parser to "take this list of key-value pairs" and convert it to an OrderedMap object (an ordered dictionary).

TOML Model

The TOML model is like JSON, but a little more mundane. For example, integers and real numbers are distinguished here, which is important for compiled languages, and there is also time support.

  • Boolean
  • Integer
  • Float
  • String
  • datetime
  • Array
  • Dictionary

With extensibility, everything is just as bad here as in JSON.

Model Tree

Whatever set of basic types we choose, it will not be enough for everything. This means that some packing and unpacking code will inevitably be required. And it is easiest to work with such code when the number of different types of nodes is minimal, since for each type you need to write a separate branch of logic. At the same time, maximum flexibility is required. Therefore, only two types of nodes will suffice for us.

  • Structure Node
  • Data Node

Structural nodes serve to describe the hierarchy, while data nodes store raw binary data. Any node can store a list of any other nodes, which achieves flexibility unattainable in other formats.

Model extensibility

Total, in terms of extensibility, everything is very bad. Popular formats are either extensible, but incredibly overcomplicated, or simple, but not extensible at all.


XML

json

YAML

TOML

tree

Extendability

โœ…

โŒ

โœ…

โŒ

โœ…

Number of patterns

90

30

210

90

10

Pay attention to YAML. Its grammar has two hundred patterns. It is so complex that you will most likely not find any complete and correct implementation of its parser. Why, even two identically working JSON parsers you still need to search, but there would seem to be 30 patterns in total.

Our goal will be to create an extremely simple, unambiguous, but at the same time maximally extensible format.

Readability

Syntax clarity is important in a variety of scenarios for working with the format: when writing, when reviewing code, when resolving conflicts, when debugging, when studying.

The speed of your work and the predictability of its results directly depends on how the format is serialized. However, some formats have serious problems with this.


XML

json

YAML

TOML

tree

Readability

โŒ

โŒ

โœ…

โœ…

โœ…

XML readability

XML is built around text with tags interspersed with additional information. As long as there is not a lot of this information, everything is fine, but the more it is, the more difficult it is to perceive the text, which eliminates the usefulness of this feature.

Hello Alice!
How are you?
Could you bring me coffee now?

<message>
	<greeting>
		Hi <a href="http://example.org/user/alice">Alice</a>!
	</greeting>
	<body>
		<s>How are you?</s><br/>
		Could you bring me
		<time datetime="1979-10-14T12:00:00.001-04:00">now</time>
		coffee?
	</body>
</message>

JSON readability

XML at least supports multiline text, but JSON, for example, can no longer boast of this. Formats of this type come from an information structure, in which text and not only text values โ€‹โ€‹are already interspersed.

{ "greetings": "Hi Alice!\nHow are you?\nCould you bring me some coffee?\n" }

Severity

As a rule, there are no problems with understanding what is written. But YAML excelled here.


XML

json

YAML

TOML

tree

Unambiguous syntax

โœ…

โœ…

โŒ

โœ…

โœ…

YAML lax

a: true # boolean
b: tru # string
c: :-) # string
d: (-: # error

There are quite a few such jokes in YAML.

Escaping

A topic close to readability is escaping. The presence of this in one way or another inevitably leads to a decrease in readability. When designing escaping, the following points should be kept in mind.

  • It is necessary to distinguish format constructs from actual data ๐Ÿ˜ต
  • It is advisable not to lose data in visibility ๐Ÿค“
  • It is advisable not to overcomplicate editing ๐Ÿคฌ

Escaping in XML

XML is a wonderful example of how not to do escaping.

foo > 0 && foo < 10

From a simple and visual text, some kind of cryptotext is obtained, which has to be mentally interpreted in order to understand what is written here.

<code>foo &gt; 0 &amp;&amp; foo &lt; 10</code>

Escaping in JSON

There is a similar problem with JSON, albeit to a lesser extent. If you have ever written plugins for VSCode syntax highlighting, then you know that grammars are described there in JSON format, where regular expressions are written.

/"[\s\S]*"/

The regulars themselves are not the most visual things, but escaped ones are even worse. It is very easy to make a mistake in them in such conditions, and it is not very easy to debug them.

"\"[\\s\\S]*\""

Escaping in YAML

In YAML, the escaping problem is generally solved, but at what cost.

  • 5 types of strings ๐Ÿ˜ฃ
  • 4 whitespace handling modifiers ๐Ÿ˜ฅ

And all this you need to know in order to correctly read any YAML file.

Escaping in Tree

No ๐Ÿคช

The most readable escaping is no escaping. Therefore, we will not have it. You might think that I'm crazy, but a little later I'll show you how to achieve this.

Minification

Many formats support different ways of formatting the same data. But it's always a trade-off between size and readability.

  • Readable formatting weighs a lot ๐Ÿ˜
  • Compact formatting is hard to read ๐Ÿ’€

XML minification

<users>
	<user>
		<name>Alice</name>
		<age>20</age>
	</user>
</users>

If you minify XML, you can save several tens of percent in size, but the result is even more difficult to read.

<!-- 13% less -->
<users><user><name>Alice</name><age>20</age></user></users>

JSON minification

{
	"users": [
		{
			"name": "Alice",
			age: 20
		}
	]
}

With JSON, the savings are slightly greater, but readability suffers more - instead of closing tags, we see a string of square and curly brackets.

// 30% less
{"users":[{"name":"Alice","age":20}]}

Tree minification

No ๐Ÿ˜ฒ

Our path is uncompromising - the format must be both extremely compact and easily perceived by a person.

Statistics on minification


XML

json

YAML

TOML

tree

Readable

195%

140%

125%

110%

100%

Minified

170%

101%

-

-

-

Download sample files.

As you can see, it is possible to make a format that in a readable form weighs less than any other, even if they are minified. The whole secret is that readability is achieved by the structure of the format itself, and does not require additional formatting that bloats the volume.

Holy Wars

A common problem when working with different formats is endless arguments about seemingly trifles.

  • Tabs or spaces? ๐Ÿคผโ€โ™‚๏ธ
  • 2 or 4 spaces? ๐Ÿคผโ€โ™€๏ธ
  • Do you need a carriage return? โšก
  • Do we do alignment? ๐Ÿคบ
  • linter/format rules? ๐Ÿ”ฅ
  • when saving/committing/pushing? ๐Ÿšง

These arguments take time and emotions, but they are completely meaningless. It is better if the format has uniform, clearly defined rules that are equally understood by any tool and person. Therefore, our format will be extremely rigid, without any liberties.

Processing speed

Simplicity, rigidity and lack of escaping potentially gives a much higher possible processing speed.

For example, in JSON, to write an arbitrary string, you need to go through each character and output a backslash to the output buffer before certain ones. That is, we cannot even know in advance how much memory we can allocate for the output buffer. And during parsing, you need to do the reverse operation with the formation of a new line. We cannot reuse the original piece of memory.

serialization: foo\bar => "foo\\bar"

parsing: "foo\\bar" => foo\bar

When we do not have escaping, we can simply take chunks of memory and send them to the output stream during serialization, which is very fast. Conversely, when parsing, we can simply refer to pieces of the original buffer and not make extra memory allocations.

In my knee - length benchmark in the D language, the following results were obtained:

Tree: 299 ms
JSON: 421 ms

For comparison, I used the naive implementation of the tree parser and json parser from the standard library.

Error coordinates

During parsing, information about the original location of the nodes obtained from the format is often lost. For example, we received JSON, started processing it, and somewhere in the depths we suddenly realized that in the database we do not have the user specified in the file. At this moment, we must show an error, but in the text of this error, we cannot indicate in which place of which file it was made. This is because this information is lost during parsing. And this is a very common problem.


XML

json

YAML

TOML

tree

Address

โœ…

โŒ

โŒ

โŒ

โœ…

Position

โŒ

โŒ

โŒ

โŒ

โœ…

Range

โŒ

โŒ

โŒ

โŒ

โœ…

In XML nodes there is a link to the resource from which it was obtained, but where it is in this resource - look with your eyes. To solve this problem, there are special parsers that give the output not arrays and dictionaries, but an Abstract Syntax Tree. But working with him is no longer so easy, and even slowly this business.

Well, this information is important, and I suggest not to lose it. Never lose. Saving node coordinates will still come in handy when it comes to AST and sourcemaps.

Stream processing

It happens that there is a lot of data and little memory, but you need to work with data quickly. And it happens that the data does not end at all. For example, you need to continuously process logs as they come in. In these cases, the ability to stream data processing saves.


XML

json

YAML

TOML

tree

Streaming

โŒ

โŒ

โœ…

โœ…

โœ…

As you can see, the most common formats do not have streaming support. They require you to have exactly one complete document root, otherwise it's a parsing error. In the case of constantly arriving data such as logs, for example, adding them to a document while maintaining its correctness is not an easy task.

This does not mean that stream processing cannot be fastened to them. For example, for XML, there are lower-level SAX parsers that allow you to work not with a tree of elements, but with a stream of tags: such and such a tag opened, a string arrived, such and such a tag closed. And for JSON, there is a whole bunch of message streaming protocols. The main problem here is that not every format-supporting tool will be able to digest your data without additional gestures.

Formats that support stream processing can be easily supplemented by appending data to the end. You can glue several data streams into one and, conversely, cut into pieces. Can be processed in parts without waiting for the transfer to complete. And all this without losing the correctness of working with the format.

Tree format

Well, summarizing what was said earlier, let's formulate all the requirements for our new format.

  • Easy syntax โœŒ
  • No escaping ๐Ÿค˜
  • No liberties ๐Ÿค™
  • No minification ๐Ÿ‘
  • Minimum size ๐Ÿ‘
  • Guaranteed readability ๐Ÿ––
  • Stream processing ๐Ÿ’ช
  • Exact coordinates of nodes โ˜

Just a tree node

So, we need to create a node named "house". What is the minimum code for this?

house

We just write this name and that's it.

List of tree nodes

And if we need not one node, but a whole list?

house
roof
wall
door
window
floor

We just write them on separate lines.

Nesting tree nodes

But what if we want to add hierarchies and put the list of nodes inside the first one?

house
	roof
	wall
	door
	window
	floor

We just write nested nodes with a tab as an indent. Those familiar with the Python language may notice a similar approach here - using good code formatting style as the basis of the syntax, rather than an optional feature.

Deep tree hierarchy

By continuing to add padding, we can create hierarchies of any nesting.

house
	roof
	wall
		door
		window
			glass
	floor

Alone at home

Often there are situations when there is only one nested node, and then it will be somehow wasteful to increase the indentation level for all nested nodes because of it.

street
	house
		wall
			door
			window

Therefore, we simply line up such nodes in one line, separating them with spaces.

street house wall
	window
	door

Indented nodes are already nested in the last node on the previous line.

Raw data

When we need to write arbitrary data, the characters in which should not be processed in any special way, we simply write them after the backslash without any escaping.

\Any data \(^_^)/

The backslash is chosen to be associated with escaping. It sort of escapes the entire text to the end of the line. But, to be precise, it is rather not escaping, but a kind of quotation marks. The backslash is the opening mark, and the newline character is the trailing mark.

Multiline data

But how to write all the same multi-line text containing, among other things, newlines? It's simple: we take a data node and put a list of other data nodes in it.

\
	\Here ๐Ÿฑโ€๐Ÿ’ป
	\	many ๐Ÿฑโ€๐Ÿ‘“
	\		cats ๐Ÿฑโ€๐Ÿ‘ค

When requesting the string content of the root data node, all nested data nodes will be concatenated via a newline character.

Different types of nodes

Finally, we can use both types of nodes mixed in any combination. For example, let's describe some user.

user
	name \Jin
	age \35
	hobby
		\kendo ๐Ÿฑโ€๐Ÿ‘ค
		\dance ๐Ÿ•บ๐Ÿฝ
		\roleplay ๐ŸŽญ
			default

As you can see, everything is quite simple. To create the most advanced data format, we needed only 2 types of nodes and 4 special characters.

Languages โ€‹โ€‹based on formats

So far, we have only talked about formats, that is, about serialization methods. On their basis, languages โ€‹โ€‹are already being designed that add semantics to abstract format nodes.

Format

Languages

XML

XHTML, SVG, XSLT, ...

json

JSON Schema, json:api, ...

YAML

yaml.org/type

TOML

-

tree

xml.tree, json.tree, view.tree, ...

Any language is some subset of the format data model with restrictions on the possible types of nodes, their relative position and content.

Next, I will show some examples of such languages โ€‹โ€‹for the tree format.

Language grammar.tree

Language grammar.tree - designed to describe formal grammars. For example, let's write a complete formal grammar for the tree format itself.

tree .is .optional .list_of line

line .is .sequence
	.optional indent
	.optional nodes
	new_line

nodes .is .sequence
	.optional .list_of struct
	.optional data
	.with_delimiter space

struct .is .list_of .byte
	.except special
data .is .sequence
	data_prefix
	.optional .list_of .byte
		.except new_line

special .is .any_of
	new_line
	data_prefix
	indent
	space

new_line .is .byte \0A
indent .is .list_of .byte \09
data_prefix .is .byte \5C
space .is .list_of .byte \20

As you can see, the grammar of the format is really extremely simple, which allows you to write a parser in any language in just an hour without even resorting to parser generators.

This grammar can be read literally: tree is an optional list of lines, and a line is a sequence of an optional indentation, an optional list of nodes, and a mandatory newline character. Well, and so on.

Language grammar.tree vs EBNF

Comparing grammar.tree with Extended Backus Naur Form you can see that the former is somewhat verbose but clear and concise, while the latter is compact , but for understanding it requires preliminary preparation, the expressive possibilities are still somewhat inferior, and its focus on a single-line representation looks somewhat awkward when using multi-line writing.

tree .is .optional .list_of line

line .is .sequence
	.optional indent
	.optional nodes
	new_line

nodes .is .sequence
	.optional .list_of struct
	.optional data
	.with_delimiter space
tree = {line};

line=[indent],
	[ nodes ],
	new_line;

nodes = data |
	structure,
	{ space , struct },
	[ space , data ];

Language xml.tree vs XML

The xml.tree language is a way to represent an XML data model in tree format. Any kind of XML can be generated from it. Conversely, any XML can be converted to xml.tree.

! doctype html
html
	meta @ charset \utf-8
	link
		@ href \web.css
		@ rel \stylesheet
	script @ src \web.js
	body
		h1 \Procter & Gamble
<!doctype html>
<html>

	<meta charset="utf-8" />
	<link href="web.css" rel="stylesheet" />
	<script src="web.js"></script>

	<body>
		<h1>Procter & Gamble</div>
	</body>

</html>

It would be nice to have such integration in the IDE that when opening any XML, you can see and edit its xml.tree representation, but everything would be saved back to XML. This would eliminate the need to break your eyes over ampersands and make working with XML as easy and simple as, for example, with markdown.

Language json.tree vs JSON

And json.tree is a language for describing the json model.

* user *
	name \Jin
	age 35
	hobby /
		\kendo ๐Ÿฑโ€๐Ÿ‘ค
		\dance ๐Ÿ•บ๐Ÿฝ
	home \C:\users\jin\
{
	"user": {
		"name": "Jin",
		age: 35
		"hobby": [
			"kendo ๐Ÿฑโ€๐Ÿ‘ค",
			"dance ๐Ÿ•บ๐Ÿฝ",
		],
		"home": "C:\\users\\jin\\"
	}
}

We needed only 2 special characters - an asterisk to denote dictionaries and a slash to denote arrays.

json.tree extensions

The beauty of languages โ€‹โ€‹based on formats like XML and Tree is that they are easy to extend while staying within the format. For example, both json and tree as formats fundamentally do not support comments. But, for example, comments are necessary in configs. How to be?

*
	# \If disabled will be used platform specific delimiters
	# \CRLN on windows and LN on others
	unix_delimiters true

In tree, we easily extended the language to suit our needs by adding a special node type for comments.

{
	"unix_delimiters#1": "If disabled will be used platform specific delimiters",
	"unix_delimiters#2": "CRLN on windows and LN on others",
	"unix_delimiters": true,
}

In JSON, the limitations of the model are affected, because of which you have to write crutches.

Language view.tree vs TypeScript

Language view.tree - used for component composition in framework $mol developed by me.

$my_details $mol_view
	sub /
		<= Pager $mol_paginator
			value?val <=> page?val 0

This describes a component that owns another component and their properties are bidirectionally related to each other. You may notice that inside view.tree the json.tree language is also used to describe arrays, dictionaries, numbers and other JSON types.

From such a simple and concise code, a rather sprawling TypeScript class is generated. You can write it with your hands, but it's a chore and without a hierarchy it's not very clear.

class $my_details extends $mol_view {

	sub() { return [ this.Pager() ] }

	@ $mol_mem Pager() {
		const Pager = new $mol_paginator
		Pager.value = val => this.page( val )
		return pager
	}

	@ $mol_mem page( val = 0 ) {
		return value
	}

}

API

Finally, there are various APIs for interacting with the format from different programming languages.

Format

Languages โ€‹โ€‹

API

XML

XHTML, SVG, XSLT, ...

DOM, SAX, AST

json

JSON Schema, json:api, ...

Native, AST

YAML

yaml.org/type

Native, AST

TOML

-

Native, AST

tree

xml.tree, json.tree, ...

AST

For XML, for example, there is a fairly flexible DOM, and there is a low-level SAX. The formats that replaced it mainly return dictionaries, arrays, and so on native to the language. True, the JSON data model is not well represented in compiled languages, where integers and floats are completely different types. And of course, for all languages โ€‹โ€‹there is a representation in the form of an Abstract Syntax Tree. True, it is usually slow and inconvenient. We will make it fast and convenient, which will allow us not to fence the zoo of incompatible APIs.

JSON AST

Let's take a simple JSON file and put it in ASTExplorer.

{
  "user": {}
}
{
	"type" : "object",
	"children" : [
		{
			"type" : "Property",
			"key" : {
				"type": "Identifier",
				"value": "user"
			}
			"value": {
				"type": "object",
				"children": []
			}
		}
	]
}

As you can see, the AST turned out to be large and complex. JSON is generally very poorly suited to describe AST. It is not very easy to work with it without special utilities.

##AST Tree

Now let's take a slightly more complex tree file.

user
	name \Jin
	age 35
	hobby
		\kendo ๐Ÿฑโ€๐Ÿ‘ค
		\dance ๐Ÿ•บ๐Ÿฝ
		\roleplay ๐ŸŽญ

And look at his AST.

user
	name \Jin
	age 35
	hobby
		\kendo ๐Ÿฑโ€๐Ÿ‘ค
		\dance ๐Ÿ•บ๐Ÿฝ
		\roleplay ๐ŸŽญ

So, something is wrong. It's the same code. Ah, no, that's right, tree is its own AST.

Tree node properties

In the TypeScript implementation, each node has roughly the following interface.

interface $mol_tree2 {
	type: string
	value: string
	kids: $mol_tree2[]
	span: $mol_span
}

Span is a reference to a series of bytes in the original resource.

interface $mol_span {
	uri: string
	row: number
	col: number
	length: number
}

Derived Tree nodes

Each node has methods for creating new nodes based on it. These factories, when creating new nodes, push the span from the original node into them. This allows even after dozens of transformations to understand how it all began.

interface $mol_tree2 {
	struct: ( type , kids )=> $mol_tree2
	data: ( value , kids )=> $mol_tree2
	list: ( kids )=> $mol_tree2
	clone: ( kids )=> $mol_tree2
}

Error messages in Tree

For example, let's take the config, find the password in it, and if it doesn't work, we'll throw an exception, where it will be written in which place of which file the wrong password is written.

const config_path = './config.tree'
const config_text = fs.readFileSync( config_path )
const config = $mol_tree2.fromString( config_text , config_path )
// server auth
// 	login \root
// 	password \qwerty

const password = config.select( 'server' , 'auth' , 'password' , '' )

if( !auth( password.text() ) ) {
	// AuthError: Wrong password
	// \default
	// ./config.tree#5:3-11
	throw password.error( 'Wrong password' , AuthError )
}

Processing Tree

Or another example - we decided that "auth" is an unfortunate name and we need to replace it with "credentials". Therefore, we write a simple script for automatic refactoring:

// server credentials
// 	login \root
// 	password \qwerty
const new_config = config.list(
	config.hack({
	
		'auth' : ( auth , context )=> [
			auth.struct( 'credentials' , auth.hack( context ) ),
		] ,
		
	})
)
fs.writeFileSync( config_path , new_config )

And in this way you can easily refactor any languages โ€‹โ€‹based on the tree format without searching for a separate parser for each language and dealing with how it works with AST.

Support by editors

If you are using an editor for which there is no plugin yet, then this is a good opportunity to implement it. This will be easier to do than for any other language.

Language support

Again, I encourage those who are interested to implement support in their favorite language and try to put it to good use.

Results


XML

JSON

YAML

TOML

Tree

Size

195%

140%

125%

110%

100%

Number of patterns

90

30

210

90

10

Unambiguous syntax

โœ…

โœ…

โŒ

โœ…

โœ…

Readability

โŒ

โŒ

โœ…

โœ…

โœ…

No escaping needed

โŒ

โŒ

โŒ

โŒ

โœ…

Exact coordinates of nodes

โŒ

โŒ

โŒ

โŒ

โœ…

Streaming

โŒ

โŒ

โœ…

โœ…

โœ…

Extensible data model

โœ…

โŒ

โœ…

โŒ

โœ…

Widespread

โœ…

โœ…

โœ…

โŒ

โŒ

Ideas

And now let's dream up what else interesting things can be done using the tree format.

  • Requests to the DBMS
  • Domain description
  • Logging
  • Communication of console utilities
  • LISP-like language
  • Universal AST

sql.tree - queries to the DBMS

Remember those clumsy MongoDB queries? Let's try to write our SQL:

select
	from $users
	fetch
		@name
		@phone
		@photo *
			@uri
			@width
			@height
	where or
		and
			@sex = female
			@age > 16
		@hobby ~ \\b(?:java|type)script\b

Parsing the query in this form is a breeze, unlike real SQL. Please note that there is a uniform syntax for logical operations and predicates "is equal to", "greater than" and even "matches the regular expression". By the way, the regular expression can also be described in the tree format, which will make it much more supported.

select
	from $users
	fetch *
	where @hobby ~
		word-edge
		or
			\java
			\type
		\script
		word-edge

domain.tree - description of the domain

Since we are talking about databases. This is how I describe the domain model.

hyoo_api_person
	descr \Live service user
	inherit hyoo_api_entity
	field
		id
			descr \Unique human readable identifier
			example \person=jin
			key unique
			type text
			edit author
		avatar
			descr \Links to avatars
			type list hyoo_api_image
			edit author
		mail
			descr \Attached emails
			type set hyoo_api_mail

From such a formal description, a server API, ACL rules, a DBMS scheme and an admin panel are automatically generated to manage the whole thing.

Logs

A common practice is to output single-line messages to the logs. As long as they fit in the width of your terminal - everything is fine, but this is a rather rare situation. Much more often, messages still do not fit and begin to be transferred, turning the flow of messages into a real mess, which is difficult to read with your eyes, and even programmatically process them - pain and suffering.

log.tree - structured logs

But what if the logs are immediately displayed in a two-dimensional form, at the same time easily readable by both machines and humans?

193.34.12.132 - - [2011-10-20T12:46:08+04:00] GET /nin-jin/slides/edit/master/t
ree/readme.md HTTP/1.1 200 4435
193.34.12.132 - - [2011-10-20T12:46:09+04:00] GET /nin-jin/slides/edit/master/t
ree/readme.html HTTP/1.1 404 4435


access
	ip \193.34.12.132
	time \2011-10-20T12:46:08+04:00
	method \GET
	uri \/nin-jin/slides/edit/master/tree/readme.md
	protocol \HTTP/1.1
	response \200
	size \4435

The lower code is clearer. Isn't it?

tree-tools - CLI tree processing utilities

You can write utilities that would allow you to simply and efficiently process such logs. For example, we will read the log, filter by the value of one of the fields, select from the messages only fields that are interesting to us and display them as a sign.

> cat access.log.tree | pick ip time method uri | table

\193.34.12.132 2011-10-20T12:46:08+04:00 GET /index.html
\193.34.12.132 2011-10-20T12:46:10+04:00 GET /index.css
\193.34.12.132 2011-10-20T12:46:20+04:00 GET /index.js

> cat access.log.tree | filter time >= 2019-09 | pick ip uri | table

\193.34.12.132 /index.html
\193.34.12.132 /index.css
\193.34.12.132 /index.js

I have a prototype of such utility which I sometimes use to view live dev server logs. It will be great if someone undertakes to implement a complete set of tools. And when there are tools, then software developers will be motivated to write logs not randomly, but in a structured way.

tree as a communication protocol

You can go further and not just write logs in tree format, but in principle promote the idea that the output of any program should be structured. Many utilities have flags for outputting a response in the form of JSON or XML, but reading such an output is stressful for a person - you have to reopen the output in visual representation tools in order to understand what is returned there and how to approach it. Just imagine a world where the output can be read and immediately somehow transformed without picking mana in search of the desired combination of keys for the next program.

> gitlog

commit
	message \$mol_style: [email protected] compatibility
	sha \b1a8f07c839604d0d34430a186246f0c1f71e628
	date \2020-05-15T23:24:32+0300
	author \nin-jin <[email protected]>
commit
	message \$mol_regexp: concurrent parse ability
	sha \be1abfa50542728dd5c156517ea31f469e7fb4d4
	date \2020-05-15T23:03:30+0300
	author \nin-jin <[email protected]>

> git log | pick date message | table

\2020-05-15T23:24:32+0300 $mol_style: [email protected] compatibility
\2020-05-15T23:03:30+0300 $mol_regexp: concurrent parse ability

WAT

WebAssembly is a forward-thinking assembler that gets as close to the machine as possible without sacrificing portability. It has a text representation format based on Lisp s-expressions.

(func $fact (param $x i64) (result i64)
    (if $x (result i64)
      (i64.eqz
        (local.get $x))
      (then
        (i64.const 1))
      (else
        (i64.mul
          (local.get $x)
          (call $fact      
            (i64.sub
              (local.get $x)
              (i64.const 1)))))))

It is difficult to perceive it no matter how you format it. Unfortunately, this is the kind of code you will see when disassembling in browser devtools.

wasm.tree - assembler without tinsel

I'm currently working on a bytecode compiler for a more descriptive wasm.tree description.

func
	$fact
	param $x i64
	result i64
	body switch
		test i64.eqz local.get $x
		then i64.const 1
		else i64.mul
			local.get $x
			call $fact i64.sub
				local.get $x
				64.const 1

From this assembler, a list of bytecodes in the bin.tree language is generated, which is already distilled into a binary by an elementary function.

00
61
73
6d
01
00
00
00
.
.
.

When there is something more or less complete, I will try to push this syntax as WAT2.0. Who cares about the fate of WebAssembly - join the development.

jack.tree - LISP without brackets

In fact, writing in raw assembler is too verbose. Therefore, the next step is the implementation of a meta-language that allows you to extend the language by means of the same language itself. The core of such a language should turn out to be extremely minimalistic, and all idioms will be connected to it as third-party libraries written in the same language.

jack
	import wasm
	tree func $fact
		> $x #8
		< #8 switches
			test is-zero $x
			then #8 1
			else mul
				$x
				$fact sub
					$x
					#8 1

Roughly speaking, a program in this language iteratively modifies its own AST in such a way that the output is a wasm binary. It may sound intimidating, but thanks to the fact that tree saves the coordinates of the sources, it is not difficult to trace the source of the error. In the repository, you can look at a scanty prototype.

$mol_jack

Abolishing LLVM

You can go even further and generate not wasm bytecodes, but downright bytecodes of the target processor, simply by adding one more transformer to the pipeline.

compile pipelines:

                jack.tree => wasm.tree =============> bin.tree
                jack.tree => wasm.tree => arm.tree => bin.tree
any-dsl.tree => jack.tree => wasm.tree => arm.tree => bin.tree

At the same time, at any level, you can run additional transformers that can optimize the code using the information available at the corresponding levels of abstraction.

optimization middlewares:

jack.tree => jack.tree
wasm.tree => wasm.tree
arm.tree => arm.tree

At the same time, let me remind you that we do not lose touch with the original sources, which will allow us to display adequate messages. And any intermediate AST can always be dumped into text in a very visual form of the tree format.

Again, join the development, it can turn out to be a cool thing to replace LLVM.

One AST to rule them all

And finally, we come to the main idea of โ€‹โ€‹this report. Tree is a perfect candidate for a universal AST binder. Just look at how long the TypeScript code goes from source to the resulting bundle when building on a typical project.

code =(P)=> loader =(P)=> compiler =(SP)=> bundler =(SP)=> terser =(S)=> bundle

P - Parse
S - Serialize

And each tool re-parses your sources into its own AST, processes it, serializes it, and passes it on. If we agree on a single AST format, then we can significantly simplify the implementation of utilities and reduce the overhead for code processing.

code =(P)=> loader =====> compiler ======> bundler ======> terser =(S)=> bundle

Even if some of the utilities will run in separate processes (which means intermediate serialization is inevitable), the tree format will allow you to transfer the AST as quickly as possible, due to the minimum overhead for parsing and serialization.

Sandbox

tree.hyoo.ru - a sandbox where you can drive various transformations. Here are some examples:

  • view.tree โ‡’ view.ts - translation of the component description into TypeScript code.
  • view.tree โ‡’ locale.json - export of reference texts for localization in the form of JSON from the component description.
  • view.tree โ‡’ view.dts - export TypeScript types with embedded sorsmaps from component descriptions.
  • JSON โ‡’ json.tree - translation of JSON into json.tree.
  • xml.tree โ‡’ XML - translation of xml.tree into XML
  • XML โ‡’ xml.tree - translation of XML into xml.tree.
  • js.tree โ‡’ JS - translation of JavaScript AST into JavaScript proper.
  • wasm.tree โ‡’ WASM - compilation of WASM AST into a WASM binary and checking its correctness. This thing is still very raw: only 3 types of sections are supported, you can't run it right there in the sandbox. But as soon as there is time, I will finish the specification.
  • jack.tree โ‡’ JS eval is a translation of a meta-language with JavaScript generation with built-in sorsmaps and immediately its execution.
  • MarkedText โ‡’ JS - translation of MarkedText into JavaScript code with embedded sorsmaps, which generates a DOM tree using the DOM API.
  • grammar.tree check - grammar correctness check.tree syntax descriptions on the fly.
  • span.tree imprint/reuse - stitching of sources and mapping in span.tree tree, its intermediate serialization into a string, followed by restoration of the original tree without loss of mapping.
  • automate.tree (JS) is an example of writing your own transformation in JavaScript that converts a simple automation script into JavaScript code with built-in sorsmaps.
  • automate.tree (jack) is the same, but using the jack.tree language.

Where to go, where to go

I hope I managed to infect you with ideas about a brighter future. But in order to bring it closer, we need to work on it together. I'm afraid I won't be able to handle all of this. So write, call and do not disappear.


Also Published Here