Parsing your own language with ANTLR4

Written by fwouts | Published 2017/04/21
Tech Story Tags: antlr4 | programming | parsing | typescript

TLDRvia the TL;DR App

Have you ever wanted to write your own programming language? Let’s say that you have, because that’s my excuse to show you how ANTLR4 works.

We’ll take the example of a super-simple functional language where you can call methods with strings:

print(concat("Hello ", "World"))

We’ll call our language “C3PO”. It sounds like a good name.

First things first. How do you define the structure of a language?

Introducing ANTLR4 grammars

ANTLR4 is, you guessed it, the fourth version of ANTLR. ANTLR stands for ANother Tool for Language Recognition. Because why not.

ANTLR allows you to define the “grammar” of your language. Just like in English, a grammar lets you explain what structure is allowed (and what isn’t). Unlike English however, a grammar follows logic and can be easily understood by a computer. Let me show you what it looks like!

<a href="https://medium.com/media/234ccc063f74f5b574bc3616a5a41c69/href">https://medium.com/media/234ccc063f74f5b574bc3616a5a41c69/href</a>

Each of these blocks (methodCall, methodCallArguments, expression, NAME, STRING) is called a rule. For now, don’t worry about the difference between lower and uppercase rules.

Of course, our programming language is missing key features such as support for numbers. Let’s not worry about it, you can add that yourself later.

Testing the grammar

Coming back to our example, let’s see how our code matches the grammar we’ve defined above.

First, set up ANTLR4 following the official instructions. Then run the following commands:

<a href="https://medium.com/media/da82886ce0106e3c2cc64560d16e92aa/href">https://medium.com/media/da82886ce0106e3c2cc64560d16e92aa/href</a>

Our code parsed successfully!

Now, just for fun, let’s try some incorrect code:

<a href="https://medium.com/media/66924cca5370549a2c6cb70e03fe04aa/href">https://medium.com/media/66924cca5370549a2c6cb70e03fe04aa/href</a>

Why did this fail? Because we didn’t define any grammar rules for numbers or for the addition sign, so 1 + 2 is illegal in our language.

How do I use this from code?

You probably don’t want to run a shell command whenever you need to parse code. Ideally, you want an API to access each node in the parsed tree.

It turns out ANTLR4 lets you generate parser code in a variety of languages: Java, C#, Python, Go, C++, Swift, JavaScript and even TypeScript!

In TypeScript for example, here is what it could look like (after setting up antlr4ts):

<a href="https://medium.com/media/a754db49ee21e6686c87a2b589ed8db8/href">https://medium.com/media/a754db49ee21e6686c87a2b589ed8db8/href</a>

A sample project is available at https://github.com/fwouts/sample-antlr4-typescript if you’d like to see this in action.

Now that you’ve seen how easy it is to parse your own language, you might wonder: what about existing programming languages? Can I parse them too? The answer is YES. In fact, the grammar you need is probably already defined in the grammars-v4 repo.

Did you enjoy this article? Please make sure to click the recommend button or send me your feedback: f “at” codonut.com

Have a great day!


Published by HackerNoon on 2017/04/21