917 reads

Pet Projects are Fun

by Matan GolanFebruary 4th, 2020

Too Long; Didn't Read

Go is a Go library that uses Go’s grammar system to create parsers. The Go library uses a different approach on creating parsers with Go. Parsing is all about taking a stream of tokens and building a parse tree from it (iff the syntax matches the grammar) Go uses go.mod files to manage your Go project dependencies. Go.mod is written in pure Go and parsers are written in Go's Go grammar system. Go's grammar system uses Go's “struct tags” to define the grammar of the language.

Companies Mentioned

I love pet projects, they are great excuse to use libraries and other technologies that you can’t use at work. Lately I’ve been working on a bigger pet project that needs to parse Go files, I’ve used ANTLR before to make this kind of things but unfortunately, ANTLR’s Go target has poor performance. So I began to search for alternatives written in pure Go and came across with this one, which took a different approach on creating parsers with Go, but before we’re going to understand how this library is different from others, let’s cover some basic concepts about parsing.

Sorry, Parsing?

For the sake of “formality” here is what Wikipedia has to say:

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).

Now, let’s try to break it down by using an example, if you use Go your’e probably familiar with go.mod files; but if you aren’t using it, go.mod files are used to manage your project dependencies. Here is a simple go.mod file:

module github.com/matang28/go-latest

go 1.13

require (
  github.com/some/dependency1 v1.2.3
  github.com/some/dependency2 v5.1.0 // indirect
)

replace github.com/some/dependency1 => github.com/some/dependency1@dev latest

before we can parse this file, we need two things:

A stream of symbols that will be obtained from the file itself.
A grammar definition that will be used by the parser to validate the syntax.

The first one is formally called a Lexer (or lexing). Lexers are all about turning strings into a list of predefined symbols that our parser uses, any symbol (or token) used by our language should be identified by the Lexer. In our case we need to capture the following tokens:

Keywords:
```
module
```
,
```
go
```
,
```
require
```
,
```
replace
```
.
Strings such as:
```
github.com/some/dependency1
```
,
```
1.13
```
.
Version string such as:
```
v1.2.3
```
,
```
v1.4.21-haijkjasd9ijasd
```
.
Other misc symbols such as:
```
=> , // , ( , )
```
, whitespaces, line-terminators and tabs.

The second one will give these tokens a syntactic meaning, you can see that

go.mod

files follows some basic rules:

You have to declare your own module name with the
```
module
```
directive.
You may declare you go version using the
```
go
```
directive.
You may list your dependencies with the
```
require
```
directive.
You may replace certain dependencies with the
```
replace
```
directive.
You may exclude certain dependencies with the
```
exclude
```
directive.

These rules are often called the grammar of the language (the go.mod language) and there is a more formal ways to represent them, one of them is the EBNF standard:


grammar GoMod;

// A source file should be structured in the following way:
sourceFile 		: moduleDecl stmt* EOF;

stmt 			: (versionDecl | reqDecl | repDecl | excDecl);

// The module directive:
moduleDecl 		: 'module' STRING;

// The version directive:
versionDecl 	: 'go' STRING;

// The require directive, it can be either a single line:
// require github.com/some/dep1 OR multi line:
// require (
//   github.com/some/dep1
//   github.com/some/dep2
// )
reqDecl 		: 'require' dependency
        		| 'require (' dependency* ')'
        		;

// The replace directive:
repDecl 		: 'replace' STRING '=>' dependency;

// The exclude directive:
excDecl 		: 'exclude' dependency;

dependency 		: STRING version comment?;

version 		: VERSION | 'latest';

comment 		: '//' STRING;

VERSION 		: [v]STRING+;
STRING 			: [a-zA-Z0-9_+.@/-]+;
WS  			  : [ \n\r\t\u000C]+ -> skip;

Now back to the definition, a Parser takes a stream of tokens (e.g the tokenization of our text file) and will try to match each token against the grammar, if our file doesn’t follow the grammar we will get a parse error but, if our file does follow the grammar we know for sure that the input file is valid (in term of syntax) and we can get the parse tree which we can traverse throughout the matched grammar rules.

Parsing is all about making this magic happen… taking a stream of tokens and building a parse tree from it (iff the syntax matches the grammar).

Whats so different?

At the beginning of the article I’ve said that one library was the spark to create this pet project, why is that?

Well parse trees aren’t the most comfortable data structure to extract data from, “walking” on a parse tree can be pretty confusing. Most of the tools available today uses code generation techniques to generate lexers and parsers from a grammar file and, in order to be as generic as possible most of them give you a listener interface which will get notified when the parser enters or exits each grammar rule.

Participle uses different approach, one that’s more common for Go developers. It uses Go’s structs system to represent the grammar (or more precisely the parts that we want to capture from the grammar), it also utilizes “struct tags” to define the grammar rules.

Wow! that was a mouthful, let’s learn by doing. At first we need some structs that represents a standard `go.mod` file:

// The root level object that represents a go.mod file
type GoModFile struct {
	Module     string
	Statements []Statement
}

type Statement struct {
	GoVersion    *string
	Requirements []Dependency
	Replacements []Replacement
	Excludes     []Dependency
}

// A struct that represents a go.mod dependency
type Dependency struct {
	ModuleName string
	Version    string
	Comment    *string
}

// A struct that represents a replace directive
type Replacement struct {
	FromModule string
	ToModule   Dependency
}

In order to write a parser, we need a stream of tokens. Participle got us covered by giving us a regex-based lexer (their docs says that they’ve been working on an EBNF-style one that could be really useful since this is the most cumbersome part of Participle IMO). Let’s see that in action:

import (
	"github.com/alecthomas/participle/lexer"
)

// The lexer uses named regex groups to tokenize the input:
var iniLexer = lexer.Must(lexer.Regexp(
	/* We want to ignore these characters, so no name needed */
	`([\s\n\r\t]+)` +
	/* Parentheses [(,)]*/
	`|(?P<Parentheses>[\(\)])` + 	
	/* Arrow [=>]*/
	`|(?P<Arrow>(=>))` +
	/* Version [v STRING] */
	`|(?P<Version>[v][a-zA-Z0-9_\+\.@\-\/]+)` + 
	/* String [a-z,A-Z,0-9,_,+,.,@,-,/]+  */
	`|(?P<String>[a-zA-Z0-9_\+\.@\-\/]+)`,
))

Now we can move on to define our grammar, think of it as a combination between the EBNF file and our Go structs representation. Here are some of the basic syntax operators Participle has to offer:

Use quotes to match a symbol for example:
```
"module"
```
will try to match the symbol
```
module
```
.
Use
```
@
```
to capture expression (i.e named group defined in our lexer), for example
```
@String
```
will match the
```
String
```
symbol and extract it into the struct field.
Use
```
@@
```
to let the underlaying struct to match the input, this is useful when you have multiple alternatives for a rule.
Use the common regex operators to capture groups:
```
*
```
to match zero or more,
```
+
```
to match at least one,
```
?
```
to match zero or one, etc...


// The root level object that represents a go.mod file
type GoModFile struct {
	Module     string      `"module" @String`
	Statements []Statement `@@*`
}

type Statement struct {
	GoVersion    *string       `( "go" @String )`
	Requirements []Dependency  `| (("require" "(" @@* ")") | ("require" @@))`
	Replacements []Replacement `| (("replace" "(" @@* ")") | ("replace" @@))`
	Excludes     []Dependency  `| (("exclude" "(" @@* ")") | ("exclude" @@))`
}

// A struct that represents a go.mod dependency
type Dependency struct {
	ModuleName string  `@String`
	Version    string  `(@Version | @"latest")`
	Comment    *string `("//" @String)?`
}

// A struct that represents a replace directive
type Replacement struct {
	FromModule string     `@String "=>"`
	ToModule   Dependency `@@`
}

All we have to do now is to build the parser from our grammar:

func Parse(source string) (*GoModFile, error) {
	p, err := participle.Build(&GoModFile{},
		participle.Lexer(iniLexer),
	)
	if err != nil {
		return nil, err
	}

	ast := &GoModFile{}
	err = p.ParseString(source, ast)
	return ast, err
}

And we’re done, we can parse `go.mod` files now! Let’s write a simple test to make sure everything works as expected:

func TestParse_WithMultipleRequirements(t *testing.T) {
	file, err := Parse(`
  module github.com/matang28/go-latest
  go 1.12
  require (
  	github.com/bla1/bla1 v1.23.1 // indirect
  	github.com/bla2/bla2 v2.25.8-20190701-fuasdjhasd8
  )
`)
	assert.Nil(t, err)
	assert.NotNil(t, file)

	assert.EqualValues(t, "github.com/matang28/go-latest", file.Module)
	assert.EqualValues(t, "1.12", *file.GoVersion)
	assert.EqualValues(t, 2, len(file.Requirements))

	assert.EqualValues(t, "github.com/bla1/bla1", file.Requirements[0].ModuleName)
	assert.EqualValues(t, "v1.23.1", file.Requirements[0].Version)
	assert.EqualValues(t, "indirect", *file.Requirements[0].Comment)

	assert.EqualValues(t, "github.com/bla2/bla2", file.Requirements[1].ModuleName)
	assert.EqualValues(t, "v2.25.8-20190701-fuasdjhasd8", file.Requirements[1].Version)
	assert.Nil(t, file.Requirements[1].Comment)

	assert.Nil(t, file.Replacements)
}

Where is the pet project?

Picking

go.mod

files for the examples wasn’t accidental, the tool I’ve decided to build is a simple automation tool. If you use Go in your organization you probably know that editing

go.mod

file manually can be tedious.

go-latest

to the rescue!

go-latest

will scan for

go.mod

files recursively, patching every dependency that matches your query to latest .

For example, for this file tree:

.
├── go.mod
├── subModule1
│   └── go.mod
└── subModule2
    └── go.mod

If I want to patch dependencies that matches my organization name to their latest version I’ll type:

$> go-latest ”organization.com” .

See?! Iv’e told you, pet projects are fun :) and I don’t care that we’ve could solve this problem by using simple regex expressions because this is the essence of pet projects!