I love pet projects, they are great excuse to use libraries and other technologies that you can’t use at work. Lately I’ve been working on a bigger pet project that needs to parse Go files, I’ve used ANTLR before to make this kind of things but unfortunately, ANTLR’s Go target has poor performance. So I began to search for alternatives written in pure Go and came across with this one, which took a different approach on creating parsers with Go, but before we’re going to understand how this library is different from others, let’s cover some basic concepts about parsing.
For the sake of “formality” here is what Wikipedia has to say:
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).
Now, let’s try to break it down by using an example, if you use Go your’e probably familiar with go.mod files; but if you aren’t using it, go.mod files are used to manage your project dependencies. Here is a simple go.mod file:
module github.com/matang28/go-latest
go 1.13
require (
github.com/some/dependency1 v1.2.3
github.com/some/dependency2 v5.1.0 // indirect
)
replace github.com/some/dependency1 => github.com/some/dependency1@dev latest
before we can parse this file, we need two things:
The first one is formally called a Lexer (or lexing). Lexers are all about turning strings into a list of predefined symbols that our parser uses, any symbol (or token) used by our language should be identified by the Lexer. In our case we need to capture the following tokens:
module
, go
, require
, replace
.github.com/some/dependency1
, 1.13
.v1.2.3
, v1.4.21-haijkjasd9ijasd
.=> , // , ( , )
, whitespaces, line-terminators and tabs.The second one will give these tokens a syntactic meaning, you can see that
go.mod
files follows some basic rules:module
directive.go
directive.require
directive.replace
directive.exclude
directive.These rules are often called the grammar of the language (the go.mod language) and there is a more formal ways to represent them, one of them is the EBNF standard:
grammar GoMod;
// A source file should be structured in the following way:
sourceFile : moduleDecl stmt* EOF;
stmt : (versionDecl | reqDecl | repDecl | excDecl);
// The module directive:
moduleDecl : 'module' STRING;
// The version directive:
versionDecl : 'go' STRING;
// The require directive, it can be either a single line:
// require github.com/some/dep1 OR multi line:
// require (
// github.com/some/dep1
// github.com/some/dep2
// )
reqDecl : 'require' dependency
| 'require (' dependency* ')'
;
// The replace directive:
repDecl : 'replace' STRING '=>' dependency;
// The exclude directive:
excDecl : 'exclude' dependency;
dependency : STRING version comment?;
version : VERSION | 'latest';
comment : '//' STRING;
VERSION : [v]STRING+;
STRING : [a-zA-Z0-9_+.@/-]+;
WS : [ \n\r\t\u000C]+ -> skip;
Now back to the definition, a Parser takes a stream of tokens (e.g the tokenization of our text file) and will try to match each token against the grammar, if our file doesn’t follow the grammar we will get a parse error but, if our file does follow the grammar we know for sure that the input file is valid (in term of syntax) and we can get the parse tree which we can traverse throughout the matched grammar rules.
Parsing is all about making this magic happen… taking a stream of tokens and building a parse tree from it (iff the syntax matches the grammar).
At the beginning of the article I’ve said that one library was the spark to create this pet project, why is that?
Well parse trees aren’t the most comfortable data structure to extract data from, “walking” on a parse tree can be pretty confusing. Most of the tools available today uses code generation techniques to generate lexers and parsers from a grammar file and, in order to be as generic as possible most of them give you a listener interface which will get notified when the parser enters or exits each grammar rule.
Participle uses different approach, one that’s more common for Go developers. It uses Go’s structs system to represent the grammar (or more precisely the parts that we want to capture from the grammar), it also utilizes “struct tags” to define the grammar rules.
Wow! that was a mouthful, let’s learn by doing. At first we need some structs that represents a standard `go.mod` file:
// The root level object that represents a go.mod file
type GoModFile struct {
Module string
Statements []Statement
}
type Statement struct {
GoVersion *string
Requirements []Dependency
Replacements []Replacement
Excludes []Dependency
}
// A struct that represents a go.mod dependency
type Dependency struct {
ModuleName string
Version string
Comment *string
}
// A struct that represents a replace directive
type Replacement struct {
FromModule string
ToModule Dependency
}
In order to write a parser, we need a stream of tokens. Participle got us covered by giving us a regex-based lexer (their docs says that they’ve been working on an EBNF-style one that could be really useful since this is the most cumbersome part of Participle IMO). Let’s see that in action:
import (
"github.com/alecthomas/participle/lexer"
)
// The lexer uses named regex groups to tokenize the input:
var iniLexer = lexer.Must(lexer.Regexp(
/* We want to ignore these characters, so no name needed */
`([\s\n\r\t]+)` +
/* Parentheses [(,)]*/
`|(?P<Parentheses>[\(\)])` +
/* Arrow [=>]*/
`|(?P<Arrow>(=>))` +
/* Version [v STRING] */
`|(?P<Version>[v][a-zA-Z0-9_\+\.@\-\/]+)` +
/* String [a-z,A-Z,0-9,_,+,.,@,-,/]+ */
`|(?P<String>[a-zA-Z0-9_\+\.@\-\/]+)`,
))
Now we can move on to define our grammar, think of it as a combination between the EBNF file and our Go structs representation. Here are some of the basic syntax operators Participle has to offer:
"module"
will try to match the symbol module
.@
to capture expression (i.e named group defined in our lexer), for example @String
will match the String
symbol and extract it into the struct field.@@
to let the underlaying struct to match the input, this is useful when you have multiple alternatives for a rule.*
to match zero or more, +
to match at least one, ?
to match zero or one, etc...
// The root level object that represents a go.mod file
type GoModFile struct {
Module string `"module" @String`
Statements []Statement `@@*`
}
type Statement struct {
GoVersion *string `( "go" @String )`
Requirements []Dependency `| (("require" "(" @@* ")") | ("require" @@))`
Replacements []Replacement `| (("replace" "(" @@* ")") | ("replace" @@))`
Excludes []Dependency `| (("exclude" "(" @@* ")") | ("exclude" @@))`
}
// A struct that represents a go.mod dependency
type Dependency struct {
ModuleName string `@String`
Version string `(@Version | @"latest")`
Comment *string `("//" @String)?`
}
// A struct that represents a replace directive
type Replacement struct {
FromModule string `@String "=>"`
ToModule Dependency `@@`
}
All we have to do now is to build the parser from our grammar:
func Parse(source string) (*GoModFile, error) {
p, err := participle.Build(&GoModFile{},
participle.Lexer(iniLexer),
)
if err != nil {
return nil, err
}
ast := &GoModFile{}
err = p.ParseString(source, ast)
return ast, err
}
And we’re done, we can parse `go.mod` files now! Let’s write a simple test to make sure everything works as expected:
func TestParse_WithMultipleRequirements(t *testing.T) {
file, err := Parse(`
module github.com/matang28/go-latest
go 1.12
require (
github.com/bla1/bla1 v1.23.1 // indirect
github.com/bla2/bla2 v2.25.8-20190701-fuasdjhasd8
)
`)
assert.Nil(t, err)
assert.NotNil(t, file)
assert.EqualValues(t, "github.com/matang28/go-latest", file.Module)
assert.EqualValues(t, "1.12", *file.GoVersion)
assert.EqualValues(t, 2, len(file.Requirements))
assert.EqualValues(t, "github.com/bla1/bla1", file.Requirements[0].ModuleName)
assert.EqualValues(t, "v1.23.1", file.Requirements[0].Version)
assert.EqualValues(t, "indirect", *file.Requirements[0].Comment)
assert.EqualValues(t, "github.com/bla2/bla2", file.Requirements[1].ModuleName)
assert.EqualValues(t, "v2.25.8-20190701-fuasdjhasd8", file.Requirements[1].Version)
assert.Nil(t, file.Requirements[1].Comment)
assert.Nil(t, file.Replacements)
}
Picking
go.mod
files for the examples wasn’t accidental, the tool I’ve decided to build is a simple automation tool. If you use Go in your organization you probably know that editing go.mod
file manually can be tedious.go-latest
to the rescue! go-latest
will scan for go.mod
files recursively, patching every dependency that matches your query to latest .For example, for this file tree:
.
├── go.mod
├── subModule1
│ └── go.mod
└── subModule2
└── go.mod
If I want to patch dependencies that matches my organization name to their latest version I’ll type:
$> go-latest ”organization.com” .
See?! Iv’e told you, pet projects are fun :) and I don’t care that we’ve could solve this problem by using simple regex expressions because this is the essence of pet projects!
For more info about go-latest checkout it’s Github page.
And as always, thanks for reading...