Whenever you start wondering how to create your syntax or rules on expression, it’s probably time to check out this powerful tool you may be interested in. The Chevrotain allows you to build your own grammar rules, validate, and parse the result at the end.
What does this library structurally consist of?
First of all, it’s Lexer. Lexer is the class that basically goes through input text and match token to each symbol or sequence of symbols (words), it found. To create an instance of the Lexer class, I need to define all tokens that I want to use. The token should have a unique name, the pattern - means regex.
For example token ‘hackernoon‘ will match the exact word in the text.
const hackernoon = createToken({
name: 'hackernoon',
pattern: /hackernoon/
})
Also, there is a possibility to categorize tokens. For this purpose, I need to define one more token with a reserved pattern value Lexer.NA
. In this case, Lexer will not use it to match some characters. And in other tokens that I want to categorize, I add my new token to a property ‘categories’. Like in the example below, I introduce MathOperator
and add this token to MathPlus
and MathMinus
.
export const MathOperator = createToken({
name: Token.MathOperator,
pattern: Lexer.NA,
});
export const MathPlus = createToken({
name: Token.MathPlus,
pattern: /\+/,
categories: MathOperator,
});
export const MathMinus = createToken({
name: Token.MathMinus,
pattern: /-/,
categories: MathOperator,
});
export const Number = createToken({
name: Token.Number,
pattern: /[+-]?\d*\.?\d+(?:[Ee][+-]?\d+)?/
});
export const Whitespace = createToken({
name: Token.WhiteSpace,
pattern: /\s+/,
group: Lexer.SKIPPED,
});
When all tokens are defined, I need to create a new instance of the Lexer class and put there an array of these tokens.
const allTokens = [
WhiteSpace,
Number,
MathPlus,
MathMinus
]
const mathLexer = new Lexer(allTokens)
And from this moment the Lexer is ready to tokenize input strings. I need to invoke the function ‘tokenize’ of the Lexer, and as a result, I will receive an object with tokens. With them, I can go to the next step.
const lexingResult = mathLexer.tokenize(text);
lexingResult
is having next interface ILexingResult
. Where errors array can be very helpful to understand for the user what is wrong in the expression because it contains coordinates to the problem: line and column.
export interface ILexingResult {
tokens: IToken[]
groups: {
[groupName: string]: IToken[]
}
errors: ILexingError[]
}
The second important thing is Parser. It is a tool to describe the rules of your expression. For example, in math expression, you are expecting that the first variable will be a Number
the second should be some operator +/-
and the third must be a Number
as well. So rule will describe such behavior. First things first I create my own class the MathParser
inherited from CstParser
. That will accept tokens list in the constructor, and the constructor will contain all required rules and after them the this.performSelfAnalysis()
to check all written rules.
class MathParser extends CstParser {
public mathCond: RuleFn;
private valueOrParen: RuleFn;
private parentheses: RuleFn;
private value: RuleFn;
private mathOp: RuleFn;
constructor(tokens: TokenType[]) {
super(tokens);
const $ = this;
// Rules here
this.performSelfAnalysis(); // Must be after all rules
}
}
The mathCond
rule is the main entrance rule that will be invoked to perform the parsing. Basically, in my example, I assume that expression should be starting from some number or from parentheses. You are wondering how to read this rule? It actually tells if at least one number 123
or parentheses with a number (123)
or parentheses with some operation (123 + 456)
(the operation described in MANY
keyword), then the expression is valid.
Of course, as you can see there could be more parentheses and math operations inside one parentheses, in these terms - recursion. And MANY
says that expression can repeat zero or more times. As like the example with (123 + 456)
- it repeats one time if it will be like (123 + 456 - 789)
then 2 repeats. (123 + 456) - 789
this can be read: 1 mathCond
repeat in parentheses and one after them.
this.mathCond = $.RULE('mathCond', () => {
$.SUBRULE($.valueOrParen, { LABEL: 'lhs' });
$.MANY(() => {
$.SUBRULE1($.mathOp)
$.SUBRULE2($.valueOrParen, { LABEL: 'rhs' });
});
});
The rule valueOrParen
is expecting some Number or other parentheses with math expression.
this.valueOrParen = $.RULE('valueOrParen', () => {
$.OR([
{ ALT: () => $.SUBRULE($.value) },
{ ALT: () => $.SUBRULE($.parentheses) },
]);
});
In the next rule parentheses, I want to cover the case with braces ()
and inside braces, recursion happens. LeftParen
and RightParen
are two more tokens. mathCond
is the main rule function.
this.parentheses = $.RULE('parentheses', () => {
$.CONSUME(LeftParen);
$.SUBRULE($.mathCond);
$.CONSUME(RightParen);
});
mathOp
is the rule which looks for +
OR -
.
this.mathOp = $.RULE('mathOp', () => {
$.OR([
{ ALT: () => $.CONSUME(MathPlus) },
{ ALT: () => $.CONSUME(MathMinus) },
]);
});
The value
rule is created just to show that it can be extended with other specific types.
this.value = $.RULE('value', () => {
$.OR([
{ ALT: () => $.CONSUME(Number) }
]);
});
Once the parser is ready, create it and call the main function.
const mathParser = new MathParser(allTokens);
mathParser.mathCond();
As for Lexer, there are errors that help users, here as well, after calling the main function parser may contain errors with Token where the error happened. mathParser.errors
So this library helps you to verify the string that the user typed and actually evaluate the expression. And even more, you can try a custom Visitor to create your own expression tree.