In this part of creating your own programming language, we will improve our lexical analysis with Regex lookaheads and lookbehinds (please check out Part I and Part II).
Basically, lookaheads and lookbehinds allow you to create your own ^
and $
sub-expressions (anchors). With their help, you can set up a condition that will be met at the beginning and at the end of the line or won’t, and this condition will not be a part of the “matched” expression.
Lookbehind can “look” back and should be placed at the beginning of the regular expression. Lookahead “looks” forward and accordingly should be placed at the end.
There are four types of lookaheads and lookbehinds with corresponding regex syntax:
(?=pattern)
(?!pattern)
(?<=pattern)
(?<!pattern)
Let’s dive into code and see how we can improve lexical analysis for our programming language. We will go through the declared Token types with the corresponding regex value to define an appropriate lexeme:
if|then|end|print|input|struct|arg
Let’s suppose we defined a variable starting with a Keyword:
print_value = 5
print print_value
During the lexical analysis we will capture this print_value
variable by the Keyword
lexeme instead of being parsed as a variable. To read it properly, we need to be sure that after the keyword is placed either a space or the end of the line. It could be done with the help of a positive lookahead in the end of the Keyword
expression:
(if|then|end|print|input|struct|arg)(?=\\s|$)
Logical lexeme true|false
The same rule is applied to the Logical
lexeme:
true_value = true
To read the Logical
lexeme properly, it should be separated either by space or by the end of the line:
(true|false)(?=\\s|$)
[0-9]+
For the Numeric
values, we will introduce the ability to read floating-point numbers with a decimal point and numbers with a negative/positive sign in front of the number:
negative_value = -123
float_value = 123.456
leading_dot_float_value = .123
ending_dot_float_value = 123.
First, to match optional +
or -
we add a character class:
[+-]?[0-9]+
Then to read a float number, we introduce a dot character with arbitrary numbers set after it:
[+-]?[0-9]+[.]?[0-9]*
There could be a special case when our number starts with a leading dot, so we add an extra alternation:
[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)
We can also replace an alternation by applying a positive lookahead with a requirement to contain maximum 1 dot and at least 1 number after an arbitrary sign declaration. In this case, our main expression for the float number will have a greedy *
quantifier:
[+-]?((?=[.]?[0-9])[0-9]*[.]?[0-9]*)
Lastly, to map floating-point numbers, we modify NumericValue
to accept Double
type as a generic Value
type:
public class NumericValue extends Value<Double> {
public NumericValue(Double value) {
super(value);
}
@Override
public String toString() {
if ((getValue() % 1) == 0) //print integer without fractions
return String.valueOf(getValue().intValue());
return super.toString();
}
}
Operator ([+]|[-]|[*]|[/]|[>]|[<]|[=]{1,2}|[!]|[:]{2}|[(]|[)]|new)
struct Object
arg number
end
new_value = new Object [ 5 ]
For the new
operator, we as well need to introduce a positive lookahead separating value with a space or with the end of the line:
([+]|[-]|[*]|[/]|[>]|[<]|[=]{1,2}|[!]|[:]{2}|[(]|[)]|new(?=\s|$))
In this part we slightly improved our lexical analyzer with positive lookaheads. Now our toy language is a little bit smarter and able to parse more complex variable expressions.