In this tutorial, we will build our own programming language and compiler using Java (you can use any other language, preferably object-oriented). The purpose of the article is to help people who are looking for a way to create their own programming language and compiler. This is a toy example, but it will try to help you understand where to start and in which direction to move. The full source code is available over on GitHub
Each language has several stages from the source code to the final executable file. Each of the stages formats incoming data in a certain way:
We will create a simple language with the following abilities:
struct Person
arg name
arg experience
arg is_developer
end
input your_name
input your_experience_in_years
input do_you_like_programming
person = new Person [your_name your_experience_in_years do_you_like_programming == "yes"]
print person
if person :: is_developer then
person_name = person :: name
print "hey " + person_name + "!"
experience = person :: experience
if experience > 0 then
started_in = 2022 - experience
print "you had started your career in " + started_in
end
end
First of all, we will start with the lexical analysis. Let’s imagine you got a message from a friend with the following content:
“Ienjoyreadingbooks”
This expression is a little difficult to read. It’s just a set of letters without any meaning. This is because our natural lexical analyzer can’t find any appropriate word from our dictionary. However, if we put spaces correctly, everything becomes clear:
“I enjoy reading books”
The lexical analyzer of a programming language works by the same principle. When we talk we highlight individual words by intonation and pauses to understand each other. In the same way we must provide the code to the lexical analyzer to make it understand us. If we write it wrong, the lexical analyzer will not be able to separate individual lexemes, words and syntax constructions.
There are six lexemes units (tokens) we will count in our programming language during lexical analysis:
These lexeme units do not make a lot of sense. You can’t declare any code blocks or function parameters by using a space. The main intent is to only help a developer to divide his code into separate lexemes. Therefore, the lexical analyzer will first of all look for spaces and newlines in order to understand how to highlight provided lexemes in the code.
+
, -
, =
, <
, >
, ::
They can be a part of more complex compound statements. The equals sign can mean not only an assignment operator but can also combine a more complex equality compare operator, consisting of two `=` in a row. In this case, the lexical analyzer will try to read expressions from left to right trying to catch the longest operator.
[
, ]
Group dividers can separate two group lexemes from each other. For example, an open square bracket can be used to mark the beginning of some specific group and a close square bracket will mark the end of the started group. In our language square brackets will be only used to declare struct instance arguments.
print
, input
, struct
, arg
, end
, new
, if
, then
A keyword is a set of alphabetic characters with some specific sense assigned by the compiler. For example, a combination of 5 letters `print` makes one word and it’s perceived by the language compiler as a console output statement definition. This meaning cannot be changed by a programmer. That’s the basic idea of keywords, they are language axioms that we combine to create our own statements - programs.
In addition to the keywords, we also need to take variables into account. A variable is a sequence of characters with some meaning given by a programmer, not a compiler. At the same time, we need to put certain restrictions on variable names. Our variables will contain only letters, numbers, and underscore characters. Other characters within the identifier cannot occur because most of the operators we described are delimiters and therefore cannot be part of another lexeme. In this case, the variable cannot begin with a digit due to the fact that the lexical analyzer detects a digit and immediately tries to match it with a number. Also, it’s important to note that the variable cannot be expressed by a keyword. It means that if our language defines the keyword `print`, then the programmer can’t introduce a variable with the same set of characters defined in the same order.
If the given sequence of characters is not a keyword and is not a variable, the last option remains - it can be a literal constant. Our language will be able to define numeric, logical and text literals. Numeric literals are special variables containing digits. For the sake of simplicity, we will not use floating-point numbers (fractions), you can implement it yourself later. Logical literals can contain boolean values: false or true. Text literal is an arbitrary set of characters from the alphabet enclosed in double-quotes.
Now that we have the basic information about all possible lexemes in our programming language let’s dive into the code and start declaring our token types. One of the easiest ways to find tokens in source text is to use regular expressions. By having tokens with given regex expressions in the key-value form, you can scan the program code and grab all tokens with the given values. Each token must consist of at least two fields: the token type and its value, for example, a Keyword key with the input value. We will use enum constants with the corresponding Pattern expression for each lexeme type:
@RequiredArgsConstructor
@Getter
public enum TokenType {
Whitespace("[\\s\\t\\n\\r]"),
Keyword("(if|then|end|print|input|struct|arg|new)"),
GroupDivider("(\\[|\\])"),
Logical("true|false"),
Numeric("[0-9]+"),
Text("\"([^\"]*)\""),
Variable("[a-zA-Z_]+[a-zA-Z0-9_]*"),
Operator("(\\+|\\-|\\>|\\<|\\={1,2}|\\!|\\:{2})");
private final String regex;
}
To simplify literals parsing I divided each of the literal types into a separate lexeme: Numeric
, Logical
and Text
. For the Text
literal we set a separate group ([^"]*)
to grab literal value without double quotes. For the equals we declare {1,2}
range. With two equal signs ==
we expect to get the compare operator instead of the assignment. To access a structure field we declared the double colon operator ::
.
Now to find a token in our code we just iterate and filter all TokenType values applying regex on our source code. To match the beginning of the row we put the ^
meta character in the beginning of each regex expression creating a Pattern instance. The Text token will capture value without quotes into the separate group. Therefore, in order to access value without quotes we grab a value from group with index 1 if we have at least one explicit group:
for (TokenType tokenType : TokenType.values()) {
Pattern pattern = Pattern.compile("^" + tokenType.getRegex());
Matcher matcher = pattern.matcher(sourceCode);
if (matcher.find()) {
// group(1) is used to get text literal without double quotes
String token = matcher.groupCount() > 0 ? matcher.group(1) : matcher.group();
}
}
To store found lexemes we need to declare the following Token class with type and value fields:
@Builder
@Getter
public class Token {
private final TokenType type;
private final String value;
}
Now we have everything to create our lexical parser. We will receive the source code as a String in the constructor and initialize tokens List:
public class LexicalParser {
private final List<Token> tokens;
private final String source;
public LexicalParser(String source) {
this.source = source;
this.tokens = new ArrayList<>();
}
...
}
In order to retrieve all tokens from the source code we need to cut the source after each found lexeme. We will declare the nextToken()
method that will accept the current index of the source code as an argument and grab the next token starting after the current position:
public class LexicalParser {
...
private int nextToken(int position) {
String nextToken = source.substring(position);
for (TokenType tokenType : TokenType.values()) {
Pattern pattern = Pattern.compile("^" + tokenType.getRegex());
Matcher matcher = pattern.matcher(nextToken);
if (matcher.find()) {
if (tokenType != TokenType.Whitespace) {
// group(1) is used to get text literal without double quotes
String value = matcher.groupCount() > 0 ? matcher.group(1) : matcher.group();
Token token = Token.builder().type(tokenType).value(value).build();
tokens.add(token);
}
return matcher.group().length();
}
}
throw new TokenException(String.format("invalid expression: `%s`", nextToken));
}
}
After successful capture we create a Token instance and accumulate it in the tokens list. We will not add the Whitespace
lexemes as they are only used to divide two lexemes from each other. In the end we return the found lexeme’s length.
To capture all tokens in the source we create the parse()
method with the while loop increasing source position each time we catch a token:
public class LexicalParser {
...
public List<Token> parse() {
int position = 0;
while (position < source.length()) {
position += nextToken(position);
}
return tokens;
}
...
}
Within our compiler model, the syntax analyzer will receive a list of tokens from the lexical analyzer and check whether this sequence can be generated by the language grammar. In the end this syntax analyzer should return an abstract syntax tree.
We will start the syntax analyzer by declaring the Expression
interface:
public interface Expression {
}
This interface will be used to declare literals, variables and composite expressions with operators.
First of all, we create Expression
implementations for our language’s literal types: Numeric
, Text
and Logical
with corresponding Java types: Integer
, String
and Boolean
. We will create the base Value
class with generic type extending Comparable (it will be used later with comparison operators):
@RequiredArgsConstructor
@Getter
public class Value<T extends Comparable<T>> implements Expression {
private final T value;
@Override
public String toString() {
return value.toString();
}
}
public class NumericValue extends Value<Integer> {
public NumericValue(Integer value) {
super(value);
}
}
public class TextValue extends Value<String> {
public TextValue(String value) {
super(value);
}
}
public class LogicalValue extends Value<Boolean> {
public LogicalValue(Boolean value) {
super(value);
}
}
We will also declare StructureValue
for our structure instances:
public class StructureValue extends Value<StructureExpression> {
public StructureValue(StructureExpression value) {
super(value);
}
}
StructureExpression
will be covered a bit later.
Variable expression will have a single field representing its name:
@AllArgsConstructor
@Getter
public class VariableExpression implements Expression {
private final String name;
}
In order to store a structure instance we will need to know the structure definition and arguments values we are passing to create an object. Argument values can signify any expressions including literals, variables and more complex expressions implementing Expression
interface. Therefore, we will use Expression
as values type:
@RequiredArgsConstructor
@Getter
public class StructureExpression implements Expression {
private final StructureDefinition definition;
private final List<Expression> values;
}
@RequiredArgsConstructor
@Getter
@EqualsAndHashCode(onlyExplicitlyIncluded = true)
public class StructureDefinition {
@EqualsAndHashCode.Include
private final String name;
private final List<String> arguments;
}
Values can be a type of the VariableExpression
. We need a way to access the variable's value by its name. I will delegate this responsibility to the Function interface which will accept variable name and return a Value
object:
...
public class StructureExpression implements Expression {
...
private final Function<String, Value<?>> variableValue;
}
Now we can implement an interface to retrieve an argument's value by name, it will be used to access the structure instance values. The getValue()
method will be implemented a bit later:
...
public class StructureExpression implements Expression {
…
public Value<?> getArgumentValue(String field) {
return IntStream
.range(0, values.size())
.filter(i -> definition.getArguments().get(i).equals(field))
.mapToObj(this::getValue) //will be implemented later
.findFirst()
.orElse(null);
}
}
Don’t forget our StructureExpression
is used as a parameter for the StructureValue
generic type which extends Comparable. Therefore, we must implement Comparable interface for our StructureExpression
:
...
public class StructureExpression implements Expression, Comparable<StructureExpression> {
...
@Override
public int compareTo(StructureExpression o) {
for (String field : definition.getArguments()) {
Value<?> value = getArgumentValue(field);
Value<?> oValue = o.getArgumentValue(field);
if (value == null && oValue == null) continue;
if (value == null) return -1;
if (oValue == null) return 1;
//noinspection unchecked,rawtypes
int result = ((Comparable) value.getValue()).compareTo(oValue.getValue());
if (result != 0) return result;
}
return 0;
}
}
We can also override the standard toString()
method. It will be useful if we want to print an entire structure instance in the console:
...
public class ObjectExpression implements Expression, Comparable<ObjectExpression> {
...
@Override
public String toString() {
return IntStream
.range(0, values.size())
.mapToObj(i -> {
Value<?> value = getValue(i); //will be implemented later
String fieldName = definition.getArguments().get(i);
return fieldName + " = " + value;
})
.collect(Collectors.joining(", ", definition.getName() + " [ ", " ]"));
}
}
In our language, we will have unary operators with 1 operand and binary operators with 2 operands.
Let’s implement the Expression interface for each of our operators. We declare OperatorExpression
interface and create base implementations for unary and binary operators:
public interface OperatorExpression extends Expression {
}
@RequiredArgsConstructor
@Getter
public class UnaryOperatorExpression implements OperatorExpression {
private final Expression value;
}
@RequiredArgsConstructor
@Getter
public class BinaryOperatorExpression implements OperatorExpression {
private final Expression left;
private final Expression right;
}
We also need to declare an abstract calc()
method for each of our operator implementations with corresponding values operands. This method will calculate the Value
from operands depending on each operator’s essence. The operand’s transition from the Expression
to the Value
will be covered a bit later.
…
public abstract class UnaryOperatorExpression extends OperatorExpression {
…
public abstract Value<?> calc(Value<?> value);
}
…
public abstract class BinaryOperatorExpression extends OperatorExpression {
…
public abstract Value<?> calc(Value<?> left, Value<?> right);
}
After declaring base operator classes we can create more detailed implementations:
!
This operator will only work with the LogicalValue
returning the new LogicalValue
instance with inverted value:
public class NotOperator extends UnaryOperatorExpression {
...
@Override
public Value<?> calc(Value<?> value) {
if (value instanceof LogicalValue) {
return new LogicalValue(!((LogicalValue) value.getValue()).getValue());
} else {
throw new ExecutionException(String.format("Unable to perform NOT operator for non logical value `%s`", value));
}
}
}
Now we will switch to the BinaryOperatorExpression
implementations with two operands:
+
The first one is addition. calc()
method will make the addition of NumericValue
objects. For the other Value types we will concat toString()
values returning TextValue
instance:
public class AdditionOperator extends BinaryOperatorExpression {
public AdditionOperator(Expression left, Expression right) {
super(left, right);
}
@Override
public Value<?> calc(Value<?> left, Value<?> right) {
if (left instanceof NumericValue && right instanceof NumericValue) {
return new NumericValue(((NumericValue) left).getValue() + ((NumericValue) right).getValue());
} else {
return new TextValue(left.toString() + right.toString());
}
}
}
-
calc()
will make getValue()
comparison only if both values have NumericValue
type. In the other case both values will be mapped to the String, removing 2nd value matches from the 1st value:
public class SubtractionOperator extends BinaryOperatorExpression {
...
@Override
public Value<?> calc(Value<?> left, Value<?> right) {
if (left instanceof NumericValue && right instanceof NumericValue) {
return new NumericValue(((NumericValue) left).getValue() - ((NumericValue) right).getValue());
} else {
return new TextValue(left.toString().replaceAll(right.toString(), ""));
}
}
}
==
calc()
will make getValue()
comparison only if both values have the same type. In the other case both values will be mapped to the String:
public class EqualsOperator extends BinaryOperatorExpression {
...
@Override
public Value<?> calc(Value<?> left, Value<?> right) {
boolean result;
if (Objects.equals(left.getClass(), right.getClass())) {
result = ((Comparable) left.getValue()).compareTo(right.getValue()) == 0;
} else {
result = ((Comparable) left.toString()).compareTo(right.toString()) == 0;
}
return new LogicalValue(result);
}
}
<
and greater than >
public class LessThanOperator extends BinaryOperatorExpression {
...
@Override
public Value<?> calc(Value<?> left, Value<?> right) {
boolean result;
if (Objects.equals(left.getClass(), right.getClass())) {
result = ((Comparable) left.getValue()).compareTo(right.getValue()) < 0;
} else {
result = left.toString().compareTo(right.toString()) < 0;
}
return new LogicalValue(result);
}
}
public class GreaterThanOperator extends BinaryOperatorExpression {
...
@Override
public Value<?> calc(Value<?> left, Value<?> right) {
boolean result;
if (Objects.equals(left.getClass(), right.getClass())) {
result = ((Comparable) left.getValue()).compareTo(right.getValue()) > 0;
} else {
result = left.toString().compareTo(right.toString()) > 0;
}
return new LogicalValue(result);
}
}
::
To read a structure argument’s value we expect to receive the left value as a StructureValue
type:
public class StructureValueOperator extends BinaryOperatorExpression {
...
@Override
public Value<?> calc(Value<?> left, Value<?> right) {
if (left instanceof StructureValue)
return ((StructureValue) left).getValue().getArgumentValue(right.toString());
return left;
}
}
If we want to pass variables or more complex Expression
implementations to the operators including operators itself, we need a way to transform Expression
object to the Value
. To do this we declare evaluate()
method in our Expression
interface:
public interface Expression {
Value<?> evaluate();
}
Value class will just return this
instance:
public class Value<T extends Comparable<T>> implements Expression {
...
@Override
public Value<?> evaluate() {
return this;
}
}
To implement evaluate()
for the VariableExpression
first we must provide an ability to get Value
by variable’s name. We delegate this work to the Function
field which will accept variable name and return corresponding Value
object:
...
public class VariableExpression implements Expression {
...
private final Function<String, Value<?>> variableValue;
@Override
public Value<?> evaluate() {
Value<?> value = variableValue.apply(name);
if (value == null) {
return new TextValue(name);
}
return value;
}
}
To evaluate the StructureExpression
we will create a StructureValue
instance. We need also implement the missing getValue()
method that will accept an argument index and return value depending on the Expression
type:
...
public class StructureExpression implements Expression, Comparable<StructureExpression> {
...
@Override
public Value<?> evaluate() {
return new StructureValue(this);
}
private Value<?> getValue(int index) {
Expression expression = values.get(index);
if (expression instanceof VariableExpression) {
return variableValue.apply(((VariableExpression) expression).getName());
} else {
return expression.evaluate();
}
}
}
For the operator expressions we evaluate operand values and pass them to the calc()
method:
public abstract class UnaryOperatorExpression extends OperatorExpression {
...
@Override
public Value<?> evaluate() {
return calc(getValue().evaluate());
}
}
public abstract class BinaryOperatorExpression extends OperatorExpression {
...
@Override
public Value<?> evaluate() {
return calc(getLeft().evaluate(), getRight().evaluate());
}
}
Before we start creating our syntax analyzer we need to introduce a model for our language’s statements. Let’s start with the Statement
interface with the execute()
proceeding method which will be called during code execution:
public interface Statement {
void execute();
}
The first statement we will implement is the PrintStatement
. This statement needs to know the expression to print. Our language will support printing literals, variables and other expressions implementing Expression
which will be evaluated during execution to calculate the Value
object:
@AllArgsConstructor
@Getter
public class PrintStatement implements Statement {
private final Expression expression;
@Override
public void execute() {
Value<?> value = expression.evaluate();
System.out.println(value);
}
}
To declare an assignment we need to know the variable’s name and the expression we want to assign. Our language will be able to assign literals, variables and more complex expressions implementing Expression
interface:
@AllArgsConstructor
@Getter
public class AssignStatement implements Statement {
private final String name;
private final Expression expression;
@Override
public void execute() {
...
}
}
During the execution we need to store the variables values somewhere. Let’s delegate this logic to the BiConsumer
field which will pass a variable name and the assigning value. Note that Expression
value needs to be evaluated while doing an assignment:
...
public class AssignStatement implements Statement {
...
private final BiConsumer<String, Value<?>> variableSetter;
@Override
public void execute() {
variableSetter.accept(name, expression.evaluate());
}
}
In order to read an expression from the console we need to know the variable name to assign value to. During the execute()
we must read a line from the console. This work can be delegated to the Supplier
, thus we will not create multiple input streams. After reading the line we parse it to the corresponding Value
object and delegate assignment to the BiConsumer
instance as we did for the AssignStatement
:
@AllArgsConstructor
@Getter
public class InputStatement implements Statement {
private final String name;
private final Supplier<String> consoleSupplier;
private final BiConsumer<String, Value<?>> variableSetter;
@Override
public void execute() {
//just a way to tell the user in which variable the entered value will be assigned
System.out.printf("enter \"%s\" >>> ", name.replace("_", " "));
String line = consoleSupplier.get();
Value<?> value;
if (line.matches("[0-9]+")) {
value = new NumericValue(Integer.parseInt(line));
} else if (line.matches("true|false")) {
value = new LogicalValue(Boolean.valueOf(line));
} else {
value = new TextValue(line);
}
variableSetter.accept(name, value);
}
}
First, we will introduce the CompositeStatement
class which will contain internal list of statements to execute:
@Getter
public class CompositeStatement implements Statement {
private final List<Statement> statements2Execute = new ArrayList<>();
public void addStatement(Statement statement) {
if (statement != null)
statements2Execute.add(statement);
}
@Override
public void execute() {
statements2Execute.forEach(Statement::execute);
}
}
This class can be reused later in case we create composite statements construction. First of the constructions will be the Condition
statement. To describe the condition we can use literals, variables and more complex constructions implementing the Expression
interface. In the end during execution we calculate the value of the Expression
and make sure the value is Logical and the result inside is truthy. Only in this case we can perform inner statements:
@RequiredArgsConstructor
@Getter
public class ConditionStatement extends CompositeStatement {
private final Expression condition;
@Override
public void execute() {
Value<?> value = condition.evaluate();
if (value instanceof LogicalValue) {
if (((LogicalValue) value).getValue()) {
super.execute();
}
} else {
throw new ExecutionException(String.format("Cannot compare non logical value `%s`", value));
}
}
}
Now we have the statements to work with our lexemes, we can now build the StatementParser
. Inside the class we declare a list of tokens and mutable tokens’ position variable:
public class StatementParser {
private final List<Token> tokens;
private int position;
public StatementParser(List<Token> tokens) {
this.tokens = tokens;
}
...
}
Then we create the parseExpression()
method which will handle provided tokens and return a full-fledged statement:
public Statement parseExpression() {
...
}
Our language expressions can only begin with declaring a variable or a keyword. Therefore, first we need to read a token and validate that it has the Variable
or the Keyword
token type. To handle this we declare a separate next() method with token types varargs as an argument which will validate that the next token has the same type. In truthy case we increment StatementParser
’s position field and return the found token:
private Statement parseExpression() {
Token token = next(TokenType.Keyword, TokenType.Variable);
...
}
private Token next(TokenType type, TokenType... types) {
TokenType[] tokenTypes = org.apache.commons.lang3.ArrayUtils.add(types, type);
if (position < tokens.size()) {
Token token = tokens.get(position);
if (Stream.of(tokenTypes).anyMatch(t -> t == token.getType())) {
position++;
return token;
}
}
Token previousToken = tokens.get(position - 1);
throw new SyntaxException(String.format("After `%s` declaration expected any of the following lexemes `%s`", previousToken, Arrays.toString(tokenTypes)));
}
After we have found the appropriate token, we can build our statement depending on the token type.
Each variable assignment starts with the Variable
token type, which we already did read. The next token we expect is the equals =
operator. You can override the next() method that will accept TokenType
with String operator representation (e.g.=
) and return the next found token only if it has the same type and value as requested:
public Statement parseExpression() {
Token token = next(TokenType.Keyword, TokenType.Variable);
switch (token.getType()) {
case Variable:
next(TokenType.Operator, "="); //skip equals
Expression value;
if (peek(TokenType.Keyword, "new")) {
value = readInstance();
} else {
value = readExpression();
}
return new AssignStatement(token.getValue(), value, variables::put);
}
...
}
After the equals operator, we read an expression. An expression can start with the new
keyword when we instantiate a structure. We will cover this case a bit later. To consume a token with intent to validate its value without incrementing position and throwing an error we will create an additional peek()
method that will return true value if the next token is the same as we expect:
...
private boolean peek(TokenType type, String value) {
if (position < tokens.size()) {
Token token = tokens.get(position);
return type == token.getType() && token.getValue().equals(value);
}
return false;
}
...
To create an AssignStatement
instance we need to pass a BiConsumer
object to the constructor. Let’s go to the StatementParser
fields declaration and add new Map variables field which will store variable name as a key and variable value as a value:
public class StatementParser {
...
private final Map<String, Value<?>> variables;
public StatementParser(List<Token> tokens) {
...
this.variables = new HashMap<>();
}
...
}
First, we will cover the plain variable expression reading that can be a literal, another variable or a more complex expression with operators. Therefore, the readExpression()
method will return the Expression
with the corresponding implementations. Inside this method, we declare the left Expression
instance. An expression can start only with a literal or with another variable. Thus, we expect to get the Variable
, Numeric
, Logical
or Text
token type from our next()
method. After reading a proper token we transform it to the appropriate Expression object inside the following switch block and assign it to the left Expression
variable:
private Expression readExpression() {
Expression left;
Token token = next(TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Text);
String value = token.getValue();
switch (token.getType()) {
case Numeric:
left = new NumericValue(Integer.parseInt(value));
case Logical:
left = new LogicalValue(Boolean.valueOf(value));
case Text:
left = new TextValue(value);
case Variable:
default:
left = new VariableExpression(value, variables::get);
}
...
}
After left variable or literal has been read we expect to catch an Operator
lexeme:
private Expression readExpression() {
Expression left;
Token token = next(TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Text);
String value = token.getValue();
switch (token.getType()) {
...
}
Token operation = next(TokenType.Operator);
Class<? extends OperatorExpression> operatorType = Operator.getType(operation.getValue());
...
}
Then we map our Operator
lexeme to the OperatorExpression
object. To do it you can use a switch block returning proper OperatorExpression
class depending on the token value. I will use enum constants with an appropriate OperatorExpression
type:
@RequiredArgsConstructor
@Getter
public enum Operator {
Not("!", NotOperator.class),
Addition("+", AdditionOperator.class),
Subtraction("-", SubtractionOperator.class),
Equality("==", EqualsOperator.class),
GreaterThan(">", GreaterThanOperator.class),
LessThan("<", LessThanOperator.class),
StructureValue("::", StructureValueOperator.class);
private final String character;
private final Class<? extends OperatorExpression> operatorType;
public static Class<? extends OperatorExpression> getType(String character) {
return Arrays.stream(values())
.filter(t -> Objects.equals(t.getCharacter(), character))
.map(Operator::getOperatorType)
.findAny().orElse(null);
}
}
In the end we read the right literal or variable as we did for the left Expression
. In order to not duplicate code we extract reading Expression
into a separate nextExpression()
method:
private Expression nextExpression() {
Token token = next(TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Text);
String value = token.getValue();
switch (token.getType()) {
case Numeric:
return new NumericValue(Integer.parseInt(value));
case Logical:
return new LogicalValue(Boolean.valueOf(value));
case Text:
return new TextValue(value);
case Variable:
default:
return new VariableExpression(value, variables::get);
}
}
Let’s refactor the readExpression()
method and read the right lexeme using nextExpression()
. But before we read the right lexeme we need to be sure that our operator supports two operands. We can check if our operator extends the BinaryOperatorExpression
. In the other case if the operator is unary we create an OperatorExpression
only using the left expression. To create an OperatorExpression
object we retrieve suitable constructor for unary or binary operator implementation and then create an instance with obtained earlier expressions:
@SneakyThrows
private Expression readExpression() {
Expression left = nextExpression();
Token operation = next(TokenType.Operator);
Class<? extends OperatorExpression> operatorType = Operator.getType(operation.getValue());
if (BinaryOperatorExpression.class.isAssignableFrom(operatorType)) {
Expression right = nextExpression();
return operatorType
.getConstructor(Expression.class, Expression.class)
.newInstance(left, right);
} else if (UnaryOperatorExpression.class.isAssignableFrom(operatorType)) {
return operatorType
.getConstructor(Expression.class)
.newInstance(left);
}
return left;
}
Additionally, we can provide an opportunity to create a long-expression with multiple operators or without operators at all with only one literal or variable. Let’s enclose the operation read into the while loop with the condition that we have an Operator as the next lexeme. Each time we create an OperatorExpression
, we assign it to the left expression thus creating a tree of subsequent operators inside one Expression
until we read the whole expression. In the end we return the left expression:
@SneakyThrows
private Expression readExpression() {
Expression left = nextExpression();
//recursively read an expression
while (peek(TokenType.Operator)) {
Token operation = next(TokenType.Operator);
Class<? extends OperatorExpression> operatorType = Operator.getType(operation.getValue());
if (BinaryOperatorExpression.class.isAssignableFrom(operatorType)) {
Expression right = nextExpression();
left = operatorType
.getConstructor(Expression.class, Expression.class)
.newInstance(left, right);
} else if (UnaryOperatorExpression.class.isAssignableFrom(operatorType)) {
left = operatorType
.getConstructor(Expression.class)
.newInstance(left);
}
}
return left;
}
private boolean peek(TokenType type) {
if (position < tokens.size()) {
Token token = tokens.get(position);
return token.getType() == type;
}
return false;
}
After we completed the readExpression()
implementation we can go back to the parseExpression()
and finish the readInstance()
implementation to instantiate a structure instance:
According to our language’s semantics we know that our structure instantiation starts with the new
keyword, we can skip the next token by calling the next()
method. The next lexeme will signify the structure name, we read it as the Variable
token type. After the structure name we expect to receive arguments inside square brackets as group dividers. In some cases our structure can be created without any arguments at all. Therefore, we use peek()
first. Each passing argument to the structure can mean an expression, thus we call readExpression()
and pass the result to the arguments list. After building structure arguments we can build our StructureExpression
preliminarily retrieving appropriate StructureDefinition
:
private Expression readInstance() {
next(TokenType.Keyword, "new"); //skip new
Token type = next(TokenType.Variable);
List<Expression> arguments = new ArrayList<>();
if (peek(TokenType.GroupDivider, "[")) {
next(TokenType.GroupDivider, "["); //skip open square bracket
while (!peek(TokenType.GroupDivider, "]")) {
Expression value = readExpression();
arguments.add(value);
}
next(TokenType.GroupDivider, "]"); //skip close square bracket
}
StructureDefinition definition = structures.get(type.getValue());
if (definition == null) {
throw new SyntaxException(String.format("Structure is not defined: %s", type.getValue()));
}
return new StructureExpression(definition, arguments, variables::get);
}
To retrieve the StructureDefinition
by name we must declare the structures map as StatementParser
’s field, we will fill it later during struct
keyword analysis:
public class StatementParser {
...
private final Map<String, StructureDefinition> structures;
public StatementParser(List<Token> tokens) {
...
this.structures = new HashMap<>();
}
...
}
Now we can continue working on the parseExpression()
method when we receive the Keyword
lexeme:
public Statement parseExpression() {
...
switch (token.getType()) {
case Variable:
...
case Keyword:
switch (token.getValue()) {
case "print":
case "input":
case "if":
case "struct":
}
...
}
}
Our language expression can start with print
, input
, if
and struct
keywords.
To read the printing value we call the already created readExpression()
method. It will read a literal, variable or more complex Expression
implementation as we did for the variable assignment. Then we create and return an instance of the PrintStatement
:
...
case "print":
Expression expression = readExpression();
return new PrintStatement(expression);
...
For the input statement, we need to know the variable name we assign value to. Therefore, we ask the next()
method to catch the next Variable
token for us:
...
case "input":
Token variable = next(TokenType.Variable);
return new InputStatement(variable.getValue(), scanner::nextLine, variables::put);
...
To create an InputStatement
instance we introduce a Scanner
object in the StatementParser
fields declaration which will help us to read a line from the console:
public class StatementParser {
...
private final Scanner scanner;
public StatementParser(List<Token> tokens) {
...
this.scanner = new Scanner(System.in);
}
...
}
The next statement we will touch is if/then
condition:
Firstly, when we receive the if
keyword, we read the condition expression by calling readExpression()
. Then according to our language semantics, we need to catch the then
keyword. Inside our condition we can declare other statements including other condition statements. Therefore, we recursively add parseExpression()
to the condition’s inner statements until we will read the finalizing end
keyword:
...
case "if":
Expression condition = readExpression();
next(TokenType.Keyword, "then"); //skip then
ConditionStatement conditionStatement = new ConditionStatement(condition);
while (!peek(TokenType.Keyword, "end")) {
Statement statement = parseExpression();
conditionStatement.addStatement(statement);
}
next(TokenType.Keyword, "end"); //skip end
return conditionStatement;
...
Structure declaration is quite simple because it consists only of Variable
and Keyword
token types. After reading the struct
keyword we expect to read the Structure name as Variable
token type. Then we read arguments until we end up with the end
keyword. In the end we build our StructureDefinition
and put it in the structures map to be accessed in the future when we’ll create a structure instance:
...
case "struct":
Token type = next(TokenType.Variable);
Set<String> args = new HashSet<>();
while (!peek(TokenType.Keyword, "end")) {
next(TokenType.Keyword, "arg");
Token arg = next(TokenType.Variable);
args.add(arg.getValue());
}
next(TokenType.Keyword, "end"); //skip end
structures.put(type.getValue(), new StructureDefinition(type.getValue(), new ArrayList<>(args)));
return null;
...
The complete parseExpression()
method:
...
public Statement parseExpression() {
Token token = next(TokenType.Keyword, TokenType.Variable);
switch (token.getType()) {
case Variable:
next(TokenType.Operator, "="); //skip equals
Expression value;
if (peek(TokenType.Keyword, "new")) {
value = readInstance();
} else {
value = readExpression();
}
return new AssignStatement(token.getValue(), value, variables::put);
case Keyword:
switch (token.getValue()) {
case "print":
Expression expression = readExpression();
return new PrintStatement(expression);
case "input":
Token variable = next(TokenType.Variable);
return new InputStatement(variable.getValue(), scanner::nextLine, variables::put);
case "if":
Expression condition = readExpression();
next(TokenType.Keyword, "then"); //skip then
ConditionStatement conditionStatement = new ConditionStatement(condition);
while (!peek(TokenType.Keyword, "end")) {
Statement statement = parseExpression();
conditionStatement.addStatement(statement);
}
next(TokenType.Keyword, "end"); //skip end
return conditionStatement;
case "struct":
Token type = next(TokenType.Variable);
Set<String> args = new HashSet<>();
while (!peek(TokenType.Keyword, "end")) {
next(TokenType.Keyword, "arg");
Token arg = next(TokenType.Variable);
args.add(arg.getValue());
}
next(TokenType.Keyword, "end"); //skip end
structures.put(type.getValue(), new StructureDefinition(type.getValue(), new ArrayList<>(args)));
return null;
}
default:
throw new SyntaxException(String.format("Statement can't start with the following lexeme `%s`", token));
}
}
...
To find and accumulate all statements we create parse()
method that will parse all expressions from the given tokens and return CompositeStatement
instance:
...
public Statement parse() {
CompositeStatement root = new CompositeStatement();
while (position < tokens.size()) {
Statement statement = parseExpression();
root.addStatement(statement);
}
return root;
}
...
We finished with the lexical and syntax analyzer. Now we can gather both implementations into the ToyLanguage
class and finally run our language:
public class ToyLanguage {
@SneakyThrows
public void execute(Path path) {
String source = Files.readString(path);
LexicalParser lexicalParser = new LexicalParser(source);
List<Token> tokens = lexicalParser.parse();
StatementParser statementParser = new StatementParser(tokens);
Statement statement = statementParser.parse();
statement.execute();
}
}
In this tutorial, we built our own language with lexical and syntax analysis. I hope this article will be useful to someone. I highly recommend you to try writing your own language, despite the fact that you have to understand a lot of implementation details. This is a learning, self-improving, and interesting experiment!