Welcome to the third article of the Let’s Build a Programming Language (LBPL) series. If this is the first article you’re reading in the series, LBPL is a series of articles that takes the reader from 0 to 1 in implementing a programming language. You can check the second article about compilers and interpreters here and the general introduction here.
The lexer, also called lexical analyzer or tokenizer, is a program that breaks down the input source code into a sequence of lexemes. It reads the input source code character by character, recognizes the lexemes and outputs a sequence of tokens describing the lexemes.
What’s a lexeme?
A lexeme is a single identifiable sequence of characters, for example, keywords (such as
while), literals (such as numbers and strings), identifiers, operators, or punctuation characters (such as
What’s a token?
A token is an object describing the lexeme. A token has a type (e.g. Keyword, Identifier, Number, or Operator) and a value (the actual characters of the described lexeme). A token can also contain other information such as the line and column numbers where the lexeme was encountered in the source code.
The lexer in code
A lexer can be implemented as a class, whose constructor takes an input string in parameter (representing the source code to perform lexical analysis on). It exposes a method to recognize and return the next token in the input.
How tokens are recognized
All possible lexemes that can appear in code written in a programming language, are described in the specification of that programming language as a set of rules called lexical grammar. Rules in the lexical grammar are often transformed into automata called finite state machines (FSM). The lexer then simulates the finite state machines to recognize the tokens.
In the next sections of this article, we will study the rules making up the lexical grammar of a programming language, finite state machines, how to transform the rules in the lexical grammar into finite state machines, and how the resulting finite state machines are implemented and used in the
nextToken() method of the
Lexer class to recognize the next token.
The lexical grammar of a programming language is a set of formal rules that govern how valid lexemes in that programming language are constructed. For example, the rules can state that a string is any sequence of characters enclosed in double-quotes or that an identifier may not start with a digit. The rules in the lexical grammar are often expressed with a set of regular definitions.
A regular definition is of the form
<element_name> = <production_rule> where
<element_name> is the name given to a symbol or a lexeme that can be encountered in the programming language and
<production_rule> is a regular expression describing that symbol or lexeme.
For example, the regular definition above defines a letter as any lowercase or uppercase alphabet character.
A regular definition can make use, in its regular expression, of any element name defined in the same lexical grammar.
As an example, in the regular definitions above, the definition identifier reuses the definitions letter and digit, in its production rule as if letter and digit were symbols, to define an identifier as any string starting with a letter or an underscore, followed by zero or more occurrences of a letter, a digit or an underscore.
Finite State Machines
A Finite State Machine or FSM is an abstract machine that is in one and only one state at any point in time. The FSM can change from one state to another as a result of an event. Changing from a state to another is called a transition. To better understand this, let’s consider the following example.
A light bulb can be thought of as a FSM. A light bulb can be in only one of two states at any point in time,
OFF. The light bulb transitions from the state
ON to the state
OFF by the press of a switch and transitions from the state
ON by the press of the same switch.
FSMs are often represented with state diagrams.
Lexical Grammar and FSMs
To recognize a token described by a regular definition, the regular expression in the definition is often transformed into a FSM. The resulting FSM has a finite number of states comprising an initial state and a set of accepting states.
The FSM moves from one state to another by consuming one of the characters or elements in the regular expression. Following the transitions from the initial state to one of the accepting states yields a valid string described by the regular expression.
For example, the regular expression a | b can be converted into the following FSM.
The above FSM has two states labeled 1 and 2. The arrow pointing to 1 and coming out from nowhere indicates that 1 is the initial state and the inner circle on 2 indicates that 2 is an accepting state of this FSM.
From 1 to 2, we can either follow the top transition by consuming the character
a yielding the string a or follow the bottom transition by consuming the character
b yielding the string b. a and b are effectively the two possible strings described by the regular expression a | b.
Following the transitions from the initial state 1 to the accepting state 6 on the above FSM can yield only one string, Blink.
From regular expression to FSM
We can transform any regular expression into a FSM by building on top of the following three basic rules.
A | B
The regular expression A | B is represented by a FSM with two states. From the state 1, we can either consume A and move to the accepting state 2 or consume B and also move to the accepting state 2.
The concatenation AB is represented by a FSM with three states. From 1, first we move to 2 by consuming A, then we move to the accepting state 3 by consuming B.
A* is represented by a FSM with only one state being both the initial and the accepting state with a transition to itself, creating a loop. From 1, we can either go nowhere because 1 is also the accepting state, thus yielding the empty string or we can follow the transition by consuming A which will lead back to 1; again we can either go nowhere or follow the transition. That will generate the strings
AA...AAA and the empty string.
Any other regular expression can be transformed into a FSM by reusing one or any combination of the basic rules above.
Let’s take a look at some examples.
R = (a|b)c
R is a|b followed by c. First, the FSM for a|b is
Then, the concatenation to c is represented by a transition from the state 2 to a new accepting state.
The two possible strings that can be generated by simulating this FSM are
R = (a|b)*c
First the FSM for (a|b)* is a loop with two options a and b at each iteration.
Then we add a new transition to concatenate c.
Simulating or running the above FSM can yield such strings as
R = a(bc)*
a(bc)* is the character a followed by zero or more repetitions of the concatenation bc. The FSM for a is simple.
The FSM for (bc)* would be represented with a loop on bc.
Concatenating the above two FSMs will give us the FSM for a(bc)*.
Running this FSM, we have to consume
a to move from the state 1 to the accepting state 2. At 2, we can either stop and yield the string
a or consume
b to move to the state 3. From 3, we have no choice but consume
c to go back to the accepting state 2; again at 2, we can either stop or go to 3 by consuming
b and the loop goes on. The possible strings that can be generated by this FSM are
As a final example, let’s take a look at the FSM corresponding to a regular definition that could describe identifiers in a programming language.
First, the FSM for letter | _ is a basic A | B FSM.
Then the FSM for (letter | digit | _)* will be a loop with 3 different options at each iteration.
We get the final FSM for identifier by concatenating the above 2 FSMs.
The FSM in code
A FSM is a combination of
- the set of all the possible states the FSM can be in
- the initial state the FSM is in
- a set of accepting states
- and the set of all transitions.
We can start by implementing a FSM as a class with properties for the states, the initial state and the accepting states.
The transitions of a FSM can be modeled with a function that takes a state
currentState and a character or symbol
input in parameters and returns the state the FSM will be in after consuming
input while in state
currentState. We can call that function the transition function and name it
Most often, the transition function will be a switch statement on the parameter
currentState, with each case returning the next state according to the parameter
To assist in the implementation of the transition function, the FSM can first be converted into a transition table. The transition table maps each state S and input I to a state S’, where S’ is the state the FSM will be in when the input I is consumed from the state S.
As an example, let’s consider the FSM below.
The corresponding transition table is
The rows of the transition table are labeled with the states of the FSM and the columns with all the characters or elements that can possibly be consumed. Each cell of the table contains the state the FSM will be in, if the character at the corresponding column is consumed while the FSM is in the state at the corresponding row. A cell containing
— means that there is no transition in the FSM for the corresponding state and character.
The transition table provides a quick visual reference when writing a transition function. The transition function for the FSM above will be:
Now, to complete the constructor of our FSM class, let’s add a parameter
nextState of type
Function representing the transition function to it.
The next step in the implementation of our FSM is to add a function allowing to run, simulate or execute the FSM on an input string. The function will return a boolean specifying whether the input string (or a subset of the input string) matches the regular expression corresponding to the FSM.
The implementation of the
run function is straightforward. The function will read the input character by character while keeping track of the current state the FSM is in. For each character read, it updates the current state with the next state the FSM will be in, by calling the transition function
nextState(). At the end of the execution of the loop, if the current state is one of the accepting states of the FSM, then the input string (or a subset of the input string) matches the regular expression corresponding to the FSM.
Usage of a FSM
To conclude this part on FSMs, let’s see how we can use our newly implemented
FSM class to recognize identifiers.
Let’s reuse the following regular definitions.
Below is the corresponding FSM.
And the corresponding transition table.
Now we can create a new instance of our
FSM class and configure it to recognize identifiers.
FSM instance can then be used to recognize identifiers.
Note that the call
fsm.run("lisp-case-identifier") will return
true even though
lisp-case-identifier is not a valid identifier since
— is not present in the regular expression. It returns
true because the subset
lisp-case-identifier is a valid Blink identifier. In the last part of this article, when working on your own lexer, you will have to update our
FSM implementation so that the
run method returns in addition of a boolean, the subset of the input that matched the regular expression.
Putting it all together
Armed with all the necessary tools (lexical grammar, regular expressions, FSMs, transition tables, etc.), we can now have a look at how they all come together in the implementation of a lexer.
Let’s consider a simple language for performing mathematical operations. The language supports the four basic arithmetic operators (+, -, * and /), comparison operators (>, ≥, <, ≤ and ==), and grouping using parenthesis. It also has basic support for variables and assignment using the = symbol.
Below are examples of valid instructions in our mini language.
The valid lexemes for this language, identifiers, numbers, parenthesis, and operators are described with the following regular definitions.
The first step in implementing a lexer is to add a token type for each valid lexeme.
Because there is a finite number of operators and parenthesis, we will gain in clarity and granularity by adding a specific token for each type of operator and parenthesis. It will also be helpful to add a special token
EndOfInput to be returned when all the characters in the input have been read.
The next step is to complete the implementation of the
The Lexer class
Let’s start by adding properties to the
Lexer class to keep track of the current position in the input, as long as the current line and column.
Now, let’s implement the
A strategy we could use for
nextToken() is to read the character at the current position. If the character matches the starting character in the production rule of one of the lexemes, we delegate the recognition of the lexeme to a helper method corresponding to that production rule.
Let’s take a look at the helper methods from the simplest to the most complex.
Recognizing parenthesis is straightforward. We just check whether the current character is
) and return the appropriate token. We also increment the current position in the input, as well as the current column.
In the definition for the
operator lexeme, we can notice that we basically have 2 types of operator, arithmetic and comparison operators (technically we have a third one, the assignment operator
= but to simplify the implementation, we’ll add it to the comparison operators group here).
For readability, we can delegate the recognition of each type of operator to a specific helper function.
Lookahead. The implementation of
recognizeComparisonOperator() makes use of a variable named
lookahead. Because an operator can, for example, be
>=, once we read the character
>, we need to know what the next character is before deciding what kind of operator we have. If the next character is
=, the recognized operator is
>=; if the next character is any other character, the recognized operator is
>. That is the purpose of the
lookahead variable, to literally look ahead in the input.
We could build an FSM for this and use it to recognize identifiers but these rules are simple enough to be implemented with a loop. We just have to keep reading characters in the input until we encounter a character that is not a letter, a digit or an underscore.
According to the regular definitions above, a number is a succession of one or more digits. The digits can be followed by a fractional part (described by the
fraction production rule) or an exponent part (described by the
exponent production rule) or again by both. Both the fractional and exponent parts are optional as indicated by the | ε in their regular expressions.
Some examples of valid numbers are
6.6262e-34. The regular expressions describing a number are complex enough to dissuade us to try to recognize numbers manually like we did for the identifiers. A FSM will greatly ease up the implementation here.
The regular expression for a number can be transformed into the FSM above and below is the corresponding transition table.
With the transition table, we can build a FSM and use it in our
With the completion of
recognizeNumber(), the lexer for our little language for performing mathematical operations is almost complete.
For our lexer to be fully complete, we need to update our
nextToken() method to ignore white spaces (so that
1 + 2 or
1+ 2, for example, all yield the same sequence of tokens) and to handle errors.
This completes our
Lexer implementation. We can easily get all the tokens in an input by repetitively calling
nextToken() until an
EndOfInput token is returned.
Test Your Might!
Congratulations on making it this far in the article. It’s time for you, if you’re up for it, to complete your first challenge by writing a lexer for Blink.
Follow this link for the instructions on how to get the challenge and how to complete it.
Feel free to ping me on Twitter @ftchirou if you have any question, a suggestion, a need for a clarification (or just to say hi) while reading this article or while working through the challenge. :)
Thanks for reading.
This is a quick introduction to regular expressions presenting the concepts necessary for getting a grasp of lexical grammar discussed here in the article.
A regular expression, simply put, is a rule that describes all the strings that can be built from a set of basic characters/symbols.
The simplest regular expression possible is the exact string being described. For example, let R be the regular expression lang, R = lang. R describes strings whose first character is
l, followed by
a, followed by
n and followed by
g. There is only one such string and that string is lang.
To describe more complex strings, we make use of regular expression operators.
| operator is used to specify union or alternatives. For example, for R = a | b,
a is a valid string described by R, so is
| can be called the OR operator.
Concatenation. Concatenation is implied in the absence of an operator between characters/symbols. For example R = ab describes the unique string
Zero or more occurrences. A postfix * is used to specify that the element it’s applied to can be repeated zero or multiple times. For example, R = a* describes strings such as
aaaaa... and the empty string. * is more formally known as the Kleene Closure named after Stephen Cole Kleene who helped formalize the concept of regular expressions.
Grouping. Just like in mathematical expressions, parenthesis
() are used for grouping. For example, R = (a|b)*c concatenates (a|b)* to c and describes strings such as
Character classes. Characters classes can be used to shorten long OR regular expressions. For example, the regular expression a | b | c | d can be replaced by the character class [abcd]. If all the characters in a character class form a logical sequence, the character class can be abbreviated further using a range notation [a¹–aⁿ] where a¹ is the first element of the sequence and aⁿ the last. For example 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 can be converted to the character class  which can be abbreviated further as [0–9].
Empty string. The empty string is described by ε (epsilon). For example, R = a | ε describes the string
a or the empty string.
The Kleene Closure (*) has the highest precedence, followed by the concatenation operator. The union operator (|) has the lowest precedence.
For example, in the regular expression ab*|c , b* is evaluated first, then ab* and finally the union with c. Rewritten with parenthesis, that regular expression will be equivalent to ((a(b*))|c).
This is all we need to know about regular expressions for the purpose of this article. Now back to the lexical grammar.
You’ve reached the end. 🎉