**Register Now!**

35,999 reads

by faical bahsisMarch 5th, 2017

Welcome to the third article of the *Let’s Build a Programming Language (LBPL)* series. If this is the first article you’re reading in the series, LBPL is a series of articles that takes the reader from 0 to 1 in implementing a programming language. You can check the second article about compilers and interpreters here and the general introduction here.

This article will describe how to build the first phase of a compiler, **the lexer**. At the end of the article, you will get your hands dirty with a challenge: build a lexer for Blink.

The **lexer**, also called **lexical analyzer** or **tokenizer**, is a program that breaks down the input source code into a sequence of lexemes. It reads the input source code character by character, recognizes the lexemes and outputs a sequence of tokens describing the lexemes.

A **lexeme** is a single identifiable sequence of characters, for example, keywords (such as `class`

, `func`

, `var`

, and `while`

), literals (such as numbers and strings), identifiers, operators, or punctuation characters (such as `{`

, `(`

, and `.`

).

A **token** is an object describing the *lexeme*. A token has a **type** (e.g. Keyword, Identifier, Number, or Operator) and a **value** (the actual characters of the described lexeme_)_. A token can also contain other information such as the line and column numbers where the lexeme was encountered in the source code.

A **lexer** can be implemented as a class, whose constructor takes an input string in parameter (representing the source code to perform lexical analysis on). It exposes a method to recognize and return the next token in the input.

All possible lexemes that can appear in code written in a programming language, are described in the specification of that programming language as a set of rules called **lexical grammar**. Rules in the lexical grammar are often transformed into automata called **finite state machines** (*FSM*). The lexer then simulates the finite state machines to recognize the tokens.

In the next sections of this article, we will study the rules making up the lexical grammar of a programming language, finite state machines, how to transform the rules in the lexical grammar into finite state machines, and how the resulting finite state machines are implemented and used in the `nextToken()`

method of the `Lexer`

class to recognize the next token.

In the later sections, we will build a lexer for a mini language and you will be introduced to a challenge where you will have to write your own lexer for Blink.

The lexical grammar of a programming language is a set of formal rules that govern how valid lexemes in that programming language are constructed. For example, the rules can state that a string is any sequence of characters enclosed in double-quotes or that an identifier may not start with a digit. The rules in the lexical grammar are often expressed with a set of **regular definitions**.

A regular definition is of the form `<element_name> = <production_rule>`

where `<element_name>`

is the name given to a symbol or a lexeme that can be encountered in the programming language and `<production_rule>`

is a **regular expression** describing that symbol or lexeme.

For example, the regular definition above defines a *letter* as any lowercase or uppercase alphabet character.

A regular definition can make use, in its regular expression, of any element name defined in the same lexical grammar.

As an example, in the regular definitions above, the definition *identifier* reuses the definitions *letter* and *digit*, in its production rule as if *letter* and *digit* were symbols, to define an identifier as *any string starting with a letter or an underscore,* followed by *zero or more occurrences of a letter, a digit or an underscore.*

A Finite State Machine or FSM is an abstract machine that is in *one and only one* state at any point in time. The FSM can change from one state to another as a result of an event. Changing from a state to another is called a **transition**. To better understand this, let’s consider the following example.

A light bulb can be thought of as a FSM. A light bulb can be in only one of two states at any point in time, `ON`

or `OFF`

. The light bulb transitions from the state `ON`

to the state `OFF`

by the press of a switch and transitions from the state `OFF`

to `ON`

by the press of the same switch.

FSMs are often represented with state diagrams.

A FSM simulating a light bulb. States are represented with circles and transitions with labeled arrows.

To recognize a token described by a regular definition, the regular expression in the definition is often transformed into a FSM. The resulting FSM has a **finite number of states** comprising an **initial state** and a **set** of **accepting states**.

The FSM moves from one state to another by *consuming one of the characters* *or elements* in the regular expression. Following the transitions from the initial state to one of the accepting states yields a valid string described by the regular expression.

For example, the regular expression *a | b* can be converted into the following FSM.

The above FSM has two states labeled **1** and **2**. The arrow pointing to **1** and coming out from nowhere indicates that **1** is the *initial state* and the inner circle on **2** indicates that **2** is an *accepting state* of this FSM.

From **1** to **2**, we can either follow the top transition by consuming the character `a`

yielding the string *a* or follow the bottom transition by consuming the character `b`

yielding the string *b*. *a* and *b* are effectively the two possible strings described by the regular expression *a | b*.

Another example.

Following the transitions from the initial state **1** to the accepting state **6** on the above FSM can yield only one string, *Blink*.

We can transform any regular expression into a FSM by building on top of the following three basic rules.

**A | B**

The regular expression *A | B* is represented by a FSM with two states. From the state **1**, we can either consume A and move to the accepting state **2** or consume B and also move to the accepting state **2**.

**AB**

The concatenation *AB* is represented by a FSM with three states. From **1**, first we move to **2** by consuming A, then we move to the accepting state **3** by consuming B.

**A***

*A** is represented by a FSM with only one state being both the initial and the accepting state with a transition to itself, creating a loop. From **1**, we can either go nowhere because **1** is also the accepting state, thus yielding the *empty string* or we can follow the transition by consuming A which will lead back to **1**; again we can either go nowhere or follow the transition. That will generate the strings`A`

, `AA`

, `AAA`

, `AA...AAA`

and the *empty string*.

Any other regular expression can be transformed into a *FSM* by reusing one or any combination of the basic rules above.

Let’s take a look at some examples.

**R = (a|b)c**

*R* is *a|b* followed by *c*. First, the FSM for *a|b* is

Then, the concatenation to *c* is represented by a transition from the state **2** to a new accepting state.

The two possible strings that can be generated by simulating this FSM are `ac`

and `bc`

.

**R = (a|b)*c**

First the FSM for *(a|b)** is a loop with two options *a* and *b* at each iteration.

Then we add a new transition to concatenate *c.*

Simulating or running the above FSM can yield such strings as `c`

, `ac`

, `abc`

, `bac`

, `bc`

, `bbabaabbc`

, `aaaaac`

or `abbbaabbbaabbc`

.

**R = a(bc)***

*a(bc)** is the character *a* followed by zero or more repetitions of the concatenation *bc*. The FSM for *a* is simple.

The FSM for *(bc)** would be represented with a loop on *bc*.

Concatenating the above two FSMs will give us the FSM for *a(bc)**.

Running this *FSM*, we have to consume `a`

to move from the state **1** to the accepting state **2**. At **2**, we can either stop and yield the string `a`

or consume `b`

to move to the state **3**. From **3**, we have no choice but consume `c`

to go back to the accepting state **2**; again at **2**, we can either stop or go to **3** by consuming `b`

and the loop goes on. The possible strings that can be generated by this FSM are `a`

, `abc`

, `abcbc`

, `abcbc`

, `abc...bc`

.

As a final example, let’s take a look at the FSM corresponding to a regular definition that could describe identifiers in a programming language.

First, the FSM for *letter | _* is a basic *A | B* FSM.

Then the FSM for *(letter | digit | _)** will be a loop with 3 different options at each iteration.

We get the final FSM for *identifier* by concatenating the above 2 *FSMs*.

A FSM is a combination of

- the set of all the possible
**states**the FSM can be in - the
**initial state**the FSM is in - a set of
**accepting states** - and the set of all
**transitions**.

We can start by implementing a FSM as a class with properties for the states, the initial state and the accepting states.

The transitions of a FSM can be modeled with a function that takes a state `currentState`

and a character or symbol `input`

in parameters and returns the state the FSM will be in after consuming `input`

while in state `currentState`

. We can call that function the **transition function** and name it `nextState()`

.

Most often, the transition function will be a switch statement on the parameter `currentState`

, with each case returning the next state according to the parameter `input`

.

To assist in the implementation of the transition function, the FSM can first be converted into a transition table. The transition table maps each state S and input I to a state S’, where S’ is the state the FSM will be in when the input I is consumed from the state S.

As an example, let’s consider the FSM below.

(a|b)c

The corresponding transition table is

Transition table for (a|b)c

The rows of the transition table are labeled with the states of the FSM and the columns with all the characters or elements that can possibly be consumed. Each cell of the table contains the state the FSM will be in, if the character at the corresponding column is consumed while the FSM is in the state at the corresponding row. A cell containing `—`

means that there is no transition in the FSM for the corresponding state and character.

The transition table provides a quick visual reference when writing a transition function. The transition function for the FSM above will be:

Now, to complete the constructor of our FSM class, let’s add a parameter `nextState`

of type `Function`

representing the transition function to it.

The next step in the implementation of our FSM is to add a function allowing to **run**, **simulate** or **execute** the FSM on an input string. The function will return a boolean specifying whether the input string (or a subset of the input string) matches the regular expression corresponding to the FSM.

The implementation of the `run`

function is straightforward. The function will read the input character by character while keeping track of the current state the FSM is in. For each character read, it updates the current state with the next state the FSM will be in, by calling the transition function `nextState()`

. At the end of the execution of the loop, if the current state is one of the accepting states of the FSM, then the input string (or a subset of the input string) matches the regular expression corresponding to the FSM.

To conclude this part on FSMs, let’s see how we can use our newly implemented `FSM`

class to recognize identifiers.

Let’s reuse the following regular definitions.

Below is the corresponding FSM.

And the corresponding transition table.

Now we can create a new instance of our `FSM`

class and configure it to recognize identifiers.

Our `FSM`

instance can then be used to recognize identifiers.

Note that the call `fsm.**run**("lisp-case-identifier")`

will return `true`

even though `lisp-case-identifier`

is not a valid identifier since `—`

is not present in the regular expression. It returns `true`

because the subset `lisp`

in `lisp-case-identifier`

is a valid Blink identifier. In the last part of this article, when working on your own lexer, you will have to update our `FSM`

implementation so that the `run`

method returns in addition of a boolean, the subset of the input that matched the regular expression.

Armed with all the necessary tools (lexical grammar, regular expressions, FSMs, transition tables, etc.), we can now have a look at how they all come together in the implementation of a lexer.

Let’s consider a simple language for performing mathematical operations. The language supports the four basic arithmetic operators (+, -, * and /), comparison operators (>, ≥, <, ≤ and ==), and grouping using parenthesis. It also has basic support for variables and assignment using the = symbol.

Below are examples of valid instructions in our mini language.

The valid lexemes for this language, **identifiers**, **numbers**, **parenthesis**, and **operators** are described with the following regular definitions.

Let’s build a lexer for this language by completing the skeleton of the `[Lexer](#d21e)`

class we introduced at the beginning of the article.

The first step in implementing a lexer is to add a token type for each valid lexeme.

Because there is a finite number of operators and parenthesis, we will gain in clarity and granularity by adding a specific token for each type of operator and parenthesis. It will also be helpful to add a special token `EndOfInput`

to be returned when all the characters in the input have been read.

The next step is to complete the implementation of the `Lexer`

class.

Let’s start by adding properties to the `Lexer`

class to keep track of the current position in the input, as long as the current line and column.

Now, let’s implement the `nextToken()`

method.

A strategy we could use for `nextToken()`

is to read the character at the current position. If the character matches the starting character in the production rule of one of the lexemes, we delegate the recognition of the lexeme to a helper method corresponding to that production rule.

Let’s take a look at the helper methods from the simplest to the most complex.

**Recognizing parenthesis**

Recognizing parenthesis is straightforward. We just check whether the current character is `(`

or `)`

and return the appropriate token. We also increment the current position in the input, as well as the current column.

**Recognizing operators**

In the definition for the `operator`

lexeme, we can notice that we basically have 2 types of operator, *arithmetic* and *comparison* operators (technically we have a third one, the assignment operator `=`

but to simplify the implementation, we’ll add it to the comparison operators group here).

For readability, we can delegate the recognition of each type of operator to a specific helper function.

**Lookahead**. The implementation of `recognizeComparisonOperator()`

makes use of a variable named `lookahead`

. Because an operator can, for example, be `>`

or `>=`

, once we read the character `>`

, we need to know what the next character is before deciding what kind of operator we have. If the next character is `=`

, the recognized operator is `>=`

; if the next character is any other character, the recognized operator is `>`

. That is the purpose of the `lookahead`

variable, to literally look ahead in the input.

**Recognizing identifiers**

We could build an FSM for this and use it to recognize identifiers but these rules are simple enough to be implemented with a loop. We just have to keep reading characters in the input until we encounter a character that is not a letter, a digit or an underscore.

**Recognizing numbers**

Note: With these regular definitions, strings such as 00, 00.42 or 00e-00 are considered numbers.

According to the regular definitions above, a number is a succession of one or more digits. The digits can be followed by a fractional part (described by the `fraction`

production rule) or an exponent part (described by the `exponent`

production rule) or again by both. Both the fractional and exponent parts are optional as indicated by the *| ε* in their regular expressions.

Some examples of valid numbers are `42`

, `3.14`

or `6.6262e-34`

. The regular expressions describing a number are complex enough to dissuade us to try to recognize numbers manually like we did for the identifiers. A FSM will greatly ease up the implementation here.

The regular expression for a number can be transformed into the FSM above and below is the corresponding transition table.

With the transition table, we can build a *FSM* and use it in our `recognizeNumber()`

method.

With the completion of `recognizeNumber()`

, the lexer for our little language for performing mathematical operations is almost complete.

For our lexer to be fully complete, we need to update our `nextToken()`

method to ignore white spaces (so that `1+2`

, `1 + 2`

or `1+ 2`

, for example, all yield the same sequence of tokens) and to handle errors.

This completes our `Lexer`

implementation. We can easily get all the tokens in an input by repetitively calling `nextToken()`

until an `EndOfInput`

token is returned.

Congratulations on making it this far in the article. It’s time for you, if you’re up for it, to complete your first challenge by writing a lexer for Blink.

Follow this link for the instructions on how to get the challenge and how to complete it.

Feel free to ping me on Twitter @ftchirou if you have any question, a suggestion, a need for a clarification (or just to say hi) while reading this article or while working through the challenge. :)

Thanks for reading.

*This is a quick introduction to regular expressions presenting the concepts necessary for getting a grasp of lexical grammar discussed* *here* *in the article.*

A **regular expression**, simply put, is a rule that describes all the strings that can be built from a set of basic characters/symbols.

The simplest regular expression possible is the exact string being described. For example, let *R* be the regular expression *lang*, *R = lang*. *R* describes strings whose first character is `l`

, followed by `a`

, followed by `n`

and followed by `g`

. There is only one such string and that string is **lang**.

To describe more complex strings, we make use of **regular expression operators**.

**Union**. The `|`

operator is used to specify union or alternatives. For example, for *R = a | b*, `a`

is a valid string described by *R*, so is `b`

. `|`

can be called the *OR* operator.

**Concatenation**. Concatenation is implied in the absence of an operator between characters/symbols. For example *R = ab* describes the unique string `ab`

.

**Zero or more occurrences**. A postfix * is used to specify that the element it’s applied to can be repeated *zero or multiple times*. For example, *R = a** describes strings such as `a`

, `aa`

, `aaa`

, `aaaa`

, `aaaaa...`

*and the empty string*. ***** is more formally known as the *Kleene Closure* named after Stephen Cole Kleene who helped formalize the concept of regular expressions.

**Grouping**. Just like in mathematical expressions, parenthesis `()`

are used for grouping. For example, *R = (a|b)*c* concatenates *(a|b)** to *c* and describes strings such as`c`

, `ac`

, `aac`

, `bc`

, `bbc`

, `abbc`

and `abbbabababc`

.

**Character classes**. Characters classes can be used to shorten long *OR* regular expressions. For example, the regular expression *a | b | c | d* can be replaced by the character class *[abcd]*. If all the characters in a character class form a logical sequence, the character class can be abbreviated further using a range notation *[a¹–aⁿ]* where *a¹* is the first element of the sequence and *aⁿ* the last. For example *0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9* can be converted to the character class *[0123456789]* which can be abbreviated further as *[0–9]*.

**Empty string**. The empty string is described by *ε* (epsilon). For example, *R = a | ε* describes the string `a`

or the *empty string*.

The Kleene Closure (*) has the highest precedence, followed by the concatenation operator. The union operator (|) has the lowest precedence.

For example, in the regular expression *ab*|c* , *b** is evaluated first, then *ab** and finally the union with *c.* Rewritten with parenthesis, that regular expression will be equivalent to *((a(b*))|c)*.

*This is all we need to know about regular expressions for the purpose of this article. Now back to the* *lexical grammar**.*

You’ve reached the end. 🎉

L O A D I N G

. . . comments & more!

. . . comments & more!