36,476 reads

Lexical Analysis

by faical bahsisMarch 5th, 2017

Too Long; Didn't Read

Welcome to the third article of the <a href="https://hackernoon.com/lets-build-a-programming-language-2612349105c6" target="_blank"><em>Let’s Build a Programming Language (LBPL)</em> series</a>. If this is the first article you’re reading in the series, LBPL is a series of articles that takes the reader from 0 to 1 in implementing a programming language. You can check the second article about compilers and interpreters <a href="https://hackernoon.com/compilers-and-interpreters-3e354a2e41cf" target="_blank">here</a> and the general introduction <a href="https://hackernoon.com/lets-build-a-programming-language-2612349105c6" target="_blank">here</a>.

Companies Mentioned

Cypress Trees, Kanō Eitoku

Welcome to the third article of the Let’s Build a Programming Language (LBPL) series. If this is the first article you’re reading in the series, LBPL is a series of articles that takes the reader from 0 to 1 in implementing a programming language. You can check the second article about compilers and interpreters here and the general introduction here.

This article will describe how to build the first phase of a compiler, the lexer. At the end of the article, you will get your hands dirty with a challenge: build a lexer for Blink.

Getting started

The lexer, also called lexical analyzer or tokenizer, is a program that breaks down the input source code into a sequence of lexemes. It reads the input source code character by character, recognizes the lexemes and outputs a sequence of tokens describing the lexemes.

What’s a lexeme?

A lexeme is a single identifiable sequence of characters, for example, keywords (such as class, func, var, and while), literals (such as numbers and strings), identifiers, operators, or punctuation characters (such as {, (, and .).

What’s a token?

A token is an object describing the lexeme. A token has a type (e.g. Keyword, Identifier, Number, or Operator) and a value (the actual characters of the described lexeme_)_. A token can also contain other information such as the line and column numbers where the lexeme was encountered in the source code.

The lexer in code

A lexer can be implemented as a class, whose constructor takes an input string in parameter (representing the source code to perform lexical analysis on). It exposes a method to recognize and return the next token in the input.

How tokens are recognized

All possible lexemes that can appear in code written in a programming language, are described in the specification of that programming language as a set of rules called lexical grammar. Rules in the lexical grammar are often transformed into automata called finite state machines (FSM). The lexer then simulates the finite state machines to recognize the tokens.

In the next sections of this article, we will study the rules making up the lexical grammar of a programming language, finite state machines, how to transform the rules in the lexical grammar into finite state machines, and how the resulting finite state machines are implemented and used in the nextToken() method of the Lexer class to recognize the next token.

In the later sections, we will build a lexer for a mini language and you will be introduced to a challenge where you will have to write your own lexer for Blink.

Lexical Grammar

The lexical grammar of a programming language is a set of formal rules that govern how valid lexemes in that programming language are constructed. For example, the rules can state that a string is any sequence of characters enclosed in double-quotes or that an identifier may not start with a digit. The rules in the lexical grammar are often expressed with a set of regular definitions.

A regular definition is of the form <element_name> = <production_rule> where <element_name> is the name given to a symbol or a lexeme that can be encountered in the programming language and <production_rule> is a regular expression describing that symbol or lexeme.

For example, the regular definition above defines a letter as any lowercase or uppercase alphabet character.

A regular definition can make use, in its regular expression, of any element name defined in the same lexical grammar.

As an example, in the regular definitions above, the definition identifier reuses the definitions letter and digit, in its production rule as if letter and digit were symbols, to define an identifier as any string starting with a letter or an underscore, followed by zero or more occurrences of a letter, a digit or an underscore.

Finite State Machines

A Finite State Machine or FSM is an abstract machine that is in one and only one state at any point in time. The FSM can change from one state to another as a result of an event. Changing from a state to another is called a transition. To better understand this, let’s consider the following example.

A light bulb can be thought of as a FSM. A light bulb can be in only one of two states at any point in time, ON or OFF. The light bulb transitions from the state ON to the state OFF by the press of a switch and transitions from the state OFF to ON by the press of the same switch.

FSMs are often represented with state diagrams.

A FSM simulating a light bulb. States are represented with circles and transitions with labeled arrows.

Lexical Grammar and FSMs

To recognize a token described by a regular definition, the regular expression in the definition is often transformed into a FSM. The resulting FSM has a finite number of states comprising an initial state and a set of accepting states.

The FSM moves from one state to another by consuming one of the characters or elements in the regular expression. Following the transitions from the initial state to one of the accepting states yields a valid string described by the regular expression.

For example, the regular expression a | b can be converted into the following FSM.

The above FSM has two states labeled 1 and 2. The arrow pointing to 1 and coming out from nowhere indicates that 1 is the initial state and the inner circle on 2 indicates that 2 is an accepting state of this FSM.

From 1 to 2, we can either follow the top transition by consuming the character a yielding the string a or follow the bottom transition by consuming the character b yielding the string b. a and b are effectively the two possible strings described by the regular expression a | b.

Another example.

Following the transitions from the initial state 1 to the accepting state 6 on the above FSM can yield only one string, Blink.

From regular expression to FSM

We can transform any regular expression into a FSM by building on top of the following three basic rules.

A | B

The regular expression A | B is represented by a FSM with two states. From the state 1, we can either consume A and move to the accepting state 2 or consume B and also move to the accepting state 2.

The concatenation AB is represented by a FSM with three states. From 1, first we move to 2 by consuming A, then we move to the accepting state 3 by consuming B.

A* is represented by a FSM with only one state being both the initial and the accepting state with a transition to itself, creating a loop. From 1, we can either go nowhere because 1 is also the accepting state, thus yielding the empty string or we can follow the transition by consuming A which will lead back to 1; again we can either go nowhere or follow the transition. That will generate the stringsA, AA, AAA, AA...AAA and the empty string.

Any other regular expression can be transformed into a FSM by reusing one or any combination of the basic rules above.

Let’s take a look at some examples.

R = (a|b)c

R is a|b followed by c. First, the FSM for a|b is

Then, the concatenation to c is represented by a transition from the state 2 to a new accepting state.

The two possible strings that can be generated by simulating this FSM are ac and bc.

R = (a|b)*c

First the FSM for (a|b)* is a loop with two options a and b at each iteration.

Then we add a new transition to concatenate c.

Simulating or running the above FSM can yield such strings as c, ac, abc, bac, bc, bbabaabbc, aaaaac or abbbaabbbaabbc.

R = a(bc)*

a(bc)* is the character a followed by zero or more repetitions of the concatenation bc. The FSM for a is simple.

The FSM for (bc)* would be represented with a loop on bc.

Concatenating the above two FSMs will give us the FSM for a(bc)*.

Running this FSM, we have to consume a to move from the state 1 to the accepting state 2. At 2, we can either stop and yield the string a or consume b to move to the state 3. From 3, we have no choice but consume c to go back to the accepting state 2; again at 2, we can either stop or go to 3 by consuming b and the loop goes on. The possible strings that can be generated by this FSM are a, abc, abcbc, abcbc, abc...bc.

As a final example, let’s take a look at the FSM corresponding to a regular definition that could describe identifiers in a programming language.

First, the FSM for letter | _ is a basic A | B FSM.

Then the FSM for (letter | digit | _)* will be a loop with 3 different options at each iteration.

We get the final FSM for identifier by concatenating the above 2 FSMs.

The FSM in code

A FSM is a combination of

the set of all the possible states the FSM can be in
the initial state the FSM is in
a set of accepting states
and the set of all transitions.

We can start by implementing a FSM as a class with properties for the states, the initial state and the accepting states.

The transitions of a FSM can be modeled with a function that takes a state currentState and a character or symbol input in parameters and returns the state the FSM will be in after consuming input while in state currentState. We can call that function the transition function and name it nextState().

Most often, the transition function will be a switch statement on the parameter currentState, with each case returning the next state according to the parameter input.

To assist in the implementation of the transition function, the FSM can first be converted into a transition table. The transition table maps each state S and input I to a state S’, where S’ is the state the FSM will be in when the input I is consumed from the state S.

As an example, let’s consider the FSM below.

(a|b)c

The corresponding transition table is

Transition table for (a|b)c

The rows of the transition table are labeled with the states of the FSM and the columns with all the characters or elements that can possibly be consumed. Each cell of the table contains the state the FSM will be in, if the character at the corresponding column is consumed while the FSM is in the state at the corresponding row. A cell containing — means that there is no transition in the FSM for the corresponding state and character.

The transition table provides a quick visual reference when writing a transition function. The transition function for the FSM above will be:

Now, to complete the constructor of our FSM class, let’s add a parameter nextState of type Function representing the transition function to it.

The next step in the implementation of our FSM is to add a function allowing to run, simulate or execute the FSM on an input string. The function will return a boolean specifying whether the input string (or a subset of the input string) matches the regular expression corresponding to the FSM.

The implementation of the run function is straightforward. The function will read the input character by character while keeping track of the current state the FSM is in. For each character read, it updates the current state with the next state the FSM will be in, by calling the transition function nextState(). At the end of the execution of the loop, if the current state is one of the accepting states of the FSM, then the input string (or a subset of the input string) matches the regular expression corresponding to the FSM.

Usage of a FSM

To conclude this part on FSMs, let’s see how we can use our newly implemented FSM class to recognize identifiers.

Let’s reuse the following regular definitions.

Below is the corresponding FSM.

And the corresponding transition table.

Now we can create a new instance of our FSM class and configure it to recognize identifiers.

Our FSM instance can then be used to recognize identifiers.

Note that the call fsm.**run**("lisp-case-identifier") will return true even though lisp-case-identifier is not a valid identifier since — is not present in the regular expression. It returns true because the subset lisp in lisp-case-identifier is a valid Blink identifier. In the last part of this article, when working on your own lexer, you will have to update our FSM implementation so that the run method returns in addition of a boolean, the subset of the input that matched the regular expression.

Putting it all together

Armed with all the necessary tools (lexical grammar, regular expressions, FSMs, transition tables, etc.), we can now have a look at how they all come together in the implementation of a lexer.

Let’s consider a simple language for performing mathematical operations. The language supports the four basic arithmetic operators (+, -, * and /), comparison operators (>, ≥, <, ≤ and ==), and grouping using parenthesis. It also has basic support for variables and assignment using the = symbol.

Below are examples of valid instructions in our mini language.

The valid lexemes for this language, identifiers, numbers, parenthesis, and operators are described with the following regular definitions.

Let’s build a lexer for this language by completing the skeleton of the [Lexer](#d21e) class we introduced at the beginning of the article.

TokenType

The first step in implementing a lexer is to add a token type for each valid lexeme.

Because there is a finite number of operators and parenthesis, we will gain in clarity and granularity by adding a specific token for each type of operator and parenthesis. It will also be helpful to add a special token EndOfInput to be returned when all the characters in the input have been read.

The next step is to complete the implementation of the Lexer class.

The Lexer class

Let’s start by adding properties to the Lexer class to keep track of the current position in the input, as long as the current line and column.

Now, let’s implement the nextToken() method.

A strategy we could use for nextToken() is to read the character at the current position. If the character matches the starting character in the production rule of one of the lexemes, we delegate the recognition of the lexeme to a helper method corresponding to that production rule.

Let’s take a look at the helper methods from the simplest to the most complex.

Recognizing parenthesis

Recognizing parenthesis is straightforward. We just check whether the current character is ( or ) and return the appropriate token. We also increment the current position in the input, as well as the current column.

Recognizing operators

In the definition for the operator lexeme, we can notice that we basically have 2 types of operator, arithmetic and comparison operators (technically we have a third one, the assignment operator = but to simplify the implementation, we’ll add it to the comparison operators group here).

For readability, we can delegate the recognition of each type of operator to a specific helper function.

Lookahead. The implementation of recognizeComparisonOperator() makes use of a variable named lookahead. Because an operator can, for example, be > or >=, once we read the character >, we need to know what the next character is before deciding what kind of operator we have. If the next character is =, the recognized operator is >=; if the next character is any other character, the recognized operator is >. That is the purpose of the lookahead variable, to literally look ahead in the input.

Recognizing identifiers

We could build an FSM for this and use it to recognize identifiers but these rules are simple enough to be implemented with a loop. We just have to keep reading characters in the input until we encounter a character that is not a letter, a digit or an underscore.

Recognizing numbers

Note: With these regular definitions, strings such as 00, 00.42 or 00e-00 are considered numbers.

According to the regular definitions above, a number is a succession of one or more digits. The digits can be followed by a fractional part (described by the fraction production rule) or an exponent part (described by the exponent production rule) or again by both. Both the fractional and exponent parts are optional as indicated by the | ε in their regular expressions.

Some examples of valid numbers are 42, 3.14 or 6.6262e-34. The regular expressions describing a number are complex enough to dissuade us to try to recognize numbers manually like we did for the identifiers. A FSM will greatly ease up the implementation here.

The regular expression for a number can be transformed into the FSM above and below is the corresponding transition table.

With the transition table, we can build a FSM and use it in our recognizeNumber() method.

With the completion of recognizeNumber(), the lexer for our little language for performing mathematical operations is almost complete.

For our lexer to be fully complete, we need to update our nextToken() method to ignore white spaces (so that 1+2, 1 + 2 or 1+ 2, for example, all yield the same sequence of tokens) and to handle errors.

This completes our Lexer implementation. We can easily get all the tokens in an input by repetitively calling nextToken() until an EndOfInput token is returned.

Test Your Might!

Congratulations on making it this far in the article. It’s time for you, if you’re up for it, to complete your first challenge by writing a lexer for Blink.

Follow this link for the instructions on how to get the challenge and how to complete it.

Feel free to ping me on Twitter @ftchirou if you have any question, a suggestion, a need for a clarification (or just to say hi) while reading this article or while working through the challenge. :)

Thanks for reading.

Regular Expressions

This is a quick introduction to regular expressions presenting the concepts necessary for getting a grasp of lexical grammar discussed here in the article.

A regular expression, simply put, is a rule that describes all the strings that can be built from a set of basic characters/symbols.

The simplest regular expression possible is the exact string being described. For example, let R be the regular expression lang, R = lang. R describes strings whose first character is l, followed by a, followed by n and followed by g. There is only one such string and that string is lang.

To describe more complex strings, we make use of regular expression operators.

Operators

Union. The | operator is used to specify union or alternatives. For example, for R = a | b, a is a valid string described by R, so is b. | can be called the OR operator.

Concatenation. Concatenation is implied in the absence of an operator between characters/symbols. For example R = ab describes the unique string ab.

Zero or more occurrences. A postfix * is used to specify that the element it’s applied to can be repeated zero or multiple times. For example, R = a* describes strings such as a, aa, aaa, aaaa, aaaaa... and the empty string. * is more formally known as the Kleene Closure named after Stephen Cole Kleene who helped formalize the concept of regular expressions.

Grouping. Just like in mathematical expressions, parenthesis () are used for grouping. For example, R = (a|b)*c concatenates (a|b)* to c and describes strings such asc, ac, aac, bc, bbc, abbc and abbbabababc.

Character classes. Characters classes can be used to shorten long OR regular expressions. For example, the regular expression a | b | c | d can be replaced by the character class [abcd]. If all the characters in a character class form a logical sequence, the character class can be abbreviated further using a range notation [a¹–aⁿ] where a¹ is the first element of the sequence and aⁿ the last. For example 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 can be converted to the character class [0123456789] which can be abbreviated further as [0–9].

Empty string. The empty string is described by ε (epsilon). For example, R = a | ε describes the string a or the empty string.

Operator Precedence

The Kleene Closure (*) has the highest precedence, followed by the concatenation operator. The union operator (|) has the lowest precedence.

For example, in the regular expression ab*|c , b* is evaluated first, then ab* and finally the union with c. Rewritten with parenthesis, that regular expression will be equivalent to ((a(b*))|c).

This is all we need to know about regular expressions for the purpose of this article. Now back to the lexical grammar.

You’ve reached the end. 🎉