, Cypress Trees Kanō Eitoku Welcome to the third article of the . If this is the first article you’re reading in the series, LBPL is a series of articles that takes the reader from 0 to 1 in implementing a programming language. You can check the second article about compilers and interpreters and the general introduction . series Let’s Build a Programming Language (LBPL) here here This article will describe how to build the first phase of a compiler, . At the end of the article, you will get your hands dirty with a : build a lexer for . the lexer challenge Blink Getting started The , also called or , is a program that breaks down the input source code into a sequence of lexemes. It reads the input source code character by character, recognizes the lexemes and outputs a sequence of tokens describing the lexemes. lexer lexical analyzer tokenizer What’s a lexeme? A is a single identifiable sequence of characters, for example, keywords (such as , , , and ), literals (such as numbers and strings), identifiers, operators, or punctuation characters (such as , , and ). lexeme class func var while { ( . What’s a token? A is an object describing the . A token has a (e.g. Keyword, Identifier, Number, or Operator) and a (the actual characters of the described lexeme_)_. A token can also contain other information such as the line and column numbers where the lexeme was encountered in the source code. token lexeme type value The lexer in code A can be implemented as a class, whose constructor takes an input string in parameter (representing the source code to perform lexical analysis on). It exposes a method to recognize and return the next token in the input. lexer How tokens are recognized All possible lexemes that can appear in code written in a programming language, are described in the specification of that programming language as a set of rules called . Rules in the lexical grammar are often transformed into automata called ( ). The lexer then simulates the finite state machines to recognize the tokens. lexical grammar finite state machines FSM In the next sections of this article, we will study the rules making up the lexical grammar of a programming language, finite state machines, how to transform the rules in the lexical grammar into finite state machines, and how the resulting finite state machines are implemented and used in the method of the class to recognize the next token. nextToken() Lexer In the later sections, we will build a lexer for a mini language and you will be introduced to a where you will have to write your own lexer for . challenge Blink Lexical Grammar The lexical grammar of a programming language is a set of formal rules that govern how valid lexemes in that programming language are constructed. For example, the rules can state that a string is any sequence of characters enclosed in double-quotes or that an identifier may not start with a digit. The rules in the lexical grammar are often expressed with a set of . regular definitions A regular definition is of the form where is the name given to a symbol or a lexeme that can be encountered in the programming language and is a describing that symbol or lexeme. <element_name> = <production_rule> <element_name> <production_rule> regular expression For example, the regular definition above defines a as any lowercase or uppercase alphabet character. letter A regular definition can make use, in its regular expression, of any element name defined in the same lexical grammar. As an example, in the regular definitions above, the definition reuses the definitions and , in its production rule as if and were symbols, to define an identifier as followed by identifier letter digit letter digit any string starting with a letter or an underscore, zero or more occurrences of a letter, a digit or an underscore. Finite State Machines A Finite State Machine or FSM is an abstract machine that is in state at any point in time. The FSM can change from one state to another as a result of an event. Changing from a state to another is called a . To better understand this, let’s consider the following example. one and only one transition A light bulb can be thought of as a FSM. A light bulb can be in only one of two states at any point in time, or . The light bulb transitions from the state to the state by the press of a switch and transitions from the state to by the press of the same switch. ON OFF ON OFF OFF ON FSMs are often represented with . state diagrams A FSM simulating a light bulb. States are represented with circles and transitions with labeled arrows. Lexical Grammar and FSMs To recognize a token described by a regular definition, the regular expression in the definition is often transformed into a FSM. The resulting FSM has a comprising an and a of . finite number of states initial state set accepting states The FSM moves from one state to another by in the regular expression. Following the transitions from the initial state to one of the accepting states yields a valid string described by the regular expression. consuming one of the characters or elements For example, the regular expression can be converted into the following FSM. a | b The above FSM has two states labeled and . The arrow pointing to and coming out from nowhere indicates that is the and the inner circle on indicates that is an of this FSM. 1 2 1 1 initial state 2 2 accepting state From to , we can either follow the top transition by consuming the character yielding the string or follow the bottom transition by consuming the character yielding the string . and are effectively the two possible strings described by the regular expression . 1 2 a a b b a b a | b Another example. Following the transitions from the initial state to the accepting state on the above FSM can yield only one string, . 1 6 Blink From regular expression to FSM We can transform any regular expression into a FSM by building on top of the following three basic rules. A | B The regular expression is represented by a FSM with two states. From the state , we can either consume A and move to the accepting state or consume B and also move to the accepting state . A | B 1 2 2 AB The concatenation is represented by a FSM with three states. From , first we move to by consuming A, then we move to the accepting state by consuming B. AB 1 2 3 A* is represented by a FSM with only one state being both the initial and the accepting state with a transition to itself, creating a loop. From , we can either go nowhere because is also the accepting state, thus yielding the or we can follow the transition by consuming A which will lead back to ; again we can either go nowhere or follow the transition. That will generate the strings , , , and the . A* 1 1 empty string 1 A AA AAA AA...AAA empty string Any other regular expression can be transformed into a by reusing one or any combination of the basic rules above. FSM Let’s take a look at some examples. R = (a|b)c is followed by . First, the FSM for is R a|b c a|b Then, the concatenation to is represented by a transition from the state to a new accepting state. c 2 The two possible strings that can be generated by simulating this FSM are and . ac bc R = (a|b)*c First the FSM for is a loop with two options and at each iteration. (a|b)* a b Then we add a new transition to concatenate c. Simulating or running the above FSM can yield such strings as , , , , , , or . c ac abc bac bc bbabaabbc aaaaac abbbaabbbaabbc R = a(bc)* is the character followed by zero or more repetitions of the concatenation . The FSM for is simple. a(bc)* a bc a The FSM for would be represented with a loop on . (bc)* bc Concatenating the above two FSMs will give us the FSM for . a(bc)* Running this , we have to consume to move from the state to the accepting state . At , we can either stop and yield the string or consume to move to the state . From , we have no choice but consume to go back to the accepting state ; again at , we can either stop or go to by consuming and the loop goes on. The possible strings that can be generated by this FSM are , , , , . FSM a 1 2 2 a b 3 3 c 2 2 3 b a abc abcbc abcbc abc...bc As a final example, let’s take a look at the FSM corresponding to a regular definition that could describe identifiers in a programming language. First, the FSM for is a basic FSM. letter | _ A | B Then the FSM for will be a loop with 3 different options at each iteration. (letter | digit | _)* We get the final FSM for by concatenating the above 2 . identifier FSMs The FSM in code A FSM is a combination of the set of all the possible the FSM can be in states the the FSM is in initial state a set of accepting states and the set of all . transitions We can start by implementing a FSM as a class with properties for the states, the initial state and the accepting states. The transitions of a FSM can be modeled with a function that takes a state and a character or symbol in parameters and returns the state the FSM will be in after consuming while in state . We can call that function the and name it . currentState input input currentState transition function nextState() Most often, the transition function will be a switch statement on the parameter , with each case returning the next state according to the parameter . currentState input To assist in the implementation of the transition function, the FSM can first be converted into a . The transition table maps each state S and input I to a state S’, where S’ is the state the FSM will be in when the input I is consumed from the state S. transition table As an example, let’s consider the FSM below. (a|b)c The corresponding transition table is Transition table for (a|b)c The rows of the transition table are labeled with the states of the FSM and the columns with all the characters or elements that can possibly be consumed. Each cell of the table contains the state the FSM will be in, if the character at the corresponding column is consumed while the FSM is in the state at the corresponding row. A cell containing means that there is no transition in the FSM for the corresponding state and character. — The transition table provides a quick visual reference when writing a transition function. The transition function for the FSM above will be: Now, to complete the constructor of our FSM class, let’s add a parameter of type representing the transition function to it. nextState Function The next step in the implementation of our FSM is to add a function allowing to , or the FSM on an input string. The function will return a boolean specifying whether the input string (or a subset of the input string) matches the regular expression corresponding to the FSM. run simulate execute The implementation of the function is straightforward. The function will read the input character by character while keeping track of the current state the FSM is in. For each character read, it updates the current state with the next state the FSM will be in, by calling the transition function . At the end of the execution of the loop, if the current state is one of the accepting states of the FSM, then the input string (or a subset of the input string) matches the regular expression corresponding to the FSM. run nextState() Usage of a FSM To conclude this part on FSMs, let’s see how we can use our newly implemented class to recognize identifiers. FSM Let’s reuse the following regular definitions. Below is the corresponding FSM. And the corresponding transition table. Now we can create a new instance of our class and configure it to recognize identifiers. FSM Our instance can then be used to recognize identifiers. FSM Note that the call will return even though is not a valid identifier since is not present in the regular expression. It returns because the subset in is a valid Blink identifier. In the last part of this article, when working on your own lexer, you will have to update our implementation so that the method returns in addition of a boolean, the subset of the input that matched the regular expression. fsm.**run**("lisp-case-identifier") true lisp-case-identifier — true lisp lisp-case-identifier FSM run Putting it all together Armed with all the necessary tools (lexical grammar, regular expressions, FSMs, transition tables, etc.), we can now have a look at how they all come together in the implementation of a lexer. Let’s consider a simple language for performing mathematical operations. The language supports the four basic arithmetic operators (+, -, * and /), comparison operators (>, ≥, <, ≤ and ==), and grouping using parenthesis. It also has basic support for variables and assignment using the = symbol. Below are examples of valid instructions in our mini language. The valid lexemes for this language, , , , and are described with the following regular definitions. identifiers numbers parenthesis operators Let’s build a lexer for this language by completing the we introduced at the beginning of the article. skeleton of the [Lexer](#d21e) class TokenType The first step in implementing a lexer is to add a token type for each valid lexeme. Because there is a finite number of operators and parenthesis, we will gain in clarity and granularity by adding a specific token for each type of operator and parenthesis. It will also be helpful to add a special token to be returned when all the characters in the input have been read. EndOfInput The next step is to complete the implementation of the class. Lexer The Lexer class Let’s start by adding properties to the class to keep track of the current position in the input, as long as the current line and column. Lexer Now, let’s implement the method. nextToken() A strategy we could use for is to read the character at the current position. If the character matches the starting character in the production rule of one of the lexemes, we delegate the recognition of the lexeme to a helper method corresponding to that production rule. nextToken() Let’s take a look at the helper methods from the simplest to the most complex. Recognizing parenthesis Recognizing parenthesis is straightforward. We just check whether the current character is or and return the appropriate token. We also increment the current position in the input, as well as the current column. ( ) Recognizing operators In the definition for the lexeme, we can notice that we basically have 2 types of operator, and operators (technically we have a third one, the assignment operator but to simplify the implementation, we’ll add it to the comparison operators group here). operator arithmetic comparison = For readability, we can delegate the recognition of each type of operator to a specific helper function. . The implementation of makes use of a variable named . Because an operator can, for example, be or , once we read the character , we need to know what the next character is before deciding what kind of operator we have. If the next character is , the recognized operator is ; if the next character is any other character, the recognized operator is . That is the purpose of the variable, to literally look ahead in the input. Lookahead recognizeComparisonOperator() lookahead > >= > = >= > lookahead Recognizing identifiers We could build an FSM for this and use it to recognize identifiers but these rules are simple enough to be implemented with a loop. We just have to keep reading characters in the input until we encounter a character that is not a letter, a digit or an underscore. Recognizing numbers Note: With these regular definitions, strings such as 00, 00.42 or 00e-00 are considered numbers. According to the regular definitions above, a number is a succession of one or more digits. The digits can be followed by a fractional part (described by the production rule) or an exponent part (described by the production rule) or again by both. Both the fractional and exponent parts are optional as indicated by the in their regular expressions. fraction exponent | ε Some examples of valid numbers are , or . The regular expressions describing a number are complex enough to dissuade us to try to recognize numbers manually like we did for the identifiers. A FSM will greatly ease up the implementation here. 42 3.14 6.6262e-34 The regular expression for a number can be transformed into the FSM above and below is the corresponding transition table. With the transition table, we can build a and use it in our method. FSM recognizeNumber() With the completion of , the lexer for our little language for performing mathematical operations is almost complete. recognizeNumber() For our lexer to be fully complete, we need to update our method to ignore white spaces (so that , or , for example, all yield the same sequence of tokens) and to handle errors. nextToken() 1+2 1 + 2 1+ 2 This completes our implementation. We can easily get all the tokens in an input by repetitively calling until an token is returned. Lexer nextToken() EndOfInput Test Your Might! Congratulations on making it this far in the article. It’s time for you, if you’re up for it, to complete your first challenge by writing a lexer for Blink. Follow for the instructions on how to get the challenge and how to complete it. this link Feel free to ping me on Twitter if you have any question, a suggestion, a need for a clarification (or just to say hi) while reading this article or while working through the challenge. :) @ftchirou Thanks for reading. Regular Expressions This is a quick introduction to regular expressions presenting the concepts necessary for getting a grasp of lexical grammar discussed here in the article. A , simply put, is a rule that describes all the strings that can be built from a set of basic characters/symbols. regular expression The simplest regular expression possible is the exact string being described. For example, let be the regular expression , . describes strings whose first character is , followed by , followed by and followed by . There is only one such string and that string is . R lang R = lang R l a n g lang To describe more complex strings, we make use of . regular expression operators Operators . The operator is used to specify union or alternatives. For example, for , is a valid string described by , so is . can be called the operator. Union | R = a | b a R b | OR . Concatenation is implied in the absence of an operator between characters/symbols. For example describes the unique string . Concatenation R = ab ab . A postfix * is used to specify that the element it’s applied to can be repeated . For example, describes strings such as , , , , . is more formally known as the named after who helped formalize the concept of regular expressions. Zero or more occurrences zero or multiple times R = a* a aa aaa aaaa aaaaa... and the empty string * Kleene Closure Stephen Cole Kleene . Just like in mathematical expressions, parenthesis are used for grouping. For example, concatenates to and describes strings such as , , , , , and . Grouping () R = (a|b)*c (a|b)* c c ac aac bc bbc abbc abbbabababc . Characters classes can be used to shorten long regular expressions. For example, the regular expression can be replaced by the character class . If all the characters in a character class form a logical sequence, the character class can be abbreviated further using a range notation where is the first element of the sequence and the last. For example can be converted to the character class which can be abbreviated further as . Character classes OR a | b | c | d [abcd] [a¹–aⁿ] a¹ aⁿ 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 [0123456789] [0–9] . The empty string is described by (epsilon). For example, describes the string or the . Empty string ε R = a | ε a empty string Operator Precedence The Kleene Closure (*) has the highest precedence, followed by the concatenation operator. The union operator (|) has the lowest precedence. For example, in the regular expression , is evaluated first, then and finally the union with Rewritten with parenthesis, that regular expression will be equivalent to . ab*|c b* ab* c. ((a(b*))|c) This is all we need to know about regular expressions for the purpose of this article. Now back to the lexical grammar . You’ve reached the end. 🎉
Share Your Thoughts