In the , we've built a simple Linux shell that prints a prompt string, reads input, then echoes the input back to the screen. This isn't very much impressive now, is it? first part of this tutorial In this part, we'll update our code to give our shell the ability to parse and execute Let's first have a look at what simple commands are. simple commands . : You can download the complete source code for Part II & III of this tutorial from . NOTE this GitHub repository What is a Simple Command? A consists of a list of words, separated by whitespace characters (space, tab, newline). The first word is the , and is mandatory (otherwise, the shell won't have a command to parse and execute!). The second and subsequent words are optional. If present, they form the we want the shell to pass to the executed command. simple command command name arguments For example, the following command: consists of two words: (the command's name), and (the first and sole argument). Similarly, the command: (which we used in to compile our shell) consists of 5 words: a command name, and a list of 4 arguments. ls -l ls -l gcc -o shell main.c prompt.c part I Now to be able to execute simple commands, our shell needs to perform the following steps: Scan input, one character at a time, to find the next token. We call this process , and the part of the shell that performs this task is known as the , or simply . lexical scanning lexical scanner the scanner Extract input tokens. We call this . tokenizing input Parse the tokens and create an , or . The part of the shell responsible for doing this is known as . Abstract Syntax Tree AST the parser Execute the . This is the job of . AST the executor The figure below shows the steps performed by the shell in order to parse and execute commands. You can see that the figure contains more steps than is shown in the above list, and that's fine. As our shell grows and become more sophisticated, we'll add the other steps as and when needed. Now let's look at each of the above four steps in detail, and see the code we'll need to implement those features in our shell. Scanning Input In order to get the next token, we need to be able to scan our input, one character at a time, so that we can identify the characters that can be part of the token, and those that are delimiter characters. A is one that marks the end of a token (and possibly the beginning of another token). Typically, delimiters are whitespace characters (space, tab, newline), but can also include other characters, such as and . delimiter character ; & In general, scanning input means we should be able to: Retrieve the next character from input. Return the last character we've read back to input. Lookahead (or peek) to check the next character, without actually retrieving it. Skip over whitespace characters. We'll define the functions to perform all of these tasks in a minute. But first, let's have a word about . abstracting input Remember the function, which we defined in ? That was the function we used to read user's input and return it as an 'd string. We could pass this string directly to our , but that would make the scanning process a bit cumbersome. In particular, it would be very difficult for the to remember the last character it gave us, so that it can move past that character and give us the character that follows. read_cmd() part I of this tutorial malloc scanner scanner To make the scanner's job easy, we'll abstract our input by passing the input string as part of a structure, which we'll define in the source file . Go ahead and create that file in your source directory, then open it in your favorite text editor and add the following code: struct source_s source.h *buffer; bufsize; curpos; }; ; ; ; ; # SOURCE_H ifndef # SOURCE_H define # EOF (-1) define # ERRCHAR ( 0) define # INIT_SRC_POS (-2) define { struct source_s char /* the input text */ long /* size of the input text */ long /* absolute char position in source */ char next_char (struct source_s *src) void unget_char (struct source_s *src) char peek_char (struct source_s *src) void skip_white_spaces (struct source_s *src) # endif Focusing on the structure's definition, you can see that our contains a pointer to the input string, in addition to a two fields that hold information about the string's length and our current position in the string (where we'll get the next character from). struct source_s long Now create another file named , to which you should add the following code: source.c { (src->curpos < ) { ; } src->curpos--; } { (!src || !src->buffer) { errno = ENODATA; ERRCHAR; } c1 = ; (src->curpos == INIT_SRC_POS) { src->curpos = ; } { c1 = src->buffer[src->curpos]; } (++src->curpos >= src->bufsize) { src->curpos = src->bufsize; EOF; } src->buffer[src->curpos]; } { (!src || !src->buffer) { errno = ENODATA; ERRCHAR; } pos = src->curpos; (pos == INIT_SRC_POS) { pos++; } pos++; (pos >= src->bufsize) { EOF; } src->buffer[pos]; } { c; (!src || !src->buffer) { ; } (((c = peek_char(src)) != EOF) && (c == || c == )) { next_char(src); } } # include <errno.h> # include "shell.h" # include "source.h" void unget_char (struct source_s *src) if 0 return char next_char (struct source_s *src) if return char 0 if -1 else if return return char peek_char (struct source_s *src) if return long if if return return void skip_white_spaces (struct source_s *src) char if return while ' ' '\t' The function returns (or ungets) the last character we've retrieved from input, back to the input source. It does this by simply manipulating the source structure's pointers. You'll see the benefit of this function later in this series. unget_char() The function returns the next character of input and updates the source pointer, so that the next call to returns the following input character. When we reach the last character in input, the function returns the special character , which we defined as -1 in above. next_char() next_char() EOF source.h The function is similar to in that it returns the next character of input. The only difference is that doesn't update the source pointer, so that the next call to returns the same input character we've just peeked at. You'll see the benefit of input peeking later in this series. peek_char() next_char() peek_char() next_char() Finally, the function skips all whitespace characters. This will help us when we've finished reading a token, and we want to skip delimiter whitespaces before we read the next token. skip_white_spaces() Tokenizing Input Now that we have our scanner's functions in place, we'll use those functions in order to extract input tokens. We'll start by defining a new structure, which we'll use to represent our tokens. Go ahead and create a file named in your source directory, then open it and add the following code: scanner.h text_len; *text; }; ; ; # SCANNER_H ifndef # SCANNER_H define { struct token_s * ; struct source_s src /* source of input */ int /* length of token text */ char /* token text */ /* the special EOF token, which indicates the end of input */ extern ; struct token_s eof_token struct token_s * tokenize (struct source_s *src) void free_token (struct token_s *tok) # endif Focusing on the structure definition, our contains a pointer to the that holds our input. The structure also contains a pointer to the token's text, and a field that tells us the length of that text (so that we don't need to repeatedly call on the token's text). struct token_s struct source_s strlen() Next, we'll write the function, which will retrieve the next token from input. We'll also write some helper functions that will help us work with input tokens. tokenize() In the source directory, create a file named , and enter the following code: scanner.c *tok_buf = ; tok_bufsize = ; tok_bufindex = ; .text_len = , }; { tok_buf[tok_bufindex++] = c; (tok_bufindex >= tok_bufsize) { *tmp = (tok_buf, tok_bufsize* ); (!tmp) { errno = ENOMEM; ; } tok_buf = tmp; tok_bufsize *= ; } } { (!tok) { ; } (tok, , (struct token_s)); tok->text_len = (str); *nstr = (tok->text_len+ ); (!nstr) { (tok); ; } (nstr, str); tok->text = nstr; tok; } { (tok->text) { (tok->text); } (tok); } { endloop = ; (!src || !src->buffer || !src->bufsize) { errno = ENODATA; &eof_token; } (!tok_buf) { tok_bufsize = ; tok_buf = (tok_bufsize); (!tok_buf) { errno = ENOMEM; &eof_token; } } tok_bufindex = ; tok_buf[ ] = ; nc = next_char(src); (nc == ERRCHAR || nc == EOF) { &eof_token; } { (nc) { : : (tok_bufindex > ) { endloop = ; } ; : (tok_bufindex > ) { unget_char(src); } { add_to_buf(nc); } endloop = ; ; : add_to_buf(nc); ; } (endloop) { ; } } ((nc = next_char(src)) != EOF); (tok_bufindex == ) { &eof_token; } (tok_bufindex >= tok_bufsize) { tok_bufindex--; } tok_buf[tok_bufindex] = ; (!tok) { ( , , strerror(errno)); &eof_token; } tok->src = src; tok; } # include <stdlib.h> # include <stdio.h> # include <string.h> # include <errno.h> # include "shell.h" # include "scanner.h" # include "source.h" char NULL int 0 int -1 /* special token to indicate end of input */ = { struct token_s eof_token 0 void add_to_buf ( c) char if char realloc 2 if return 2 struct token_s * create_token ( *str) char * = ( ( )); struct token_s tok malloc sizeof struct token_s if return NULL memset 0 sizeof strlen char malloc 1 if free return NULL strcpy return void free_token (struct token_s *tok) if free free struct token_s * tokenize (struct source_s *src) int 0 if return if 1024 malloc if return 0 0 '\0' char if return do switch case ' ' case '\t' if 0 1 break case '\n' if 0 else 1 break default break if break while if 0 return if '\0' * = ( ); struct token_s tok create_token tok_buf if fprintf stderr "error: failed to alloc buffer: %s\n" return return The global variables we defined in this file serve the following purposes: is a pointer to the buffer in which we'll store the current token. tok_buf is the number of bytes we allocate to the buffer. tok_bufsize is the current buffer index (i.e. it tells us where to add the next input character in the buffer). tok_bufindex is a special token we'll use to signal the end of file/input ( ). eof_token EOF Now let's have a look at the functions we've defined in this file. The function adds a single character to the token buffer. If the buffer is full, the function takes care of extending it. add_to_buf() The function takes a string and converts it to a structure. It takes care of allocating memory for the token's structure and text, and fills in the structure's member fields. create_token() struct token_s The function frees the memory used by a token structure, as well as the memory used to store the token's text. free_token() The function is the heart and soul of our lexical scanner, and the function's code is fairly straight-forward. It starts by allocating memory for our token buffer (if not already done), then initializes our token buffer and source pointers. It then calls to retrieve the next input character. When we reach the end of input, returns the special , which marks the end of input. tokenize() next_char() tokenize() eof_token The function then loops to read input characters, one at a time. If it encounters a whitespace character, it checks the token buffer to see if it's empty or not. If the buffer is not empty, we delimit the current token and break out of the loop. Otherwise, we skip the whitespace characters and move along to the beginning of the next token. After we've got our token, calls , passing it the token text (which we stored in the buffer). The token text is converted to a token structure, which then returns to the caller. tokenize() create_token() tokenize() Compiling the Shell We sure have made a lot of progress in this part, but our shell is still not ready to parse and execute commands. Therefore, we won't be compiling the shell now. Our next compilation will be at the end of part III, after we implement our parser and our executor. What's Next In this part, we've implemented our lexical scanner. In the next part, we'll write the parser and the executor, after which we'll have a shell that is capable of executing simple commands. Stay tuned!