Building A Linux Shell - Part II [A Step-by-Step Guide]

In the first part of this tutorial, we've built a simple Linux shell that prints a prompt string, reads input, then echoes the input back to the screen. This isn't very much impressive now, is it?

In this part, we'll update our code to give our shell the ability to parse and execute simple commands. Let's first have a look at what simple commands are.

NOTE: You can download the complete source code for Part II & III of this tutorial from this GitHub repository.

What is a Simple Command?

A simple command consists of a list of words, separated by whitespace characters (space, tab, newline). The first word is the command name, and is mandatory (otherwise, the shell won't have a command to parse and execute!). The second and subsequent words are optional. If present, they form the arguments we want the shell to pass to the executed command.

For example, the following command:

ls -l

consists of two words:

ls

(the command's name), and

-l

(the first and sole argument). Similarly, the command:

gcc -o shell main.c prompt.c

(which we used in part I to compile our shell) consists of 5 words: a command name, and a list of 4 arguments.

Now to be able to execute simple commands, our shell needs to perform the following steps:

Scan input, one character at a time, to find the next token. We call this process lexical scanning, and the part of the shell that performs this task is known as the lexical scanner, or simply the scanner.
Extract input tokens. We call this tokenizing input.
Parse the tokens and create an Abstract Syntax Tree, or AST. The part of the shell responsible for doing this is known as the parser.
Execute the AST. This is the job of the executor.

The figure below shows the steps performed by the shell in order to parse and execute commands. You can see that the figure contains more steps than is shown in the above list, and that's fine. As our shell grows and become more sophisticated, we'll add the other steps as and when needed.

Now let's look at each of the above four steps in detail, and see the code we'll need to implement those features in our shell.

Scanning Input

In order to get the next token, we need to be able to scan our input, one character at a time, so that we can identify the characters that can be part of the token, and those that are delimiter characters. A delimiter character is one that marks the end of a token (and possibly the beginning of another token). Typically, delimiters are whitespace characters (space, tab, newline), but can also include other characters, such as

and

In general, scanning input means we should be able to:

Retrieve the next character from input.
Return the last character we've read back to input.
Lookahead (or peek) to check the next character, without actually retrieving it.
Skip over whitespace characters.

We'll define the functions to perform all of these tasks in a minute. But first, let's have a word about abstracting input.

Remember the

read_cmd()

function, which we defined in part I of this tutorial? That was the function we used to read user's input and return it as an

malloc

'd string. We could pass this string directly to our scanner, but that would make the scanning process a bit cumbersome. In particular, it would be very difficult for the scanner to remember the last character it gave us, so that it can move past that character and give us the character that follows.

To make the scanner's job easy, we'll abstract our input by passing the input string as part of a

struct source_s

structure, which we'll define in the source file

source.h

. Go ahead and create that file in your source directory, then open it in your favorite text editor and add the following code:

#ifndef SOURCE_H
#define SOURCE_H

#define EOF             (-1)
#define ERRCHAR         ( 0)

#define INIT_SRC_POS    (-2)

struct source_s
{   
    char *buffer;       /* the input text */
    long bufsize;       /* size of the input text */
    long  curpos;       /* absolute char position in source */
};

char next_char(struct source_s *src);
void unget_char(struct source_s *src);
char peek_char(struct source_s *src);
void skip_white_spaces(struct source_s *src);

#endif

Focusing on the structure's definition, you can see that our

struct source_s

contains a pointer to the input string, in addition to a two

long

fields that hold information about the string's length and our current position in the string (where we'll get the next character from).

Now create another file named

source.c

, to which you should add the following code:

#include <errno.h>
#include "shell.h"
#include "source.h"

void unget_char(struct source_s *src)
{
    if(src->curpos < 0)
    {
        return;
    }

    src->curpos--;
}


char next_char(struct source_s *src)
{
    if(!src || !src->buffer)
    {
        errno = ENODATA;
        return ERRCHAR;
    }

    char c1 = 0;
    if(src->curpos == INIT_SRC_POS)
    {
        src->curpos  = -1;
    }
    else
    {
        c1 = src->buffer[src->curpos];
    }

    if(++src->curpos >= src->bufsize)
    {
        src->curpos = src->bufsize;
        return EOF;
    }

    return src->buffer[src->curpos];
}


char peek_char(struct source_s *src)
{
    if(!src || !src->buffer)
    {
        errno = ENODATA;
        return ERRCHAR;
    }

    long pos = src->curpos;

    if(pos == INIT_SRC_POS)
    {
        pos++;
    }
    pos++;

    if(pos >= src->bufsize)
    {
        return EOF;
    }

    return src->buffer[pos];
}


void skip_white_spaces(struct source_s *src)
{
    char c;

    if(!src || !src->buffer)
    {
        return;
    }

    while(((c = peek_char(src)) != EOF) && (c == ' ' || c == '\t'))
    {
        next_char(src);
    }
}

The

unget_char()

function returns (or ungets) the last character we've retrieved from input, back to the input source. It does this by simply manipulating the source structure's pointers. You'll see the benefit of this function later in this series.

The

next_char()

function returns the next character of input and updates the source pointer, so that the next call to

next_char()

returns the following input character. When we reach the last character in input, the function returns the special character

EOF

, which we defined as -1 in

source.h

above.

The

peek_char()

function is similar to

next_char()

in that it returns the next character of input. The only difference is that

peek_char()

doesn't update the source pointer, so that the next call to

next_char()

returns the same input character we've just peeked at. You'll see the benefit of input peeking later in this series.

Finally, the

skip_white_spaces()

function skips all whitespace characters. This will help us when we've finished reading a token, and we want to skip delimiter whitespaces before we read the next token.

Tokenizing Input

Now that we have our scanner's functions in place, we'll use those functions in order to extract input tokens. We'll start by defining a new structure, which we'll use to represent our tokens.

Go ahead and create a file named

scanner.h

in your source directory, then open it and add the following code:

#ifndef SCANNER_H
#define SCANNER_H

struct token_s
{
    struct source_s *src;       /* source of input */
    int    text_len;            /* length of token text */
    char   *text;               /* token text */
};

/* the special EOF token, which indicates the end of input */
extern struct token_s eof_token;

struct token_s *tokenize(struct source_s *src);
void free_token(struct token_s *tok);

#endif

Focusing on the structure definition, our

struct token_s

contains a pointer to the

struct source_s

that holds our input. The structure also contains a pointer to the token's text, and a field that tells us the length of that text (so that we don't need to repeatedly call

strlen()

on the token's text).

Next, we'll write the

tokenize()

function, which will retrieve the next token from input. We'll also write some helper functions that will help us work with input tokens.

In the source directory, create a file named

scanner.c

, and enter the following code:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include "shell.h"
#include "scanner.h"
#include "source.h"

char *tok_buf = NULL;
int   tok_bufsize  = 0;
int   tok_bufindex = -1;

/* special token to indicate end of input */
struct token_s eof_token = 
{
    .text_len = 0,
};


void add_to_buf(char c)
{
    tok_buf[tok_bufindex++] = c;

    if(tok_bufindex >= tok_bufsize)
    {
        char *tmp = realloc(tok_buf, tok_bufsize*2);

        if(!tmp)
        {
            errno = ENOMEM;
            return;
        }

        tok_buf = tmp;
        tok_bufsize *= 2;
    }
}


struct token_s *create_token(char *str)
{
    struct token_s *tok = malloc(sizeof(struct token_s));
    
    if(!tok)
    {
        return NULL;
    }

    memset(tok, 0, sizeof(struct token_s));
    tok->text_len = strlen(str);
    
    char *nstr = malloc(tok->text_len+1);
    
    if(!nstr)
    {
        free(tok);
        return NULL;
    }
    
    strcpy(nstr, str);
    tok->text = nstr;
    
    return tok;
}


void free_token(struct token_s *tok)
{
    if(tok->text)
    {
        free(tok->text);
    }
    free(tok);
}


struct token_s *tokenize(struct source_s *src)
{
    int  endloop = 0;

    if(!src || !src->buffer || !src->bufsize)
    {
        errno = ENODATA;
        return &eof_token;
    }
    
    if(!tok_buf)
    {
        tok_bufsize = 1024;
        tok_buf = malloc(tok_bufsize);
        if(!tok_buf)
        {
            errno = ENOMEM;
            return &eof_token;
        }
    }

    tok_bufindex     = 0;
    tok_buf[0]       = '\0';

    char nc = next_char(src);

    if(nc == ERRCHAR || nc == EOF)
    {
        return &eof_token;
    }

    do
    {
        switch(nc)
        {
            case ' ':
            case '\t':
                if(tok_bufindex > 0)
                {
                    endloop = 1;
                }
                break;
                
            case '\n':
                if(tok_bufindex > 0)
                {
                    unget_char(src);
                }
                else
                {
                    add_to_buf(nc);
                }
                endloop = 1;
                break;
                
            default:
                add_to_buf(nc);
                break;
        }

        if(endloop)
        {
            break;
        }

    } while((nc = next_char(src)) != EOF);

    if(tok_bufindex == 0)
    {
        return &eof_token;
    }
    
    if(tok_bufindex >= tok_bufsize)
    {
        tok_bufindex--;
    }
    tok_buf[tok_bufindex] = '\0';

    struct token_s *tok = create_token(tok_buf);
    if(!tok)
    {
        fprintf(stderr, "error: failed to alloc buffer: %s\n", strerror(errno));
        return &eof_token;
    }

    tok->src = src;
    return tok;
}

The global variables we defined in this file serve the following purposes:

tok_buf
is a pointer to the buffer in which we'll store the current token.
```
tok_bufsize
```
is the number of bytes we allocate to the buffer.
```
tok_bufindex
```
is the current buffer index (i.e. it tells us where to add the next input character in the buffer).
```
eof_token
```
is a special token we'll use to signal the end of file/input (EOF).

Now let's have a look at the functions we've defined in this file.

The

add_to_buf()

function adds a single character to the token buffer. If the buffer is full, the function takes care of extending it.

The

create_token()

function takes a string and converts it to a

struct token_s

structure. It takes care of allocating memory for the token's structure and text, and fills in the structure's member fields.

The

free_token()

function frees the memory used by a token structure, as well as the memory used to store the token's text.

The

tokenize()

function is the heart and soul of our lexical scanner, and the function's code is fairly straight-forward. It starts by allocating memory for our token buffer (if not already done), then initializes our token buffer and source pointers. It then calls

next_char()

to retrieve the next input character. When we reach the end of input,

tokenize()

returns the special

eof_token

, which marks the end of input.

The function then loops to read input characters, one at a time. If it encounters a whitespace character, it checks the token buffer to see if it's empty or not. If the buffer is not empty, we delimit the current token and break out of the loop. Otherwise, we skip the whitespace characters and move along to the beginning of the next token.

After we've got our token,

tokenize()

calls

create_token()

, passing it the token text (which we stored in the buffer). The token text is converted to a token structure, which

tokenize()

then returns to the caller.

Compiling the Shell

We sure have made a lot of progress in this part, but our shell is still not ready to parse and execute commands. Therefore, we won't be compiling the shell now. Our next compilation will be at the end of part III, after we implement our parser and our executor.

What's Next

In this part, we've implemented our lexical scanner. In the next part, we'll write the parser and the executor, after which we'll have a shell that is capable of executing simple commands.

Stay tuned!

Building A Linux Shell - Part II [A Step-by-Step Guide]

Too Long; Didn't Read

What is a Simple Command?

Scanning Input

Tokenizing Input

Compiling the Shell

What's Next

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps

Building A Linux Shell - Part II [A Step-by-Step Guide]

Too Long; Didn't Read

What is a Simple Command?

Scanning Input

Tokenizing Input

Compiling the Shell

What's Next

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps