This is part V of a tutorial on how to build a Linux shell. You can read the previous parts of this tutorial from the following links: part I, part II, part III, part IV.
NOTE: You can download the complete source code for Part V from this GitHub repository.
As we've seen in the previous parts, a simple command consists of one or more arguments, also known as words. The first word contains the name of the command we want to execute, while the rest of the words contain the command's arguments.
Before the shell executes a command, it performs word expansion on the command words. Word expansion is the process by which the shell takes a command word, checks it to see if it contains variable names, pathnames, commands, and arithmetic expressions, and substitutes each name/command/expression with its value. The resultant word, which is usually (but not always) longer than the original word, might be broken down into one or more subwords (or fields), in a process known as field splitting.
In this part, we'll implement the 7 word expansions defined by POSIX, which are: tilde expansion, parameter expansion, arithmetic expansion, command substitution, field splitting, pathname expansion, and quote removal. There are other word expansions, such as brace expansion and process substitution, which are not defined by POSIX, and which we won't be discussing here. After finishing this lesson, it would be a good exercise if you extended the shell by implementing the non-POSIX word expansions.
The word expansion process is very complex, and we'll have to implement a host of functions in order to perform word expansion in our shell. As such, we won't be able to discuss each and every function in detail. Instead, we'll have a general overview of each word expansion and the functions we need to process that expansion. At the end of each section, you'll find a link to our GitHub repository, where you'll be able to read the whole function's code at your own pace.
After we finish discussing the individual word expansion functions, we'll write the main function that will tie everything together, then we'll update our existing shell's code to make our shell use the word expansion functions we've just implemented. To make the code easier to read, I've defined all our word expansion functions in the source file
wordexp.c
, which you'll find in this link (the pattern matching functions are defined in pattern.c
, which you can read in this link).Now let's get started.
When the shell performs word expansion, it checks every word in the command line to see if it contains possible word expansions. Expansions can appear anywhere in the word: at the beginning, in the middle, or at the end of the word. An expansion might also include the whole word.
Word expansions are preceded by a
$
sign. The characters following the $
sign indicate the type of expansion to be performed by the shell. These characters are interpreted by the shell as follows:@
, *
, #
, ?
, -
, $
, !
, or 0
, which indicate variable expansion of a special parameter (we'll discuss these in a later lesson in this tutorial).{
and }
.(
and )
.((
and ))
.The shell performs tilde expansion, parameter expansion, command substitution, and arithmetic expansion first, which is then followed by field splitting and pathname expansion. Lastly, the shell removes any quote characters from the expanded word(s) that have been part of the original word.
When the shell performs word expansion, the process can result in zero, one, or more words. We'll use a special structure to represent those words, which we'll define in our header file
:shell.h
struct word_s
{
char *data;
int len;
struct word_s *next;
};
The structure contains the following fields:
data
=> the string representing this word.len
=> the length of the data
field.next
=> pointer to the next word, or NULL
if this is the last word (we'll use a linked list to represent our expanded words).Of course, we'll need some functions to allocate and free our
struct word_s
structures. For this, we'll use the following functions:struct word_s *make_word(char *str);
void free_all_words(struct word_s *first);
The first function allocates memory for the structure, creates a copy of the word string, and returns the newly allocated word. The second function frees the memory used by a list of word structures. You can read the functions' code in our
wordexp.c
source file.As we've said earlier, word expansion is a complex process, for which we need to define a host of different functions. Before we dive into the details of word expansion, let's first define some helper functions.
The following list shows the function prototypes for our helper functions, all of which we'll define in the
source file:wordexp.c
char *wordlist_to_str(struct word_s *word);
void delete_char_at(char *str, size_t index);
int is_name(char *str);
size_t find_closing_quote(char *data);
size_t find_closing_brace(char *data);
char *substitute_str(char *s1, char *s2, size_t start, size_t end);
int substitute_word(char **pstart, char **p, size_t len, char *(func)(char *), int add_quotes);
Here's a breakdown of what these functions do:
wordlist_to_str()
=> converts a linked list of expanded words to a single string.delete_char_at()
=> removes the character at the given index from a string.is_name()
=> checks if a string represents a valid variable name (refer back to The Word Expansion Process section above).find_closing_quote()
=> when a word expansion contains an opening quote character ("
, '
, or `
), we need to find the matching closing quote character that encloses the quoted string (more on quoting below). This function returns the zero-based index of the closing quote character in the word.find_closing_brace()
=> similar to the above, except that it finds the matching closing brace. That is, if the opening brace is {
, (
, or [
, this function returns the zero-based index of the matching }
, )
, or ]
character, respectively. Finding pairs of quotes is important for processing parameter expansion, arithmetic expansion, and command substitution.substitute_str()
=> substitutes the substring of s1
, from the character at position start
to that at position end
(both positions are zero-based), with the s2
string. This is useful when the word expansion is part of a longer word, such as ${PATH}/ls
, in which case we need only expand ${PATH}
, then attach /ls
to the end of the expanded string.substitute_word()
=> a helper function that calls the other word expansion functions, which we'll define in the following sections.Additionally, we'll define some functions to help us work with strings. We'll define all of these functions in the
source file:strings.c
char *strchr_any(char *string, char *chars);
char *quote_val(char *val, int add_quotes);
int check_buffer_bounds(int *count, int *len, char ***buf);
void free_buffer(int len, char **buf);
Here's what these functions do:
strchr_any()
=> similar to strchr()
, except that it searches the given string for any of the given characters.quote_val()
=> does the reverse of quote removal, that is, converts a string to a quoted string (this might seem trivial at first, but quoting might become complicated when we have to escape quotes and backslash characters, for example).check_buffer_bounds()
and free_buffer()
functions will allow our backend executor to support variable number of command arguments, instead of the hard limit we've set in Part II, which was 255.Now let's write the functions to handle each type of word expansion.
During tilde expansion, the shell substitutes a tilde character (followed by an optional username) with the pathname of the user's home directory. For example,
~
and ~/
are tilde-expanded to the current user's home directory, while ~john
is tilde-expanded to user John's home directory, and so on. The tilde character, in addition to all the following characters up to the first unquoted forward slash, are known as the tilde prefix (we'll deal with quoting later on in this lession).To perform tilde expansion, we'll define the
tilde_expand()
function, which has the following prototype:char *tilde_expand(char *s);
The function accepts a single argument: the tilde prefix we want to expand. If the expansion succeeds, the function returns an malloc'd string representing the tilde-expanded prefix. Otherwise, it returns
NULL
. Here's a quick breakdown of what the function does in order to expand a tilde prefix:~
, get the value of the $HOME
shell variable. If $HOME
is defined and is not NULL
, return its value. Otherwise, get the current User ID (UID) by calling getuid()
, and pass the UID to getpwuid()
to get the password database entry corresponding to the current user. The pw_dir
field of the password database entry contains the pathname of the home directory, which the function returns.~
), we take these letters as the name of the user whose home directory we want to get. We call getpwnam()
, passing it the username, and return the value of the pw_dir
field.NULL
. Otherwise, we return an malloc'd copy of the home directory path.Take a minute to read the
tilde_expand()
function's code in our GitHub repository.In parameter expansion, the shell replaces the name of a shell variable with the variable's value (hence the other name, variable expansion). Parameter Expansion is what allows the shell to execute a command such as
echo $PATH
. In this example, the shell performs parameter expansion on the $PATH
variable, replacing it with the actual executable path (something like /bin:/sbin:/usr/bin:/usr/sbin
). What the echo
command sees is not the $PATH
word, but its expanded value (which can be NULL
, of course).To signal to the shell that we want to expand a shell variable, we precede the variable name with a
$
sign. That is, to expand the PATH
, USER
, and SHELL
variables, we need to pass the $PATH
, $USER
, and $SHELL
words to the shell, respectively (alternatively, we can pass these variable expansions to shell shell as ${PATH}
, ${USER}
, and ${SHELL}
). Shell variable names can contain any combination of letters, digits, and underscores. Names can contain capital or small letters although, by convention, capitalized names are reserved for standard shell variables, such as those defined by POSIX.We can control how the shell performs parameter expansion by using a parameter expansion modifier, which tells the shell which part of the value we want expanded, as well as what to do if there is no shell variable with the given name. The table below summarizes the parameter expansion modifiers (the ones defined by POSIX are marked by the POSIX word in the Description column). Most shells support other modifiers (found in the bottom half of the table), which we won't discuss here. Refer to your shell's manual page for more information on the non-POSIX modifiers.
To perform parameter (or variable) expansion, we'll define the
var_expand()
function, which has the following prototype:char *var_expand(char *orig_var_name);
The function accepts a single argument: the parameter (i.e. the variable name) we want to expand. If the expansion succeeds, the function returns an malloc'd string containing the expanded value. Otherwise, it returns
NULL
. Here's a quick breakdown of what the function does in order to expand a variable name to get its value:${PATH}
), remove the braces, as they are not part of the variable name itself.#
, we need to get the variable name's length (the ${#parameter}
expansion, which is seen in the 5th row in the table above).${parameter#word}
and ${parameter%word}
expansions), we'll need two helper functions: match_suffix()
and match_prefix()
. We won't discuss these functions here, but you can read their code from this link.${parameter:=word}
, we need to set the value of the symbol table entry to the value we've just expanded.#
, get the length of the expanded value and use it as the final result.Take a minute to read the
var_expand()
function's code in our GitHub repository.In command substitution, the shell forks a process to run a command, then replaces the command substitution expansion with the command's output. For example, in the following loop:
for i in $(ls); do echo $i; done
the shell forks a process, in which the
ls
command is run. The output of this command is the list of files in the current directory. The shell takes that output, splits it into a list of words, then feeds those words, one at a time, to the loop. In each iteration of the loop, variable $i
is assigned the name of the next file from the list. This name is passed to the
echo
command, which outputs the name on a separate line (Practically, it would be better to execute ls
directly, but this example is just to show how command substitution can be used in the shell).Command substitution can be written as
$(command)
, or `command`
. To perform command substitution, we'll define the command_substitute()
function, which has the following prototype:char *command_substitute(char *orig_cmd);
The function accepts a single argument: the command we want to execute. If the expansion succeeds, the function returns an malloc'd string representing the command's output. If the expansion fails, or if the command outputs nothing, the function returns
NULL
. Here's a quick breakdown of what the function does in order to expand a command substitution:
$()
or the backquotes ``
. This leaves us with the command we need to execute.popen()
to create a pipe. We pass the command to be executed to popen()
, and we get a pointer to a FILE
stream from which we'll read the command's output.fread()
to read the command's output from the pipe. Store the string read in a buffer.Take a minute to read the
command_substitute()
function's code in our GitHub repository.Using arithmetic expansion, we can let the shell perform different arithmetic operations and use the result in performing other commands. While POSIX requires the shell to only support signed long integer arithmetic, many shells (e.g. ksh and zsh) support floating-point arithmetic.
Also, the shell is not required to support any mathematical functions, although most shells do. For our simple shell, we'll only support signed long integer arithmetic with no math function support.
Arithmetic expansion is written as
$((expression))
. To perform the expansion, we'll define the
arithm_expand()
function, which has the following prototype:char *arithm_expand(char *expr);
The
arithm_expand()
function receives a string containing the arithmetic expression, performs the necessary calculations, and returns the result in the form of an malloc'd string (or NULL
in case of error, e.g. an invalid arithmetic operation). The function and its associated helper functions are complex and lengthy, but here's the main highlights:x - y
is x y -
, and that of 3 + 4 × (2 − 1)
is 3 4 2 1 − × +
.Take a minute to read the
arithm_expand()
function's code in our GitHub repository.During field splitting, the shell takes the result(s) of parameter expansion, command substitution, and arithmetic expansion, and splits them into one or more parts, which we call fields (hence the name, field splitting). The process depends on the value of the
$IFS
shell variable. IFS is a historical term that stands for Internal (or Input) Field Separators, and it originates from the time when Unix shells didn't have a builtin array type. As a workaround, early Unix shells had to find another way of representing multi-member arrays. The shell would join the array members in a single string, separated by spaces. When the shell needs to retrieve the array members, it breaks down the string into one or more fields. The
$IFS
variable tells the shell where exactly to break that string. The shell interprets
$IFS
characters as follows (this description is taken from POSIX):$IFS
is a space, tab, and newline, or if the variable is not set, any sequence of space, tab, or newline characters at the beginning or end of the input shall be ignored and any sequence of those characters within the input shall delimit a field.$IFS
is null, no field splitting shall be performed.$IFS
white space shall be ignored at the beginning and end of the input. (b) Each occurrence in the input of an $IFS
character that is not $IFS
white space, along with any adjacent $IFS
white space, shall delimit a field, as described previously. (c) Non-zero-length $IFS
white space shall delimit a field.To perform the expansion, we'll define the
field_split()
function, which has the following prototype:struct word_s *field_split(char *str);
Take a minute to read the
field_split()
function's code in our GitHub repository.During pathname expansion (also known as filename globbing), the shell matches one or more file names with a given pattern. The pattern can contain normal characters (which are matched to themselves), in addition to the special characters
*
, ?
, and []
, which are also known as the globbing characters. The asterisk
*
matches any length of characters (including zero characters), the ?
matches one character, and the brackets introduce a regular expression (RE) bracket expression. The result of the expansion is the list of files whose names match the pattern.To perform the expansion, we'll define the
pathnames_expand()
function, which has the following prototype:struct word_s *pathnames_expand(struct word_s *words);
This function accepts a single argument: pointer to the first word in the linked list of words we want to pathname-expand. For each word, the function does the following:
*
, ?
, and []
), by calling the helper function has_glob_chars()
, which we'll define in the source file pattern.c
. If the word contains globbing characters, we treat it as the pattern we need to match; otherwise, we move to the next word (if any)..
and ..
(which point to the current directory and the parent directory, respectively). We delegate pattern matching to another helper function, get_filename_matches()
, which we'll define in the same source file pattern.c
.Take a minute to read the
pathnames_expand()
function's code in our GitHub repository.The final step in the word expansion process is quote removal. Quoting is used to remove the special meaning of certain characters to the shell. The shell treats some characters, such as the backslash and the quotes in a special way. To inhibit this behavior, we need to quote those characters to force the shell to treat them as normal characters (you can read more about quoting in this link).
We can quote characters in one of three way: using the backslash, single quotes, or double quotes. The backslash character is used to preserve the literal meaning (that is, to escape) the character following the backslash. This is similar to the way we escape characters in the C language.
Single quotes preserve the literal meaning of all characters inside the quotes, that is, the shell doesn't attempt word expansion on single-quoted strings.
Double quotes as similar to single quotes, except that the shell recognizes the special meaning of the backquote, backslash, and
$
sign. That is, the shell can perform word expansion inside double-quoted strings.To perform quote removal, we'll define the
remove_quotes()
function, which has the following prototype:void remove_quotes(struct word_s *wordlist);
Take a minute to read the
remove_quotes()
function's code in our GitHub repository.Now that we have our word expansion functions, its time to tie it all together. In this section, we'll write our main word expansion function, the one we'll call to perform word expansion. This function, in turn, will call the other functions to perform the various steps of word expansion.
Our main function is
, which we'll define in the source file word_expand()
:wordexp.c
struct word_s *word_expand(char *orig_word);
Here's what the function does in order to perform word expansion on the word we pass as its sole argument:
~
, "
, '
, `
, =
, \
, and $
.substitute_word()
, which in turn will call the appropriate word expansion function.field_split()
.pathnames_expand()
.remove_quotes()
.Take a minute to read the
function's code in our GitHub repository.word_expand()
In Part II of this tutorial, we wrote our
function, which we used to get input tokens. So far, our tokenize()
function doesn't know how to handle quoted strings and escaped characters. To add this functionality, we need to update our code. Open the tokenize()
file and add the following code to the scanner.c
function, right after the tokenize()
switch
statement's opening brace: case '"':
case '\'':
case '`':
add_to_buf(nc);
i = find_closing_quote(src->buffer+src->curpos);
if(!i)
{
src->curpos = src->bufsize;
fprintf(stderr, "error: missing closing quote '%c'\n", nc);
return &eof_token;
}
while(i--)
{
add_to_buf(next_char(src));
}
break;
case '\\':
nc2 = next_char(src);
if(nc2 == '\n')
{
break;
}
add_to_buf(nc);
if(nc2 > 0)
{
add_to_buf(nc2);
}
break;
case '$':
add_to_buf(nc);
nc = peek_char(src);
if(nc == '{' || nc == '(')
{
i = find_closing_brace(src->buffer+src->curpos+1);
if(!i)
{
src->curpos = src->bufsize;
fprintf(stderr, "error: missing closing brace '%c'\n", nc);
return &eof_token;
}
while(i--)
{
add_to_buf(next_char(src));
}
}
else if(isalnum(nc) || nc == '*' || nc == '@' || nc == '#' ||
nc == '!' || nc == '?' || nc == '$')
{
add_to_buf(next_char(src));
}
break;
Now our lexical scanner knows how to recognize and skip over quoted strings, escaped characters, and other word expansion constructs. Check out the updated lexical scanner code in this link.
Lastly, we need to update our backend executor so that it can:
Open the
file, navigate to the beginning of the executor.c
function and find the following lines (I reduced the loop to do_simple_command()
...
to preserve space): int argc = 0;
long max_args = 255;
char *argv[max_args+1];
char *str;
while(child)
{
...
}
argv[argc] = NULL;
and replace them with the following code:
int argc = 0;
int targc = 0;
char **argv = NULL;
char *str;
while(child)
{
str = child->val.str;
struct word_s *w = word_expand(str);
if(!w)
{
child = child->next_sibling;
continue;
}
struct word_s *w2 = w;
while(w2)
{
if(check_buffer_bounds(&argc, &targc, &argv))
{
str = malloc(strlen(w2->data)+1);
if(str)
{
strcpy(str, w2->data);
argv[argc++] = str;
}
}
w2 = w2->next;
}
free_all_words(w);
child = child->next_sibling;
}
if(check_buffer_bounds(&argc, &targc, &argv))
{
argv[argc] = NULL;
}
With this code, the executor calls
on each command argument and adds the expanded word(s) to the arguments list, which we'll eventually pass on to the command. The list can grow as much as needed, thanks to our word_expand()
function, which allocates memory to the buffer as needed.check_buffer_bounds()
All that remains now is to free our arguments list once we've finished executing the command, before we return to the caller. We do this by calling:
free_buffer(argc, argv);
in three different places: after executing a builtin utility, if
fork()
returns error status (which means we can't execute an external command), and after waitpid()
has returned. Check out the updated executor code in this link.Let’s compile our shell. Open your favorite terminal emulator, navigate to your source directory and make sure you have 19 files and 2 subdirectories in there:
Now compile the shell using the following command (you need to download the source project from our GitHub repository):
make
If everything goes well,
should not output anything (apart from a few harmless warnings), and there should be an executable file named gcc
shell
in the current directory:Now invoke the shell by running
./shell
, and try using our word expansion functions and checking the results:$ echo *
Makefile build builtins executor.c executor.h initsh.c main.c node.c node.h parser.c parser.h pattern.c prompt.c scanner.c scanner.h shell shell.h shunt.c source.c source.h strings.c symtab wordexp.c
$ echo '*'
*
$ echo ~
/home/user
$ echo ~/Downloads
/home/user/Downloads
$ echo ${A=value}
value
$ echo $A
value
$ echo $((2+7))
9
And that's it for today. Our shell can now handle all kinds of word expansions (the ones defined by POSIX, to be specific). Toy around with the shell and see what results you can get from the different types of word expansion. Compare the results with the ones you get from your default shell.
We've taken a huge leap in this lesson, with lots of code, most of which we didn't have the time or space to examine in detail. You might want to spend some time reading through the code in our GitHub repository to get yourself familiar with the word expansion process.
In the next part, we’ll talk about positional parameters and special parameters, what they are, why we need them, and how to implement them in our shell.
Stay tuned!
Previously published at https://medium.com/@mohammedisam2000/building-a-linux-shell-part-v-9cf3c0e31269