Parsing is a process of converting formatted text into a data structure. A data structure type can be any suitable representation of the information engraved in the source text.
List
of List of values
or a List of Record objects
.A piece of program that does parsing is called Parser.
Parser analyses source text against the format* prescribed. If source text does not match against format error is thrown or returned.
*format is coded inside the parser.
Format is the DNA of a parser.
Small Case Study
Consider an example of Date parsing from a string (source) in format
DD-MM-YYYY
to Date object:class Date {
int day;
int month;
int year;
}
For Date parsing, I would be using Regular Expression¹ (regex for short). Regex can be matched against a string. It also helps in extracting part of the source text if it is matched.
Note: This is going to be a small-scale illustration of parsing with “Regex.” This might be a fair approach for one-two lines input text but not in every case. You may have to write a parser by hand (most complex) or use Parser Generator tools (moderately complex).
Code
The parsing and extracting the Date element is as:
String date = "20-05-2012";
// 1. defining a regular expression
Pattern dateMatcher = Pattern.compile("(\\d{2})-(\\d{2})-(\\d{4})");
// 2. matching a regular expression
Matcher matcher = dateMatcher.matcher(date);
// 3. if pattern matches
if (matcher.find()) {
// extract day
int day = Integer.parseInt(matcher.group(1));
int month = Integer.parseInt(matcher.group(2));
int year = Integer.parseInt(matcher.group(3));
// 4. validate date : skipped code
return new Date(day, month, year);
} else {
// throw Error
throw new DateParseException(date +" is not valid")
}
The code is written in Java.
How does this code work:
1. Date formatted as DD-MM-YYYY can be matched using regex:
(\d{2})-(\d{2})-(\d{4})
# where \d matches any digit between 0 to 9,
# {n} means how many times the last character type is expected
# () is called capturing group and enclosed part of string that would be extracted.
# Each group is numbered, the first group is assigned "1", the second one is given number "2", and so on
2. The given date as the string is matched against the defined regex.
3. If the match is successful, then the day, month, and year are extracted from the source string using Group Construct provided by Regular Expression API. It is a standard construct in any Regular Expression API in any language.
4. The extracted date parts can be validated; related code is skipped for you to exercise.
This is an illustration of Regular Expression based parsing, which is limited to format, which can be defined by regular expressions. It is a basic example of parsing. A programming language or format like XML parsing is complex in nature. You can refer to the book named “Crafting Interpreters” to get an idea about a full-scaled parsing.
You can also read about “Lezer: Code Mirror’s Parsing System”
Parsing can be scoped as a composition of Scanning and Syntactic Analysis or just Syntactic Analysis.
Scanning
Scanning is a process of converting a stream of characters into tokens.
A token represents a “concept” introduced by format. Logically, a token can be considered as a label assigned to one or more characters. From a processing point of view: A token is an object, can contain lexeme³, location information, and more.
In Java language: if, while, int are examples of tokens. In date parsing, tokens are defined by Regex;
\d{2}
(day, month), —
(separator), \d{4}
(year) are tokens. Note: Day and month are the same token type. A token is defined by “the pattern of characters” not by locations of characters.Scanning is also called Tokenization or Lexical Analysis. A part of the program which does scanning is called Lexer or Scanner or Tokenizer.
Syntactic Analysis
Syntactic Analysis analyses the structure formed as keeping tokens in order as their positions. It also validates and extracts engraved data to create the preferred Data Structure.
In date parsing example: “day is followed by month and year.” The order is checked by Regex Engine. Extraction is done based on order and matches.
Errors reported by this phase are called Syntactic Errors. Going back to Date parsing example,
99-JAN-2021
is an invalid Date; however, 99–99–9999
is a valid Date because rule (\d{2})-(\d{2})-(\d{4})
says so. It may seem absurd, but parsers generally validate Syntactic Correctness.The scale of parsing determines the inclusion or exclusion of Scanning as part of it:
Many times, parsing tasks are delegated to parser code generators like Antlr or Lex. These tools require a set of rules or grammar and generate parser code.
The generated parsers produce a tree upon parsing, which may not be the desired data structure; however, these libraries provide sufficient APIs to convert the constructed tree to the desired data structure.
There is more in Parser world; the type of parsing styles: top-down or bottom-up, Leftmost or Rightmost derivation. These design styles limit parsing capabilities of the parser. You may visit the following link to get a synoptic view of these design choices and their comparison:
https://tomassetti.me/guide-parsing-algorithms-terminology/#tablesParsingAlgorithms
Previously published at https://themightyprogrammer.dev/article/parsing