On a Friday morning, slacking, you are thinking about the new Netflix shows to watch. Your boss comes over, asks you to write a parser for a . Systemd unit file She needs it by Monday. You're nervous. The last time you were asked to write a parser, you went down the rabbit hole of the web, Copying and Pasting Regex Formulas Until It Worked™. You take a sip of your boba tea to calm yourself down. You google up Systemd and like... no, it's not as simple as you thought. Your good 'ol regex copy-paste trick flies out the window. Chances of spending a non-disruptive weekend to binge on shows are quickly going toward . null PEG To the Rescue Don't lose hope just yet. Let me introduce you to , an easy way to wrap your head around parsers and save your valuable weekend. Parse Expression Grammer (PEG) PEG is a readable way to write syntax rules and is quite similar to regular expressions, which is different from a counterpart such as in which expressions must be reduced to smaller symbols. Sorry, , maybe some other workdays. Context-free Grammar Backus-Naur Form (BNF) Noam Chomsky I will be using a Rust PEG parsing library called , which is pretty awesome. If you haven't yet, I'll wait up while you go . Pest install Rust Getting Started Let's kick off with a simple grammar for parsing a JavaScript-like variable declaration statement. We will start by thinking about a set of rules for valid inputs. A declaration statement: begins with a keyword , followed by identifier(s). var one or more is white space insensitive ends with a semicolon ( ) ; A group of identifiers: is preceded by a keyword var is delimited by commas is white space insensitive An identifier: can contain any number of digits, characters, and underscores in both lower and upper cases cannot start with a digit cannot contain any space An identifier is a , meaning it is an unbreakable piece of the token. So are the keywords and the semicolons. term var Circle back a bit to the BNF side, using , you could formally define the above grammar like this: Extended Backus-Naur Grammar (EBNF) <alpha> := 'a' | 'b' | 'c' | 'd' | 'e' | /* ... */ 'z' | 'A' | 'B' | 'C' | 'D' | 'E' | /* ... */ 'Z' <digit> := 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 <Decl> := 'var' <Idents> '\n'? ';' <Idents> := <Ident> ('\n'? ',' <Ident>)* <Ident> := <alpha>+ (<alpha> | <digit> | '_')* The uppercase rule name represents a non-terminating symbol (i.e. can be broken down into smaller terms). The lowercase name represents a term. For brevity's sake, we leave out implicit white spaces from the rules. Basically, one or more white spaces can exist between every symbol you see. Let's dig into it! The and rules are self-explanatory, so we'll leave that to your guess. <alpha> <digit> is the most complex rule for a declaration statement, featuring a keyword, symbol, optional new line, and ending with a semicolon. Basically, this rule says, "Any string input beginning with , followed by one or more spaces, then a sub-rule , followed by one or more spaces and finally ending with a single semicolon is valid and I'll happily chomp on them." <Decl> var <Idents> var <Idents> can either be a single symbol, followed by zero or more pairs of comma and . <Idents> <Ident> <Ident> Finally, an must begin with one or more character, followed by zero or more character, digit, or underscore. <Ident> Ok, Code please Hold on, young Anakin! Hot-headed, you are. Lo and behold, with the grammar defined, we will be able to parse these statements: var foo, bar, baz; var foo_1, baRamYu,baz99; var foo_x , baroo , bazoo ; No, I didn't forget to format my code. It's valid JS, and it is for our parser too! Here is something our rules won't stand: var 99Problems Does anyone care to guess what's wrong here? ✋ Leave your answer in the comment. (psss, if your answer is right, I'll follow you + 3 likes on your post 👍) Enter Rust and Pest Ok, I hope you already have (In your terminal, trying typing and see if it comes up). Start by creating a new Rust binary project with Rust installed which cargo $ cargo new --bin maybe-js; cd maybe-js Within the project folder, open up file and add the following under , and run to install them. Cargo.toml dependencies cargo update [dependencies] pest = "2.0" pest_derive = "2.0" Once that's done, into and create a file called , and paste the following in it: cd src grammar.pest alpha = { 'a'..'z' | 'A'..'Z' } digit = { '0'..'9' } underscore = { "_" } newline = _{ "\n" | "\r" } WHITESPACE = _{ " " } declaration = { "var" ~ !newline ~ idents ~ newline? ~ ";" } idents = { ident ~ (newline? ~ "," ~ ident)* } ident = @{ !digit ~ (alpha | digit | underscore)+ } Now if I have had your attention for the last few minutes, it shouldn't be to hard to take a guess on what's happening here. (Oh, I haven't? Anyways... here we go) The first five are all terms. They are sets of valid values. The | is called the , which is like "or-else". Choice Operator first | or_else When matching a choice expression, is attempted. If first matches successfully, the entire expression succeeds immediately. However, if fails, is attempted next. first first or_else The rule has a quirky before the bracket, which in Pest speaks means "silent" -- we simply don't want it as a part of our parsed tokens, but it is nonetheless part of a valid syntax. newline _ The rule has a special place in Pest. If you defined it, Pest will automatically insert implicit optinal white spaces (according to the rule you define) between every symbols. Again, the says we want to silent it, since we don't want tons of white spaces as part of our syntax tree. WHITESPACE WHITESPACE _ The rule is very similar to the EBNF counterpart we learned before. The tildes ("~") simply mean "and then". The rule starts with a "var" keyword, followed by anything that isn't a newline (the "!" does what you intuitively would have guessed - negating a rule), followed by a sub-rule , an optional newline, and ends with a semicolon. declaration idents The rule is again similar to the EBNF example. It is either an single , followed by zero or more comma-separated s. idents ident ident The rule is a little special. The "@", known as , in front of the bracket is there to say, "I don't want implicit white spaces to apply to this rule." We sure don't want to include any space in our variable name. Moreover, marking a rule as atomic this way treats the rule as a term, silencing the internal matching rules. Any inner rules are discarded. ident Atomic string_lit = { "\"" ~ inner ~ "\"" } inner = { ASCII_ALPHANUMERIC* } Note that is a convenient built-in rule in Pest for any ASCII characters and digits. ASCII_ALPHANUMERIC If we parsed a string "hello" with this rule, this will yield first a node, which in turn has an node containing the string "hello", without quotation marks. string_lit inner Adding a "@" sigil in front of bracket: string_lit string_lit = @{ "\"" ~ inner ~ "\"" } inner = { ASCII_ALPHANUMERIC* } We will end up with a flat node containing ""hello"". string_lit A similar sigil "$" known as the , protect implicit white spaces inside the rule. The difference is it allows internal matching rules to parse as normal. Compound Atomic The part protects the rule from going forward if it starts with a number. If it doesn't, any one or more combinations of characters, numbers, and underscores are fine. !digit But wait, where's the code? Dang, my clever explorer! You seem to corner me at every move. Yes, that wasn't code at all but a Pest grammar definition. Now we have to write some Rust code to parse some text. Fire up and add the following: src/main.rs /// You need to do this to use macro extern crate pest; #[macro_use] extern crate pest_derive; /// 1. Import modules use std::fs; use pest::Parser; use pest::iterators::Pair; /// 2. Define a "marker" struct and add a path to our grammar file. #[derive(Parser)] #[grammar = "grammar.pest"] struct IdentParser; /// 3. Print the detail of a current Pair and optional divider fn print_pair(pair: &Pair<Rule>, hard_divider: bool) { println!("Rule: {:?}", pair.as_rule()); println!("Span: {:?}", pair.as_span()); println!("Text: {:?}", pair.as_str()); if hard_divider { println!("{:=>60}", ""); } else { println!("{:->60}", ""); } } fn main() { /// 4. Parse a sample string input let pair = IdentParser::parse(Rule::declaration, "var foo1, bar_99, fooBar;") .expect("unsuccessful parse") .next().unwrap(); print_pair(&pair, true); /// 5. Iterate over the "inner" Pairs for inner_pair in pair.into_inner() { print_pair(&inner_pair, true); match inner_pair.as_rule() { /// 6. If we match an idents rule... Rule::idents => { /// 7. Iterate over another inner Pairs for inner_inner_pair in inner_pair.into_inner() { match inner_inner_pair.as_rule() { /// 8. The term ident is the last level Rule::ident => { print_pair(&inner_inner_pair, false); } _ => unreachable!(), } } } _ => unreachable!(), } } } It's ok if you don't understand most of the things here. Let's run it with in your project directory and look at the printed output. cargo run Rule: declaration Span: Span { str: "var foo1, bar_99, fooBar;", start: 0, end: 28 } Text: "var foo1, bar_99, fooBarBaz;" ============================================================ Rule: idents Span: Span { str: "foo1, bar_99, fooBar", start: 4, end: 27 } Text: "foo1, bar_99, fooBarBaz" ============================================================ Rule: ident Span: Span { str: "foo1", start: 4, end: 8 } Text: "foo1" ------------------------------------------------------------ Rule: ident Span: Span { str: "bar_99", start: 10, end: 16 } Text: "bar_99" ------------------------------------------------------------ Rule: ident Span: Span { str: "fooBar", start: 18, end: 27 } Text: "fooBarBaz" ------------------------------------------------------------ The most important concept here is a . It represents a matching pair of tokens, or equivalently, the spanned text that a named rule successfully matched. Pair We often use s to: Pair Determining which rule produced the Pair Using the as a raw Pair &str Inspecting the inner named sub-rules that produced the Pair let pair = Parser::parse(Rule::enclosed, "(..6472..) and more text") .unwrap().next().unwrap(); assert_eq!(pair.as_rule(), Rule::enclosed); assert_eq!(pair.as_str(), "(..6472..)"); let inner_rules = pair.into_inner(); println!("{}", inner_rules); // --> [number(3, 7)] A can have zero, one, or more inner rules. For maximum flexibility, returns , which is an iterator type over each pair. Pair Pair::into_inner() Pairs 💡 is very common idiom when working with Pest. Make sure you understand what a is. Pair::into_inner() Pair Let's parse Systemd Now it's time to put the work in. Here is an example of a Systemd unit file: [Unit] Description=Nginx After=network-online.target Wants=network-online.target [Service] Type=simple Environment=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/home/ec2-user/.local/bin Environment=LD_LIBRARY_PATH=/usr/local/lib Environment=PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ExecStart=/usr/local/sbin/nginx-runner.sh Restart=on-failure RestartSec=0 KillMode=process [Install] WantedBy=multi-user.target The file is grouped into sections, each with a name enclosed by a pair of square brackets. Each section contains zero or more pair of property name and value, seperated by an equal sign "=". Let's try to flesh out a set of rules. Create a new rust project with , then create a file named with the following rules: cargo new --bin systemd-parser src/grammar.pest /// Implicit white spaces are ok. WHITESPACE = _{ " " } /// Set of characters permited char = { ASCII_ALPHANUMERIC | "." | "_" | "/" | "-" } /// name is one or more chars. Note that white spaces are allowed. name = { char+ } // value can be zero or more char, plus = and : for path variables. value = { (char | "=" | ":" )* } /// section is a name, enclosed by square brackets. section = { "[" ~ name ~ "]" } /// property pair is a name and value, separated by an equal sign. property = { name ~ "=" ~ value } /// A Systemd unit file structure file = { SOI ~ ((section | property)? ~ NEWLINE)* ~ EOI } In the file, start with the following: main.rs extern crate pest; #[macro_use] extern crate pest_derive; use std::fs; use std::env::current_dir; use std::collections::HashMap; use pest::Parser; #[derive(Parser)] #[grammar = "grammar.pest"] struct SystemdParser; /// Implement a simple AST representation #[derive(Debug, Clone)] pub enum SystemdValue { List(Vec<String>), Str(String), } // ... As the first step, after the initial imports and setup, we define a enum as a simple representation of data type in a Systemd file. will captures a single property value, and will captures multiple property values with a duplicate property key name. For example, in the previous file, there are several properties. SystemdValue SystemdValue::Str(String) SystemdValue::List(Vec<String>) nginx.service Environment Here is the function: main fn main() { // Read and parse the unit file. let unparsed_file = fs::read_to_string("nginx.service") .expect("cannot read file"); let file = SystemdParser::parse(Rule::file, &unparsed_file).expect("fail to parse") .next() .unwrap(); // Create a fresh HashMap to store the data. let mut properties: HashMap<String, HashMap<String, SystemdValue>> = HashMap::new(); // These two mutable variables will be used to store // section name and property key name. let mut current_section_name = String::new(); let mut current_key_name = String::new(); // Iterate over the file line-by-line. for line in file.into_inner() { match line.as_rule() { Rule::section => { // Update the current_section_name let mut inner_rules = line.into_inner(); current_section_name = inner_rules.next().unwrap().as_str().to_string(); } Rule::property => { let mut inner_rules = line.into_inner(); // Get a sub map of properties with the current_section_name key, or create new. let section = properties.entry(current_section_name.clone()).or_default(); // Get the current property name and value. let name = inner_rules.next().unwrap().as_str().to_string(); let value = inner_rules.next().unwrap().as_str().to_string(); // If the property name already exists... if name == current_key_name { // Get the section of the map with the key name, or insert a new SytemdValue::List. let entry = section.entry(current_key_name.clone()).or_insert(SystemdValue::List(vec![])); // Push the value onto the inner vector of SystemdValue::List. if let SystemdValue::List(ent) = entry { ent.push(value); } } else { // Create a new SystemdValue::List and save it under name key. let entry = section.entry(name.clone()).or_insert(SystemdValue::List(vec![])); // Push the current value onto the vector, then set the // current_key_name to the current name. if let SystemdValue::List(ent) = entry { ent.push(value); } current_key_name = name; } } Rule::EOI => (), _ => unreachable!(), } } } This is all good, but we didn't use anywhere in the code. To keep code clean, we decided to treat every property as , where the key of the map is the property key, and the String vector stores the list of property values. If there is no value, the vector is empty. If there is one value, this vector contains that single value, and so on. SystemdValue::Str HashMap<String, SystemdValue::List(Vec<String>) To make the API a little bit easier to use, we will write a small helper function to process all the single-valued s and convert them into . Systemd::List(Vec<String>) Systemd::Str(String) // Iterate over the nested maps, and convert empty and // single-element `SystemdValue::List<Vec<String>>` to // `SystemdValue::Str(String)`. fn pre_process_map(map: &mut HashMap<String, HashMap<String, SystemdValue>>) { for (_, value) in map.into_iter() { for (_, v) in value.into_iter() { if let SystemdValue::List(vs) = v { if vs.len() == 0 { let v_ = SystemdValue::Str(String::new()); *v = v_.clone(); } else if vs.len() == 1 { let v_ = SystemdValue::Str((vs[0]).clone()); *v = v_.clone(); } } } } } Now we are ready to print it out for the world to see! fn main() { // Our main code pre_process_map(properties); println!("{:#?}", properties); } Boom! Congratulations! You have just written a Systemd file parser, plus a mini JS one. 🤯 Now you will have some spare time to party away on your Friday night. With another afternoon, you might even figure out how to serialize your Systemd unit file into JSON to impress your boss on Monday. You can check out the parser code implemented as a library in [this repo](https://github.com/jochasinga/systemd-parser) Follow me to get more meme-joy hackery like this one. I like to brew and distill hard ciders into smooth ones. Also published on: https://dev.to/pancy/learn-to-build-a-parser-in-rust-for-fun-and-profit-2id5