Introduction Sooner or later every software engineer will come across a situation that requires some kind of text parsing. The text might contain information in semi-structured form that need to get parsed and saved into data structures. Different methods can be used to tackle the text parsing problem. One approach is to develop custom code using a line split approach and then parse the separate elements. Another one can be to use a built-in scanner library function available in many programming languages. If you are interested to dive deep into the topic you can find a , covering everything from regular expressions, grammars and lexers to parsing trees and algorithms. very detailed guide here Deterministic vs Non-deterministic, what's the difference? In this article, I compare two parsing methods with a focus on their performance. The first method uses for parsing and data extraction. You can find a regular expression engine in any programming language. Great examples of regular expression library capabilities are included in perl, python, php and java. regular expressions A regular expression can be modeled by a . FSMs are quite flexible and can be used for any task that has distinct states of processing or operation. In parsing text using regular expressions, a class of FSMs is of particular interest, the Non-deterministic FSM or as they are also known the , You can easily and then use the NDFA to parse the text input. Finite State Machine (FSM) Non-deterministic Finite Automata (NDFA) convert a regular expression into a NDFA Using an NDFA in practice usually means a lot of backtracking. The algorithm has to follow every possible transition from a state to another and when a possible path fails it goes back and continues with another path. In theory, there are methods available that convert a NDFA to a Deterministic Finite Automaton (DFA). This conversion can be quite tricky when your parser is designed using regular expressions. In this article, I opted for a DFA designed from scratch as the second method to parse the text and extract the data. This allowed me to compare the two methods, the NDFA and the DFA, in terms of performance. Case study - An .ini file parser To compare the two methods, the case study of an configuration file parser and reader was used. The input file can have configuration sections that are denoted in a single line using brackets and section names, for example: ini-like [sectionA] The section line can also have a comment part: [sectionA] ; is a comment this Comments can also exist on their own in a line: ; line contains only comment this Each section can contain multiple lines with settings in a key-value fashion. The key name must be a string and the value can take various types. The supported types are string, integer, path, array and boolean. Some examples are given below: keyname1 = keyname2 = keyname3 = .ssh/file.conf keyname4 = value1,value2,value3 keyname5 = "A string value" 35363262 /root/ true Each setting can have overrides that are loaded instead of the main value when the user enters the flag during loading. For example, you can have multiple values for key name config_path depending on the deployment system production, staging, testing: config_path<production> = lib/config/prod.cfg config_path<testing> = lib/config/prod.cfg config_path<staging> = lib/config/prod.cfg /var/ /var/ /var/ For a more comprehensive example of the accepted input, you can check the github repository containing the code used in this article. A sample input file can be found here. Non-deterministic parser using Regular Expressions Firstly, the regular expression parser was put in action and its performance measured. The graph detailing the NDFA that was created from the regular expression is shown in graph 1. This graph implies backtracking when a path fails until all possible paths in the graph from Start to End are exhausted. In each one of the nodes, a regular expression match is tried and if it succeeds the corresponding data is extracted if applicable. Comment node expects the following regular expression to match \s*;.* Section node has a data group and expects \s*\[([\\w_]+)\]\s* For a full definition of the regular expressions used in this NDFA you can check the class that implements it in the source code. Deterministic parser using Finite State Machine Next, a parser based on an FSM implementation was build and tested on the same input data. The FSM graph is shown in graph 2. In this case, transitions between states are deterministic and depend only on the current input character. There is no backtracking and each text line is processed character by character using the FSM. States that are tagged with the isFinal flag are valid terminal states. If the input text is fully processed while the FSM is in a final state, the parsing is completed successfully. The color legend shows the states extract different data types from the input text. For example, light-green state 3 extracts the group section name. For a full definition of the FSM and its implementation in Java you can check the . source code here Performance results So what are the results of the performance comparison between the two methods? To find out, a set of text parsing operations on a large synthetic file was done. The parsing was repeated multiple times for each method and the average time was measured. The resulting data from text parsing was put onto memory in HashMap. You can see the results of the comparison for different file sizes in graph 3. The blue line shows the parsing time of NDFA for different sizes of the input file. The green line shows the parsing time of DFA for the same input size. The input size is counted in total configuration lines and the x axis scale is logarithmic. The DFA scales well and approximates a O(logN) complexity, whereas NDFA approximates a O(N) complexity. At N = 2.4M lines DFA is 4-5x faster than NDFA. This makes DFA parser an excellent parser candidate when text parsing speed is important. Conclusion In this article, I have reviewed and compared the performance of different types of Finite Automata for the task of text parsing into a data structure in memory. A Non-Deterministic Finite Automaton based on Java's Regular Expression engine was and tested. built Next, a Deterministic Finite Automaton was using a custom-built Java code FSM implementation. The DFA was found at least 4 times faster than the NDFA and having logarithmic time complexity as the input size increases. You can find the . built full code implementation used in the article in my Github repository About the author currently works as a Senior Software Architect / Technical Lead at . He likes to design and implement large-scale distributed application backends and single-page web applications, embedded with Machine Learning (ML) components. From time to time he might also get his hands dirty with Infrastructure as Code (IaC) and DevOps tasks or mini IoT projects. Spiros Dimopoulos Behavioral Signals