Parsing SQL queries provides superpowers for monitoring data health. This post describes how to get started on parsing SQL for data observability.
Query history of a data warehouse is a rich source of information to glean how data is used in your organization. Many aspects of data observability can be tracked by analyzing query history. For example, query history analysis can extract:
These statistics also help to automate common data engineering tasks like:
SQL language is an ISO/IEC standard and the latest version is SQL2016. However, every database implements the standard differently, uses different function names for the same operation, and has extensions to access specific custom features. Therefore, there isn’t one SQL parser for dialects of all popular databases and data warehouses.
Regular expressions is a popular approach to extract information from SQL statements. However, regular expressions quickly become too complex to handle common features like WITH, sub-queries, windows clauses, aliases and quotes. sqlparse is a popular Python package that uses regular expressions to parse SQL.
An alternate approach is to implement the SQL grammar using parser generators like ANTLR. There are similar open-source parser generators in other popular languages. There are multiple projects that maintain parsers for popular open-source databases like MySQL and Postgres. For other open-source databases, the grammar can be extracted from the open-source project. For commercial databases, the only option is to reverse engineer the complete grammar. There are SQL parser/optimizer platforms like Apache Calcite that help to reduce the effort to implement the SQL dialect of your choice.
Some popular open-source databases and data warehouses are:
libpg_query extracts the parser (written in C) from the Postgres project and packages it as a stand-alone library. This library is wrapped in other languages by other projects like:
Parser/Optimizer platforms implement the common SQL language features and allow customization as a first-class feature of the platform. Two popular open-source projects are:
Apache Calcite allows customizations at various points of the parsing process.
Apache Calcite also provides visitors for traversing the SQL execution plan. Visitor pattern is an algorithm to traverse a SQL plan.
There are many abandoned open-source SQL parsers. The first filter is to use a project that will be supported in the future. For popular databases such as Postgres and MySQL/MariaDB, there are parsers available in multiple programming languages.
What if there is no parser for your database?
Most teams do not create a parser from scratch. A popular option is to use the Postgres parser and then add custom SQL syntax. AWS Redshift, Vertica, and DuckDB are examples. Use a Postgres SQL parser to parse the query history of these databases to parse the majority of the queries.
Many queries will fail to parse such as UNLOAD in AWS Redshift. If it is important to also parse the variants, consider modifying the projects to accept the custom grammar OR use a platform like Apache Calcite.
There is a demand for SQL parsers to build reports on database or data warehouse usage. There are a number of good open-source projects. However, there is a steep learning curve to use these projects and in many cases, a project may not fit your specific requirements.
Struggling with parsing query history? Get in touch
Also published on Tokern.