Introduction In this tutorial, I will show you how works by using it to run SQL queries on continuously produced nginx logs. By the end of the tutorial, you will have a better idea of what Materialize is, how it's different than other SQL engines, and how to use it. Materialize Prerequisites For the sake of simplicity, I will use a brand new Ubuntu 21.04 server where I will install nginx, Materialize and , a CLI tool similar to used to connect to Materialize and execute SQL on it. mzcli psql If you want to follow along you could spin up a new Ubuntu 21.04 server on your favorite could provider. If you prefer running Materialize on a different operating system, you can follow the steps on how to install Materialize here: How to install Materialize What is Materialize? Materialize is a streaming database for real-time analytics. It is not a substitution for your transactional database, instead, it accepts input data from a variety of sources like: Messages from streaming sources like Kafka Archived data from object stores like S3 Change feeds from databases like PostgreSQL Data in Files: CSV, JSON and even unstructured files like logs (what we'll be using today) And it maintains the answers to your SQL queries over time, keeping them up-to-date as new data flows in (using ), instead of running them against a static snapshot at a point in time. materialized views If you want to learn more about Materialize, make sure to check out their official documentation here: Materialize Documentation Installing Materialize Materialize runs as a single binary called . Since we're running on Linux, we'll just install Materialize directly. To install it, run the following command:' materialized (d for daemon, following Unix conventions) sudo apt install materialized Once it's installed, start Materialize (with sudo so it has access to nginx logs): sudo materialized Now that we have the running, we need to open a new terminal to install and run a CLI tool that we use to interact with our Materialize instance! materialized There are other ways that you could use in order to run Materialize as described . For a production-ready Materialize instance, I would recommend giving a try! here Materialize Cloud Installing mzcli The tool lets us connect to Materialize similar to how we would use a SQL client to connect to any other database. mzcli Materialize is wire-compatible with PostgreSQL, so if you have already installed you could use it instead of , but with you get nice syntax highlighting and autocomplete when writing your queries. psql mzcli mzcli To learn the main differences between the two, make sure to check out the official documentation here: Materialize CLI Connections The easiest way to install is via , so first run: mzcli pipx apt install pipx and, once is installed, install with: pipx mzcli pipx install mzcli Now that we have we can connect to with: mzcli materialized mzcli -U materialize -h localhost -p 6875 materialize For this demo, let's quickly install nginx and use Regex to parse the log and create Materialized Views. Installing nginx If you don't already have nginx installed, install it with the following command: sudo apt install nginx Next, let's populate the access log with some entries with a Bash loop: for i in {1..200} ; do curl -s 'localhost/materialize' > /dev/null ; echo $i ; done If you have an actual nginx , you can skip the step above. access.log Now we'll have some entries in the access log file that we would be able to able to feed into Materialize. /var/log/nginx/access.log Adding a Materialize Source By creating a Source you are essentially telling Materialize to connect to some external data source. As described in the introduction, you could connect a wide variety of sources to Materialize. For the full list of source types make sure to check out the official documentation here: Materialize source types Let's start by creating a from our nginx access log. text file source First, access the Materialize instance with the command: mzcli mzcli -U materialize -h localhost -p 6875 materialize Then run the following statement to create the source: CREATE SOURCE nginx_log FROM FILE '/var/log/nginx/access.log' WITH (tail = true) FORMAT REGEX '(?P<ipaddress>[^ ]+) - - \[(?P<time>[^\]]+)\] "(?P<request>[^ ]+) (?P<url>[^ ]+)[^"]+" (?P<statuscode>\d{3})'; A quick rundown: : First we specify that we want to create a source CREATE SOURCE : Then we specify that this source will read from a local file, and we provide the path to that file FROM FILE : Continually check the file for new content WITH (tail = true) : as this is an unstructured file we need to specify regex as the format so that we could get only the specific parts of the log that we need. FORMAT REGEX Let's quickly review the Regex itself as well. The Materialize-specific behavior to note here is the pattern extracts the matched text into a column named . ?P<NAME_HERE> NAME_HERE To make this a bit more clear, a standard entry in your nginx access log file would look like this: 123.123.123.119 - - [13/Oct/2021:10:54:22 +0000] "GET / HTTP/1.1" 200 396 "-" "Mozilla/5.0 zgrab/0.x" : With this pattern we match the IP address for each line of the nginx log, e.g. . (?P<ipaddress>[^ ]+) 123.123.123.119 : the timestamp string from inside square brackets, e.g. \[(?P<time>[^\]]+)\] [13/Oct/2021:10:54:22 +0000] : the type of request like , etc. "(?P<request>[^ ]+) GET POST : the relative URL, eg. (?P<url>[^ ]+) /favicon.ico : the three digit HTTP status code. (?P<statuscode>\d{3}) Once you execute the create source statement, you can confirm the source was created successfully by running the following: mz> SHOW SOURCES; // Output +-----------+ | name | |-----------| | nginx_log | +-----------+ SELECT 1 Time: 0.021s Now that we have our source in place, let's go ahead and create a view! Creating a Materialized View You may be familiar with from the world of traditional databases like PostgreSQL, which are essentially cached queries. The unique feature here is the materialized view we are about to create is . Materialized Views automatically kept up-to-date In order to create a materialized view, we will use the following statement: CREATE MATERIALIZED VIEW aggregated_logs AS SELECT ipaddress, request, url, statuscode::int, COUNT(*) as count FROM nginx_log GROUP BY 1,2,3,4; The important things to note are: Materialize will keep the results of the embedded query in memory, so you'll always get a fast and up-to-date answer The results are incrementally updated as new log events arrive Under the hood, and then takes care of all the heavy lifting for you. This is incredibly powerful, as it allows you to process data in real-time using SQL. Materialize compiles your SQL query into a dataflow just A quick rundown of the statement itself: First we start with the which identifies that we want to create a new Materialized view named . CREATE MATERIALIZED VIEW aggregated_logs aggregated_logs Then we specify the statement that we are interested in keeping track of over time. In this case we are aggregating the data in our log file by , , and , and we are counting the total instances of each combo with a SELECT ipaddress request url statuscode COUNT(*) When creating a Materialized View, it could be based on multiple sources like a stream from Kafka, a raw data file that you have on an S3 bucket, or your PostgreSQL database. This single statement will give you the power to analyze your data in real-time. We specified a simple that we want the view to be based on but this could include complex operations like s, however for the sake of this tutorial we are keeping things simple. SELECT JOIN For more information about Materialized Views check out the official documentation here: Creating Materialized views Now you could use this new view and interact with the data from the nginx log with pure SQL! Reading from the view If we do a on this Materialized view, we get a nice aggregated summary of stats: SELECT SELECT * FROM aggregated_logs; ipaddress | request | url | statuscode | count ----------------+---------+--------------------------+------------+------- 127.0.0.1 | GET | /materialize | 404 | 200 As more requests come in to the nginx server, the aggregated stats in the view are kept up-to-date. We could also write queries that do further aggregation and filtering on top of the materialized view, for example, counting requests by route only: SELECT url, SUM(count) as total FROM aggregated_logs GROUP BY 1 ORDER BY 2 DESC; If we were re-run the query over and over again, we could see the numbers change instantly as soon as we get new data in the log as Materialize processes each line of the log and keeps listening for new lines: +--------------------------+-------+ | url | total | |--------------------------+-------| | /materialize/demo-page-2 | 1255 | | /materialize/demo-page | 1957 | | /materialize | 400 | +--------------------------+-------+ As another example, let's use together with the command to see this in action. psql watch If you don't have already isntalled you can install it with the following command: psql sudo apt install postgresql-client After that, let's run the statement every second using the command: SELECT * FROM aggregated_logs watch watch -n1 "psql -c 'select * from aggregated_logs' -U materialize -h localhost -p 6875 materialize" In , you could run another loop to generate some new nginx logs and see how the results change: another terminal window for for i in {1..2000} ; do curl -s 'localhost/materialize/demo-page-2' > /dev/null ; echo $i ; done The output of the command would look like this: watch Feel free to experiment with more complex queries and analyze your nginx access log for suspicious activity using pure SQL and keep track of the results in real-time! Conclusion By now, hopefully, you have a hands-on understanding of how incrementally maintained materialized views work in Materialize. In case that you like the project, make sure to star it on GitHub: https://github.com/MaterializeInc/materialize If you are totally new to SQL, make sure to check out this free eBook here: Free introduction to SQL basics eBook Also published here: https://devdojo.com/bobbyiliev/learn-materialize-by-running-streaming-sql-on-your-nginx-logs