I was recently surfing around Anton Chuvakin’s posts on SIEMs and became particularly restless about one particular requirement “tearing them apart”: real time vs historical analysis. His post from 2014 on the subject gives an excellent overview of the antagonism between these two (scroll down to the table!).
As I’ve been thinking about this a lot during my infosec career and when building SpectX, I decided to elaborate on that list. What’s truly nagging me is this: has nothing improved since 2014 with regard to these differences while the volumes of data have been growing a lot? Is the antagonistic nature of historic and near-term analysis technically difficult for SIEMs? I think it is and the best way to explain is just to elaborate on the items in Chuvakin’s comparison list and add a few of my own.
The primary mission of a SIEM has remained the same since their dawn: detection of known threats in (near) real-time (although there has been considerable development in other functionalities). The real-time nature also implies a relatively constrained data scope: you need real-time incoming events and a certain amount of recent data for detection rules/models to work. However, the system must scale over large number of input source data.
From a technical standpoint, the key for good performance is indexing. Computing metadata allows speeding up search and meeting the performance requirements in seconds (or minutes). However, all the data within the scope has to be indexed in order to guarantee timely responses.
Indexing comes with limitations and a cost on complexity:
The scope of historical log analysis (pattern discovery, threat hunting, forensic investigation, etc) is different. Individual tasks usually involve data from a few selected sources over much longer periods, stretching well beyond the visibility of a SIEM. At the same time, the set of SIEM’s data sources must also be available for historical analysis.
However, query performance expectations are lower than for near real-time processing and getting results within minutes or tens of minutes is quite acceptable. Also, the amount of data included in individual analysis tasks is often not very different from the total amount of SIEM-indexed data for short-term analysis. The key point to be aware of is that the total amount of data that has to be available for historical analysis is considerably larger.
It is not optimal to accomplish this scalability with indexing. In addition to the complexity costs, indexes eat up a lot of space, often as much as the data itself and sometimes even more. This is not justified for more time-tolerant historical analysis.
Of course, all the discussion above is just theoretical. In practice, scaling SIEMs to include a multitude of data sources is achievable, depending on data volumes and their retention period. However, as always, there are limitations. Organisations looking for scaling a free, open source solution often only look at the implementation side (cost, know-how, effort) but fail to estimate the maintenance and impact of edge cases. Once they meet the limits of the current setup, it is extremely difficult to find people with specific skills to dig in historical data as the technical complexity grows exponentially. There are many discussions on the web on this topic, see for instance here, here and here. Many organizations have also failed.
There is one particular issue faced by those attempting to scale Elasticsearch-based systems: the effect of data inflation. If log events are not already in JSON format, they will be converted to during the Elastic ETL process. Unfortunately, JSON carries a significant overhead compared to tabulated, or csv formats. For instance 1 Gb of standard Apache web server access log will approximately reach a 2,5 Gb footprint after converting to JSON. This is a problem as it translates directly to the storage period due to the complexity of scaling.
Many commercial SIEMs compress data for internal storage. This is a favorable approach as the same 1 Gb of Apache access log will shrink to ~200…~250 MB after compressing with gzip. Also, the complexity of scaling is better documented and support is available. Provided you have the funds to cover the costs. Pricing is a major limiting factor for many commercial SIEMs.
Historical analysis encompasses many practical tasks, such as forensic analysis, data discovery, threat hunting, etc. These are the basic means of establishing ground truth when identifying malicious behaviour from logs. The focus is on discovering threats, patterns, indicators, etc. and the way it’s being done is usually exploratory, branching on different paths with intermediate results. In practice it means proceeding through multiple steps, combining different source data and other intermediate results.
All of this is quite different from SIEM’s main goal of detecting threats based on known patterns. Hence SIEM’s provide only minimum necessary for data discovery: simple search, filtering, basic descriptive statistics of extracted fields.
To support exploratory process efficiently the analytics tool should provide the following critical features:
So far we’ve talked about differences in scalability and in the nature of the analysis process. Clearly there are many details setting short and long term analysis apart. However the most crucial controversies emerge from the way how data is treated. This is what I’m going to discuss in the next part of the article.