I was recently surfing around Anton Chuvakin’s posts on SIEMs and became particularly restless about one particular requirement “tearing them apart”: real time vs historical analysis. His post from 2014 on the subject gives an excellent overview of the antagonism between these two (scroll down to the table!).
As I’ve been thinking about this a lot during my infosec career and when building SpectX, I decided to elaborate on that list. What’s truly nagging me is this: has nothing improved since 2014 with regard to these differences while the volumes of data have been growing a lot? Is the antagonistic nature of historic and near-term analysis technically difficult for SIEMs? I think it is and the best way to explain is just to elaborate on the items in Chuvakin’s comparison list and add a few of my own.
SIEMs and Scalability Model of Near Real-Time Analysis
The primary mission of a SIEM has remained the same since their dawn: detection of known threats in (near) real-time (although there has been considerable development in other functionalities). The real-time nature also implies a relatively constrained data scope: you need real-time incoming events and a certain amount of recent data for detection rules/models to work. However, the system must scale over large number of input source data.
From a technical standpoint, the key for good performance is indexing. Computing metadata allows speeding up search and meeting the performance requirements in seconds (or minutes). However, all the data within the scope has to be indexed in order to guarantee timely responses.
Indexing comes with limitations and a cost on complexity:
- Data has to be indexed before it can be submitted to queries — which inherently means that the architecture has to scale both in terms of processing AND storage.
- Index computation has to handle fluctuations (peaks) of incoming events.
- Query processing has to compete with the computation of indexes.
Why SIEMs are not optimal for historical analysis
The scope of historical log analysis (pattern discovery, threat hunting, forensic investigation, etc) is different. Individual tasks usually involve data from a few selected sources over much longer periods, stretching well beyond the visibility of a SIEM. At the same time, the set of SIEM’s data sources must also be available for historical analysis.
However, query performance expectations are lower than for near real-time processing and getting results within minutes or tens of minutes is quite acceptable. Also, the amount of data included in individual analysis tasks is often not very different from the total amount of SIEM-indexed data for short-term analysis. The key point to be aware of is that the total amount of data that has to be available for historical analysis is considerably larger.
It is not optimal to accomplish this scalability with indexing. In addition to the complexity costs, indexes eat up a lot of space, often as much as the data itself and sometimes even more. This is not justified for more time-tolerant historical analysis.
Practical Limitations of Scalability
Of course, all the discussion above is just theoretical. In practice, scaling SIEMs to include a multitude of data sources is achievable, depending on data volumes and their retention period. However, as always, there are limitations. Organisations looking for scaling a free, open source solution often only look at the implementation side (cost, know-how, effort) but fail to estimate the maintenance and impact of edge cases. Once they meet the limits of the current setup, it is extremely difficult to find people with specific skills to dig in historical data as the technical complexity grows exponentially. There are many discussions on the web on this topic, see for instance here, here and here. Many organizations have also failed.
There is one particular issue faced by those attempting to scale Elasticsearch-based systems: the effect of data inflation. If log events are not already in JSON format, they will be converted to during the Elastic ETL process. Unfortunately, JSON carries a significant overhead compared to tabulated, or csv formats. For instance 1 Gb of standard Apache web server access log will approximately reach a 2,5 Gb footprint after converting to JSON. This is a problem as it translates directly to the storage period due to the complexity of scaling.
Many commercial SIEMs compress data for internal storage. This is a favorable approach as the same 1 Gb of Apache access log will shrink to ~200…~250 MB after compressing with gzip. Also, the complexity of scaling is better documented and support is available. Provided you have the funds to cover the costs. Pricing is a major limiting factor for many commercial SIEMs.
What To Expect From a Decent Historical Analysis Tool?
Historical analysis encompasses many practical tasks, such as forensic analysis, data discovery, threat hunting, etc. These are the basic means of establishing ground truth when identifying malicious behaviour from logs. The focus is on discovering threats, patterns, indicators, etc. and the way it’s being done is usually exploratory, branching on different paths with intermediate results. In practice it means proceeding through multiple steps, combining different source data and other intermediate results.
All of this is quite different from SIEM’s main goal of detecting threats based on known patterns. Hence SIEM’s provide only minimum necessary for data discovery: simple search, filtering, basic descriptive statistics of extracted fields.
To support exploratory process efficiently the analytics tool should provide the following critical features:
- Flexible search/analytics language to retrieve and manipulate data. Since analysis subjects (threats, patterns, malicious acts, etc) are complex processes, then a step-by-step exploration should be backed by flexibility and power similar to SQL. A simple search provided by most SIEMs, using graphical user interface, drop-down lists and checkboxes is not enough.
- Rich library of functions to analyse data: aggregation, maths, basic descriptive statistics, string manipulations, basic encoding-decoding, basic cryptography, etc. NB! Aggregation, statistical and maths functions must be big-data scalable (i.e provide mathematically correct results over the entire dataset being queried).
- Provide means for extending function library (allow user defined functions). No built-in library will cover the need for every case life throws at you. When it does, there’s no time to wait for the library to be updated by the vendor.
- Provide an API for using external 3rd party tools. Again — real life is far more richer than any single vendor can cover. Flexibility to address the unexpected in your data often marks the line between success and failure.
So far we’ve talked about differences in scalability and in the nature of the analysis process. Clearly there are many details setting short and long term analysis apart. However the most crucial controversies emerge from the way how data is treated. This is what I’m going to discuss in the next part of the article.