A glimpse of history. Photo by on Sergiu Vălenaș Unsplash I was recently surfing around Anton Chuvakin’s posts on SIEMs and became particularly restless about one particular requirement “tearing them apart”: . His post from 2014 on the subject gives an excellent overview of the antagonism between these two (scroll down to the table!). real time vs historical analysis As I’ve been thinking about this a lot during my infosec career and when building , I decided to elaborate on that list. What’s truly nagging me is this: has nothing improved since 2014 with regard to these differences while the volumes of data have been growing a lot? Is the antagonistic nature of historic and near-term analysis technically difficult for SIEMs? I think it is and the best way to explain is just to elaborate on the items in Chuvakin’s comparison list and add a few of my own. SpectX SIEMs and Scalability Model of Near Real-Time Analysis The primary mission of a SIEM has remained the same since their dawn: detection of known threats in (near) (although there has been considerable development in other functionalities). The real-time nature also implies a relatively : you need real-time incoming events and a certain amount of recent data for detection rules/models to work. However, the system must scale over large number of input source data. real-time constrained data scope Figure1: Analysis scope of near real-time analysis From a technical standpoint, the key for good performance is . Computing metadata allows speeding up search and meeting the performance requirements in seconds (or minutes). However, the data within the scope has to be indexed in order to guarantee timely responses. indexing all Indexing comes with limitations and a cost on complexity: Data has to be indexed it can be submitted to queries — which inherently means that the architecture has to scale both in terms of processing AND storage. before Index computation has to handle fluctuations (peaks) of incoming events. Query processing has to compete with the computation of indexes. Why SIEMs are not optimal for historical analysis The scope of historical log analysis (pattern discovery, threat hunting, forensic investigation, etc) is different. Individual tasks usually involve over , stretching well beyond the visibility of a SIEM. At the same time, the set of SIEM’s data sources must also be available for historical analysis. data from a few selected sources much longer periods Figure2: Analysis scope of a historical analysis. Different analysis tasks involve different source data scopes. However, query performance expectations are lower than for near real-time processing and getting results within minutes or tens of minutes is quite acceptable. Also, the amount of data included in individual analysis tasks is often not very different from the total amount of SIEM-indexed data for short-term analysis. The key point to be aware of is that the that has to be for historical analysis is . total amount of data available considerably larger It is not optimal to accomplish this scalability with indexing. In addition to the complexity costs, indexes eat up a lot of space, often as much as the data itself and sometimes even more. This is not justified for more time-tolerant historical analysis. Practical Limitations of Scalability Of course, all the discussion above is just theoretical. In practice, scaling SIEMs to include a multitude of data sources is achievable, depending on data volumes and their retention period. However, as always, there are limitations. Organisations looking for scaling a free, open source solution often only look at the implementation side (cost, know-how, effort) but fail to estimate the maintenance and impact of edge cases. Once they meet the limits of the current setup, it is extremely difficult to find people with specific skills to dig in historical data as the technical complexity grows exponentially. There are many discussions on the web on this topic, see for instance , and . Many organizations have also . here here here failed There is one particular issue faced by those attempting to scale Elasticsearch-based systems: the effect of data inflation. If log events are not already in JSON format, they will be converted to during the Elastic ETL process. Unfortunately, JSON carries a significant overhead compared to tabulated, or csv formats. For instance 1 Gb of standard Apache web server access log will approximately reach a 2,5 Gb footprint after converting to JSON. This is a problem as it translates directly to the storage period due to the complexity of scaling. Many commercial SIEMs compress data for internal storage. This is a favorable approach as the same 1 Gb of Apache access log will shrink to ~200…~250 MB after compressing with gzip. Also, the complexity of scaling is better documented and support is available. Provided you have the funds to cover the costs. Pricing is a major limiting factor for many commercial SIEMs. What To Expect From a Decent Historical Analysis Tool? Historical analysis encompasses many practical tasks, such as forensic analysis, data discovery, threat hunting, etc. These are the basic means of establishing ground truth when identifying malicious behaviour from logs. The threats, patterns, indicators, etc. and the way it’s being done is usually , branching on different paths with intermediate results. In practice it means proceeding through multiple steps, combining different source data and other intermediate results. focus is on discovering exploratory All of this is quite different from SIEM’s main goal of Hence SIEM’s provide only minimum necessary for data discovery: simple search, filtering, basic descriptive statistics of extracted fields. detecting threats based on known patterns. To support exploratory process efficiently the analytics tool should provide the following critical features: Since analysis subjects (threats, patterns, malicious acts, etc) are complex processes, then a step-by-step exploration should be backed by flexibility and power similar to SQL. A simple search provided by most SIEMs, using graphical user interface, drop-down lists and checkboxes is not enough. Flexible search/analytics language to retrieve and manipulate data. : aggregation, maths, basic descriptive statistics, string manipulations, basic encoding-decoding, basic cryptography, etc. NB! Aggregation, statistical and maths functions must be big-data scalable (i.e provide mathematically correct results over the entire dataset being queried). Rich library of functions to analyse data (allow user defined functions). No built-in library will cover the need for every case life throws at you. When it does, there’s no time to wait for the library to be updated by the vendor. Provide means for extending function library . Again — real life is far more richer than any single vendor can cover. Flexibility to address the unexpected in your data often marks the line between success and failure. Provide an API for using external 3rd party tools So far we’ve talked about differences in scalability and in the nature of the analysis process. Clearly there are many details setting short and long term analysis apart. However the most crucial controversies emerge from the way how data is treated. This is what I’m going to discuss in the of the article. next part

Apache

Discovery

Timely

Historical Log Analysis and SIEM Limitations: Part II

Too Long; Didn't Read

Download free Tech Leaders Productivity Report!

Historical Log Analysis and SIEM Limitations

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story