What You Must Know About Machine Learning Malware Analysis

This story was originally published on CCSI’s blog. Enjoy!

We are in the post-signature era of antimalware software. Attackers are driven by the profit motive, and are also driven by a lust for power. About a decade ago, malware researchers determined that the amount of malicious files in the computing collective doubled every two years. Now, in a manner similar to Moore’s Law, the rate of malware growth is probably exponentially greater. Malware deployers aren’t only script kiddies who buy executables and crypters in the Dark Web. They’re also national militaries… Stuxnet anyone?

Antivirus Signatures

I believe that antivirus signatures still have an important role to play. I’m reminded of my days as a teenage retail cashier in Canada. Organized crime has the infrastructure and know how to make counterfeit $50 and $100 bills that’d fool most trained cashiers and bank tellers. It’s profitable to counterfeit a $50 bill even if it costs you $40 to do so. But there’s no way for that sort of effort to make counterfeiting $5 bills profitable. So organized crime usually won’t bother counterfeiting $5 bills. Counterfeit $5 bills are more obvious because they’re cheaply done, possibly by desperate people who aren’t connected to organized crime. I saw a number of counterfeit $5 bills as a teenager. Counterfeiting a $5 bill is a cheap and easy crime, and it’s still commonly done. So is the sort of malware that can be easily detected by signatures. Client PCs, mobile devices, and datacenters alike probably encounter signature-detectable malware constantly. It’s important for antimalware software to employ signatures just to keep a lot of the cheap rubbish out.

But more and more malware comes from sophisticated developers that evade signature detection. All good antimalware software these days must employ some sort of heuristic algorithms. Good heuristics can prevent zero day attacks. One promising sort of heuristic technology is machine learning malware analysis, which is becoming something of a buzzword in the cybersecurity realm. What is it?

Machine Learning Malware Analysis

The sort of machine learning that’s found in a lot of antimalware software tries to learn which files are malicious and which are benign based on databases of both malicious and benign code. The AI involved tries to make decisions about whether or not analyzed code is harmful based on a series of traits. Some traits may rank higher than other traits. So code that’s determined to be benign might have some traits that the software considers to be a possible indication of malware. Malware is evolving rapidly, so the algorithms must evolve rapidly as well. It’s a constant, ongoing process.

There has been research and deployment of machine learning malware analysis for many years now. We hear about machine learning a lot more frequently these days because effective technology is much cheaper now. Machine learning antimalware software can’t be client driven, because a client PC or mobile device is exposed to much smaller, more limited samples of malware. Proper machine learning requires Big Data processing and cloud-based systems. Cloud servers are much cheaper and more available now, so machine learning malware analysis is more accessible than ever, from the enterprise to your mother’s smartphone.

Some antimalware software vendors tout that they have heuristic technology that can detect zero day attacks and signature-evading malware that’s superior to machine learning techniques. For example, SIEM vendor TaaSera’s NetTrust is advertised to use their proprietary network behavioral analytics instead of machine learning. How to deploy machine learning effectively, and whether or not machine learning is the best sort of heuristic malware detection technology is a controversial topic these days. It may take years of running different types of heuristic malware detection technology “in the wild” before there’s greater clarity in cybersecurity research.