MachineLearning/Anomalous Detections and Explanation

miekiemoes · October 31, 2018

The post below gives some more insight into our MachineLearning detection.

If the background info doesn't interest you and you (developer) want to know how to avoid FPs, please read the last section: False Positives

Machine Learning Demystified: Anomaly Detection at Malwarebytes

Machine learning and artificial intelligence (AI) are buzzwords you hear all the time now in technology, media, and the news. They’ve been applied to tackle problems ranging from voice recognition to cancer diagnosis to, of course, malware detection. Companies who do machine learning often make it sound perfect and like magic. But in truth, there’s no magic involved: It's just a tool like any other with both strengths and weaknesses.

In this post we’ll demystify machine learning. We’ll explain what it is, how it’s applied in anti-malware, why we at Malwarebytes use a different approach than some other companies. How we use it instead.

What is machine learning?

Machine learning recognizes patterns in new data based upon learning those patterns from existing “training” data. For example, you can “teach” a machine learning model to recognize pictures of dogs by “training” it with lots of pictures both of dogs and things-that-aren’t-dogs. The model “learns” the difference between a dog and a not-dog by “learning” the subtle patterns of identifying markers or “features” that generally characterize pictures of dogs.

How is machine learning used in anti-malware?

Our job as an anti-malware company is to distinguish between “goodware” and malware files (typically apps or programs). We want to allow the goodware to run freely on your computer, and block the malware. So it’s natural to imagine using a machine learning model to do this: just like it can learn to classify pictures as dogs or not, you might imagine training a machine learning model to classify an app as malware or not. This is not a new idea in the security industry: every antivirus company has used it to varying degrees since the early 1990s.

So why hasn’t machine learning put malware authors out of business yet?

In order to make a traditional classification-based machine learning model, you need two things. First, you need new data to “look like” existing data, so that any patterns found in existing data will still apply in the same way to new data. If new dogs look completely different from previous dogs, then it’ll be pretty tough to teach a model (or a human being!) to recognize what future dogs will look like. Second, you need to have seen enough of the possible patterns in all possible data to be able to recognize and generalize well to new cases. If there are particular kinds of dogs out there that you’ve never seen before, it might be tough to teach a model (or a human being!) to recognize what a dog looks like in general.

Both of these requirements are problems for anti-malware machine learning. New files often look very different from existing files, and the number of possible files there could ever be is so giant that it’s just not possible, even in theory, to have seen a representative set of them.

The result is that companies are forced to make machine learning models based on a truly tiny fraction of the possible files that could be out there. That means we shouldn’t have a lot of confidence that the models will accurately classify new files that are different from the ones they have seen before. When you combine that with the fact that the files we want to classify are malware, written by human authors who are trying actively to evade detection by the models, things look pretty grim for the classification-based approach to machine learning in anti-malware.

Many companies still try it anyway. And they have many of the problems you would expect: they can have a lot of false positives on goodware files that are very different from the files in their training sets, and a lot of false negatives on malware files that are very different from the files in their training sets.

What does Malwarebytes do instead?

We at Malwarebytes chose a very different approach to machine learning, called “anomaly detection”. Instead of trying to “learn the differences” in general between all goodware and all malware, anomaly detection tries to quantify “how similar to a training set of goodware” a particular file looks. Instead of trying to classify a picture as dog-or-not-dog, an anomaly detection model would score each picture as “this looks 85% similar to the dogs I’ve seen before”. If the score is small enough, perhaps 1-3%, the model would say “there is only a 1-3% chance this is a dog, based on my prior knowledge of the dogs I’ve seen before. Therefore, this is either a really weird-looking new kind of dog, or not a dog at all.”

On first glance this approach seems similar to the classification approach described above. But it has a crucial difference: it only has to be trained using goodware files, it doesn’t need to be trained using malware. Unsurprisingly, goodware ends up being more self-similar than malware, it changes more slowly with time, and so anomaly detection models end up being both more robust and more long-lived than classification models.

Most crucially of all, we find that malware files tend to be anomalous according to our model, because of their use of obfuscation techniques designed to evade traditional antiviruses, meaning that we can use our anomaly detector as a malware detector. That is exactly the technology in use in our products today.

What are the weaknesses?

We said right up front that machine learning isn’t magic, it’s a tool like any other with both strengths and weaknesses. The strengths of anomaly detection are clear: true 0-hour generic malware detection that is robust and long-lived, and detects a large swath of the 0-hour malware landscape. But what are the weaknesses?

The primary weakness of anomaly detection is that it can only be used to detect “anomalous-looking” malware. If you write the world’s cleanest simplest non-obfuscated keylogger, it is likely not to be detected by our anomaly detector. Our researchers typically find that about 50-80% of zero-hour malware is anomalous-looking, depending on the specific malware tested and the specific anomaly model used. That’s not bad, and it’s better than any classification model we’ve tested, but it’s certainly not a silver bullet either, and would not be sufficient protection on its own.

However, that’s OK with us because we believe strongly that no single layer of protection is ever sufficient on its own, and so Malwarebytes layers together vector defense (our anti-exploit layers), behavioral protection (our anti-ransomware and application behavioral protection layers), website protection, and our anti-malware heuristic engine, alongside this new anomaly detection layer. This combination of layers provides more comprehensive protection against the 0-hour malware landscape than if we had one layer alone, as so many machine learning based anti-malware vendors do.

False positives

False positives are a reality in the anti-malware industry, and our anomaly detection models are no exception. We do our best to try to keep them to a minimum. Malwarebytes anomaly detection engine scans more than 3 million unique files per day and we receive false positive reports on 0.0001% of them. If your software is one of the unlucky few, we apologize for the inconvenience.

Detections by our anomaly detection engine are identified as "anomalous" files, not as "malware". Typically a false positive arises when a piece of legitimate software has never been seen before across our entire userbase of tens of millions of users, and was written using techniques or tools commonly used by malware, such as very old versions of Visual Basic, executable packers or obfuscators, or the lack of a valid digital signature. It is not surprising that our models often consider such files to look anomalous.

We would encourage all software developers to avoid packing or obfuscating their code after compilation, use consistent Version Information and to digitally sign their code to guarantee its integrity. Signing in particular has been a best-practice in the software industry for decades, and offers users a guarantee that an app has not been tampered with. There are high-profile examples of apps that have been tampered-with to add or incorporate malware. We would not want unsigned software on our machines in 2018, and we suspect most of our users wouldn’t either.

As a last resort, if you are unable or unwilling to take these steps, please contact us through our forums here: https://forums.malwarebytes.com/forum/42-file-detections/ (start a new thread) with examples of the files being detected as anomalous, and we will add them to our database of known good apps.

Also, If you are a developer, while building your application, I suggest you exclude the working/building directory from detection via the exclusion settings in Malwarebytes. This since our Anomaly detection might possibly detect some of the files you are building.
Once the application/project is final and ready to be shared with others, in most cases it won't be detected anymore since it won't be triggered as "anomalous" anymore either.

In case a "final project" is still detected, please let us know (include the sample), so we can add it to our database of known good apps as well to prevent this in the future.

Edited November 1, 2018 by miekiemoes

Sign In

MachineLearning/Anomalous Detections and Explanation

Recommended Posts

miekiemoes

Link to post

Share on other sites

Recently Browsing 0 members

Browse

Activity

Personal

Business

Business Modules

Partners

Learn

Start here

Type of malware/attacks

How does it get on my computer?

Scams and grifts

Support

Important Information