Here’s a transcript of the podcast for your convenience.
I am Ankur Tyagi from Qualys. I am the senior malware research engineer over here, and in this podcast for Help Net Security, I will be talking about visual network and file forensics. So, in recent times you might have known there has been an influx of large volume of samples coming in, and people really need an effective and really fast way to classify them.
The typical approach is to use a normal machine learning basis to classify malware but the problem is that, with the machine learning system, you need to spend a lot of time training that system first of all and creating good model out of clean traffic. And in case you have created the model based on the bad traffic, you will need to feed in a lot of bad stuff first of all into the model, and going forward the model can then be used to filter whatever is bad or good.
The problem here is that, if something has not been seen by the model before, it will not be good in detecting the new variance or the new attacks that are coming in. The best approach would be to use a quick filter to reduce the volume and then use a typical machine learning model. To use such a filter, you would first of all need to focus upon the structural patterns of the file, so the visual approach that we are focusing on here is about highlighting the patterns within a file. We would like to classify a file but without actually opening it up, without actually submitting it to a CPU or a performance intensive analysis tool chain.
Structural analysis helps you to be immune to byte level modifications, so for example, you have a malware which has impact virus in unknown packer, now unless and until you unpack the file, you will not be able to actually understand what exactly the file is trying to do.
The first problem here is to unpack the file and for new packers that are out there, you would need to invest some manual hours into understanding how exactly unpacking has to be done. Rather than doing that if you can, you know, visually highlight the patterns within the file, even the packed file, you would be able to understand that the content looks similar to a packed file. This insight will actually help you to, you know, use manual analysis rather than submitting this file to a CPU intensive tool chain. Now this might save you a few CPU cycles but over a large number of files this can save you a lot of hours, lot of time that might be wasted into analyzing files that will eventually not give you any insight.
The primary focus of visual structuring is to use techniques like entropy and inground visualization, compression ratio, statistics and theoretical mean size, use these statistical metrics and classify a file as good or bad. The primary emphasis again is on quickly correlating attributes and identifying patterns. The conclusion might not be accurate, it might not be as accurate as machine learning system, but it will definitely be orders of magnitude faster compared to them. So this can be a first level pre-filter for sure. This can also help you to reduce the noise and highlight the significant behavior and you can probably use some heuristics after the analysis is complete and then the structural pattern identification logic can then complement the overall analysis process.
If you can identify a file structurally, you can then use appropriate analysis engines to inspect the content of the file. So rather than using just a blank approach, a blind approach towards analyzing all or any kind of a file, you can now have intelligent way of understanding and unpacking the file via these structural pattern identification.
Users can also use the visual analysis approach to analyze intrusion artifacts, for example, you come across an intrusion attempt and you have some residue, you have some remnant files, you would like to analyze, then you can submit those because you don’t know what these files are, they might be unknown file types, they might be packed in some way, they might be encrypted in some way you don’t have the keys.
Now, you can actually first of all verify what kind of compression or encryption mechanism was used, whether this is compressed or encrypted in the first place or whether this is just, you know, a typical unknown or you know, a file format that you are seeing for the first time. So, if a visual technique is used, it might give you insight into understanding what the file content actually is. We will actually be talking about a framework, so we have created to actually analyze and help users use the visualization techniques on unknown malware variant. This framework actually helps you to use all of the techniques I mentioned earlier (entropy, inground visualization, theoretical mean size), you can use these techniques via the framework upon unknown malware variants or unknown file formats that you are interested in, or if you would like to use them in your forensic activities.
The framework, actually, it’s a Python-based framework, it gives you an API and it’s also a command line based tool, you can use that in your inhouse automation tools or if you would like to use as a standalone tool you can use it in that manner as well. Once you feed some files, it will give you a report, you can, it’s a JSON report, you can feed that report into you log analytics engine or you can feed that into your own tool chain to gather whatever insights in the tool has generated for you.
The output would basically be the classification metrics, the framework will not tell you whether the file is good or bad. Now this decision has to be user’s, the framework will not do that for you because it has seen a file as a single entity and it is not using a model to compare it against good or bad stuff. It will look at the file and based on the visual heuristics that it has, it will, you know, give you some numbers, now these numbers can be used in your own analysis engine or they can be used as a source of input to your heuristics engine to your scoring engine and based on the numbers, you can decide whether these numbers are sufficient to classify the file as bad, or if they not enough, you can then use additional tools to increase or decrease the score, to understand what exactly the score means.
The most important fact here is that, visual tooling also helps you to classify and identify malware variance, so most of the time, you might have seen that, the highest percentage of malware that we see in recent times, it is a variant of existing or known malware. It’s just that the authors are not using, are not creating the code for the new, for a new malware from scratch, they are basing the new malware upon existing known malware, unknown malware variance. So, this basically means that there are always blocks of code which will be repeated, which will be common across a lot of samples and visual tooling can help you to identify and pinpoint what part of the code is repeated and what part of the code is added new.
Now, this is very useful insight because it can help you understand how the malware family is evolving, it can also tell you what kind of features you can expect in the future. So based on the visual tooling, you can tell whether a particular malware family is adding a specific crypto set of functions or if in case, you find that you know, the malware variant or the sibling has removed a certain set of implementation of crypto techniques that can probably give you an insight into what exactly the authors are thinking of what their plan of action might be in the coming future.
Visual tooling can also help you identify and highlight the prominent section of the file. Now this can be useful in case you come across an unknown file format, so now the most common file format for a malware would be B files, Windows executables, but in case you come across an unknown malware variant and it is something other than Windows executable, you can use the visual tooling approach to find out patterns within the files. Now this is useful because you don’t have to open up the file, you don’t have to set up an environment to analyze it. The visual tooling approach is completely static, it consumes a file and then visually highlights it. So, it is safer in a sense that there is no dynamic execution of the malware, so you are safe from all the side effects.
That’s the primary goal of visual tooling, it can definitely be useful in cases where, you have unknown malware or unknown file formats to work with, and you would need to use a quick filter to understand what exactly the file format is. If you are looking for something which is faster than a typical machine learning approach, if you don’t have the time to invest in creating a full-fledged machine learning model, and you would need a quick filter, visual tooling can be a very good source.
I definitely recommend looking at the framework, the framework is called Rudra, it is available on GitHub and it can be used as a standalone script like I said, the API can also be used to complement your in-house automation tools.