Signature detection and generation
Usually when dealing with unknow malware it’s interesting to know if there are any packers / protectors used in the malware. For a seasoned malware analyst it’s easy to spot whether any of those are present or not. But even the best analyst can’t outright say what the actual packer / protector is on every sample. There are some publicly available signature scanners, with PEiD being the most widely known.
But even with PEiD the hitrate lingers somewhere around 50%. So, I could point you to x+n various places where people share signatures for the external database PEiD has. The PEiD forums are one such a place. But as the old saying goes:
Give a starving man a fish, and he’ll be back for more.
Teach him to fish on his own, and he won’t bother you again.
With that in mind, let’s get to business. One of the problems with signatures is that how can you tell the scanner where it should look for matches? With PE executables, PEiD favors two possibilities:
directly from entrypoint, or anywhere in file. Without saying it’s clear that the entrypoint version should be favoured because when scanning a big file for hundreds or even thousands of signatures you’re going to feel the overhead bigtime. Thus sticking to “entrypoint only” signatures makes sense since it’s a specific place to look for instead of shuffling through the whole file. I’ll try to discuss here both options, and give you a little glimpse on how I usually generate signatures.
Entrypoint only:
Since I’ve got some thousands of samples sitting on waiting for detection, there’s no point in going through them manually. Sometime ago I created a little program to help me with the task of finding similarities in a set of files. All the files I scan are ones that have no signature yet, but are deemed by entropy calculation to be packed. Basically the tool takes a snippet of bytes from the entrypoints and collates them to HashMaps. After the scan is finished, it sorts the HashMaps to and prints out top 5 lists of entrypoint bytes. One of the results I stumbled onto few days ago was a bit like this:
The second biggest entrypointset contains 17 hits:
00a1964730e1629cf1066c119a6f6125: 700755527B01FC5A5DE92D0000007A6852D69210…
023bf53ac04d0883acdff77270bc6eaa: 700755527B01FC5A5DE92D0000007A6852D69210…
03592b7e337105febbb94d9596d41b84: 700755527B01FC5A5DE92D0000007A6852D69210… … and so on. All the 17 hits were exactly the same, thus allowing a signature to be made. The packer in question is currently unknown, but every sample I collect from now on will be linked to it. I periodically run through various forums to see whether someone knows the packer and has made a signature it, allowing me to easily change the detection name. I also search forums and such, looking out for packers being offered that I haven’t seen before. I download them, pack a few distinctly different files with it and generate a signature based on them. I also crossreference the new signature with old ones to see whether I already detect it under another name.
The others:
The samples in which there are no entrypoint similarities most of the work has to be handmade. The tool I made is of some use, but ultimately most work is done manually. I can use the tool to search for similarities in section names and such, and I have some more ideas on how to expand it to cut down manual labour. But to get to the point, there are several places where you can try to dig gold on detecting similarities:
- Section characteristics (names, sizes, offsets)
- Beginnings and ending of sections in the file
- Around the IAT (Import Address Table)
One nice thing about signatures is that when I fingerprint a packer that seems to be some sort of private version I’ll get every sample packed with it into one nice directory. Later on if the person behind the samples is caught, I can easily dig up all samples I have and ship them to law enforcement, where hopefully they’ll add some extra value in the case.