De-anonymizing code authors by analyzing executable binaries

SmartNA PortPlus - High Performance Visibility Solutions that scale with your network.

A group of researchers that have previously proven that it’s possible to de-anonymize programmers by analysing the source code of programs they have created, have now demonstrated that a good result can be also be achieved by analyzing executable binaries of those programs.

In both experiments they used a dataset collected from the annual programming competition Google Code Jam (from 2008 to 2014, source code samples of 600 programmers).

“We cast programmer de-anonymization as a machine learning problem,” explained Aylin Caliskan-Islam, a Postdoctoral Research Associate & CITP Fellow at Princeton University, and one of the researchers involved in the experimentation.

The analysis of the binaries followed this particular workflow (click on the screenshot to enlarge it):

In order to achieve a good result, the random forest classifier has to be trained by being fed sample executable binaries by the various code authors.

The system does not work as well if the samples for each coder are too few or if the number of coders is too large, but the difference is not that big. In fact, after being fed with only one executable binary sample created by a single author, the classifier was capable of correctly identifying the author (out of 20 programmers) with 75.4% accuracy.

After raising the number of executable binary samples to 14 for each programmer, the classifier correctly classified 280 test instances with 96% accuracy.

The researchers also tested their approach with samples collected from open source projects hosted on GitHub.

“Open source projects do not guarantee ground truth on authorship. The feature vectors might capture topics of the project instead of programming style. As a result, open source code does not constitute the ideal data for authorship analysis; however, it allows us to better assess the applicability of programmer de-anonymization in the wild,” the researchers explained in the paper.

The results? A 62% accuracy in correctly classifying the programmers’ executable binaries.

“Our results present a clear concern for people who would like to release binaries anonymously,” the researchers noted.

Still, malware authors that use anti-forensic techniques such as code obfuscation and data compression and encryption, or specialized compilers that lack decompilers, can breathe easy for now, as further work is required to pass those obstacles to analyzing their malicious executable binaries.