Home‎ > ‎Current Projects‎ > ‎

Polymorphic Malcode Analysis


Polymorphic malcode remains a troubling threat to the security community. The ability for malcode to be automatically transformed into semantically equivalent variants frustrates attempts to rapidly construct a single, simple, easily veriable representation. We present a quantitative analysis of the strengths and limitations of shellcode polymorphism and consider its impact on current intrusion detection practice.

We focus on the nature of shellcode decoding routines; the empirical evidence we gather helps show that modeling the class of self-modifying code is likely intractable by known methods, including both statistical constructs and string signatures. In addition, we develop and present metrics that provide insight into the capabilities, strengths, and weaknesses of these polymorphic encryption engines. In order to explore countermeasures to these future threats, we show how to improve polymorphic techniques and create a proof-of-concept engine that encompass some of these additional potential.

Our analysis supply novel ways to understand the limitations of current signature-based techniques.

We focuse on what we consider the most constrained section of malcode, the decoder portion, such as the five cipher-op layer loop shown above. Since this section of shellcode must contain executable code, it cannot easily be disguised. We derive our motivation from the challenge of modeling this particular type of malcode data. We wondered whether, given unlimited samples of polymorphic code, it is possible to compute and store a set of signatures or a statistical model that could represent this class of code. If so, how costly would such a task be in terms of memory and processing time? In the span of the n-byte space that these samples of code populate, how much overlap is there with the class of benign network traffic? Unlike current research on polymorphic engines, our work focuses on the general class of code that performs decryption independent of the payload.

In our proof-of-concept engine, we demonstrate that not only can self-decryption routines be crafted in so many ways that it is infeasible to write on single class of signatures to detect all potential instances, but that it is also possible to simultaneously blend the entire shellcode sample (NOP sled, return address included) to look like "normal" network traffic. The above image (a) shows a potential statistical distribution of bytes on a particular port on a target host. Image (b) shows a shellcode instance being crafted to "appear" normal by our engine, after a few iterations, the shellcode looks statistically similar to any other typical packet arriving at that port.

We generated millions of fully working encrypted shellcode samples using these advanced obfuscation and blending techniques, and used them analyze the efficacy of these modern polymorphic engines. The image above shows a three-gram scatter plot of the rather small (often 30-40 byte) self-decryption routines we've generated. The image helps to illustrate the magnitude of challenge we face in attempting to model such a large space of potential malicious instruction sequences.

Our study reinforces the a previously held conventional wisdom that it is no longer practical to continue modeling malcode using signatures in IDS systems. We argue that a more safer approach lay in developing higher order statistical models for normal traffic, ones which can resist against these blending attacks. This remains the subject of our on-going work.

  • Yingbo Song, Michael E. Locasto, Angelos Stavrou, Angelos D. Keromytis and Salvatore J. Stolfo "On the Infeasibility of Modeling Polymorphic Shellcode " In Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS). October 2007, Alexandria, Virginia, USA.