Purple teaming and the role of threat categorization

Organizations constantly work to ensure optimal threat detection and prevention across their systems. One question gets asked repeatedly: “Can we detect the threats we’re supposed to be able to detect?”

attack technique variants

Red team assessment, penetration testing, and even purple team assessments (in their current form) are all designed to answer these questions. Unfortunately, as attacks get more complex, these assessments struggle to provide comprehensive answers. Why is that?

The answer is: variation. These assessment services typically test defenses against ten to twenty attack techniques, and only use one (or in some rare cases, a few) variations of each technique. But each technique can have thousands or millions of variants (as a matter of fact, one technique I examined had 39,000 variations, and another had 2.4 million).

It’s hard to understand if an organization is truly protected or was just prepared for the specific technique variant the red team used. Will an attacker use the same one? With thousands of options at their disposal, it isn’t likely.

As a result, many organizations have begun to embrace purple teaming, where red and blue teams work together to take a more comprehensive and collaborative approach to security assessments. But how can teams defend against the huge cloud of possible variations of each attack technique when they don’t account for (or understand) all those variations? This is why I believe purple team assessments must evolve.

Cataloguing attack variants

In my mind, a more comprehensive way to evaluate defenses is to test them against a representative sample of attack technique variants.

Obviously testing each variant of an attack technique – like that one where I found 2.4 million variants – is not practical. First, teams should decide which techniques they want to test for, then catalog the variants of those attacks to the best of their ability, and finally, pick a representative sample of those variants.

This sampling should be diverse. We can reasonably assume that defenders will detect variants “in between” their samples but are less likely to detect ones outside the scope of their sample. A narrow sample leaves out more variants – and provides less information on how their defenses will fare against the range of potential techniques an adversary might use. But picking a diverse set of samples is easier said than done.

Picking a representative sample of attack techniques is difficult because there’s no good system in cybersecurity for cataloguing the variants of an attack. The system we do have glosses over too much detail.

Traditionally in cybersecurity, attack techniques are broken down into three levels – tactics (like Persistence), techniques (like Kerberoasting) and procedures (the specific tools or steps to execute a technique, like the Invoke-Kerberoast tool created by Will Schroeder). But this model loses too much detail, particularly in the “procedures” category. For example, a technique, like Credential Dumping, can be accomplished with many different procedures, like Mimikatz or Dumpert. Each procedure can have a variety of different sequences of function calls. Defining what a procedure is gets difficult very quickly!

How can the industry solve this problem? I believe assessments should look at five or six levels when evaluating attack techniques: tactics, techniques, sub-techniques, procedures (maybe), operations, and functions. The last four allow teams to account for this mass variation and exposes the reality that a single technique could have tens or hundreds of thousands of variants that are all technically unique.

To better understand this, let’s dive into more specifics around each of these levels.

  • Tactics – Short-term, tactical adversary goals during an attack. Examples of tactics include Defense Evasion and Lateral Movement (definitions and examples are from the MITRE ATT&CK framework).
  • Techniques – The means by which adversaries achieve tactical goals. For example, Process Injection and Rootkit are both techniques for accomplishing the Defensive Evasion tactic mentioned above.
  • Sub-techniques – More specific means by which adversaries achieve tactical goals at a lower level than techniques. For example, Dynamic-link Library Injection and Asynchronous Procedure Call are two different types of Process Injection.
  • Operations – An abstraction that represents the specific actions that must be taken against resources on the target system/environment to implement the (sub-)technique. Often a single operation (like Process Open) will serve as a category for multiple API functions as demonstrated below.
    • Ex. 1 – Process Open -> Memory Allocate -> Process Write -> Thread Create
    • Ex. 2 – Process Open -> Section Create -> Section Map (local) -> Section Map (remote) -> Thread Create
  • Functions – The literal API functions that a specific tool calls to implement the operations in the chain. The operating system often provides numerous nearly identical functions for developers to choose from. However, the difference from one function to another may be enough to bypass a detection. Here are two different strings of functions for the first Operation example I shared above:
    • Ex. 1 – OpenProcess -> VirtualAllocEx -> WriteProcessMemory – CreateRemoteThread
    • Ex. 2 – NtOpenProcess -> NtAllocateVirtualMemory -> NtWriteVirtualMemory -> NtThreadCreate

In this approach, I define “Procedure” as “a chain of operations.” This is less precise than these other definitions. It’s sometimes useful to include it in this breakdown, sometimes not. Either way, this five or six-layered model captures the complexity of attack techniques better (in my opinion) and can help defenders select more diverse test cases.

Assessment services offer a powerful way to validate security posture, but they must address the challenge of attack technique variation if we expect them to be comprehensive. By constantly evolving how we build those models, we can improve efficacy and drive better results.

Don't miss