How Recursive Malware Scanning Navigates Compressed Archives

Technical Articles

Review Cloudmersive's technical library.

6/6/2025 - Brian O'Neill

Addressing the Concept of Threat Obfuscation

In cybersecurity, threat obfuscation is an expansive game bad actors play to throw off antivirus (AV) software scanning policies. The idea of threat obfuscation is rational: to successfully smuggle malicious files past robust network defenses, one must successfully convince the “sentries” of those defenses (the AV policies) to look no further than the outer shell of the vehicle the files hide within.

Archive file types like ZIP, RAR, 7Z and others capable of compressing (and sometimes encrypting) immense volumes of content are ideally suited for the purposes of malware obfuscation. Recursion – a complex but extremely important concept in computing – is the methodology best suited to counteract archive-based malware obfuscation strategies.

Invoking a Familiar Metaphor for Malware Obfuscation

If the concept of layering malware within an innocuous container conjures an image of Odysseus’ Trojan Horse to mind, we’re thinking along the right lines. Malicious code snuck into a system hiding amongst legitimate software components has been referred to officially as a “Trojan” since the 1970’s in cybersecurity, paying tribute to the storied (albeit likely mythological) sneak-attack which allegedly spelled doom for the ancient city of Troy.

While archive-based attacks aren’t necessarily Trojans by the strictest security definitions, they’re theoretically similar enough to invoke the same metaphor. This metaphor has clear shortcomings, however, when we face the reality of compressed archive-based threat obfuscation – and it’s worth nitpicking these shortcomings to better appreciate the role recursion plays in detecting threats nested deeply within hierarchical structures.

Dispelling the Illusion of a 2-Dimensional Attack Surface

The Trojan Horse of legend was a large, mobile statue with two basic layers. The external layer was a wooden structure shaped to look like a horse, and the internal layer was an open cavity just large enough for Odysseus and his comrades to all fit within. The obfuscation of the Athenian attack group only went one layer deep – and it’s safe to say that, in hindsight, the Trojans might’ve benefitted from briefly investigating that interior layer before celebrating their believed victory.

Against threats obfuscated in compressed archives, a second look wouldn’t be enough. This idea is easiest to understand if we briefly suspend disbelief (even further) by re-envisioning the Trojan horse metaphor.

A Horse, Within a Horse, Within a Horse…

Let’s imagine that a Trojan guard was proactive and cunning enough to check inside the suspicious horse structure before allowing it to enter the city of Troy. Imagine that within the structure, rather than uncovering a group of Athenian soldiers, this guard instead found boxes of gifts and supplies - along with another, smaller Trojan Horse built to scale. Next, imagine that upon opening the second horse, the guard found yet another assortment of gifts, and yet another scale model of the original horse. Still, no hidden soldiers to be found.

Puzzled as we might expect this guard to be, we’d likely assume they were justified in declaring the horse safe to enter the city walls after their search. We’d also likely empathize with this guard’s feeling of complete bewilderment when, later that night, Odysseus and his troops still emerged from the horse and caught the sleeping city by surprise. They were, by powers unknown to man, hidden within the nth iteration of the scale horse. Perhaps the guard would’ve found them if they’d looked a few more layers deep; perhaps not. It’s impossible to say up front how deep that nested structure would’ve gone.

Addressing the Reality of Nested Threats in Compressed Archives

This absurd multi-layered attack vehicle concept more accurately represents the level of obfuscation compressed archive formats can provide. It’s closer to a Russian nesting doll concept than a Trojan horse. Formats like ZIP can hold countless layers of files – including additional ZIP archives – because those archives are treated just like any additional files by the parent ZIP they live within. It’s not enough to look past one, two, or even three layers of a compressed archive to mitigate an obfuscated threat; it’s essential to look at each file within each archive layer before declaring the full archive “safe”.

Recursion is the key concept which makes deep-archive spelunking possible without knowing exactly how deep the archive goes.

Understanding Recursion and its Utility in Security Workflows

A Brief Overview of Recursion in Computing

Recursion is a powerful concept in mathematics and computing. In computing, it specifically refers to a function or method calling itself to solve smaller pieces of a larger problem.

A method’s ability to accomplish a recursive task depends on the existence of a base case. This base case gives the method call solid ground to work from, preventing it from endlessly looping through its own logic. Examples of recursive problem-solving range greatly in complexity, including anything from finding n in a factorial sequence to solving for n attempts in the famous Tower of Hanoi problem.

Below is a simple example of a recursive method that returns the factorial value for an integer n in Java:

//Recursion Example

public class FactorialExample {
    public static int factorial(int n) {
        if (n <= 1) {
            return 1; // Base case
        }
        return n * factorial(n - 1); // Recursive call
    }

    public static void main(String[] args) {
        int result = factorial(5); // 5! = 120
        System.out.println("Factorial of 5 is: " + result); // result is 120
    }
}

As shown in the above code, the method factorial(int n) is passed an integer value n. The method recursively calls itself until n - 1 reaches the base case value of 1. Each subsequent recursive method call before the base case is reached gets queued in RAM, ready to execute sequentially once n = 1.

Once n = 1, the base case value of n is returned and multiplied by the value of n one step prior to triggering the base case (in this case, 2). The product of this equation is then multiplied by the next stored value (in this case, 3). These calculations occur successively until n - 1 is multiplied by the original input n value. The return from this method is the factorial value of n. If n = 5, the return would be 120. That’s expressing 1 * 2 * 3 * 4 * 5.

It's important to note that recursion, while powerful, is also extremely resource-intensive compared to other methods of looping through code. RAM consumption in each recursive method call adds up quickly, and this can quickly overwhelm a device in large-scale recursive cases if memory consumption is not handled carefully.

Recursion in File Directory Traversal

File directory traversal is a natural fit for recursive computing. File directories are hierarchically structured, and they’re full of recursive base cases – whether we’re starting from a root folder or 20 layers deep.

When we search for content in our Windows file explorer, for example, we invoke a recursive method under the hood on our device. If, starting from our root folder, we search for files with the phrase “report”, our device will slowly but surely work through each successive folder in our system – checking each file within each folder – for content with “report” in the title or body. It’ll do this until it reaches the very last set of folders in the file directory hierarchy, and from there, it’ll work its way back up to the root folder.

This is ultimately very similar to our earlier factorial example. Instead of successively multiplying new values, our Operating System (OS) will employ its own logic in each recursive call to queue and display a series of files which contain the phrase that matches our search string.

Recursion in Compressed Archive Threat Scanning

Compressed archives like ZIP, TAR, 7Z, and others are little more than portable file directories with unique compression algorithms. This makes them powerful, ubiquitous tools for sharing a multitude of large files and/or folders at once, and it’s also what makes them naturally suited for recursive threat scanning techniques.

No matter how deep a threat actor chooses to bury malware within the hierarchical structure of a compressed archive file, AV software equipped with recursive archive-scanning methods will eventually retrieve the file in question. Whether or not the threat is identified depends, at that point, on the threat detection policies themselves.

After being recursively identified, files (e.g., .pdf, .docx, .jpg) within the archive will be processed with AV scanning logic, while nested archives will be decompressed and recursively scanned according to their own independent base case. Provided RAM resources scale appropriately to accommodate extremely deep archive scans, recursive scanning methods will consistently put AV workflows in the best possible position to uncover nested, obfuscated threats.

Recursive Scanning with Cloudmersive

Cloudmersive’s Advanced Virus Scan API utilizes recursion as a core mechanic in its deep content verification process.
Assuming archives are treated as trusted file types in a file upload process, the Advanced Scan API will recursively unpack each archive – and any nested archives within it – to scour each individual file for a wide range of potential threats. This includes executable content, scripts, macros, invalid content, and more, along with known virus and malware signatures via signature-based scanning techniques.

Archives with unsafe extraction outcomes – such as ZIP or RAR “bombs” packed with immense volumes of data intended to crash a vulnerable system – are identified and distinguished from other archive threats. These archives are rendered incapable of harming the target system, as they are detected and processed on dedicated API infrastructure.

Nested archives are identified with content verification capabilities which look past the given file extension. This roots out extension-based obfuscation (e.g., disguising a .zip as a .jpg). File types are detected by matching their structure against the strict formatting specifications laid out by the file provider.

The Advanced Scan API can be deployed in defense of individual web applications (with minor code changes), and it can be deployed with zero code changes at the network perimeter in a forward proxy, reverse proxy, or fully-fledged Web Application Firewall (WAF). It can also be deployed adjacent to AWS, Azure, GCP, and other cloud object storage instances to perform in-storage scanning after files are uploaded.

Conclusion

Recursive techniques aren’t just an intriguing concept – they’re a practical necessity in modern threat detection. As efforts to obfuscate threats within compressed archive formats grow more sophisticated, simply inspecting the first few layers of an archive becomes less and less sufficient. Just like basic recursive methods which solve for n, advanced antivirus systems must recursively solve for each potential threat in an archive until no new layers remain. Without recursive capabilities, the guard at the gate might miss the real threat hidden several horses deep.

To learn more about Cloudmersive’s Advanced Virus Scan API recursive scanning capabilities, please feel free to contact a member of our team.

Technical Articles

Addressing the Concept of Threat Obfuscation

Invoking a Familiar Metaphor for Malware Obfuscation

Dispelling the Illusion of a 2-Dimensional Attack Surface

A Horse, Within a Horse, Within a Horse…

Addressing the Reality of Nested Threats in Compressed Archives

Understanding Recursion and its Utility in Security Workflows

A Brief Overview of Recursion in Computing

Recursion in File Directory Traversal

Recursion in Compressed Archive Threat Scanning

Recursive Scanning with Cloudmersive

Conclusion

Related

600 free API calls/month, with no expiration

API Products

Virus Scan APIs

Spam Detection APIs

Security Threat Detection APIs

Document and Data Conversion APIs

Validate APIs

Natural Language Processing (NLP) APIs

Optical Character Recognition (OCR) APIs

Image and Face Recognition and Processing APIs

Questions? We'll be your guide.