|
Understanding HTML Format and its Threat Potential |
6/2/2025 - Brian O'Neill |
IntroductionHyperText Markup Language (HTML) provides the foundation of structured web content we interact with each day. It’s an essential language used to display both static web pages and rich, interactive applications alike. HTML’s ubiquity across countless platforms and devices is a direct result of its flexibility – and this same characteristic make it a potent security threat. HTML can be packed with threats and manipulated in subtle, dangerous ways to evade detection from weakly configured threat scanning software. In this article, we’ll learn how documents written in HTML are structured, why HTML is so widely adopted, and how bad actors deliver malware, execute phishing attacks, and exfiltrate enterprise data with specially crafted HTML content. In that effort, we’ll examine a real-world critical vulnerability (CVE) which effectively highlights how HTML rendering engines can be easily abused. At the end, we’ll discuss how Cloudmersive’s Advanced Virus Scan API helps uncover hidden threats in HTML files – even when they’re carefully obfuscated to avoid detection. What is HTML?Unifying Presentation on the World Wide WebHTML is a standardized markup language used to structure and present content on the web. It was first introduced in 1991 by Tim Berners-Lee, and it quickly became the primary language of the budding web. It evolved over time from a document formatting system (HTML 2.0) to the robust application platform (HTMl5) it is today. Since 2004, the HTML standard has been maintained by the Web Hypertext Application Technology Working Group (WHATWG) – a community of individuals from various tech backgrounds motivated to evolve HTML and related technologies. Storing HTML in a Dedicated File FormatThe files that hold HTML content – typically designated with SGML Roots (and XML Overlap)HTML was originally derived from Standard Generalized Markup Language (SGML) – a meta-language created in the 1980’s to define markup languages in a standardized way. It’s worth noting that this derivation is also true of Extensible Markup Language (XML), which was released toward the end of the 90’s as a structured data exchange format with content display capabilities. Unlike its distant cousin, HTML was created explicitly with content layout and presentation in mind. That said, both XML and HTML share a similarly tag-based, hierarchical and flexible syntax, and this syntax can hold complex references to external content. Modern HTML (specifically HTML5) isn’t limited to embedding text, images, and links; it can embed scripts, video, audio, and even entire web applications (thanks to technologies like Understanding HTML File StructureHTML file structure generally consists of a doctype declaration, a root page element, a head section, and a body section. Attributes can be included to add functionality or styling, and tags can be nested inside one another hierarchically. Below is a basic code example of this structure:
In contrast to more esoteric languages and formats, HTML is quite human-readable – a major advantage which aids in the democratization of custom web-designed content. It’s supported by all major browsers and operating system (OS) platforms, and it’s even usable in emails, PDFs, CHM help files, and other common documents. On the surface, HTML structure might seem straightforward enough to get a quick handle on. However, the full capabilities of HTML code aren’t entirely obvious up front. HTML can load and execute JavaScript from external sources (on the web), launch invisible forms (or Real World Threat Vectors in HTML FilesThe scripting and external linking features that HTML supports are powerful tools for threat actors. Attackers frequently weaponize HTML files using several common strategies, some of which we’ve outline below. Phishing Pages & Credential HarvestersIt’s easy for threat actors to build fake login pages or forms in HTML which visually mimic real, trusted services. It’s just as easy for threat actors to embed those pages in email attachments or upload them to misconfigured web servers. When victims are tricked by these pages, threat actors can capture their information inputs on a command-and-control (C2) server. These phishing and credential harvesting attacks are often the first steps in large-scale enterprise account takeovers. They can lead to unauthorized access to internal systems, data exfiltration (theft), financial fraud, or even larger, more widespread phishing campaigns directed from within the compromised organization. It’s far easier to trick subsequent victims when the account they’re receiving malicious content from is verifiably “safe”. JavaScript PayloadsJavaScript code enables dynamic actions on web pages. HTML files which include JavaScript can execute that code automatically in a browser (or any HTML reader) once the file is opened. That code can include keyloggers, downloads, and other malicious content – all of which are very difficult for simple antivirus (AV) solutions to identify. Threat actors can use JavaScript to steal browser data, achieve privilege escalation in the victim’s environment via browser exploits, or silently download a secondary payload with the code they snuck into the victim’s environment. Data Exfiltration via
|