Technical Articles

Review Cloudmersive's technical library.

Understanding HTML Format and its Threat Potential
6/2/2025 - Brian O'Neill


Introduction

HyperText Markup Language (HTML) provides the foundation of structured web content we interact with each day. It’s an essential language used to display both static web pages and rich, interactive applications alike.

HTML’s ubiquity across countless platforms and devices is a direct result of its flexibility – and this same characteristic make it a potent security threat. HTML can be packed with threats and manipulated in subtle, dangerous ways to evade detection from weakly configured threat scanning software.

In this article, we’ll learn how documents written in HTML are structured, why HTML is so widely adopted, and how bad actors deliver malware, execute phishing attacks, and exfiltrate enterprise data with specially crafted HTML content. In that effort, we’ll examine a real-world critical vulnerability (CVE) which effectively highlights how HTML rendering engines can be easily abused. At the end, we’ll discuss how Cloudmersive’s Advanced Virus Scan API helps uncover hidden threats in HTML files – even when they’re carefully obfuscated to avoid detection.

What is HTML?

Unifying Presentation on the World Wide Web

HTML is a standardized markup language used to structure and present content on the web. It was first introduced in 1991 by Tim Berners-Lee, and it quickly became the primary language of the budding web. It evolved over time from a document formatting system (HTML 2.0) to the robust application platform (HTMl5) it is today.

Since 2004, the HTML standard has been maintained by the Web Hypertext Application Technology Working Group (WHATWG) – a community of individuals from various tech backgrounds motivated to evolve HTML and related technologies.

Storing HTML in a Dedicated File Format

The files that hold HTML content – typically designated with .html or .htm extensions – are simple, plain text documents consisting of nested tags. These tags carry the content we expect to see on any given web page, like paragraphs, links, images, forms, etc. They also carry the content we don’t see – like scripts, which take automated actions in web browsers and other HTML readers.

SGML Roots (and XML Overlap)

HTML was originally derived from Standard Generalized Markup Language (SGML) – a meta-language created in the 1980’s to define markup languages in a standardized way. It’s worth noting that this derivation is also true of Extensible Markup Language (XML), which was released toward the end of the 90’s as a structured data exchange format with content display capabilities.

Unlike its distant cousin, HTML was created explicitly with content layout and presentation in mind. That said, both XML and HTML share a similarly tag-based, hierarchical and flexible syntax, and this syntax can hold complex references to external content. Modern HTML (specifically HTML5) isn’t limited to embedding text, images, and links; it can embed scripts, video, audio, and even entire web applications (thanks to technologies like <canvas>, <iframe>, and <script>).

Understanding HTML File Structure

HTML file structure generally consists of a doctype declaration, a root page element, a head section, and a body section. Attributes can be included to add functionality or styling, and tags can be nested inside one another hierarchically.

Below is a basic code example of this structure:

<!DOCTYPE html> <!-- Declares HTML5 document type -->
<html lang="en"> <!-- Root element of the page -->
<head>
  <meta charset="UTF-8"> <!-- Page encoding -->
  <meta name="viewport" content="width=device-width, initial-scale=1.0"> <!-- Mobile scaling -->
  <title>Sample HTML Page</title> <!-- Browser tab title -->
  <link rel="stylesheet" href="styles.css"> <!-- External stylesheet -->
  <script src="script.js"></script> <!-- External JavaScript -->
</head>
<body>
  <h1>Welcome to My Page</h1> <!-- Main heading -->
  <p>This is a paragraph of text that gives information to the reader.</p>

  <a href="https://example.com" target="_blank">Visit Example.com</a> <!-- Link with attributes -->

  <img src="image.jpg" alt="A description of the image"> <!-- Embedded image -->

  <ul>
    <li>First item</li>
    <li>Second item</li>
    <li>Third item</li>
  </ul> <!-- Unordered list -->

  <div class="content-section">
    <h2>Section Title</h2>
    <p>More detailed content goes here.</p>
  </div> <!-- Layout using a div -->
</body>
</html>

In contrast to more esoteric languages and formats, HTML is quite human-readable – a major advantage which aids in the democratization of custom web-designed content. It’s supported by all major browsers and operating system (OS) platforms, and it’s even usable in emails, PDFs, CHM help files, and other common documents.

On the surface, HTML structure might seem straightforward enough to get a quick handle on. However, the full capabilities of HTML code aren’t entirely obvious up front. HTML can load and execute JavaScript from external sources (on the web), launch invisible forms (or <iframes>) that take specific actions without the user’s awareness, store base64-encoded payloads inside tags like <script, <style>, and <iframe>, and even directly reference remote files or endpoints which serve dynamic content. That’s a very wide potential attack surface for a single markup format.

Real World Threat Vectors in HTML Files

The scripting and external linking features that HTML supports are powerful tools for threat actors. Attackers frequently weaponize HTML files using several common strategies, some of which we’ve outline below.

Phishing Pages & Credential Harvesters

It’s easy for threat actors to build fake login pages or forms in HTML which visually mimic real, trusted services. It’s just as easy for threat actors to embed those pages in email attachments or upload them to misconfigured web servers. When victims are tricked by these pages, threat actors can capture their information inputs on a command-and-control (C2) server.

These phishing and credential harvesting attacks are often the first steps in large-scale enterprise account takeovers. They can lead to unauthorized access to internal systems, data exfiltration (theft), financial fraud, or even larger, more widespread phishing campaigns directed from within the compromised organization. It’s far easier to trick subsequent victims when the account they’re receiving malicious content from is verifiably “safe”.

JavaScript Payloads

JavaScript code enables dynamic actions on web pages. HTML files which include JavaScript can execute that code automatically in a browser (or any HTML reader) once the file is opened. That code can include keyloggers, downloads, and other malicious content – all of which are very difficult for simple antivirus (AV) solutions to identify.

Threat actors can use JavaScript to steal browser data, achieve privilege escalation in the victim’s environment via browser exploits, or silently download a secondary payload with the code they snuck into the victim’s environment.

Data Exfiltration via <iframe> and <form> Elements

Threat actors can craft HTML files with embedded invisible iframes or self-submitting forms. These can silently load internal pages within a target environment, directly manipulate web browser behaviors, or siphon sensitive data to a remove server. These elements are typically invisible to the user.

Such attacks often result in the extraction of valuable internal details like network topology, active sessions, or authentication tokens – all of which can be weaponized in follow-up attacks on the same system. These follow-ups are likely to succeed because the user isn’t made aware that anything was wrong in the first place.

HTML Smuggling

Sophisticated threat actors can embed encoded payloads – like .exe, .docm (macro-enabled Word document), or .lnk files – inside an HTML file using JavaScript code. If a victim opens this file in their web browser, the JavaScript can decode and construct a malicious file entirely within the context of the user’s device. The attacker can store this file in memory on the victim’s device or download it using the victim browser’s own save functions.

These payloads are typically Base64-encoded within a <script> tag – or in some cases referenced externally. Because the files are assembled locally in the victim’s browser, they tend to bypass traditional perimeter defenses like email scanners, proxies, and web gateways.

HTML Vulnerability in the Wild: CVE-2023-4863

The CVE we’ve chosen to highlight in this article – CVE-2023-4863 – is a bit different from the vectors outlined above, but it’s of the more high-profile HTML-adjacent vulnerabilities in recent years. It was a critical buffer overflow vulnerability on Google Chrome’s rendering engine (Blink).

The issue lay in how Blink’s WebP image rendering library – libwebp – handled certain image formats embedded in HTML content. Threat actors could use a malicious <img> tag pointing to a malformed WebP file to exploit the browser’s memory and execute arbitrary code.

This vulnerability remains a good reminder of how even the most basic HTML files can carry hidden content capable of hijacking a system. Many applications render HTML, and malformed HTML can initiate devastating attacks.

Advanced Scanning for HTML-Based Threats with Cloudmersive

Cloudmersive’s Advanced Virus Scan API offers enhanced protection against HTML-based threats by digging deep into the structure of .htm and .html files. It additionally identifies HTML content embedded in email message bodies and other sensitive locations.

The Advanced Scan API inspects embedded scripts in HTML files for obfuscated or suspicious logic, and it follows embedded links to identify suspicious domains. When content is base-64 encoded or compressed inline, it deconstructs that content to ensure threats aren’t hidden within. Additionally, it validates against HTML smuggling patterns and known phishing payloads to protect unsuspecting enterprise users from well-disguised attacks.

Whether HTML is part of a user-uploaded archive, an email attachment, or a link preview workflow, Cloudmersive scans the file at the content level to identify hidden and emerging threats.

Conclusion

While HTML appears simple at first glance, it hides a powerful, flexible structure that has long been a favorite tool of threat actors. From HTML smuggling to phishing campaigns, attackers have routinely exploited HTML features to bypass weakly configured security policies. Understanding how HTML works – and the ways it can be abused – is a critical part of building a secure file upload or file transfer workflow.

To learn more about scanning HTML files for threats with Cloudmersive’s Advanced Virus Scan API, please contact a member of our team.

800 free API calls/month, with no expiration

Get started now! or Sign in with Google

Questions? We'll be your guide.

Contact Sales