“Even if you do the redaction, supposedly correctly, even if you remove the text, there’s a lot of latent information that is dependent on the content that was redacted, and even that can leak information,” Levchenko says. “If you redact a name in a PDF, if the attacker has any context—they know this is an American—they will be able to, with high probability, either recover that name or narrow it down to a very small list of candidates.”
Edact-Ray focuses on the size of glyphs (broadly, characters or letters) and their positioning. “It’s pretty clear to a lot of people that the letter ‘L’ is skinnier than a letter ‘M,’ and that if you redacted just the letter ‘L,’ then you might be able to tell it is different from a redaction with just the letter ‘M,’” Bland says. The tool is essentially able to automatically compare the size of the redaction and the position of the letters with a predefined “dictionary” of words to estimate what has been replaced.
The software is constructed by inferring how the original document was produced—for instance, in Microsoft Word—and then reverse engineering the specifics of the document. “That tells us about how the text was laid out,” Levchenko says. “Once we know that, we have a model for how that tool laid out the text and how and what information it deposited throughout the rest of the document.” From here, it is ultimately possible to simulate what the original text may have been and produce a series of potential, or likely, matches. During testing, the team was able to eliminate 80,000 guesses per second.
“We found, for example, that redacting a surname from a PDF generated by Microsoft Word set using 10-point Calibri leaves enough residual information to uniquely identify the name in 14 percent of all cases,” the team’s research paper concludes, adding that this is likely to be a “lower bound on the extent of vulnerable redactions.”
Daniel Lopresti, a professor of computer science at Lehigh University who has studied redaction techniques, says the research is impressive. It “presents a comprehensive study of redaction tools and the ways in which they can be broken, including exploiting nearly invisible aspects of a document typography,” says Lopresti, who was not involved with the research. “The picture it paints is scary; too often redaction is done badly.”
The vast majority of the organizations impacted by real-world redaction failures highlighted in the research—including the US Department of Justice, the US courts system, the Office of Inspector General, and Adobe—did not respond to WIRED’s request for comment. Bland and The research paper says that many of the organizations have engaged with the team’s research.
Microsoft did not address data being leaked from Word documents that are converted to PDFs. “Customers can save a document as a PDF, but it is the role of the redaction tool to censor or obscure information,” says Jeff Jones, senior director, Microsoft . Jones adds that people should “review” data and their files before converting them to a format that is going to be shared.