Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Include support for tagged PDFs

See original GitHub issue

While working on a feature to show outlines for documents without outlines, I found that the PDF format supports a standard way to attach semantics to the structure of the PDF (14.6, 14.7, 14.8 of PDF spec). This could be used to improve the text selection, searching and accessibility.

This is a complex feature, and probably not going to be resolved soon. However, we can incrementally add support for smaller features that are under the umbrella of tagged PDFs. I’m now developing the minimal internal data structures and parsers (NumTree, StructTree, StructElem) for the use case of extracting outlines from PDFs, which could be used as a basis for further improvements related to tagged PDFs.

Relevant bugzilla bugs:

https://bugzilla.mozilla.org/show_bug.cgi?id=727819 “Make PDF.js accessible”
https://bugzilla.mozilla.org/show_bug.cgi?id=861157 “Support tagged PDFs in pdf.js”

External resources:

http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf (section 14.8 Tagged PDF, but also 14.6 Marked Content and 14.7 Logical Structure)
http://www.aiim.org/Research-and-Publications/standards/committees/PDFUA/Technical-Implementation-Guide-32000-1 “PDF/UA Technical Implementation Guide: Understanding ISO 32000-1 (PDF 1.7)”

Issue Analytics

State:
Created 8 years ago
Reactions:1
Comments:16 (6 by maintainers)

Top GitHub Comments

3reactions

aardriancommented, Aug 27, 2020

Edge has touted native support for tagged PDFs. Chrome also now supports it, and has also touted its coming ability to export tagged PDFs from web pages.

Today, Firefox does not expose the tagging in PDFs to the accessibility tree / accessibility APIs. However, this text is on the features list for Firefox 80:

Firefox can now be set as the default system PDF viewer.

If a user who relies AT does this, or a system administrator who does not know the make-up of users does this, it can be problematic for those users who otherwise relied on Edge, Chrome, or Adobe’s Reader to parse tagged PDFs for them.

I strongly suggest that advice be stricken from the release notes for 80, and that this bug priority be bumped up. I understand Mozilla is resource constrained now, but the optics on promoting an inaccessible feature that is better served in competing browsers is not a good look.

2reactions

jcstehcommented, Oct 7, 2020

Thanks @trjohnst for your work on this.

I started manually rebasing @trjohnst’s branch on pdf.js master. This approach works well for tags which only need a single level; e.g. headings or images with alt text. When walking the content stream, if it encounters a marked-content sequence, it looks up the associated structure element and places the appropriate ARIA role on the text span in the HTML output by the pdf.js text layer.

Unfortunately, this isn’t sufficient for anything that needs nested tags; e.g. lists or tables. I don’t think the approach can be extended to cover those, at least not without a lot of tricky edge cases. Furthermore, in order to properly support links and form fields (and note that form fields weren’t supported by pdf.js at the time of @trjohnst’s contribution), we need to be able to consider the annotation layer, not just the text layer. Thinking even further forward, it’d be good to be able to implement heuristics to try to detect (and correctly position) headings, links, tables, form fields, etc. in untagged PDFs.

Rather than trying to do this in the text layer, I think we’re going to need to walk the structure tree and render nodes based on that, setting ARIA properties on the elements we output. The structure tree can reference data in both the text and annotation layers. We can either reorder the text and annotation layer DOM nodes based on the structure tree (might be tricky without breaking the visual rendering?) or use aria-owns to reorder just the a11y tree without reordering the DOM.

Architecturally, this is tricky because the text and annotation layers are already rendered separately, and now we need to look at a third layer (or at least source of truth), the structure tree, which can move (or reference) nodes in both of the other layers. The simplest way to do this is probably to attach an id to every marked-content sequence (in the text layer) and link/form field (in the annotation layer). I see form fields already have a data attribute specifying an id. If we’re going to use aria-owns, we need to set the id attribute anyway, so this might feed two birds with one scone. The id would need to be something we can calculate from outside of the text and annotation layers, from within our new structure layer. When we’re handling the structure tree, we’d then output elements for the structure elements, moving/owning elements from the text/annotation layers based on their ids.

Going beyond tagged PDF to heuristics, we’d need to be able to do things like: given a link or form field annotation, does its rectangle encompass something in the text layer? if it does, the annotation should be associated with its text (aria-owns or DOM move). Again, that’s architecturally tricky because the text and annotation layers (and their inputs) are separate and I don’t think we have any cached state from those layers we can use. However, we can potentially look at the bounds of the nodes rendered by the text and annotation layers, though that starts to blur the architectural boundaries between content and presentation processing.

While an initial implementation of tagged PDF doesn’t necessarily need to support heuristics, I’d strongly encourage this to be considered as part of the architectural design. The reality is that untagged PDFs are unfortunately very prevalent and it’d be sad to be locked into an architecture which doesn’t allow these to be made more accessible. (Note that Acrobat Reader, and to a much lesser extent Chromium, use heuristics to try to make untagged PDFs more accessible.)

Top Results From Across the Web

Creating accessible PDFs in Adobe Acrobat

To tag a PDF in Acrobat, choose Tools > Accessibility > Add Tags To Document. This command works on any untagged PDF, such...

What is a Tagged PDF? - Accessibility is the right thing to do!

A tagged PDF includes hidden accessibility markups that, when properly applied, help to optimize the reading experience of those who use screen readers...

Tagged PDF – Accessible Technology

A key part of making PDFs accessible is ensuring the document is “tagged.” A “tagged PDF” is a type of PDF that includes...

Section 508 Guide Tagging PDF's in Adobe Acrobat Pro

This guide aims to help authors of PDF's as well as those tasked with reviewing ... Not all content included in the PDF...

PDF Accessibility: Tagged PDF - YouTube

In this video, we're going to look what to do when the accessibility checker flags “ Tagged PDF.” Acrobat does have an autotag...