Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

getTextContent() text items have wrong height

See original GitHub issue

pdfJS Version: 1.7.290 nodeJS Version: v6.9.3 Test PDF file: test.pdf

TL;DR: textContent.height is way off compared to rendered PDF, I’m not sure if this is a bug, an invalid PDF file or if this is intended behaviour.

One some PDF files (see attached file for example) the textContent items seem to have a wrong value for the ‘height’ property.

Consider the following example, which is the text ‘Uw rekening’ just below the top-right logo:

{
  "str": "Uw rekening",
  "dir": "ltr",
  "width": 98.928,
  "height": 0.54,
  "transform": [
    18,
    0,
    0,
    18,
    441.81,
    708.4499999999999
  ],
  "fontName": "g_d9_f21"
}

Here the ‘height’ property value is 0.54, whilst Math.sqrt(t[2]*t[2] + t[3]*t[3]) = 18 is expected. When looking at the rendered PDF, we can also confirm that the text in question is actually rendered as 18px high.

I traced the code backwards in the PDFJS source, this is what I found:

In flushTextContentItem(), the height is 18, but then multiplied with textContentItem.textAdvanceScale, which has a value of 0.03 for the attached PDF.

If we look at ensureTextContentItem(), we see that textAdvanceScale is calculated as follows:

textAdvanceScale = Math.sqrt(ctm[0]*ctm[0] + ctm[1]+ctm[1]) * Math.sqrt(tlm[0]*tlm[0] + tml[1]*tlm[1])

Where ctm is the content transform matrix, and tlm the text line matrix.

The text line matrix looks just fine, but (in case of this PDF example), the ctm seems very unlikely:

[
  0.03,
  0,
  0,
  0.03,
  0,
  0
]

Eventually I found that a cm operator is encountered with args [0.03, 0, 0, 0.03, 0, 0], which is then handled in preprocessCommand() and triggers stateManager.transform(args), where the ctm is updated to [0.03, 0, 0, 0.03, 0, 0].

But this is where my debugger threw in the towel as it crashes when trying to navigate through the massive 57k LOC PDFJS library.

When inspecting the PDF, I find this part:

q
0.03 0 0 0.03 0 0 cm
BT
/F9 600.00 Tf
0.89 0.00 0.10 rg
14727 23615 TD
(Uw rekening) Tj
*snip*
ET Q

So yes, the graphic static is modified right before the text portion, but that’s about where my knowledge of the PDF format ends. I don’t know if the ‘graphic state’ is supposed to influence text size?

So, in conclusion: I don’t know if this is a bug, an invalid PDF document or an intended behaviour. But I do know that height 0.54 is not how the document is actually rendered.

To get the actual rendered height of a text item, can I safely assume that the ‘real’ height is equal to Math.sqrt(t[2]*t[2] + t[3]*t[3]) ?

Issue Analytics

State:
Created 6 years ago
Comments:14 (3 by maintainers)

Top GitHub Comments

10reactions

aberkovskycommented, Dec 15, 2018

I found problem in commit https://github.com/mozilla/pdf.js/commit/4537590033169915e68f6480e2463bc4b2175f78 before this commit height multiply to textAdvanceScale only for vertical fonts after multiply in any cases

3reactions

Saltallicacommented, Jun 3, 2017

If it helps, this worked correctly in 1.6.210, which I have reverted back to.

Top Results From Across the Web

How can I get text content to resize correctly when using ...

I fixed a similar issue (w/ fancybox v2.+) in which the div.fancybox-wrap element was properly changed to have a height of "auto", but...

Example usage for org.w3c.dom Node getTextContent

This attribute returns the text content of this node and its descendants. Usage. From source file:Main.java /** * Remove any whitespace text nodes...

How to get content from the editor and set content | TinyMCE

You can do this using the getContent() API method. Let's say you have initialized the editor on a textarea with id=”myTextarea”. For instance:...

Node (Java Platform SE 7 ) - Oracle Help Center

The node immediately preceding this node. String · getTextContent(). This attribute returns the text content of this node and its descendants. Object ...

mesquite.lib Class MesquiteWindow

String, getAnnotation() Set the text in the explanation area. int, getAnnotationHeight() Gets the height of the Annotation area. java.awt.