question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

getTextContent() text items have wrong height

See original GitHub issue

pdfJS Version: 1.7.290 nodeJS Version: v6.9.3 Test PDF file: test.pdf

TL;DR: textContent.height is way off compared to rendered PDF, I’m not sure if this is a bug, an invalid PDF file or if this is intended behaviour.

One some PDF files (see attached file for example) the textContent items seem to have a wrong value for the ‘height’ property.

Consider the following example, which is the text ‘Uw rekening’ just below the top-right logo:

{
  "str": "Uw rekening",
  "dir": "ltr",
  "width": 98.928,
  "height": 0.54,
  "transform": [
    18,
    0,
    0,
    18,
    441.81,
    708.4499999999999
  ],
  "fontName": "g_d9_f21"
}

Here the ‘height’ property value is 0.54, whilst Math.sqrt(t[2]*t[2] + t[3]*t[3]) = 18 is expected. When looking at the rendered PDF, we can also confirm that the text in question is actually rendered as 18px high.

I traced the code backwards in the PDFJS source, this is what I found:

In flushTextContentItem(), the height is 18, but then multiplied with textContentItem.textAdvanceScale, which has a value of 0.03 for the attached PDF.

If we look at ensureTextContentItem(), we see that textAdvanceScale is calculated as follows:

textAdvanceScale = Math.sqrt(ctm[0]*ctm[0] + ctm[1]+ctm[1]) * Math.sqrt(tlm[0]*tlm[0] + tml[1]*tlm[1])

Where ctm is the content transform matrix, and tlm the text line matrix.

The text line matrix looks just fine, but (in case of this PDF example), the ctm seems very unlikely:

[
  0.03,
  0,
  0,
  0.03,
  0,
  0
]

Eventually I found that a cm operator is encountered with args [0.03, 0, 0, 0.03, 0, 0], which is then handled in preprocessCommand() and triggers stateManager.transform(args), where the ctm is updated to [0.03, 0, 0, 0.03, 0, 0].

But this is where my debugger threw in the towel as it crashes when trying to navigate through the massive 57k LOC PDFJS library.

When inspecting the PDF, I find this part:

q
0.03 0 0 0.03 0 0 cm
BT
/F9 600.00 Tf
0.89 0.00 0.10 rg
14727 23615 TD
(Uw rekening) Tj
*snip*
ET Q

So yes, the graphic static is modified right before the text portion, but that’s about where my knowledge of the PDF format ends. I don’t know if the ‘graphic state’ is supposed to influence text size?

So, in conclusion: I don’t know if this is a bug, an invalid PDF document or an intended behaviour. But I do know that height 0.54 is not how the document is actually rendered.

To get the actual rendered height of a text item, can I safely assume that the ‘real’ height is equal to Math.sqrt(t[2]*t[2] + t[3]*t[3]) ?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:14 (3 by maintainers)

github_iconTop GitHub Comments

10reactions
aberkovskycommented, Dec 15, 2018

I found problem in commit https://github.com/mozilla/pdf.js/commit/4537590033169915e68f6480e2463bc4b2175f78 before this commit height multiply to textAdvanceScale only for vertical fonts after multiply in any cases

3reactions
Saltallicacommented, Jun 3, 2017

If it helps, this worked correctly in 1.6.210, which I have reverted back to.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How can I get text content to resize correctly when using ...
I fixed a similar issue (w/ fancybox v2.+) in which the div.fancybox-wrap element was properly changed to have a height of "auto", but...
Read more >
Example usage for org.w3c.dom Node getTextContent
This attribute returns the text content of this node and its descendants. Usage. From source file:Main.java /** * Remove any whitespace text nodes...
Read more >
How to get content from the editor and set content | TinyMCE
You can do this using the getContent() API method. Let's say you have initialized the editor on a textarea with id=”myTextarea”. For instance:...
Read more >
Node (Java Platform SE 7 ) - Oracle Help Center
The node immediately preceding this node. String · getTextContent(). This attribute returns the text content of this node and its descendants. Object ...
Read more >
mesquite.lib Class MesquiteWindow
String, getAnnotation() Set the text in the explanation area. int, getAnnotationHeight() Gets the height of the Annotation area. java.awt.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found