getTextContent() text items have wrong height
See original GitHub issuepdfJS Version: 1.7.290 nodeJS Version: v6.9.3 Test PDF file: test.pdf
TL;DR: textContent.height is way off compared to rendered PDF, I’m not sure if this is a bug, an invalid PDF file or if this is intended behaviour.
One some PDF files (see attached file for example) the textContent items seem to have a wrong value for the ‘height’ property.
Consider the following example, which is the text ‘Uw rekening’ just below the top-right logo:
{
"str": "Uw rekening",
"dir": "ltr",
"width": 98.928,
"height": 0.54,
"transform": [
18,
0,
0,
18,
441.81,
708.4499999999999
],
"fontName": "g_d9_f21"
}
Here the ‘height’ property value is 0.54, whilst Math.sqrt(t[2]*t[2] + t[3]*t[3]) = 18
is expected.
When looking at the rendered PDF, we can also confirm that the text in question is actually rendered as 18px high.
I traced the code backwards in the PDFJS source, this is what I found:
In flushTextContentItem()
, the height is 18, but then multiplied with textContentItem.textAdvanceScale
, which has a value of 0.03 for the attached PDF.
If we look at ensureTextContentItem()
, we see that textAdvanceScale
is calculated as follows:
textAdvanceScale = Math.sqrt(ctm[0]*ctm[0] + ctm[1]+ctm[1]) * Math.sqrt(tlm[0]*tlm[0] + tml[1]*tlm[1])
Where ctm
is the content transform matrix, and tlm
the text line matrix.
The text line matrix looks just fine, but (in case of this PDF example), the ctm seems very unlikely:
[
0.03,
0,
0,
0.03,
0,
0
]
Eventually I found that a cm
operator is encountered with args [0.03, 0, 0, 0.03, 0, 0]
, which is then handled in preprocessCommand()
and triggers stateManager.transform(args)
, where the ctm is updated to [0.03, 0, 0, 0.03, 0, 0]
.
But this is where my debugger threw in the towel as it crashes when trying to navigate through the massive 57k LOC PDFJS library.
When inspecting the PDF, I find this part:
q
0.03 0 0 0.03 0 0 cm
BT
/F9 600.00 Tf
0.89 0.00 0.10 rg
14727 23615 TD
(Uw rekening) Tj
*snip*
ET Q
So yes, the graphic static is modified right before the text portion, but that’s about where my knowledge of the PDF format ends. I don’t know if the ‘graphic state’ is supposed to influence text size?
So, in conclusion: I don’t know if this is a bug, an invalid PDF document or an intended behaviour. But I do know that height 0.54 is not how the document is actually rendered.
To get the actual rendered height of a text item, can I safely assume that the ‘real’ height is equal to Math.sqrt(t[2]*t[2] + t[3]*t[3])
?
Issue Analytics
- State:
- Created 6 years ago
- Comments:14 (3 by maintainers)
Top GitHub Comments
I found problem in commit https://github.com/mozilla/pdf.js/commit/4537590033169915e68f6480e2463bc4b2175f78 before this commit height multiply to textAdvanceScale only for vertical fonts after multiply in any cases
If it helps, this worked correctly in 1.6.210, which I have reverted back to.