getTextContent doesn't always return the right fontRef
See original GitHub issueAttach (recommended) or Link to PDF file here: test8.pdf
Configuration:
- Web browser and its version: Chrome Version 98.0.4758.102 (64 bits)
- Operating system and its version: Windows 10 (64bits)
- PDF.js version: v2.13.216
- Is a browser extension: No
Steps to reproduce the problem:
- getTextContent on the only page of the document
- check the fontRef of the item that contains “ou une liturgie”
let textContent = await page.getTextContent();
textContent.items.forEach((item) => {
if (item.str.includes('ou une liturgie')) console.log(item);
});
What is the expected behavior?
On this particular example when using the chrome debugger I can see the fontRef used to render the glyphs of (part of) this item is d_d0_f3
What went wrong?
The item that get logged out has a fontRef g_d0_f4
.
I think I isolated the problem to this particular pattern of OPS: (ignoring irrelevant OPS here)
- setFont d_d0_f3
- [...do stuff, showText...]
- save
- setFont d_d0_f4
- showText ") "
- restore
- showText "ou une liturgie de clairvoyance (par exemple"
This should return two different items with different fonts in my opinion (or is this PDF file broken in some way ? It renders fine though). What I get is the item with the str value of ) ou une liturgie de clairvoyance (par exemple
and fontRef g_d0_f4
.
I found a fix, but I’m not sure it’s very elegant the way I did it.
In the function buildTextContentItem
in the file src/core/evaluator.js:2633
I added at the top :
if (textContentItem.initialized && textContentItem.fontName != textState.font.loadedName) {
flushTextContentItem();
}
With this, I get to separate items, each with it’s own correct fontRef.
I think handling the restore
case in the switch (same file, line 2861) would be a lot cleaner but I don’t know enough of the specifics of this library to be sure. Is always trying to flushTextContentItem()
when reading a restore
OPS valid ?
Anyway, thanks a lot for your time !
Issue Analytics
- State:
- Created a year ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
Ok here are my results: Possible wanted result ✅
pdfjs-2.13.216 ❌ (reason of this post)
pdfjs-2.12.313 ❌ (not exactly the same results, the item’s string is even longer)
pdfjs-2.11.338 ✅ (the text does not split in the same way as the last release but there’s no obvious problem in the fontRefs)
Do we know which PR caused this?