question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

getTextContent doesn't always return the right fontRef

See original GitHub issue

Attach (recommended) or Link to PDF file here: test8.pdf

Configuration:

  • Web browser and its version: Chrome Version 98.0.4758.102 (64 bits)
  • Operating system and its version: Windows 10 (64bits)
  • PDF.js version: v2.13.216
  • Is a browser extension: No

Steps to reproduce the problem:

  1. getTextContent on the only page of the document
  2. check the fontRef of the item that contains “ou une liturgie”
let textContent = await page.getTextContent();
textContent.items.forEach((item) => {
  if (item.str.includes('ou une liturgie')) console.log(item);
});

What is the expected behavior? On this particular example when using the chrome debugger I can see the fontRef used to render the glyphs of (part of) this item is d_d0_f3

What went wrong? The item that get logged out has a fontRef g_d0_f4.

I think I isolated the problem to this particular pattern of OPS: (ignoring irrelevant OPS here)

- setFont d_d0_f3
- [...do stuff, showText...]
- save
- setFont d_d0_f4
- showText ") "
- restore
- showText "ou une liturgie de clairvoyance (par exemple"

This should return two different items with different fonts in my opinion (or is this PDF file broken in some way ? It renders fine though). What I get is the item with the str value of ) ou une liturgie de clairvoyance (par exemple and fontRef g_d0_f4.


I found a fix, but I’m not sure it’s very elegant the way I did it. In the function buildTextContentItem in the file src/core/evaluator.js:2633 I added at the top :

if (textContentItem.initialized && textContentItem.fontName != textState.font.loadedName) {
  flushTextContentItem();
}

With this, I get to separate items, each with it’s own correct fontRef.

I think handling the restore case in the switch (same file, line 2861) would be a lot cleaner but I don’t know enough of the specifics of this library to be sure. Is always trying to flushTextContentItem() when reading a restore OPS valid ?

Anyway, thanks a lot for your time !

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
karlakcommented, Apr 6, 2022

Ok here are my results: Possible wanted result ✅

[...,
{fontName: 'g_d0_f4', str: 'odem'}
{fontName: 'g_d0_f4', str: ' '},
{fontName: 'g_d0_f4', str: ')'},
{fontName: 'g_d0_f4', str: ' '},
{fontName: 'g_d0_f3', str: 'ou une liturgie de clairvoyance (par exemple'},
...]

pdfjs-2.13.216 ❌ (reason of this post)

[...,
{fontName: 'g_d0_f4', str: 'odem'},
{fontName: 'g_d0_f4', str: ' '},
{fontName: 'g_d0_f4', str: ') ou une liturgie de clairvoyance (par exemple'},
...]

pdfjs-2.12.313 ❌ (not exactly the same results, the item’s string is even longer)

[...,
{fontName: 'g_d0_f4', str: 'odem) ou une liturgie de clairvoyance (par exemple'},
...]

pdfjs-2.11.338 ✅ (the text does not split in the same way as the last release but there’s no obvious problem in the fontRefs)

[...,
{fontName: 'g_d0_f4', str: ''},
{fontName: 'g_d0_f4', str: 'odem) '},
{fontName: 'g_d0_f4', str: ' '},
{fontName: 'g_d0_f3', str: 'ou  une  liturgie  de  clairvoyance  (par  exemple  '},
{fontName: 'g_d0_f3', str: ''},
{fontName: 'g_d0_f3', str: ' '},
...]
0reactions
marco-ccommented, Apr 19, 2022

Do we know which PR caused this?

Read more comments on GitHub >

github_iconTop Results From Across the Web

java - .getTextContent returns text from child elements too
getTextContent () which returns all text in the sub elements, as well, without whitespace (or else I'd have split on spaces) or .getNodeValue()....
Read more >
JDK-8032908 getTextContent doesn't return string in JAXP
JDK-8032908 : getTextContent doesn't return string in JAXP. Type: Bug; Component: xml; Sub-Component: jaxp; Affected Version: 7u25.
Read more >
Scalable Vector Graphics (SVG) Tiny 1.2 Specification - W3C
This specification defines the features and syntax for Scalable Vector Graphics (SVG) Tiny, Version 1.2, a language for describing two- ...
Read more >
JTidy / Discussion / Help: getTextContent() always returning null
Everytime I call getTextContent() on an org.w3c.dom.Node object, it always returns null. When I checked the documentation, ...
Read more >
openjfx/8/master/rt: 8fb0ea0159d9 - Java.net
curWeights.get(j) : 0; + weights[j][i] = j < curWeights.getSize() ? ... It looks like linux doesn't ever build + // webkit at all...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found