question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Retrieve bounding box of text on a page

See original GitHub issue

I would like to determine the margins of the text in a PDF document. One possibility would be to render the PDF and look at the text layer of each page, specifically the positionins of their div children (which represent rows of text). That strikes me as a little too cumbersome, though. Is there a way to retrieve the bounding box of all text on a page from the PDFJS object?

Issue Analytics

  • State:closed
  • Created 9 years ago
  • Comments:16 (12 by maintainers)

github_iconTop GitHub Comments

19reactions
jmlsfcommented, Aug 16, 2016

Here is how I approached this. First, I ignore the scale factor in the textContent transform. (The transformation matrix provided in the textContent.item[x].transform doesn’t make any sense to me because it sets a scale in both the x and y directions equal to font height. I don’t know why you’d ever want to do canvas operations in those units. But that’s beside the point.)

The numbers that matter are:

const item = textContent.items[0];
const transform = item.transform;
const x = transform[4];
const y = transform[5];
const width = item.width;
const height = item.height;

In order to do any operations on the canvas using these values, you have to (1) fix the y coordinate from the PDF origin to the canvas origin and (2) scale the whole thing by whatever you’ve scaled the viewport (i.e. by whatever you passed getViewport). So I do this:

convertToCanvasCoords([x, y, width, height]) {
  const { scale } = this;
  return [x * scale, this.canvas.height - ((y + height) * scale), width * scale, height * scale];
}

Where this.scale is the same number I passed to getViewport. Then the following draws an accurate box around the text:

ctx.strokeRect(...this.convertToCanvasCoords([x, y, width, height]));

Note, I had to adjust y again by the height of the box because strokeRect wants the top left corner and even after adjusting for the PDF origin issue, what you end up with is the bottom left corner of the box. So you add the height to get the top, then scale, then fix for origin. There’s probably a cleaner way of doing this, but this works, and it has the advantage that I kind of understand what’s going on. 😃 Hope that helps.

12reactions
knowtheorycommented, May 28, 2019

For others digging around for what pdf.js is actually doing with transformation vectors, the PDF Reference includes a definition of how transformation vectors are laid out and how they relate to mapping into a two dimensional coordinate space.

Specifically, the components of a transformation matrix are described on page 142:

  • Translations are specified as [ 1 0 0 1 tx ty ], where tx and ty are the distances to translate the origin of the coordinate system in the horizontal and vertical dimensions, respectively.
  • Scaling is obtained by [sx 0 0 sy 0 0]. This scales the coordinates so that 1 unit in the horizontal and vertical dimensions of the new coordinate system is the same size as sx and sy units, respectively, in the previous coordinate system.
  • Rotations are produced by [cos θ sin θ −sin θ cos θ 0 0], which has the effect of rotating the coordinate system axes by an angle θ counterclockwise.
  • Skew is specified by [1 tan α tan β 1 0 0], which skews the x axis by an angle α and the y axis by an angle β.

(there’s an accompanying chart in the reference as well)

And the vector itself is defined thus:

PDF represents coordinates in a two-dimensional space. The point (x, y) in such a space can be expressed in vector form as [x y 1]. The constant third element of this vector (1) is needed so that the vector can be used with 3-by-3 matrices in the calculations described below. The transformation between two coordinate systems is represented by a 3-by-3 transformation matrix written as

[ed: pretend this is a matrix]

a b 0
c d 0
e f 1

Because a transformation matrix has only six elements that can be changed, it is usually specified in PDF as the six-element array [a b c d e f].

Read more comments on GitHub >

github_iconTop Results From Across the Web

Measure bounding box of text node in Javascript
I'm building an app that breaks HTML into pages (EPUB HTML, actually), and I need to know the position, height, and width of...
Read more >
Element.getBoundingClientRect() - Web APIs | MDN
This simple example retrieves the DOMRect object representing the bounding client rect of a simple <div> element, and prints out its properties ...
Read more >
How can I get a bounding box (coordinates) for a text object?
The GetPageText function can be used to retrieve the bounding box of a text object. The parameters for this function let you choose...
Read more >
BoundingBox - Amazon Textract - AWS Documentation
The bounding box around the detected page, text, key-value pair, table, table cell, or selection element on a document page. The left (x-coordinate)...
Read more >
Learn how to wrap text around objects in InDesign
Wrap Around Bounding Box setting (left) compared to Wrap Around Object ... If you can't get the text to wrap around an image,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found