Retrieve bounding box of text on a page
See original GitHub issueI would like to determine the margins of the text in a PDF document. One possibility would be to render the PDF and look at the text layer of each page, specifically the positionins of their div
children (which represent rows of text).
That strikes me as a little too cumbersome, though. Is there a way to retrieve the bounding box of all text on a page from the PDFJS
object?
Issue Analytics
- State:
- Created 9 years ago
- Comments:16 (12 by maintainers)
Top Results From Across the Web
Measure bounding box of text node in Javascript
I'm building an app that breaks HTML into pages (EPUB HTML, actually), and I need to know the position, height, and width of...
Read more >Element.getBoundingClientRect() - Web APIs | MDN
This simple example retrieves the DOMRect object representing the bounding client rect of a simple <div> element, and prints out its properties ...
Read more >How can I get a bounding box (coordinates) for a text object?
The GetPageText function can be used to retrieve the bounding box of a text object. The parameters for this function let you choose...
Read more >BoundingBox - Amazon Textract - AWS Documentation
The bounding box around the detected page, text, key-value pair, table, table cell, or selection element on a document page. The left (x-coordinate)...
Read more >Learn how to wrap text around objects in InDesign
Wrap Around Bounding Box setting (left) compared to Wrap Around Object ... If you can't get the text to wrap around an image,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Here is how I approached this. First, I ignore the scale factor in the textContent transform. (The transformation matrix provided in the textContent.item[x].transform doesn’t make any sense to me because it sets a scale in both the x and y directions equal to font height. I don’t know why you’d ever want to do canvas operations in those units. But that’s beside the point.)
The numbers that matter are:
In order to do any operations on the canvas using these values, you have to (1) fix the
y
coordinate from the PDF origin to the canvas origin and (2) scale the whole thing by whatever you’ve scaled the viewport (i.e. by whatever you passed getViewport). So I do this:Where
this.scale
is the same number I passed to getViewport. Then the following draws an accurate box around the text:Note, I had to adjust y again by the height of the box because strokeRect wants the top left corner and even after adjusting for the PDF origin issue, what you end up with is the bottom left corner of the box. So you add the height to get the top, then scale, then fix for origin. There’s probably a cleaner way of doing this, but this works, and it has the advantage that I kind of understand what’s going on. 😃 Hope that helps.
For others digging around for what
pdf.js
is actually doing with transformation vectors, the PDF Reference includes a definition of how transformation vectors are laid out and how they relate to mapping into a two dimensional coordinate space.Specifically, the components of a transformation matrix are described on page 142:
(there’s an accompanying chart in the reference as well)
And the vector itself is defined thus: