question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Apply detect() on readable PDF files

See original GitHub issue

Hi there, from the docs I infere that detect() operates, for example, on PIL.Image objects. Is there way to directly operate on already readable PDF files (which obviates the need applying OCR as well). Greetings

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
gevezexcommented, Jul 14, 2021

Solved it like this with PyMuPdf (pip install pymupdf). I hope it can help someone with the same issue. Check also the pymupdf utility for retrieving text out of certain box coordinate


# function for rescaling xy coordinates
def scale_xy(textblock, scale=72/200):
    x1 = textblock.block.x_1 * scale
    y1 = textblock.block.y_1 * scale
    x2 = textblock.block.x_2 * scale
    y2 = textblock.block.y_2 * scale
    return (x1,y1,x2,y2)

# Using PyMuPdf for retrieving text in a bounding box
import fitz  # this is pymupdf

# Function for retrieving the tokens (words). See pymupdf utilities
def make_text(words):
    """Return textstring output of get_text("words").
    Word items are sorted for reading sequence left to right,
    top to bottom.
    """
    line_dict = {}  # key: vertical coordinate, value: list of words
    words.sort(key=lambda w: w[0])  # sort by horizontal coordinate
    for w in words:  # fill the line dictionary
        y1 = round(w[3], 1)  # bottom of a word: don't be too picky!
        word = w[4]  # the text of the word
        line = line_dict.get(y1, [])  # read current line content
        line.append(word)  # append new word
        line_dict[y1] = line  # write back to dict
    lines = list(line_dict.items())
    lines.sort()  # sort vertically
    return "\n".join([" ".join(line[1]) for line in lines])

# Open your pdf in pymupdf
pdf_doc = fitz.open('/location/to/your/file.pdf')
pdf_page4 = pdf_doc[3]  # this wil retrieve for example page 4 
words = pdf_page4.get_text("words")

# Get one of your inferenced TextBlocks what is detected with your model (model LayoutParser)
# In the doc it was called "layout". So will use that one
# first recognized bounding box:  layout[0]
# When I print my pdf version the output of layout[0] is like this:
>>> TextBlock(block=Rectangle(x_1=104.882, y_1=133.696, x_2=124.79, y_2=147.696), text=Het, id=0, type=None, parent=None, next=None, score=None)

# Rescale the coordinates
new_coordinates = scale_xy(layout[0])

# Create a Rect object for fitz (similar to TextBlock for the bounding box coordinates)
rect = fitz.Rect(*new_coordinates)

# Now we can find and print all the tokens in the bounding box:
mywords = [w for w in words if fitz.Rect(w[:4]).intersects(rect)]

print("\nSelect the words intersecting the rectangle")
print("-------------------------------------------")
print(make_text(mywords))

Sorry for the confusion of terminologies. I am still learning pdf related stuff.

0reactions
lolipopshockcommented, Sep 13, 2021

See #71 and #72

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to use OCR software for PDFs in 4 easy steps - Adobe
With optical character recognition (OCR) in Adobe Acrobat, you can extract text and convert scanned documents into editable, searchable PDF files instantly.
Read more >
How to detect a searchable pdf from a non-searchable one?
Try a PDF text Extractor (like Tika) first. Most likely it Returns no or very Little text. In that case Switch to OCR....
Read more >
Detect text in files (PDF/TIFF) | Cloud Vision API - Google Cloud
The Vision API can detect and transcribe text from PDF and TIFF files stored in Cloud Storage. Document text detection from PDF and...
Read more >
HOW TO: Determine if a PDF is Searchable in V20 - LeadTools
These first two examples (C# and VB) test the first page of the PDF to see if text is available to be read....
Read more >
Making A PDF Text-searchable
Click on Tools > Text Recognition > In This File. Text recognition menu. The Recognize Text popup box opens. Select All pages, then...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found