Improve efficiency on PDFs which contain large amounts of text
See original GitHub issueIf a PDF contains a large amount of text and a small amount of pictures, we only want to OCR the pictures. The script currently OCRs the whole pages, including any existing text, which is undesirable because of the CPU consumption, and degradation of existing text.
I want to implement a change (probably optional, enabled by a flag) to only run OCR on the images, not on any exising text. I would split the images away from the PDF using pdfimages
, and then somehow re-create the layer sandwich using only the OCR generated for those images. The original text inside the files should be left untouched.
Do you have any pointers on doing this? I have a couple of ideas I want to investigate:
- process everything page by page
- edit the PDF to make all text invisible (same color as background)
- run pipeline as it is – OCR should be faster, since most of the page is blank
- recombine original PDF text & image layers with the new OCR layer overlay (still page by page)
- still inefficient – OCR needs to scan through a lot of empty pages
- process everything image by image
- run
pdfimages
to extract images from PDF (along with page number, img size and coordinates) - maybe use pdf2html to get image location & position
- create PDF sandwiches for each image separately (using pdf2pdfocr, of course)
- re-combine them in the original PDF using pdfjam and pdftk
- more efficient – we don’t give blank images to the OCR engine
- run
Issue Analytics
- State:
- Created a year ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Trim Down Large PDF Files With These 5 Simple Tips
Learn how to effectively reduce the file size of your PDF documents by adjusting 5 common PDF creation options.
Read more >Optimizing PDFs in Adobe Acrobat Pro
Open a PDF in Acrobat. Choose File > Reduce File Size or Compress PDF. Note: Adobe is testing the simplified optimize PDF experience...
Read more >Have a large PDF? Use these steps to compress your PDF file ...
Open your file in Adobe DC and under “File” select “Save as Other”. Then select “Reduced Size PDF”. Keep it as “Retain Existing”...
Read more >10 Ways to Work with PDFs that Will Improve Your Efficiency
1: Edit a PDF · 2: Sign a PDF · 3: Turn a PDF into an Image · 4: Merge PDF Files ·...
Read more >6 tips on using PDFs to increase efficiency, improve workflows
The only way to do redaction is with a redaction tool, commonly found in PDF software. These tools don't just cover up text...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, I see a 9X improvement on speed with the new flag for documents of moderate text size and low image count (30 pages).
Thank you for the feature!
Testing on this document with 30 pages of text: https://raw.githubusercontent.com/liquidinvestigations/hoover-testdata/master/data/disk-files/pdf-doc-txt/stanley.ec02.pdf
Thanks again!
[2022-05-15 09:24:48.847586] [LOG] Success in 16.423 seconds!