question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Output PDF is getting distorted on each ocrmypdf command.

See original GitHub issue

Hi,

Please see the attached image where it shows the output PDF is getting distorted on each ocrmypdf command.

distorted_from_v1 0_to_v1 4

FYI, we are using auto-rotate options (–rotate-pages --rotate-pages-threshold 1) only for 1st version and for the rest versions PDF, we are not using the auto-rotate option.

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --rotate-pages --rotate-pages-threshold 1 v_1.0.pdf v_1.1.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf v_1.1.pdf v_1.2.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf v_1.2.pdf v_1.3.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf v_1.3.pdf v_1.4.pdf

NOTE: OCRMyPDF version: 7.0.0

Could you please help me on this?

Also, if I add –oversample 600 option to command in each version, it works fine but output pdf size has increased.

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 --rotate-pages --rotate-pages-threshold 1 v_2.0.pdf v_2.1.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 v_2.1.pdf v_2.2.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 v_2.2.pdf v_2.3.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 v_2.3.pdf v_2.4.pdf
 

Thanks.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:15

github_iconTop GitHub Comments

1reaction
jbarlow83commented, Apr 26, 2020

Use --optimize 0 and --output-type pdf to disable and decompression.

Image resolution never changes by default but recompression can occur.

On Sun., Apr. 26, 2020, 13:30 Laurent Meyer, notifications@github.com wrote:

Good evening,

I’m experiencing a similar problem but I have a conceptional question: why is OCRmyPDF changing the image output at all? I thought it would not be the case as I read it in the readme:

Keeps the exact resolution of the original embedded images

My case is the following: I have a long screenshot (webpage) that I cut in many pieces (via Pillow - loseless): after this operation the png is looking like this:

[image: image] https://user-images.githubusercontent.com/5024077/80318713-511c0f80-880c-11ea-8d0b-c30c1c887bde.png

After that, I convert it in PDF and the output looks the following:

[image: image] https://user-images.githubusercontent.com/5024077/80318734-698c2a00-880c-11ea-8a6e-440c6593b79b.png

And then I OCRmyPDF the file:

subprocess.run([“ocrmypdf”, “-l”, “eng+deu+fra”, “–threshold”, “…/pdfs/yourfile.pdf”, “…/pdfs/mvp.pdf”])

and I get some noise around the letters (it does the same without threshold):

[image: image] https://user-images.githubusercontent.com/5024077/80318756-97716e80-880c-11ea-9029-bb357fb3e672.png

Also the size of the pdf went from 2.3MB to 812KB but I would have preferred no compression at all…

I’m I missing something?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/316#issuecomment-619620464, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN5YMYNRV3JPOPYYQMPC43ROSKXLANCNFSM4GGJNCTA .

0reactions
lolobossecommented, Apr 26, 2020

Good evening,

I’m experiencing a similar problem but I have a conceptional question: why is OCRmyPDF changing the image output at all? I thought it would not be the case as I read it in the readme:

Keeps the exact resolution of the original embedded images

My case is the following: I have a long screenshot (webpage) that I cut in many pieces (via Pillow - loseless): after this operation the png is looking like this:

image

After that, I convert it in PDF and the output looks the following:

image

And then I OCRmyPDF the file:

subprocess.run(["ocrmypdf", "-l", "eng+deu+fra", "--threshold", "../pdfs/yourfile.pdf", "../pdfs/mvp.pdf"])

and I get some noise around the letters (it does the same without threshold):

image

Also the size of the pdf went from 2.3MB to 812KB but I would have preferred no compression at all…

I’m I missing something?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Output PDF is getting distorted on each ocrmypdf command.
Hi, Please see the attached image where it shows the output PDF is getting distorted on each ocrmypdf command. FYI, we are using...
Read more >
ocrmypdf Documentation - Read the Docs
Rasterize each page as an image, OCR the images, and combine the output into a PDF. This preserves the layout of each page,...
Read more >
Advanced features - OCRmyPDF - Read the Docs
Some unpaper features cause multiple input or output files to be consumed or ... Then an image of each page is created with...
Read more >
Release notes - OCRmyPDF - Read the Docs
Worked around a major regression in Ghostscript 9.56.0 where all OCR text is stripped out of the PDF. It simply removes all text,...
Read more >
Release 9.8.1 James R. Barlow - ocrmypdf Documentation
1. Rasterize each page as an image, OCR the images, and combine the output into a PDF. This preserves the layout of each...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found