question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

file size increase for pdf/a

See original GitHub issue

OCRmyPDF is really marvelous! Thanks!

I have one question regarding output file size: Unless explicitly selecting pdf as output type, I have quite large file sizes (~4x) after “ocrmypdf in.pdf out.pdf”. The pages are scanned text, i.e. actually there are no gray pixels only black or white ones. Only “–output-type pdf” keeps the file size similar.

For the first page (the others are similar) “pdfimages -list in.pdf” gives:

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2697  4533  icc     1   1  ccitt  no        17  0   600   601 88.3K 5.9%

out.pdf results in:

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2697  4533  rgb     3   8  image  no        12  0   600   601  385K 1.1%

Even --optimize 3 results in double file size for out.pdf (saved as pdf/a):

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2697  4533  gray    1   1  image  no        34  0   600   601  203K  14%

Is a conversion obligatory for pdf/a? Or is there a way to keep the original image type AND generate pdf/a?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
femifrakcommented, Sep 13, 2018

Thank you for the explanation. But one thing I don’t understand:

with -output-type pdf (unless the “Image processing” settings are turned on), the raster image will never make it into the final PDF.

Did you mean “without” instead of “with”? With “-output-type pdf” I get small outfile sizes. Therefore, I think this option makes the original PDF images the final PDF.

I would think that most of the scanned PDFs contain old documents, many of which are only b/w. That is, I don’t think b/w is such a rare special case, don’t you? What do you think of an option to convert gray images to b/w for output? Gray images would be better for tesseract (?) and b/w output would be better for reading and for file size. This would be the same reasoning as for --clean-final, but --clean-final doesn’t convert to monochrome. (Apart from the programming effort, of course …).

change compression after quantizing. That is something I can add eventually.

very good idea 😃

0reactions
aspikcommented, Feb 6, 2020

I know this might be not the right place, but I didn’t want to create a new “issue”.

I don’t intend to offer save as b/w because some workflows (mine for example) is mostly b/w with a few color images per file, and I’d rather not introduce an option that effectively ruins the output if you use it on the wrong file.

Could you please tell a bit more about your workflow? I have documents which contains just black text with a small color graphic (corporate logo). Additionally one site has signatures which are made with a blue pencil. What would be the best way to scan and process this kind of documents (contracts)? I don’t want to lose the color information.

Currently I’m scanning it as color text with 600dpi (using VueScan). After passing it through ocrmypdf the size did not reduced much (~700KB). When using the jbig2-lossy compression 3 the size was halved (14MB -> 7MB for an 8 pages document). I’m perfectly fine with the size in case of storing it locally.

However, sometimes I want to send this kind of document per e-mail and in this case even the 7MB is not optimal. I would like to convert it to b/w as it does not matter if the corporate logo is red-blue or just black. How would that be harmful if there would be a save/convert as b/w option?

Maybe I’m missing something here and there is a different way to reduce the file size?

Thanks for the excellent app!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resize PDF - Change PDF Page Size/Margins Online Free
Select a PDF file to resize: upload the file from your computer or cloud storage service like Google Drive or Dropbox. Or, you...
Read more >
Change PDF page size - Resize your PDF pages online
First: Upload your file from your computer or a cloud or drag and drop it into the field above. Then: Choose the aspect...
Read more >
How to Increase PDF Size Quickly - Wondershare PDFelement
PDF editors like PDFelement can easily resize PDF to A4. Open the PDF file, click "Page" > "Page Boxes" and select "Change Page...
Read more >
PDF Resizer - PDF Tools
PDF resizer is a simple, free online tool for PDF document resizing and compressing to save disk space, bandwidth and computer memory. Reduce...
Read more >
Why is my PDF file so big? - Adobe
PDFs are usually noticeably large when a few specific things happen. First, PDFs can be oversized because one or more fonts have been...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found