Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

file size increase for pdf/a

See original GitHub issue

OCRmyPDF is really marvelous! Thanks!

I have one question regarding output file size: Unless explicitly selecting pdf as output type, I have quite large file sizes (~4x) after “ocrmypdf in.pdf out.pdf”. The pages are scanned text, i.e. actually there are no gray pixels only black or white ones. Only “–output-type pdf” keeps the file size similar.

For the first page (the others are similar) “pdfimages -list in.pdf” gives:

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2697  4533  icc     1   1  ccitt  no        17  0   600   601 88.3K 5.9%

out.pdf results in:

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2697  4533  rgb     3   8  image  no        12  0   600   601  385K 1.1%

Even --optimize 3 results in double file size for out.pdf (saved as pdf/a):

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2697  4533  gray    1   1  image  no        34  0   600   601  203K  14%

Is a conversion obligatory for pdf/a? Or is there a way to keep the original image type AND generate pdf/a?

Issue Analytics

State:
Created 5 years ago
Comments:9

Top GitHub Comments

1reaction

femifrakcommented, Sep 13, 2018

Thank you for the explanation. But one thing I don’t understand:

with -output-type pdf (unless the “Image processing” settings are turned on), the raster image will never make it into the final PDF.

Did you mean “without” instead of “with”? With “-output-type pdf” I get small outfile sizes. Therefore, I think this option makes the original PDF images the final PDF.

I would think that most of the scanned PDFs contain old documents, many of which are only b/w. That is, I don’t think b/w is such a rare special case, don’t you? What do you think of an option to convert gray images to b/w for output? Gray images would be better for tesseract (?) and b/w output would be better for reading and for file size. This would be the same reasoning as for --clean-final, but --clean-final doesn’t convert to monochrome. (Apart from the programming effort, of course …).

change compression after quantizing. That is something I can add eventually.

very good idea 😃

0reactions

aspikcommented, Feb 6, 2020

I know this might be not the right place, but I didn’t want to create a new “issue”.

I don’t intend to offer save as b/w because some workflows (mine for example) is mostly b/w with a few color images per file, and I’d rather not introduce an option that effectively ruins the output if you use it on the wrong file.

Could you please tell a bit more about your workflow? I have documents which contains just black text with a small color graphic (corporate logo). Additionally one site has signatures which are made with a blue pencil. What would be the best way to scan and process this kind of documents (contracts)? I don’t want to lose the color information.

Currently I’m scanning it as color text with 600dpi (using VueScan). After passing it through ocrmypdf the size did not reduced much (~700KB). When using the jbig2-lossy compression 3 the size was halved (14MB -> 7MB for an 8 pages document). I’m perfectly fine with the size in case of storing it locally.

However, sometimes I want to send this kind of document per e-mail and in this case even the 7MB is not optimal. I would like to convert it to b/w as it does not matter if the corporate logo is red-blue or just black. How would that be harmful if there would be a save/convert as b/w option?

Maybe I’m missing something here and there is a different way to reduce the file size?

Thanks for the excellent app!