question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

WARNING - Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)

See original GitHub issue

Describe the issue When I get this warning, the file size is usually larger than I expected. Also in the output is INFO - Optimize ratio: 3.60 savings: 72.2%, this is near the expectation. So I think the problem is the failure to convert to PDF/A.

To Reproduce

ocrmypdf --tesseract-timeout=0 --optimize 3 --skip-text --jbig2-lossy kolmbook-eng-scan.pdf kolmbook-eng-scan.o3.pdf 

Console output:

Scan: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 519/519 [00:24<00:00, 21.52page/s]
   INFO - Start processing 4 pages concurrent
OCR: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 519.0/519.0 [00:21<00:00, 24.00page/s]
JPEGs: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.59image/s]
PNGs: 0image [00:00, ?image/s]
JBIG2: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 52/52 [00:16<00:00,  3.15item/s]
   INFO - Optimize ratio: 3.60 savings: 72.2%
WARNING - Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)

Example file http://www.lirmm.fr/~ashen/kolmbook-eng-scan.pdf

  • This is the input file
  • The file contains no personal or confidential information
  • I am the copyright holder for this file
  • I permit this file to be included in the OCRmyPDF test suite under the CC-BY-SA 4.0 license
  • I am not the copyright holder, but this file is available under a free software license

System:

  • OS: macOS 10.14.5
  • OCRmyPDF Version: 9.5.0

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
jbarlow83commented, Mar 2, 2020

PDF/A is supposed to make a file that is reproducible when extraterrestrial archaeologists try to digest the remains of human civilization a million years from now. Because of that, substituted fonts are not acceptable in PDF/A, and there is no way to guarantee that the PDF will appear the same to all users. If you do --force-ocr then you accept the larger file size and accept the font substitutions performed by Ghostscript (may be wrong), and the result is PDF/A.

You could try Acrobat DC to see if it can embed the missing fonts into the PDF. It does have features like this.

0reactions
jbarlow83commented, Mar 23, 2020

I’ll close the issue now as I do not believe there is any way to fix it. If you have further related questions feel free to reopen it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Metadata is silently deleted? Β· Issue #327 Β· ocrmypdf ... - GitHub
The documentation makes no mention of metadata being deleted that I ... Set output PDF/A metadata (default: copy input document's metadata).
Read more >
Error when adding xmp metadata with xelatex and pdfx - TeX
I want to create a pdf according to the PDF/A-1b standard with xelatex and pdfx but I get an error message now when...
Read more >
PDF properties and metadata - Acrobat - Adobe Support
In Adobe Acrobat, follow these steps to view, create, edit, or add a description to document properties or to view object data and...
Read more >
Preflight errors during PDF/A-1a validation | Adobe Acrobat
However I found a workaround for this: Removing all metadata with Acrobat's PDF-Optimizer and putting it back in Acrobat. I don't think thatΒ ......
Read more >
Creating high-quality PDF/A documents using LaTeX
Creating high-quality PDF/A documents using LaTeX. This document provides step-by-step instructions for generating valid PDF/A from LaTeX sources.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found