Extract single pages from a double-page layout and more
See original GitHub issueMany thanks for your sophisticated tool!
I’m hoping that it will improve my academic reading workflow by turning scanned PDFs into annotatable documents to be processed afterwards with Zotero and the ZotFile plugin. The latter automatically extracts annotations from PDFs appending the corresponding page number. Because nearly all of my PDFs I have to read have a double-page layout (= two actual pages on one PDF page) and normally don’t start with page one (they usually contain only a chapter of a book), I have to create one page per sheet (as depicted here) and adjust the page number manually, in order to have the correct page number after extracting the annotations with ZotFile. You may have noticed that the screenshot is taken from unpaper. Because your tool is already using unpaper and doing some preprocessing magic, I appreciate your professional opinion on the following questions:
-
Can OCRmyPDF automatically extract pages from a double-page layout and create one page per sheet in the output PDF? Ideally, those pages should be aligned, the blank pages removed and the page margins cropped slightly.
-
Is the automatic extraction of pages reliable from your experience? Or are graphical tools like krop or briss more suitable for this?
-
What is the most convenient way to change the page number of a PDF to the actual page number of the scanned document? This is essential, so that Zotfile can append the corresponding page number to the annotations.
As a humanities student I’m not very tech-savvy and appreciate every help, especially when the deadline for my master thesis is slowly getting closer and closer.
Using OCRmyPDF
7.4.0 on Ubuntu
18.10 with unpaper
6.1 and tesseract
4.0.0-115-ge3a3.
Best wishes!
Issue Analytics
- State:
- Created 5 years ago
- Comments:14
I don’t think it would be too difficult to create a pathway for using more of the unpaper options - I will see about adding that.
I don’t think I will add
mutool poster
but instead mention it as a procedure in the documentation. It sounds like mutool already does a good job at that (and would cover other, more general cases).You could do