Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Extract single pages from a double-page layout and more

See original GitHub issue

Many thanks for your sophisticated tool!

I’m hoping that it will improve my academic reading workflow by turning scanned PDFs into annotatable documents to be processed afterwards with Zotero and the ZotFile plugin. The latter automatically extracts annotations from PDFs appending the corresponding page number. Because nearly all of my PDFs I have to read have a double-page layout (= two actual pages on one PDF page) and normally don’t start with page one (they usually contain only a chapter of a book), I have to create one page per sheet (as depicted here) and adjust the page number manually, in order to have the correct page number after extracting the annotations with ZotFile. You may have noticed that the screenshot is taken from unpaper. Because your tool is already using unpaper and doing some preprocessing magic, I appreciate your professional opinion on the following questions:

Can OCRmyPDF automatically extract pages from a double-page layout and create one page per sheet in the output PDF? Ideally, those pages should be aligned, the blank pages removed and the page margins cropped slightly.
Is the automatic extraction of pages reliable from your experience? Or are graphical tools like krop or briss more suitable for this?
What is the most convenient way to change the page number of a PDF to the actual page number of the scanned document? This is essential, so that Zotfile can append the corresponding page number to the annotations.

As a humanities student I’m not very tech-savvy and appreciate every help, especially when the deadline for my master thesis is slowly getting closer and closer.

Using OCRmyPDF 7.4.0 on Ubuntu 18.10 with unpaper 6.1 and tesseract 4.0.0-115-ge3a3.

Best wishes!

Issue Analytics

State:
Created 5 years ago
Comments:14

Top GitHub Comments

3reactions

jbarlow83commented, Jan 9, 2019

I don’t think it would be too difficult to create a pathway for using more of the unpaper options - I will see about adding that.

I don’t think I will add mutool poster but instead mention it as a procedure in the documentation. It sounds like mutool already does a good job at that (and would cover other, more general cases).

1reaction

jbarlow83commented, Jan 20, 2019

You could do

git clone https://github.com/jbarlow83/OCRmyPDF.git ocrmypdf
cd ocrmypdf
git checkout feature/unpaper-args
pip install --user .

Top Results From Across the Web

How to extract pages from a PDF - Adobe Support

To extract non-consecutive pages, select a page to extract. Then, press the Ctrl key (Windows) or Cmd key (macOS) and select each additional ......

How to Extract Pages From a Microsoft Word Document

Position the cursor on the page you want to extract. Go to “File,” “Export” and “Create PDF/XPS” to open the “Save As” dialog....

Extract single pages from a double-page layout and more #330

Can OCRmyPDF automatically extract pages from a double-page layout and create one page per sheet in the output PDF? Ideally, those pages should ......

How to Extract Pages from Word? - Wondershare PDFelement

Open your Word document, click on the “Layout" tab, click on "Orientation." In orientation, there are two options available "Portrait” and “Landscape," select ......

Adobe Acrobat DC: Extracting Pages from a Bigger PDF

Select the page or pages you would like to extract. You can either select a set of specific pages or select a single...