question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Extract single pages from a double-page layout and more

See original GitHub issue

Many thanks for your sophisticated tool!

I’m hoping that it will improve my academic reading workflow by turning scanned PDFs into annotatable documents to be processed afterwards with Zotero and the ZotFile plugin. The latter automatically extracts annotations from PDFs appending the corresponding page number. Because nearly all of my PDFs I have to read have a double-page layout (= two actual pages on one PDF page) and normally don’t start with page one (they usually contain only a chapter of a book), I have to create one page per sheet (as depicted here) and adjust the page number manually, in order to have the correct page number after extracting the annotations with ZotFile. You may have noticed that the screenshot is taken from unpaper. Because your tool is already using unpaper and doing some preprocessing magic, I appreciate your professional opinion on the following questions:

  1. Can OCRmyPDF automatically extract pages from a double-page layout and create one page per sheet in the output PDF? Ideally, those pages should be aligned, the blank pages removed and the page margins cropped slightly.

  2. Is the automatic extraction of pages reliable from your experience? Or are graphical tools like krop or briss more suitable for this?

  3. What is the most convenient way to change the page number of a PDF to the actual page number of the scanned document? This is essential, so that Zotfile can append the corresponding page number to the annotations.

As a humanities student I’m not very tech-savvy and appreciate every help, especially when the deadline for my master thesis is slowly getting closer and closer.

Using OCRmyPDF 7.4.0 on Ubuntu 18.10 with unpaper 6.1 and tesseract 4.0.0-115-ge3a3.

Best wishes!

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:14

github_iconTop GitHub Comments

3reactions
jbarlow83commented, Jan 9, 2019

I don’t think it would be too difficult to create a pathway for using more of the unpaper options - I will see about adding that.

I don’t think I will add mutool poster but instead mention it as a procedure in the documentation. It sounds like mutool already does a good job at that (and would cover other, more general cases).

1reaction
jbarlow83commented, Jan 20, 2019

You could do

git clone https://github.com/jbarlow83/OCRmyPDF.git ocrmypdf
cd ocrmypdf
git checkout feature/unpaper-args
pip install --user .
Read more comments on GitHub >

github_iconTop Results From Across the Web

How to extract pages from a PDF - Adobe Support
To extract non-consecutive pages, select a page to extract. Then, press the Ctrl key (Windows) or Cmd key (macOS) and select each additional ......
Read more >
How to Extract Pages From a Microsoft Word Document
Position the cursor on the page you want to extract. Go to “File,” “Export” and “Create PDF/XPS” to open the “Save As” dialog....
Read more >
Extract single pages from a double-page layout and more #330
Can OCRmyPDF automatically extract pages from a double-page layout and create one page per sheet in the output PDF? Ideally, those pages should ......
Read more >
How to Extract Pages from Word? - Wondershare PDFelement
Open your Word document, click on the “Layout" tab, click on "Orientation." In orientation, there are two options available "Portrait” and “Landscape," select ......
Read more >
Adobe Acrobat DC: Extracting Pages from a Bigger PDF
Select the page or pages you would like to extract. You can either select a set of specific pages or select a single...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found