question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is there any plan to remove dependency of PyPDF2?

See original GitHub issue

Thank you for providing a very powerful, useful library. I want to extract tables from various pdfs, but often, when I read a pdf by camelot.readPDF function, it gets error from pypdf2 like below.

  • PyPDF2.utils.PdfReadError: Could not find object.
  • RecursionError: maximum recursion depth exceeded while calling a Python object
  • NotImplementedError: only algorithm code 1 and 2 are supported. This PDF uses code 5
  • PyPDF2.utils.PdfReadError: file has not been decrypted (this occures although file is not encrypted…)

This library seems to using 2 pdf libraries, pypdf2 and pdfminer.six and main functionality to extract text from pdfs is seemed to be dependent to pdfminer.six. I think this library can consist without pypdf2, with considering PyPDF2 is not maintained since 2018.

Regards.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
MartinThomacommented, Jun 27, 2022

PyPDF2 is maintained again since April 2022. I’m the new maintainer. Since then, we fixed a lot of things. I’m currently downloading 800,000 PDF files from Tikas test dataset to ensure we can parse them.

Technical nitpic: Most of those issues are actually not bugs in PyPDF2, but robustness issues. The files don’t conform to the PDF standard. PyPDF2 still tries to support them, but not following the standard makes it more difficult.

NotImplementedError: only algorithm code 1 and 2 are supported. This PDF uses code 5

See https://github.com/py-pdf/PyPDF2/pull/749 - was merged 🎉

PyPDF2.utils.PdfReadError: file has not been decrypted

That might be https://github.com/py-pdf/PyPDF2/issues/416 - I was not able to reproduce the issue. Do you have a PDF / sample code to help me reproduce it?

RecursionError: maximum recursion depth exceeded while calling a Python object

That might be https://github.com/py-pdf/PyPDF2/issues/520 - again, I cannot reproduce it. If you have a PDF / code to show it, please let me know 😃

1reaction
Arnie97commented, Dec 18, 2020

Thanks for your great contributions! I also encountered the PyPDF2 problems occasionally, and your fork fixed them for me. The code need to be adjusted slightly to work with PyMuPDF v1.18.5 though:

diff --git i/camelot/handlers.py w/camelot/handlers.py
--- i/camelot/handlers.py
+++ w/camelot/handlers.py
@@ -114,7 +114,7 @@ class PDFHandler(object):
             outfile = fitz.open()
             outpage = outfile.newPage(-1, width=p.rect.width,
                                       height=p.rect.height)
-            outpage.showPDFpage(outpage.rect, infile, page - 1)
+            outpage.showPDFpage(outpage.rect, infile, pno=page-1)
             outfile.save(fpath)

             layout, dim = get_page_layout(fpath)

According to the PyMuPDF docs,

The major and minor versions of PyMuPDF and MuPDF will always be the same. Only the third qualifier (patch level) may deviate from that of MuPDF.

Strict == is probably needed in setup.py since PyMuPDF do not confirm to the semantic versioning scheme, and introduced breaking changes between v1.18.4 and v1.18.5…

Read more comments on GitHub >

github_iconTop Results From Across the Web

PyPDF2 insists on removing all the spaces - Stack Overflow
Never figured out how to remove the spaces, it is a very unwieldy program. I found the answer to use pdfMiner to be...
Read more >
PyPDF2
If you plan to use PyPDF2 for encrypting or decrypting PDFs that use AES, you will need to install some extra depen- dencies....
Read more >
py-pdf/PyPDF2: A pure-python PDF library capable of ... - GitHub
PyPDF2 is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can...
Read more >
PYPDF2 Tutorial - Working with PDF in Python - Nanonets
PyPDF2 is a Python library for working with PDF documents. It can be used to parse PDFs, modify them, and create new PDFs....
Read more >
How to Work With a PDF in Python
You can use PyPDF2 to extract metadata and some text from a PDF. This can be useful when you're doing certain types of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found