Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is there any plan to remove dependency of PyPDF2?

See original GitHub issue

Thank you for providing a very powerful, useful library. I want to extract tables from various pdfs, but often, when I read a pdf by camelot.readPDF function, it gets error from pypdf2 like below.

PyPDF2.utils.PdfReadError: Could not find object.
RecursionError: maximum recursion depth exceeded while calling a Python object
NotImplementedError: only algorithm code 1 and 2 are supported. This PDF uses code 5
PyPDF2.utils.PdfReadError: file has not been decrypted (this occures although file is not encrypted…)

This library seems to using 2 pdf libraries, pypdf2 and pdfminer.six and main functionality to extract text from pdfs is seemed to be dependent to pdfminer.six. I think this library can consist without pypdf2, with considering PyPDF2 is not maintained since 2018.

Regards.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:6 (1 by maintainers)

Top GitHub Comments

1reaction

MartinThomacommented, Jun 27, 2022

PyPDF2 is maintained again since April 2022. I’m the new maintainer. Since then, we fixed a lot of things. I’m currently downloading 800,000 PDF files from Tikas test dataset to ensure we can parse them.

Technical nitpic: Most of those issues are actually not bugs in PyPDF2, but robustness issues. The files don’t conform to the PDF standard. PyPDF2 still tries to support them, but not following the standard makes it more difficult.

NotImplementedError: only algorithm code 1 and 2 are supported. This PDF uses code 5

See https://github.com/py-pdf/PyPDF2/pull/749 - was merged 🎉

PyPDF2.utils.PdfReadError: file has not been decrypted

That might be https://github.com/py-pdf/PyPDF2/issues/416 - I was not able to reproduce the issue. Do you have a PDF / sample code to help me reproduce it?

RecursionError: maximum recursion depth exceeded while calling a Python object

That might be https://github.com/py-pdf/PyPDF2/issues/520 - again, I cannot reproduce it. If you have a PDF / code to show it, please let me know 😃

1reaction

Arnie97commented, Dec 18, 2020

Thanks for your great contributions! I also encountered the PyPDF2 problems occasionally, and your fork fixed them for me. The code need to be adjusted slightly to work with PyMuPDF v1.18.5 though:

diff --git i/camelot/handlers.py w/camelot/handlers.py
--- i/camelot/handlers.py
+++ w/camelot/handlers.py
@@ -114,7 +114,7 @@ class PDFHandler(object):
             outfile = fitz.open()
             outpage = outfile.newPage(-1, width=p.rect.width,
                                       height=p.rect.height)
-            outpage.showPDFpage(outpage.rect, infile, page - 1)
+            outpage.showPDFpage(outpage.rect, infile, pno=page-1)
             outfile.save(fpath)

             layout, dim = get_page_layout(fpath)

According to the PyMuPDF docs,

The major and minor versions of PyMuPDF and MuPDF will always be the same. Only the third qualifier (patch level) may deviate from that of MuPDF.

Strict == is probably needed in setup.py since PyMuPDF do not confirm to the semantic versioning scheme, and introduced breaking changes between v1.18.4 and v1.18.5…