Is there any plan to remove dependency of PyPDF2?
See original GitHub issueThank you for providing a very powerful, useful library.
I want to extract tables from various pdfs, but often, when I read a pdf by camelot.readPDF
function, it gets error from pypdf2 like below.
PyPDF2.utils.PdfReadError: Could not find object.
RecursionError: maximum recursion depth exceeded while calling a Python object
NotImplementedError: only algorithm code 1 and 2 are supported. This PDF uses code 5
PyPDF2.utils.PdfReadError: file has not been decrypted
(this occures although file is not encrypted…)
This library seems to using 2 pdf libraries, pypdf2 and pdfminer.six and main functionality to extract text from pdfs is seemed to be dependent to pdfminer.six. I think this library can consist without pypdf2, with considering PyPDF2 is not maintained since 2018.
Regards.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (1 by maintainers)
Top Results From Across the Web
PyPDF2 insists on removing all the spaces - Stack Overflow
Never figured out how to remove the spaces, it is a very unwieldy program. I found the answer to use pdfMiner to be...
Read more >PyPDF2
If you plan to use PyPDF2 for encrypting or decrypting PDFs that use AES, you will need to install some extra depen- dencies....
Read more >py-pdf/PyPDF2: A pure-python PDF library capable of ... - GitHub
PyPDF2 is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can...
Read more >PYPDF2 Tutorial - Working with PDF in Python - Nanonets
PyPDF2 is a Python library for working with PDF documents. It can be used to parse PDFs, modify them, and create new PDFs....
Read more >How to Work With a PDF in Python
You can use PyPDF2 to extract metadata and some text from a PDF. This can be useful when you're doing certain types of...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
PyPDF2 is maintained again since April 2022. I’m the new maintainer. Since then, we fixed a lot of things. I’m currently downloading 800,000 PDF files from Tikas test dataset to ensure we can parse them.
Technical nitpic: Most of those issues are actually not bugs in PyPDF2, but robustness issues. The files don’t conform to the PDF standard. PyPDF2 still tries to support them, but not following the standard makes it more difficult.
See https://github.com/py-pdf/PyPDF2/pull/749 - was merged 🎉
That might be https://github.com/py-pdf/PyPDF2/issues/416 - I was not able to reproduce the issue. Do you have a PDF / sample code to help me reproduce it?
That might be https://github.com/py-pdf/PyPDF2/issues/520 - again, I cannot reproduce it. If you have a PDF / code to show it, please let me know 😃
Thanks for your great contributions! I also encountered the PyPDF2 problems occasionally, and your fork fixed them for me. The code need to be adjusted slightly to work with PyMuPDF v1.18.5 though:
According to the PyMuPDF docs,
Strict
==
is probably needed insetup.py
since PyMuPDF do not confirm to the semantic versioning scheme, and introduced breaking changes between v1.18.4 and v1.18.5…