It would be nice to be able to include a template (json file tabula-template) directly as an option of tabula-py
See original GitHub issue(This is a reopening of #110, opened by made to fit the issue template. This was originally submitted by @papagala but I have the same need, AFAICT)
Is your feature request related to a problem? Please describe. I have PDFs with many pages, each presenting a table with the same (lined) structure, but filled differently (some boxes are empty, some are filled, the length varies, it depends on the page). The empty cell and varying lengths of entries mess up the table structure detection, enough that each table is understood to have a slightly different structure, and as a result the JSON output has a different structure page-per-page.
Note that, in my case, I am actually trying to devise a solution that will assist each of my users in analysing their PDF file consisting of many pages, many tables, etc. The actual content is personal data. To fix ideas, think of me pre-programming a little python script to assist in scraping a PDF printout of their bank statement.
This is my own problem, but maybe @papagala’s is different.
Describe the solution you’d like
My problem can already be partially managed in the following way: from the structure extracted from each page (through method described in #102, for instance), I can figure out programmatically, with some basic logic, which unique structure to use across all the pages. The problem is that I don’t have a way to then tell tabula-py
to use this structure to reanalyse the whole document.
I imagine I could pass on that structure as a tabula template to read_pdf
, but AFAICT this functionality is not available right now (but is available in tabula java?).
Describe alternatives you’ve considered I could:
- use tabula on each page;
- output some JSON;
- look at the position of the cells (see #102);
- try to reconcile these cell positions across the pages into a global template;
- go back, page by page, to transform the JSON output to fit the global template.
I assess 4. to be of medium difficulty and 5. to be hard. Much better would be to, instead of 5., re-use tabula based on the global template.
In my particular situation, where I have to prepare this for others to use, this alternative would be messy, error-prone, and very unlikely to be fixed by end-users.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:21 (10 by maintainers)
I implemented
read_pdf_with_template()
function. You can try it on master branch.Installation:
Example code:
It did work 👍 I sometimes believe that developers that are too good come from the future. Thanks a lot