question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

It would be nice to be able to include a template (json file tabula-template) directly as an option of tabula-py

See original GitHub issue

(This is a reopening of #110, opened by made to fit the issue template. This was originally submitted by @papagala but I have the same need, AFAICT)

Is your feature request related to a problem? Please describe. I have PDFs with many pages, each presenting a table with the same (lined) structure, but filled differently (some boxes are empty, some are filled, the length varies, it depends on the page). The empty cell and varying lengths of entries mess up the table structure detection, enough that each table is understood to have a slightly different structure, and as a result the JSON output has a different structure page-per-page.

Note that, in my case, I am actually trying to devise a solution that will assist each of my users in analysing their PDF file consisting of many pages, many tables, etc. The actual content is personal data. To fix ideas, think of me pre-programming a little python script to assist in scraping a PDF printout of their bank statement.

This is my own problem, but maybe @papagala’s is different.

Describe the solution you’d like My problem can already be partially managed in the following way: from the structure extracted from each page (through method described in #102, for instance), I can figure out programmatically, with some basic logic, which unique structure to use across all the pages. The problem is that I don’t have a way to then tell tabula-py to use this structure to reanalyse the whole document.

I imagine I could pass on that structure as a tabula template to read_pdf, but AFAICT this functionality is not available right now (but is available in tabula java?).

Describe alternatives you’ve considered I could:

  1. use tabula on each page;
  2. output some JSON;
  3. look at the position of the cells (see #102);
  4. try to reconcile these cell positions across the pages into a global template;
  5. go back, page by page, to transform the JSON output to fit the global template.

I assess 4. to be of medium difficulty and 5. to be hard. Much better would be to, instead of 5., re-use tabula based on the global template.

In my particular situation, where I have to prepare this for others to use, this alternative would be messy, error-prone, and very unlikely to be fixed by end-users.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:21 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
chezoucommented, Sep 23, 2018

I implemented read_pdf_with_template() function. You can try it on master branch.

Installation:

$ pip install git+https://github.com/chezou/tabula-py

Example code:

dfs = tabula.read_pdf_with_template('./examples/data.pdf', './examples/data.tabula-template.json', pandas_options={'header': 0})
1reaction
papagalacommented, Feb 13, 2019

It did work 👍 I sometimes believe that developers that are too good come from the future. Thanks a lot

Read more comments on GitHub >

github_iconTop Results From Across the Web

tabula-py - Read the Docs
tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF...
Read more >
tabula.io
tabula.io¶. This module is a wrapper of tabula, which enables table extraction from a PDF. This module extracts tables from a PDF into...
Read more >
tabula-py example notebook
Convert PDF tables into CSV, TSV, or JSON files. You can convert files directly rather creating Python objects with convert_into() function. In ...
Read more >
Why my tabula template does not output the data from PDF file ...
I notice the out put format defining really doesn't work in the function. However, it does out put the data in JSON format....
Read more >
Extract Tables From PDFs With tabula-py - LinkedIn
With it, you also can extract tables from PDF into CSV, TSV or JSON file. Note, this options will only work for PDFs...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found