Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

It would be nice to be able to include a template (json file tabula-template) directly as an option of tabula-py

See original GitHub issue

(This is a reopening of #110, opened by made to fit the issue template. This was originally submitted by @papagala but I have the same need, AFAICT)

Is your feature request related to a problem? Please describe. I have PDFs with many pages, each presenting a table with the same (lined) structure, but filled differently (some boxes are empty, some are filled, the length varies, it depends on the page). The empty cell and varying lengths of entries mess up the table structure detection, enough that each table is understood to have a slightly different structure, and as a result the JSON output has a different structure page-per-page.

Note that, in my case, I am actually trying to devise a solution that will assist each of my users in analysing their PDF file consisting of many pages, many tables, etc. The actual content is personal data. To fix ideas, think of me pre-programming a little python script to assist in scraping a PDF printout of their bank statement.

This is my own problem, but maybe @papagala’s is different.

Describe the solution you’d like My problem can already be partially managed in the following way: from the structure extracted from each page (through method described in #102, for instance), I can figure out programmatically, with some basic logic, which unique structure to use across all the pages. The problem is that I don’t have a way to then tell tabula-py to use this structure to reanalyse the whole document.

I imagine I could pass on that structure as a tabula template to read_pdf, but AFAICT this functionality is not available right now (but is available in tabula java?).

Describe alternatives you’ve considered I could:

use tabula on each page;
output some JSON;
look at the position of the cells (see #102);
try to reconcile these cell positions across the pages into a global template;
go back, page by page, to transform the JSON output to fit the global template.

I assess 4. to be of medium difficulty and 5. to be hard. Much better would be to, instead of 5., re-use tabula based on the global template.

In my particular situation, where I have to prepare this for others to use, this alternative would be messy, error-prone, and very unlikely to be fixed by end-users.

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:21 (10 by maintainers)

Top GitHub Comments

2reactions

chezoucommented, Sep 23, 2018

I implemented read_pdf_with_template() function. You can try it on master branch.

Installation:

$ pip install git+https://github.com/chezou/tabula-py

Example code:

dfs = tabula.read_pdf_with_template('./examples/data.pdf', './examples/data.tabula-template.json', pandas_options={'header': 0})

1reaction

papagalacommented, Feb 13, 2019

It did work 👍 I sometimes believe that developers that are too good come from the future. Thanks a lot

Top Results From Across the Web

tabula-py - Read the Docs

tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF...

tabula.io

tabula.io¶. This module is a wrapper of tabula, which enables table extraction from a PDF. This module extracts tables from a PDF into...

tabula-py example notebook

Convert PDF tables into CSV, TSV, or JSON files. You can convert files directly rather creating Python objects with convert_into() function. In ...

Why my tabula template does not output the data from PDF file ...

I notice the out put format defining really doesn't work in the function. However, it does out put the data in JSON format....

Extract Tables From PDFs With tabula-py - LinkedIn

With it, you also can extract tables from PDF into CSV, TSV or JSON file. Note, this options will only work for PDFs...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

It would be nice to be able to include a template (json file tabula-template) directly as an option of tabula-py

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

tabula has no attribute 'read_pdf' - in VSCode.

CalledProcessError in Jupyter Notebook/Lab