question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Given a document how ignore the header and set the columns of a table?

See original GitHub issue

I am working with a PDF very similar to this document:

captura de pantalla 2017-01-17 a las 11 43 37 a m

As you can see the above document has a header, when I try to use tabula-py to extract it, I am getting everything merged in a single column:

In:

df = read_pdf_table('file.pdf')

Out:

captura de pantalla 2017-01-17 a las 12 44 00 p m

Thus, my question is how can I ignore the header and get the content of the table?. I also tried with the options:

In:

df = read_pdf_table('file.pdf', columns = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])

Out:


---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-4-33ed930c5d2a> in <module>()
      6 
      7 df = read_pdf_table('file.pdf',
----> 8                    columns = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5']
          9 
         10 #df = read_pdf_table('/Users/user/Downloads/table.pdf')

/usr/local/lib/python3.5/site-packages/tabula/wrapper.py in read_pdf_table(input_path, **kwargs)
     45     args = ["java", "-jar", jar_path] + options + [input_path]
     46 
---> 47     output = subprocess.check_output(args)
     48 
     49     if len(output) == 0:

/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py in check_output(timeout, *popenargs, **kwargs)
    624 
    625     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 626                **kwargs).stdout
    627 
    628 

/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    706         if check and retcode:
    707             raise CalledProcessError(retcode, process.args,
--> 708                                      output=stdout, stderr=stderr)
    709     return CompletedProcess(process.args, retcode, stdout, stderr)
    710 

CalledProcessError: Command '['java', '-jar', '/usr/local/lib/python3.5/site-packages/tabula/tabula-0.9.1-jar-with-dependencies.jar', '--pages', '1', '--guess', '--columns', 

Nevertheless, it did not worked.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

6reactions
chezoucommented, Jan 18, 2017

In short, you can extract with area and spreadsheet option.

In [4]: tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91))
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Out[4]:
  Unnamed: 0 Col2 Col3 Col4 Col5
0          A    B   12    R    G
1        NaN    R    T   23    H
2          B    B   33    R    A
3          C    T   99    E    M
4          D    I   12   34    M
5          E    I    I    W   90
6        NaN    1    2    W    h
7        NaN    4    3    E    H
8          F    E   E4    R    4

How to use area option

According to tabula-java wiki, there is a explain how to specify the area: https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want

Using macOS’s preview, I got area information:

image

java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename

given

Note the left, top, height, and width parameters and calculate the following:

y1 = top
x1 = left
y2 = top + height
x2 = left + width

I confirmed with tabula-java:

java -jar ./tabula/tabula-0.9.1-jar-with-dependencies.jar -g -r -a "337.29,226.49,472.85,384.91" table.pdf

Without -r(same as --spreadsheet) option, it does not work properly.

1reaction
sfinotticommented, Jan 2, 2020

@sfinotti Use columns instead if column. Note that columns option doesn’t work with lattice mode.

Just tried with “columns”, and got the error: ‘float’ object is not iterable Then, changing “col_def” from =(186.681) to =(186.681,), it worded out So, even if you have only ONE column delimiter, it’s necessary to ad a “,” at the end.

Thanks a lot !!!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Promote or demote rows and column headers (Power Query)
To demote column headers to the first row, select Home, select the arrow next to Use First Row As Headers, and then select...
Read more >
20. How to Create Accessible Tables in Word
Setting the top row to Repeat as Header Row at the top of each page · Right click on your table and go...
Read more >
Tables in HTML documents
Table cells may either contain "header" information (see the TH element) or "data" (see the TD element). Cells may span multiple rows and...
Read more >
Repeat Header Row in Word Table when Table ... - YouTube
The problem occurs when you create a page break within your table - when you do this the column headers stop repeating at...
Read more >
How to load a table from file, then transpose it and replace its ...
The header information provides a mapping between the column index ... \begin{document} \section{database1} Iterate row then column: ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found