Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Given a document how ignore the header and set the columns of a table?

See original GitHub issue

I am working with a PDF very similar to this document:

captura de pantalla 2017-01-17 a las 11 43 37 a m

As you can see the above document has a header, when I try to use tabula-py to extract it, I am getting everything merged in a single column:

In:

df = read_pdf_table('file.pdf')

Out:

captura de pantalla 2017-01-17 a las 12 44 00 p m

Thus, my question is how can I ignore the header and get the content of the table?. I also tried with the options:

In:

df = read_pdf_table('file.pdf', columns = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])

Out:


---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-4-33ed930c5d2a> in <module>()
      6 
      7 df = read_pdf_table('file.pdf',
----> 8                    columns = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5']
          9 
         10 #df = read_pdf_table('/Users/user/Downloads/table.pdf')

/usr/local/lib/python3.5/site-packages/tabula/wrapper.py in read_pdf_table(input_path, **kwargs)
     45     args = ["java", "-jar", jar_path] + options + [input_path]
     46 
---> 47     output = subprocess.check_output(args)
     48 
     49     if len(output) == 0:

/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py in check_output(timeout, *popenargs, **kwargs)
    624 
    625     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 626                **kwargs).stdout
    627 
    628 

/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    706         if check and retcode:
    707             raise CalledProcessError(retcode, process.args,
--> 708                                      output=stdout, stderr=stderr)
    709     return CompletedProcess(process.args, retcode, stdout, stderr)
    710 

CalledProcessError: Command '['java', '-jar', '/usr/local/lib/python3.5/site-packages/tabula/tabula-0.9.1-jar-with-dependencies.jar', '--pages', '1', '--guess', '--columns',

Nevertheless, it did not worked.

Issue Analytics

State:
Created 7 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

6reactions

chezoucommented, Jan 18, 2017

In short, you can extract with area and spreadsheet option.

In [4]: tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91))
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Out[4]:
  Unnamed: 0 Col2 Col3 Col4 Col5
0          A    B   12    R    G
1        NaN    R    T   23    H
2          B    B   33    R    A
3          C    T   99    E    M
4          D    I   12   34    M
5          E    I    I    W   90
6        NaN    1    2    W    h
7        NaN    4    3    E    H
8          F    E   E4    R    4

How to use `area` option

According to tabula-java wiki, there is a explain how to specify the area: https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want

Using macOS’s preview, I got area information:

java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename

given

Note the left, top, height, and width parameters and calculate the following:

y1 = top
x1 = left
y2 = top + height
x2 = left + width

I confirmed with tabula-java:

java -jar ./tabula/tabula-0.9.1-jar-with-dependencies.jar -g -r -a "337.29,226.49,472.85,384.91" table.pdf

Without -r(same as --spreadsheet) option, it does not work properly.

1reaction

sfinotticommented, Jan 2, 2020

@sfinotti Use columns instead if column. Note that columns option doesn’t work with lattice mode.

Just tried with “columns”, and got the error: ‘float’ object is not iterable Then, changing “col_def” from =(186.681) to =(186.681,), it worded out So, even if you have only ONE column delimiter, it’s necessary to ad a “,” at the end.

Thanks a lot !!!