Given a document how ignore the header and set the columns of a table?
See original GitHub issueI am working with a PDF very similar to this document:
![captura de pantalla 2017-01-17 a las 11 43 37 a m](https://cloud.githubusercontent.com/assets/13632106/22032481/36e6297c-dcaa-11e6-9a63-516bef5109a4.png)
As you can see the above document has a header, when I try to use tabula-py to extract it, I am getting everything merged in a single column:
In:
df = read_pdf_table('file.pdf')
Out:
![captura de pantalla 2017-01-17 a las 12 44 00 p m](https://cloud.githubusercontent.com/assets/13632106/22034711/b6adbef6-dcb2-11e6-8e7f-ae3e717f83c9.png)
Thus, my question is how can I ignore the header and get the content of the table?. I also tried with the options:
In:
df = read_pdf_table('file.pdf', columns = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])
Out:
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
<ipython-input-4-33ed930c5d2a> in <module>()
6
7 df = read_pdf_table('file.pdf',
----> 8 columns = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5']
9
10 #df = read_pdf_table('/Users/user/Downloads/table.pdf')
/usr/local/lib/python3.5/site-packages/tabula/wrapper.py in read_pdf_table(input_path, **kwargs)
45 args = ["java", "-jar", jar_path] + options + [input_path]
46
---> 47 output = subprocess.check_output(args)
48
49 if len(output) == 0:
/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py in check_output(timeout, *popenargs, **kwargs)
624
625 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 626 **kwargs).stdout
627
628
/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
706 if check and retcode:
707 raise CalledProcessError(retcode, process.args,
--> 708 output=stdout, stderr=stderr)
709 return CompletedProcess(process.args, retcode, stdout, stderr)
710
CalledProcessError: Command '['java', '-jar', '/usr/local/lib/python3.5/site-packages/tabula/tabula-0.9.1-jar-with-dependencies.jar', '--pages', '1', '--guess', '--columns',
Nevertheless, it did not worked.
Issue Analytics
- State:
- Created 7 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
Promote or demote rows and column headers (Power Query)
To demote column headers to the first row, select Home, select the arrow next to Use First Row As Headers, and then select...
Read more >20. How to Create Accessible Tables in Word
Setting the top row to Repeat as Header Row at the top of each page · Right click on your table and go...
Read more >Tables in HTML documents
Table cells may either contain "header" information (see the TH element) or "data" (see the TD element). Cells may span multiple rows and...
Read more >Repeat Header Row in Word Table when Table ... - YouTube
The problem occurs when you create a page break within your table - when you do this the column headers stop repeating at...
Read more >How to load a table from file, then transpose it and replace its ...
The header information provides a mapping between the column index ... \begin{document} \section{database1} Iterate row then column: ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
In short, you can extract with
area
andspreadsheet
option.How to use
area
optionAccording to tabula-java wiki, there is a explain how to specify the area: https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want
Using macOS’s preview, I got area information:
given
I confirmed with tabula-java:
Without
-r
(same as--spreadsheet
) option, it does not work properly.Just tried with “columns”, and got the error: ‘float’ object is not iterable Then, changing “col_def” from =(186.681) to =(186.681,), it worded out So, even if you have only ONE column delimiter, it’s necessary to ad a “,” at the end.
Thanks a lot !!!