Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Read_PDF generating single item list

See original GitHub issue

When using the tabula.read_pdf for an online pdf files, I expected a dataframe output but seemed to receive a list with the entire table data as one item in the list.

Check list before submit

[ x] Did you read FAQ?
[x ] (Optional, but really helpful) Your PDF URL: https://www.nar.realtor/sites/default/files/documents/ehs-03-2020-overview-2020-04-21.pdf
[x ] Paste the output of import tabula; tabula.environment_info() on Python REPL: ?

    3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
Java version:
    java version "1.8.0_251"
Java(TM) SE Runtime Environment (build 1.8.0_251-b08)
Java HotSpot(TM) Client VM (build 25.251-b08, mixed mode)
tabula-py version: 2.1.0
platform: Windows-10-10.0.18362-SP0
uname:
    uname_result(system='Windows', node='SENE01-VM02', release='10', version='10.0.18362', machine='AMD64', processor='Intel64 Family 6 Model 63 Stepping 2, GenuineIntel')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')

If not possible to execute tabula.environment_info(), please answer following questions manually.

Paste the output of python --version command on your terminal: ?
Paste the output of java -version command on your terminal: ?
Does java -h command work well?; Ensure your java command is included in PATH
Write your OS and it’s version: ?

What did you do when you faced the problem?

I tried to manually convert the tabula.read_pdf() output as a DataFrame via pd.DataFrame(output) but the ouput was a single cell df that showed truncated part of the first text extracted from the pdf. (‘Unnam…’). Tried searching StackOverflow for similar issues but did not find any.

Code:

import tabula
df = tabula.read_pdf()("https://www.nar.realtor/sites/default/files/documents/ehs-03-2020-overview-2020-04-21.pdf")
print(df)

## output not as expected, used pd.DataFrame() in attmept to achieve df expected
df2 = pd.DataFrame(df)

Expected behavior:

Expected a table output similar to the examples provided on GitHub. https://github.com/chezou/tabula-py/blob/master/examples/tabula_example.ipynb

Actual behavior:

A single cell dataframe that doesn’t display any data as expected via the examples provided in the linked GitHub example python notebook.

Related Issues:

Issue Analytics

State:
Created 3 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

chezoucommented, Jun 4, 2020

I couldn’t find any issue. As of tabula-py 2.0.0, read_pdf() returns a list of DataFrames. You can find an actual DataFrame with df[0] for example.

fname = "https://www.nar.realtor/sites/default/files/documents/ehs-03-2020-overview-2020-04-21.pdf"
>>> df = tabula.read_pdf(fname)
'pages' argument isn't specified.Will extract only from page 1 by default.
Got stderr: Jun 04, 2020 11:19:49 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
WARNING: New fonts found, font cache will be re-built
Jun 04, 2020 11:19:49 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Building on-disk font cache, this may take a while
Jun 04, 2020 11:19:53 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Finished building on-disk font cache, found 870 fonts

>>> df
[                                           Unnamed: 0      Unnamed: 1                       Unnamed: 2               Unnamed: 3  ... Unnamed: 10 Unnamed: 11 Unnamed: 12    Mos.
0                                                Year            U.S.                        Northeast                  Midwest  ...        West         NaN  Inventory*  Supply
1                                                2017      5 ,510,000                         7 40,000                1,300,000  ...           *         NaN   1,460,000     3.9
2                                                2018      5 ,340,000                         6 90,000                1,270,000  ...           *         NaN   1,530,000     4.0
3                                                2019      5 ,340,000                         6 90,000                1,250,000  ...           *         NaN   1,390,000     3.9
...snip...


[43 rows x 14 columns]]
>>> len(df)
1
>>> df[0]
                                           Unnamed: 0      Unnamed: 1                       Unnamed: 2               Unnamed: 3  ... Unnamed: 10 Unnamed: 11 Unnamed: 12    Mos.
0                                                Year            U.S.                        Northeast                  Midwest  ...        West         NaN  Inventory*  Supply
1                                                2017      5 ,510,000                         7 40,000                1,300,000  ...           *         NaN   1,460,000     3.9
2                                                2018      5 ,340,000                         6 90,000                1,270,000  ...           *         NaN   1,530,000     4.0
3                                                2019      5 ,340,000                         6 90,000                1,250,000  ...           *         NaN   1,390,000     3.9
...snip...
[43 rows x 14 columns]

0reactions

clarakheinzcommented, Jun 4, 2020

Ah, I see I missed the indexing piece. Thank you both so much for clearing that up.