question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Read_PDF generating single item list

See original GitHub issue

When using the tabula.read_pdf for an online pdf files, I expected a dataframe output but seemed to receive a list with the entire table data as one item in the list.

Check list before submit

    3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
Java version:
    java version "1.8.0_251"
Java(TM) SE Runtime Environment (build 1.8.0_251-b08)
Java HotSpot(TM) Client VM (build 25.251-b08, mixed mode)
tabula-py version: 2.1.0
platform: Windows-10-10.0.18362-SP0
uname:
    uname_result(system='Windows', node='SENE01-VM02', release='10', version='10.0.18362', machine='AMD64', processor='Intel64 Family 6 Model 63 Stepping 2, GenuineIntel')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '') 

If not possible to execute tabula.environment_info(), please answer following questions manually.

  • Paste the output of python --version command on your terminal: ?
  • Paste the output of java -version command on your terminal: ?
  • Does java -h command work well?; Ensure your java command is included in PATH
  • Write your OS and it’s version: ?

What did you do when you faced the problem?

I tried to manually convert the tabula.read_pdf() output as a DataFrame via pd.DataFrame(output) but the ouput was a single cell df that showed truncated part of the first text extracted from the pdf. (‘Unnam…’). Tried searching StackOverflow for similar issues but did not find any.

Code:

import tabula
df = tabula.read_pdf()("https://www.nar.realtor/sites/default/files/documents/ehs-03-2020-overview-2020-04-21.pdf")
print(df)

## output not as expected, used pd.DataFrame() in attmept to achieve df expected
df2 = pd.DataFrame(df)

Expected behavior:

Expected a table output similar to the examples provided on GitHub. https://github.com/chezou/tabula-py/blob/master/examples/tabula_example.ipynb

Actual behavior:

A single cell dataframe that doesn’t display any data as expected via the examples provided in the linked GitHub example python notebook.

Related Issues:

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
chezoucommented, Jun 4, 2020

I couldn’t find any issue. As of tabula-py 2.0.0, read_pdf() returns a list of DataFrames. You can find an actual DataFrame with df[0] for example.

fname = "https://www.nar.realtor/sites/default/files/documents/ehs-03-2020-overview-2020-04-21.pdf"
>>> df = tabula.read_pdf(fname)
'pages' argument isn't specified.Will extract only from page 1 by default.
Got stderr: Jun 04, 2020 11:19:49 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
WARNING: New fonts found, font cache will be re-built
Jun 04, 2020 11:19:49 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Building on-disk font cache, this may take a while
Jun 04, 2020 11:19:53 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Finished building on-disk font cache, found 870 fonts

>>> df
[                                           Unnamed: 0      Unnamed: 1                       Unnamed: 2               Unnamed: 3  ... Unnamed: 10 Unnamed: 11 Unnamed: 12    Mos.
0                                                Year            U.S.                        Northeast                  Midwest  ...        West         NaN  Inventory*  Supply
1                                                2017      5 ,510,000                         7 40,000                1,300,000  ...           *         NaN   1,460,000     3.9
2                                                2018      5 ,340,000                         6 90,000                1,270,000  ...           *         NaN   1,530,000     4.0
3                                                2019      5 ,340,000                         6 90,000                1,250,000  ...           *         NaN   1,390,000     3.9
...snip...


[43 rows x 14 columns]]
>>> len(df)
1
>>> df[0]
                                           Unnamed: 0      Unnamed: 1                       Unnamed: 2               Unnamed: 3  ... Unnamed: 10 Unnamed: 11 Unnamed: 12    Mos.
0                                                Year            U.S.                        Northeast                  Midwest  ...        West         NaN  Inventory*  Supply
1                                                2017      5 ,510,000                         7 40,000                1,300,000  ...           *         NaN   1,460,000     3.9
2                                                2018      5 ,340,000                         6 90,000                1,270,000  ...           *         NaN   1,530,000     4.0
3                                                2019      5 ,340,000                         6 90,000                1,250,000  ...           *         NaN   1,390,000     3.9
...snip...
[43 rows x 14 columns]
0reactions
clarakheinzcommented, Jun 4, 2020

Ah, I see I missed the indexing piece. Thank you both so much for clearing that up.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Generate Customized PDF From SharePoint List Item ...
This video is about how you can create custom PDF from SharePoint List item using Power Automate or MS Flow (Previously called). we...
Read more >
Reading Order tool for PDFs (Acrobat Pro) - Adobe Support
The Reading Order tool in Adobe Acrobat provides the easiest and quickest way to fix reading order and basic tagging problems.
Read more >
PDF21: Using List tags for lists in PDF documents - W3C
The intent of this technique is to create lists of related items using list elements appropriate for their purposes. PDF files containing lists...
Read more >
Combining Multiple Documents into a Single PDF File
Open Adobe Acrobat and from the File menu choose. Create PDF / From Multiple Files · Click "Browse" to locate the first file...
Read more >
Create and Modify PDF Files in Python
In this tutorial, you'll explore the different ways of creating and modifying PDF files in Python. You'll learn how to read and extract...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found