Read_PDF generating single item list
See original GitHub issueWhen using the tabula.read_pdf for an online pdf files, I expected a dataframe output but seemed to receive a list with the entire table data as one item in the list.
Check list before submit
-
[ x] Did you read FAQ?
-
[x ] (Optional, but really helpful) Your PDF URL: https://www.nar.realtor/sites/default/files/documents/ehs-03-2020-overview-2020-04-21.pdf
-
[x ] Paste the output of
import tabula; tabula.environment_info()
on Python REPL: ?
3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
Java version:
java version "1.8.0_251"
Java(TM) SE Runtime Environment (build 1.8.0_251-b08)
Java HotSpot(TM) Client VM (build 25.251-b08, mixed mode)
tabula-py version: 2.1.0
platform: Windows-10-10.0.18362-SP0
uname:
uname_result(system='Windows', node='SENE01-VM02', release='10', version='10.0.18362', machine='AMD64', processor='Intel64 Family 6 Model 63 Stepping 2, GenuineIntel')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')
If not possible to execute tabula.environment_info()
, please answer following questions manually.
- Paste the output of
python --version
command on your terminal: ? - Paste the output of
java -version
command on your terminal: ? - Does
java -h
command work well?; Ensure your java command is included inPATH
- Write your OS and it’s version: ?
What did you do when you faced the problem?
I tried to manually convert the tabula.read_pdf() output as a DataFrame via pd.DataFrame(output) but the ouput was a single cell df that showed truncated part of the first text extracted from the pdf. (‘Unnam…’). Tried searching StackOverflow for similar issues but did not find any.
Code:
import tabula
df = tabula.read_pdf()("https://www.nar.realtor/sites/default/files/documents/ehs-03-2020-overview-2020-04-21.pdf")
print(df)
## output not as expected, used pd.DataFrame() in attmept to achieve df expected
df2 = pd.DataFrame(df)
Expected behavior:
Expected a table output similar to the examples provided on GitHub. https://github.com/chezou/tabula-py/blob/master/examples/tabula_example.ipynb
Actual behavior:
A single cell dataframe that doesn’t display any data as expected via the examples provided in the linked GitHub example python notebook.
Related Issues:
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
I couldn’t find any issue. As of tabula-py 2.0.0,
read_pdf()
returns a list of DataFrames. You can find an actual DataFrame withdf[0]
for example.Ah, I see I missed the indexing piece. Thank you both so much for clearing that up.