Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can 'convert_into()' pdf file to json but executing 'read_pdf()' as json gives UTF-8 encoding error.

See original GitHub issue

Summary of your issue

Can ‘convert_into()’ pdf file to json, but executing ‘read_pdf()’ as json gives UTF-8 encoding error.

Environment

Write and check your environment.

python --version: ? 3.6.1.final.0, jupyer notebook 5.0.0
java -version: ?
OS and it’s version: ? win64 anaconda 4.3.22
Your PDF URL: https://www.dropbox.com/s/rg11o0iitia4zua/QA-17H104161-2017-09-22-DO.pdf?dl=0

What did you do when you faced the problem?

I don’t understand why the convert_into function works fine with this pdf, but passing the same pdf into read_pdf() yields an encoding error. Shouldn’t the default options for both functions be identical?

Example code:

from tabula import read_pdf
from tabula import convert_into
import pandas
file = 'T:/baysestuaries/Data/WDFT-Coastal/db_archive/QA/QA-17H104161-2017-09-22-DO.pdf'
convert_into(file,"test.json", output_format='json')
df = read_pdf(file, output_format='json')

Output:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-208-fc7babef8e03> in <module>()
----> 1 df = read_pdf(file, output_format='json')

C:\Users\ETurner\AppData\Local\Continuum\Anaconda3\lib\site-packages\tabula\wrapper.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, **kwargs)
     90 
     91         else:
---> 92             return json.loads(output.decode(encoding))
     93 
     94     else:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 5134: invalid start byte

What did you intend to be?

Ideally, the behavior of both functions should be identical. I am actually trying to read this pdf as a pandas dataframe, but it is very messy. Just reading it as a json works for me so I can parse out the items I need. However, don’t want to have to convert files first to waste disk space.