question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can 'convert_into()' pdf file to json but executing 'read_pdf()' as json gives UTF-8 encoding error.

See original GitHub issue

Summary of your issue

Can ‘convert_into()’ pdf file to json, but executing ‘read_pdf()’ as json gives UTF-8 encoding error.

Environment

Write and check your environment.

What did you do when you faced the problem?

I don’t understand why the convert_into function works fine with this pdf, but passing the same pdf into read_pdf() yields an encoding error. Shouldn’t the default options for both functions be identical?

Example code:

from tabula import read_pdf
from tabula import convert_into
import pandas
file = 'T:/baysestuaries/Data/WDFT-Coastal/db_archive/QA/QA-17H104161-2017-09-22-DO.pdf'
convert_into(file,"test.json", output_format='json')
df = read_pdf(file, output_format='json')

Output:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-208-fc7babef8e03> in <module>()
----> 1 df = read_pdf(file, output_format='json')

C:\Users\ETurner\AppData\Local\Continuum\Anaconda3\lib\site-packages\tabula\wrapper.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, **kwargs)
     90 
     91         else:
---> 92             return json.loads(output.decode(encoding))
     93 
     94     else:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 5134: invalid start byte

What did you intend to be?

Ideally, the behavior of both functions should be identical. I am actually trying to read this pdf as a pandas dataframe, but it is very messy. Just reading it as a json works for me so I can parse out the items I need. However, don’t want to have to convert files first to waste disk space.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
evanleeturnercommented, Oct 12, 2017

thanks, doing “read_pdf(file, output_format=‘json’, encoding=‘ISO-8859-1’)” works with this file.

0reactions
islander23commented, May 16, 2022

encoding=‘ISO-8859-1’

output the pdf, but lack of some information when the row has more than one line

Read more comments on GitHub >

github_iconTop Results From Across the Web

tabula-py - Read the Docs
You can read tables from PDF and convert into pandas's DataFrame. tabula-py also enables you to convert a PDF file into CSV/TSV/JSON file....
Read more >
Saving UTF-8 texts with json.dumps as UTF-8, not as a \u ...
Note that the encoding='utf8' keyword argument has nothing to do with the output that json.dumps() produces; it is used for decoding str input ......
Read more >
Parse PDF Files While Retaining Structure with Tabula-py
Select the area you want to parse, and click Save Selections as Template . Then, Download the translated Java arguments in a text...
Read more >
tabula.io
This module is a wrapper of tabula, which enables table extraction from a PDF. This module extracts tables from a PDF into a...
Read more >
pdf2json - npm
PDF file parser that converts PDF binaries to text based JSON, powered by porting a fork of PDF.JS to Node.js. Latest version: 3.0.1, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found