Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Transistor Image Tutorial - UnicodeEncodeError from CorpusParser

See original GitHub issue

Hello,

First off, thanks for this interesting library. I have only gone through it a little bit but am quite excited to explore its full capabilities.

I am going through the transistor_image_tutorial at the moment. Running the lines:

corpus_parser = Parser(structural=True, lingual=True, visual=True, pdf_path=pdf_path, flatten=[])
%time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)

I get the following UnicodeEncodeError.

transistor_image_tutorial - unicodeencodeerror

As a result, when I execute the next line, I get

Documents: 0 Sentences: 0 Figures: 0

I was wondering if there is a workaround for this or if I needed to do something first to avoid the error and get the same result as the original transistor_image_tutorial file. Thank you.

Issue Analytics

State:
Created 5 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

2reactions

lukehsiaocommented, Aug 2, 2018

Which version of PostgreSQL are you using?

Can you verify that your postgres encoding is set to UTF8? Check by runnning:

$ psql stg_temp_max_figure -c "SHOW SERVER_ENCODING"

Also, can we see what encoding your database is using? List those by running:

$ psql -l

I suspect the issue is that your database is using an ascii encoding rather than unicode, as it should be. If that hypothesis is correct, you should try to rerun the tutorial after explicitly making sure the database is using unicode by dropping the database and making a new one:

$ createdb -E UTF8 stg_temp_max_figure
$ psql -l

1reaction

lukehsiaocommented, Aug 3, 2018

Thanks for the follow-up @Allen8838! I don’t know offhand how to change default encoding, but I do suspect it’s based on setting the templates to be UTF8 rather than ASCII.

Also note that Fonduer expects Postgres 9.6 or above, so there may be other issues you run into with using postgres 9.5.13, so I’d recommend upgrading if you can.

Top Results From Across the Web

Unicode & Character Encodings in Python: A Painless Guide

In this tutorial, you'll get a Python-centric introduction to character encodings and unicode. Handling character encodings and numbering systems can at ...