question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Transistor Image Tutorial - UnicodeEncodeError from CorpusParser

See original GitHub issue

Hello,

First off, thanks for this interesting library. I have only gone through it a little bit but am quite excited to explore its full capabilities.

I am going through the transistor_image_tutorial at the moment. Running the lines:

corpus_parser = Parser(structural=True, lingual=True, visual=True, pdf_path=pdf_path, flatten=[])
%time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)

I get the following UnicodeEncodeError.

transistor_image_tutorial - unicodeencodeerror

As a result, when I execute the next line, I get

Documents: 0 Sentences: 0 Figures: 0

I was wondering if there is a workaround for this or if I needed to do something first to avoid the error and get the same result as the original transistor_image_tutorial file. Thank you.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
lukehsiaocommented, Aug 2, 2018

Which version of PostgreSQL are you using?

Can you verify that your postgres encoding is set to UTF8? Check by runnning:

$ psql stg_temp_max_figure -c "SHOW SERVER_ENCODING"

Also, can we see what encoding your database is using? List those by running:

$ psql -l

I suspect the issue is that your database is using an ascii encoding rather than unicode, as it should be. If that hypothesis is correct, you should try to rerun the tutorial after explicitly making sure the database is using unicode by dropping the database and making a new one:

$ createdb -E UTF8 stg_temp_max_figure
$ psql -l
1reaction
lukehsiaocommented, Aug 3, 2018

Thanks for the follow-up @Allen8838! I don’t know offhand how to change default encoding, but I do suspect it’s based on setting the templates to be UTF8 rather than ASCII.

Also note that Fonduer expects Postgres 9.6 or above, so there may be other issues you run into with using postgres 9.5.13, so I’d recommend upgrading if you can.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unicode & Character Encodings in Python: A Painless Guide
In this tutorial, you'll get a Python-centric introduction to character encodings and unicode. Handling character encodings and numbering systems can at ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found