Transistor Image Tutorial - UnicodeEncodeError from CorpusParser
See original GitHub issueHello,
First off, thanks for this interesting library. I have only gone through it a little bit but am quite excited to explore its full capabilities.
I am going through the transistor_image_tutorial at the moment. Running the lines:
corpus_parser = Parser(structural=True, lingual=True, visual=True, pdf_path=pdf_path, flatten=[])
%time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)
I get the following UnicodeEncodeError.
As a result, when I execute the next line, I get
Documents: 0 Sentences: 0 Figures: 0
I was wondering if there is a workaround for this or if I needed to do something first to avoid the error and get the same result as the original transistor_image_tutorial file. Thank you.
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Unicode & Character Encodings in Python: A Painless Guide
In this tutorial, you'll get a Python-centric introduction to character encodings and unicode. Handling character encodings and numbering systems can at ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Which version of PostgreSQL are you using?
Can you verify that your postgres encoding is set to UTF8? Check by runnning:
Also, can we see what encoding your database is using? List those by running:
I suspect the issue is that your database is using an ascii encoding rather than unicode, as it should be. If that hypothesis is correct, you should try to rerun the tutorial after explicitly making sure the database is using unicode by dropping the database and making a new one:
Thanks for the follow-up @Allen8838! I don’t know offhand how to change default encoding, but I do suspect it’s based on setting the templates to be UTF8 rather than ASCII.
Also note that Fonduer expects Postgres 9.6 or above, so there may be other issues you run into with using postgres 9.5.13, so I’d recommend upgrading if you can.