Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Encoding error while reading database

See original GitHub issue

Hi, I am experiencing an encoding error while reading calima_star database in a Windows environment, and that is due to the fact that database.py relies on default encoding:

with open(fpath, 'r') as dbfile:

Then, if I try to db = CalimaStarDB.builtin_db(), it will raise an UnicodeError for the database is in UTF-8 (I guess), and Python default encoding in Windows is that of the computer language (mine, CP1252). Is there any way I could specify the encoding while reading the builtin_db?

If I replace that line at database.py to the one below, the error will be fixed, but that is not straightforward since I intend to making camel_tools pip installable through requirements.

with open(fpath, 'r', encoding="utf-8") as dbfile:

Thanks in advance

Issue Analytics

State:
Created 3 years ago
Comments:8

Top GitHub Comments

1reaction

owocommented, Sep 15, 2020

Hi @alvelvis,

Just a quick update.

So camel-tools should be installable on plain Windows now through the master branch. I’ve added installation instructions. However, dialectid will not be available.

Also, camel_tools.calima_star is now camel_tools.morphology (this was the breaking changes I mentioned). The usage is almost the same other than things being renamed (CalimaStarDB -> MorphologyDB, CalimaStarAnalyzer -> Analyzer, etc). Just take a look at the docs to see what changed.

A pip release should be coming in the next couple of weeks or so once everything else is stabilized.

1reaction

owocommented, Sep 3, 2020

Will it be available in the “pip” install in the future?

Yes and no.

Yes the "encoding=utf-8" will be present in a future version on pip. However, like I mentioned before, there will be breaking changes in the API you’ll need to account for. No, it will not be installable on “plain” Windows for the foreseeable future because of the issues I mentioned in my last response.

However, I will consider with the team if we should have an option to install just the components that will work on a “plain” Windows setup. This means that users won’t have access to the Dialect Identification system (perhaps other components as well). The analyzer alone should work on a plain Windows setup without issues.

It will be some time from now before a new pip version is released, so please keep an eye out (you can also fill out the form in the README to get an email when the next version is released).

Top Results From Across the Web

Why do I get a unicode encoding error in the middle of reading ...

There is most likely an invalid Unicode character in the file that you are reading from. You can try to either remove it...

How can I fix the UTF-8 error when bulk uploading users?

This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the...

UnicodeDecodeError: 'utf-8' codec can't decode byte [...] in ...

When Pandas reads a CSV, by default it assumes that the encoding is UTF-8. When the following error occurs, the CSV parser encounters...

Troubleshooting Encoding Issues When Integrating Data from ...

Here we will highlight the common indicators of these issues along with some possible solutions. EXAMPLES OF ENCODING ERRORS. UNREADABLE CHARACTERS. Have you ......

SQL Encoding Read Error · Issue #720 · sequelpro ... - GitHub

I connected via SSH to a database and wanted to import a database. The database is utf8 and Sequel Pro also shows UTF8...