question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

See original GitHub issue

Tried opening a text/plain; charset=utf-8 file with torchtext.

# file -i data/simple_questions_wikidata/train.tsv
data/simple_questions_wikidata/train.tsv: text/plain; charset=utf-8

Got this stack trace:

Traceback (most recent call last):
  File "src/jobs/seq2seq/train.py", line 234, in <module>
    fields=[('input', input_field), ('output', output_field)])
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 56, in splits
    train_data = None if train is None else cls(path + train, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 107, in __init__
    for line in f]
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 106, in <listcomp>
    make_example(line.decode('utf-8') if six.PY2 else line, fields)
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0

Fixed with: with open(os.path.expanduser(path), encoding='utf-8') as f:

Here: https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py#L104

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:13 (10 by maintainers)

github_iconTop GitHub Comments

16reactions
nelson-liucommented, Dec 2, 2020

The default ubuntu docker image doesn’t have en-US.UTF-8. That’s the warning you’re getting when exporting. Try:

RUN apt-get update --fix-missing && apt-get install locales RUN locale-gen en_US.UTF-8 ENV LANG en_US.UTF-8 ENV LC_ALL en_US.UTF-8

11reactions
nelson-liucommented, Jul 20, 2017

yes, the sys.getdefaultencoding() looks unexpected. Python 3 changed the system encoding to default to utf-8, but only when LC_CTYPE is unicode-aware.

I’m betting that echo $LANG and echo $LC_CTYPE will print C or something on your machine – try setting these environment variables beforehand and let me know how that goes:

export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
Read more comments on GitHub >

github_iconTop Results From Across the Web

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in ...
You are encoding to UTF-8, then re-encoding to UTF-8. Python can only do this if it first decodes again to Unicode, but it...
Read more >
ASCII codec can't decode byte 0xc3 - Bugzilla@Mozilla
If the bytes data is all from terminal output, then I think the "correct" solution is to figure out the encoding used by...
Read more >
UnicodeDecodeError: 'ascii' codec can't decode byte - Intellipaat
1 Answer. To fix “UnicodeDecodeError you can use the following piece of code this is the default encoding of python is utf8. After...
Read more >
'ascii' codec can't decode byte 0xc3 in position 27: ordinal not ...
Hi everyone;i'm trying to add constraints of unicity;i can show the message in the beginning but after 2 tests i show this message:...
Read more >
'ascii' codec can't decode byte 0xc3 in position 1145
UnicodeDecodeError : 'ascii' codec can't decode byte 0xc3 in position 1145: ordinal not in range(128). 725 views. Skip ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found