UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3
See original GitHub issueTried opening a text/plain; charset=utf-8
file with torchtext.
# file -i data/simple_questions_wikidata/train.tsv
data/simple_questions_wikidata/train.tsv: text/plain; charset=utf-8
Got this stack trace:
Traceback (most recent call last):
File "src/jobs/seq2seq/train.py", line 234, in <module>
fields=[('input', input_field), ('output', output_field)])
File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 56, in splits
train_data = None if train is None else cls(path + train, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 107, in __init__
for line in f]
File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 106, in <listcomp>
make_example(line.decode('utf-8') if six.PY2 else line, fields)
File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0
Fixed with:
with open(os.path.expanduser(path), encoding='utf-8') as f:
Here: https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py#L104
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:13 (10 by maintainers)
Top Results From Across the Web
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in ...
You are encoding to UTF-8, then re-encoding to UTF-8. Python can only do this if it first decodes again to Unicode, but it...
Read more >ASCII codec can't decode byte 0xc3 - Bugzilla@Mozilla
If the bytes data is all from terminal output, then I think the "correct" solution is to figure out the encoding used by...
Read more >UnicodeDecodeError: 'ascii' codec can't decode byte - Intellipaat
1 Answer. To fix “UnicodeDecodeError you can use the following piece of code this is the default encoding of python is utf8. After...
Read more >'ascii' codec can't decode byte 0xc3 in position 27: ordinal not ...
Hi everyone;i'm trying to add constraints of unicity;i can show the message in the beginning but after 2 tests i show this message:...
Read more >'ascii' codec can't decode byte 0xc3 in position 1145
UnicodeDecodeError : 'ascii' codec can't decode byte 0xc3 in position 1145: ordinal not in range(128). 725 views. Skip ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The default ubuntu docker image doesn’t have en-US.UTF-8. That’s the warning you’re getting when exporting. Try:
RUN apt-get update --fix-missing && apt-get install locales RUN locale-gen en_US.UTF-8 ENV LANG en_US.UTF-8 ENV LC_ALL en_US.UTF-8
yes, the
sys.getdefaultencoding()
looks unexpected. Python 3 changed the system encoding to default to utf-8, but only when LC_CTYPE is unicode-aware.I’m betting that
echo $LANG
andecho $LC_CTYPE
will printC
or something on your machine – try setting these environment variables beforehand and let me know how that goes: