UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs
See original GitHub issueDescribe the bug The library was unable to decode byte into character.
Affected dataset(s)
msmarco-passage/dev/small
To Reproduce Steps to reproduce the behavior:
- Make sure
collectionandqueries.tar.gz
has already been downloaded in the respective dataset folder in~/.ir_datasets
folder - Run:
import ir_datasets
train = ir_datasets.load('msmarco-passage/dev/small')
for doc in train.docs_iter():
doc
- Wait for it to run, and you will see an error:
[INFO] [starting] fixing encoding
[INFO] [finished] fixing encoding: [07:07] [3.06GB] [7.16MB/s]
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
d:\Repos\XpressAI\vecto-reranking\1 - Dataset Exploration.ipynb Cell 6 in <cell line: 1>()
----> [1](vscode-notebook-cell:/d%3A/Repos/XpressAI/vecto-reranking/1%20-%20Dataset%20Exploration.ipynb#W5sZmlsZQ%3D%3D?line=0) for doc in train.docs_iter():
[2](vscode-notebook-cell:/d%3A/Repos/XpressAI/vecto-reranking/1%20-%20Dataset%20Exploration.ipynb#W5sZmlsZQ%3D%3D?line=1) doc
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\__init__.py:147, in DocstoreSplitter.__next__(self)
146 def __next__(self):
--> 147 return next(self.it)
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\formats\tsv.py:92, in TsvIter.__next__(self)
91 def __next__(self):
---> 92 line = next(self.line_iter)
93 cols = line.rstrip('\n').split('\t')
94 num_cols = len(self.cls._fields)
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\formats\tsv.py:30, in FileLineIter.__next__(self)
28 self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()))
29 while self.pos < self.start:
---> 30 line = self.stream.readline()
31 if line != '\n':
32 self.pos += 1
File ~\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined>
Expected behavior Decoding completes without error.
Additional context Screenshot:
Issue Analytics
- State:
- Created a year ago
- Comments:6
Top Results From Across the Web
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d ...
The error message tells you that cp1252 codec is unable to decode the character with the byte 0x9D. When I browsed through the...
Read more >UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d ...
I tried to debug with load args only and the encoding argument is passed to pandas successfully, not sure what is causing the...
Read more >UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d
Traceback (most recent call last): File "Conditional.py", line 108, in module for line in file1: File "cp1252.py", line 23, in decode return ...
Read more >'charmap' codec can't decode byte 0x9d in position ... - YouTube
ÕzbekchaXatolik: UnicodeDecodeError : ' charmap ' codec can't decode byte 0x9d in position 2045: character maps to undefined #python #progra...
Read more >'charmap' codec can't decode byte 0xX in position X: character ...
UnicodeDecodeError : ' charmap ' codec can't decode byte 0xX in position X: character maps to undefined, when trying to #open() and print()...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks! I suspect it’s this issue: https://github.com/allenai/ir_datasets/issues/151
There’s a branch that fixes it, but for some reason, it hasn’t been merged into the main branch: https://github.com/allenai/ir_datasets/tree/encoding-fixes
I’ll look into merging in the changes that have been made since the branch was made and look into pulling it into the main branch.
Just to chime in, we’ve seen this same issue crop up with the
irds:nfcorpus/dev
dataset too. @seanmacavaney is there any updated on getting the encoding fix branched merged? I only ask because I assigned my class a homework involving this dataset and now students who use Windows are reporting not being able to load it without error.