Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs

See original GitHub issue

Describe the bug The library was unable to decode byte into character.

Affected dataset(s)

msmarco-passage/dev/small

To Reproduce Steps to reproduce the behavior:

Make sure collectionandqueries.tar.gz has already been downloaded in the respective dataset folder in ~/.ir_datasets folder
Run:

import ir_datasets
train = ir_datasets.load('msmarco-passage/dev/small')
for doc in train.docs_iter():
    doc

Wait for it to run, and you will see an error:

[INFO] [starting] fixing encoding
[INFO] [finished] fixing encoding: [07:07] [3.06GB] [7.16MB/s]
                                            
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
d:\Repos\XpressAI\vecto-reranking\1 - Dataset Exploration.ipynb Cell 6 in <cell line: 1>()
----> [1](vscode-notebook-cell:/d%3A/Repos/XpressAI/vecto-reranking/1%20-%20Dataset%20Exploration.ipynb#W5sZmlsZQ%3D%3D?line=0) for doc in train.docs_iter():
      [2](vscode-notebook-cell:/d%3A/Repos/XpressAI/vecto-reranking/1%20-%20Dataset%20Exploration.ipynb#W5sZmlsZQ%3D%3D?line=1)     doc

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\__init__.py:147, in DocstoreSplitter.__next__(self)
    146 def __next__(self):
--> 147     return next(self.it)

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\formats\tsv.py:92, in TsvIter.__next__(self)
     91 def __next__(self):
---> 92     line = next(self.line_iter)
     93     cols = line.rstrip('\n').split('\t')
     94     num_cols = len(self.cls._fields)

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\formats\tsv.py:30, in FileLineIter.__next__(self)
     28         self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()))
     29 while self.pos < self.start:
---> 30     line = self.stream.readline()
     31     if line != '\n':
     32         self.pos += 1

File ~\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
     22 def decode(self, input, final=False):
---> 23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined>

Expected behavior Decoding completes without error.

Additional context Screenshot:

Issue Analytics

State:
Created a year ago
Comments:6

Top GitHub Comments

1reaction

seanmacavaneycommented, Sep 3, 2022

Thanks! I suspect it’s this issue: https://github.com/allenai/ir_datasets/issues/151

There’s a branch that fixes it, but for some reason, it hasn’t been merged into the main branch: https://github.com/allenai/ir_datasets/tree/encoding-fixes

I’ll look into merging in the changes that have been made since the branch was made and look into pulling it into the main branch.

0reactions

davidjurgenscommented, Sep 25, 2022

Just to chime in, we’ve seen this same issue crop up with the irds:nfcorpus/dev dataset too. @seanmacavaney is there any updated on getting the encoding fix branched merged? I only ask because I assigned my class a homework involving this dataset and now students who use Windows are reporting not being able to load it without error.

Top Results From Across the Web

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d ...

The error message tells you that cp1252 codec is unable to decode the character with the byte 0x9D. When I browsed through the...

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d ...

I tried to debug with load args only and the encoding argument is passed to pandas successfully, not sure what is causing the...

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d

Traceback (most recent call last): File "Conditional.py", line 108, in module for line in file1: File "cp1252.py", line 23, in decode return ...

'charmap' codec can't decode byte 0x9d in position ... - YouTube

ÕzbekchaXatolik: UnicodeDecodeError : ' charmap ' codec can't decode byte 0x9d in position 2045: character maps to undefined #python #progra...

'charmap' codec can't decode byte 0xX in position X: character ...

UnicodeDecodeError : ' charmap ' codec can't decode byte 0xX in position X: character maps to undefined, when trying to #open() and print()...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

ClueWeb22

Permissions error on /tmp/ir_dataset directory due to multiple users on the same server