question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs

See original GitHub issue

Describe the bug The library was unable to decode byte into character.

Affected dataset(s)

  • msmarco-passage/dev/small

To Reproduce Steps to reproduce the behavior:

  1. Make sure collectionandqueries.tar.gz has already been downloaded in the respective dataset folder in ~/.ir_datasets folder
  2. Run:
import ir_datasets
train = ir_datasets.load('msmarco-passage/dev/small')
for doc in train.docs_iter():
    doc
  1. Wait for it to run, and you will see an error:
[INFO] [starting] fixing encoding
[INFO] [finished] fixing encoding: [07:07] [3.06GB] [7.16MB/s]
                                            
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
d:\Repos\XpressAI\vecto-reranking\1 - Dataset Exploration.ipynb Cell 6 in <cell line: 1>()
----> [1](vscode-notebook-cell:/d%3A/Repos/XpressAI/vecto-reranking/1%20-%20Dataset%20Exploration.ipynb#W5sZmlsZQ%3D%3D?line=0) for doc in train.docs_iter():
      [2](vscode-notebook-cell:/d%3A/Repos/XpressAI/vecto-reranking/1%20-%20Dataset%20Exploration.ipynb#W5sZmlsZQ%3D%3D?line=1)     doc

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\__init__.py:147, in DocstoreSplitter.__next__(self)
    146 def __next__(self):
--> 147     return next(self.it)

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\formats\tsv.py:92, in TsvIter.__next__(self)
     91 def __next__(self):
---> 92     line = next(self.line_iter)
     93     cols = line.rstrip('\n').split('\t')
     94     num_cols = len(self.cls._fields)

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\formats\tsv.py:30, in FileLineIter.__next__(self)
     28         self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()))
     29 while self.pos < self.start:
---> 30     line = self.stream.readline()
     31     if line != '\n':
     32         self.pos += 1

File ~\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
     22 def decode(self, input, final=False):
---> 23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined>

Expected behavior Decoding completes without error.

Additional context Screenshot: image

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6

github_iconTop GitHub Comments

1reaction
seanmacavaneycommented, Sep 3, 2022

Thanks! I suspect it’s this issue: https://github.com/allenai/ir_datasets/issues/151

There’s a branch that fixes it, but for some reason, it hasn’t been merged into the main branch: https://github.com/allenai/ir_datasets/tree/encoding-fixes

I’ll look into merging in the changes that have been made since the branch was made and look into pulling it into the main branch.

0reactions
davidjurgenscommented, Sep 25, 2022

Just to chime in, we’ve seen this same issue crop up with the irds:nfcorpus/dev dataset too. @seanmacavaney is there any updated on getting the encoding fix branched merged? I only ask because I assigned my class a homework involving this dataset and now students who use Windows are reporting not being able to load it without error.

Read more comments on GitHub >

github_iconTop Results From Across the Web

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d ...
The error message tells you that cp1252 codec is unable to decode the character with the byte 0x9D. When I browsed through the...
Read more >
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d ...
I tried to debug with load args only and the encoding argument is passed to pandas successfully, not sure what is causing the...
Read more >
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d
Traceback (most recent call last): File "Conditional.py", line 108, in module for line in file1: File "cp1252.py", line 23, in decode return ...
Read more >
'charmap' codec can't decode byte 0x9d in position ... - YouTube
ÕzbekchaXatolik: UnicodeDecodeError : ' charmap ' codec can't decode byte 0x9d in position 2045: character maps to undefined #python #progra...
Read more >
'charmap' codec can't decode byte 0xX in position X: character ...
UnicodeDecodeError : ' charmap ' codec can't decode byte 0xX in position X: character maps to undefined, when trying to #open() and print()...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found