question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 78552: character maps to <undefined>

See original GitHub issue

Description

pandas.CSVDataset default argument hit Unicode Error

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 78552: character maps to <undefined>

Steps

Step to reproduce

catalog.yml

sample:
  type: pandas.CSVDataSet
  filepath: data2/sample.csv

python script to create and load dataset

df['test'] = ['123']
df['test'] = '\x9d'

df.to_csv('sample.csv') # to data/sample.csv
pd.read_csv('sample.csv') # success
catalog.load('sample') # UnicodeDecodeError: 'charmap'

Kedro: 0.16.4 Window 10

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:12 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
limdautocommented, Aug 28, 2020

~I think this is a discrepancy in pandas’ API, not Kedro.~

When writing data, the pandas.CSVDataSet uses panda’s to_csv method. This method uses utf-8 as the default encoding.

On the other hand, when reading data, the pandas.CSVDataSet uses panda’s read_csv method. The subtle difference here is that this method has an optional encoding, instead of an utf-8 default. If it’s not provided, the dataset will defer to the encoding of the open() method on the filesystem, which will use the system’s default encoding. On some Windows system, it’s cp1252, hence the charmap error message.

You can fix this by either changing the encoding on the reading side or the writing side. Both are equally valid.

  • On the reading side, you need to pass encoding to both load_args, which will then pass it to panda’s read_csv, and open_args_load, which will then pass it to the open() method, so both use utf-8 correctly.
  • Or on the writing side, you can pass cp1252 to save_args so pandas writes your file with your system’s default encoding and no need to adjust the load args.

I hope this helps.

0reactions
lorenabalancommented, Jan 12, 2021

@noklam I believe this is addressed in 0.17.0, as per commit referenced above. Could you please confirm if that’s the case, and if so consider closing the issue?

Read more comments on GitHub >

github_iconTop Results From Across the Web

'charmap' codec can't decode byte X in position Y: character ...
If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError . Sometimes we don't know the...
Read more >
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d ...
CSVDataset default argument hit Unicode Error UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 78552: character maps ...
Read more >
UnicodeDecodeError: 'charmap' codec can't decode byte
The Python "UnicodeDecodeError: 'charmap' codec can't decode byte in position" occurs when we specify an incorrect encoding or don't explicitly ...
Read more >
'charmap' codec can't decode byte 0x9d in position ... - YouTube
ÕzbekchaXatolik: UnicodeDecodeError : ' charmap ' codec can't decode byte 0x9d in position 2045: character maps to undefined #python #progra...
Read more >
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d
Solved - UnicodeDecodeError : ' charmap ' codec can't decode byte 0x9d ... can't decode byte 0x9d in position 6552: character maps to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found