Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 78552: character maps to <undefined>

See original GitHub issue

Description

pandas.CSVDataset default argument hit Unicode Error

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 78552: character maps to <undefined>

Steps

Step to reproduce

catalog.yml

sample:
  type: pandas.CSVDataSet
  filepath: data2/sample.csv

python script to create and load dataset

df['test'] = ['123']
df['test'] = '\x9d'

df.to_csv('sample.csv') # to data/sample.csv
pd.read_csv('sample.csv') # success
catalog.load('sample') # UnicodeDecodeError: 'charmap'

Kedro: 0.16.4 Window 10

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:12 (11 by maintainers)

Top GitHub Comments

1reaction

limdautocommented, Aug 28, 2020

~I think this is a discrepancy in pandas’ API, not Kedro.~

When writing data, the pandas.CSVDataSet uses panda’s to_csv method. This method uses utf-8 as the default encoding.

On the other hand, when reading data, the pandas.CSVDataSet uses panda’s read_csv method. The subtle difference here is that this method has an optional encoding, instead of an utf-8 default. If it’s not provided, the dataset will defer to the encoding of the open() method on the filesystem, which will use the system’s default encoding. On some Windows system, it’s cp1252, hence the charmap error message.

You can fix this by either changing the encoding on the reading side or the writing side. Both are equally valid.

On the reading side, you need to pass encoding to both load_args, which will then pass it to panda’s read_csv, and open_args_load, which will then pass it to the open() method, so both use utf-8 correctly.
Or on the writing side, you can pass cp1252 to save_args so pandas writes your file with your system’s default encoding and no need to adjust the load args.

I hope this helps.

0reactions

lorenabalancommented, Jan 12, 2021

@noklam I believe this is addressed in 0.17.0, as per commit referenced above. Could you please confirm if that’s the case, and if so consider closing the issue?