UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 78552: character maps to <undefined>
See original GitHub issueDescription
pandas.CSVDataset default argument hit Unicode Error
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 78552: character maps to <undefined>
Steps
Step to reproduce
catalog.yml
sample:
type: pandas.CSVDataSet
filepath: data2/sample.csv
python script to create and load dataset
df['test'] = ['123']
df['test'] = '\x9d'
df.to_csv('sample.csv') # to data/sample.csv
pd.read_csv('sample.csv') # success
catalog.load('sample') # UnicodeDecodeError: 'charmap'
Kedro: 0.16.4 Window 10
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:12 (11 by maintainers)
Top Results From Across the Web
'charmap' codec can't decode byte X in position Y: character ...
If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError . Sometimes we don't know the...
Read more >UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d ...
CSVDataset default argument hit Unicode Error UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 78552: character maps ...
Read more >UnicodeDecodeError: 'charmap' codec can't decode byte
The Python "UnicodeDecodeError: 'charmap' codec can't decode byte in position" occurs when we specify an incorrect encoding or don't explicitly ...
Read more >'charmap' codec can't decode byte 0x9d in position ... - YouTube
ÕzbekchaXatolik: UnicodeDecodeError : ' charmap ' codec can't decode byte 0x9d in position 2045: character maps to undefined #python #progra...
Read more >UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d
Solved - UnicodeDecodeError : ' charmap ' codec can't decode byte 0x9d ... can't decode byte 0x9d in position 6552: character maps to...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

~I think this is a discrepancy in pandas’ API, not Kedro.~
When writing data, the
pandas.CSVDataSetuses panda’sto_csvmethod. This method usesutf-8as the default encoding.On the other hand, when reading data, the
pandas.CSVDataSetuses panda’sread_csvmethod. The subtle difference here is that this method has an optional encoding, instead of anutf-8default. If it’s not provided, the dataset will defer to the encoding of theopen()method on the filesystem, which will use the system’s default encoding. On some Windows system, it’scp1252, hence thecharmaperror message.You can fix this by either changing the encoding on the reading side or the writing side. Both are equally valid.
encodingto bothload_args, which will then pass it to panda’sread_csv, andopen_args_load, which will then pass it to theopen()method, so both use utf-8 correctly.cp1252tosave_argsso pandas writes your file with your system’s default encoding and no need to adjust the load args.I hope this helps.
@noklam I believe this is addressed in 0.17.0, as per commit referenced above. Could you please confirm if that’s the case, and if so consider closing the issue?