Suppress UnicodeEncodeError when executing to_csv method
See original GitHub issueCode Sample, a copy-pastable example if possible
# error pattern
import pandas as pd
unicode_data = [["key", "\u070a"]]
df = pd.DataFrame(unicode_data)
df.to_csv("./test.csv", encoding="cp932") # UnicodeEncodeError: 'cp932' codec can't encode character '\u070a' in position 6: illegal multibyte sequence
# good pattern
import pandas as pd
unicode_data = [["key", "\u070a"]]
df = pd.DataFrame(unicode_data)
with open("./test.csv", mode="w", encoding="cp932", errors="ignore") as f:
df.to_csv(f)
Problem description
UnicodeEncodeError occurs when executing to_csv with eoncode parameter SHIFT-JIS or cp932.
We are able to avoid this error using with open
(good pattern), this code is redundant.
So I want to suppress UnicodeEncodeError with to_csv’s parameter.
Expected Output
# good pattern
import pandas as pd
unicode_data = [["key", "\u070a"]]
df = pd.DataFrame(unicode_data)
df.to_csv("./test.csv", encoding="cp932", ignore_error=True)
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None python : 3.7.2.final.0 python-bits : 64 OS : Darwin OS-release : 18.6.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : ja_JP.UTF-8 LOCALE : ja_JP.UTF-8
pandas : 0.25.0 numpy : 1.16.2 pytz : 2018.9 dateutil : 2.8.0 pip : 19.0.3 setuptools : 40.8.0 Cython : None pytest : 4.3.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : 2.6.1 pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : None sqlalchemy : 1.3.1 tables : None xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : None
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:7 (3 by maintainers)
Closing as
to_csv
(errors
) andread_csv
(encoding_errors
) both have arguments to ignore encoding errors.On Windows, many editors assume the default ANSI encoding (CP1252 on US Windows) instead of UTF-8 if there is no byte order mark (BOM) character at the start of the file. Files store bytes, which means all unicode have to be encoded into bytes before they can be stored in a file. read_csv takes an encoding option to deal with files in different formats. So, you have to specify an encoding, such as utf-8.
df.to_csv('D:\panda.csv',sep='\t',encoding='utf-8')
If you don’t specify an encoding, then the encoding used by df.to_csv defaults to ascii in Python2, or utf-8 in Python3.
Also, you can encode a problematic series first then decode it back to utf-8.
df['column-name'] = df['column-name'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))
This will also rectify the problem.