question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Suppress UnicodeEncodeError when executing to_csv method

See original GitHub issue

Code Sample, a copy-pastable example if possible

# error pattern
import pandas as pd

unicode_data = [["key", "\u070a"]]
df = pd.DataFrame(unicode_data)
df.to_csv("./test.csv", encoding="cp932") # UnicodeEncodeError: 'cp932' codec can't encode character '\u070a' in position 6: illegal multibyte sequence
# good pattern
import pandas as pd

unicode_data = [["key", "\u070a"]]
df = pd.DataFrame(unicode_data)
with open("./test.csv", mode="w", encoding="cp932", errors="ignore") as f:
    df.to_csv(f)

Problem description

UnicodeEncodeError occurs when executing to_csv with eoncode parameter SHIFT-JIS or cp932. We are able to avoid this error using with open(good pattern), this code is redundant. So I want to suppress UnicodeEncodeError with to_csv’s parameter.

Expected Output

# good pattern
import pandas as pd

unicode_data = [["key", "\u070a"]]
df = pd.DataFrame(unicode_data)
df.to_csv("./test.csv", encoding="cp932", ignore_error=True)

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.7.2.final.0 python-bits : 64 OS : Darwin OS-release : 18.6.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : ja_JP.UTF-8 LOCALE : ja_JP.UTF-8

pandas : 0.25.0 numpy : 1.16.2 pytz : 2018.9 dateutil : 2.8.0 pip : 19.0.3 setuptools : 40.8.0 Cython : None pytest : 4.3.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : 2.6.1 pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : None sqlalchemy : 1.3.1 tables : None xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : None

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
twoertweincommented, Sep 12, 2021

Though as per @bsolomon1124 's remark, it would be meaningful to add this errors argument in read_csv as well

Closing as to_csv (errors) and read_csv (encoding_errors) both have arguments to ignore encoding errors.

0reactions
linehammercommented, Apr 5, 2021

On Windows, many editors assume the default ANSI encoding (CP1252 on US Windows) instead of UTF-8 if there is no byte order mark (BOM) character at the start of the file. Files store bytes, which means all unicode have to be encoded into bytes before they can be stored in a file. read_csv takes an encoding option to deal with files in different formats. So, you have to specify an encoding, such as utf-8.

df.to_csv('D:\panda.csv',sep='\t',encoding='utf-8')

If you don’t specify an encoding, then the encoding used by df.to_csv defaults to ascii in Python2, or utf-8 in Python3.

Also, you can encode a problematic series first then decode it back to utf-8.

df['column-name'] = df['column-name'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))

This will also rectify the problem.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unicode issues when writing to CSV file - python
Unicode is an abstract enumeration of characters; a file is a sequence of bytes. UTF-8 is the default method for encoding a Unicode...
Read more >
Error 'CSV Error: Invalid CSV file format' with unicode ...
Option 1: Remove the BOM manually (re-save the file and keep double-byte characters). · After saving the file, don't re-open the file. ·...
Read more >
How to fix a Unicode error while reading a CSV file ...
Run the program and check the number of hard faults and the amount of physical memory used. You can checkmark the python process...
Read more >
Saving Tweet to CSV - Academic Research
In v1.1 API, running it as a command line tool, this works for me: Setup: pip install --upgrade twarc twarc configure. Remove --limit...
Read more >
Reading and Writing CSV Files in Python
The csv library contains objects and other code to read, write, and process data from and to CSV files. Reading CSV Files With...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found