to_csv() surrogates not allowed
See original GitHub issueCode Sample
s = '\ud800'
srs = pd.Series()
srs.loc[ 0 ] = s
srs.to_csv('testcase.csv')
Stack trace:
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-50-769583baba38> in <module>()
4 srs = pd.Series()
5 srs.loc[ 0 ] = s
----> 6 srs.to_csv('testcase.csv')
/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in to_csv(self, path, index, sep, na_rep, float_format, header, index_label, mode, encoding, compression, date_format, decimal)
3779 index_label=index_label, mode=mode,
3780 encoding=encoding, compression=compression,
-> 3781 date_format=date_format, decimal=decimal)
3782 if path is None:
3783 return result
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
1743 doublequote=doublequote,
1744 escapechar=escapechar, decimal=decimal)
-> 1745 formatter.save()
1746
1747 if path_or_buf is None:
/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in save(self)
169 self.writer = UnicodeWriter(f, **writer_kwargs)
170
--> 171 self._save()
172
173 finally:
/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in _save(self)
284 break
285
--> 286 self._save_chunk(start_i, end_i)
287
288 def _save_chunk(self, start_i, end_i):
/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in _save_chunk(self, start_i, end_i)
311
312 libwriters.write_csv_rows(self.data, ix, self.nlevels,
--> 313 self.cols, self.writer)
pandas/_libs/writers.pyx in pandas._libs.writers.write_csv_rows()
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 2: surrogates not allowed
Problem description
The presence of Unicode surrogates in a dataframe (or Series) causes an error in .to_csv()
. This has already been fixed in .to_hdf()
by allowing the errors=
argument to be used where we can use the surrogatepass
or surrogateescape
error handler.
See the original bug report and the PR that fixed it.
Expected Output
No error.
Output of pd.show_versions()
I forgot to grab this before the end of my workshop and I destroyed the cloud instance. Sorry. It was Python 3.6 and pandas 0.23.4 I think.
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (7 by maintainers)
Top Results From Across the Web
python 3.x - Handle surrogates with pandas - Stack Overflow
The problem is I don't know in which rows and in which columns they are. try: data. to_csv(outp_file, encoding='utf-8') except ...
Read more >Pandas to_csv Encoding Error Solution - varunpramanik.com
_libs.writers.write_csv_rows(). UnicodeEncodeError: 'utf-8' codec can't encode characters in position 31-32: surrogates not allowed.
Read more >'utf-8' codec can't encode character '\ud83d' in position 388 ...
Coding example for the question UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 388: surrogates not allowed-pandas.
Read more >Fix Python UnicodeEncodeError: 'ascii' codec can't encode ...
How to Fix UnicodeEncodeError: 'ascii' codec can't encode character in Python and when writing pandas DataFrames to CSV files.
Read more >Unicode Encode Error when writing pandas df to csv - Intellipaat
If you don't specify an encoding, then the encoding used by df.to_csv defaults to ascii in Python2, or utf-8 in Python3.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This (plain open):
will yield the error below:
But
open()
supports passing a codec error handler with theerrors=
named argument:This doesn’t generate an error.
Implementing the named argument
errors=
into_csv()
satisfies the principle of least surprise. To me, this is the way to go. Having to explain why all fields should be re-encoded withencode()
before usingto_csv()
while everything else worked without it (and used to work without it before) was a painful moment for young data scientists.take