Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

to_csv() surrogates not allowed

See original GitHub issue

Code Sample

s = '\ud800'
srs = pd.Series()
srs.loc[ 0 ] = s
srs.to_csv('testcase.csv')

Stack trace:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-50-769583baba38> in <module>()
      4 srs = pd.Series()
      5 srs.loc[ 0 ] = s
----> 6 srs.to_csv('testcase.csv')

/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in to_csv(self, path, index, sep, na_rep, float_format, header, index_label, mode, encoding, compression, date_format, decimal)
   3779                            index_label=index_label, mode=mode,
   3780                            encoding=encoding, compression=compression,
-> 3781                            date_format=date_format, decimal=decimal)
   3782         if path is None:
   3783             return result

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   1743                                  doublequote=doublequote,
   1744                                  escapechar=escapechar, decimal=decimal)
-> 1745         formatter.save()
   1746 
   1747         if path_or_buf is None:

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in save(self)
    169                 self.writer = UnicodeWriter(f, **writer_kwargs)
    170 
--> 171             self._save()
    172 
    173         finally:

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in _save(self)
    284                 break
    285 
--> 286             self._save_chunk(start_i, end_i)
    287 
    288     def _save_chunk(self, start_i, end_i):

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in _save_chunk(self, start_i, end_i)
    311 
    312         libwriters.write_csv_rows(self.data, ix, self.nlevels,
--> 313                                   self.cols, self.writer)

pandas/_libs/writers.pyx in pandas._libs.writers.write_csv_rows()

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 2: surrogates not allowed

Problem description

The presence of Unicode surrogates in a dataframe (or Series) causes an error in .to_csv(). This has already been fixed in .to_hdf() by allowing the errors= argument to be used where we can use the surrogatepass or surrogateescape error handler.

See the original bug report and the PR that fixed it.

Expected Output

No error.

Output of `pd.show_versions()`

I forgot to grab this before the end of my workshop and I destroyed the cloud instance. Sorry. It was Python 3.6 and pandas 0.23.4 I think.

Issue Analytics

State:
Created 5 years ago
Comments:8 (7 by maintainers)

Top GitHub Comments

1reaction

obilodeaucommented, Sep 5, 2018

This (plain open):

import csv
row = '\ud800'
with open("test-you-can-delete.csv", "w") as _file:
   writer = csv.writer(_file)
   writer.writerow(row)

will yield the error below:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-12-c276abf97bef> in <module>()
      3 with open("test-you-can-delete.csv", "w") as _file:
      4    writer = csv.writer(_file)
----> 5    writer.writerow(row)

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

But open() supports passing a codec error handler with the errors= named argument:

import csv
row = '\ud800'
with open("test-you-can-delete.csv", "w", errors='surrogatepass') as _file:
   writer = csv.writer(_file)
   writer.writerow(row)

This doesn’t generate an error.

Implementing the named argument errors= in to_csv() satisfies the principle of least surprise. To me, this is the way to go. Having to explain why all fields should be re-encoded with encode() before using to_csv() while everything else worked without it (and used to work without it before) was a painful moment for young data scientists.

0reactions

roberthdevriescommented, Mar 14, 2020

take