question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

to_csv() surrogates not allowed

See original GitHub issue

Code Sample

s = '\ud800'
srs = pd.Series()
srs.loc[ 0 ] = s
srs.to_csv('testcase.csv')

Stack trace:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-50-769583baba38> in <module>()
      4 srs = pd.Series()
      5 srs.loc[ 0 ] = s
----> 6 srs.to_csv('testcase.csv')

/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in to_csv(self, path, index, sep, na_rep, float_format, header, index_label, mode, encoding, compression, date_format, decimal)
   3779                            index_label=index_label, mode=mode,
   3780                            encoding=encoding, compression=compression,
-> 3781                            date_format=date_format, decimal=decimal)
   3782         if path is None:
   3783             return result

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   1743                                  doublequote=doublequote,
   1744                                  escapechar=escapechar, decimal=decimal)
-> 1745         formatter.save()
   1746 
   1747         if path_or_buf is None:

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in save(self)
    169                 self.writer = UnicodeWriter(f, **writer_kwargs)
    170 
--> 171             self._save()
    172 
    173         finally:

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in _save(self)
    284                 break
    285 
--> 286             self._save_chunk(start_i, end_i)
    287 
    288     def _save_chunk(self, start_i, end_i):

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in _save_chunk(self, start_i, end_i)
    311 
    312         libwriters.write_csv_rows(self.data, ix, self.nlevels,
--> 313                                   self.cols, self.writer)

pandas/_libs/writers.pyx in pandas._libs.writers.write_csv_rows()

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 2: surrogates not allowed

Problem description

The presence of Unicode surrogates in a dataframe (or Series) causes an error in .to_csv(). This has already been fixed in .to_hdf() by allowing the errors= argument to be used where we can use the surrogatepass or surrogateescape error handler.

See the original bug report and the PR that fixed it.

Expected Output

No error.

Output of pd.show_versions()

I forgot to grab this before the end of my workshop and I destroyed the cloud instance. Sorry. It was Python 3.6 and pandas 0.23.4 I think.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
obilodeaucommented, Sep 5, 2018

This (plain open):

import csv
row = '\ud800'
with open("test-you-can-delete.csv", "w") as _file:
   writer = csv.writer(_file)
   writer.writerow(row)

will yield the error below:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-12-c276abf97bef> in <module>()
      3 with open("test-you-can-delete.csv", "w") as _file:
      4    writer = csv.writer(_file)
----> 5    writer.writerow(row)

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

But open() supports passing a codec error handler with the errors= named argument:

import csv
row = '\ud800'
with open("test-you-can-delete.csv", "w", errors='surrogatepass') as _file:
   writer = csv.writer(_file)
   writer.writerow(row)

This doesn’t generate an error.

Implementing the named argument errors= in to_csv() satisfies the principle of least surprise. To me, this is the way to go. Having to explain why all fields should be re-encoded with encode() before using to_csv() while everything else worked without it (and used to work without it before) was a painful moment for young data scientists.

0reactions
roberthdevriescommented, Mar 14, 2020

take

Read more comments on GitHub >

github_iconTop Results From Across the Web

python 3.x - Handle surrogates with pandas - Stack Overflow
The problem is I don't know in which rows and in which columns they are. try: data. to_csv(outp_file, encoding='utf-8') except ...
Read more >
Pandas to_csv Encoding Error Solution - varunpramanik.com
_libs.writers.write_csv_rows(). UnicodeEncodeError: 'utf-8' codec can't encode characters in position 31-32: surrogates not allowed.
Read more >
'utf-8' codec can't encode character '\ud83d' in position 388 ...
Coding example for the question UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 388: surrogates not allowed-pandas.
Read more >
Fix Python UnicodeEncodeError: 'ascii' codec can't encode ...
How to Fix UnicodeEncodeError: 'ascii' codec can't encode character in Python and when writing pandas DataFrames to CSV files.
Read more >
Unicode Encode Error when writing pandas df to csv - Intellipaat
If you don't specify an encoding, then the encoding used by df.to_csv defaults to ascii in Python2, or utf-8 in Python3.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found