Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

s3.open() doesn't seem to understand user defined encoding

See original GitHub issue

I have an utf-8 encoded .csv file with user defined data that I’m trying to upload to my s3 bucket.

My script is like this. It works perfectly with a dummy dataframe, and it also works if I try to save my data locally with in-built open() function, but with my .csv it breaks at the last line:

# [..] other stuff
s3 = s3fs.S3FileSystem(session=session)
with s3.open('my_bucket/my_file.csv', 'w', encoding='utf-8') as output:
        data.to_csv(output, index=False, encoding='utf-8')

and here’s the traceback:

Traceback (most recent call last):
  File "s3_csv_gz.py", line 35, in <module>
    upload(file_path)
  File "s3_csv_gz.py", line 30, in upload
    data.to_csv(output, index=False, encoding='utf-8')
  File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 3020, in to_csv
    formatter.save()
  File "C:\Anaconda\lib\site-packages\pandas\io\formats\csvs.py", line 172, in save
    self._save()
  File "C:\Anaconda\lib\site-packages\pandas\io\formats\csvs.py", line 288, in _save
    self._save_chunk(start_i, end_i)
  File "C:\Anaconda\lib\site-packages\pandas\io\formats\csvs.py", line 315, in _save_chunk
    self.cols, self.writer)
  File "pandas/_libs/writers.pyx", line 72, in pandas._libs.writers.write_csv_rows
  File "C:\Anaconda\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 14-22: character maps to <undefined>

As you can see from the last call, for some reason the module called is cp1252.py, instead of utf-8.py. Now, I’m not 100% sure if that’s how it’s supposed to work, but I’m quite certain that cp1252 has nothing to do with utf-8.

Is there a way to circumvent this? I’d really love to use this package instead of boto3 to upload my files, but I can’t seem to make it work.

Issue Analytics

State:
Created 4 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Sep 30, 2019

Ah, so there was indeed a bug, fixed in the linked PR. In way you are doing it now with ‘wb’, the file is open in binary mode, and I suppose pandas is doing the right thing in dealing with that.

0reactions

wtfzambocommented, Oct 1, 2019

I was surprised as well, apparently the compression only works when passing a file path as argument and not a file object, for some magical reason. A quick google search shows that it’s been already noticed by the community but never addressed.