BUG: read_csv is failing with an encoding different that UTF-8 and memory_map set to True in version 1.2.4
See original GitHub issue-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
df = pd.DataFrame({'name': ['Raphael', 'Donatello', 'Miguel Angel', 'Leonardo'],
'mask': ['red', 'purple', 'orange', 'blue'],
'weapon': ['sai', 'bo staff', 'nunchunk', 'katana']})
df.to_csv("tmnt.csv", index=False, encoding="utf-16")
pd.read_csv(filepath_or_buffer="tmnt.csv", encoding="utf-16", sep=",", header=0, decimal=".", memory_map=True)
Problem description
This works perfectly with version 1.1.1, but now it doesn’t and since this is a nice feature because it removes I/O overhead, I think is good look into this and also because it could break many things
Expected Output
name mask weapon
0 Raphael red sai
1 Donatello purple bo staff
2 Miguel Angel orange nunchunk
3 Leonardo blue katana
Traceback
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "./lib/python3.8/site-packages/pandas/io/parsers.py", line 610, in read_csv
return _read(filepath_or_buffer, kwds)
File "./lib/python3.8/site-packages/pandas/io/parsers.py", line 462, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "./lib/python3.8/site-packages/pandas/io/parsers.py", line 819, in __init__
self._engine = self._make_engine(self.engine)
File "./lib/python3.8/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "./lib/python3.8/site-packages/pandas/io/parsers.py", line 1898, in __init__
self._reader = parsers.TextReader(self.handles.handle, **kwds)
File "pandas/_libs/parsers.pyx", line 518, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 649, in pandas._libs.parsers.TextReader._get_header
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit : 2cb96529396d93b46abab7bbc73a208e708c642e
python : 3.8.8.final.0
python-bits : 64
OS : Darwin
OS-release : 20.2.0
Version : Darwin Kernel Version 20.2.0: Wed Dec 2 20:39:59 PST 2020; root:xnu-7195.60.75~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
pandas : 1.2.4
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 54.0.0
Cython : None
pytest : 6.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.1
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
UnicodeDecodeError when reading CSV file in Pandas with ...
A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with...
Read more >How can I fix the UTF-8 error when bulk uploading users?
This error occurs because the software you are using saves the file in a different type of encoding, such as ISO-8859, instead of...
Read more >'utf-8' codec can't decode byte 0xff in position 0: invalid start ...
pandas-dev/pandasBUG: read_csv is failing with an encoding different that UTF-8 and memory_map set to True in version 1.2.4#40986. Created over 1 year ago....
Read more >Use UTF-8 (Unicode) charset encoding for pages and email ...
It's broken and makes Bad Things happen in Netscape 4.x. Need to actually send a charset parameter on the Content-Type header being spit...
Read more >Issues with CSV uploads and character encoding in Shiny
rawdat <- read.csv(inFile$datapath, header = TRUE, sep = ","). I have tried to fix this by adding encode = "UTF-8" but now I...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

I looked into pandas/io/common.py and pandas/io/parser.py, found some points to discuss.
1. In common.py-Line-783, it only uses “utf-8” to decode bytes when users setting memory_mapping option, the encoding option has not effect.
2. Can not decode a “utf-16” bytes(from f.readline()) directly if “\n” is at the end of the string.
Code Snippet.
3. After the file handle is mapped into a mmap object, the behavior of the wrapped handle becomes different from the raw handle. Maybe related to 2?
Code snippet
Does it work with the python engine on 1.1.x and 1.2.x?
pd.read_csv(filepath_or_buffer="tmnt.csv", encoding="utf-16", memory_map=True, ..., engine="python").I will have more time to look into this next week.