BUG: TextFileReader uses an incorrect encoding to test the size of the separator
See original GitHub issuePandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# Run in UTF-8 mode (`python -X utf8 reproducible_example.py`) just in case.
import pandas as pd
import tempfile
with tempfile.NamedTemporaryFile("wt", encoding="latin-1") as f:
f.write("key\u00A5value\ntables\u00A5rectangular")
f.seek(0)
pd.read_csv(f.name, sep="\u00A5", encoding="latin-1")
Issue Description
Despite the fact that the encoding of the file to be read is clearly specified as "latin-1"
(ISO 8859-1), which encodes U+00A5 ¥ YEN SIGN as a single octet, read_csv
raises the following warning:
.../venv/lib/python3.9/site-packages/pandas/util/_decorators.py:311: ParserWarning: Falling back to the 'python' engine because the separator encoded in utf-8 is > 1 char long, and the 'c' engine does not support such separators; you can avoid this warning by specifying engine='python'.
return func(*args, **kwargs)
The cause of this bug is that TextFileReader
erroneously uses sys.getfilesystemencoding()
to determine the encoding. Even as a default, this would be incorrect: the filesystem encoding is about file names, not about file contents.
This bug has been present from the commit that initially added this warning.
Expected Behavior
The encoding to be tested should be based on the encoding
keyword argument (or its default value). In this case, no warning should be produced.
Installed Versions
INSTALLED VERSIONS
commit : c7f7443c1bad8262358114d5e88cd9c8a308e8aa python : 3.9.2.final.0 python-bits : 64 OS : Linux OS-release : 5.10.0-12-amd64 Version : #1 SMP Debian 5.10.103-1 (2022-03-07) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8
pandas : 1.3.1 numpy : 1.21.1 pytz : 2021.1 dateutil : 2.8.2 pip : 20.3.4 setuptools : 44.1.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.1 IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.2 numexpr : None odfpy : None openpyxl : 3.0.7 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.7.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
I just naively removed the warning but the c-engine then throws an error: there seems to be reason why this warning is there 😉
Unless you are familiar with cython, I wouldn’t recommend this a first contribution. And before working on it, it would be good to get feedback from someone who is familiar with the c-engine (not me).
It looks like the C engine in
_libs/parsers.pyx
hardcodes the encoding as utf-8 in various places. So it would probably best to check the encoding inio/parsers/readers.py
and if the separator character does not conform to the length == 1 requirement in UTF-8, it could silently change the engine topython
instead of producing a warning. Alternatively OP could just pass the parameterengine="python"
toread_csv
and get rid of the warning.