Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: TextFileReader uses an incorrect encoding to test the size of the separator

See original GitHub issue

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# Run in UTF-8 mode (`python -X utf8 reproducible_example.py`) just in case.

import pandas as pd
import tempfile

with tempfile.NamedTemporaryFile("wt", encoding="latin-1") as f:
    f.write("key\u00A5value\ntables\u00A5rectangular")
    f.seek(0)
    pd.read_csv(f.name, sep="\u00A5", encoding="latin-1")

Issue Description

Despite the fact that the encoding of the file to be read is clearly specified as "latin-1" (ISO 8859-1), which encodes U+00A5 ¥ YEN SIGN as a single octet, read_csv raises the following warning:

.../venv/lib/python3.9/site-packages/pandas/util/_decorators.py:311: ParserWarning: Falling back to the 'python' engine because the separator encoded in utf-8 is > 1 char long, and the 'c' engine does not support such separators; you can avoid this warning by specifying engine='python'.
  return func(*args, **kwargs)

The cause of this bug is that TextFileReader erroneously uses sys.getfilesystemencoding() to determine the encoding. Even as a default, this would be incorrect: the filesystem encoding is about file names, not about file contents.

This bug has been present from the commit that initially added this warning.

Expected Behavior

The encoding to be tested should be based on the encoding keyword argument (or its default value). In this case, no warning should be produced.

Installed Versions

INSTALLED VERSIONS

commit : c7f7443c1bad8262358114d5e88cd9c8a308e8aa python : 3.9.2.final.0 python-bits : 64 OS : Linux OS-release : 5.10.0-12-amd64 Version : #1 SMP Debian 5.10.103-1 (2022-03-07) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 1.3.1 numpy : 1.21.1 pytz : 2021.1 dateutil : 2.8.2 pip : 20.3.4 setuptools : 44.1.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.1 IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.2 numexpr : None odfpy : None openpyxl : 3.0.7 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.7.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

Issue Analytics

State:
Created 2 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

2reactions

twoertweincommented, Mar 23, 2022

I just naively removed the warning but the c-engine then throws an error: there seems to be reason why this warning is there 😉

File “pandas/_libs/parsers.pyx”, line 544, in pandas._libs.parsers.TextReader.cinit File “pandas/_libs/parsers.pyx”, line 661, in pandas._libs.parsers.TextReader._get_header UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xc2 in position 3: unexpected end of data

Unless you are familiar with cython, I wouldn’t recommend this a first contribution. And before working on it, it would be good to get feedback from someone who is familiar with the c-engine (not me).

0reactions

roberthdevriescommented, Jun 23, 2022

It looks like the C engine in _libs/parsers.pyx hardcodes the encoding as utf-8 in various places. So it would probably best to check the encoding in io/parsers/readers.py and if the separator character does not conform to the length == 1 requirement in UTF-8, it could silently change the engine to python instead of producing a warning. Alternatively OP could just pass the parameter engine="python" to read_csv and get rid of the warning.

Top Results From Across the Web

python 3.x - Value error in Pandas on read_csv on "::" separator

However, pandas seem to be failing when the separators are "::". Am I typing the code wrong? Code: import pandas as pd import...

IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation

If sep is None , the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will...

6 Data Loading, Storage, and File Formats

Return a TextFileReader object for reading the file piecemeal. This object can also be used with the with statement. chunksize, For iteration, size...

Text Files - Spark 3.3.1 Documentation

The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator, compression,...

Lesson 17: Introduction to Pandas — Programming Bootcamp ...

In this paper, researchers used the Glasgow Facial Matching Test (GMFT) to ... default False Return TextFileReader object for iteration or ...