question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: TextFileReader uses an incorrect encoding to test the size of the separator

See original GitHub issue

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# Run in UTF-8 mode (`python -X utf8 reproducible_example.py`) just in case.

import pandas as pd
import tempfile

with tempfile.NamedTemporaryFile("wt", encoding="latin-1") as f:
    f.write("key\u00A5value\ntables\u00A5rectangular")
    f.seek(0)
    pd.read_csv(f.name, sep="\u00A5", encoding="latin-1")

Issue Description

Despite the fact that the encoding of the file to be read is clearly specified as "latin-1" (ISO 8859-1), which encodes U+00A5 ¥ YEN SIGN as a single octet, read_csv raises the following warning:

.../venv/lib/python3.9/site-packages/pandas/util/_decorators.py:311: ParserWarning: Falling back to the 'python' engine because the separator encoded in utf-8 is > 1 char long, and the 'c' engine does not support such separators; you can avoid this warning by specifying engine='python'.
  return func(*args, **kwargs)

The cause of this bug is that TextFileReader erroneously uses sys.getfilesystemencoding() to determine the encoding. Even as a default, this would be incorrect: the filesystem encoding is about file names, not about file contents.

This bug has been present from the commit that initially added this warning.

Expected Behavior

The encoding to be tested should be based on the encoding keyword argument (or its default value). In this case, no warning should be produced.

Installed Versions

INSTALLED VERSIONS

commit : c7f7443c1bad8262358114d5e88cd9c8a308e8aa python : 3.9.2.final.0 python-bits : 64 OS : Linux OS-release : 5.10.0-12-amd64 Version : #1 SMP Debian 5.10.103-1 (2022-03-07) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 1.3.1 numpy : 1.21.1 pytz : 2021.1 dateutil : 2.8.2 pip : 20.3.4 setuptools : 44.1.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.1 IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.2 numexpr : None odfpy : None openpyxl : 3.0.7 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.7.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
twoertweincommented, Mar 23, 2022

I just naively removed the warning but the c-engine then throws an error: there seems to be reason why this warning is there 😉

File “pandas/_libs/parsers.pyx”, line 544, in pandas._libs.parsers.TextReader.cinit File “pandas/_libs/parsers.pyx”, line 661, in pandas._libs.parsers.TextReader._get_header UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xc2 in position 3: unexpected end of data

Unless you are familiar with cython, I wouldn’t recommend this a first contribution. And before working on it, it would be good to get feedback from someone who is familiar with the c-engine (not me).

0reactions
roberthdevriescommented, Jun 23, 2022

It looks like the C engine in _libs/parsers.pyx hardcodes the encoding as utf-8 in various places. So it would probably best to check the encoding in io/parsers/readers.py and if the separator character does not conform to the length == 1 requirement in UTF-8, it could silently change the engine to python instead of producing a warning. Alternatively OP could just pass the parameter engine="python" to read_csv and get rid of the warning.

Read more comments on GitHub >

github_iconTop Results From Across the Web

python 3.x - Value error in Pandas on read_csv on "::" separator
However, pandas seem to be failing when the separators are "::". Am I typing the code wrong? Code: import pandas as pd import...
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
If sep is None , the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will...
Read more >
6 Data Loading, Storage, and File Formats
Return a TextFileReader object for reading the file piecemeal. This object can also be used with the with statement. chunksize, For iteration, size...
Read more >
Text Files - Spark 3.3.1 Documentation
The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator, compression,...
Read more >
Lesson 17: Introduction to Pandas — Programming Bootcamp ...
In this paper, researchers used the Glasgow Facial Matching Test (GMFT) to ... default False Return TextFileReader object for iteration or ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found