question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_csv fails to read file if there are cyrillic symbols in filename

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas
cyrillic_filename = "./файл_1.csv"
# 'c' engine fails:
df = pandas.read_csv(cyrillic_filename, engine="c", encoding="cp1251")
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-18-9cb08141730c> in <module>()
      2 
      3 cyrillic_filename = "./файл_1.csv"
----> 4 df = pandas.read_csv(cyrillic_filename , engine="c", encoding="cp1251")

d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    653                     skip_blank_lines=skip_blank_lines)
    654 
--> 655         return _read(filepath_or_buffer, kwds)
    656 
    657     parser_f.__name__ = name

d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    403 
    404     # Create the parser.
--> 405     parser = TextFileReader(filepath_or_buffer, **kwds)
    406 
    407     if chunksize or iterator:

d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
    762             self.options['has_index_names'] = kwds['has_index_names']
    763 
--> 764         self._make_engine(self.engine)
    765 
    766     def close(self):

d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
    983     def _make_engine(self, engine='c'):
    984         if engine == 'c':
--> 985             self._engine = CParserWrapper(self.f, **self.options)
    986         else:
    987             if engine == 'python':

d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
   1603         kwds['allow_leading_cols'] = self.index_col is not False
   1604 
-> 1605         self._reader = parsers.TextReader(src, **kwds)
   1606 
   1607         # XXX
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas\_libs\parsers.c:4209)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source (pandas\_libs\parsers.c:8895)()
OSError: Initializing from file failed

# 'python' engine work:
df = pandas.read_csv(cyrillic_filename, engine="python", encoding="cp1251")
df.size
>>172440

# 'c' engine works if filename can be encoded to utf-8
latin_filename = "./file_1.csv"
df = pandas.read_csv(latin_filename, engine="c", encoding="cp1251")
df.size
>>172440

Problem description

The ‘c’ engine should read the files with non-UTF-8 filenames

Expected Output

File content readed into dataframe

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.1.final.0 python-bits: 32 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.13.2 scipy: 0.19.1 xarray: None IPython: 6.2.1 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: 2.4.8 xlrd: None xlwt: None xlsxwriter: None lxml: 4.0.0 bs4: None html5lib: 1.0b10 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None None

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
gfyoungcommented, Jan 14, 2019

it’s really embarassing to still have that error when engine is set to C

@fingoldo : Yeah, it’s awkward no doubt, but I hope you understand that there are many of these types of “embarrassing errors” than there are man-hours (devs and contributors combined) to correct them, especially if people have a hard time reproducing them.

Lucky for you, I currently have in my possession a Windows machine, and I was able to patch this issue pretty quickly in a PR:

https://github.com/pandas-dev/pandas/pull/24758

1reaction
gfyoungcommented, Jan 12, 2019

Open source community seems to be no better than Microsoft in this regard, where known bugs are not getting fixed for years.

@fingoldo : Sorry about this! We do get a lot of issues every day, and unlike at Microsoft, we have way fewer code maintainers to work and address all of these issues that we receive.

That being said, if you would like to tackle the issue, that would be great! Part of the issue that we have right now is that it’s hard for us to test and validate any fixes, so a community contribution would be most welcome for something like this.

xref https://github.com/pandas-dev/pandas/issues/15086#issuecomment-445799283

Read more comments on GitHub >

github_iconTop Results From Across the Web

UnicodeDecodeError when reading CSV file in Pandas with ...
Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object....
Read more >
Error - unable to read the csv file in pandas
i am unable to read the csv file. import pandas as pd df=pd.read_csv(“data.csv”). the error i am getting is :- ...
Read more >
Encoding | PyCharm Documentation - JetBrains
If PyCharm displays characters in a file incorrectly, it probably couldn't detect the file encoding. In this case, you need to specify the...
Read more >
Encoding problems (cyrillic) for importing data with CSV Import
When you export data via “save as csv”, the file is saved in UTF-8 with correct headers and content. After csv reimport with...
Read more >
codecs — Codec registry and base classes — Python 3.11.1 ...
If encoding is not None , then the underlying encoded files are always ... Therefore it does not support bytes-to-bytes encoders such as...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found