Encoding error : `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2: invalid continuation byte`
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ubuntu 18.04
- **Modin installed from : pip install modin[ray]
- Modin version: 0.6.3
- Python version: 3.7.3
Describe the problem
Hello,
i’m trying to use modin to reduce the memory peak due the volum of the data, so i change the pandas with modin.pandas, i try to do a simple read of a file but encoded in ‘latin-1’ (french) . With pandas all goes smoothly but using modin i got an error of encoding as follow :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2: invalid continuation byte
the script used (which works fine on pandas but not in modin ) :
caract = pd.read_csv(path, sep="\t", encoding = "ISO-8859-1")
ps :: i tried other encoding and the same remark : works on pandas and not on modin (backed by ray) : ISO-8859-1, ISO-8859-9, latin-1
any solution ??
thanks
Source code / logs
`RayTaskError: ray_worker (pid=10815, host=ubuntu) File “pandas/_libs/parsers.pyx”, line 1297, in pandas._libs.parsers.TextReader._string_convert File “pandas/_libs/parsers.pyx”, line 1520, in pandas._libs.parsers._string_box_utf8 UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe9 in position 2: invalid continuation byte
During handling of the above exception, another exception occurred:
ray_worker (pid=10815, host=ubuntu) File “/home/lasngd/.conda/envs/pytorch/lib/python3.7/site-packages/modin/engines/ray/task_wrapper.py”, line 8, in deploy_ray_func return func(**args) File “/home/lasngd/.conda/envs/pytorch/lib/python3.7/site-packages/modin/backends/pandas/parsers.py”, line 69, in parse pandas_df = pandas.read_csv(BytesIO(to_read), **kwargs) File “/home/lasngd/.conda/envs/pytorch/lib/python3.7/site-packages/pandas/io/parsers.py”, line 685, in parser_f return _read(filepath_or_buffer, kwds) File “/home/lasngd/.conda/envs/pytorch/lib/python3.7/site-packages/pandas/io/parsers.py”, line 463, in _read data = parser.read(nrows) File “/home/lasngd/.conda/envs/pytorch/lib/python3.7/site-packages/pandas/io/parsers.py”, line 1154, in read ret = self._engine.read(nrows) File “/home/lasngd/.conda/envs/pytorch/lib/python3.7/site-packages/pandas/io/parsers.py”, line 2059, in read data = self._reader.read(nrows) File “pandas/_libs/parsers.pyx”, line 881, in pandas._libs.parsers.TextReader.read File “pandas/_libs/parsers.pyx”, line 896, in pandas._libs.parsers.TextReader._read_low_memory File “pandas/_libs/parsers.pyx”, line 973, in pandas._libs.parsers.TextReader._read_rows File “pandas/_libs/parsers.pyx”, line 1105, in pandas._libs.parsers.TextReader._convert_column_data File “pandas/_libs/parsers.pyx”, line 1158, in pandas._libs.parsers.TextReader._convert_tokens File “pandas/_libs/parsers.pyx”, line 1281, in pandas._libs.parsers.TextReader._convert_with_dtype File “pandas/_libs/parsers.pyx”, line 1297, in pandas._libs.parsers.TextReader._string_convert File “pandas/_libs/parsers.pyx”, line 1520, in pandas._libs.parsers._string_box_utf8 UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe9 in position 2: invalid continuation byte`
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (5 by maintainers)
Top GitHub Comments
System information OS Platform Windows 10 Home **Modin installed from : pip install modin[dask] Modin version: 0.6.3 Python version: 3.7.3
Hi @devin-petersohn,
I am facing a similar issue as @ghsama on windows with modin using dask engine. With vanilla pandas this works just fine:
pd.read_csv('fires_50k.csv', encoding = "ISO-8859-1")
However, while reading CSV with modin.pandas like this:mpd.read_csv('fires_50k.csv', encoding = "ISO-8859-1")
throws this UnicodeDecodeError:Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “D:\Users\hvard\Anaconda3\lib\site-packages\modin\pandas\io.py”, line 97, in parser_func return _read(**kwargs) File “D:\Users\hvard\Anaconda3\lib\site-packages\modin\pandas\io.py”, line 110, in _read pd_obj = BaseFactory.read_csv(**kwargs) File “D:\Users\hvard\Anaconda3\lib\site-packages\modin\data_management\factories.py”, line 52, in read_csv return cls._determine_engine()._read_csv(**kwargs) File “D:\Users\hvard\Anaconda3\lib\site-packages\modin\data_management\factories.py”, line 56, in _read_csv return cls.io_cls.read_csv(**kwargs) File “D:\Users\hvard\Anaconda3\lib\site-packages\modin\engines\base\io\text\csv_reader.py”, line 197, in read row_lengths = cls.materialize(index_ids) File “D:\Users\hvard\Anaconda3\lib\site-packages\modin\engines\dask\task_wrapper.py”, line 20, in materialize return client.gather(future) File “D:\Users\hvard\Anaconda3\lib\site-packages\distributed\client.py”, line 1876, in gather asynchronous=asynchronous, File “D:\Users\hvard\Anaconda3\lib\site-packages\distributed\client.py”, line 771, in sync self.loop, func, *args, callback_timeout=callback_timeout, **kwargs File “D:\Users\hvard\Anaconda3\lib\site-packages\distributed\utils.py”, line 334, in sync raise exc.with_traceback(tb) File “D:\Users\hvard\Anaconda3\lib\site-packages\distributed\utils.py”, line 318, in f result[0] = yield future File “D:\Users\hvard\Anaconda3\lib\site-packages\tornado\gen.py”, line 1133, in run value = future.result() File “D:\Users\hvard\Anaconda3\lib\site-packages\distributed\client.py”, line 1732, in _gather raise exception.with_traceback(traceback) File “D:\Users\hvard\Anaconda3\lib\site-packages\modin\backends\pandas\parsers.py”, line 69, in parse pandas_df = pandas.read_csv(BytesIO(to_read), **kwargs) File “D:\Users\hvard\Anaconda3\lib\site-packages\pandas\io\parsers.py”, line 685, in parser_f return _read(filepath_or_buffer, kwds) File “D:\Users\hvard\Anaconda3\lib\site-packages\pandas\io\parsers.py”, line 463, in _read data = parser.read(nrows) File “D:\Users\hvard\Anaconda3\lib\site-packages\pandas\io\parsers.py”, line 1154, in read ret = self._engine.read(nrows) File “D:\Users\hvard\Anaconda3\lib\site-packages\pandas\io\parsers.py”, line 2059, in read data = self._reader.read(nrows) File “pandas/_libs/parsers.pyx”, line 881, in pandas._libs.parsers.TextReader.read File “pandas/_libs/parsers.pyx”, line 896, in pandas._libs.parsers.TextReader._read_low_memory File “pandas/_libs/parsers.pyx”, line 973, in pandas._libs.parsers.TextReader._read_rows File “pandas/_libs/parsers.pyx”, line 1105, in pandas._libs.parsers.TextReader._convert_column_data File “pandas/_libs/parsers.pyx”, line 1158, in pandas._libs.parsers.TextReader._convert_tokens File “pandas/_libs/parsers.pyx”, line 1281, in pandas._libs.parsers.TextReader._convert_with_dtype File “pandas/_libs/parsers.pyx”, line 1297, in pandas._libs.parsers.TextReader._string_convert File “pandas/_libs/parsers.pyx”, line 1520, in pandas._libs.parsers._string_box_utf8 UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xa0 in position 8: invalid start byte
This is the CSV I am trying to read. This is a part(50K) of a large 1.88M rows dataset. I think you should be able to reproduce the issue with this data. Please do let me know otherwise.
Thanks!
df = pd.read_csv( filepath, encoding='windows-1251', names=['field', 'names', 'here'], sep=';', skiprows=0, na_values='\\N', engine='c' )
Thank You