Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_csv with Ray engine fails with some combinations of `Column and Index Locations and Names` parameters

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
Modin version (modin.__version__): 0.8.1.1+34.ga571e10
Python version: 3.8.6
Code we can use to reproduce:

import os

os.environ["MODIN_ENGINE"] = "ray"

import pandas
import modin.pandas as pd
from modin.pandas.test.utils import df_equals

test_filename = "test.csv"
kwargs = {
    "filepath_or_buffer": test_filename,
    "index_col": "col1",
    "usecols": ["col1"],
}
str_two_cols = """col1,col2
0,1
2,3
"""

try :
    with open(test_filename, "w") as f:
        f.write(str_two_cols)

    df_pandas = pandas.read_csv(**kwargs)
    print(df_pandas)
    df_pd = pd.read_csv(**kwargs)
    print(df_pd)
    df_equals(df_pd, df_pandas)
finally:
    os.remove(test_filename)

Describe the problem

Source code / logs

Empty DataFrame
Columns: []
Index: [0, 2]
Traceback (most recent call last):
  File "test.py", line 26, in <module>
    df_pd = pd.read_csv(**kwargs)
  File "/localdisk/amyskov/modin/modin/pandas/io.py", line 109, in parser_func
    return _read(**kwargs)
  File "/localdisk/amyskov/modin/modin/pandas/io.py", line 127, in _read
    pd_obj = EngineDispatcher.read_csv(**kwargs)
  File "/localdisk/amyskov/modin/modin/data_management/factories/dispatcher.py", line 104, in read_csv
    return cls.__engine._read_csv(**kwargs)
  File "/localdisk/amyskov/modin/modin/data_management/factories/factories.py", line 87, in _read_csv
    return cls.io_cls.read_csv(**kwargs)
  File "/localdisk/amyskov/modin/modin/engines/base/io/file_reader.py", line 29, in read
    query_compiler = cls._read(*args, **kwargs)
  File "/localdisk/amyskov/modin/modin/engines/base/io/text/csv_reader.py", line 142, in _read
    column_chunksize = compute_chunksize(empty_pd_df, num_splits, axis=1)
  File "/localdisk/amyskov/modin/modin/data_management/utils.py", line 54, in compute_chunksize
    col_chunksize = get_default_chunksize(len(df.columns), num_splits)
  File "/localdisk/amyskov/modin/modin/data_management/utils.py", line 29, in get_default_chunksize
    length // num_splits if length % num_splits == 0 else length // num_splits + 1
ZeroDivisionError: integer division or modulo by zero

Issue Analytics

State:
Created 3 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

2reactions

amyskovcommented, Jul 14, 2021

@ymoslem, temporally this bug can be avoided by adding index_col=False option to the read_csv function, see example below:

import modin.pandas as pd
import pandas
import ray
import os
os.environ["MODIN_ENGINE"] = "ray"
ray.init()

from modin.pandas.test.utils import df_equals

file_name = "ELRA-W0309.en-fr.en"

df_pandas = pandas.read_csv(file_name, names=['English'], sep="\n")
df_pd = pd.read_csv(file_name, names=['English'], sep="\n", index_col=False)
df_pd.reset_index(drop=True, inplace=True) # indexes aligning between pandas and Modin

df_equals(df_pandas, df_pd) # df_pd and df_pandas are equal!

Hope this helps!

0reactions

pyritocommented, Aug 22, 2022

I am not able to reproduce this bug on master, so I’ll go ahead and close this issue. Please re-open if needed!