question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_csv with Ray engine fails with some combinations of `Column and Index Locations and Names` parameters

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Modin version (modin.__version__): 0.8.1.1+34.ga571e10
  • Python version: 3.8.6
  • Code we can use to reproduce:
import os

os.environ["MODIN_ENGINE"] = "ray"

import pandas
import modin.pandas as pd
from modin.pandas.test.utils import df_equals

test_filename = "test.csv"
kwargs = {
    "filepath_or_buffer": test_filename,
    "index_col": "col1",
    "usecols": ["col1"],
}
str_two_cols = """col1,col2
0,1
2,3
"""

try :
    with open(test_filename, "w") as f:
        f.write(str_two_cols)

    df_pandas = pandas.read_csv(**kwargs)
    print(df_pandas)
    df_pd = pd.read_csv(**kwargs)
    print(df_pd)
    df_equals(df_pd, df_pandas)
finally:
    os.remove(test_filename)

Describe the problem

Source code / logs

Empty DataFrame
Columns: []
Index: [0, 2]
Traceback (most recent call last):
  File "test.py", line 26, in <module>
    df_pd = pd.read_csv(**kwargs)
  File "/localdisk/amyskov/modin/modin/pandas/io.py", line 109, in parser_func
    return _read(**kwargs)
  File "/localdisk/amyskov/modin/modin/pandas/io.py", line 127, in _read
    pd_obj = EngineDispatcher.read_csv(**kwargs)
  File "/localdisk/amyskov/modin/modin/data_management/factories/dispatcher.py", line 104, in read_csv
    return cls.__engine._read_csv(**kwargs)
  File "/localdisk/amyskov/modin/modin/data_management/factories/factories.py", line 87, in _read_csv
    return cls.io_cls.read_csv(**kwargs)
  File "/localdisk/amyskov/modin/modin/engines/base/io/file_reader.py", line 29, in read
    query_compiler = cls._read(*args, **kwargs)
  File "/localdisk/amyskov/modin/modin/engines/base/io/text/csv_reader.py", line 142, in _read
    column_chunksize = compute_chunksize(empty_pd_df, num_splits, axis=1)
  File "/localdisk/amyskov/modin/modin/data_management/utils.py", line 54, in compute_chunksize
    col_chunksize = get_default_chunksize(len(df.columns), num_splits)
  File "/localdisk/amyskov/modin/modin/data_management/utils.py", line 29, in get_default_chunksize
    length // num_splits if length % num_splits == 0 else length // num_splits + 1
ZeroDivisionError: integer division or modulo by zero

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
amyskovcommented, Jul 14, 2021

@ymoslem, temporally this bug can be avoided by adding index_col=False option to the read_csv function, see example below:

import modin.pandas as pd
import pandas
import ray
import os
os.environ["MODIN_ENGINE"] = "ray"
ray.init()

from modin.pandas.test.utils import df_equals

file_name = "ELRA-W0309.en-fr.en"

df_pandas = pandas.read_csv(file_name, names=['English'], sep="\n")
df_pd = pd.read_csv(file_name, names=['English'], sep="\n", index_col=False)
df_pd.reset_index(drop=True, inplace=True) # indexes aligning between pandas and Modin

df_equals(df_pandas, df_pd) # df_pd and df_pandas are equal!

Hope this helps!

0reactions
pyritocommented, Aug 22, 2022

I am not able to reproduce this bug on master, so I’ll go ahead and close this issue. Please re-open if needed!

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to “read_csv” with Pandas - Towards Data Science
We can solve this issue using header parameter. In most cases, the first row in a csv file includes column names and inferred...
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
If this option is set to True , nothing should be passed in for the delimiter parameter. Column and index locations and names#....
Read more >
Troubleshooting — Modin 0.12.1+0.g34962ec.dirty ...
This can happen when Ray fails to start. It will keep retrying, but often it is faster to just restart the notebook or...
Read more >
Understanding Delimiters in Pandas read_csv() Function
Pandas can also be identified as a combination of two or more Pandas Series ... CSV (or Comma Separated Values) files, as the...
Read more >
Python Pandas Cheat Sheet - Edlitera
Select data using labels (column names and row index labels) ... of boolean values: Select specific rows and columns using combinations of integer...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found