question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Modin fails to load csv from s3 with ray client

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu
  • Modin version (modin.__version__): master (0.8.3+22.ge99b629)
  • Python version: 3.7
  • Code we can use to reproduce:
import ray
import os
import ray.util
ray.util.connect("<service_ip>:50051")
import modin.pandas as pd
pd.DEFAULT_NPARTITIONS = 10
df = pd.read_csv("s3://<bucket>/HIGGS_100k.csv")

Describe the problem

Modin fails to load csv from s3 with ray client and throws an error.

Source code / logs

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-10-9b3c648a226d> in <module>()
----> 1 df = pd.read_csv("s3://<s3_bucket>/HIGGS_100k.csv")

/home/bhavya.agarwal/.local/lib/python3.7/site-packages/modin/pandas/io.py in parser_func(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)
    114 
    115         kwargs = {k: v for k, v in f_locals.items() if k in _pd_read_csv_signature}
--> 116         return _read(**kwargs)
    117 
    118     return parser_func

/home/bhavya.agarwal/.local/lib/python3.7/site-packages/modin/pandas/io.py in _read(**kwargs)
    133 
    134     Engine.subscribe(_update_engine)
--> 135     pd_obj = EngineDispatcher.read_csv(**kwargs)
    136     # This happens when `read_csv` returns a TextFileReader object for iterating through
    137     if isinstance(pd_obj, pandas.io.parsers.TextFileReader):

/home/bhavya.agarwal/.local/lib/python3.7/site-packages/modin/data_management/factories/dispatcher.py in read_csv(cls, **kwargs)
    102     @classmethod
    103     def read_csv(cls, **kwargs):
--> 104         return cls.__engine._read_csv(**kwargs)
    105 
    106     @classmethod

/home/bhavya.agarwal/.local/lib/python3.7/site-packages/modin/data_management/factories/factories.py in _read_csv(cls, **kwargs)
     85     @classmethod
     86     def _read_csv(cls, **kwargs):
---> 87         return cls.io_cls.read_csv(**kwargs)
     88 
     89     @classmethod

/home/bhavya.agarwal/.local/lib/python3.7/site-packages/modin/engines/base/io/file_dispatcher.py in read(cls, *args, **kwargs)
     27     @classmethod
     28     def read(cls, *args, **kwargs):
---> 29         query_compiler = cls._read(*args, **kwargs)
     30         # TODO (devin-petersohn): Make this section more general for non-pandas kernel
     31         # implementations.

/home/bhavya.agarwal/.local/lib/python3.7/site-packages/modin/engines/base/io/text/csv_dispatcher.py in _read(cls, filepath_or_buffer, **kwargs)
    192         dtypes = cls.get_dtypes(dtypes_ids) if len(dtypes_ids) > 0 else None
    193 
--> 194         partition_ids = cls.build_partition(partition_ids, row_lengths, column_widths)
    195         # If parse_dates is present, the column names that we have might not be
    196         # the same length as the returned column names. If we do need to modify

/home/bhavya.agarwal/.local/lib/python3.7/site-packages/modin/engines/base/io/text/text_file_dispatcher.py in build_partition(cls, partition_ids, row_lengths, column_widths)
     51                     for j in range(len(partition_ids[i]))
     52                 ]
---> 53                 for i in range(len(partition_ids))
     54             ]
     55         )

/home/bhavya.agarwal/.local/lib/python3.7/site-packages/modin/engines/base/io/text/text_file_dispatcher.py in <listcomp>(.0)
     51                     for j in range(len(partition_ids[i]))
     52                 ]
---> 53                 for i in range(len(partition_ids))
     54             ]
     55         )

/home/bhavya.agarwal/.local/lib/python3.7/site-packages/modin/engines/base/io/text/text_file_dispatcher.py in <listcomp>(.0)
     49                         width=column_widths[j],
     50                     )
---> 51                     for j in range(len(partition_ids[i]))
     52                 ]
     53                 for i in range(len(partition_ids))

/home/bhavya.agarwal/.local/lib/python3.7/site-packages/modin/engines/ray/pandas_on_ray/frame/partition.py in __init__(self, object_id, length, width, ip, call_queue)
     25 class PandasOnRayFramePartition(BaseFramePartition):
     26     def __init__(self, object_id, length=None, width=None, ip=None, call_queue=None):
---> 27         assert type(object_id) is ray.ObjectID
     28 
     29         self.oid = object_id

AssertionError:

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
devin-petersohncommented, Feb 23, 2021

This is an issue currently on latest Ray wheels, being discussed here: https://github.com/ray-project/ray/issues/14279

1reaction
shossaincommented, Feb 22, 2021

@devin-petersohn: I tried to submit the following script to an existing cluster using ray submit:

import ray
ray.init(address='auto', _redis_password='5241590000000000')


import modin.pandas as pd

columns_names = [
        "trip_id", "vendor_id", "pickup_datetime", "dropoff_datetime", "store_and_fwd_flag",
        "rate_code_id", "pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude",
        "passenger_count", "trip_distance", "fare_amount", "extra", "mta_tax", "tip_amount",
        "tolls_amount", "ehail_fee", "improvement_surcharge", "total_amount", "payment_type",
        "trip_type", "pickup", "dropoff", "cab_type", "precipitation", "snow_depth", "snowfall",
        "max_temperature", "min_temperature", "average_wind_speed", "pickup_nyct2010_gid",
        "pickup_ctlabel", "pickup_borocode", "pickup_boroname", "pickup_ct2010",
        "pickup_boroct2010", "pickup_cdeligibil", "pickup_ntacode", "pickup_ntaname", "pickup_puma",
        "dropoff_nyct2010_gid", "dropoff_ctlabel", "dropoff_borocode", "dropoff_boroname",
        "dropoff_ct2010", "dropoff_boroct2010", "dropoff_cdeligibil", "dropoff_ntacode",
        "dropoff_ntaname", "dropoff_puma",
    ]


df = pd.read_csv('https://modin-datasets.s3.amazonaws.com/trips_data.csv', names=columns_names)

def q1(df):
    return df.groupby("cab_type")["cab_type"].count()
print(df)      # Works fine
print(q1(df))  # Throws exception

But, I am getting the following exception:

(raylet) [2021-02-22 12:17:09,584 C 4100 4100] pull_manager.cc:100:  Check failed: active_object_pull_requests_[obj_id].erase(request_it->first) 
(raylet) [2021-02-22 12:17:09,584 E 4100 4100] logging.cc:435: *** Aborted at 1614025029 (unix time) try "date -d @1614025029" if you are using GNU date ***
(raylet) [2021-02-22 12:17:09,584 E 4100 4100] logging.cc:435: PC: @                0x0 (unknown)
(raylet) [2021-02-22 12:17:09,593 E 4100 4100] logging.cc:435: *** SIGABRT (@0x3e800001004) received by PID 4100 (TID 0x7f3902812800) from PID 4100; stack trace: ***
(raylet) [2021-02-22 12:17:09,595 E 4100 4100] logging.cc:435:     @     0x556fe05a223f google::(anonymous namespace)::FailureSignalHandler()
(raylet) [2021-02-22 12:17:09,596 E 4100 4100] logging.cc:435:     @     0x7f3902d743c0 (unknown)
(raylet) [2021-02-22 12:17:09,596 E 4100 4100] logging.cc:435:     @     0x7f390285d18b gsignal
(raylet) [2021-02-22 12:17:09,596 E 4100 4100] logging.cc:435:     @     0x7f390283c859 abort
(raylet) [2021-02-22 12:17:09,599 E 4100 4100] logging.cc:435:     @     0x556fe0593615 ray::SpdLogMessage::Flush()
(raylet) [2021-02-22 12:17:09,601 E 4100 4100] logging.cc:435:     @     0x556fe059364d ray::RayLog::~RayLog()
(raylet) [2021-02-22 12:17:09,602 E 4100 4100] logging.cc:435:     @     0x556fe028df8d ray::PullManager::DeactivatePullBundleRequest()
(raylet) [2021-02-22 12:17:09,603 E 4100 4100] logging.cc:435:     @     0x556fe0290ed9 ray::PullManager::CancelPull()
(raylet) [2021-02-22 12:17:09,604 E 4100 4100] logging.cc:435:     @     0x556fe027e28a ray::ObjectManager::CancelPull()
(raylet) [2021-02-22 12:17:09,605 E 4100 4100] logging.cc:435:     @     0x556fe01d0b77 ray::raylet::DependencyManager::RemoveTaskDependencies()
(raylet) [2021-02-22 12:17:09,606 E 4100 4100] logging.cc:435:     @     0x556fe023afdd ray::raylet::ClusterTaskManager::DispatchScheduledTasksToWorkers()
(raylet) [2021-02-22 12:17:09,607 E 4100 4100] logging.cc:435:     @     0x556fe0209d2f ray::raylet::NodeManager::HandleWorkerAvailable()
(raylet) [2021-02-22 12:17:09,608 E 4100 4100] logging.cc:435:     @     0x556fe0209e30 ray::raylet::NodeManager::HandleWorkerAvailable()
(raylet) [2021-02-22 12:17:09,608 E 4100 4100] logging.cc:435:     @     0x556fe020a373 ray::raylet::NodeManager::ProcessAnnounceWorkerPortMessage()
(raylet) [2021-02-22 12:17:09,609 E 4100 4100] logging.cc:435:     @     0x556fe0226f1a ray::raylet::NodeManager::ProcessClientMessage()
(raylet) [2021-02-22 12:17:09,610 E 4100 4100] logging.cc:435:     @     0x556fe01852a1 _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray16ClientConnectionEElRKSt6vectorIhSaIhEEEZNS1_6raylet6Raylet12HandleAcceptERKN5boost6system10error_codeEEUlS3_lS8_E0_E9_M_invokeERKSt9_Any_dataOS3_OlS8_
(raylet) [2021-02-22 12:17:09,614 E 4100 4100] logging.cc:435:     @     0x556fe054da4e ray::ClientConnection::ProcessMessage()
(raylet) [2021-02-22 12:17:09,618 E 4100 4100] logging.cc:435:     @     0x556fe054aaec boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(raylet) [2021-02-22 12:17:09,622 E 4100 4100] logging.cc:435:     @     0x556fe0910e41 boost::asio::detail::scheduler::do_run_one()
(raylet) [2021-02-22 12:17:09,624 E 4100 4100] logging.cc:435:     @     0x556fe09124e9 boost::asio::detail::scheduler::run()
(raylet) [2021-02-22 12:17:09,624 E 4100 4100] logging.cc:435:     @     0x556fe09149d7 boost::asio::io_context::run()
(raylet) [2021-02-22 12:17:09,627 E 4100 4100] logging.cc:435:     @     0x556fe0151572 main
(raylet) [2021-02-22 12:17:09,627 E 4100 4100] logging.cc:435:     @     0x7f390283e0b3 __libc_start_main
(raylet) [2021-02-22 12:17:09,629 E 4100 4100] logging.cc:435:     @     0x556fe0166665 (unknown)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Modin with ray client on k8s - Kubernetes
Right now modin requires you to run ray.init() on head node, ... I am getting error when I try to download a csv...
Read more >
Modin really slow with ray client - General Questions
I'm trying to get moding working with Ray client, and the performance is really bad. From the ray dashboard, it looks like only...
Read more >
Troubleshooting — Modin 0.18.0+0.gba7ab8eb.dirty ...
Hanging on import modin.​​ This can happen when Ray fails to start. It will keep retrying, but often it is faster to just...
Read more >
Recently Active 'modin' Questions - Stack Overflow
I'm attempting to read a csv file using modin and it results in the following error. this issue seems to happen on all...
Read more >
Pandas Is Not Enough? A Comprehensive Guide To ...
A Comprehensive Guide To Alternative Data Wrangling Solutions. Including Dask, Modin, polars, Vaex, Terality and 6 others. I think pandas needs ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found