Modin fails on a simple code snippet with `ray==2.1.0` in development environment
See original GitHub issueModin fails with the recently released ray 2.1.0 even for this simple code snippet:
import modin.pandas as pd
print((pd.DataFrame([[1]]) + 1)._to_pandas())
Traceback
2022-11-08 12:37:32,128 ERROR services.py:1403 -- Failed to start the dashboard: Failed to start the dashboard, return code -11
The last 10 lines of /tmp/ray/session_2022-11-08_12-37-30_420275_2718136/logs/dashboard.log:
2022-11-08 12:37:32,128 ERROR services.py:1404 -- Failed to start the dashboard, return code -11
The last 10 lines of /tmp/ray/session_2022-11-08_12-37-30_420275_2718136/logs/dashboard.log:
Traceback (most recent call last):
File "/localdisk/dchigare/miniconda3/envs/test_test_modin_/lib/python3.8/site-packages/ray/_private/services.py", line 1389, in start_api_server
raise Exception(err_msg + last_log_str)
Exception: Failed to start the dashboard, return code -11
The last 10 lines of /tmp/ray/session_2022-11-08_12-37-30_420275_2718136/logs/dashboard.log:
2022-11-08 12:37:32,251 INFO worker.py:1528 -- Started a local Ray instance.
UserWarning: Distributing <class 'list'> object. This may take some time.
Traceback (most recent call last):
File "t3.py", line 6, in <module>
print((pd.DataFrame([[1]]) + 1)._to_pandas())
File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
return obj(*args, **kwargs)
File "/localdisk/dchigare/repos/modin/modin/pandas/dataframe.py", line 2883, in _to_pandas
return self._query_compiler.to_pandas()
File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
return obj(*args, **kwargs)
File "/localdisk/dchigare/repos/modin/modin/core/storage_formats/pandas/query_compiler.py", line 287, in to_pandas
return self._modin_frame.to_pandas()
File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
return obj(*args, **kwargs)
File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 125, in run_f_on_minimally_updated_metadata
result = f(self, *args, **kwargs)
File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 3089, in to_pandas
df = self._partition_mgr_cls.to_pandas(self._partitions)
File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
return obj(*args, **kwargs)
File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in to_pandas
retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in <listcomp>
retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in <listcomp>
retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/partitioning/partition.py", line 145, in to_pandas
dataframe = self.get()
File "/localdisk/dchigare/repos/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py", line 81, in get
result = RayWrapper.materialize(self._data)
File "/localdisk/dchigare/repos/modin/modin/core/execution/ray/common/engine_wrapper.py", line 92, in materialize
return ray.get(obj_id)
File "/localdisk/dchigare/miniconda3/envs/test_test_modin_/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/localdisk/dchigare/miniconda3/envs/test_test_modin_/lib/python3.8/site-packages/ray/_private/worker.py", line 2291, in get
raise value
ray.exceptions.LocalRayletDiedError: The task's local raylet died. Check raylet.out for more information.
(raylet) [2022-11-08 12:37:32,802 E 2718405 2718451] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
The log files that the error mentions seem to not contain any useful info:
dashboard_agent.log
2022-11-08 12:37:32,673 INFO agent.py:102 -- Parent pid is 2718405
2022-11-08 12:37:32,674 INFO agent.py:128 -- Dashboard agent grpc address: 0.0.0.0:63816
python-core-driver0.log
Global stats: 12 total (8 active)
Queueing time: mean = 20.740 us, max = 112.582 us, min = 7.329 us, total = 248.875 us
Execution time: mean = 9.748 us, total = 116.980 us
Event stats:
PeriodicalRunner.RunFnPeriodically - 6 total (4 active, 1 running), CPU time: mean = 1.763 us, total = 10.579 us
InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 88.407 us, total = 88.407 us
WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 17.994 us, total = 17.994 us
InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
CoreWorker.deadline_timer.flush_profiling_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
UNKNOWN - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[2022-11-08 12:37:32,260 I 2718136 2718471] accessor.cc:608: Received notification for node id = b8eb05a6aaee9b4f7f7d8b5c2188bfc617ee57f6583b546a554f6939, IsAlive = 1
[2022-11-08 12:37:32,807 W 2718136 2718471] direct_task_transport.cc:488: The worker failed to receive a response from the local raylet because the raylet is unavailable (crashed). Error: GrpcUnavailable: RPC Error message: Socket closed; RPC Error details:
[2022-11-08 12:37:32,807 I 2718136 2718471] task_manager.cc:507: Task failed: GrpcUnavailable: RPC Error message: Socket closed; RPC Error details: : Type=NORMAL_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=modin.core.execution.ray.implementations.pandas_on_ray.partitioning.partition, class_name=, function_name=_apply_func, function_hash=f11822ef8d2c45b093f2b1e8e55ee73f}, task_id=c8ef45ccd0112571ffffffffffffffffffffffff01000000, task_name=_apply_func, job_id=01000000, num_args=4, num_returns=4, depth=1, attempt_number=0, max_retries=3, serialized_runtime_env={"env_vars": {"__MODIN_AUTOIMPORT_PANDAS__": "1"}}, eager_install=1, setup_timeout_seconds=600
[2022-11-08 12:37:33,260 I 2718136 2718471] raylet_client.cc:364: Error reporting task backlog information: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
[2022-11-08 12:37:34,261 I 2718136 2718471] raylet_client.cc:364: Error reporting task backlog information: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:637: Disconnecting to the raylet.
[2022-11-08 12:37:34,268 I 2718136 2718136] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=Shutdown by ray.shutdown()., has creation_task_exception_pb_bytes=0
[2022-11-08 12:37:34,268 W 2718136 2718136] raylet_client.cc:188: IOError: Broken pipe [RayletClient] Failed to disconnect from raylet. This means the raylet the worker is connected is probably already dead.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:580: Shutting down a core worker.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:604: Disconnecting a GCS client.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:608: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.
[2022-11-08 12:37:34,268 I 2718136 2718471] core_worker.cc:736: Core worker main io service stopped.
[2022-11-08 12:37:34,268 W 2718136 2718474] metric_exporter.cc:209: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:617: Core worker ready to be deallocated.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:571: Core worker is destructed
[2022-11-08 12:37:34,518 I 2718136 2718136] core_worker_process.cc:147: Destructing CoreWorkerProcessImpl. pid: 2718136
[2022-11-08 12:37:34,518 I 2718136 2718136] io_service_pool.cc:47: IOServicePool is stopped.
The error occurs only on Linux and only in our development environment (environment-dev.yml). Fresh installed modin with ray 2.1.0 works fine:
$ conda create -n clean_env python=3.8
$ pip install modin/
$ pip install ray==2.1.0
$ python reproducer.py # works fine
----
$ conda env create -f modin/environment-dev.yml
$ python reproducer.py # fails
The error also occurs in our CI where all of the non-windows jobs with ray fail. Example: https://github.com/modin-project/modin/actions/runs/3420080409/jobs/5697742798
p.s. the issue seems not to be related to the redis
version requirements in our env recipe:
https://github.com/modin-project/modin/blob/c30ab4c132295b1e168fbb06a050c76554759e11/environment-dev.yml#L53
Both non-working and working environments use the same version of redis (3.5.1)
Issue Analytics
- State:
- Created 10 months ago
- Comments:6 (5 by maintainers)
There are a lot of things wrong with that environment:
dask[complete]>=2.22.0
- conda does not understand this syntax and ~it does not do what you expect it to~ only happens to work by accident; unfortunately (and in contrast to other packages like ray and modin) there’s no direct replacement in conda-forge yet.matplotlib<=3.2.2
- why oh why are you forcing the use of a version that’s over 2 years old? This will force the solver into very weird contortionscoverage<5.0
- likewise, over 3 years oldpygithub==1.53
- over 2 years oldrpyc==4.1.5
- over 2 years oldGenerally:
asv, black, connectorx, flake8, numpydoc, tqdm, xgboost
from pip to conda-forge dependencies (ideally also ray)Of course things shouldn’t break wherever possible, sorry for that (and we’re looking into what’s going on with grpcio 1.49.1 in this context). Still, I feel obliged to point out that your requirement constraints are somewhere between “actively hostile to the solver” and “shooting yourself in the foot”. I’ve kept banging that drum in https://github.com/modin-project/modin/issues/3371 and on the feedstock, but I cannot seem to get that point across.
It’s one of the prime reasons I (personally) haven’t spent much time with modin - it doesn’t “play nice with others” from the POV of dependency constraints, which makes it very hard to use together with other projects (and their dependencies)[^1]. Note that I’m not critizising temporary pins to unbreak CI, but with multiple several-year-old constraints, you’re externalising a lot of costs onto the ecosystem (incl. other libraries, packagers and your users).
[^1]: It also makes it really hard to debug situations as in this bug.
@h-vetinari thank you for raising this point! I think a lot of what you said makes sense and can be addressed in a few PRs. I agree with you that pinning versions can cause dependency headaches.
You raise a fair point w.r.t. pinning pandas to the patch level. We will take steps towards coming to a solution there!