question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Modin fails on a simple code snippet with `ray==2.1.0` in development environment

See original GitHub issue

Modin fails with the recently released ray 2.1.0 even for this simple code snippet:

import modin.pandas as pd
print((pd.DataFrame([[1]]) + 1)._to_pandas())
Traceback
2022-11-08 12:37:32,128 ERROR services.py:1403 -- Failed to start the dashboard: Failed to start the dashboard, return code -11
 The last 10 lines of /tmp/ray/session_2022-11-08_12-37-30_420275_2718136/logs/dashboard.log:
2022-11-08 12:37:32,128 ERROR services.py:1404 -- Failed to start the dashboard, return code -11
 The last 10 lines of /tmp/ray/session_2022-11-08_12-37-30_420275_2718136/logs/dashboard.log:
Traceback (most recent call last):
  File "/localdisk/dchigare/miniconda3/envs/test_test_modin_/lib/python3.8/site-packages/ray/_private/services.py", line 1389, in start_api_server
    raise Exception(err_msg + last_log_str)
Exception: Failed to start the dashboard, return code -11
 The last 10 lines of /tmp/ray/session_2022-11-08_12-37-30_420275_2718136/logs/dashboard.log:
2022-11-08 12:37:32,251 INFO worker.py:1528 -- Started a local Ray instance.
UserWarning: Distributing <class 'list'> object. This may take some time.
Traceback (most recent call last):
  File "t3.py", line 6, in <module>
    print((pd.DataFrame([[1]]) + 1)._to_pandas())
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/pandas/dataframe.py", line 2883, in _to_pandas
    return self._query_compiler.to_pandas()
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/storage_formats/pandas/query_compiler.py", line 287, in to_pandas
    return self._modin_frame.to_pandas()
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 125, in run_f_on_minimally_updated_metadata
    result = f(self, *args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 3089, in to_pandas
    df = self._partition_mgr_cls.to_pandas(self._partitions)
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in to_pandas
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in <listcomp>
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in <listcomp>
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/partitioning/partition.py", line 145, in to_pandas
    dataframe = self.get()
  File "/localdisk/dchigare/repos/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py", line 81, in get
    result = RayWrapper.materialize(self._data)
  File "/localdisk/dchigare/repos/modin/modin/core/execution/ray/common/engine_wrapper.py", line 92, in materialize
    return ray.get(obj_id)
  File "/localdisk/dchigare/miniconda3/envs/test_test_modin_/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/localdisk/dchigare/miniconda3/envs/test_test_modin_/lib/python3.8/site-packages/ray/_private/worker.py", line 2291, in get
    raise value
ray.exceptions.LocalRayletDiedError: The task's local raylet died. Check raylet.out for more information.
(raylet) [2022-11-08 12:37:32,802 E 2718405 2718451] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.

The log files that the error mentions seem to not contain any useful info:

dashboard_agent.log

2022-11-08 12:37:32,673 INFO agent.py:102 -- Parent pid is 2718405
2022-11-08 12:37:32,674 INFO agent.py:128 -- Dashboard agent grpc address: 0.0.0.0:63816
python-core-driver0.log
Global stats: 12 total (8 active)
Queueing time: mean = 20.740 us, max = 112.582 us, min = 7.329 us, total = 248.875 us
Execution time:  mean = 9.748 us, total = 116.980 us
Event stats:
        PeriodicalRunner.RunFnPeriodically - 6 total (4 active, 1 running), CPU time: mean = 1.763 us, total = 10.579 us
        InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 88.407 us, total = 88.407 us
        WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 17.994 us, total = 17.994 us
        InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
        CoreWorker.deadline_timer.flush_profiling_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
        NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
        UNKNOWN - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s


[2022-11-08 12:37:32,260 I 2718136 2718471] accessor.cc:608: Received notification for node id = b8eb05a6aaee9b4f7f7d8b5c2188bfc617ee57f6583b546a554f6939, IsAlive = 1
[2022-11-08 12:37:32,807 W 2718136 2718471] direct_task_transport.cc:488: The worker failed to receive a response from the local raylet because the raylet is unavailable (crashed). Error: GrpcUnavailable: RPC Error message: Socket closed; RPC Error details:
[2022-11-08 12:37:32,807 I 2718136 2718471] task_manager.cc:507: Task failed: GrpcUnavailable: RPC Error message: Socket closed; RPC Error details: : Type=NORMAL_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=modin.core.execution.ray.implementations.pandas_on_ray.partitioning.partition, class_name=, function_name=_apply_func, function_hash=f11822ef8d2c45b093f2b1e8e55ee73f}, task_id=c8ef45ccd0112571ffffffffffffffffffffffff01000000, task_name=_apply_func, job_id=01000000, num_args=4, num_returns=4, depth=1, attempt_number=0, max_retries=3, serialized_runtime_env={"env_vars": {"__MODIN_AUTOIMPORT_PANDAS__": "1"}}, eager_install=1, setup_timeout_seconds=600
[2022-11-08 12:37:33,260 I 2718136 2718471] raylet_client.cc:364: Error reporting task backlog information: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
[2022-11-08 12:37:34,261 I 2718136 2718471] raylet_client.cc:364: Error reporting task backlog information: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:637: Disconnecting to the raylet.
[2022-11-08 12:37:34,268 I 2718136 2718136] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=Shutdown by ray.shutdown()., has creation_task_exception_pb_bytes=0
[2022-11-08 12:37:34,268 W 2718136 2718136] raylet_client.cc:188: IOError: Broken pipe [RayletClient] Failed to disconnect from raylet. This means the raylet the worker is connected is probably already dead.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:580: Shutting down a core worker.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:604: Disconnecting a GCS client.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:608: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.
[2022-11-08 12:37:34,268 I 2718136 2718471] core_worker.cc:736: Core worker main io service stopped.
[2022-11-08 12:37:34,268 W 2718136 2718474] metric_exporter.cc:209: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:617: Core worker ready to be deallocated.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:571: Core worker is destructed
[2022-11-08 12:37:34,518 I 2718136 2718136] core_worker_process.cc:147: Destructing CoreWorkerProcessImpl. pid: 2718136
[2022-11-08 12:37:34,518 I 2718136 2718136] io_service_pool.cc:47: IOServicePool is stopped.

The error occurs only on Linux and only in our development environment (environment-dev.yml). Fresh installed modin with ray 2.1.0 works fine:

$ conda create -n clean_env python=3.8
$ pip install modin/
$ pip install ray==2.1.0
$ python reproducer.py # works fine
----
$ conda env create -f modin/environment-dev.yml
$ python reproducer.py # fails

The error also occurs in our CI where all of the non-windows jobs with ray fail. Example: https://github.com/modin-project/modin/actions/runs/3420080409/jobs/5697742798


p.s. the issue seems not to be related to the redis version requirements in our env recipe: https://github.com/modin-project/modin/blob/c30ab4c132295b1e168fbb06a050c76554759e11/environment-dev.yml#L53 Both non-working and working environments use the same version of redis (3.5.1)

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
h-vetinaricommented, Nov 14, 2022

The error occurs only on Linux and only in our development environment (environment-dev.yml).

There are a lot of things wrong with that environment:

  • dask[complete]>=2.22.0 - conda does not understand this syntax and ~it does not do what you expect it to~ only happens to work by accident; unfortunately (and in contrast to other packages like ray and modin) there’s no direct replacement in conda-forge yet.
  • matplotlib<=3.2.2 - why oh why are you forcing the use of a version that’s over 2 years old? This will force the solver into very weird contortions
  • coverage<5.0 - likewise, over 3 years old
  • pygithub==1.53 - over 2 years old
  • rpyc==4.1.5 - over 2 years old

Generally:

  • pandas shouldn’t be pinned to patch version, see https://github.com/modin-project/modin/issues/3371
  • move asv, black, connectorx, flake8, numpydoc, tqdm, xgboost from pip to conda-forge dependencies (ideally also ray)
  • please alphabetize your dependencies

Of course things shouldn’t break wherever possible, sorry for that (and we’re looking into what’s going on with grpcio 1.49.1 in this context). Still, I feel obliged to point out that your requirement constraints are somewhere between “actively hostile to the solver” and “shooting yourself in the foot”. I’ve kept banging that drum in https://github.com/modin-project/modin/issues/3371 and on the feedstock, but I cannot seem to get that point across.

It’s one of the prime reasons I (personally) haven’t spent much time with modin - it doesn’t “play nice with others” from the POV of dependency constraints, which makes it very hard to use together with other projects (and their dependencies)[^1]. Note that I’m not critizising temporary pins to unbreak CI, but with multiple several-year-old constraints, you’re externalising a lot of costs onto the ecosystem (incl. other libraries, packagers and your users).

[^1]: It also makes it really hard to debug situations as in this bug.

1reaction
pyritocommented, Nov 14, 2022

@h-vetinari thank you for raising this point! I think a lot of what you said makes sense and can be addressed in a few PRs. I agree with you that pinning versions can cause dependency headaches.

You raise a fair point w.r.t. pinning pandas to the patch level. We will take steps towards coming to a solution there!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Modin fails to load csv from s3 with ray client #2688 - GitHub
The issue is the Client API, which is new. If you create and connect to a Ray cluster without the Client, it will...
Read more >
Pandas Modin ray library fails to startup - Stack Overflow
Try init ing ray before you import modin : import os os.environ["MODIN_ENGINE"] = "ray" import ray ray.init() import modin.pandas as pd.
Read more >
Troubleshooting — Modin 0.18.0+0.gba7ab8eb.dirty ...
This can happen when Ray fails to start. It will keep retrying, but often it is faster to just restart the notebook or...
Read more >
Installing Ray — Ray 2.2.0 - the Ray documentation
While using a conda environment, it is recommended to install Ray from PyPi using pip install ray in the newly created environment. Building...
Read more >
Ray Documentation - Read the Docs
Ray is a flexible, high-performance distributed execution framework. Ray is easy to install: pip install ray. Installation.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found