[ray] Modin on ray causes ray.tune to hang
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 21.04
- Modin version (
modin.__version__): 0.10.2 - Python version: 3.7
Describe the problem
tune.run() starts work on a local cluster. After a couple of minutes less and less CPUs are used. After no CPU is utilized, tune.run() still hasn’t finished. The expected behavior is that after tune.run() all cluster resources are utilized until tune.run() finishes.
As discussed with ray devs, it seems to be a modin issue 😃: https://github.com/ray-project/ray/issues/18808. @Yard1: “Each Ray trial takes up 1 CPU resource. Modin operations inside those trials also take up 1 resource each. Because all resources are taken up by trials, the modin operations cannot progress as they are waiting for resources to become free - which will never happen because the trials are waiting on the modin operations to finish. Classic deadlock And this is also why limiting concurrency works, as it allows some CPUs to be free and thus usable by modin”
Additional info: ray monitor cluster.yaml shows that all CPUs are in use.
Source code / logs
import modin.pandas as pd
import ray
from ray import tune
from ray.tune.suggest.basic_variant import BasicVariantGenerator
ray.init(address='auto', _redis_password='xxx')
def easy_objective(config, data):
data_df = data[0]
# Here be dragons. If either of the below lines are included, Tune hangs.
score = int(pd.DataFrame(pd.Series(df.test), columns=["test"]).explode(["test"]).test.sum())
# pd.DataFrame(pd.Series(df.test), columns=["test"]).explode(["test"])
# pd.DataFrame(pd.Series(df.test), columns=["test"]).sum()
tune.report(score=score)
tune.run(
tune.with_parameters(easy_objective, data=[df.index.values, df.bid.values, df.ask.values, df.decimals_price[0]]),
name="test_study",
time_budget_s=3600*24*3,
num_samples=-1,
verbose=3,
fail_fast=True,
config={
"steps": 100,
"width": tune.uniform(0, 20),
"height": tune.uniform(-100, 100),
"activation": tune.grid_search(["relu", "tanh"])
},
metric="score",
mode="max",
# but works with this enabled
# search_alg=BasicVariantGenerator(max_concurrent=CLUSTER_AVAILABLE_LOGICAL_CPUS - 1), #N.B. "-1", else hangs
)
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)

Top Related StackOverflow Question
In the short term, we should strive to improve documentation to ensure that users are aware that libraries like Modin use the same resource pool as Tune. I’ll be getting out a PR to improve that in Tune on Monday - would be great if a similar mention could be put into Modin’s docs!
Hi @jmakov, thanks for posting! This is a tricky one. Our Ray workers each occupy 1 CPU for Ray to schedule properly, otherwise Ray’s scheduler will inefficiently place Modin tasks and potentially oversubscribe the system. Happy to discuss alternative ways of efficiently sharing resources between other Ray libraries with the Ray team, @Yard1 @richardliaw @simon-mo. Do you folks think we need to designate a custom resource for Modin, and how can the scheduler make sure there are enough resources for both Modin and Tune?
I love my Ray developer friends, they are always so happy to blame me 😄. Jokes aside, I’m not sure it’s as simple as one library or another’s fault: as I mention above it’s something we have to coordinate with them.