Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[ray] Modin on ray causes ray.tune to hang

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 21.04
Modin version (modin.__version__): 0.10.2
Python version: 3.7

Describe the problem

tune.run() starts work on a local cluster. After a couple of minutes less and less CPUs are used. After no CPU is utilized, tune.run() still hasn’t finished. The expected behavior is that after tune.run() all cluster resources are utilized until tune.run() finishes.

As discussed with ray devs, it seems to be a modin issue 😃: https://github.com/ray-project/ray/issues/18808. @Yard1: “Each Ray trial takes up 1 CPU resource. Modin operations inside those trials also take up 1 resource each. Because all resources are taken up by trials, the modin operations cannot progress as they are waiting for resources to become free - which will never happen because the trials are waiting on the modin operations to finish. Classic deadlock And this is also why limiting concurrency works, as it allows some CPUs to be free and thus usable by modin”

Additional info: ray monitor cluster.yaml shows that all CPUs are in use.

Source code / logs

import modin.pandas as pd 
import ray
from ray import tune
from ray.tune.suggest.basic_variant import BasicVariantGenerator


ray.init(address='auto', _redis_password='xxx')


def easy_objective(config, data):
    data_df = data[0]

    # Here be dragons. If either of the below lines are included, Tune hangs.
    score = int(pd.DataFrame(pd.Series(df.test), columns=["test"]).explode(["test"]).test.sum()) 
    # pd.DataFrame(pd.Series(df.test), columns=["test"]).explode(["test"])
    # pd.DataFrame(pd.Series(df.test), columns=["test"]).sum()  

    tune.report(score=score)


tune.run(
    tune.with_parameters(easy_objective, data=[df.index.values, df.bid.values, df.ask.values, df.decimals_price[0]]),
    name="test_study",
    time_budget_s=3600*24*3,
    num_samples=-1,
    verbose=3,
    fail_fast=True,
    config={
            "steps": 100,
            "width": tune.uniform(0, 20),
            "height": tune.uniform(-100, 100),
            "activation": tune.grid_search(["relu", "tanh"])
        },
    metric="score", 
    mode="max",
# but works with this enabled
#    search_alg=BasicVariantGenerator(max_concurrent=CLUSTER_AVAILABLE_LOGICAL_CPUS - 1),  #N.B. "-1", else hangs
)

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

Yard1commented, Sep 25, 2021

In the short term, we should strive to improve documentation to ensure that users are aware that libraries like Modin use the same resource pool as Tune. I’ll be getting out a PR to improve that in Tune on Monday - would be great if a similar mention could be put into Modin’s docs!

1reaction

devin-petersohncommented, Sep 24, 2021

Hi @jmakov, thanks for posting! This is a tricky one. Our Ray workers each occupy 1 CPU for Ray to schedule properly, otherwise Ray’s scheduler will inefficiently place Modin tasks and potentially oversubscribe the system. Happy to discuss alternative ways of efficiently sharing resources between other Ray libraries with the Ray team, @Yard1 @richardliaw @simon-mo. Do you folks think we need to designate a custom resource for Modin, and how can the scheduler make sure there are enough resources for both Modin and Tune?

As discussed with ray devs, it seems to be a modin issue 😃:

I love my Ray developer friends, they are always so happy to blame me 😄. Jokes aside, I’m not sure it’s as simple as one library or another’s fault: as I mention above it’s something we have to coordinate with them.

Top Results From Across the Web

Using Pandas on Ray (Modin) — Ray 2.2.0

Modin, previously Pandas on Ray, is a dataframe manipulation library that allows users to speed up their pandas workloads by acting as a...

Using Modin with Ray Tune - General Questions

First, the issue was that Modin dataframes were not serializable, and thus could not be shared by Ray. This was fixed by installing...

Intel® Distribution of Modin

Modin * is a drop-in replacement for pandas, enabling data scientists to scale to distributed DataFrame processing without having to change API code....

Pandas Modin ray library fails to startup - Stack Overflow

I am trying to accelerate my pandas data processing using modin import os os.environ["MODIN_ENGINE"] = "ray" import modin.pandas as pd df ...

Scale AWS SDK for pandas workloads with AWS Glue for Ray

AWS SDK for pandas also supports self-managed Ray on Amazon Elastic Compute Cloud (Amazon ... pip install "awswrangler[modin,ray]==3.0.0rc2".