question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Core] [Bug] Remote client environment is not setting up properly.

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core, Ray Clusters

What happened + What you expected to happen

Hi everyone, I have a Ray cluster deployed on Azure K8s. I connect to it using kubectl command as given in the documentation. I initially ran the task in the local ray environment to test the scaling and working of the task. Then, when I am trying to run put an object to test on the cluster, it throws the following error:

Put failed:
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-9-bd8872eca93e> in <module>
     16     'combined_dfs': combined_dfs_ray,
     17 }
---> 18 run_validation = validate_source.remote(config)
ModuleNotFoundError: No module named 'sklearn'

But sklearn is available in the local env. Also, for the algorithm we are running, we are using runtime_env option while running the ray.init command to make available our custom code. All the dependencies are already installed in the env.

LOCAL_PORT = 10001
ray.init(f"ray://127.0.0.1:{LOCAL_PORT}",
         runtime_env={
             "working_dir": "../src",
         })

By the put command, I was expecting an objectId, thus telling that the object has been successfully transferred to the cluster.

Versions / Dependencies

I am using:

  • Conda (Windows)
  • Python 3.8.12
  • Ray[default] 1.9.1
  • Remote client on Azure k8s

Reproduction script

I am trying to workout a small code sample, but the problem is that the simple code structure and no external dependencies are working fine in the remote client environment. The error occurs when I am using our existing dev env. I am trying to create an example in the meanwhile. Also, If there are any logs which could help in providing information then please do let me know.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:15 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
dongruixiaocommented, Jan 11, 2022

i had the same problem

In [4]: ray.init(address='auto', namespace='algo-serve', runtime_env={'pip':['requests==2.1.2'], 'env_vars': dict(os.environ)})
2022-01-10 12:20:57,224	INFO worker.py:843 -- Connecting to existing Ray cluster at address: 10.251.192.213:6379
Out[4]:
{'node_ip_address': '10.251.192.213',
 'raylet_ip_address': '10.251.192.213',
 'redis_address': '10.251.192.213:6379',
 'object_store_address': '/tmp/ray/session_2022-01-10_11-43-51_099860_49411/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2022-01-10_11-43-51_099860_49411/sockets/raylet',
 'webui_url': '10.251.192.213:8265',
 'session_dir': '/tmp/ray/session_2022-01-10_11-43-51_099860_49411',
 'metrics_export_port': 61428,
 'node_id': 'ac4b598c8477048e81b883069027527fbaf0ba9be758441c427be970'}

(raylet) [2022-01-10 12:20:57,396 E 49504 49504] agent_manager.cc:237: Failed to delete URIs, status = IOError: , maybe there are some network problems, will retry it later.
(raylet) [2022-01-10 12:20:57,397 E 49504 49504] agent_manager.cc:237: Failed to delete URIs, status = IOError: , maybe there are some network problems, will retry it later.
(raylet, ip=10.251.183.221) [2022-01-10 12:20:57,615 E 157 157] agent_manager.cc:237: Failed to delete URIs, status = IOError: , maybe there are some network problems, will retry it later.
(raylet, ip=10.251.183.221) [2022-01-10 12:20:57,617 E 157 157] agent_manager.cc:237: Failed to delete URIs, status = IOError: , maybe there are some network problems, will retry it later.
(raylet, ip=10.251.183.221) [2022-01-10 12:20:57,682 E 157 157] agent_manager.cc:237: Failed to delete URIs, status = IOError: , maybe there are some network problems, will retry it later.
(raylet, ip=10.251.183.221) [2022-01-10 12:20:57,809 E 157 157] agent_manager.cc:237: Failed to delete URIs, status = IOError: , maybe there are some network problems, will retry it later.

Thanks @dongruixiao, could you share some more details about your setup? Are you using helm charts as well?

I only run it in my custom cluster implemented via node_provider.py and do not use helm

and this is my config:

cluster_name: default

max_workers: 2

upscaling_speed: 1.0

idle_timeout_minutes: 5

provider:
    type: external
    module: test.my_provider
auth:
    ssh_user: ubuntu

available_node_types:
    ray.head.default:
        resources: 
          ...

    ray.worker.default:
        min_workers: 0
        max_workers: 32

        resources:
            ... 
        
head_node_type: ray.head.default

file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude: []

rsync_filter: []

initialization_commands: []

setup_commands:
    - test ! -z $all_proxy || echo 'export all_proxy="..."' >> ~/.bashrc
    - test -d $HOME/anaconda3 || wget https://repo.continuum.io/archive/Anaconda3-2021.11-Linux-x86_64.sh
    - test -d $HOME/anaconda3 || bash Anaconda3-2021.11-Linux-x86_64.sh -b -p $HOME/anaconda3
    - which conda || echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
 
head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ulimit -c unlimited
    - ray start --head --port=6379 --object-manager-port=8076 --include-dashboard true --dashboard-host 0.0.0.0 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ulimit -c unlimited
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}

1reaction
edoakescommented, Jan 7, 2022

When running on the cluster you need to make sure the deps are also installed there. Don’t need to locally because everything is running in the same env (the local env).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error: Unable to Start Debugging on the Web Server
If you are debugging on a remote machine, make sure you have installed and are running the remote debugger. If the message mentions...
Read more >
Remote agent running as a Windows Service does not pick up ...
Remote agent running as a Windows Service does not pick up the PATH variable correctly · Cause · Resolution · Workaround 1 ·...
Read more >
Fixing the "Remote Desktop Connection: An Internal Error Has ...
We can check more “basics” on the server side by going to Start -> Settings -> Remote Desktop. Here, I can click Advanced...
Read more >
Environment Dependencies — Ray 2.2.0
remote def f(): # The function will have its working directory changed to its node's # local copy of /path/to/files. return open("hello.txt").read() ...
Read more >
Remote Development Tips and Tricks - Visual Studio Code
Tip: PuTTY for Windows is not a supported client, but you can convert your PuTTYGen keys. Quick start: Using SSH keys. To set...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found