question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

time budget setting doesn't take effect in remote Ray cluster

See original GitHub issue

Describe the bug

I am running ludwig in a single node ray cluster. (only head). here’s my setting

trainer:
  batch_size: auto
  learning_rate: auto
  learning_rate_scaling: sqrt
  decay: true
  decay_steps: 20000
  decay_rate: 0.8
  optimizer:
    type: adam
  validation_field: recommended
  validation_metric: roc_auc
hyperopt:
  search_alg:
    type: hyperopt
    random_state_seed: null
  executor:
    type: ray
    num_samples: 10
    time_budget_s: 300
    scheduler:
....

I notice it periodically prints the trail information and some logs like Read->Map_Batches: 0%| | 0/1 [04:05<?, ?it/s] etc.

== Status ==
Current time: 2022-08-19 23:00:25 (running for 00:04:14.72)
Memory usage on this node: 3.7/15.2 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 72.000: None
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/8.0 GiB heap, 0.0/2.33 GiB objects
Result logdir: /data/results/hyperopt
Number of trials: 4/10 (1 PENDING, 3 RUNNING)
+----------------+----------+--------------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+----------------------+-----------------------+-------------------------+
| Trial name     | status   | loc                |   combiner.bn_momentum |   combiner.bn_virtual_bs |   combiner.num_steps |   combiner.output_size |   combiner.relaxation_factor |   combiner.size |   combiner.sparsity |   trainer.decay_rate |   trainer.decay_steps |   trainer.learning_rate |
|----------------+----------+--------------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+----------------------+-----------------------+-------------------------|
| trial_1e90f97c | RUNNING  | 192.168.48.244:554 |                   0.4  |                     4096 |                    7 |                    128 |                            2 |               8 |              0      |                 0.8  |                  8000 |                   0.02  |
| trial_20e1a424 | RUNNING  | 192.168.48.244:613 |                   0.05 |                     1024 |                    7 |                     64 |                            2 |              32 |              1e-06  |                 0.95 |                  2000 |                   0.005 |
| trial_20e552f4 | RUNNING  | 192.168.48.244:616 |                   0.3  |                     2048 |                    9 |                     16 |                            1 |              24 |              0.0001 |                 0.95 |                  2000 |                   0.005 |
| trial_20e900ac | PENDING  |                    |                   0.3  |                      256 |                    6 |                     24 |                            2 |              24 |              0      |                 0.9  |                 10000 |                   0.025 |
+----------------+----------+--------------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+----------------------+-----------------------+-------------------------+

Read->Map_Batches:   0%|          | 0/1 [04:05<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:05<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:08<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:05<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:05<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:09<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:05<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:05<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:09<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:06<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:05<?, ?it/s]

The tricky thing is after 5mins (time_budget_s: 300). It only shows information like but never print trail information. It seems there’s some problems or issues on dataset preventing driver program to stop. Does anyone meet same problem?

Read->Map_Batches:   0%|          | 0/1 [05:41<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [05:45<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [05:42<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [05:42<?, ?it/s]

To Reproduce

  1. Create a single node ray cluster
  2. ludwig hyperopt --config /data/rotten_tomatoes.yaml --dataset /data/rotten_tomatoes.csv
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: x-automl-cluster
  namespace: ray-system
spec:
  rayVersion: '1.13.0'
  headGroupSpec:
    serviceType: NodePort
    replicas: 1
    rayStartParams:
      port: '6379'
      dashboard-host: '0.0.0.0'
      node-ip-address: $MY_POD_IP
    template:
      spec:
        containers:
        - name: ray-head
          image: seedjeffwan/automl:v0.3.3
          env:
          - name: MY_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:
              cpu: "3"
              memory: "8192Mi"
            requests:
              cpu: "3"
              memory: "8192Mi"

Please provide code, yaml config file and a sample of data in order to entirely reproduce the issue. Issues that are not reproducible will be ignored.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]
  • Python version
  • Ludwig version

Additional context Add any other context about the problem here.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
ShreyaRcommented, Sep 8, 2022

@Jeffwan after looking into this more, I can confirm that this issue is caused by lack of enough CPU resources.

In your current setup, all CPU cores are reserved by the trials, and so there are no cores available for RayDataset tasks like data loading and map batches calls.

I’ve filed an issue #2465 that will allow Ludwig to manually throttle the number of concurrent trials you can train when there are limited CPU resources.

In the meantime, there are 2 options for getting unblocked:

  • Setting max_concurrent_trials in the hyperopt config to <= 2 if training on your current cluster. This should look like:
      hyperopt:
         executor:
             max_concurrent_trials: 2
    
  • Training on a larger cluster.

Let me know if you have any follow up questions about this!

1reaction
Jeffwancommented, Oct 5, 2022

I am using v0.6.0 (ray 2.0) and I didn’t see any similar issues and I can confirm my issue was resolved. Thanks a lot!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Environment Dependencies — Ray 2.2.0
This will install the dependencies to the remote cluster. Any tasks and actors used in the job will use this runtime environment unless...
Read more >
Issue with Ray not distributing tasks fairly. Heavily biased ...
I'll try distribute the task load such that an atomic task takes more time, and have less number of tasks overall. Maybe that'll...
Read more >
Programming in Ray: Tips for first-time users - RISE Lab
Avoid passing same object repeatedly to remote tasks ... (Note: If this program takes much more than 1 sec, it is probably because...
Read more >
"slow start" launching worker processes on new nodes #12052
Under Ray 1.0.0, the worker processes were launched at the start, and I'd immediately have full CPU utilization across my cluster. Now it...
Read more >
Autoscaling clusters with Ray - Anyscale
Thus, Ray is suitable for a multi-cloud strategy and its use does not create any vendor lock-in. Ray also only depends on basic...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found