Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

time budget setting doesn't take effect in remote Ray cluster

See original GitHub issue

Describe the bug

I am running ludwig in a single node ray cluster. (only head). here’s my setting

trainer:
  batch_size: auto
  learning_rate: auto
  learning_rate_scaling: sqrt
  decay: true
  decay_steps: 20000
  decay_rate: 0.8
  optimizer:
    type: adam
  validation_field: recommended
  validation_metric: roc_auc
hyperopt:
  search_alg:
    type: hyperopt
    random_state_seed: null
  executor:
    type: ray
    num_samples: 10
    time_budget_s: 300
    scheduler:
....

I notice it periodically prints the trail information and some logs like Read->Map_Batches: 0%| | 0/1 [04:05<?, ?it/s] etc.

== Status ==
Current time: 2022-08-19 23:00:25 (running for 00:04:14.72)
Memory usage on this node: 3.7/15.2 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 72.000: None
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/8.0 GiB heap, 0.0/2.33 GiB objects
Result logdir: /data/results/hyperopt
Number of trials: 4/10 (1 PENDING, 3 RUNNING)
+----------------+----------+--------------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+----------------------+-----------------------+-------------------------+
| Trial name     | status   | loc                |   combiner.bn_momentum |   combiner.bn_virtual_bs |   combiner.num_steps |   combiner.output_size |   combiner.relaxation_factor |   combiner.size |   combiner.sparsity |   trainer.decay_rate |   trainer.decay_steps |   trainer.learning_rate |
|----------------+----------+--------------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+----------------------+-----------------------+-------------------------|
| trial_1e90f97c | RUNNING  | 192.168.48.244:554 |                   0.4  |                     4096 |                    7 |                    128 |                            2 |               8 |              0      |                 0.8  |                  8000 |                   0.02  |
| trial_20e1a424 | RUNNING  | 192.168.48.244:613 |                   0.05 |                     1024 |                    7 |                     64 |                            2 |              32 |              1e-06  |                 0.95 |                  2000 |                   0.005 |
| trial_20e552f4 | RUNNING  | 192.168.48.244:616 |                   0.3  |                     2048 |                    9 |                     16 |                            1 |              24 |              0.0001 |                 0.95 |                  2000 |                   0.005 |
| trial_20e900ac | PENDING  |                    |                   0.3  |                      256 |                    6 |                     24 |                            2 |              24 |              0      |                 0.9  |                 10000 |                   0.025 |
+----------------+----------+--------------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+----------------------+-----------------------+-------------------------+

Read->Map_Batches:   0%|          | 0/1 [04:05<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:05<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:08<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:05<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:05<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:09<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:05<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:05<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:09<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:06<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [04:05<?, ?it/s]

The tricky thing is after 5mins (time_budget_s: 300). It only shows information like but never print trail information. It seems there’s some problems or issues on dataset preventing driver program to stop. Does anyone meet same problem?

Read->Map_Batches:   0%|          | 0/1 [05:41<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [05:45<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [05:42<?, ?it/s]
Read->Map_Batches:   0%|          | 0/1 [05:42<?, ?it/s]

To Reproduce

Create a single node ray cluster
ludwig hyperopt --config /data/rotten_tomatoes.yaml --dataset /data/rotten_tomatoes.csv

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: x-automl-cluster
  namespace: ray-system
spec:
  rayVersion: '1.13.0'
  headGroupSpec:
    serviceType: NodePort
    replicas: 1
    rayStartParams:
      port: '6379'
      dashboard-host: '0.0.0.0'
      node-ip-address: $MY_POD_IP
    template:
      spec:
        containers:
        - name: ray-head
          image: seedjeffwan/automl:v0.3.3
          env:
          - name: MY_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:
              cpu: "3"
              memory: "8192Mi"
            requests:
              cpu: "3"
              memory: "8192Mi"

Please provide code, yaml config file and a sample of data in order to entirely reproduce the issue. Issues that are not reproducible will be ignored.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]
Python version
Ludwig version

Additional context Add any other context about the problem here.

Issue Analytics

State:
Created a year ago
Comments:7 (2 by maintainers)

Top GitHub Comments

2reactions

ShreyaRcommented, Sep 8, 2022

@Jeffwan after looking into this more, I can confirm that this issue is caused by lack of enough CPU resources.

In your current setup, all CPU cores are reserved by the trials, and so there are no cores available for RayDataset tasks like data loading and map batches calls.

I’ve filed an issue #2465 that will allow Ludwig to manually throttle the number of concurrent trials you can train when there are limited CPU resources.

In the meantime, there are 2 options for getting unblocked:

Setting max_concurrent_trials in the hyperopt config to <= 2 if training on your current cluster. This should look like:
```
  hyperopt:
     executor:
         max_concurrent_trials: 2
```
Training on a larger cluster.

Let me know if you have any follow up questions about this!

1reaction

Jeffwancommented, Oct 5, 2022

I am using v0.6.0 (ray 2.0) and I didn’t see any similar issues and I can confirm my issue was resolved. Thanks a lot!