time budget setting doesn't take effect in remote Ray cluster
See original GitHub issueDescribe the bug
I am running ludwig in a single node ray cluster. (only head). here’s my setting
trainer:
batch_size: auto
learning_rate: auto
learning_rate_scaling: sqrt
decay: true
decay_steps: 20000
decay_rate: 0.8
optimizer:
type: adam
validation_field: recommended
validation_metric: roc_auc
hyperopt:
search_alg:
type: hyperopt
random_state_seed: null
executor:
type: ray
num_samples: 10
time_budget_s: 300
scheduler:
....
I notice it periodically prints the trail information and some logs like Read->Map_Batches: 0%| | 0/1 [04:05<?, ?it/s]
etc.
== Status ==
Current time: 2022-08-19 23:00:25 (running for 00:04:14.72)
Memory usage on this node: 3.7/15.2 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 72.000: None
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/8.0 GiB heap, 0.0/2.33 GiB objects
Result logdir: /data/results/hyperopt
Number of trials: 4/10 (1 PENDING, 3 RUNNING)
+----------------+----------+--------------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+----------------------+-----------------------+-------------------------+
| Trial name | status | loc | combiner.bn_momentum | combiner.bn_virtual_bs | combiner.num_steps | combiner.output_size | combiner.relaxation_factor | combiner.size | combiner.sparsity | trainer.decay_rate | trainer.decay_steps | trainer.learning_rate |
|----------------+----------+--------------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+----------------------+-----------------------+-------------------------|
| trial_1e90f97c | RUNNING | 192.168.48.244:554 | 0.4 | 4096 | 7 | 128 | 2 | 8 | 0 | 0.8 | 8000 | 0.02 |
| trial_20e1a424 | RUNNING | 192.168.48.244:613 | 0.05 | 1024 | 7 | 64 | 2 | 32 | 1e-06 | 0.95 | 2000 | 0.005 |
| trial_20e552f4 | RUNNING | 192.168.48.244:616 | 0.3 | 2048 | 9 | 16 | 1 | 24 | 0.0001 | 0.95 | 2000 | 0.005 |
| trial_20e900ac | PENDING | | 0.3 | 256 | 6 | 24 | 2 | 24 | 0 | 0.9 | 10000 | 0.025 |
+----------------+----------+--------------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+----------------------+-----------------------+-------------------------+
Read->Map_Batches: 0%| | 0/1 [04:05<?, ?it/s]
Read->Map_Batches: 0%| | 0/1 [04:05<?, ?it/s]
Read->Map_Batches: 0%| | 0/1 [04:08<?, ?it/s]
Read->Map_Batches: 0%| | 0/1 [04:05<?, ?it/s]
Read->Map_Batches: 0%| | 0/1 [04:05<?, ?it/s]
Read->Map_Batches: 0%| | 0/1 [04:09<?, ?it/s]
Read->Map_Batches: 0%| | 0/1 [04:05<?, ?it/s]
Read->Map_Batches: 0%| | 0/1 [04:05<?, ?it/s]
Read->Map_Batches: 0%| | 0/1 [04:09<?, ?it/s]
Read->Map_Batches: 0%| | 0/1 [04:06<?, ?it/s]
Read->Map_Batches: 0%| | 0/1 [04:05<?, ?it/s]
The tricky thing is after 5mins (time_budget_s: 300). It only shows information like but never print trail information. It seems there’s some problems or issues on dataset preventing driver program to stop. Does anyone meet same problem?
Read->Map_Batches: 0%| | 0/1 [05:41<?, ?it/s]
Read->Map_Batches: 0%| | 0/1 [05:45<?, ?it/s]
Read->Map_Batches: 0%| | 0/1 [05:42<?, ?it/s]
Read->Map_Batches: 0%| | 0/1 [05:42<?, ?it/s]
To Reproduce
- Create a single node ray cluster
- ludwig hyperopt --config /data/rotten_tomatoes.yaml --dataset /data/rotten_tomatoes.csv
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: x-automl-cluster
namespace: ray-system
spec:
rayVersion: '1.13.0'
headGroupSpec:
serviceType: NodePort
replicas: 1
rayStartParams:
port: '6379'
dashboard-host: '0.0.0.0'
node-ip-address: $MY_POD_IP
template:
spec:
containers:
- name: ray-head
image: seedjeffwan/automl:v0.3.3
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
cpu: "3"
memory: "8192Mi"
requests:
cpu: "3"
memory: "8192Mi"
Please provide code, yaml config file and a sample of data in order to entirely reproduce the issue. Issues that are not reproducible will be ignored.
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
- OS: [e.g. iOS]
- Version [e.g. 22]
- Python version
- Ludwig version
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created a year ago
- Comments:7 (2 by maintainers)
Top GitHub Comments
@Jeffwan after looking into this more, I can confirm that this issue is caused by lack of enough CPU resources.
In your current setup, all CPU cores are reserved by the trials, and so there are no cores available for RayDataset tasks like data loading and map batches calls.
I’ve filed an issue #2465 that will allow Ludwig to manually throttle the number of concurrent trials you can train when there are limited CPU resources.
In the meantime, there are 2 options for getting unblocked:
max_concurrent_trials
in the hyperopt config to <= 2 if training on your current cluster. This should look like:Let me know if you have any follow up questions about this!
I am using v0.6.0 (ray 2.0) and I didn’t see any similar issues and I can confirm my issue was resolved. Thanks a lot!