0% GPU usage when using `hyperparameter_search`
See original GitHub issue## Environment info
transformers
version: 4.4.0.dev0- Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.7.0+cu101 (True)
- Tensorflow version (GPU?): 2.4.1 (True)
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No (Single GPU) --> Colab
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help
Models:
- ray/raytune: @richardliaw, @amogkam
- trainer: @sgugger
Information
Model I am using (Bert, XLNet …): RoBERTa
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
This is in continuation with #10055 where the underlying code is the same, and it is more or less the same as the official example. The problem is that when I start hyperparameter_search
then it just keeps running with 0% GPU usage (memory is occupied) and the CPU also remains relatively idle:-
== Status ==
Memory usage on this node: 5.9/25.5 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1/4 CPUs, 1/1 GPUs, 0.0/14.99 GiB heap, 0.0/5.18 GiB objects (0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/_inner_2021-02-15_11-45-33
Number of trials: 1/100 (1 RUNNING)
+--------------------+----------+-------+-------------+--------------+--------------+----------------+-----------------+-----------------+--------------------+-------------------------------+---------+----------------+
| Trial name | status | loc | adafactor | adam_beta1 | adam_beta2 | adam_epsilon | learning_rate | max_grad_norm | num_train_epochs | per_device_train_batch_size | seed | weight_decay |
|--------------------+----------+-------+-------------+--------------+--------------+----------------+-----------------+-----------------+--------------------+-------------------------------+---------+----------------|
| _inner_4fd43_00000 | RUNNING | | True | 0.862131 | 0.813033 | 1e-09 | 2.34754e-05 | 0.0056821 | 2 | 16 | 21.1968 | 0.95152 |
+--------------------+----------+-------+-------------+--------------+--------------+----------------+-----------------+-----------------+--------------------+-------------------------------+---------+----------------+
Sometimes, there are also warnings that the single worker is pending due to lack of resources, however my CPU usage is minimum, plenty of RAM is free (~24 Gb) and GPU also some about a gig of free memory.
2021-02-15 13:56:53,761 WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffff44ed5e1383be630817647ecd01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {3.000000/4.000000 CPU, 14.990234 GiB/14.990234 GiB memory, 0.000000/1.000000 GPU, 1.000000/1.000000 node:172.28.0.2, 5.126953 GiB/5.126953 GiB object_store_memory, 1.000000/1.000000 accelerator_type:V100}
. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
This is how the tuner looks like:-
from ray.tune.suggest.hyperopt import HyperOptSearch
from ray.tune.schedulers import PopulationBasedTraining
from ray import tune
import random
pbt = PopulationBasedTraining(
time_attr="training_iteration",
metric="accuracy",
mode="max",
perturbation_interval=10, # every 10 `time_attr` units
# (training_iterations in this case)
hyperparam_mutations={
"weight_decay": tune.uniform(1, 0.0001),
"seed": tune.uniform(1,20000),
"learning_rate": tune.choice([1e-5, 2e-5, 3e-5, 4e-5, 5e-5, 6e-5, 2e-7, 1e-7, 3e-7, 2e-8]),
"adafactor": tune.choice(['True','False']),
"adam_beta1": tune.uniform(1.0, 0.0),
"adam_beta2": tune.uniform(1.0, 0),
"adam_epsilon": tune.choice([1e-8, 2e-8, 3e-8, 1e-9, 2e-9, 3e-10]),
"max_grad_norm": tune.uniform(1.0, 0),
})
best_run = trainer.hyperparameter_search(n_trials=100, compute_objective='accuracy', direction="maximize", backend='ray',
scheduler=pbt)
Using HyperOptScheduler
causes OOMs
Issue Analytics
- State:
- Created 3 years ago
- Comments:13
Top Results From Across the Web
Using hyperparameter-search in Trainer - Transformers
. Right now I am adding a parameter to the model_init function but whenever it executes, the parameter is None so the trial...
Read more >GPU runs out of memory during hyperparameter tuning loop ...
I tried deleting the graph using del model0 but that did not work either. I couldn't try using tf.reset_default_graph() since the programming ...
Read more >Accelerating hyper-parameter searching with GPU | Kaggle
We will accelerate hyper-parameter search with the K80 GPU available in Kaggle. ... The only change we need to make is to set...
Read more >Have ray check for GPU usage by other users #5528 - GitHub
Is there a way to have ray check for current GPU usage before ... all of the GPUs for extreme hyper-parameter search which...
Read more >How to identify low GPU utilization due to small batch size
1. Prepare training dataset · 2. Create a Training Job with Profiling Enabled · 3. Monitor the system resource...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
So I tried that above, but apparently
evaluate
does not return “accuracy”, so as a workaround I switched toeval_accuracy
. But this creates a new problem; this error comes in the first trial but it doesn’t go on to the next trial. Could be that it is training? GPU usage seems to be 0, so I doubt it is training but it is not terminating the process or moving on. Strange.It looks like it Is pointing to ‘objective’, which is the same function you wrote above:-
Interestingly, removing the args
compute_objective
anddirection
does not yield anything, so I figured the problem must be elsewhere.Putting
eval_accuracy
in the PBT parameters and making thecompute_objective
solves the issue.Thanks a lot @amogkam for your support!! we need more people like you 👍 🚀 🥳
Ah @neel04, the error message is happening because
compute_metrics
is being passed as thecompute_objective
arg intrainer.hyperparameter_search
. If you remove this arg your code runs fine.compute_objective
should be a function that takes in the output ofevaluate
(which is the dictionary returnedcompute_metrics
) as an input and returns a single float value (see the docstring). It is not the same ascompute_metrics
. So here you should just be returning the “accuracy” value from the input dictionary. Something like this should work I believe: