Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

0% GPU usage when using `hyperparameter_search`

See original GitHub issue

## Environment info

transformers version: 4.4.0.dev0
Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.7.0+cu101 (True)
Tensorflow version (GPU?): 2.4.1 (True)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No (Single GPU) --> Colab
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help

Models:

ray/raytune: @richardliaw, @amogkam
trainer: @sgugger

Information

Model I am using (Bert, XLNet …): RoBERTa

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

This is in continuation with #10055 where the underlying code is the same, and it is more or less the same as the official example. The problem is that when I start hyperparameter_search then it just keeps running with 0% GPU usage (memory is occupied) and the CPU also remains relatively idle:-


== Status ==
Memory usage on this node: 5.9/25.5 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1/4 CPUs, 1/1 GPUs, 0.0/14.99 GiB heap, 0.0/5.18 GiB objects (0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/_inner_2021-02-15_11-45-33
Number of trials: 1/100 (1 RUNNING)
+--------------------+----------+-------+-------------+--------------+--------------+----------------+-----------------+-----------------+--------------------+-------------------------------+---------+----------------+
| Trial name         | status   | loc   | adafactor   |   adam_beta1 |   adam_beta2 |   adam_epsilon |   learning_rate |   max_grad_norm |   num_train_epochs |   per_device_train_batch_size |    seed |   weight_decay |
|--------------------+----------+-------+-------------+--------------+--------------+----------------+-----------------+-----------------+--------------------+-------------------------------+---------+----------------|
| _inner_4fd43_00000 | RUNNING  |       | True        |     0.862131 |     0.813033 |          1e-09 |     2.34754e-05 |       0.0056821 |                  2 |                            16 | 21.1968 |        0.95152 |
+--------------------+----------+-------+-------------+--------------+--------------+----------------+-----------------+-----------------+--------------------+-------------------------------+---------+----------------+

Sometimes, there are also warnings that the single worker is pending due to lack of resources, however my CPU usage is minimum, plenty of RAM is free (~24 Gb) and GPU also some about a gig of free memory.


2021-02-15 13:56:53,761	WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffff44ed5e1383be630817647ecd01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {3.000000/4.000000 CPU, 14.990234 GiB/14.990234 GiB memory, 0.000000/1.000000 GPU, 1.000000/1.000000 node:172.28.0.2, 5.126953 GiB/5.126953 GiB object_store_memory, 1.000000/1.000000 accelerator_type:V100}
. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

This is how the tuner looks like:-


from ray.tune.suggest.hyperopt import HyperOptSearch
from ray.tune.schedulers import PopulationBasedTraining
from ray import tune
import random

pbt = PopulationBasedTraining(
    time_attr="training_iteration",
    metric="accuracy",
    mode="max",
    perturbation_interval=10,  # every 10 `time_attr` units
                               # (training_iterations in this case)
    hyperparam_mutations={

        "weight_decay": tune.uniform(1, 0.0001),
        "seed": tune.uniform(1,20000),
        "learning_rate": tune.choice([1e-5, 2e-5, 3e-5, 4e-5, 5e-5, 6e-5, 2e-7, 1e-7, 3e-7, 2e-8]),
        "adafactor": tune.choice(['True','False']),
        "adam_beta1": tune.uniform(1.0, 0.0),
        "adam_beta2": tune.uniform(1.0, 0),
        "adam_epsilon": tune.choice([1e-8, 2e-8, 3e-8, 1e-9, 2e-9, 3e-10]),
        "max_grad_norm": tune.uniform(1.0, 0),

    })

best_run = trainer.hyperparameter_search(n_trials=100, compute_objective='accuracy', direction="maximize", backend='ray',
                                         scheduler=pbt)

Using HyperOptScheduler causes OOMs

Issue Analytics

State:
Created 3 years ago
Comments:13

Top GitHub Comments

6reactions

neel04commented, Feb 17, 2021

So I tried that above, but apparently evaluate does not return “accuracy”, so as a workaround I switched to eval_accuracy. But this creates a new problem; this error comes in the first trial but it doesn’t go on to the next trial. Could be that it is training? GPU usage seems to be 0, so I doubt it is training but it is not terminating the process or moving on. Strange.


2021-02-17 10:57:12,244	ERROR worker.py:1053 -- Possible unhandled error from worker: ray::ImplicitFunc.train_buffered() (pid=1340, ip=172.28.0.2)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 366, in step
    self._report_thread_runner_error(block=True)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 513, in _report_thread_runner_error
    ("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train_buffered() (pid=1340, ip=172.28.0.2)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 248, in run
    self._entrypoint()
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 316, in entrypoint
    self._status_reporter.get_checkpoint())
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 576, in _trainable_func
    output = fn()
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 651, in _inner
    inner(config, checkpoint_dir=None)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 645, in inner
    fn(config, **fn_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/integrations.py", line 160, in _objective
    local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 925, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "<ipython-input-12-cd510628f360>", line 10, in __getitem__
TypeError: new(): invalid data type 'str'

It looks like it Is pointing to ‘objective’, which is the same function you wrote above:-

def compute_objective(metrics):
  return metrics["eval_accuracy"]           #does not return accuracy

Interestingly, removing the args compute_objectiveand direction does not yield anything, so I figured the problem must be elsewhere.

Putting eval_accuracy in the PBT parameters and making the compute_objective solves the issue.

Thanks a lot @amogkam for your support!! we need more people like you 👍 🚀 🥳

0reactions

amogkamcommented, Feb 17, 2021

Ah @neel04, the error message is happening because compute_metrics is being passed as the compute_objective arg in trainer.hyperparameter_search. If you remove this arg your code runs fine.

compute_objective should be a function that takes in the output of evaluate (which is the dictionary returned compute_metrics) as an input and returns a single float value (see the docstring). It is not the same as compute_metrics. So here you should just be returning the “accuracy” value from the input dictionary. Something like this should work I believe:

def compute_objective(metrics):
  return metrics["accuracy"]

Top Results From Across the Web

Using hyperparameter-search in Trainer - Transformers

. Right now I am adding a parameter to the model_init function but whenever it executes, the parameter is None so the trial...

GPU runs out of memory during hyperparameter tuning loop ...

I tried deleting the graph using del model0 but that did not work either. I couldn't try using tf.reset_default_graph() since the programming ...

Accelerating hyper-parameter searching with GPU | Kaggle

We will accelerate hyper-parameter search with the K80 GPU available in Kaggle. ... The only change we need to make is to set...

Have ray check for GPU usage by other users #5528 - GitHub

Is there a way to have ray check for current GPU usage before ... all of the GPUs for extreme hyper-parameter search which...

How to identify low GPU utilization due to small batch size

1. Prepare training dataset · 2. Create a Training Job with Profiling Enabled · 3. Monitor the system resource...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

0% GPU usage when using `hyperparameter_search`

Who can help

Information

To reproduce

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

1.3GB dataset creates over 107GB of cache file!

Saving HF wrapped in Keras