Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot run grid search using Trainer API and Ray Tune

See original GitHub issue

Environment info

transformers version: 4.8.2
Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.11
PyTorch version (GPU?): 1.9.0+cu102 (True)
Tensorflow version (GPU?): 2.6.0 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: <fill in>
Using distributed or parallel set-up in script?: <fill in>

Who can help

@richardliaw, @amogkam

@sgugger

Information

Model I am using (Bert, XLNet …): roBERTa

The problem arises when using:

[x ] the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
[x ] my own task or dataset: (give details below)

To reproduce

Hi, I am trying to do grid search on my roberta model.

Steps to reproduce the behavior:

  hyperParameters = {
    "per_gpu_batch_size": [32],
    "learning_rate": [2e-5],
    "num_epochs": [2,3]
  }

  def my_hp_space_ray(trial):
      from ray import tune
  
      return {
          "learning_rate": tune.choice(hyperParameters.get('learning_rate')),
          "num_train_epochs": tune.choice(hyperParameters.get('num_epochs'))
      }

training_args = TrainingArguments("test",
                                  per_device_train_batch_size= 32,
                                  per_device_eval_batch_size = 32,
                                  evaluation_strategy = "epoch", #Can be epoch or steps
                                  weight_decay=0.01,
                                  logging_strategy ="epoch",
                                  metric_for_best_model="accuracy",
                                  report_to="wandb"
                                  )

trainer = Trainer(
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=tokenized_datasets_train,
    eval_dataset=tokenized_datasets_val,
    model_init=model_init,
    compute_metrics=compute_metrics,
)

trainer.hyperparameter_search(
    direction="minimize", 
    backend="ray",
    n_trials= 2,
    hp_space = my_hp_space_ray)

2021-08-30 21:07:06,743 ERROR trial_runner.py:773 -- Trial _objective_2e533_00001: Error processing event. Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/ray/tune/trial_runner.py", line 739, in _process_trial results = self.trial_executor.fetch_result(trial) File "/usr/local/lib/python3.7/dist-packages/ray/tune/ray_trial_executor.py", line 746, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/usr/local/lib/python3.7/dist-packages/ray/_private/client_mode_hook.py", line 82, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/ray/worker.py", line 1621, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1182, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f67395b5ed0>) File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 178, in train_buffered result = self.train() File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 237, in train result = self.step() File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 379, in step self._report_thread_runner_error(block=True) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error ("Trial raised an exception. Traceback:\n{}".format(err_tb_str) ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=1182, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f67395b5ed0>) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run self._entrypoint() File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint self._status_reporter.get_checkpoint()) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func output = fn() File "/usr/local/lib/python3.7/dist-packages/ray/tune/utils/trainable.py", line 344, in inner trainable(config, **fn_kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/integrations.py", line 162, in _objective local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial) File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1269, in train tr_loss += self.training_step(model, inputs) File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1762, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1794, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 1184, in forward return_dict=return_dict, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 845, in forward return_dict=return_dict, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 529, in forward output_attentions, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 414, in forward past_key_value=self_attn_past_key_value, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 344, in forward output_attentions, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 257, in forward attention_scores = attention_scores / math.sqrt(self.attention_head_size) RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 15.90 GiB total capacity; 13.14 GiB already allocated; 312.75 MiB free; 13.29 GiB reserved in total by PyTorch) Result for _objective_2e533_00001: {}

Expected behavior

Hi i would like to hyper parameter tune my roberta model, using ray tune and the trainer API, is there a way to not run out of memory even if it takes longer time to finish? Or is there some other type of parameter tuning i should use instead?

I spent the whole day trying to figure it out, so any help would be hugely appreciated

Issue Analytics

State:
Created 2 years ago
Comments:15 (2 by maintainers)

Top GitHub Comments

1reaction

mosh98commented, Aug 31, 2021

@Yard1 @richardliaw updating transformers did help. Huge thanks!

altho i have one last question, is there a way to specify to use tpu’s instead of gpu’s in the api?

At the moment i have it like this,

trainer.hyperparameter_search(
    direction="minimize", 
    backend="ray",
    n_trials= 1,
    hp_space = my_hp_space_ray,
    resources_per_trial =  {"cpu": 1,"gpu": 1},
    fail_fast="raise"
    )

1reaction

richardliawcommented, Aug 31, 2021

Seems like the error is:

(pid=1376)     output = fn()
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/utils/trainable.py", line 344, in inner
(pid=1376)     trainable(config, **fn_kwargs)
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/integrations.py", line 162, in _objective
(pid=1376)     local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1331, in train
(pid=1376)     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch)
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1426, in _maybe_log_save_evaluate
(pid=1376)     metrics = self.evaluate()
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 2031, in evaluate
(pid=1376)     metric_key_prefix=metric_key_prefix,
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 2260, in evaluation_loop
(pid=1376)     metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
(pid=1376)   File "<ipython-input-38-b8a033e8f995>", line 5, in compute_metrics
(pid=1376) NameError: name 'metric' is not defined

as a tip, maybe you could consider also doing hyperparameter_search(..., fail_fast="raise") to help see errors better.