question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot run grid search using Trainer API and Ray Tune

See original GitHub issue

Environment info

  • transformers version: 4.8.2
  • Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.11
  • PyTorch version (GPU?): 1.9.0+cu102 (True)
  • Tensorflow version (GPU?): 2.6.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: <fill in>
  • Using distributed or parallel set-up in script?: <fill in>

Who can help

@richardliaw, @amogkam

@sgugger

Information

Model I am using (Bert, XLNet …): roBERTa

The problem arises when using:

  • [x ] the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • [x ] my own task or dataset: (give details below)

To reproduce

Hi, I am trying to do grid search on my roberta model.

Steps to reproduce the behavior:

  hyperParameters = {
    "per_gpu_batch_size": [32],
    "learning_rate": [2e-5],
    "num_epochs": [2,3]
  }
  
  def my_hp_space_ray(trial):
      from ray import tune
  
      return {
          "learning_rate": tune.choice(hyperParameters.get('learning_rate')),
          "num_train_epochs": tune.choice(hyperParameters.get('num_epochs'))
      }
training_args = TrainingArguments("test",
                                  per_device_train_batch_size= 32,
                                  per_device_eval_batch_size = 32,
                                  evaluation_strategy = "epoch", #Can be epoch or steps
                                  weight_decay=0.01,
                                  logging_strategy ="epoch",
                                  metric_for_best_model="accuracy",
                                  report_to="wandb"
                                  )
trainer = Trainer(
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=tokenized_datasets_train,
    eval_dataset=tokenized_datasets_val,
    model_init=model_init,
    compute_metrics=compute_metrics,
)
trainer.hyperparameter_search(
    direction="minimize", 
    backend="ray",
    n_trials= 2,
    hp_space = my_hp_space_ray)

2021-08-30 21:07:06,743 ERROR trial_runner.py:773 -- Trial _objective_2e533_00001: Error processing event. Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/ray/tune/trial_runner.py", line 739, in _process_trial results = self.trial_executor.fetch_result(trial) File "/usr/local/lib/python3.7/dist-packages/ray/tune/ray_trial_executor.py", line 746, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/usr/local/lib/python3.7/dist-packages/ray/_private/client_mode_hook.py", line 82, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/ray/worker.py", line 1621, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1182, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f67395b5ed0>) File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 178, in train_buffered result = self.train() File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 237, in train result = self.step() File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 379, in step self._report_thread_runner_error(block=True) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error ("Trial raised an exception. Traceback:\n{}".format(err_tb_str) ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=1182, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f67395b5ed0>) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run self._entrypoint() File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint self._status_reporter.get_checkpoint()) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func output = fn() File "/usr/local/lib/python3.7/dist-packages/ray/tune/utils/trainable.py", line 344, in inner trainable(config, **fn_kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/integrations.py", line 162, in _objective local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial) File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1269, in train tr_loss += self.training_step(model, inputs) File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1762, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1794, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 1184, in forward return_dict=return_dict, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 845, in forward return_dict=return_dict, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 529, in forward output_attentions, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 414, in forward past_key_value=self_attn_past_key_value, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 344, in forward output_attentions, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 257, in forward attention_scores = attention_scores / math.sqrt(self.attention_head_size) RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 15.90 GiB total capacity; 13.14 GiB already allocated; 312.75 MiB free; 13.29 GiB reserved in total by PyTorch) Result for _objective_2e533_00001: {}

Expected behavior

Hi i would like to hyper parameter tune my roberta model, using ray tune and the trainer API, is there a way to not run out of memory even if it takes longer time to finish? Or is there some other type of parameter tuning i should use instead?

I spent the whole day trying to figure it out, so any help would be hugely appreciated

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:15 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
mosh98commented, Aug 31, 2021

@Yard1 @richardliaw updating transformers did help. Huge thanks!

altho i have one last question, is there a way to specify to use tpu’s instead of gpu’s in the api?

At the moment i have it like this,

trainer.hyperparameter_search(
    direction="minimize", 
    backend="ray",
    n_trials= 1,
    hp_space = my_hp_space_ray,
    resources_per_trial =  {"cpu": 1,"gpu": 1},
    fail_fast="raise"
    )
1reaction
richardliawcommented, Aug 31, 2021

Seems like the error is:

(pid=1376)     output = fn()
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/utils/trainable.py", line 344, in inner
(pid=1376)     trainable(config, **fn_kwargs)
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/integrations.py", line 162, in _objective
(pid=1376)     local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1331, in train
(pid=1376)     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch)
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1426, in _maybe_log_save_evaluate
(pid=1376)     metrics = self.evaluate()
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 2031, in evaluate
(pid=1376)     metric_key_prefix=metric_key_prefix,
(pid=1376)   File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 2260, in evaluation_loop
(pid=1376)     metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
(pid=1376)   File "<ipython-input-38-b8a033e8f995>", line 5, in compute_metrics
(pid=1376) NameError: name 'metric' is not defined

as a tip, maybe you could consider also doing hyperparameter_search(..., fail_fast="raise") to help see errors better.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Ray Tune FAQ — Ray 2.2.0 - the Ray documentation
In this case, we cannot use tune.sample_from because it doesn't support grid searching. The solution here is to create a list of valid...
Read more >
[rllib+tune] (Regression) Grid Search no longer works ... - GitHub
If a Trainer is overriden with the with_updates functionality, the updates are only applied on the first grid-search parameter set. All ...
Read more >
Using hyperparameter-search in Trainer - Hugging Face Forums
Making sense of duplicate arguments in Huggingface's hyperparameter search work flow. Running Optuna on Two HuggingFace Trainer Tasks.
Read more >
Scaling up PyTorch Lightning hyperparameter tuning with Ray ...
By the end of this blog post, you will be able to make your PyTorch Lightning models configurable, define a parameter search space,...
Read more >
How to Use Ray, a Distributed Python Framework, on Databricks
RLlib uses Tune, a Ray library for scalable hyperparameter tuning that runs variations of the models to find the best one. In this...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found