question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Version 4.12.5 doesn't work on sagemaker

See original GitHub issue

Environment info

  • transformers version: 4.12.5
  • Platform: sagemaker
  • Python version: 3.6
  • PyTorch version (GPU?): 1.8.1 GPU
  • Using GPU in script?: yes, using sagemaker instance
  • Using distributed or parallel set-up in script?: no

Information

I have code that I am using to train on sagemaker. Previously this code has worked just fine but today I initiated a new training run with new data and no changes to the code. The training run on sagemaker fails when it hits the trainer.train() using transformers.Trainer. Code below:

    logging.info(f"Loading Tokenizer")
    tokenizer = T5Tokenizer.from_pretrained(args.model_base

    # Generate tokenized dataset
    logging.info(f"Tokenizing dataset")
    train_dataset = ParaphraseDataset(
        tokenizer=tokenizer,
        file_path=training_file,
    )
    eval_dataset = ParaphraseDataset(
        tokenizer=tokenizer,
        file_path=testing_file,
    )

    # Initialize model
    logging.info(f"Initializing model from {args.model_base}.")
    model = T5ForConditionalGeneration.from_pretrained(args.model_base)
    
    # Training arguments
    training_args = TrainingArguments(
        **_get_training_args(args),
        report_to="wandb"
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=DataCollator(tokenizer),
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.01)],
    )
    logging.info(f"Beginning model training using the following params:\n{training_args}.")
    output = trainer.train()

The error occurs during the trainer.train(). The error is the following:

#015  0% 0/7680 [00:00<?, ?it/s]Traceback (most recent call last):#015
  File "sm_train_deploy.py", line 271, in <module>#015
    train(parser.parse_args())#015
  File "sm_train_deploy.py", line 210, in train#015
    output = trainer.train()#015
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1316, in train#015
    tr_loss_step = self.training_step(model, inputs)#015
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1849, in training_step#015
    loss = self.compute_loss(model, inputs)#015
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1881, in compute_loss#015
    outputs = model(**inputs)#015
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 903, in _call_impl#015
    self._forward_pre_hooks.values()):#015
RuntimeError: OrderedDict mutated during iteration#015

I’m calling this on sagemaker using the following command:

# Instantiate estimator with parameters
estimator = PyTorch(
    entry_point="sm_train_deploy.py",
    source_dir="code",
    role=role,
    framework_version="1.8.1",
    py_version="py3",
    instance_count=1,  # this script only support distributed training for GPU instances.
    instance_type="ml.p3.2xlarge",
    output_path=output_path,
    module_dir=output_path,
    code_location=output_path,
#     checkpoint_s3_uri=output_path,
    hyperparameters={
        **params
    },
    disable_profiler=True # disable debugger
)

estimator.fit({"training": training_file_s3, "testing": test_file_s3})

Expected behavior

This behavior is not seen when I run the code locally on my own GPU. The training completes as expected.

If I revert to using transformers 4.12.2 (the last version of transformers where this code worked on sagemaer), the code runs as expected on sagemaker without changing anything.

Sorry if I missed anything important, first time submitting a bug report like this.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:12 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
sguggercommented, Dec 6, 2021

This is the same error as with DeepSpeed, I’m guessing they are using a similar thing with the hooks not being allowed to be mutated. This is fixed on master so we should just hurry up to do a release and clearly document that this patch release won’t work on SageMaker.

1reaction
setu4993commented, Dec 10, 2021

@sgugger @philschmid : Confirming 4.13.0 fixed this for us.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Transformers 4.6.0 on SageMaker? - Hugging Face Forums
Hi all, Is there a timeline for when Transformers 4.6.0 will be supported in the HuggingFace SDK on SageMaker? I've recently been having ......
Read more >
Huggingface Sagemaker-sdk - notebooks - GitHub
In this demo, we will use the Hugging Faces transformers and datasets library together with a custom Amazon sagemaker-sdk extension to fine-tune a...
Read more >
Troubleshooting Amazon SageMaker Studio
Troubleshooting Amazon SageMaker Studio · Shut down Studio from the File menu. · Wait for 1 minute. · Re-open Studio by refreshing the...
Read more >
Deploy BERT with Hugging Face Transformers, Amazon ...
Learn how to deploy BERT/DistilBERT with Hugging Face Transformers using Amazon SageMaker and Terraform module.
Read more >
philschmid/sagemaker-huggingface/aws - Terraform Registry
... -aws-sagemaker-huggingface/tree/v0.6/examples/async_inference (report an issue) ... "philschmid/sagemaker-huggingface/aws" version = "0.5.0" name_prefix ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found