Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Version 4.12.5 doesn't work on sagemaker

See original GitHub issue

Environment info

transformers version: 4.12.5
Platform: sagemaker
Python version: 3.6
PyTorch version (GPU?): 1.8.1 GPU
Using GPU in script?: yes, using sagemaker instance
Using distributed or parallel set-up in script?: no

Information

I have code that I am using to train on sagemaker. Previously this code has worked just fine but today I initiated a new training run with new data and no changes to the code. The training run on sagemaker fails when it hits the trainer.train() using transformers.Trainer. Code below:

    logging.info(f"Loading Tokenizer")
    tokenizer = T5Tokenizer.from_pretrained(args.model_base

    # Generate tokenized dataset
    logging.info(f"Tokenizing dataset")
    train_dataset = ParaphraseDataset(
        tokenizer=tokenizer,
        file_path=training_file,
    )
    eval_dataset = ParaphraseDataset(
        tokenizer=tokenizer,
        file_path=testing_file,
    )

    # Initialize model
    logging.info(f"Initializing model from {args.model_base}.")
    model = T5ForConditionalGeneration.from_pretrained(args.model_base)
    
    # Training arguments
    training_args = TrainingArguments(
        **_get_training_args(args),
        report_to="wandb"
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=DataCollator(tokenizer),
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.01)],
    )
    logging.info(f"Beginning model training using the following params:\n{training_args}.")
    output = trainer.train()

The error occurs during the trainer.train(). The error is the following:

#015  0% 0/7680 [00:00<?, ?it/s]Traceback (most recent call last):#015
  File "sm_train_deploy.py", line 271, in <module>#015
    train(parser.parse_args())#015
  File "sm_train_deploy.py", line 210, in train#015
    output = trainer.train()#015
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1316, in train#015
    tr_loss_step = self.training_step(model, inputs)#015
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1849, in training_step#015
    loss = self.compute_loss(model, inputs)#015
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1881, in compute_loss#015
    outputs = model(**inputs)#015
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 903, in _call_impl#015
    self._forward_pre_hooks.values()):#015
RuntimeError: OrderedDict mutated during iteration#015

I’m calling this on sagemaker using the following command:

# Instantiate estimator with parameters
estimator = PyTorch(
    entry_point="sm_train_deploy.py",
    source_dir="code",
    role=role,
    framework_version="1.8.1",
    py_version="py3",
    instance_count=1,  # this script only support distributed training for GPU instances.
    instance_type="ml.p3.2xlarge",
    output_path=output_path,
    module_dir=output_path,
    code_location=output_path,
#     checkpoint_s3_uri=output_path,
    hyperparameters={
        **params
    },
    disable_profiler=True # disable debugger
)

estimator.fit({"training": training_file_s3, "testing": test_file_s3})

Expected behavior

This behavior is not seen when I run the code locally on my own GPU. The training completes as expected.

If I revert to using transformers 4.12.2 (the last version of transformers where this code worked on sagemaer), the code runs as expected on sagemaker without changing anything.

Sorry if I missed anything important, first time submitting a bug report like this.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:12 (11 by maintainers)

Top GitHub Comments

2reactions

sguggercommented, Dec 6, 2021

This is the same error as with DeepSpeed, I’m guessing they are using a similar thing with the hooks not being allowed to be mutated. This is fixed on master so we should just hurry up to do a release and clearly document that this patch release won’t work on SageMaker.

1reaction

setu4993commented, Dec 10, 2021

@sgugger @philschmid : Confirming 4.13.0 fixed this for us.

Top Results From Across the Web

Transformers 4.6.0 on SageMaker? - Hugging Face Forums

Hi all, Is there a timeline for when Transformers 4.6.0 will be supported in the HuggingFace SDK on SageMaker? I've recently been having ......

Huggingface Sagemaker-sdk - notebooks - GitHub

In this demo, we will use the Hugging Faces transformers and datasets library together with a custom Amazon sagemaker-sdk extension to fine-tune a...

Troubleshooting Amazon SageMaker Studio

Troubleshooting Amazon SageMaker Studio · Shut down Studio from the File menu. · Wait for 1 minute. · Re-open Studio by refreshing the...

Deploy BERT with Hugging Face Transformers, Amazon ...

Learn how to deploy BERT/DistilBERT with Hugging Face Transformers using Amazon SageMaker and Terraform module.

philschmid/sagemaker-huggingface/aws - Terraform Registry

... -aws-sagemaker-huggingface/tree/v0.6/examples/async_inference (report an issue) ... "philschmid/sagemaker-huggingface/aws" version = "0.5.0" name_prefix ...