Version 4.12.5 doesn't work on sagemaker
See original GitHub issueEnvironment info
transformers
version: 4.12.5- Platform: sagemaker
- Python version: 3.6
- PyTorch version (GPU?): 1.8.1 GPU
- Using GPU in script?: yes, using sagemaker instance
- Using distributed or parallel set-up in script?: no
Information
I have code that I am using to train on sagemaker. Previously this code has worked just fine but today I initiated a new training run with new data and no changes to the code. The training run on sagemaker fails when it hits the trainer.train() using transformers.Trainer. Code below:
logging.info(f"Loading Tokenizer")
tokenizer = T5Tokenizer.from_pretrained(args.model_base
# Generate tokenized dataset
logging.info(f"Tokenizing dataset")
train_dataset = ParaphraseDataset(
tokenizer=tokenizer,
file_path=training_file,
)
eval_dataset = ParaphraseDataset(
tokenizer=tokenizer,
file_path=testing_file,
)
# Initialize model
logging.info(f"Initializing model from {args.model_base}.")
model = T5ForConditionalGeneration.from_pretrained(args.model_base)
# Training arguments
training_args = TrainingArguments(
**_get_training_args(args),
report_to="wandb"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=DataCollator(tokenizer),
callbacks=[EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.01)],
)
logging.info(f"Beginning model training using the following params:\n{training_args}.")
output = trainer.train()
The error occurs during the trainer.train(). The error is the following:
#015 0% 0/7680 [00:00<?, ?it/s]Traceback (most recent call last):#015
File "sm_train_deploy.py", line 271, in <module>#015
train(parser.parse_args())#015
File "sm_train_deploy.py", line 210, in train#015
output = trainer.train()#015
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1316, in train#015
tr_loss_step = self.training_step(model, inputs)#015
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1849, in training_step#015
loss = self.compute_loss(model, inputs)#015
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1881, in compute_loss#015
outputs = model(**inputs)#015
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 903, in _call_impl#015
self._forward_pre_hooks.values()):#015
RuntimeError: OrderedDict mutated during iteration#015
I’m calling this on sagemaker using the following command:
# Instantiate estimator with parameters
estimator = PyTorch(
entry_point="sm_train_deploy.py",
source_dir="code",
role=role,
framework_version="1.8.1",
py_version="py3",
instance_count=1, # this script only support distributed training for GPU instances.
instance_type="ml.p3.2xlarge",
output_path=output_path,
module_dir=output_path,
code_location=output_path,
# checkpoint_s3_uri=output_path,
hyperparameters={
**params
},
disable_profiler=True # disable debugger
)
estimator.fit({"training": training_file_s3, "testing": test_file_s3})
Expected behavior
This behavior is not seen when I run the code locally on my own GPU. The training completes as expected.
If I revert to using transformers 4.12.2 (the last version of transformers where this code worked on sagemaer), the code runs as expected on sagemaker without changing anything.
Sorry if I missed anything important, first time submitting a bug report like this.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:12 (11 by maintainers)
This is the same error as with DeepSpeed, I’m guessing they are using a similar thing with the hooks not being allowed to be mutated. This is fixed on master so we should just hurry up to do a release and clearly document that this patch release won’t work on SageMaker.
@sgugger @philschmid : Confirming 4.13.0 fixed this for us.