Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

zero_optimization.cpu_offload: true leads to a silent crash

See original GitHub issue

I’m experimenting with various zero_optimization config options and I noticed that when I flip to true zero_optimization.cpu_offload, the application exits w/o crashing or doing any training.

{
    "train_batch_size": 20,
    "steps_per_print": 2000,

    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    
   "zero_optimization": {
       "stage": 0,
       "allgather_partitions": true,
       "allgather_bucket_size": 500000000,
       "overlap_comm": true,
       "reduce_scatter": true,
       "reduce_bucket_size": 500000000,
       "contiguous_gradients": false,
       "cpu_offload": false
   },

   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 3e-5,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },
   "scheduler": {
     "type": "WarmupLR",
     "params": {
       "warmup_min_lr": 0,
       "warmup_max_lr": 3e-5,
       "warmup_num_steps": 500
     }
   },
   "wall_clock_breakdown": false
}

leads to a silent exit but doing nothing:

Full log

export BS=20; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=…/…/src USE_TF=0 deepspeed  ./finetune_trainer.py --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json
rm: cannot remove ‘output_dir’: No such file or directory
[2020-12-18 19:42:37,871] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-12-18 19:42:37,897] [INFO] [runner.py:355:main] cmd = /home/stas/anaconda3/envs/main-38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 20 --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json
[2020-12-18 19:42:38,631] [INFO] [launch.py:78:main] WORLD INFO DICT: {‘localhost’: [0, 1]}
[2020-12-18 19:42:38,631] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=2, node_rank=0
[2020-12-18 19:42:38,631] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class ‘list’>, {‘localhost’: [0, 1]})
[2020-12-18 19:42:38,631] [INFO] [launch.py💯main] dist_world_size=2
[2020-12-18 19:42:38,631] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1
[‘–deepspeed’, ‘–deepspeed_config’, ‘ds_config.json’]
1
2020-12-18 19:42:40 | WARNING | main | Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False
2020-12-18 19:42:40 | INFO | main | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir=‘output_dir’, overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.NO: ‘no’>, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir=‘runs/Dec18_19-42-40_hope’, logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=‘O1’, local_rank=1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name=‘output_dir’, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend=‘auto’, sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler=‘linear’)
[‘–deepspeed’, ‘–deepspeed_config’, ‘ds_config.json’]
0
2020-12-18 19:42:40 | WARNING | main | Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
2020-12-18 19:42:40 | INFO | main | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir=‘output_dir’, overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.NO: ‘no’>, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir=‘runs/Dec18_19-42-40_hope’, logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=‘O1’, local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name=‘output_dir’, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend=‘auto’, sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler=‘linear’)
[INFO|configuration_utils.py:431] 2020-12-18 19:42:41,139 >> loading configuration file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/3a05b98cd4a37d1704b3d884e5bd1e19a3783d2d0a9f1f5449f4896f4d163781.b57423f4136691c59b9844b9358d5b26655ad2a5e080f0fbb24070bc528d090e
[INFO|configuration_utils.py:467] 2020-12-18 19:42:41,141 >> Model config MBartConfig {
“_num_labels”: 3,
“activation_dropout”: 0.0,
“activation_function”: “gelu”,
“add_bias_logits”: false,
“add_final_layer_norm”: true,
“architectures”: [
“BartForConditionalGeneration”
],
“attention_dropout”: 0.0,
“bos_token_id”: 0,
“classif_dropout”: 0.0,
“classifier_dropout”: 0.0,
“d_model”: 1024,
“decoder_attention_heads”: 16,
“decoder_ffn_dim”: 4096,
“decoder_layerdrop”: 0.0,
“decoder_layers”: 4,
“decoder_start_token_id”: 250020,
“do_blenderbot_90_layernorm”: false,
“dropout”: 0.1,
“encoder_attention_heads”: 16,
“encoder_ffn_dim”: 4096,
“encoder_layerdrop”: 0.0,
“encoder_layers”: 12,
“eos_token_id”: 2,
“extra_pos_embeddings”: 2,
“force_bos_token_to_be_generated”: false,
“id2label”: {
“0”: “LABEL_0”,
“1”: “LABEL_1”,
“2”: “LABEL_2”
},
“init_std”: 0.02,
“is_encoder_decoder”: true,
“label2id”: {
“LABEL_0”: 0,
“LABEL_1”: 1,
“LABEL_2”: 2
},
“max_length”: 1000,
“max_position_embeddings”: 1024,
“model_type”: “mbart”,
“normalize_before”: true,
“normalize_embedding”: true,
“num_beams”: 5,
“num_hidden_layers”: 12,
“output_past”: true,
“pad_token_id”: 1,
“save_step”: 7,
“scale_embedding”: true,
“static_position_embeddings”: false,
“use_cache”: true,
“variant”: “prelayernorm”,
“vocab_size”: 250027
}
[INFO|configuration_utils.py:431] 2020-12-18 19:42:41,415 >> loading configuration file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/3a05b98cd4a37d1704b3d884e5bd1e19a3783d2d0a9f1f5449f4896f4d163781.b57423f4136691c59b9844b9358d5b26655ad2a5e080f0fbb24070bc528d090e
[INFO|configuration_utils.py:467] 2020-12-18 19:42:41,417 >> Model config MBartConfig {
“_num_labels”: 3,
“activation_dropout”: 0.0,
“activation_function”: “gelu”,
“add_bias_logits”: false,
“add_final_layer_norm”: true,
“architectures”: [
“BartForConditionalGeneration”
],
“attention_dropout”: 0.0,
“bos_token_id”: 0,
“classif_dropout”: 0.0,
“classifier_dropout”: 0.0,
“d_model”: 1024,
“decoder_attention_heads”: 16,
“decoder_ffn_dim”: 4096,
“decoder_layerdrop”: 0.0,
“decoder_layers”: 4,
“decoder_start_token_id”: 250020,
“do_blenderbot_90_layernorm”: false,
“dropout”: 0.1,
“encoder_attention_heads”: 16,
“encoder_ffn_dim”: 4096,
“encoder_layerdrop”: 0.0,
“encoder_layers”: 12,
“eos_token_id”: 2,
“extra_pos_embeddings”: 2,
“force_bos_token_to_be_generated”: false,
“id2label”: {
“0”: “LABEL_0”,
“1”: “LABEL_1”,
“2”: “LABEL_2”
},
“init_std”: 0.02,
“is_encoder_decoder”: true,
“label2id”: {
“LABEL_0”: 0,
“LABEL_1”: 1,
“LABEL_2”: 2
},
“max_length”: 1000,
“max_position_embeddings”: 1024,
“model_type”: “mbart”,
“normalize_before”: true,
“normalize_embedding”: true,
“num_beams”: 5,
“num_hidden_layers”: 12,
“output_past”: true,
“pad_token_id”: 1,
“save_step”: 7,
“scale_embedding”: true,
“static_position_embeddings”: false,
“use_cache”: true,
“variant”: “prelayernorm”,
“vocab_size”: 250027
}
[INFO|tokenization_utils_base.py:1718] 2020-12-18 19:42:41,418 >> Model name ‘sshleifer/distill-mbart-en-ro-12-4’ not found in model shortcut name list (facebook/mbart-large-en-ro, facebook/mbart-large-cc25). Assuming ‘sshleifer/distill-mbart-en-ro-12-4’ is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/sentencepiece.bpe.model from cache at /home/stas/.cache/huggingface/transformers/62ed1799c9b9a3c199222637281d38762ae87e00165a2613e31c93b3673f08b8.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/special_tokens_map.json from cache at /home/stas/.cache/huggingface/transformers/9423d956f3dd4d8fd97112a8d3f87081f6256ce54ccfecd27938c48e294b8aa8.72fa8565f9c8b5dc27e7ac070020aec80359d9da2e5628b3f313f41bf44d322c
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/tokenizer_config.json from cache at /home/stas/.cache/huggingface/transformers/f5629ec54e86b66e2e9879777df84ce24ede4c93495e6ce9f9161011260c5344.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/tokenizer.json from cache at None
[INFO|tokenization_utils_base.py:925] 2020-12-18 19:42:43,989 >> Assigning [‘ar_AR’, ‘cs_CZ’, ‘de_DE’, ‘en_XX’, ‘es_XX’, ‘et_EE’, ‘fi_FI’, ‘fr_XX’, ‘gu_IN’, ‘hi_IN’, ‘it_IT’, ‘ja_XX’, ‘kk_KZ’, ‘ko_KR’, ‘lt_LT’, ‘lv_LV’, ‘my_MM’, ‘ne_NP’, ‘nl_XX’, ‘ro_RO’, ‘ru_RU’, ‘si_LK’, ‘tr_TR’, ‘vi_VN’, ‘zh_CN’] to the additional_special_tokens key of the tokenizer
[INFO|modeling_utils.py:1024] 2020-12-18 19:42:44,314 >> loading weights file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/pytorch_model.bin from cache at /home/stas/.cache/huggingface/transformers/d2a7ade93d629fb16e06233407ab8aa0e70af5532c66c3b38ce2ff905743bf78.fa8ebf3af9c5dec8982ce624e74de87e85c9a944e776b79b8e8bd65126ed2073
Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/distill-mbart-en-ro-12-4 and are newly initialized: [‘lm_head.weight’]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:1045] 2020-12-18 19:43:06,939 >> load time=0.8602
[2020-12-18 19:43:07,280] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master
[2020-12-18 19:43:07,280] [INFO] [engine.py:147:init] Initializing torch distributed with backend: nccl
[INFO|modeling_utils.py:1145] 2020-12-18 19:43:07,318 >> All model checkpoint weights were used when initializing MBartForConditionalGeneration.
[WARNING|modeling_utils.py:1147] 2020-12-18 19:43:07,318 >> Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/distill-mbart-en-ro-12-4 and are newly initialized: [‘lm_head.weight’]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2020-12-18 19:43:07,512] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master
[2020-12-18 19:43:07,512] [INFO] [engine.py:147:init] Initializing torch distributed with backend: nccl
[2020-12-18 19:43:11,225] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-18 19:43:11,229] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000030, betas=(0.800000, 0.999000), weight_decay=0.000000, adam_w=1
[2020-12-18 19:43:13,258] [INFO] [engine.py:702:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000030, betas=(0.800000, 0.999000), weight_decay=0.000000, adam_w=1
[2020-12-18 19:43:13,262] [INFO] [engine.py:593:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2020-12-18 19:43:13,262] [INFO] [engine.py:598:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam (
Parameter Group 0
amsgrad: False
betas: [0.8, 0.999]
bias_correction: True
eps: 1e-08
lr: 3e-05
weight_decay: 3e-07
)
[2020-12-18 19:43:13,262] [INFO] [engine.py:702:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-12-18 19:43:13,262] [INFO] [unfused_optimizer.py:36:init] Fused Lamb Legacy : False
group 0 param 0 = 1048576
group 0 param 0 = 1048576

If I flip zero_optimization.cpu_offload to false everything works:

Full log

export BS=20; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0 deepspeed  ./finetune_trainer.py --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json
rm: cannot remove 'output_dir': No such file or directory
[2020-12-18 20:29:55,608] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-12-18 20:29:55,634] [INFO] [runner.py:355:main] cmd = /home/stas/anaconda3/envs/main-38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 20 --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json
[2020-12-18 20:29:56,371] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2020-12-18 20:29:56,372] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=2, node_rank=0
[2020-12-18 20:29:56,372] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2020-12-18 20:29:56,372] [INFO] [launch.py:100:main] dist_world_size=2
[2020-12-18 20:29:56,372] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1
['--deepspeed', '--deepspeed_config', 'ds_config.json']
1
2020-12-18 20:29:58 | WARNING | __main__ | Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False
2020-12-18 20:29:58 | INFO | __main__ | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir='runs/Dec18_20-29-58_hope', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend='auto', sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
['--deepspeed', '--deepspeed_config', 'ds_config.json']
0
2020-12-18 20:29:58 | WARNING | __main__ | Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
2020-12-18 20:29:58 | INFO | __main__ | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir='runs/Dec18_20-29-58_hope', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend='auto', sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
[INFO|configuration_utils.py:431] 2020-12-18 20:29:58,890 >> loading configuration file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/5fd8333015b256440e1b6fbf2d5f86a4868a39440a89554475ee8d1c616d9e56.5b830f48cd63bb457b6ea960d512d839da5b4c30ee8b6998c04977316c32b2f0
[INFO|configuration_utils.py:467] 2020-12-18 20:29:58,892 >> Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 2,
  "decoder_attention_heads": 1,
  "decoder_ffn_dim": 4,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 2,
  "do_blenderbot_90_layernorm": false,
  "dropout": 0.1,
  "encoder_attention_heads": 1,
  "encoder_ffn_dim": 4,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 2,
  "eos_token_id": 2,
  "extra_pos_embeddings": 2,
  "force_bos_token_to_be_generated": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 2,
  "num_hidden_layers": 2,
  "output_past": true,
  "pad_token_id": 1,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "use_cache": true,
  "vocab_size": 250027
}
[INFO|configuration_utils.py:431] 2020-12-18 20:29:59,191 >> loading configuration file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/5fd8333015b256440e1b6fbf2d5f86a4868a39440a89554475ee8d1c616d9e56.5b830f48cd63bb457b6ea960d512d839da5b4c30ee8b6998c04977316c32b2f0
[INFO|configuration_utils.py:467] 2020-12-18 20:29:59,192 >> Model config MBartConfig {
“_num_labels”: 3,
“activation_dropout”: 0.0,
“activation_function”: “gelu”,
“add_bias_logits”: false,
“add_final_layer_norm”: true,
“architectures”: [
“BartForConditionalGeneration”
],
“attention_dropout”: 0.0,
“bos_token_id”: 0,
“classif_dropout”: 0.0,
“classifier_dropout”: 0.0,
“d_model”: 2,
“decoder_attention_heads”: 1,
“decoder_ffn_dim”: 4,
“decoder_layerdrop”: 0.0,
“decoder_layers”: 2,
“do_blenderbot_90_layernorm”: false,
“dropout”: 0.1,
“encoder_attention_heads”: 1,
“encoder_ffn_dim”: 4,
“encoder_layerdrop”: 0.0,
“encoder_layers”: 2,
“eos_token_id”: 2,
“extra_pos_embeddings”: 2,
“force_bos_token_to_be_generated”: false,
“id2label”: {
“0”: “LABEL_0”,
“1”: “LABEL_1”,
“2”: “LABEL_2”
},
“init_std”: 0.02,
“is_encoder_decoder”: true,
“label2id”: {
“LABEL_0”: 0,
“LABEL_1”: 1,
“LABEL_2”: 2
},
“max_position_embeddings”: 1024,
“model_type”: “mbart”,
“normalize_before”: true,
“normalize_embedding”: true,
“num_beams”: 2,
“num_hidden_layers”: 2,
“output_past”: true,
“pad_token_id”: 1,
“scale_embedding”: true,
“static_position_embeddings”: false,
“use_cache”: true,
“vocab_size”: 250027
}
[INFO|tokenization_utils_base.py:1718] 2020-12-18 20:29:59,192 >> Model name ‘sshleifer/tiny-mbart’ not found in model shortcut name list (facebook/mbart-large-en-ro, facebook/mbart-large-cc25). Assuming ‘sshleifer/tiny-mbart’ is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/sentencepiece.bpe.model from cache at /home/stas/.cache/huggingface/transformers/13a2c62c1dabc5357bc38b0694f5829f3db0708d51f1a0f07734f62cc0a825a0.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/special_tokens_map.json from cache at /home/stas/.cache/huggingface/transformers/33fa7894ab257a74cede3060dca6d2fc609918785e80160f6c057723ece47292.0dc5b1041f62041ebbd23b1297f2f573769d5c97d8b7c28180ec86b8f6185aa8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/tokenizer_config.json from cache at /home/stas/.cache/huggingface/transformers/e9c580e6446c42ed20fb148206f2a9bd75a825278ffa029df063682077d45bb6.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/tokenizer.json from cache at None
[INFO|tokenization_utils_base.py:925] 2020-12-18 20:30:01,779 >> Assigning [‘ar_AR’, ‘cs_CZ’, ‘de_DE’, ‘en_XX’, ‘es_XX’, ‘et_EE’, ‘fi_FI’, ‘fr_XX’, ‘gu_IN’, ‘hi_IN’, ‘it_IT’, ‘ja_XX’, ‘kk_KZ’, ‘ko_KR’, ‘lt_LT’, ‘lv_LV’, ‘my_MM’, ‘ne_NP’, ‘nl_XX’, ‘ro_RO’, ‘ru_RU’, ‘si_LK’, ‘tr_TR’, ‘vi_VN’, ‘zh_CN’] to the additional_special_tokens key of the tokenizer
Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/tiny-mbart and are newly initialized: [‘lm_head.weight’]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:1024] 2020-12-18 20:30:02,107 >> loading weights file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/pytorch_model.bin from cache at /home/stas/.cache/huggingface/transformers/d6eec704737db03a21a794f08b07fcbb71d855562a992cfb1be6193b37a7ff68.61ce63751e40ea882dd1a22b6c9303b954b81ec69d631ab0541750fd856720be
[INFO|modeling_utils.py:1045] 2020-12-18 20:30:02,150 >> load time=0.0017
[INFO|modeling_utils.py:1145] 2020-12-18 20:30:02,152 >> All model checkpoint weights were used when initializing MBartForConditionalGeneration.
[WARNING|modeling_utils.py:1147] 2020-12-18 20:30:02,152 >> Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/tiny-mbart and are newly initialized: [‘lm_head.weight’]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2020-12-18 20:30:02,195] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master
[2020-12-18 20:30:02,195] [INFO] [engine.py:147:init] Initializing torch distributed with backend: nccl
[2020-12-18 20:30:02,339] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master
[2020-12-18 20:30:02,339] [INFO] [engine.py:147:init] Initializing torch distributed with backend: nccl
[2020-12-18 20:30:05,642] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-18 20:30:05,645] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-18 20:30:05,674] [INFO] [engine.py:593:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2020-12-18 20:30:05,674] [INFO] [engine.py:598:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam (
Parameter Group 0
betas: [0.8, 0.999]
bias_correction: True
eps: 1e-08
lr: 3e-05
weight_decay: 3e-07
)
[2020-12-18 20:30:05,674] [INFO] [engine.py:681:_configure_fp16_optimizer] Creating fp16 optimizer with dynamic loss scale
[2020-12-18 20:30:05,674] [INFO] [engine.py:681:_configure_fp16_optimizer] Creating fp16 optimizer with dynamic loss scale
[2020-12-18 20:30:05,677] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
betas: [0.8, 0.999]
bias_correction: True
eps: 1e-08
lr: 3e-05
step: 1
weight_decay: 3e-07
)
[2020-12-18 20:30:05,677] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
betas: [0.8, 0.999]
bias_correction: True
eps: 1e-08
lr: 3e-05
step: 1
weight_decay: 3e-07
)
[2020-12-18 20:30:05,680] [INFO] [engine.py:629:_configure_optimizer] DeepSpeed Final Optimizer = {‘dynamic_loss_scale’: True, ‘cur_scale’: 4294967296, ‘cur_iter’: 0, ‘last_overflow_iter’: -1, ‘scale_factor’: 2, ‘scale_window’: 1000, ‘optimizer_state_dict’: {‘state’: {0: {‘exp_avg’: tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
device=‘cuda:1’), ‘exp_avg_sq’: tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
device=‘cuda:1’)}}, ‘param_groups’: [{‘lr’: 3e-05, ‘bias_correction’: True, ‘betas’: [0.8, 0.999], ‘eps’: 1e-08, ‘weight_decay’: 3e-07, ‘step’: 1, ‘params’: [0]}]}, ‘fp32_groups_flat’: [tensor([-3.6163e-02, -1.1017e-02,  1.9646e-03, -9.6741e-03,  0.0000e+00,
0.0000e+00,  1.9623e-02,  1.2726e-02, -4.2610e-03, -8.0185e-03,
0.0000e+00,  0.0000e+00, -2.0142e-03, -3.5553e-02, -3.7537e-02,
3.1891e-02,  0.0000e+00,  0.0000e+00,  1.1742e-02,  2.5101e-02,
-1.1864e-02, -7.1220e-03,  0.0000e+00,  0.0000e+00,  1.0000e+00,
1.0000e+00,  0.0000e+00,  0.0000e+00,  2.5635e-02,  1.0338e-02,
-1.1421e-02, -2.0981e-02, -1.6876e-02, -1.6815e-02, -3.4180e-02,
3.1799e-02,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
3.6591e-02,  6.4888e-03,  2.2934e-02, -1.4061e-02, -4.8256e-03,
1.2184e-02, -2.0172e-02, -1.9394e-02,  0.0000e+00,  0.0000e+00,
1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.2901e-02,
4.0054e-03,  8.0338e-03, -1.1307e-02,  0.0000e+00,  0.0000e+00,
2.8641e-02,  4.8184e-04, -1.0582e-02,  1.1536e-02,  0.0000e+00,
0.0000e+00, -1.0925e-02, -7.4043e-03,  9.5320e-04,  3.4504e-03,
0.0000e+00,  0.0000e+00,  1.7471e-02,  2.3289e-03,  2.1545e-02,
2.8915e-03,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
0.0000e+00,  0.0000e+00, -3.9185e-02, -1.3550e-02,  2.9087e-03,
9.9945e-04,  2.0447e-02, -2.4887e-02,  1.3676e-03,  4.8523e-03,
0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -4.0253e-02,
-1.5764e-03, -4.0039e-02, -2.2980e-02,  1.1307e-02,  4.4373e-02,
1.8646e-02, -2.0630e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,
1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,
0.0000e+00, -1.5434e-02,  4.0321e-03,  9.0714e-03,  1.0330e-02,
0.0000e+00,  0.0000e+00, -4.5776e-03, -3.0075e-02,  8.6670e-03,
-2.1652e-02,  0.0000e+00,  0.0000e+00, -2.4200e-02,  1.8417e-02,
-2.5970e-02,  9.2010e-03,  0.0000e+00,  0.0000e+00, -8.5220e-03,
-6.2332e-03, -1.0139e-02, -8.6823e-03,  0.0000e+00,  0.0000e+00,
1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00, -1.4549e-02,
-2.5162e-02, -1.4793e-02,  1.6220e-02,  0.0000e+00,  0.0000e+00,
-2.8320e-02, -2.6138e-02, -1.5015e-02, -5.4893e-03,  0.0000e+00,
0.0000e+00,  1.1015e-03, -1.5366e-02,  3.3813e-02, -1.7052e-03,
0.0000e+00,  0.0000e+00,  2.7100e-02,  7.7667e-03, -3.0640e-02,
-2.1133e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
0.0000e+00,  0.0000e+00,  6.5536e-03, -1.3023e-02, -7.0572e-04,
-1.0208e-02,  6.4087e-03,  5.1575e-03,  1.9257e-02,  2.7344e-02,
0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -3.2867e-02,
2.7817e-02, -2.0920e-02,  2.7580e-03, -1.8356e-02, -2.4857e-02,
-1.5450e-02, -1.2680e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,
1.0000e+00,  0.0000e+00,  0.0000e+00,  8.5144e-03, -1.6571e-02,
-5.7106e-03, -2.2568e-02,  0.0000e+00,  0.0000e+00,  3.8319e-03,
-1.2337e-02, -1.1345e-02, -4.2847e-02,  0.0000e+00,  0.0000e+00,
-5.4741e-03, -2.9114e-02,  8.7662e-03,  2.9564e-03,  0.0000e+00,
0.0000e+00,  1.7075e-02,  1.0483e-02, -2.0325e-02,  3.5675e-02,
0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,
0.0000e+00, -1.4648e-02, -2.5375e-02,  1.4200e-03, -5.0621e-03,
0.0000e+00,  0.0000e+00,  2.5284e-02,  1.3382e-02,  5.9319e-03,
-1.9791e-02,  0.0000e+00,  0.0000e+00,  4.7821e-02,  2.8944e-04,
-3.6407e-02,  2.6886e-02,  0.0000e+00,  0.0000e+00, -3.4424e-02,
8.2550e-03, -1.9302e-02,  3.7476e-02,  0.0000e+00,  0.0000e+00,
1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0750e-02,
-3.7804e-03,  3.7689e-02, -1.9821e-02, -1.4641e-02,  1.4755e-02,
-3.3321e-03,  2.1469e-02,  0.0000e+00,  0.0000e+00,  0.0000e+00,
0.0000e+00, -6.6643e-03, -8.9407e-05,  1.4587e-02,  2.7637e-03,
9.8190e-03,  2.0325e-02, -4.8950e-02, -2.8954e-03,  0.0000e+00,
0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,
1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00,
1.0000e+00,  0.0000e+00,  0.0000e+00], device=‘cuda:1’,
requires_grad=True)], ‘clip_grad’: 0.0}
FusedAdam (
Parameter Group 0
betas: [0.8, 0.999]
bias_correction: True
eps: 1e-08
lr: 3e-05
step: 1
weight_decay: 3e-07
)
<deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fee4132d5e0>
[2020-12-18 20:30:05,681] [INFO] [engine.py:629:_configure_optimizer] DeepSpeed Final Optimizer = {‘dynamic_loss_scale’: True, ‘cur_scale’: 4294967296, ‘cur_iter’: 0, ‘last_overflow_iter’: -1, ‘scale_factor’: 2, ‘scale_window’: 1000, ‘optimizer_state_dict’: {‘state’: {0: {‘exp_avg’: tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
device=‘cuda:0’), ‘exp_avg_sq’: tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
device=‘cuda:0’)}}, ‘param_groups’: [{‘lr’: 3e-05, ‘bias_correction’: True, ‘betas’: [0.8, 0.999], ‘eps’: 1e-08, ‘weight_decay’: 3e-07, ‘step’: 1, ‘params’: [0]}]}, ‘fp32_groups_flat’: [tensor([-3.6163e-02, -1.1017e-02,  1.9646e-03, -9.6741e-03,  0.0000e+00,
0.0000e+00,  1.9623e-02,  1.2726e-02, -4.2610e-03, -8.0185e-03,
0.0000e+00,  0.0000e+00, -2.0142e-03, -3.5553e-02, -3.7537e-02,
3.1891e-02,  0.0000e+00,  0.0000e+00,  1.1742e-02,  2.5101e-02,
-1.1864e-02, -7.1220e-03,  0.0000e+00,  0.0000e+00,  1.0000e+00,
1.0000e+00,  0.0000e+00,  0.0000e+00,  2.5635e-02,  1.0338e-02,
-1.1421e-02, -2.0981e-02, -1.6876e-02, -1.6815e-02, -3.4180e-02,
3.1799e-02,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
3.6591e-02,  6.4888e-03,  2.2934e-02, -1.4061e-02, -4.8256e-03,
1.2184e-02, -2.0172e-02, -1.9394e-02,  0.0000e+00,  0.0000e+00,
1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.2901e-02,
4.0054e-03,  8.0338e-03, -1.1307e-02,  0.0000e+00,  0.0000e+00,
2.8641e-02,  4.8184e-04, -1.0582e-02,  1.1536e-02,  0.0000e+00,
0.0000e+00, -1.0925e-02, -7.4043e-03,  9.5320e-04,  3.4504e-03,
0.0000e+00,  0.0000e+00,  1.7471e-02,  2.3289e-03,  2.1545e-02,
2.8915e-03,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
0.0000e+00,  0.0000e+00, -3.9185e-02, -1.3550e-02,  2.9087e-03,
9.9945e-04,  2.0447e-02, -2.4887e-02,  1.3676e-03,  4.8523e-03,
0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -4.0253e-02,
-1.5764e-03, -4.0039e-02, -2.2980e-02,  1.1307e-02,  4.4373e-02,
1.8646e-02, -2.0630e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,
1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,
0.0000e+00, -1.5434e-02,  4.0321e-03,  9.0714e-03,  1.0330e-02,
0.0000e+00,  0.0000e+00, -4.5776e-03, -3.0075e-02,  8.6670e-03,
-2.1652e-02,  0.0000e+00,  0.0000e+00, -2.4200e-02,  1.8417e-02,
-2.5970e-02,  9.2010e-03,  0.0000e+00,  0.0000e+00, -8.5220e-03,
-6.2332e-03, -1.0139e-02, -8.6823e-03,  0.0000e+00,  0.0000e+00,
1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00, -1.4549e-02,
-2.5162e-02, -1.4793e-02,  1.6220e-02,  0.0000e+00,  0.0000e+00,
-2.8320e-02, -2.6138e-02, -1.5015e-02, -5.4893e-03,  0.0000e+00,
0.0000e+00,  1.1015e-03, -1.5366e-02,  3.3813e-02, -1.7052e-03,
0.0000e+00,  0.0000e+00,  2.7100e-02,  7.7667e-03, -3.0640e-02,
-2.1133e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
0.0000e+00,  0.0000e+00,  6.5536e-03, -1.3023e-02, -7.0572e-04,
-1.0208e-02,  6.4087e-03,  5.1575e-03,  1.9257e-02,  2.7344e-02,
0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -3.2867e-02,
2.7817e-02, -2.0920e-02,  2.7580e-03, -1.8356e-02, -2.4857e-02,
-1.5450e-02, -1.2680e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,
1.0000e+00,  0.0000e+00,  0.0000e+00,  8.5144e-03, -1.6571e-02,
-5.7106e-03, -2.2568e-02,  0.0000e+00,  0.0000e+00,  3.8319e-03,
-1.2337e-02, -1.1345e-02, -4.2847e-02,  0.0000e+00,  0.0000e+00,
-5.4741e-03, -2.9114e-02,  8.7662e-03,  2.9564e-03,  0.0000e+00,
0.0000e+00,  1.7075e-02,  1.0483e-02, -2.0325e-02,  3.5675e-02,
0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,
0.0000e+00, -1.4648e-02, -2.5375e-02,  1.4200e-03, -5.0621e-03,
0.0000e+00,  0.0000e+00,  2.5284e-02,  1.3382e-02,  5.9319e-03,
-1.9791e-02,  0.0000e+00,  0.0000e+00,  4.7821e-02,  2.8944e-04,
-3.6407e-02,  2.6886e-02,  0.0000e+00,  0.0000e+00, -3.4424e-02,
8.2550e-03, -1.9302e-02,  3.7476e-02,  0.0000e+00,  0.0000e+00,
1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0750e-02,
-3.7804e-03,  3.7689e-02, -1.9821e-02, -1.4641e-02,  1.4755e-02,
-3.3321e-03,  2.1469e-02,  0.0000e+00,  0.0000e+00,  0.0000e+00,
0.0000e+00, -6.6643e-03, -8.9407e-05,  1.4587e-02,  2.7637e-03,
9.8190e-03,  2.0325e-02, -4.8950e-02, -2.8954e-03,  0.0000e+00,
0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,
1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00,
1.0000e+00,  0.0000e+00,  0.0000e+00], device=‘cuda:0’,
requires_grad=True)], ‘clip_grad’: 0.0}
[2020-12-18 20:30:05,681] [INFO] [engine.py:457:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2020-12-18 20:30:05,681] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7f303160d640>
[2020-12-18 20:30:05,681] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[3e-05], mom=[[0.8, 0.999]]
[2020-12-18 20:30:05,681] [INFO] [config.py:644:print] DeepSpeedEngine configuration:
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7f303160db50>
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   allreduce_always_fp32 … False
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   amp_enabled … False
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   amp_params … False
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   disable_allgather … False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   dump_state … False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   dynamic_loss_scale_args … {‘init_scale’: 4294967296, ‘scale_window’: 1000, ‘delayed_shift’: 2, ‘min_scale’: 1}
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   fp16_enabled … True
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   global_rank … 0
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   gradient_accumulation_steps … 1
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   gradient_clipping … 0.0
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   gradient_predivide_factor … 1.0
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   initial_dynamic_scale … 4294967296
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   loss_scale … 0
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   memory_breakdown … False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   optimizer_legacy_fusion … False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   optimizer_name … adam
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   optimizer_params … {‘lr’: 3e-05, ‘betas’: [0.8, 0.999], ‘eps’: 1e-08, ‘weight_decay’: 3e-07, ‘adam_w_mode’: True}
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   pipeline … {‘stages’: ‘auto’, ‘partition’: ‘best’, ‘seed_layers’: False, ‘activation_checkpoint_interval’: 0}
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   pld_enabled … False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   pld_params … False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   prescale_gradients … False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   scheduler_name … WarmupLR
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   scheduler_params … {‘warmup_min_lr’: 0, ‘warmup_max_lr’: 3e-05, ‘warmup_num_steps’: 500}
2020-12-18 20:30:05 | INFO | main | *** Train ***
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   sparse_attention … None
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   sparse_gradients_enabled … False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   steps_per_print … 2000
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   tensorboard_enabled … False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   tensorboard_job_name … DeepSpeedJobName
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   tensorboard_output_path …
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   train_batch_size … 20
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   train_micro_batch_size_per_gpu  10
2020-12-18 20:30:05 | WARNING | seq2seq_trainer | scheduler is passed to Seq2SeqTrainer, --lr_scheduler arg is ignored.
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   wall_clock_breakdown … False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   world_size … 2
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   zero_allow_untested_optimizer  False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   zero_config … {
“allgather_bucket_size”: 500000000,
“allgather_partitions”: true,
“contiguous_gradients”: true,
“cpu_offload”: false,
“elastic_checkpoint”: true,
“load_from_fp32_weights”: true,
“overlap_comm”: false,
“reduce_bucket_size”: 500000000,
“reduce_scatter”: false,
“stage”: 0
}
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   zero_enabled … False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   zero_optimization_stage … 0
[2020-12-18 20:30:05,682] [INFO] [config.py:650:print]   json = {
“fp16”:{
“enabled”:true,
“hysteresis”:2,
“loss_scale”:0,
“loss_scale_window”:1000,
“min_loss_scale”:1
},
“optimizer”:{
“params”:{
“adam_w_mode”:true,
“betas”:[
0.8,
0.999
],
“eps”:1e-08,
“lr”:3e-05,
“weight_decay”:3e-07
},
“type”:“Adam”
},
“scheduler”:{
“params”:{
“warmup_max_lr”:3e-05,
“warmup_min_lr”:0,
“warmup_num_steps”:500
},
“type”:“WarmupLR”
},
“steps_per_print”:2000,
“train_batch_size”:20,
“wall_clock_breakdown”:false,
“zero_optimization”:{
“allgather_bucket_size”:500000000,
“allgather_partitions”:true,
“contiguous_gradients”:true,
“cpu_offload”:false,
“overlap_comm”:false,
“reduce_bucket_size”:500000000,
“reduce_scatter”:false,
“stage”:0
}
}
FusedAdam (
Parameter Group 0
betas: [0.8, 0.999]
bias_correction: True
eps: 1e-08
lr: 3e-05
step: 1
weight_decay: 3e-07
)
<deepspeed.runtime.lr_schedules.WarmupLR object at 0x7f303160d640>
2020-12-18 20:30:05 | INFO | main | *** Train ***
2020-12-18 20:30:05 | WARNING | seq2seq_trainer | scheduler is passed to Seq2SeqTrainer, --lr_scheduler arg is ignored.
[INFO|trainer.py:723] 2020-12-18 20:30:05,688 >> ***** Running training *****
[INFO|trainer.py:724] 2020-12-18 20:30:05,688 >>   Num examples = 500
[INFO|trainer.py:725] 2020-12-18 20:30:05,688 >>   Num Epochs = 1
[INFO|trainer.py:726] 2020-12-18 20:30:05,688 >>   Instantaneous batch size per device = 20
[INFO|trainer.py:727] 2020-12-18 20:30:05,688 >>   Total train batch size (w. parallel, distributed & accumulation) = 40
[INFO|trainer.py:728] 2020-12-18 20:30:05,688 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:729] 2020-12-18 20:30:05,688 >>   Total optimization steps = 13
{‘loss’: inf, ‘learning_rate’: 0.0, ‘epoch’: 0.07692307692307693}
92%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍         | 12/13 [00:02<00:00,  5.65it/s][INFO|trainer.py:883] 2020-12-18 20:30:08,588 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
{‘epoch’: 1.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:02<00:00,  5.95it/s]
[INFO|trainer.py:1247] 2020-12-18 20:30:08,589 >> Saving model checkpoint to output_dir
[INFO|trainer.py:1251] 2020-12-18 20:30:08,589 >> Trainer.model is not a PreTrainedModel, only saving its state dict.
2020-12-18 20:30:08 | INFO | main | ***** train metrics *****
2020-12-18 20:30:08 | INFO | main |   train_samples_per_second = 172.096
2020-12-18 20:30:08 | INFO | main |   train_runtime = 2.9054
2020-12-18 20:30:08 | INFO | main |   train_n_ojbs = 500

I know I haven’t provided reproduction info, as I haven’t quite finished working on integration with HF transformers, but it should be ready soon. I was hoping you could tell from logs what went wrong. But if it isn’t helpful I will update this Issue with reproduction details once I have a transformers branch you could experiment with.

Issue Analytics

State:
Created 3 years ago
Comments:27 (25 by maintainers)

Top GitHub Comments

3reactions

tjruwasecommented, Dec 21, 2020

I think I see the issue, based on your stack trace.

File "/mnt/nvme1/code/huggingface/transformers-deepspeed/src/transformers/trainer.py", line 1182, in training_step
    loss.backward()

Can you please call model.backward() instead of loss.backward()? I assume that model is the return value of deepspeed.initialize().

2reactions

stas00commented, Dec 24, 2020

We are much appreciating you too offering to support our DS integration process, @g-karthik!

Top Results From Across the Web

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

zero_optimization.cpu_offload: true leads to a silent crash

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[user application] which deepspeed flags are required if any

[deepspeed checkpointing] AttributeError: 'NoneType' object has no attribute 'numel'