[BUG] NVMe Offload, error while fetching submodule parameters.
See original GitHub issueDescribe the bug I want to test ZeRO-Infinity NVMe offload for large model. But below error occurs. I don’t know why this error occurs. In ZeRO-Infinity (offloading optimizer states and parameters to NVMe), error happens. While fetching sub module parameters, expected id is different from id in fetch queue.
Error message
38636.36 | swap_out_gradient: 25736.14 | swap_out_param: 75127.72 | swap_in_gradient: 13291.40 | async_swap_gradient_wait: 25195.09
[2022-03-08 08:58:08,735] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_swap_in_state: 77148.05 | optimizer_swap_out_state: 75127.83 | optimizer_step: 156991.31
[2022-03-08 08:58:08,736] [INFO] [logging.py:69:log_dist] [Rank 0] step=1, skipped=0, lr=[4.6874999999999995e-08, 4.6874999999999995e-08], mom=[(0.9, 0.999), (0.9, 0.999)]
[2022-03-08 08:58:08,736] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 10651.79 | backward_microstep: 24895.01 | backward_inner_microstep: 24778.36 | backward_allreduce_microstep: 116.51 | step_microstep: 160185.78
[2022-03-08 08:58:08,736] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 10651.79 | backward: 24895.01 | backward_inner: 24778.38 | backward_allreduce: 116.50 | step: 160185.78
iteration 1/ 320000 | elapsed time per iteration (ms): 195766.2 | learning rate: 4.687E-08 | lm loss: 1.154783E+01 | loss scale: 1024.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
after 1 iterations memory (MB) | allocated: 20.43310546875 | max allocated: 3943.98779296875 | reserved: 6786.0 | max reserved: 6786.0
time (ms) | forward: 10684.03 | backward: 24895.16 | backward-backward: 24895.11 | backward-allreduce: 0.00 | optimizer: 160185.98 | batch generator: 4.91
Effective Tera Flops per GPU: 2.23 and total parameters 6.654 B
Traceback (most recent call last):
File "pretrain_gpt2.py", line 134, in <module>
args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})
File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/training.py", line 110, in pretrain
train_data_iterator, valid_data_iterator)
File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/training.py", line 481, in train
lr_scheduler)
File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/training.py", line 330, in train_step
loss, loss_reduced = forward_step_func(data_iterator, model)
File "pretrain_gpt2.py", line 100, in forward_step
losses = model(tokens, position_ids, attention_mask, labels=labels)
File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1597, in forward
loss = self.module(*inputs, **kwargs)
File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/model/gpt2_model.py", line 81, in forward
get_key_value=get_key_value)
File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/model/language_model.py", line 333, in forward
get_key_value=get_key_value)
File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py", line 988, in forward
attention_mask)
File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py", line 965, in _checkpointed_forward
hidden_states, attention_mask)
File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 748, in checkpoint
CheckpointFunction.apply(function, all_outputs, *args)
File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 582, in forward
outputs = run_function(*inputs_cuda)
File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py", line 955, in custom_forward
x_ = layer(x_, inputs[1])
File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1109, in _call_impl
result = hook(self, input)
File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1412, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1529, in pre_sub_module_forward_function
self.param_coordinator.fetch_sub_module(sub_module)
File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 359, in fetch_sub_module
f"tracing error at step {self.__step_id}: "
RuntimeError: tracing error at step 19: expected the next 1 parameters in the parameter fetch queue to be ({'id': 19, 'status': 'AVAILABLE', 'numel': 4096, 'ds_numel': 4096, 'shape': (4096,), 'ds_shape': (4096,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {20}},) but got ({'id': 17, 'status': 'AVAILABLE', 'numel': 12288, 'ds_numel': 12288, 'shape': (12288,), 'ds_shape': (12288,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': set()},).
[2022-03-08 08:58:13,817] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24210
[2022-03-08 08:58:13,817] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24211
[2022-03-08 08:58:13,817] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24212
[2022-03-08 08:58:13,818] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24213
[2022-03-08 08:58:13,818] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24214
[2022-03-08 08:58:13,818] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24215
[2022-03-08 08:58:13,818] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24216
[2022-03-08 08:58:13,818] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24217
[2022-03-08 08:58:13,818] [ERROR] [launch.py:184:sigkill_handler] ['/usr/bin/python3', '-u', 'pretrain_gpt2.py', '--local_rank=7', '--model-parallel-size', '1', '--num-layers', '32', '--hidden-size', '4096', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--batch-size', '8', '--train-iters', '320000', '--lr-decay-iters', '320000', '--save', 'checkpoints/gpt2_345m_ds', '--load', 'checkpoints/gpt2_345m_ds', '--data-path', '/data/Megatron-LM/data/indexed_datasets/megatron', '--vocab-file', '/data/Megatron-LM/data/gpt2-vocab.json', '--merge-file', '/data/Megatron-LM/data/gpt2-merges.txt', '--data-impl', 'mmap', '--split', '949,50,1', '--distributed-backend', 'nccl', '--lr', '1.5e-4', '--lr-decay-style', 'cosine', '--min-lr', '1.0e-5', '--weight-decay', '1e-2', '--clip-grad', '1.0', '--warmup', '0.01', '--checkpoint-activations', '--log-interval', '1', '--save-interval', '10000', '--eval-interval', '2000', '--eval-iters', '10', '--fp16', '--scattered-embeddings', '--split-transformers', '--deepspeed', '--deepspeed_config', '/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/examples/infinity2.json', '--zero-stage', '3', '--zero-reduce-bucket-size', '5000000', '--zero-allgather-bucket-size', '50000000', '--zero-contigious-gradients', '--zero-reduce-scatter', '--deepspeed-activation-checkpointing', '--checkpoint-num-layers', '1', '--partition-activations', '--checkpoint-in-cpu', '--synchronize-each-layer', '--contigious-checkpointing'] exits with return code = 1
Expected behavior It is expected that learning will proceed while offloading nvme.
ds_report output
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: ubuntu18.04
- GPU 8, T4 (g4dn.metal)
- 1 node
- Python version: 3.7.5
Launcher context bash examples/ds_pretrain_gpt2-zero3.sh
#! /bin/bash
# Change for multinode config
MP_SIZE=1
DEBUG=1
if [[ ${DEBUG} == 1 ]]; then
MP_SIZE=1
NUM_WORKERS=1
NUM_GPUS_PER_WORKER=8
HIDDEN_SIZE=4096
NUM_ATTN_HEADS=32
NUM_LAYERS=32
BATCHSIZE=8
else
NUM_WORKERS=${DLTS_NUM_WORKER}
NUM_GPUS_PER_WORKER=${DLTS_NUM_GPU_PER_WORKER}
HIDDEN_SIZE=8192
NUM_ATTN_HEADS=32
NUM_LAYERS=50
BATCHSIZE=4
#HIDDEN_SIZE=4096
#NUM_LAYERS=24 # 50
#BATCHSIZE=16
fi
BASE_DATA_PATH=/data/Megatron-LM/data
DATA_PATH=${BASE_DATA_PATH}/indexed_datasets/megatron
VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m_ds
script_path=$(realpath $0)
script_dir=$(dirname $script_path)
if [[ -z $1 ]]; then
#config_json="$script_dir/ds_zero_stage_3_config.json"
# offloads to NVMe
config_json="$script_dir/infinity2.json"
else
config_json=$script_dir/`basename $1`
fi
#ZeRO Configs
stage=3
reduce_scatter=true
contigious_gradients=true
rbs=5000000
#agbs=5000000000
agbs=50000000
#Activation Checkpointing and Contigious Memory
chkp_layers=1
PA=true
PA_CPU=true
CC=true
SYNCHRONIZE=true
PROFILE=false
# TiledLinear splits, 0 is disable
TILED_LINEAR="false"
TILE_DIM=1
# Megatron Model Parallelism
LOGDIR="tboard-zero3/stage${stage}-lazyscatter-${NUM_LAYERS}l_${HIDDEN_SIZE}h_${NUM_WORKERS}n_${NUM_GPUS_PER_WORKER}g_${MP_SIZE}mp_${BATCHSIZE}b"
gpt_options=" \
--model-parallel-size ${MP_SIZE} \
--num-layers $NUM_LAYERS \
--hidden-size $HIDDEN_SIZE \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length 1024 \
--max-position-embeddings 1024 \
--batch-size $BATCHSIZE \
--train-iters 320000 \
--lr-decay-iters 320000 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file $VOCAB_PATH \
--merge-file $MERGE_PATH \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--lr 1.5e-4 \
--lr-decay-style cosine \
--min-lr 1.0e-5 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--warmup 0.01 \
--checkpoint-activations \
--log-interval 1 \
--save-interval 10000 \
--eval-interval 2000 \
--eval-iters 10 \
--fp16 \
--scattered-embeddings \
--split-transformers \
"
#--tensorboard-dir ${LOGDIR}
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${stage} \
--zero-reduce-bucket-size ${rbs} \
--zero-allgather-bucket-size ${agbs}
"
if [ "${contigious_gradients}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--zero-contigious-gradients"
fi
if [ "${reduce_scatter}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--zero-reduce-scatter"
fi
chkp_opt=" \
--deepspeed-activation-checkpointing \
--checkpoint-num-layers ${chkp_layers}"
if [ "${PA}" = "true" ]; then
chkp_opt="${chkp_opt} --partition-activations"
fi
if [ "${PA_CPU}" = "true" ]; then
chkp_opt="${chkp_opt} \
--checkpoint-in-cpu"
fi
if [ "${SYNCHRONIZE}" = "true" ]; then
chkp_opt="${chkp_opt} \
--synchronize-each-layer"
fi
if [ "${CC}" = "true" ]; then
chkp_opt="${chkp_opt} \
--contigious-checkpointing"
fi
if [ "${PROFILE}" = "true" ]; then
chkp_opt="${chkp_opt} \
--profile-backward"
fi
if [ "${TILED_LINEAR}" = "true" ]; then
tile_opt="${tile_opt} \
--memory-centric-tiled-linear \
--tile-factor=${TILE_DIM}"
fi
full_options="${gpt_options} ${deepspeed_options} ${chkp_opt} ${tile_opt}"
run_cmd="deepspeed --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} pretrain_gpt2.py ${@:2} ${full_options}"
echo ${run_cmd}
eval ${run_cmd}
set +x
configuration file: infinity2.json
{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"steps_per_print": 1,
"zero_optimization": {
"stage": 3,
"stage3_max_live_parameters": 1e8,
"allgather_partitions": true,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_bucket_size": 9000000,
"sub_group_size": 1e10,
"offload_optimizer": {
"device": "nvme",
"nvme_path": "/mnt/nvme2" ,
"buffer_count": 4,
"pin_memory": true
},
"offload_param": {
"device": "nvme",
"nvme_path": "/mnt/nvme2",
"buffer_count": 5,
"pin_memory": true
}
},
"activation_checkpointing": {
"profile": true,
"cpu_checkpointing": true,
"partition_activations": true
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"loss_scale": 1024,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": true,
"zero_allow_untested_optimizer": false,
"aio": {
"block_size": 1048576,
"queue_depth": 8,
"single_submit": false,
"overlap_events": true,
"thread_count": 1
}
}
Docker context I don’t use docker.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:6 (1 by maintainers)
Top Results From Across the Web
Why does git fail to fetch specific valid submodule for a given ...
Running this command after cloning (and receiving the error) solved my problem: git submodule update --force --recursive --init --remote.
Read more >Runner fails to check out submodules error: unknown switch `f
Steps to reproduce. Have a repository with submodules and run CI. I have both relative(on the same gitlab instance) and absolute(on github) ...
Read more >Git - Submodules - Git SCM
To also initialize, fetch and checkout any nested submodules, you can use the foolproof git submodule update --init --recursive . Working on a...
Read more >DeepSpeed - bytemeta
[BUG] NVMe Offload, error while fetching submodule parameters. · [BUG]: Deepspeed build fails on rocm , error in hipify.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@lkm2835, can you please confirm that you are still seeing this issue?
I had the same issue, but managed to resolve by rolling back to 0.5.10. Not sure of exact root cause right now, but the 0.6.0 update included a lot of changes to stage 3.