Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] NVMe Offload, error while fetching submodule parameters.

See original GitHub issue

Describe the bug I want to test ZeRO-Infinity NVMe offload for large model. But below error occurs. I don’t know why this error occurs. In ZeRO-Infinity (offloading optimizer states and parameters to NVMe), error happens. While fetching sub module parameters, expected id is different from id in fetch queue.

Error message

38636.36 | swap_out_gradient: 25736.14 | swap_out_param: 75127.72 | swap_in_gradient: 13291.40 | async_swap_gradient_wait: 25195.09
[2022-03-08 08:58:08,735] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_swap_in_state: 77148.05 | optimizer_swap_out_state: 75127.83 | optimizer_step: 156991.31
[2022-03-08 08:58:08,736] [INFO] [logging.py:69:log_dist] [Rank 0] step=1, skipped=0, lr=[4.6874999999999995e-08, 4.6874999999999995e-08], mom=[(0.9, 0.999), (0.9, 0.999)]
[2022-03-08 08:58:08,736] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 10651.79 | backward_microstep: 24895.01 | backward_inner_microstep: 24778.36 | backward_allreduce_microstep: 116.51 | step_microstep: 160185.78
[2022-03-08 08:58:08,736] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 10651.79 | backward: 24895.01 | backward_inner: 24778.38 | backward_allreduce: 116.50 | step: 160185.78
 iteration        1/  320000 | elapsed time per iteration (ms): 195766.2 | learning rate: 4.687E-08 | lm loss: 1.154783E+01 | loss scale: 1024.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
after 1 iterations memory (MB) | allocated: 20.43310546875 | max allocated: 3943.98779296875 | reserved: 6786.0 | max reserved: 6786.0
time (ms) | forward: 10684.03 | backward: 24895.16 | backward-backward: 24895.11 | backward-allreduce: 0.00 | optimizer: 160185.98 | batch generator: 4.91
Effective Tera Flops per GPU: 2.23 and total parameters 6.654 B
Traceback (most recent call last):
  File "pretrain_gpt2.py", line 134, in <module>
    args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})
  File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/training.py", line 110, in pretrain
    train_data_iterator, valid_data_iterator)
  File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/training.py", line 481, in train
    lr_scheduler)
  File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/training.py", line 330, in train_step
    loss, loss_reduced = forward_step_func(data_iterator, model)
  File "pretrain_gpt2.py", line 100, in forward_step
    losses = model(tokens, position_ids, attention_mask, labels=labels)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1597, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/model/gpt2_model.py", line 81, in forward
    get_key_value=get_key_value)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/model/language_model.py", line 333, in forward
    get_key_value=get_key_value)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py", line 988, in forward
    attention_mask)
  File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py", line 965, in _checkpointed_forward
    hidden_states, attention_mask)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 748, in checkpoint
    CheckpointFunction.apply(function, all_outputs, *args)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 582, in forward
    outputs = run_function(*inputs_cuda)
  File "/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py", line 955, in custom_forward
    x_ = layer(x_, inputs[1])
  File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1109, in _call_impl
    result = hook(self, input)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1412, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1529, in pre_sub_module_forward_function
    self.param_coordinator.fetch_sub_module(sub_module)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 359, in fetch_sub_module
    f"tracing error at step {self.__step_id}: "
RuntimeError: tracing error at step 19: expected the next 1 parameters in the parameter fetch queue to be ({'id': 19, 'status': 'AVAILABLE', 'numel': 4096, 'ds_numel': 4096, 'shape': (4096,), 'ds_shape': (4096,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {20}},) but got ({'id': 17, 'status': 'AVAILABLE', 'numel': 12288, 'ds_numel': 12288, 'shape': (12288,), 'ds_shape': (12288,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': set()},).
[2022-03-08 08:58:13,817] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24210
[2022-03-08 08:58:13,817] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24211
[2022-03-08 08:58:13,817] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24212
[2022-03-08 08:58:13,818] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24213
[2022-03-08 08:58:13,818] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24214
[2022-03-08 08:58:13,818] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24215
[2022-03-08 08:58:13,818] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24216
[2022-03-08 08:58:13,818] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 24217
[2022-03-08 08:58:13,818] [ERROR] [launch.py:184:sigkill_handler] ['/usr/bin/python3', '-u', 'pretrain_gpt2.py', '--local_rank=7', '--model-parallel-size', '1', '--num-layers', '32', '--hidden-size', '4096', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--batch-size', '8', '--train-iters', '320000', '--lr-decay-iters', '320000', '--save', 'checkpoints/gpt2_345m_ds', '--load', 'checkpoints/gpt2_345m_ds', '--data-path', '/data/Megatron-LM/data/indexed_datasets/megatron', '--vocab-file', '/data/Megatron-LM/data/gpt2-vocab.json', '--merge-file', '/data/Megatron-LM/data/gpt2-merges.txt', '--data-impl', 'mmap', '--split', '949,50,1', '--distributed-backend', 'nccl', '--lr', '1.5e-4', '--lr-decay-style', 'cosine', '--min-lr', '1.0e-5', '--weight-decay', '1e-2', '--clip-grad', '1.0', '--warmup', '0.01', '--checkpoint-activations', '--log-interval', '1', '--save-interval', '10000', '--eval-interval', '2000', '--eval-iters', '10', '--fp16', '--scattered-embeddings', '--split-transformers', '--deepspeed', '--deepspeed_config', '/home/ubuntu/git/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/examples/infinity2.json', '--zero-stage', '3', '--zero-reduce-bucket-size', '5000000', '--zero-allgather-bucket-size', '50000000', '--zero-contigious-gradients', '--zero-reduce-scatter', '--deepspeed-activation-checkpointing', '--checkpoint-num-layers', '1', '--partition-activations', '--checkpoint-in-cpu', '--synchronize-each-layer', '--contigious-checkpointing'] exits with return code = 1

Expected behavior It is expected that learning will proceed while offloading nvme.

ds_report output

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: ubuntu18.04
GPU 8, T4 (g4dn.metal)
1 node
Python version: 3.7.5

Launcher context bash examples/ds_pretrain_gpt2-zero3.sh

#! /bin/bash

# Change for multinode config
MP_SIZE=1

DEBUG=1
if [[ ${DEBUG} == 1 ]];  then
       MP_SIZE=1
       NUM_WORKERS=1
       NUM_GPUS_PER_WORKER=8
       HIDDEN_SIZE=4096
       NUM_ATTN_HEADS=32
       NUM_LAYERS=32
       BATCHSIZE=8
else
       NUM_WORKERS=${DLTS_NUM_WORKER}
       NUM_GPUS_PER_WORKER=${DLTS_NUM_GPU_PER_WORKER}
       HIDDEN_SIZE=8192
       NUM_ATTN_HEADS=32
       NUM_LAYERS=50
       BATCHSIZE=4

       #HIDDEN_SIZE=4096
       #NUM_LAYERS=24 # 50
       #BATCHSIZE=16
fi


BASE_DATA_PATH=/data/Megatron-LM/data
DATA_PATH=${BASE_DATA_PATH}/indexed_datasets/megatron
VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m_ds

script_path=$(realpath $0)
script_dir=$(dirname $script_path)
if [[ -z $1 ]]; then
       #config_json="$script_dir/ds_zero_stage_3_config.json"

       # offloads to NVMe
       config_json="$script_dir/infinity2.json"
else
       config_json=$script_dir/`basename $1`
fi

#ZeRO Configs
stage=3
reduce_scatter=true
contigious_gradients=true
rbs=5000000
#agbs=5000000000
agbs=50000000

#Activation Checkpointing and Contigious Memory
chkp_layers=1
PA=true
PA_CPU=true
CC=true
SYNCHRONIZE=true
PROFILE=false

# TiledLinear splits, 0 is disable
TILED_LINEAR="false"
TILE_DIM=1


# Megatron Model Parallelism
LOGDIR="tboard-zero3/stage${stage}-lazyscatter-${NUM_LAYERS}l_${HIDDEN_SIZE}h_${NUM_WORKERS}n_${NUM_GPUS_PER_WORKER}g_${MP_SIZE}mp_${BATCHSIZE}b"


gpt_options=" \
        --model-parallel-size ${MP_SIZE} \
        --num-layers $NUM_LAYERS \
        --hidden-size $HIDDEN_SIZE \
        --num-attention-heads ${NUM_ATTN_HEADS} \
        --seq-length 1024 \
        --max-position-embeddings 1024 \
        --batch-size $BATCHSIZE \
        --train-iters 320000 \
        --lr-decay-iters 320000 \
        --save $CHECKPOINT_PATH \
        --load $CHECKPOINT_PATH \
        --data-path $DATA_PATH \
        --vocab-file $VOCAB_PATH \
        --merge-file $MERGE_PATH \
        --data-impl mmap \
        --split 949,50,1 \
        --distributed-backend nccl \
        --lr 1.5e-4 \
        --lr-decay-style cosine \
        --min-lr 1.0e-5 \
        --weight-decay 1e-2 \
        --clip-grad 1.0 \
        --warmup 0.01 \
        --checkpoint-activations \
        --log-interval 1 \
        --save-interval 10000 \
        --eval-interval 2000 \
        --eval-iters 10 \
        --fp16 \
        --scattered-embeddings \
        --split-transformers \
"
        #--tensorboard-dir ${LOGDIR}

 deepspeed_options=" \
                --deepspeed \
                --deepspeed_config ${config_json} \
                --zero-stage ${stage} \
                --zero-reduce-bucket-size ${rbs} \
                --zero-allgather-bucket-size ${agbs}
            "

if [ "${contigious_gradients}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
                --zero-contigious-gradients"
fi

if [ "${reduce_scatter}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
                --zero-reduce-scatter"
fi

chkp_opt=" \
--deepspeed-activation-checkpointing \
--checkpoint-num-layers ${chkp_layers}"

if [ "${PA}" = "true" ]; then
chkp_opt="${chkp_opt} --partition-activations"
fi

if [ "${PA_CPU}" = "true" ]; then
chkp_opt="${chkp_opt} \
        --checkpoint-in-cpu"
fi

if [ "${SYNCHRONIZE}" = "true" ]; then
chkp_opt="${chkp_opt} \
        --synchronize-each-layer"
fi

if [ "${CC}" = "true" ]; then
chkp_opt="${chkp_opt} \
        --contigious-checkpointing"
fi

if [ "${PROFILE}" = "true" ]; then
chkp_opt="${chkp_opt} \
        --profile-backward"
fi

if [ "${TILED_LINEAR}" = "true" ]; then
tile_opt="${tile_opt} \
        --memory-centric-tiled-linear \
        --tile-factor=${TILE_DIM}"
fi


full_options="${gpt_options} ${deepspeed_options} ${chkp_opt} ${tile_opt}"

run_cmd="deepspeed --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER}  pretrain_gpt2.py ${@:2} ${full_options}"
echo ${run_cmd}
eval ${run_cmd}

set +x

configuration file: infinity2.json

{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 1,
  "zero_optimization": {
    "stage": 3,
    "stage3_max_live_parameters": 1e8,
    "allgather_partitions": true,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "reduce_bucket_size": 9000000,
    "sub_group_size": 1e10,
    "offload_optimizer": {
      "device": "nvme",
      "nvme_path": "/mnt/nvme2" ,
      "buffer_count": 4,
      "pin_memory": true
    },
    "offload_param": {
      "device": "nvme",
      "nvme_path": "/mnt/nvme2",
      "buffer_count": 5,
      "pin_memory": true
    }
  },
  "activation_checkpointing": {
      "profile": true,
      "cpu_checkpointing": true,
      "partition_activations": true
  },
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true,
    "loss_scale": 1024,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "wall_clock_breakdown": true,
  "zero_allow_untested_optimizer": false,
  "aio": {
    "block_size": 1048576,
    "queue_depth": 8,
    "single_submit": false,
    "overlap_events": true,
    "thread_count": 1
  }
}

Docker context I don’t use docker.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:6 (1 by maintainers)

Top GitHub Comments

1reaction

tjruwasecommented, Apr 18, 2022

@lkm2835, can you please confirm that you are still seeing this issue?

1reaction

peterr-scommented, Mar 11, 2022

I had the same issue, but managed to resolve by rolling back to 0.5.10. Not sure of exact root cause right now, but the 0.6.0 update included a lot of changes to stage 3.