AsyncIO Error with Stage 3 + NVME offload
See original GitHub issueHi,
When trying to use ZERO Stage 3 with NVME offloading, which is required for fitting large models into memory, I am seeing the following error:
/nvme/zero_stage_3/fp16params/rank24/0_param.tensor.swp: buffer nbytes != file bytes 28824000 != 28311552
python: /usr/local/lib/python3.6/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:223: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char*, bool, bool): \
Assertion `static_cast<long long int>(buffer.nbytes()) == num_file_bytes' failed.
I have inserted some debug print()
statements at both the write and read python call sites, here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/swap_tensor/utils.py#L19-L26
And I observe that when I am writing, the tensor really is trying to write 28824000
bytes (because it has 14412000
elements). However when I do an ls -l /nvme/zero_stage_3/fp16params/rank24/0_param.tensor.swp
I observe that the file only has 28311552
bytes as mentioned in the error message. So it seems that somehow the async write command is failing to properly write the full contents of the tensor.
Any idea why this would happen? Or suggestions for how to debug further?
I have tried looking at kernel logs via dmesg
, but nothing turns up. I have also tried running program with strace -e trace=io_submit,io_getevents,io_setup
, but I only see the io_setup
syscall and not the io_submit
or io_getevents
syscalls.
I do have libaio_dev installed.
My deepspeed config looks like this:
zero_optimization:
stage: 3
stage3_prefetch_bucket_size: 1e9
stage3_param_persistence_threshold: 1e6
stage3_max_live_parameters: 1e9
overlap_comm: true
contiguous_gradients: true
offload_param:
device: nvme
nvme_path: /nvme
pin_memory: false
max_in_cpu: 1e9
buffer_size: 1e9
buffer_count: 5
offload_optimizer:
device: nvme
nvme_path: /nvme
pin_memory: false
Thanks, Stephen
Issue Analytics
- State:
- Created 2 years ago
- Comments:20 (9 by maintainers)
Top GitHub Comments
@tjruwase yes, I have tested #1086 via commit 38d46848f450a080c8ab96427d9da00c2e5b6327, and it works for me now. Thanks for quick fix!
Actually, everything works fine if I use a single GPU. I think the underlying problem is this line: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L681-L682
i.e.:
While the tensor-size is aligned properly, after dividing by world size the
partition_size
is not aligned.And the
get_buffer()
call here uses thecompute_buffer
not the alignedswap_buffer
: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L689