Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AsyncIO Error with Stage 3 + NVME offload

See original GitHub issue

Hi,

When trying to use ZERO Stage 3 with NVME offloading, which is required for fitting large models into memory, I am seeing the following error:

/nvme/zero_stage_3/fp16params/rank24/0_param.tensor.swp: buffer nbytes != file bytes 28824000 != 28311552
python: /usr/local/lib/python3.6/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:223: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char*, bool, bool): \
Assertion `static_cast<long long int>(buffer.nbytes()) == num_file_bytes' failed.

I have inserted some debug print() statements at both the write and read python call sites, here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/swap_tensor/utils.py#L19-L26

And I observe that when I am writing, the tensor really is trying to write 28824000 bytes (because it has 14412000 elements). However when I do an ls -l /nvme/zero_stage_3/fp16params/rank24/0_param.tensor.swp I observe that the file only has 28311552 bytes as mentioned in the error message. So it seems that somehow the async write command is failing to properly write the full contents of the tensor.

Any idea why this would happen? Or suggestions for how to debug further?

I have tried looking at kernel logs via dmesg, but nothing turns up. I have also tried running program with strace -e trace=io_submit,io_getevents,io_setup, but I only see the io_setup syscall and not the io_submit or io_getevents syscalls.

I do have libaio_dev installed.

My deepspeed config looks like this:

zero_optimization:
  stage: 3
  stage3_prefetch_bucket_size: 1e9
  stage3_param_persistence_threshold: 1e6
  stage3_max_live_parameters: 1e9
  overlap_comm: true
  contiguous_gradients: true
  offload_param:
      device: nvme
      nvme_path: /nvme
      pin_memory: false
      max_in_cpu: 1e9
      buffer_size: 1e9
      buffer_count: 5
    offload_optimizer:
      device: nvme
      nvme_path: /nvme
      pin_memory: false

Thanks, Stephen

Issue Analytics

State:
Created 2 years ago
Comments:20 (9 by maintainers)

Top GitHub Comments

1reaction

stephenrawlscommented, May 20, 2021

@tjruwase yes, I have tested #1086 via commit 38d46848f450a080c8ab96427d9da00c2e5b6327, and it works for me now. Thanks for quick fix!

1reaction

stephenrawlscommented, May 19, 2021

Actually, everything works fine if I use a single GPU. I think the underlying problem is this line: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L681-L682

i.e.:

tensor_size = self._aligned_size(param)
partition_size = tensor_size // self.world_size

While the tensor-size is aligned properly, after dividing by world size the partition_size is not aligned.

And the get_buffer() call here uses the compute_buffer not the aligned swap_buffer: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L689

Top Results From Across the Web

Using asyncio.Queue for producer-consumer flow

In that case, use run_in_executor to off-load the blocking code to a thread pool. Then you'd write if await loop.run_in_executor(None, lambda: ...

DeepSpeed Configuration JSON

Enable offloading of model parameters to CPU or NVMe. This frees up GPU memory for larger models or batch sizes. Valid only with...

Developing with asyncio — Python 3.11.1 documentation

Asynchronous programming is different from classic “sequential” programming. This page lists common mistakes and traps and explains how to avoid them.

Asynchronous and "direct" IO support for PostgreSQL.

Hi,. over the last ~year I spent a lot of time trying to figure out how we could add AIO (asynchronous IO) and...

1. fio - Flexible I/O tester rev. 3.32 - FIO's documentation!

The first step in getting fio to simulate a desired I/O workload, is writing a job file ... How do we issue I/O?...