question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AsyncIO Error with Stage 3 + NVME offload

See original GitHub issue

Hi,

When trying to use ZERO Stage 3 with NVME offloading, which is required for fitting large models into memory, I am seeing the following error:

/nvme/zero_stage_3/fp16params/rank24/0_param.tensor.swp: buffer nbytes != file bytes 28824000 != 28311552
python: /usr/local/lib/python3.6/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:223: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char*, bool, bool): \
Assertion `static_cast<long long int>(buffer.nbytes()) == num_file_bytes' failed.

I have inserted some debug print() statements at both the write and read python call sites, here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/swap_tensor/utils.py#L19-L26

And I observe that when I am writing, the tensor really is trying to write 28824000 bytes (because it has 14412000 elements). However when I do an ls -l /nvme/zero_stage_3/fp16params/rank24/0_param.tensor.swp I observe that the file only has 28311552 bytes as mentioned in the error message. So it seems that somehow the async write command is failing to properly write the full contents of the tensor.

Any idea why this would happen? Or suggestions for how to debug further?

I have tried looking at kernel logs via dmesg, but nothing turns up. I have also tried running program with strace -e trace=io_submit,io_getevents,io_setup, but I only see the io_setup syscall and not the io_submit or io_getevents syscalls.

I do have libaio_dev installed.

My deepspeed config looks like this:

zero_optimization:
  stage: 3
  stage3_prefetch_bucket_size: 1e9
  stage3_param_persistence_threshold: 1e6
  stage3_max_live_parameters: 1e9
  overlap_comm: true
  contiguous_gradients: true
  offload_param:
      device: nvme
      nvme_path: /nvme
      pin_memory: false
      max_in_cpu: 1e9
      buffer_size: 1e9
      buffer_count: 5
    offload_optimizer:
      device: nvme
      nvme_path: /nvme
      pin_memory: false

Thanks, Stephen

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:20 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
stephenrawlscommented, May 20, 2021

@tjruwase yes, I have tested #1086 via commit 38d46848f450a080c8ab96427d9da00c2e5b6327, and it works for me now. Thanks for quick fix!

1reaction
stephenrawlscommented, May 19, 2021

Actually, everything works fine if I use a single GPU. I think the underlying problem is this line: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L681-L682

i.e.:

tensor_size = self._aligned_size(param)
partition_size = tensor_size // self.world_size

While the tensor-size is aligned properly, after dividing by world size the partition_size is not aligned.

And the get_buffer() call here uses the compute_buffer not the aligned swap_buffer: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L689

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using asyncio.Queue for producer-consumer flow
In that case, use run_in_executor to off-load the blocking code to a thread pool. Then you'd write if await loop.run_in_executor(None, lambda: ...
Read more >
DeepSpeed Configuration JSON
Enable offloading of model parameters to CPU or NVMe. This frees up GPU memory for larger models or batch sizes. Valid only with...
Read more >
Developing with asyncio — Python 3.11.1 documentation
Asynchronous programming is different from classic “sequential” programming. This page lists common mistakes and traps and explains how to avoid them.
Read more >
Asynchronous and "direct" IO support for PostgreSQL.
Hi,. over the last ~year I spent a lot of time trying to figure out how we could add AIO (asynchronous IO) and...
Read more >
1. fio - Flexible I/O tester rev. 3.32 - FIO's documentation!
The first step in getting fio to simulate a desired I/O workload, is writing a job file ... How do we issue I/O?...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found