Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[testing] failing tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_stage3_nvme_offload

See original GitHub issue

So a few days ago tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_stage3_nvme_offload started hanging and getting killed by pytest-timeout.

It gets stuck in _jit_compile which never completes. This is nvme-specific, as all other deepspeed tests that use jit work just fine.

If I run it on my own setup by first removing rm -rf ~/.cache/torch_extensions/ it works just fine. So it happens only on that github-actions runner.

I went back to the logs from a few days back when it wasn’t failing and checked that it’s the same libaio packages installed on both cases:

Get:1 http://archive.ubuntu.com/ubuntu focal/main amd64 libaio1 amd64 0.3.112-5 [7184 B]
Get:2 http://archive.ubuntu.com/ubuntu focal/main amd64 libaio-dev amd64 0.3.112-5 [13.7 kB]

@tjruwase, any insights to why it might start hanging on building the nvme cuda extention?

The main difference is that the successful run was using deepspeed-0.4.2 and it started failing with deepspeed-0.4.3 release. I looked through the changes since 0.4.2 and I don’t see anything remotely related to the op_builder other than https://github.com/microsoft/DeepSpeed/pull/1213 - could that be related?

The full log is:


self = <test_deepspeed.TrainerIntegrationDeepSpeed testMethod=test_stage3_nvme_offload>

    @require_deepspeed_aio
    def test_stage3_nvme_offload(self):
        with mockenv_context(**self.dist_env_1_gpu):
            # this actually doesn't have to be on NVMe, any storage will do since this test only
            # runs a simple check that we can use some directory as if it were NVMe
            nvme_path = self.get_auto_remove_tmp_dir()
            nvme_config = dict(device="nvme", nvme_path=nvme_path)
            ds_config_zero3_dict = self.get_config_dict(ZERO3)
            ds_config_zero3_dict["zero_optimization"]["offload_optimizer"] = nvme_config
            ds_config_zero3_dict["zero_optimization"]["offload_param"] = nvme_config
            trainer = get_regression_trainer(local_rank=0, fp16=True, deepspeed=ds_config_zero3_dict)
            with CaptureLogger(deepspeed_logger) as cl:
>               trainer.train()

tests/deepspeed/test_deepspeed.py:321: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/transformers/trainer.py:1124: in train
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
src/transformers/deepspeed.py:370: in deepspeed_init
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py:126: in initialize
    engine = DeepSpeedEngine(args=args,
/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:194: in __init__
    self._configure_optimizer(optimizer, model_parameters)
/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:726: in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:940: in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer_Stage3(
/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py:809: in __init__
    self._configure_tensor_swapping(offload_optimizer_config, aio_config)
/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py:938: in _configure_tensor_swapping
    self.optimizer_swapper = swapper_type(
/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py:47: in __init__
    aio_op = AsyncIOBuilder().load()
/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py:239: in load
    return self.jit_load(verbose)
/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py:267: in jit_load
    op_module = load(
/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py:1074: in load
    return _jit_compile(
/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py:1301: in _jit_compile
    baton.wait()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <torch.utils.file_baton.FileBaton object at 0x7f7418fe1fa0>

    def wait(self):
        '''
        Periodically sleeps for a certain amount until the baton is released.
    
        The amount of time slept depends on the ``wait_seconds`` parameter
        passed to the constructor.
        '''
        while os.path.exists(self.lock_file_path):
>           time.sleep(self.wait_seconds)
E           Failed: Timeout >60.0s

/opt/conda/lib/python3.8/site-packages/torch/utils/file_baton.py:42: Failed
----------------------------- Captured stdout call -----------------------------
[2021-07-14 20:39:36,891] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.4.3, git-hash=unknown, git-branch=unknown
[2021-07-14 20:39:36,892] [INFO] [utils.py:11:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
[2021-07-14 20:39:36,914] [INFO] [engine.py:179:__init__] DeepSpeed Flops Profiler Enabled: False
Using /github/home/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.25669288635253906 seconds
Adam Optimizer #19 is created with AVX2 arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2021-07-14 20:39:37,652] [INFO] [engine.py:708:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2021-07-14 20:39:37,653] [INFO] [engine.py:713:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2021-07-14 20:39:37,653] [INFO] [utils.py:43:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2021-07-14 20:39:37,653] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2021-07-14 20:39:37,653] [INFO] [engine.py:938:_configure_zero_optimizer] Initializing ZeRO Stage 3
[2021-07-14 20:39:37,653] [INFO] [stage3.py:633:__init__] Reduce bucket size 1
[2021-07-14 20:39:37,653] [INFO] [stage3.py:634:__init__] Allgather bucket size 0.9
Using /github/home/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005452632904052734 seconds
[2021-07-14 20:39:37,656] [INFO] [stage3.py:933:_configure_tensor_swapping] Tensor Swapping: Adding optimizer tensors
[2021-07-14 20:39:37,657] [INFO] [utils.py:30:print_object] SwapBufferManager:
[2021-07-14 20:39:37,657] [INFO] [utils.py:34:print_object]   count ........................ 4
[2021-07-14 20:39:37,657] [INFO] [utils.py:34:print_object]   dtype ........................ torch.float32
[2021-07-14 20:39:37,657] [INFO] [utils.py:34:print_object]   free_buffer_index ............ [0, 1, 2, 3]
[2021-07-14 20:39:37,657] [INFO] [utils.py:34:print_object]   gigabytes .................... 3.814697265625e-06
[2021-07-14 20:39:37,657] [INFO] [utils.py:34:print_object]   num_elems .................... 256
[2021-07-14 20:39:37,657] [INFO] [utils.py:34:print_object]   used_buffer_index ............ {}
Using /github/home/.cache/torch_extensions as PyTorch extensions root...
----------------------------- Captured stderr call -----------------------------
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using amp fp16 backend

+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++

~~~~~~~~~~~~~~~~~~~~~ Stack of Thread-1 (140136515512064) ~~~~~~~~~~~~~~~~~~~~~~
  File "/opt/conda/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.8/site-packages/tqdm/_monitor.py", line 59, in run
    self.was_killed.wait(self.sleep_interval)
  File "/opt/conda/lib/python3.8/threading.py", line 558, in wait
    signaled = self._cond.wait(timeout)
  File "/opt/conda/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)

~~~~~~~~~~~~~~~~~~~~~ Stack of <unknown> (140136768341760) ~~~~~~~~~~~~~~~~~~~~~
  File "/opt/conda/lib/python3.8/site-packages/execnet/gateway_base.py", line 285, in _perform_spawn
    reply.run()
  File "/opt/conda/lib/python3.8/site-packages/execnet/gateway_base.py", line 220, in run
    self._result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/execnet/gateway_base.py", line 967, in _thread_receiver
    msg = Message.from_io(io)
  File "/opt/conda/lib/python3.8/site-packages/execnet/gateway_base.py", line 432, in from_io
    header = io.read(9)  # type 1, channel 4, payload 4
  File "/opt/conda/lib/python3.8/site-packages/execnet/gateway_base.py", line 400, in read
    data = self._read(numbytes - len(buf))

+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++

Issue Analytics

State:
Created 2 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

2reactions

stas00commented, Jul 14, 2021

I think it did the trick, thank you @tjruwase! https://github.com/huggingface/transformers/pull/12723

0reactions

kangqiyuecommented, Aug 6, 2022

That’s very cool !!! I have been stuck here for a long time, and finally I found this solution!

The system just waiting after the log:

Using ~/.cache/torch_extensions/py38_cu113 as PyTorch extensions root…

The debug process located at

def wait(self): ‘’’ Periodically sleeps for a certain amount until the baton is released. The amount of time slept depends on the wait_seconds parameter passed to the constructor. ‘’’ while os.path.exists(self.lock_file_path): time.sleep(self.wait_seconds)

After I remove the folder, the process became normal. Cool!