[testing] failing tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_stage3_nvme_offload
See original GitHub issueSo a few days ago tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_stage3_nvme_offload
started hanging and getting killed by pytest-timeout.
It gets stuck in _jit_compile
which never completes. This is nvme-specific, as all other deepspeed tests that use jit work just fine.
If I run it on my own setup by first removing rm -rf ~/.cache/torch_extensions/
it works just fine. So it happens only on that github-actions runner.
I went back to the logs from a few days back when it wasn’t failing and checked that it’s the same libaio packages installed on both cases:
Get:1 http://archive.ubuntu.com/ubuntu focal/main amd64 libaio1 amd64 0.3.112-5 [7184 B]
Get:2 http://archive.ubuntu.com/ubuntu focal/main amd64 libaio-dev amd64 0.3.112-5 [13.7 kB]
@tjruwase, any insights to why it might start hanging on building the nvme cuda extention?
The main difference is that the successful run was using deepspeed-0.4.2 and it started failing with deepspeed-0.4.3 release. I looked through the changes since 0.4.2 and I don’t see anything remotely related to the op_builder other than https://github.com/microsoft/DeepSpeed/pull/1213 - could that be related?
The full log is:
self = <test_deepspeed.TrainerIntegrationDeepSpeed testMethod=test_stage3_nvme_offload>
@require_deepspeed_aio
def test_stage3_nvme_offload(self):
with mockenv_context(**self.dist_env_1_gpu):
# this actually doesn't have to be on NVMe, any storage will do since this test only
# runs a simple check that we can use some directory as if it were NVMe
nvme_path = self.get_auto_remove_tmp_dir()
nvme_config = dict(device="nvme", nvme_path=nvme_path)
ds_config_zero3_dict = self.get_config_dict(ZERO3)
ds_config_zero3_dict["zero_optimization"]["offload_optimizer"] = nvme_config
ds_config_zero3_dict["zero_optimization"]["offload_param"] = nvme_config
trainer = get_regression_trainer(local_rank=0, fp16=True, deepspeed=ds_config_zero3_dict)
with CaptureLogger(deepspeed_logger) as cl:
> trainer.train()
tests/deepspeed/test_deepspeed.py:321:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/transformers/trainer.py:1124: in train
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
src/transformers/deepspeed.py:370: in deepspeed_init
model, optimizer, _, lr_scheduler = deepspeed.initialize(
/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py:126: in initialize
engine = DeepSpeedEngine(args=args,
/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:194: in __init__
self._configure_optimizer(optimizer, model_parameters)
/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:726: in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:940: in _configure_zero_optimizer
optimizer = FP16_DeepSpeedZeroOptimizer_Stage3(
/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py:809: in __init__
self._configure_tensor_swapping(offload_optimizer_config, aio_config)
/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py:938: in _configure_tensor_swapping
self.optimizer_swapper = swapper_type(
/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py:47: in __init__
aio_op = AsyncIOBuilder().load()
/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py:239: in load
return self.jit_load(verbose)
/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py:267: in jit_load
op_module = load(
/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py:1074: in load
return _jit_compile(
/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py:1301: in _jit_compile
baton.wait()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <torch.utils.file_baton.FileBaton object at 0x7f7418fe1fa0>
def wait(self):
'''
Periodically sleeps for a certain amount until the baton is released.
The amount of time slept depends on the ``wait_seconds`` parameter
passed to the constructor.
'''
while os.path.exists(self.lock_file_path):
> time.sleep(self.wait_seconds)
E Failed: Timeout >60.0s
/opt/conda/lib/python3.8/site-packages/torch/utils/file_baton.py:42: Failed
----------------------------- Captured stdout call -----------------------------
[2021-07-14 20:39:36,891] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.4.3, git-hash=unknown, git-branch=unknown
[2021-07-14 20:39:36,892] [INFO] [utils.py:11:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
[2021-07-14 20:39:36,914] [INFO] [engine.py:179:__init__] DeepSpeed Flops Profiler Enabled: False
Using /github/home/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.25669288635253906 seconds
Adam Optimizer #19 is created with AVX2 arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2021-07-14 20:39:37,652] [INFO] [engine.py:708:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2021-07-14 20:39:37,653] [INFO] [engine.py:713:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2021-07-14 20:39:37,653] [INFO] [utils.py:43:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2021-07-14 20:39:37,653] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2021-07-14 20:39:37,653] [INFO] [engine.py:938:_configure_zero_optimizer] Initializing ZeRO Stage 3
[2021-07-14 20:39:37,653] [INFO] [stage3.py:633:__init__] Reduce bucket size 1
[2021-07-14 20:39:37,653] [INFO] [stage3.py:634:__init__] Allgather bucket size 0.9
Using /github/home/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005452632904052734 seconds
[2021-07-14 20:39:37,656] [INFO] [stage3.py:933:_configure_tensor_swapping] Tensor Swapping: Adding optimizer tensors
[2021-07-14 20:39:37,657] [INFO] [utils.py:30:print_object] SwapBufferManager:
[2021-07-14 20:39:37,657] [INFO] [utils.py:34:print_object] count ........................ 4
[2021-07-14 20:39:37,657] [INFO] [utils.py:34:print_object] dtype ........................ torch.float32
[2021-07-14 20:39:37,657] [INFO] [utils.py:34:print_object] free_buffer_index ............ [0, 1, 2, 3]
[2021-07-14 20:39:37,657] [INFO] [utils.py:34:print_object] gigabytes .................... 3.814697265625e-06
[2021-07-14 20:39:37,657] [INFO] [utils.py:34:print_object] num_elems .................... 256
[2021-07-14 20:39:37,657] [INFO] [utils.py:34:print_object] used_buffer_index ............ {}
Using /github/home/.cache/torch_extensions as PyTorch extensions root...
----------------------------- Captured stderr call -----------------------------
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using amp fp16 backend
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
~~~~~~~~~~~~~~~~~~~~~ Stack of Thread-1 (140136515512064) ~~~~~~~~~~~~~~~~~~~~~~
File "/opt/conda/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.8/site-packages/tqdm/_monitor.py", line 59, in run
self.was_killed.wait(self.sleep_interval)
File "/opt/conda/lib/python3.8/threading.py", line 558, in wait
signaled = self._cond.wait(timeout)
File "/opt/conda/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
~~~~~~~~~~~~~~~~~~~~~ Stack of <unknown> (140136768341760) ~~~~~~~~~~~~~~~~~~~~~
File "/opt/conda/lib/python3.8/site-packages/execnet/gateway_base.py", line 285, in _perform_spawn
reply.run()
File "/opt/conda/lib/python3.8/site-packages/execnet/gateway_base.py", line 220, in run
self._result = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/execnet/gateway_base.py", line 967, in _thread_receiver
msg = Message.from_io(io)
File "/opt/conda/lib/python3.8/site-packages/execnet/gateway_base.py", line 432, in from_io
header = io.read(9) # type 1, channel 4, payload 4
File "/opt/conda/lib/python3.8/site-packages/execnet/gateway_base.py", line 400, in read
data = self._read(numbytes - len(buf))
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
I think it did the trick, thank you @tjruwase! https://github.com/huggingface/transformers/pull/12723
That’s very cool !!! I have been stuck here for a long time, and finally I found this solution!
The system just waiting after the log:
The debug process located at
After I remove the folder, the process became normal. Cool!