Describe the bug Running deepspeed with huggingface transformers Trainer.train() leads to a “RuntimeError: Tensors must be CUDA and dense”.
No problem with deepspeed 0.5.4, but the bug exists with deepspeed 0.5.5 and the current 0.5.6 github version.
To Reproduce Following an example adapted from the following public example:
[INFO|trainer.py:1196] 2021-11-03 17:18:24,597 >> ***** Running training *****
[INFO|trainer.py:1197] 2021-11-03 17:18:24,597 >> Num examples = 1707
[INFO|trainer.py:1198] 2021-11-03 17:18:24,597 >> Num Epochs = 10
[INFO|trainer.py:1199] 2021-11-03 17:18:24,597 >> Instantaneous batch size per device = 2
[INFO|trainer.py:1200] 2021-11-03 17:18:24,598 >> Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1201] 2021-11-03 17:18:24,598 >> Gradient Accumulation steps = 32
[INFO|trainer.py:1202] 2021-11-03 17:18:24,598 >> Total optimization steps = 260
0%|▌ | 1/260 [00:24<1:46:01, 24.56s/it]Traceback (most recent call last):
File "/mnt/default/code/finetune/run_clm.py", line 521, in <module>
main()
File "/mnt/default/code/finetune/run_clm.py", line 471, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1316, in train
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1849, in training_step
loss = self.compute_loss(model, inputs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1881, in compute_loss
outputs = model(**inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1347, in forward
loss = self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1060, in _call_impl
result = hook(self, input)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1452, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1569, in pre_sub_module_forward_function
self.param_coordinator.prefetch_next_sub_modules(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 361, in prefetch_next_sub_modules
self._all_gather(params_to_prefetch, async_op=True)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 524, in _all_gather
handles = partitioned_params[0].all_gather(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 590, in all_gather
return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 689, in _all_gather
handle = self._allgather_param(param,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 891, in _allgather_param
handle = dist._all_gather_base(flat_tensor,
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1968, in _all_gather_base
work = group._allgather_base(output_tensor, input_tensor)
RuntimeError: Tensors must be CUDA and dense
0%|▌
https://huggingface.co/ErykWdowiak/GPTalian/blob/main/scripts/run_clm.py
Expected behavior A clear and concise description of what you expected to happen.
ds_report output
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja … [OKAY]
op name … installed … compatible
cpu_adam … [NO] … [OKAY] cpu_adagrad … [NO] … [OKAY] fused_adam … [NO] … [OKAY] fused_lamb … [NO] … [OKAY] sparse_attn … [NO] … [OKAY] transformer … [NO] … [OKAY] stochastic_transformer . [NO] … [OKAY] [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io … [NO] … [NO] transformer_inference … [NO] … [OKAY] utils … [NO] … [OKAY] quantizer … [NO] … [OKAY]
DeepSpeed general environment info: torch install path … [‘/opt/conda/lib/python3.8/site-packages/torch’] torch version … 1.9.0+cu111 torch cuda version … 11.1 nvcc version … 11.1 deepspeed install path … [‘/opt/conda/lib/python3.8/site-packages/deepspeed’] deepspeed info … 0.5.5, unknown, unknown deepspeed wheel compiled w. … torch 1.8, cuda 11.1
Screenshots If applicable, add screenshots to help explain your problem.
Launcher context deepspeed
Docker context pytorch/pytorch:1.8.0-cuda11.1-cudnn8-devel
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Thanks for testing so quickly.
Looks good now, big thanks!