question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Describe the bug Running deepspeed with huggingface transformers Trainer.train() leads to a “RuntimeError: Tensors must be CUDA and dense”.

No problem with deepspeed 0.5.4, but the bug exists with deepspeed 0.5.5 and the current 0.5.6 github version.

To Reproduce Following an example adapted from the following public example:

[INFO|trainer.py:1196] 2021-11-03 17:18:24,597 >> ***** Running training *****
[INFO|trainer.py:1197] 2021-11-03 17:18:24,597 >>   Num examples = 1707
[INFO|trainer.py:1198] 2021-11-03 17:18:24,597 >>   Num Epochs = 10
[INFO|trainer.py:1199] 2021-11-03 17:18:24,597 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:1200] 2021-11-03 17:18:24,598 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1201] 2021-11-03 17:18:24,598 >>   Gradient Accumulation steps = 32
[INFO|trainer.py:1202] 2021-11-03 17:18:24,598 >>   Total optimization steps = 260
  0%|▌                                                                                                                                                              | 1/260 [00:24<1:46:01, 24.56s/it]Traceback (most recent call last):
  File "/mnt/default/code/finetune/run_clm.py", line 521, in <module>
    main()
  File "/mnt/default/code/finetune/run_clm.py", line 471, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1316, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1849, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1881, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1347, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1060, in _call_impl
    result = hook(self, input)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1452, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1569, in pre_sub_module_forward_function
    self.param_coordinator.prefetch_next_sub_modules(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 361, in prefetch_next_sub_modules
    self._all_gather(params_to_prefetch, async_op=True)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 524, in _all_gather
    handles = partitioned_params[0].all_gather(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 590, in all_gather
    return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 689, in _all_gather
    handle = self._allgather_param(param,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 891, in _allgather_param
    handle = dist._all_gather_base(flat_tensor,
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1968, in _all_gather_base
    work = group._allgather_base(output_tensor, input_tensor)
RuntimeError: Tensors must be CUDA and dense
  0%|▌                                              

https://huggingface.co/ErykWdowiak/GPTalian/blob/main/scripts/run_clm.py

Expected behavior A clear and concise description of what you expected to happen.

ds_report output


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja … [OKAY]

op name … installed … compatible

cpu_adam … [NO] … [OKAY] cpu_adagrad … [NO] … [OKAY] fused_adam … [NO] … [OKAY] fused_lamb … [NO] … [OKAY] sparse_attn … [NO] … [OKAY] transformer … [NO] … [OKAY] stochastic_transformer . [NO] … [OKAY] [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io … [NO] … [NO] transformer_inference … [NO] … [OKAY] utils … [NO] … [OKAY] quantizer … [NO] … [OKAY]

DeepSpeed general environment info: torch install path … [‘/opt/conda/lib/python3.8/site-packages/torch’] torch version … 1.9.0+cu111 torch cuda version … 11.1 nvcc version … 11.1 deepspeed install path … [‘/opt/conda/lib/python3.8/site-packages/deepspeed’] deepspeed info … 0.5.5, unknown, unknown deepspeed wheel compiled w. … torch 1.8, cuda 11.1

Screenshots If applicable, add screenshots to help explain your problem.

Launcher context deepspeed

Docker context pytorch/pytorch:1.8.0-cuda11.1-cudnn8-devel

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
tjruwasecommented, Nov 11, 2021

Thanks for testing so quickly.

1reaction
Chenglong-MScommented, Nov 11, 2021

Looks good now, big thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

bug - Wiktionary
(entomology) An insect of the order Hemiptera (the “true bugs”). Any of various species of marine or freshwater crustaceans; e.g. a Moreton Bay...
Read more >
Bug (2006) - IMDb
An unhinged war veteran holes up with a lonely woman in a spooky Oklahoma motel room. The line between reality and delusion is...
Read more >
Bug Definition & Meaning - Merriam-Webster
The meaning of BUG is any of an order (Hemiptera and especially its suborder Heteroptera) of insects (such as an assassin bug or...
Read more >
Bug - Wikipedia
A terrestrial arthropod animal (with at least six legs). Insect, a six-legged arthropod · Covert listening device, used in surveillance, espionage and policing ......
Read more >
BUG | definition in the Cambridge English Dictionary
bug noun (INSECT) ... an insect: Some tiny white bugs had eaten the leaves of my house plants. ... A bug is also...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found