question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DeepSpeed internal error on CPU

See original GitHub issue

🐛 Bug

DeepSpeed raises an internal error when the Trainer runs on CPU. I imagine they don’t support CPU training so we should raise a MisconfigurationException in that case.

To Reproduce

Code

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(1, 1)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(1, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        accelerator="cpu",
        strategy="deepspeed",
    )
    trainer.fit(model, train_dataloaders=train_data)


if __name__ == "__main__":
    run()

Stacktrace

Traceback (most recent call last):
  File "/home/carmocca/git/pytorch-lightning/pl_examples/bug_report/bug_report_model.py", line 66, in <module>
    run()
  File "/home/carmocca/git/pytorch-lightning/pl_examples/bug_report/bug_report_model.py", line 62, in run
    trainer.fit(model, train_dataloaders=train_data)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 771, in fit
    self._call_and_handle_interrupt(
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1218, in _run
    self.strategy.setup(self)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 360, in setup
    self.init_deepspeed()
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 459, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 492, in _initialize_deepspeed_train
    model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
  File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 424, in _setup_model_and_optimizer
    deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
  File "/home/carmocca/git/py39/lib/python3.9/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/carmocca/git/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 247, in __init__
    self._set_distributed_vars(args)
  File "/home/carmocca/git/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 831, in _set_distributed_vars
    if device_rank >= 0:
TypeError: '>=' not supported between instances of 'NoneType' and 'int'

Expected behavior

Better error message

Environment

-e git+https://github.com/PyTorchLightning/pytorch-lightning@523fa74bfe4fcd387c042c7cb22c8abcf3e9f968#egg=pytorch_lightning
torch==1.11.0
torchmetrics==0.7.3
torchtext==0.12.0
deepspeed==0.6.1

cc @borda @SeanNaren @awaelchli @rohitgr7 @akihironitta

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
akihironittacommented, Apr 8, 2022

@myxik Sure! Thank you 😃

0reactions
carmoccacommented, Jul 28, 2022

Or better yet, the 1.7.0rc0 release: pip install --pre -U pytorch_lightning

Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepSpeed internal error on CPU · Issue #12607 - GitHub
Bug DeepSpeed raises an internal error when the Trainer runs on CPU. I imagine they don't support CPU training so we should raise...
Read more >
DeepSpeed - Hugging Face
ZeRO Stage-3 with CPU Offload DeepSpeed Plugin Example ... This will result in an error because one can only use DS Scheduler when...
Read more >
ZeRO & Fastest BERT: Increasing the scale and speed of ...
In this webinar, the DeepSpeed team will discuss what DeepSpeed is, how to use it with your existing PyTorch models, and advancements in...
Read more >
KDD 2020: Hands on Tutorials: Deep Speed - YouTube
with over 100 billion parametersJing Zhao: Microsoft Bing; Yuxiong He: Microsoft; Samyam Rajbhandari: Microsoft; Hongzhi Li: Microsoft ...
Read more >
Distributed communication package - torch.distributed - PyTorch
Use the Gloo backend for distributed CPU training. ... NCCL_BLOCKING_WAIT will provide errors to the user which can be caught and handled, but...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found