DeepSpeed internal error on CPU
See original GitHub issue🐛 Bug
DeepSpeed raises an internal error when the Trainer
runs on CPU. I imagine they don’t support CPU training so we should raise a MisconfigurationException in that case.
To Reproduce
Code
import os
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(1, 1)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("train_loss", loss)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("valid_loss", loss)
def test_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("test_loss", loss)
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=0.1)
def run():
train_data = DataLoader(RandomDataset(1, 64), batch_size=2)
model = BoringModel()
trainer = Trainer(
default_root_dir=os.getcwd(),
limit_train_batches=1,
limit_val_batches=1,
limit_test_batches=1,
num_sanity_val_steps=0,
max_epochs=1,
enable_model_summary=False,
accelerator="cpu",
strategy="deepspeed",
)
trainer.fit(model, train_dataloaders=train_data)
if __name__ == "__main__":
run()
Stacktrace
Traceback (most recent call last):
File "/home/carmocca/git/pytorch-lightning/pl_examples/bug_report/bug_report_model.py", line 66, in <module>
run()
File "/home/carmocca/git/pytorch-lightning/pl_examples/bug_report/bug_report_model.py", line 62, in run
trainer.fit(model, train_dataloaders=train_data)
File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1218, in _run
self.strategy.setup(self)
File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 360, in setup
self.init_deepspeed()
File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 459, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 492, in _initialize_deepspeed_train
model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
File "/home/carmocca/git/pytorch-lightning/pytorch_lightning/strategies/deepspeed.py", line 424, in _setup_model_and_optimizer
deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
File "/home/carmocca/git/py39/lib/python3.9/site-packages/deepspeed/__init__.py", line 119, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/carmocca/git/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 247, in __init__
self._set_distributed_vars(args)
File "/home/carmocca/git/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 831, in _set_distributed_vars
if device_rank >= 0:
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
Expected behavior
Better error message
Environment
-e git+https://github.com/PyTorchLightning/pytorch-lightning@523fa74bfe4fcd387c042c7cb22c8abcf3e9f968#egg=pytorch_lightning
torch==1.11.0
torchmetrics==0.7.3
torchtext==0.12.0
deepspeed==0.6.1
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:7 (5 by maintainers)
Top Results From Across the Web
DeepSpeed internal error on CPU · Issue #12607 - GitHub
Bug DeepSpeed raises an internal error when the Trainer runs on CPU. I imagine they don't support CPU training so we should raise...
Read more >DeepSpeed - Hugging Face
ZeRO Stage-3 with CPU Offload DeepSpeed Plugin Example ... This will result in an error because one can only use DS Scheduler when...
Read more >ZeRO & Fastest BERT: Increasing the scale and speed of ...
In this webinar, the DeepSpeed team will discuss what DeepSpeed is, how to use it with your existing PyTorch models, and advancements in...
Read more >KDD 2020: Hands on Tutorials: Deep Speed - YouTube
with over 100 billion parametersJing Zhao: Microsoft Bing; Yuxiong He: Microsoft; Samyam Rajbhandari: Microsoft; Hongzhi Li: Microsoft ...
Read more >Distributed communication package - torch.distributed - PyTorch
Use the Gloo backend for distributed CPU training. ... NCCL_BLOCKING_WAIT will provide errors to the user which can be caught and handled, but...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@myxik Sure! Thank you 😃
Or better yet, the
1.7.0rc0
release:pip install --pre -U pytorch_lightning