Zero optimization doesn't work
See original GitHub issueWhen I use any zero optimization in config, I catch this error. Using: pytorch 1.8.1 windows 10 gtx 1070
AttributeError Traceback (most recent call last)
<ipython-input-23-51abee19513b> in <module> 4 config_params=config, 5 #model_parameters=model.parameters(), ----> 6 dist_init_required=False 7 ) 8
E:\programs\Anaconda\lib\site-packages\deepspeed_init_.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params) 123 dist_init_required=dist_init_required, 124 collate_fn=collate_fn, –> 125 config_params=config_params) 126 else: 127 assert mpu is None, “mpu must be None with pipeline parallelism”
E:\programs\Anaconda\lib\site-packages\deepspeed\runtime\engine.py in init(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params, dont_change_device) 199 self.save_non_zero_checkpoint = False 200 self.save_zero_checkpoint = False –> 201 self._configure_checkpointing(dist_init_required) 202 203 if self.pld_enabled():
E:\programs\Anaconda\lib\site-packages\deepspeed\runtime\engine.py in _configure_checkpointing(self, dist_init_required) 469 if self.zero_optimization(): 470 param_rank = torch.distributed.get_rank( –> 471 group=self.optimizer.dp_process_group) 472 473 # Only the first parameter parallel process needs to store the
AttributeError: ‘NoneType’ object has no attribute ‘dp_process_group’
My code
import torch
import deepspeed
import os
import json
import numpy as np
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '9867'
os.environ['RANK'] = "0"
os.environ['LOCAL_RANK'] = "0"
os.environ['WORLD_SIZE'] = "1"
#%%
class NeuralNetwork(torch.nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.flatten = torch.nn.Flatten()
self.out = torch.nn.Sequential(
torch.nn.Linear(28*28, 128),
torch.nn.SELU(),
torch.nn.Linear(128,10),
torch.nn.SELU()
)
def forward(self, x):
x = self.flatten(x)
logits = self.out(x)
return logits
test_model = NeuralNetwork()
#%%
torch.distributed.init_process_group(backend="gloo")
#%%
deepspeed.init_distributed("gloo")
device = torch.device("cuda", 0)
#%%
with open("config2.json") as file:
config = json.load(file)
parameters = filter(lambda p: p.requires_grad, test_model.parameters())
test_model.requires_grad_(False)
model_engine, optimizer, _, __ = deepspeed.initialize(model=test_model,
config_params=config,
#model_parameters=test_model.parameters(),
dist_init_required=True
)
My config
{
"train_batch_size": 1,
"zero_optimization": {
"stage":2
},
"fp16": {
"enabled": true
}
}
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (6 by maintainers)
Top GitHub Comments
@tjruwase sure. I will open PR soon.
Nope. I solved. https://github.com/microsoft/DeepSpeed/issues/1499