Potential bugs when mmdetection runs on PyTorch < 1.8
See original GitHub issueThanks for your error report and we appreciate it a lot.
Describe the bug
There is a small bug of PyTorch (version < 1.8).
In short, when we use DataLoadr
with num_workers != 0
, these forked children process share the same numpy
random seeds.
A toy example, run with PyTorch 1.6.
import numpy as np
from torch.utils.data import Dataset, DataLoader
class RandomDataset(Dataset):
def __getitem__(self, index):
return np.random.randint(0, 1000, 3)
def __len__(self):
return 8
dataset = RandomDataset()
dataloader = DataLoader(dataset, batch_size=2, num_workers=2)
for batch in dataloader:
print(batch)
Output:
tensor([[437, 650, 998], # process 0
[108, 34, 197]])
tensor([[437, 650, 998], # process 1
[108, 34, 197]])
tensor([[153, 629, 103], # process 0
[695, 102, 728]])
tensor([[153, 629, 103], # process 1
[695, 102, 728]])
Currently, mmdetection sets different numpy
seeds for different workers according to the worker_id
. However, the worker processes are killed at the end of each epoch, and all worker resources are lost. At the next epoch, mmdetection will set the same seeds as the previous epoch.
Example:
import numpy as np
from torch.utils.data import Dataset, DataLoader
class RandomDataset(Dataset):
def __getitem__(self, index):
return np.random.randint(0, 1000, 3)
def __len__(self):
return 8
def worker_init_fn(worker_id):
np.random.seed(worker_id)
dataset = RandomDataset()
dataloader = DataLoader(dataset, batch_size=2, num_workers=2, worker_init_fn=worker_init_fn)
print("the first epoch is ok")
for batch in dataloader:
print(batch)
print("the second epoch is the same as the first epoch")
for batch in dataloader:
print(batch)
Output:
the first epoch is ok
tensor([[684, 559, 629],
[192, 835, 763]])
tensor([[ 37, 235, 908],
[ 72, 767, 905]])
tensor([[707, 359, 9],
[723, 277, 754]])
tensor([[715, 645, 847],
[960, 144, 129]])
the second epoch is the same as the first epoch
tensor([[684, 559, 629],
[192, 835, 763]])
tensor([[ 37, 235, 908],
[ 72, 767, 905]])
tensor([[707, 359, 9],
[723, 277, 754]])
tensor([[715, 645, 847],
[960, 144, 129]])
Maybe we can use the solution of detectron2 or pytorch-image-models. They use a seed that changed dynamically with epoch, i.e., torch.initial_seed()
, torch.utils.data.get_worker_info().seed
.
Relevent issue: https://github.com/pytorch/pytorch/issues/5059
Environment
PyTorch 1.6.0
Issue Analytics
- State:
- Created a year ago
- Comments:5
@serend1p1ty Thanks a lot for your feedback. In fact, this problem exists in all PyTorch versions, and the official solution has been given.
persistent_workers
parameter was introduced in pytorch1.7, and this problem can be avoided as long as it is set to True. But if you set it to false it will still appearI reproduce the same problem according to this issue. And I have made a pr to fix it. Thank you again for this bug report.