[Bug] Synchronization issue on GPU
See original GitHub issueI’m using the v0.0.4 version from this branch: https://github.com/libffcv/ffcv/tree/v0.0.4
There’s a (possibly major) bug where two models will not receive the same inputs from the FFCV dataloader, unless torch.cuda.synchronize
is explicitly called. Below is a simple code snippet to reproduce this issue:
import torch
from torchvision.models import resnet18
from tqdm import tqdm
from copy import deepcopy
dataloader = create_ffcv_dataloader() # Your own custom dataloader factory
model1 = resnet18(pretrained=False).cuda()
model2 = deepcopy(model1)
with torch.no_grad():
for it, (imgs, *_) in enumerate(tqdm(dataloader)):
model1(imgs)
model2(imgs)
# Uncommenting the following line will pass the assertion at the bottom, while leaving it commented will trigger assertion error
# torch.cuda.synchronize()
if it == 20:
break
assert model1.bn1.running_mean.allclose(model2.bn1.running_mean)
BatchNorm tracks running stats, which can be used to check whether two identical models received the same inputs on the forward pass. Without torch.cuda.synchronize()
, the above code will trigger an assertion error, since the two models received different inputs at some point. With torch.cuda.synchronize()
, no assertion error will be triggered.
Also, I have noticed that this behavior does not necessarily happen with larger models, where the forward pass takes a longer time.
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (2 by maintainers)
Top GitHub Comments
Here is the strictly minimal version of the test that makes it fail 100% of the time on my 3090:
@andrewilyas @GuillaumeLeclerc Just tested the code - this seems to be fixed in the
fix-198
branch. Doesn’t seem to be merged in thev1.0.0
branch yet.Thanks for the fix! I’m surprised the fix was just a single line of code.