question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Synchronization issue on GPU

See original GitHub issue

I’m using the v0.0.4 version from this branch: https://github.com/libffcv/ffcv/tree/v0.0.4

There’s a (possibly major) bug where two models will not receive the same inputs from the FFCV dataloader, unless torch.cuda.synchronize is explicitly called. Below is a simple code snippet to reproduce this issue:

import torch
from torchvision.models import resnet18
from tqdm import tqdm
from copy import deepcopy

dataloader = create_ffcv_dataloader()  # Your own custom dataloader factory
model1 = resnet18(pretrained=False).cuda()
model2 = deepcopy(model1)
with torch.no_grad():
    for it, (imgs, *_) in enumerate(tqdm(dataloader)):
        model1(imgs)
        model2(imgs)
        # Uncommenting the following line will pass the assertion at the bottom, while leaving it commented will trigger assertion error
        # torch.cuda.synchronize()         
        if it == 20:
            break

    assert model1.bn1.running_mean.allclose(model2.bn1.running_mean)

BatchNorm tracks running stats, which can be used to check whether two identical models received the same inputs on the forward pass. Without torch.cuda.synchronize(), the above code will trigger an assertion error, since the two models received different inputs at some point. With torch.cuda.synchronize(), no assertion error will be triggered. Also, I have noticed that this behavior does not necessarily happen with larger models, where the forward pass takes a longer time.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:14 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
GuillaumeLeclerccommented, Mar 25, 2022

Here is the strictly minimal version of the test that makes it fail 100% of the time on my 3090:

import torch
from ffcv import Loader
from ffcv.fields.rgb_image import SimpleRGBImageDecoder
from ffcv.transforms import ToTensor, ToDevice, ToTorchImage, Convert
from torchvision.models import resnet18
from tqdm import tqdm

def main():
    beton_path = 'cifar_train.beton'    # Your FFCV .beton file here
    image_pipeline = [SimpleRGBImageDecoder(), ToTensor(),
                      ToDevice(torch.device(0), non_blocking=False), ToTorchImage(),
                      Convert(torch.float32),
      ]
    loader = Loader(beton_path, batch_size=512, num_workers=1,
                    pipelines={'image': image_pipeline, 'label': None})
    model1 = resnet18(pretrained=False).cuda()
    model2 = resnet18(pretrained=False).cuda()
    model2.load_state_dict(model1.state_dict())

    while True:
        with torch.no_grad():
            for it, (imgs,) in enumerate(tqdm(loader)):
                # imgs = imgs.clone()
                model1(imgs)
                model2(imgs)
                # torch.cuda.synchronize()
                # breakpoint()
                if it == 2: # 1 works sometimes but not 100% of the time on my GPU
                    break

            assert model1.bn1.running_mean.allclose(model2.bn1.running_mean)


if __name__ == "__main__":
    main()
0reactions
numpeecommented, Sep 30, 2022

@andrewilyas @GuillaumeLeclerc Just tested the code - this seems to be fixed in the fix-198 branch. Doesn’t seem to be merged in the v1.0.0 branch yet.

Thanks for the fix! I’m surprised the fix was just a single line of code.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Synchronization problem - CUDA - NVIDIA Developer Forums
Hello,. In my code, I am using 2 streams. ... Step 5: 2 streams are synchronized to the host using CudaDeviceSychronize. After synchronization,....
Read more >
GPU sync failed · Issue #1450 - GitHub
I am getting the same error when I create a simple custom operator that operates on a list of input tensors of type...
Read more >
NVIDIA: V-Sync Stuttering Caused By Drivers Bug, Fix Coming
Nvidia Responds to GeForce 600 Series V-Sync Stuttering Issue ... We've root caused the issue to a driver bug and identified a fix...
Read more >
Nvidia confirms high-refresh-rate G-Sync bug - Bit-tech.net
Nvidia has confirmed the presence of a bug in its G-Sync implementation for high-refresh-rate displays, causing the power draw and heat ...
Read more >
Shared Memory and Synchronization – GPU Programming
To solve this issue, we need to explicitly synchronize all threads in a block, so that memory operations are also finalized and visible...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found