Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

negligble performance gains and non convergence on DCGAN using apex (what to change?)

See original GitHub issue

I bought a RTX 2070 with the goal in mind to train my DCGAN on fp16 for bigger and faster models. After carefully adjusting my models and running vanilla model.half() without apex, AMP and FP16_Optimizer I’m not too convinced by the results. Maybe I did something wrong?

The architecture:

       #Loss Function: 
        criterion = nn.BCELoss()


       # Generator
       "512px output": (
        nn.Sequential(
        # Input Z (100x1x1)
        nn.ConvTranspose2d(nz, ngf * 64, 4, 1, 0, bias=False),
        nn.BatchNorm2d(ngf * 64),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 4x4x(ngf*64)

        nn.ConvTranspose2d(ngf * 64, ngf * 32, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 32),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 8x8x(ngf*32)

        nn.ConvTranspose2d(ngf * 32, ngf * 16, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 16),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 16x16x(ngf*16)

        nn.ConvTranspose2d(ngf * 16, ngf * 8, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 8),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 32x32x(ngf*8)

        nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 4),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 64x64x(ngf*4)

        nn.ConvTranspose2d(ngf * 4, ngf * 2, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 2),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 128x128x(ngf * 2)
            
        nn.ConvTranspose2d(ngf * 2, ngf, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 256x256x(ngf)

        nn.ConvTranspose2d(ngf, nc, 4, 2, 1, bias=False),
        nn.Tanh()
        # 512x512x3 Output
    ),

    # Discriminator
    nn.Sequential(
        # Input 512x512x3
        nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf),
        nn.LeakyReLU(0.2, inplace=True),
        # 256x256xndf

        nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 2),
        nn.LeakyReLU(0.2, inplace=True),
        # 64x64x(ndf * 2)

        nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 4),
        nn.LeakyReLU(0.2, inplace=True),
        # 32x32x(ndf * 4)

        nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 8),
        nn.LeakyReLU(0.2, inplace=True),
        # 16x16x(ndf * 8)

        nn.Conv2d(ndf * 8, ndf * 16, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 16),
        nn.LeakyReLU(0.2, inplace=True),
        # 8x8x(ndf * 16)

        nn.Conv2d(ndf * 16, ndf * 32, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 32),
        nn.LeakyReLU(0.2, inplace=True),
        # 4x4x(ndf * 32)
        
        nn.Conv2d(ndf * 32, ndf * 64, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 64),
        nn.LeakyReLU(0.2, inplace=True),
        # 2x2x(ndf * 64)

        nn.Conv2d(ndf * 64, 1, 4, 1, 0, bias=False),
        nn.Sigmoid()
        # 1x1x1
    )),

I changed the following parts in my code to accomodate for FP16:

network_to_half(netG)
network_to_half(netD)

optimizerD = FP16_Optimizer(optimizerD, dynamic_loss_scale=True, verbose=False)
optimizerG = FP16_Optimizer(optimizerG, dynamic_loss_scale=True, verbose=False)


in the training loop:

for i, data in enumerate(dataloader, 0):
    # making the input fp16
    input_batch = data[0].cuda().half()
     ....
    # collect gradients for real batch in discriminator
    optimizerD.backward(errD_real, update_master_grads=False)
     ....
    # collect gradients for fake batch in discriminator
    optimizerD.backward(errD_fake, update_master_grads=False)
     ....
    # backprop discriminator
     optimizerD.update_master_grads()
     optimizerD.step()
    ....
    # collect gradients for generated batch in generator and backprop generator
     optimizerG.backward(errG)
     optimizerG.step()
    ....

Results:

using stock model.half() without apex: the model is 2x slower and not converging after 1 epoch
using AMP: the model is 1.5x slower and not converging after 1 epoch
using FP16_Optimizer: the model is 1.2x slower and converging if dynamic_loss_scale is used

Basically the model only somewhat behaves if I’m using dynamic_loss_scale in FP16_Optimizer, although it produces garbage outputs even though the architecture didn’t change from the FP32 model that worked.

AMP should use dynamic_loss_scale automatically but it always collapses after 1 iteration and is very slow.

I expected the model to be faster and atleast converge like the FP32 model did. The only benefit is that the model is occupying around 51% less space on the GPU, so bigger models can be trained.

Questions:

What do I need to change in my architecture and training setup to make FP16 work with this DCGAN?

System information

PyTorch version: 0.4.1 Is debug build: No CUDA used to build PyTorch: 9.2

OS: Microsoft Windows 10 Home GCC version: Could not collect CMake version: Could not collect

Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 9.2.148 GPU models and configuration: GPU 0: GeForce RTX 2070 Nvidia driver version: 416.81 cuDNN version: Could not collect

Versions of relevant libraries: [pip] Could not collect [conda] cuda92 1.0 0 pytorch [conda] pytorch 0.4.1 py37_cuda92_cudnn7he774522_1 [cuda92] pytorch [conda] torchvision 0.2.1 <pip>

Issue Analytics

State:
Created 5 years ago
Comments:23 (10 by maintainers)

Top GitHub Comments

3reactions

ngimelcommented, Dec 28, 2018

@bearpelican, If I read your notebook right, for unet you get slowdown in fp16 for upsampling architecture and ~50% speed-up for conv transpose architecture? Upsampling in fp16 is slow (slower than in fp32) because backwards of upsampling layer is implemented with atomicAdd, and since there is no native support for atomicAdd, the performance is pretty bad https://github.com/pytorch/pytorch/blob/master/aten/src/THCUNN/SpatialUpSamplingNearest.cu#L94. You’d be better off converting upsampling layers to fp32. For conv transpose architecture speed up is less than for resnet because there are many strided transposed convolution layers, and those provide worse speed-up. If you are not using cudnn 7.4.1 or 7.4.2, try those cudnn versions, they improve strided convolution performance.

1reaction

ngimelcommented, Dec 28, 2018

Unfortunately no, it still maps to the same CAS emulation underneath, IIRC. What would help is rewriting upsampling backward w/o atomicAdds (I think average pooling forward can be tortured into computing what upsampling nearest backward computes, and it does not use atomics), or rewriting it in such a way so that atomicAdd is always called on half2. It’s hard to guarantee necessary alignment for arbitrary image sizes, though.

Top Results From Across the Web

Ways to improve GAN performance | by Jonathan Hui

The “best” image keeps changing when both networks counteract their opponent. ... Non-convergence: the models do not converge and worse they become unstable ......

GANs Failure Modes: How to Identify and Monitor Them

One of the most common issues I have observed while training GANs is a high learning rate. It leads to either mode collapse...

Tips for Training Stable Generative Adversarial Networks

1. Use Strided Convolutions. It is common to use pooling layers such as max-pooling layers for downsampling in convolutional neural networks. ...

GAN — Why it is so hard to train Generative Adversarial ...

Non-convergence: the model parameters oscillate, destabilize and never converge, · Mode collapse: the generator collapses which produces limited ...

On Convergence and Stability of GANs - arXiv

In section 3.1, we compare the modeling performance of our algorithm against vanilla GAN and. WGAN variants in the standard DCGAN/CIFAR-10 setup. Section...