negligble performance gains and non convergence on DCGAN using apex (what to change?)
See original GitHub issueI bought a RTX 2070 with the goal in mind to train my DCGAN on fp16 for bigger and faster models. After carefully adjusting my models and running vanilla model.half() without apex, AMP and FP16_Optimizer I’m not too convinced by the results. Maybe I did something wrong?
The architecture:
#Loss Function:
criterion = nn.BCELoss()
# Generator
"512px output": (
nn.Sequential(
# Input Z (100x1x1)
nn.ConvTranspose2d(nz, ngf * 64, 4, 1, 0, bias=False),
nn.BatchNorm2d(ngf * 64),
nn.LeakyReLU(negative_slope=0.2, inplace=True),
# 4x4x(ngf*64)
nn.ConvTranspose2d(ngf * 64, ngf * 32, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf * 32),
nn.LeakyReLU(negative_slope=0.2, inplace=True),
# 8x8x(ngf*32)
nn.ConvTranspose2d(ngf * 32, ngf * 16, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf * 16),
nn.LeakyReLU(negative_slope=0.2, inplace=True),
# 16x16x(ngf*16)
nn.ConvTranspose2d(ngf * 16, ngf * 8, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf * 8),
nn.LeakyReLU(negative_slope=0.2, inplace=True),
# 32x32x(ngf*8)
nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf * 4),
nn.LeakyReLU(negative_slope=0.2, inplace=True),
# 64x64x(ngf*4)
nn.ConvTranspose2d(ngf * 4, ngf * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf * 2),
nn.LeakyReLU(negative_slope=0.2, inplace=True),
# 128x128x(ngf * 2)
nn.ConvTranspose2d(ngf * 2, ngf, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf),
nn.LeakyReLU(negative_slope=0.2, inplace=True),
# 256x256x(ngf)
nn.ConvTranspose2d(ngf, nc, 4, 2, 1, bias=False),
nn.Tanh()
# 512x512x3 Output
),
# Discriminator
nn.Sequential(
# Input 512x512x3
nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf),
nn.LeakyReLU(0.2, inplace=True),
# 256x256xndf
nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf * 2),
nn.LeakyReLU(0.2, inplace=True),
# 64x64x(ndf * 2)
nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf * 4),
nn.LeakyReLU(0.2, inplace=True),
# 32x32x(ndf * 4)
nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf * 8),
nn.LeakyReLU(0.2, inplace=True),
# 16x16x(ndf * 8)
nn.Conv2d(ndf * 8, ndf * 16, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf * 16),
nn.LeakyReLU(0.2, inplace=True),
# 8x8x(ndf * 16)
nn.Conv2d(ndf * 16, ndf * 32, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf * 32),
nn.LeakyReLU(0.2, inplace=True),
# 4x4x(ndf * 32)
nn.Conv2d(ndf * 32, ndf * 64, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf * 64),
nn.LeakyReLU(0.2, inplace=True),
# 2x2x(ndf * 64)
nn.Conv2d(ndf * 64, 1, 4, 1, 0, bias=False),
nn.Sigmoid()
# 1x1x1
)),
I changed the following parts in my code to accomodate for FP16:
network_to_half(netG)
network_to_half(netD)
optimizerD = FP16_Optimizer(optimizerD, dynamic_loss_scale=True, verbose=False)
optimizerG = FP16_Optimizer(optimizerG, dynamic_loss_scale=True, verbose=False)
in the training loop:
for i, data in enumerate(dataloader, 0):
# making the input fp16
input_batch = data[0].cuda().half()
....
# collect gradients for real batch in discriminator
optimizerD.backward(errD_real, update_master_grads=False)
....
# collect gradients for fake batch in discriminator
optimizerD.backward(errD_fake, update_master_grads=False)
....
# backprop discriminator
optimizerD.update_master_grads()
optimizerD.step()
....
# collect gradients for generated batch in generator and backprop generator
optimizerG.backward(errG)
optimizerG.step()
....
Results:
- using stock model.half() without apex: the model is 2x slower and not converging after 1 epoch
- using AMP: the model is 1.5x slower and not converging after 1 epoch
- using FP16_Optimizer: the model is 1.2x slower and converging if
dynamic_loss_scale
is used
Basically the model only somewhat behaves if I’m using dynamic_loss_scale
in FP16_Optimizer, although it produces garbage outputs even though the architecture didn’t change from the FP32 model that worked.
AMP should use dynamic_loss_scale
automatically but it always collapses after 1 iteration and is very slow.
I expected the model to be faster and atleast converge like the FP32 model did. The only benefit is that the model is occupying around 51% less space on the GPU, so bigger models can be trained.
Questions:
What do I need to change in my architecture and training setup to make FP16 work with this DCGAN?
System information
PyTorch version: 0.4.1 Is debug build: No CUDA used to build PyTorch: 9.2
OS: Microsoft Windows 10 Home GCC version: Could not collect CMake version: Could not collect
Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 9.2.148 GPU models and configuration: GPU 0: GeForce RTX 2070 Nvidia driver version: 416.81 cuDNN version: Could not collect
Versions of relevant libraries: [pip] Could not collect [conda] cuda92 1.0 0 pytorch [conda] pytorch 0.4.1 py37_cuda92_cudnn7he774522_1 [cuda92] pytorch [conda] torchvision 0.2.1 <pip>
Issue Analytics
- State:
- Created 5 years ago
- Comments:23 (10 by maintainers)
@bearpelican, If I read your notebook right, for unet you get slowdown in fp16 for upsampling architecture and ~50% speed-up for conv transpose architecture? Upsampling in fp16 is slow (slower than in fp32) because backwards of upsampling layer is implemented with atomicAdd, and since there is no native support for atomicAdd, the performance is pretty bad https://github.com/pytorch/pytorch/blob/master/aten/src/THCUNN/SpatialUpSamplingNearest.cu#L94. You’d be better off converting upsampling layers to fp32. For conv transpose architecture speed up is less than for resnet because there are many strided transposed convolution layers, and those provide worse speed-up. If you are not using cudnn 7.4.1 or 7.4.2, try those cudnn versions, they improve strided convolution performance.
Unfortunately no, it still maps to the same CAS emulation underneath, IIRC. What would help is rewriting upsampling backward w/o atomicAdds (I think average pooling forward can be tortured into computing what upsampling nearest backward computes, and it does not use atomics), or rewriting it in such a way so that atomicAdd is always called on half2. It’s hard to guarantee necessary alignment for arbitrary image sizes, though.