Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Loss not dropping on custom dataset :(

See original GitHub issue

Hi, thanks for the wonderful work, @mathildecaron31! Reported video is inspiring 😄 __ I am experimenting with a custom dataset. The thing is, it’s totally okay to train vision transformer (deit_small) in supervised manner and loss drops fine. Even managed to apply visualize_attention.py to see heatmaps for a separately trained ViT. But when I switch to use self-supervised Dino setup, there is almost no change in loss during training. Do you have idea why it could happen or possible solutions? __ Thanks!

I am attaching screenshot from training and arguments I have used for training script.

loss-stop

arch ='deit_small'
patch_size = 16
out_dim = 10000 # default 65536
norm_last_layer = False
momentum_teacher = 0.996 # check this according to batch_size
bsize = 256 #####
use_bn_in_head = False
warmup_teacher_temp = 0.0005 # less if does not decrease, default 0.04
teacher_temp = 0.3 # increase if needed, default: 0.04
warmup_teacher_temp_epochs = 0 # default 30 to warmup
use_fp16 = False #disable is loss is unstable, default: True
weight_decay = 0.04 # a smaller value works well
weight_decay_end = 0.4 # final value of weight decay
clip_grad = 3.0 # max parameter gradient norm, 0 for disabling # default, 3.0
batch_size_per_gpu = 256 # reduce this if not fit, default 64
epochs = 100
freeze_last_layer = 5 # default 1, Try increasing this value if the loss does not decrease.
lr = 0.005 #linear with batch size scaled, for ref of 256, def 0.0005
warmup_epochs = 0 #linear warmup def 10
min_lr = 1e-6 # target lr at the optimization
optimizer = 'sgd' # def: adamw
global_crops_scale = (0.4, 1.)
local_crops_number = 8 # local small views
local_crops_scale = (0.05, 0.4) # def (0.05, 0.4)
data_path = train_dataset_dir #
output_dir = "./dirlog"
saveckp_freq = 20
seed = 0 # random seed
num_workers = 40 #def:10
dist_url = "env://"
local_rank = 0
device_ids = [0, 1, 2, 3, 4, 5] # use 6 gpus

Issue Analytics

State:
Created 2 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

5reactions

mathildecaron31commented, Jun 2, 2021

Hi @tuttelikz Thanks for your kind words. Can you try the following for improving the stability:

--norm_last_layer true (This will l2 normalize the last layer weights)
--use_fp16 false (But you’re already doing that 😃 )
--optimizer adamw (Is there a motivation for using sgd instead of adamw ?) and thus adapt the learning rate as before. Also if you choose to use sgd it is posible that you need to re-adapt the weight decay (maybe use a much lower value). I’d recommend starting from the default optim params with adamw.

If I understand correctly the effective batch size is 1536 (256 * 6). Can you try reducing that a bit ? I’ve observed that large batch training can be unstable.

Hope that helps.

4reactions

mathildecaron31commented, Aug 25, 2021

Best way to thank me is to star the repo haha 😄

Top Results From Across the Web

Having issues with neural network training. Loss not decreasing

Try to overfit your network on much smaller data and for many epochs without augmenting first, say one-two batches for many epochs. If...

Validation loss is not decreasing - Data Science Stack Exchange

The model is overfitting right from epoch 10, the validation loss is increasing while the training loss is decreasing. Dealing with such a...

Custom loss function not decreasing - PyTorch Forums

Custom loss function not decreasing · First try different learning rates, while simultaneously turning of all regularization. · Try using a standard loss...

Solving the TensorFlow Keras Model Loss Problem

How to Implement a Non-trivial TensorFlow Keras Loss Function ... Not to mention the fact that the more custom code that you include...

Why does the loss or accuracy fluctuate during the training?

Very small batch_size · Large network, small dataset · Tensorflow Pooling layers in Convolutional Neural Network · Training Spacy Models on custom data...