question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Loss not dropping on custom dataset :(

See original GitHub issue

Hi, thanks for the wonderful work, @mathildecaron31! Reported video is inspiring 😄 __ I am experimenting with a custom dataset. The thing is, it’s totally okay to train vision transformer (deit_small) in supervised manner and loss drops fine. Even managed to apply visualize_attention.py to see heatmaps for a separately trained ViT. But when I switch to use self-supervised Dino setup, there is almost no change in loss during training. Do you have idea why it could happen or possible solutions? __ Thanks!

I am attaching screenshot from training and arguments I have used for training script.

loss-stop

arch ='deit_small'
patch_size = 16
out_dim = 10000 # default 65536
norm_last_layer = False
momentum_teacher = 0.996 # check this according to batch_size
bsize = 256 #####
use_bn_in_head = False
warmup_teacher_temp = 0.0005 # less if does not decrease, default 0.04
teacher_temp = 0.3 # increase if needed, default: 0.04
warmup_teacher_temp_epochs = 0 # default 30 to warmup
use_fp16 = False #disable is loss is unstable, default: True
weight_decay = 0.04 # a smaller value works well
weight_decay_end = 0.4 # final value of weight decay
clip_grad = 3.0 # max parameter gradient norm, 0 for disabling # default, 3.0
batch_size_per_gpu = 256 # reduce this if not fit, default 64
epochs = 100
freeze_last_layer = 5 # default 1, Try increasing this value if the loss does not decrease.
lr = 0.005 #linear with batch size scaled, for ref of 256, def 0.0005
warmup_epochs = 0 #linear warmup def 10
min_lr = 1e-6 # target lr at the optimization
optimizer = 'sgd' # def: adamw
global_crops_scale = (0.4, 1.)
local_crops_number = 8 # local small views
local_crops_scale = (0.05, 0.4) # def (0.05, 0.4)
data_path = train_dataset_dir #
output_dir = "./dirlog"
saveckp_freq = 20
seed = 0 # random seed
num_workers = 40 #def:10
dist_url = "env://"
local_rank = 0
device_ids = [0, 1, 2, 3, 4, 5] # use 6 gpus

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

5reactions
mathildecaron31commented, Jun 2, 2021

Hi @tuttelikz Thanks for your kind words. Can you try the following for improving the stability:

  • --norm_last_layer true (This will l2 normalize the last layer weights)
  • --use_fp16 false (But you’re already doing that 😃 )
  • --optimizer adamw (Is there a motivation for using sgd instead of adamw ?) and thus adapt the learning rate as before. Also if you choose to use sgd it is posible that you need to re-adapt the weight decay (maybe use a much lower value). I’d recommend starting from the default optim params with adamw.

If I understand correctly the effective batch size is 1536 (256 * 6). Can you try reducing that a bit ? I’ve observed that large batch training can be unstable.

Hope that helps.

4reactions
mathildecaron31commented, Aug 25, 2021

Best way to thank me is to star the repo haha 😄

Read more comments on GitHub >

github_iconTop Results From Across the Web

Having issues with neural network training. Loss not decreasing
Try to overfit your network on much smaller data and for many epochs without augmenting first, say one-two batches for many epochs. If...
Read more >
Validation loss is not decreasing - Data Science Stack Exchange
The model is overfitting right from epoch 10, the validation loss is increasing while the training loss is decreasing. Dealing with such a...
Read more >
Custom loss function not decreasing - PyTorch Forums
Custom loss function not decreasing · First try different learning rates, while simultaneously turning of all regularization. · Try using a standard loss...
Read more >
Solving the TensorFlow Keras Model Loss Problem
How to Implement a Non-trivial TensorFlow Keras Loss Function ... Not to mention the fact that the more custom code that you include...
Read more >
Why does the loss or accuracy fluctuate during the training?
Very small batch_size · Large network, small dataset · Tensorflow Pooling layers in Convolutional Neural Network · Training Spacy Models on custom data...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found