Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

EMA Bug

See original GitHub issue

Hi Phil,

This morning I tried to run the decoder training part. I decided to use DecoderTrainer but found one issue when ema update.

When after using decoder_trainer do sampling, the next train forward run will throw RunError:

Traceback (most recent call last):
  File "/home/caohe/DPMs/dalle2/train_decoder.py", line 321, in <module>    main()
  File "/home/caohe/DPMs/dalle2/train_decoder.py", line 318, in main
    train(decoder_trainer, train_dl, val_dl, train_config, device)
  File "/home/caohe/DPMs/dalle2/train_decoder.py", line 195, in train
    trainer.update(unet_number)
  File "/home/caohe/DPMs/dalle2/dalle2_pytorch/train.py", line 288, in update
    self.ema_unets[index].update()
  File "/home/caohe/DPMs/dalle2/dalle2_pytorch/train.py", line 119, in update
    self.update_moving_average(self.ema_model, self.online_model)
  File "/home/caohe/DPMs/dalle2/dalle2_pytorch/train.py", line 129, in update_moving_average
    ema_param.data = calculate_ema(self.beta, old_weight, up_weight)
  File "/home/caohe/DPMs/dalle2/dalle2_pytorch/train.py", line 125, in calculate_ema
    return old * beta + new * (1 - beta)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and CPU!

https://github.com/lucidrains/DALLE2-pytorch/blob/6f76652d118d3da2419bd12084abfff45772553b/dalle2_pytorch/train.py#L108-L118

And I checked the up_weight.device(online model) and old_weight.device(ema model), found online model is on cuda:0 but ema model is on cpu. It’s really weird, I debugged for a long time and I think it might be caused by the DecoderTrainer.sample() process. When swapping across ema and online model, there exists some problem related to the device. https://github.com/lucidrains/DALLE2-pytorch/blob/6021945fc8e1ec27bbebfa1e181e892a7c4d05fb/dalle2_pytorch/train.py#L298-L308

The way I fixed it just add self.ema_model = self.ema_model.to(next(self.online_model.parameters()).device) before use self.update_moving_average(self.ema_model, self.online_model) (pretty naive haha)

Hope to hear your solution

Enjoy!