Does the size of batch-size affect the training results?See original GitHub issue
I have run the train.py with the command blow on KITTI-raw-data :
python3 train.py /path/to/the/formatted/data/ -b4 -m0 -s2.0 --epoch-size 1000 --sequence-length 5 --log-output --with-gt
Otherwise the batch-size=80, and the train(41664)/vaild(2452) split is different.
The result I get is:
Results with scale factor determined by GT/prediction ratio (like the original paper) :
abs_rel, sq_rel, rms, log_rms, a1, a2, a3
0.2058, 1.6333, 6.7410, 0.2895, 0.6762, 0.8853, 0.9532
pose: Results 10 ATE, RE mean 0.0223, 0.0053 std 0.0188, 0.0036
Results 09 ATE, RE mean 0.0284, 0.0055 std 0.0241, 0.0035 ` You can see that there’s still a quiet big margin with yours: Abs Rel | Sq Rel | RMSE | RMSE(log) | Acc.1 | Acc.2 | Acc.3 0.181 | 1.341 | 6.236 | 0.262 | 0.733 | 0.901 | 0.964
I think there is no other factors causing this difference, otherwise the batch-size and data split. Therefore, does the size of batch-size affect the training results?
What’s more, when I try to train my model with two Titan GPUs, batch-size=80*2=160, the memory usage of each GPU is: GPU0: about 11G, GPU1: about 6G. There is a huge memory usage difference between two GPUs, and it seriously impacts multi-gpu trianing. And then I find the loss calculations are all placed on the first GPU, actually the memory is mainly used to calculate the 4 scales of depth photometric_reconstruction_loss, and we can just move some scales to the cuda:0, and others to cuda:1, it might be better I think.
- Created 5 years ago
- Comments:14 (7 by maintainers)
Top GitHub Comments
Results with your split, using model_best :
Results with scale factor determined by GT/prediction ratio (like the original paper) : abs_rel, sq_rel, rms, log_rms, a1, a2, a3 0.1854, 1.3986, 6.4104, 0.2687, 0.7149, 0.8985, 0.9619
Results with your split, using checkpoint :
Results with scale factor determined by GT/prediction ratio (like the original paper) : abs_rel, sq_rel, rms, log_rms, a1, a2, a3 0.2040, 1.8203, 6.6266, 0.2914, 0.6971, 0.8848, 0.9510
As such, I think you only used the
checkpoint.pth.tar . This is consistent with author’s claim that you eventually end up with worse results if you keep on training after more than 140K iterations.