question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training stopping by itself after (250/batch_size) steps

See original GitHub issue

Hello,

I’m trying to launch the training script on the pre-trained VGG-16 model that is linked in the Readme.md. The problem that I have is that training stops by itself after (250/batch_size) steps. I didn’t touch anything in the train_ssd.py script except for the batch size because my gpu was running oom with a batch_size of 32. I tried 4, 2 and 1 as batch sizes and training stopped after 62, 125 and 250 steps, saving a checkpoint. My accuracy also seems to not change at all, being at 0.750000 all the time.

My question is “What makes my training stop so quickly ?” I can’t find where it comes from, although it definitely looks intentional since it stops pretty cleanly.

Here is the output of launching train_ssd.py with batch_size=2 (as you can see, it starts on step 1841 and stops by itself on step 1966) :

(tensorflowgpu-env) C:\Users\tomri\Documents\SSD.TensorFlow-master>python train_ssd.py

WARNING:tensorflow:From train_ssd.py:427: replicate_model_fn (from tensorflow.contrib.estimator.python.estimator.replicate_model_fn) is deprecated and will be removed after 2018-05-31. Instructions for updating: Please use tf.contrib.distribute.MirroredStrategy instead. 2019-03-13 16:10:10.829767: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 2019-03-13 16:10:12.219015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.493 pciBusID: 0000:01:00.0 totalMemory: 4.00GiB freeMemory: 3.30GiB 2019-03-13 16:10:12.234896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0 2019-03-13 16:10:12.683094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-13 16:10:12.686969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-03-13 16:10:12.689310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-03-13 16:10:12.692227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:0 with 3016 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1) INFO:tensorflow:Replicating the model_fn across [‘/device:GPU:0’]. Variables are going to be placed on [‘/device:GPU:0’]. Consolidation device is going to be /device:GPU:0. INFO:tensorflow:Using config: {‘_session_config’: gpu_options { per_process_gpu_memory_fraction: 1.0 } allow_soft_placement: true , ‘_save_checkpoints_steps’: None, ‘_evaluation_master’: ‘’, ‘_global_id_in_cluster’: 0, ‘_train_distribute’: None, ‘_task_id’: 0, ‘_keep_checkpoint_max’: 5, ‘_tf_random_seed’: 20180503, ‘_service’: None, ‘_log_step_count_steps’: 10, ‘_num_ps_replicas’: 0, ‘_task_type’: ‘worker’, ‘_save_summary_steps’: 1000, ‘_save_checkpoints_secs’: 7200, ‘_is_chief’: True, ‘_master’: ‘’, ‘_cluster_spec’: <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000025555533550>, ‘_keep_checkpoint_every_n_hours’: 10000, ‘_device_fn’: None, ‘_num_worker_replicas’: 1, ‘_model_dir’: ‘./logs/’} Starting a training cycle. INFO:tensorflow:Calling model_fn. WARNING:tensorflow:From C:\Users\tomri\Documents\SSD.TensorFlow-master\net\ssd_net.py:114: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead WARNING:tensorflow:From train_ssd.py:388: TowerOptimizer.init (from tensorflow.contrib.estimator.python.estimator.replicate_model_fn) is deprecated and will be removed after 2018-05-31. Instructions for updating: Please use tf.contrib.distribute.MirroredStrategy instead. INFO:tensorflow:Ignoring --checkpoint_path because a checkpoint already exists in ./logs/. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. 2019-03-13 16:10:25.877210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0 2019-03-13 16:10:25.882028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-13 16:10:25.886118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-03-13 16:10:25.890165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-03-13 16:10:25.893592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3016 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1) INFO:tensorflow:Restoring parameters from ./logs/model.ckpt-1841 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 1841 into ./logs/model.ckpt. 2019-03-13 16:10:52.813970: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.57GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-13 16:10:53.824309: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.57GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. INFO:tensorflow:lr=0.001000, l2=7.292980, loc=3.477014, loss=53.604458, acc=0.750000, ce=42.834465 INFO:tensorflow:loss = 53.604458, step = 1841 INFO:tensorflow:global_step/sec: 2.24063 INFO:tensorflow:lr=0.001000, l2=7.292490, loc=3.637795, loss=25.027014, acc=0.750000, ce=14.096728 INFO:tensorflow:loss = 25.027014, step = 1851 (4.462 sec) INFO:tensorflow:global_step/sec: 3.73853 INFO:tensorflow:lr=0.001000, l2=7.292079, loc=3.079521, loss=22.236305, acc=0.750000, ce=11.864706 INFO:tensorflow:loss = 22.236305, step = 1861 (2.676 sec) INFO:tensorflow:global_step/sec: 3.75393 INFO:tensorflow:lr=0.001000, l2=7.291565, loc=1.217845, loss=11.967870, acc=0.750000, ce=3.458459 INFO:tensorflow:loss = 11.96787, step = 1871 (2.665 sec) INFO:tensorflow:global_step/sec: 3.73993 INFO:tensorflow:lr=0.001000, l2=7.291090, loc=3.487017, loss=17.143929, acc=0.750000, ce=6.365822 INFO:tensorflow:loss = 17.143929, step = 1881 (2.672 sec) INFO:tensorflow:global_step/sec: 3.66937 INFO:tensorflow:lr=0.001000, l2=7.290852, loc=3.540547, loss=32.483604, acc=0.750000, ce=21.652206 INFO:tensorflow:loss = 32.483604, step = 1891 (2.728 sec) INFO:tensorflow:global_step/sec: 3.78508 INFO:tensorflow:lr=0.001000, l2=7.290482, loc=2.461484, loss=13.741415, acc=0.750000, ce=3.989449 INFO:tensorflow:loss = 13.741415, step = 1901 (2.640 sec) INFO:tensorflow:global_step/sec: 3.79082 INFO:tensorflow:lr=0.001000, l2=7.290192, loc=3.321430, loss=21.725256, acc=0.750000, ce=11.113634 INFO:tensorflow:loss = 21.725256, step = 1911 (2.638 sec) INFO:tensorflow:global_step/sec: 3.75983 INFO:tensorflow:lr=0.001000, l2=7.289964, loc=2.578511, loss=13.988897, acc=0.750000, ce=4.120422 INFO:tensorflow:loss = 13.988897, step = 1921 (2.663 sec) INFO:tensorflow:global_step/sec: 3.76216 INFO:tensorflow:lr=0.001000, l2=7.289906, loc=5.934974, loss=40.700157, acc=0.750000, ce=27.475279 INFO:tensorflow:loss = 40.700157, step = 1931 (2.655 sec) INFO:tensorflow:global_step/sec: 3.76096 INFO:tensorflow:lr=0.001000, l2=7.289580, loc=2.893521, loss=17.323259, acc=0.750000, ce=7.140158 INFO:tensorflow:loss = 17.32326, step = 1941 (2.659 sec) INFO:tensorflow:global_step/sec: 3.77241 INFO:tensorflow:lr=0.001000, l2=7.289131, loc=2.843292, loss=13.380405, acc=0.750000, ce=3.247983 INFO:tensorflow:loss = 13.380405, step = 1951 (2.651 sec) INFO:tensorflow:global_step/sec: 3.77656 INFO:tensorflow:lr=0.001000, l2=7.288894, loc=2.549889, loss=13.519226, acc=0.750000, ce=3.680444 INFO:tensorflow:loss = 13.519226, step = 1961 (2.648 sec) INFO:tensorflow:Saving checkpoints for 1966 into ./logs/model.ckpt. INFO:tensorflow:Loss for final step: 16.647438.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
tomrichardoncommented, Mar 14, 2019

No, I’m working with Pascal VOC 2007+2012

1reaction
HiKapokcommented, Mar 14, 2019

@tomrichardon did you use custom dataset?

Read more comments on GitHub >

github_iconTop Results From Across the Web

All You Need to Know about Batch Size, Epochs and Training ...
If you get a plot like this, you must stop the training process (early stopping) at the 5th epoch as the validation loss...
Read more >
How to maximize GPU utilization by finding the right batch size
In this article, we examine the effects of batch size on DL model training times and accuracies, and go on to describe a...
Read more >
Training process was killed without throwing any problems
I have tried reducing the batch size to 1. I am able to do training but parallel evaluation is not working. 1
Read more >
Training Tips for the Transformer Model
In order to convert training steps to epochs, we need to multiply the steps by the effective batch size and divide by the...
Read more >
Training & evaluation with the built-in methods - Keras
Introduction. This guide covers training, evaluation, and prediction (inference) models when using built-in APIs for training & validation ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found