Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training stopping by itself after (250/batch_size) steps

See original GitHub issue

Hello,

I’m trying to launch the training script on the pre-trained VGG-16 model that is linked in the Readme.md. The problem that I have is that training stops by itself after (250/batch_size) steps. I didn’t touch anything in the train_ssd.py script except for the batch size because my gpu was running oom with a batch_size of 32. I tried 4, 2 and 1 as batch sizes and training stopped after 62, 125 and 250 steps, saving a checkpoint. My accuracy also seems to not change at all, being at 0.750000 all the time.

My question is “What makes my training stop so quickly ?” I can’t find where it comes from, although it definitely looks intentional since it stops pretty cleanly.

Here is the output of launching train_ssd.py with batch_size=2 (as you can see, it starts on step 1841 and stops by itself on step 1966) :

(tensorflowgpu-env) C:\Users\tomri\Documents\SSD.TensorFlow-master>python train_ssd.py

WARNING:tensorflow:From train_ssd.py:427: replicate_model_fn (from tensorflow.contrib.estimator.python.estimator.replicate_model_fn) is deprecated and will be removed after 2018-05-31. Instructions for updating: Please use tf.contrib.distribute.MirroredStrategy instead. 2019-03-13 16:10:10.829767: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 2019-03-13 16:10:12.219015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.493 pciBusID: 0000:01:00.0 totalMemory: 4.00GiB freeMemory: 3.30GiB 2019-03-13 16:10:12.234896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0 2019-03-13 16:10:12.683094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-13 16:10:12.686969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-03-13 16:10:12.689310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-03-13 16:10:12.692227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:0 with 3016 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1) INFO:tensorflow:Replicating the model_fn across [‘/device:GPU:0’]. Variables are going to be placed on [‘/device:GPU:0’]. Consolidation device is going to be /device:GPU:0. INFO:tensorflow:Using config: {‘_session_config’: gpu_options { per_process_gpu_memory_fraction: 1.0 } allow_soft_placement: true , ‘_save_checkpoints_steps’: None, ‘_evaluation_master’: ‘’, ‘_global_id_in_cluster’: 0, ‘_train_distribute’: None, ‘_task_id’: 0, ‘_keep_checkpoint_max’: 5, ‘_tf_random_seed’: 20180503, ‘_service’: None, ‘_log_step_count_steps’: 10, ‘_num_ps_replicas’: 0, ‘_task_type’: ‘worker’, ‘_save_summary_steps’: 1000, ‘_save_checkpoints_secs’: 7200, ‘_is_chief’: True, ‘_master’: ‘’, ‘_cluster_spec’: <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000025555533550>, ‘_keep_checkpoint_every_n_hours’: 10000, ‘_device_fn’: None, ‘_num_worker_replicas’: 1, ‘_model_dir’: ‘./logs/’} Starting a training cycle. INFO:tensorflow:Calling model_fn. WARNING:tensorflow:From C:\Users\tomri\Documents\SSD.TensorFlow-master\net\ssd_net.py:114: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead WARNING:tensorflow:From train_ssd.py:388: TowerOptimizer.init (from tensorflow.contrib.estimator.python.estimator.replicate_model_fn) is deprecated and will be removed after 2018-05-31. Instructions for updating: Please use tf.contrib.distribute.MirroredStrategy instead. INFO:tensorflow:Ignoring --checkpoint_path because a checkpoint already exists in ./logs/. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. 2019-03-13 16:10:25.877210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0 2019-03-13 16:10:25.882028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-13 16:10:25.886118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-03-13 16:10:25.890165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-03-13 16:10:25.893592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3016 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1) INFO:tensorflow:Restoring parameters from ./logs/model.ckpt-1841 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 1841 into ./logs/model.ckpt. 2019-03-13 16:10:52.813970: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.57GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-13 16:10:53.824309: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.57GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. INFO:tensorflow:lr=0.001000, l2=7.292980, loc=3.477014, loss=53.604458, acc=0.750000, ce=42.834465 INFO:tensorflow:loss = 53.604458, step = 1841 INFO:tensorflow:global_step/sec: 2.24063 INFO:tensorflow:lr=0.001000, l2=7.292490, loc=3.637795, loss=25.027014, acc=0.750000, ce=14.096728 INFO:tensorflow:loss = 25.027014, step = 1851 (4.462 sec) INFO:tensorflow:global_step/sec: 3.73853 INFO:tensorflow:lr=0.001000, l2=7.292079, loc=3.079521, loss=22.236305, acc=0.750000, ce=11.864706 INFO:tensorflow:loss = 22.236305, step = 1861 (2.676 sec) INFO:tensorflow:global_step/sec: 3.75393 INFO:tensorflow:lr=0.001000, l2=7.291565, loc=1.217845, loss=11.967870, acc=0.750000, ce=3.458459 INFO:tensorflow:loss = 11.96787, step = 1871 (2.665 sec) INFO:tensorflow:global_step/sec: 3.73993 INFO:tensorflow:lr=0.001000, l2=7.291090, loc=3.487017, loss=17.143929, acc=0.750000, ce=6.365822 INFO:tensorflow:loss = 17.143929, step = 1881 (2.672 sec) INFO:tensorflow:global_step/sec: 3.66937 INFO:tensorflow:lr=0.001000, l2=7.290852, loc=3.540547, loss=32.483604, acc=0.750000, ce=21.652206 INFO:tensorflow:loss = 32.483604, step = 1891 (2.728 sec) INFO:tensorflow:global_step/sec: 3.78508 INFO:tensorflow:lr=0.001000, l2=7.290482, loc=2.461484, loss=13.741415, acc=0.750000, ce=3.989449 INFO:tensorflow:loss = 13.741415, step = 1901 (2.640 sec) INFO:tensorflow:global_step/sec: 3.79082 INFO:tensorflow:lr=0.001000, l2=7.290192, loc=3.321430, loss=21.725256, acc=0.750000, ce=11.113634 INFO:tensorflow:loss = 21.725256, step = 1911 (2.638 sec) INFO:tensorflow:global_step/sec: 3.75983 INFO:tensorflow:lr=0.001000, l2=7.289964, loc=2.578511, loss=13.988897, acc=0.750000, ce=4.120422 INFO:tensorflow:loss = 13.988897, step = 1921 (2.663 sec) INFO:tensorflow:global_step/sec: 3.76216 INFO:tensorflow:lr=0.001000, l2=7.289906, loc=5.934974, loss=40.700157, acc=0.750000, ce=27.475279 INFO:tensorflow:loss = 40.700157, step = 1931 (2.655 sec) INFO:tensorflow:global_step/sec: 3.76096 INFO:tensorflow:lr=0.001000, l2=7.289580, loc=2.893521, loss=17.323259, acc=0.750000, ce=7.140158 INFO:tensorflow:loss = 17.32326, step = 1941 (2.659 sec) INFO:tensorflow:global_step/sec: 3.77241 INFO:tensorflow:lr=0.001000, l2=7.289131, loc=2.843292, loss=13.380405, acc=0.750000, ce=3.247983 INFO:tensorflow:loss = 13.380405, step = 1951 (2.651 sec) INFO:tensorflow:global_step/sec: 3.77656 INFO:tensorflow:lr=0.001000, l2=7.288894, loc=2.549889, loss=13.519226, acc=0.750000, ce=3.680444 INFO:tensorflow:loss = 13.519226, step = 1961 (2.648 sec) INFO:tensorflow:Saving checkpoints for 1966 into ./logs/model.ckpt. INFO:tensorflow:Loss for final step: 16.647438.