Training stopping by itself after (250/batch_size) steps
See original GitHub issueHello,
I’m trying to launch the training script on the pre-trained VGG-16 model that is linked in the Readme.md. The problem that I have is that training stops by itself after (250/batch_size) steps. I didn’t touch anything in the train_ssd.py script except for the batch size because my gpu was running oom with a batch_size of 32. I tried 4, 2 and 1 as batch sizes and training stopped after 62, 125 and 250 steps, saving a checkpoint. My accuracy also seems to not change at all, being at 0.750000 all the time.
My question is “What makes my training stop so quickly ?” I can’t find where it comes from, although it definitely looks intentional since it stops pretty cleanly.
Here is the output of launching train_ssd.py with batch_size=2 (as you can see, it starts on step 1841 and stops by itself on step 1966) :
(tensorflowgpu-env) C:\Users\tomri\Documents\SSD.TensorFlow-master>python train_ssd.py
WARNING:tensorflow:From train_ssd.py:427: replicate_model_fn (from tensorflow.contrib.estimator.python.estimator.replicate_model_fn) is deprecated and will be removed after 2018-05-31.
Instructions for updating:
Please use tf.contrib.distribute.MirroredStrategy
instead.
2019-03-13 16:10:10.829767: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-03-13 16:10:12.219015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.493
pciBusID: 0000:01:00.0
totalMemory: 4.00GiB freeMemory: 3.30GiB
2019-03-13 16:10:12.234896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-03-13 16:10:12.683094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-13 16:10:12.686969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2019-03-13 16:10:12.689310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2019-03-13 16:10:12.692227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:0 with 3016 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Replicating the model_fn
across [‘/device:GPU:0’]. Variables are going to be placed on [‘/device:GPU:0’]. Consolidation device is going to be /device:GPU:0.
INFO:tensorflow:Using config: {‘_session_config’: gpu_options {
per_process_gpu_memory_fraction: 1.0
}
allow_soft_placement: true
, ‘_save_checkpoints_steps’: None, ‘_evaluation_master’: ‘’, ‘_global_id_in_cluster’: 0, ‘_train_distribute’: None, ‘_task_id’: 0, ‘_keep_checkpoint_max’: 5, ‘_tf_random_seed’: 20180503, ‘_service’: None, ‘_log_step_count_steps’: 10, ‘_num_ps_replicas’: 0, ‘_task_type’: ‘worker’, ‘_save_summary_steps’: 1000, ‘_save_checkpoints_secs’: 7200, ‘_is_chief’: True, ‘_master’: ‘’, ‘_cluster_spec’: <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000025555533550>, ‘_keep_checkpoint_every_n_hours’: 10000, ‘_device_fn’: None, ‘_num_worker_replicas’: 1, ‘_model_dir’: ‘./logs/’}
Starting a training cycle.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From C:\Users\tomri\Documents\SSD.TensorFlow-master\net\ssd_net.py:114: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From train_ssd.py:388: TowerOptimizer.init (from tensorflow.contrib.estimator.python.estimator.replicate_model_fn) is deprecated and will be removed after 2018-05-31.
Instructions for updating:
Please use tf.contrib.distribute.MirroredStrategy
instead.
INFO:tensorflow:Ignoring --checkpoint_path because a checkpoint already exists in ./logs/.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2019-03-13 16:10:25.877210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-03-13 16:10:25.882028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-13 16:10:25.886118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2019-03-13 16:10:25.890165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2019-03-13 16:10:25.893592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3016 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from ./logs/model.ckpt-1841
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1841 into ./logs/model.ckpt.
2019-03-13 16:10:52.813970: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.57GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-13 16:10:53.824309: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.57GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
INFO:tensorflow:lr=0.001000, l2=7.292980, loc=3.477014, loss=53.604458, acc=0.750000, ce=42.834465
INFO:tensorflow:loss = 53.604458, step = 1841
INFO:tensorflow:global_step/sec: 2.24063
INFO:tensorflow:lr=0.001000, l2=7.292490, loc=3.637795, loss=25.027014, acc=0.750000, ce=14.096728
INFO:tensorflow:loss = 25.027014, step = 1851 (4.462 sec)
INFO:tensorflow:global_step/sec: 3.73853
INFO:tensorflow:lr=0.001000, l2=7.292079, loc=3.079521, loss=22.236305, acc=0.750000, ce=11.864706
INFO:tensorflow:loss = 22.236305, step = 1861 (2.676 sec)
INFO:tensorflow:global_step/sec: 3.75393
INFO:tensorflow:lr=0.001000, l2=7.291565, loc=1.217845, loss=11.967870, acc=0.750000, ce=3.458459
INFO:tensorflow:loss = 11.96787, step = 1871 (2.665 sec)
INFO:tensorflow:global_step/sec: 3.73993
INFO:tensorflow:lr=0.001000, l2=7.291090, loc=3.487017, loss=17.143929, acc=0.750000, ce=6.365822
INFO:tensorflow:loss = 17.143929, step = 1881 (2.672 sec)
INFO:tensorflow:global_step/sec: 3.66937
INFO:tensorflow:lr=0.001000, l2=7.290852, loc=3.540547, loss=32.483604, acc=0.750000, ce=21.652206
INFO:tensorflow:loss = 32.483604, step = 1891 (2.728 sec)
INFO:tensorflow:global_step/sec: 3.78508
INFO:tensorflow:lr=0.001000, l2=7.290482, loc=2.461484, loss=13.741415, acc=0.750000, ce=3.989449
INFO:tensorflow:loss = 13.741415, step = 1901 (2.640 sec)
INFO:tensorflow:global_step/sec: 3.79082
INFO:tensorflow:lr=0.001000, l2=7.290192, loc=3.321430, loss=21.725256, acc=0.750000, ce=11.113634
INFO:tensorflow:loss = 21.725256, step = 1911 (2.638 sec)
INFO:tensorflow:global_step/sec: 3.75983
INFO:tensorflow:lr=0.001000, l2=7.289964, loc=2.578511, loss=13.988897, acc=0.750000, ce=4.120422
INFO:tensorflow:loss = 13.988897, step = 1921 (2.663 sec)
INFO:tensorflow:global_step/sec: 3.76216
INFO:tensorflow:lr=0.001000, l2=7.289906, loc=5.934974, loss=40.700157, acc=0.750000, ce=27.475279
INFO:tensorflow:loss = 40.700157, step = 1931 (2.655 sec)
INFO:tensorflow:global_step/sec: 3.76096
INFO:tensorflow:lr=0.001000, l2=7.289580, loc=2.893521, loss=17.323259, acc=0.750000, ce=7.140158
INFO:tensorflow:loss = 17.32326, step = 1941 (2.659 sec)
INFO:tensorflow:global_step/sec: 3.77241
INFO:tensorflow:lr=0.001000, l2=7.289131, loc=2.843292, loss=13.380405, acc=0.750000, ce=3.247983
INFO:tensorflow:loss = 13.380405, step = 1951 (2.651 sec)
INFO:tensorflow:global_step/sec: 3.77656
INFO:tensorflow:lr=0.001000, l2=7.288894, loc=2.549889, loss=13.519226, acc=0.750000, ce=3.680444
INFO:tensorflow:loss = 13.519226, step = 1961 (2.648 sec)
INFO:tensorflow:Saving checkpoints for 1966 into ./logs/model.ckpt.
INFO:tensorflow:Loss for final step: 16.647438.
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (4 by maintainers)
No, I’m working with Pascal VOC 2007+2012
@tomrichardon did you use custom dataset?