question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

sync optimizer problem

See original GitHub issue

hi Vincent, I have another problem, I use 0 1 as training GPU and numbatches_to_aggregate=0 in default config standardtrainer.cfg , but I found 3 Start master session in log. Is this behavior right?

2018-08-21 07:20:16.556159: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered          [286/1991]
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-08-21 07:20:22.265699: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 6066331584d0d493 with config: gpu_optios { allow_growth: true } allow_soft_placement: true
2018-08-21 07:20:22.711547: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 51400be6a5c9e9bc with config: gpu_optios { allow_growth: true } allow_soft_placement: true
WORKER 0: step 0/15600 loss: 3.890086, learning rate: 0.001000 
         time elapsed: 8.318348 sec
         peak memory usage: 21/22604 MB
WORKER 0: step 1/15600 loss: 3.776481, learning rate: 0.001000 
         time elapsed: 1.858370 sec
         peak memory usage: 688/22604 MB
WORKER 0: step 2/15600 loss: 3.670535, learning rate: 0.001000 
         time elapsed: 1.519805 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 3/15600 loss: 3.695373, learning rate: 0.001000 
         time elapsed: 1.608191 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 4/15600 loss: 3.627747, learning rate: 0.000999 
         time elapsed: 2.627351 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 5/15600 loss: 3.646843, learning rate: 0.000999 
         time elapsed: 1.744121 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 6/15600 loss: 3.629820, learning rate: 0.000999 
         time elapsed: 1.468782 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 7/15600 loss: 3.638589, learning rate: 0.000999 
         time elapsed: 1.179851 sec
         peak memory usage: 893/22604 MB
2018-08-21 07:20:52.899330: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session c5b480c57e96eea5 with config: gpu_optios { allow_growth: true } allow_soft_placement: true
WORKER 0: step 8/15600 loss: 3.618140, learning rate: 0.000999 
         time elapsed: 1.060105 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 9/15600 loss: 3.596526, learning rate: 0.000999 
         time elapsed: 2.330892 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 10/15600 loss: 3.601762, learning rate: 0.000999 
         time elapsed: 1.179645 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 11/15600 loss: 3.615230, learning rate: 0.000998 
         time elapsed: 0.954928 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 12/15600 loss: 3.602469, learning rate: 0.000998 
         time elapsed: 1.251472 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 13/15600 loss: 3.601032, learning rate: 0.000998 
         time elapsed: 0.886364 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 14/15600 loss: 3.603386, learning rate: 0.000998 
         time elapsed: 1.565970 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 15/15600 loss: 3.622668, learning rate: 0.000998 
         time elapsed: 0.885060 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 16/15600 loss: 3.613049, learning rate: 0.000998 
         time elapsed: 1.336841 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 17/15600 loss: 3.610620, learning rate: 0.000997 
         time elapsed: 0.818247 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 18/15600 loss: 3.618244, learning rate: 0.000997 
         time elapsed: 0.795765 sec
         peak memory usage: 893/22604 MB
WORKER 1: step 17/15600 loss: 3.590606, learning rate: 0.000997 
         time elapsed: 8.500966 sec
         peak memory usage: 893/22604 MB

In Contrast, When I set numbatches_to_aggregate=2 use sync replicas optimizer, there is a error msg like

Traceback (most recent call last):
  File "nabu/scripts/train.py", line 112, in <module>
    testing=False)
  File "nabu/scripts/train.py", line 90, in train
    tr.train(testing)
  File "/opt/cephfs1/asr/users/fanlu/mfs/nabu/nabu/neuralnetworks/trainers/trainer.py", line 607, in train
    outputs = self._create_graph()
  File "/opt/cephfs1/asr/users/fanlu/mfs/nabu/nabu/neuralnetworks/trainers/trainer.py", line 187, in _create_graph
    cluster=cluster)
  File "/opt/cephfs1/asr/users/fanlu/mfs/nabu/nabu/neuralnetworks/trainers/trainer.py", line 569, in _update
    name='apply_gradients')
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py", line 238, in apply_gradients
    raise ValueError("Global step is required to check staleness")
ValueError: Global step is required to check staleness

So I add param global_step to apply_gradients_op in func _update

# opperation to apply the gradients
        apply_gradients_op = optimizer.apply_gradients(
            grads_and_vars=grads_and_vars,
            name='apply_gradients', global_step=global_step)

and start training, but there is no training log print anymore. How to set global_step to apply_gradients_op? @vrenkens

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:15 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
vrenkenscommented, Aug 21, 2018

@AzizCode92 The task_name is different. The indices for different task names can be the same

1reaction
vrenkenscommented, Aug 21, 2018

About your first question: it is normal that 3 sessions are started when using 2 GPUs. There is one for each GPU worker and one for the parameter server that holds the parameters.

About the second question: I will check it out when I have the time

Read more comments on GitHub >

github_iconTop Results From Across the Web

Synchronization and Optimization: A Systems-based Approach
The systemic approach to the economics and management of resources focuses on what can be achieved by combining the resources at hand.
Read more >
Parameters do not properly sync across processes when ...
With my current setup, it appears that model parameters are not synchronized across processes when running with DDP and AMP set to opt...
Read more >
How To Fix On-Line Sync Data Consistency Issues
1. Rebuild Your Portfolios from their Transactions · 2. Optimize your data · 3. Use the "Upload" Sync from your Main Computer ·...
Read more >
Fix issues when you can't sync OneNote - Microsoft Support
This may be caused by large backup files. To solve these sync issues, you can optimize, or delete existing notebook backups. In OneNote,...
Read more >
Work Profile App Sync Problem - OnePlus Community
At the end of installation Boxer App suggests to turn off battery optimization off the itself but there is no battery optimization section...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found