[CLI]: v0.13.1 Training results in IndexError: pop when using WandbCallback
See original GitHub issueDescribe the bug
Training starts, but never completes due to the bug. At the end of the training phase I get the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-30-f5faf23168b9> in <module>
1 results = model.fit(z_train, y_train, batch_size=config.batch_size, epochs=config.epochs, callbacks=callbacks,\
----> 2 validation_data=(z_test, y_test))
~/.local/lib/python3.6/site-packages/wandb/integration/keras/keras.py in new_v2(*args, **kwargs)
171 for cbk in cbks:
172 set_wandb_attrs(cbk, val_data)
--> 173 return old_v2(*args, **kwargs)
174
175 training_arrays.orig_fit_loop = old_arrays
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
106 def _method_wrapper(self, *args, **kwargs):
107 if not self._in_multi_worker_mode(): # pylint: disable=protected-access
--> 108 return method(self, *args, **kwargs)
109
110 # Running inside `run_distribute_coordinator` already.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1144 del self._eval_data_handler
1145 callbacks.on_train_end(logs=training_logs)
-> 1146 return self.history
1147
1148 def test_step(self, data):
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py in __exit__(self, exception_type, exception_value, traceback)
425 "tf.distribute.set_strategy() out of `with` scope."),
426 e)
--> 427 _pop_per_thread_mode()
428
429
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribution_strategy_context.py in _pop_per_thread_mode()
63
64 def _pop_per_thread_mode():
---> 65 ops.get_default_graph()._distribution_strategy_stack.pop(-1) # pylint: disable=protected-access
66
67
IndexError: pop from empty list
When I disable WandbCallback() everything works fine. I use 2 GPUs on single machine via MirroredStrategy on tensorflow.
Additional Files
No response
Environment
WandB version: 0.13.1
OS: Ubuntu 18.04.5 LTS
Python version: 3.6.9
Versions of relevant libraries: tensorflow v2.3.1+nv
Additional Context
No response
Issue Analytics
- State:
- Created a year ago
- Comments:15
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Hi @alikaanguven , @bermeitinger-b , @jnatale11 . We have identified the regression where this issue is stemming from and have began work on a fix and will provide an update once it’s released.
Hi @bermeitinger-b , @alikaanguven , thank-you both for the feedback. We are still investigating and will work on reproducing soon.