Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[CLI]: v0.13.1 Training results in IndexError: pop when using WandbCallback

See original GitHub issue

Describe the bug

Training starts, but never completes due to the bug. At the end of the training phase I get the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-30-f5faf23168b9> in <module>
      1 results = model.fit(z_train, y_train, batch_size=config.batch_size, epochs=config.epochs, callbacks=callbacks,\
----> 2                     validation_data=(z_test, y_test))

~/.local/lib/python3.6/site-packages/wandb/integration/keras/keras.py in new_v2(*args, **kwargs)
    171             for cbk in cbks:
    172                 set_wandb_attrs(cbk, val_data)
--> 173         return old_v2(*args, **kwargs)
    174 
    175     training_arrays.orig_fit_loop = old_arrays

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
    106   def _method_wrapper(self, *args, **kwargs):
    107     if not self._in_multi_worker_mode():  # pylint: disable=protected-access
--> 108       return method(self, *args, **kwargs)
    109 
    110     # Running inside `run_distribute_coordinator` already.

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1144         del self._eval_data_handler
   1145       callbacks.on_train_end(logs=training_logs)
-> 1146       return self.history
   1147 
   1148   def test_step(self, data):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py in __exit__(self, exception_type, exception_value, traceback)
    425                          "tf.distribute.set_strategy() out of `with` scope."),
    426             e)
--> 427     _pop_per_thread_mode()
    428 
    429 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribution_strategy_context.py in _pop_per_thread_mode()
     63 
     64 def _pop_per_thread_mode():
---> 65   ops.get_default_graph()._distribution_strategy_stack.pop(-1)  # pylint: disable=protected-access
     66 
     67 

IndexError: pop from empty list

When I disable WandbCallback() everything works fine. I use 2 GPUs on single machine via MirroredStrategy on tensorflow.

Additional Files

No response

Environment

WandB version: 0.13.1

OS: Ubuntu 18.04.5 LTS

Python version: 3.6.9

Versions of relevant libraries: tensorflow v2.3.1+nv

Additional Context

No response

Issue Analytics

State:
Created a year ago
Comments:15

Top GitHub Comments

2reactions

MBakirWBcommented, Sep 24, 2022

Hi @alikaanguven , @bermeitinger-b , @jnatale11 . We have identified the regression where this issue is stemming from and have began work on a fix and will provide an update once it’s released.

1reaction

MBakirWBcommented, Sep 1, 2022

Hi @bermeitinger-b , @alikaanguven , thank-you both for the feedback. We are still investigating and will work on reproducing soon.