Interrupting training crashes Python kernel when training on GPU for the first time with MirroredStrategy
See original GitHub issueSystem information.
- Have I written custom code (as opposed to using a stock example script provided in Keras): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- TensorFlow installed from (source or binary): binary (
pip install tensorflow
) - TensorFlow version (use command below): 2.4.3
- Python version: 3.7
- GPU model and memory: Nvidia Tesla V100
Describe the problem.
When using multi-GPU training in a Jupyter Notebook:
strategy = tf.distribute.MirroredStrategy()
model_ = ...
with strategy.scope():
model = model_.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
Then launching training:
model.fit(...)
Describe the current behavior.
- when initial training runs to the end, subsequent trainings can be interrupted fine
- when initial training itself is interrupted, the Jupyter kernel dies
Detailed step-by-step procedure:
- compile the model with mirrored strategy
- launch training with
model.fit()
- let the training go to its own end -> otherwise kernel dies if interrupted here
- after initial training completed, re-compile the model with mirrored strategy
- launch another training
- interrupt this training -> works fine, kernel doesn’t die
If training was interrupted at step 3. then the kernel dies systematically.

Describe the expected behavior.
- when initial training is interrupted, the Jupyter kernel should not die
Source code / logs.
No error message — the Jupyter iPython kernel simply dies with no message.
Current workaround.
The current workaround I found is to run the first training with just 1 epoch, let it complete, then I can recompile the model and re-train at will with no crashes. Less than ideal, but it works.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Kernel died restarting whenever training a model
A very cumbersome issue with tensorflow-gpu. It took me days to find the best working solution. What seems to be the problem:.
Read more >MirroredStrategy demo for distributed training - YouTube
Learn how to distribute training across multiple GPUs within a managed Jupyter Lab environment. Intro to distributed training ...
Read more >Keras FAQ
How can I interrupt training when the validation loss isn't decreasing anymore? ... How can I train a Keras model on multiple GPUs...
Read more >GPU training hangs with tensorflow… | Apple Developer Forums
I am training a model using tensorflow-metal 0.5.1. and at one point the training hangs and I have to stop the kernel. It...
Read more >Multi-worker training with Keras | TensorFlow Core
Dataset and model definition. Next, create an mnist_setup.py file with a simple model and dataset setup. This Python file will be used by ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@JivanRoquet Quick question to find whether the root-cause is keras or distribution strategy. I guess distribution strategy. Are you using Keras or TF model? Did you try without Mirrored strategy? Do you notice this issues without distribution strategy.
If you think this is more related to distribution strategy, please open this issue in TF repository https://github.com/tensorflow/tensorflow/issues as this keras repo is mainly for keras related issues. Thanks!
Sure thing. Closing here for now.