Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Interrupting training crashes Python kernel when training on GPU for the first time with MirroredStrategy

See original GitHub issue

System information.

Have I written custom code (as opposed to using a stock example script provided in Keras): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): binary (pip install tensorflow)
TensorFlow version (use command below): 2.4.3
Python version: 3.7
GPU model and memory: Nvidia Tesla V100

Describe the problem.

When using multi-GPU training in a Jupyter Notebook:

strategy = tf.distribute.MirroredStrategy()
model_ = ...
with strategy.scope():
    model = model_.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])

Then launching training:

model.fit(...)

Describe the current behavior.

when initial training runs to the end, subsequent trainings can be interrupted fine
when initial training itself is interrupted, the Jupyter kernel dies

Detailed step-by-step procedure:

compile the model with mirrored strategy
launch training with model.fit()
let the training go to its own end -> otherwise kernel dies if interrupted here
after initial training completed, re-compile the model with mirrored strategy
launch another training
interrupt this training -> works fine, kernel doesn’t die

If training was interrupted at step 3. then the kernel dies systematically.

Describe the expected behavior.

when initial training is interrupted, the Jupyter kernel should not die

Source code / logs.

No error message — the Jupyter iPython kernel simply dies with no message.

Current workaround.

The current workaround I found is to run the first training with just 1 epoch, let it complete, then I can recompile the model and re-train at will with no crashes. Less than ideal, but it works.

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

jvishnuvardhancommented, Nov 12, 2021

@JivanRoquet Quick question to find whether the root-cause is keras or distribution strategy. I guess distribution strategy. Are you using Keras or TF model? Did you try without Mirrored strategy? Do you notice this issues without distribution strategy.

If you think this is more related to distribution strategy, please open this issue in TF repository https://github.com/tensorflow/tensorflow/issues as this keras repo is mainly for keras related issues. Thanks!

0reactions

JivanRoquetcommented, Nov 12, 2021

Sure thing. Closing here for now.

Top Results From Across the Web

Kernel died restarting whenever training a model

A very cumbersome issue with tensorflow-gpu. It took me days to find the best working solution. What seems to be the problem:.

MirroredStrategy demo for distributed training - YouTube

Learn how to distribute training across multiple GPUs within a managed Jupyter Lab environment. Intro to distributed training ...

Keras FAQ

How can I interrupt training when the validation loss isn't decreasing anymore? ... How can I train a Keras model on multiple GPUs...

GPU training hangs with tensorflow… | Apple Developer Forums

I am training a model using tensorflow-metal 0.5.1. and at one point the training hangs and I have to stop the kernel. It...

Multi-worker training with Keras | TensorFlow Core

Dataset and model definition. Next, create an mnist_setup.py file with a simple model and dataset setup. This Python file will be used by ......

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Interrupting training crashes Python kernel when training on GPU for the first time with MirroredStrategy

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[TPU, keras preprocessing layer] Some Op must be a compile-time constant.

tf.keras.callbacks.ModelCheckpoint Type Error : Unable to serialize 1.0000000656873453e-05 to JSON