question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Interrupting training crashes Python kernel when training on GPU for the first time with MirroredStrategy

See original GitHub issue

System information.

  • Have I written custom code (as opposed to using a stock example script provided in Keras): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): binary (pip install tensorflow)
  • TensorFlow version (use command below): 2.4.3
  • Python version: 3.7
  • GPU model and memory: Nvidia Tesla V100

Describe the problem.

When using multi-GPU training in a Jupyter Notebook:

strategy = tf.distribute.MirroredStrategy()
model_ = ...
with strategy.scope():
    model = model_.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])

Then launching training:

model.fit(...)

Describe the current behavior.

  • when initial training runs to the end, subsequent trainings can be interrupted fine
  • when initial training itself is interrupted, the Jupyter kernel dies

Detailed step-by-step procedure:

  1. compile the model with mirrored strategy
  2. launch training with model.fit()
  3. let the training go to its own end -> otherwise kernel dies if interrupted here
  4. after initial training completed, re-compile the model with mirrored strategy
  5. launch another training
  6. interrupt this training -> works fine, kernel doesn’t die

If training was interrupted at step 3. then the kernel dies systematically.

Screenshot 2021-11-11 at 13 13 11

Describe the expected behavior.

  • when initial training is interrupted, the Jupyter kernel should not die

Source code / logs.

No error message — the Jupyter iPython kernel simply dies with no message.

Current workaround.

The current workaround I found is to run the first training with just 1 epoch, let it complete, then I can recompile the model and re-train at will with no crashes. Less than ideal, but it works.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
jvishnuvardhancommented, Nov 12, 2021

@JivanRoquet Quick question to find whether the root-cause is keras or distribution strategy. I guess distribution strategy. Are you using Keras or TF model? Did you try without Mirrored strategy? Do you notice this issues without distribution strategy.

If you think this is more related to distribution strategy, please open this issue in TF repository https://github.com/tensorflow/tensorflow/issues as this keras repo is mainly for keras related issues. Thanks!

0reactions
JivanRoquetcommented, Nov 12, 2021

Sure thing. Closing here for now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Kernel died restarting whenever training a model
A very cumbersome issue with tensorflow-gpu. It took me days to find the best working solution. What seems to be the problem:.
Read more >
MirroredStrategy demo for distributed training - YouTube
Learn how to distribute training across multiple GPUs within a managed Jupyter Lab environment. Intro to distributed training ...
Read more >
Keras FAQ
How can I interrupt training when the validation loss isn't decreasing anymore? ... How can I train a Keras model on multiple GPUs...
Read more >
GPU training hangs with tensorflow… | Apple Developer Forums
I am training a model using tensorflow-metal 0.5.1. and at one point the training hangs and I have to stop the kernel. It...
Read more >
Multi-worker training with Keras | TensorFlow Core
Dataset and model definition. Next, create an mnist_setup.py file with a simple model and dataset setup. This Python file will be used by ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found