question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kernel freeze at tf.keras.Sequential.fit()

See original GitHub issue

What I did?

Link to Colab: https://colab.research.google.com/drive/1g6BFapSuG0-WCQzxlrDsPKCcmaGemB9f?usp=sharing

Please use emails connected to the GitHub account for request - I’ll accept it. Notebook is related to my graduation project and I don’t want the work to go fully public yet.

I created a custom layer with quantum circuit in quantum_circuit() to represent 8x8 image - 4 readout qubits with two H gates, connected to 16 qubits by ZZ**(param) gates for each of 4 readouts. (8x8 extension of what can be found in MNIST Classification example.

The image is divided into 4 4x4 pieces, each connected to single readout qubit.

The data is represented similarly to what can be found in the example (X gate if normalized_color > 0.5).

I attached a softmax layer directly to quantum one for classification using tf.keras.Sequential model, since I want to extend it further - up to all 10 digits.

qnn_model = tf.keras.Sequential([
    tf.keras.Input(shape=(), dtype=tf.string, name='q_input'),
    tfq.layers.PQC(model_circuit, model_readout, name='quantum'),
    tf.keras.layers.Dense(2, activation=tf.keras.activations.softmax, name='softmax'),
])
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
quantum (PQC)                (None, 4)                 64        
_________________________________________________________________
softmax (Dense)              (None, 2)                 10        
=================================================================
Total params: 74
Trainable params: 74
Non-trainable params: 0
_________________________________________________________________

I compiled the model and I tried to fit it.

What was expected to happen?

The model should start to iterate over given number of epochs.

What happened?

Epoch 1/10 is displayed, but nothing else happens.

  • The Colab kernel restarts yielding log, that can be found in the Attachements section.
  • Using WSL2 local environment I just encountered something I would call ‘a kernel freeze’. The cell was trying to run, but there was nothing happening - no CPU, RAM usage. The operation could not have been interrupted - only kernel restart worked.

Environment

tensorflow          2.3.1
tensorflow-quantum  0.4.0

for both:

  • Google Colab
  • Windows Subsystem Linux 2 (Ubuntu 20.04.1 LTS; Windows 10 Pro, build 20270)

No GPU involved.

What I found out?

When I try to run the notebook with compressed_image_size = 4 everything works as intended. I’ve checked my quantum_circuit() and it seems to be working as intended for version 8x8 - it generates circuit with desired architecture.

When I tried to trace down the error I found out that:

data_adapter.py: enumerate_epochs() yields correct epoch, but the tf.data.Iterator data_iterator has AttributeErrors like

AttributeError: 'OwnedIterator' object has no attribute '_self_unconditional_checkpoint_dependencies'

in

  • _checkpoint_dependencies
  • _deferred_dependencies
AttributeError: 'OwnedIterator' object has no attribute '_self_name_based_restores'
  • _name_based_restores

and:

AttributeError("'OwnedIterator' object has no attribute '_self_unconditional_checkpoint_dependencies'")
AttributeError("'OwnedIterator' object has no attribute '_self_unconditional_dependency_names'")
AttributeError("'OwnedIterator' object has no attribute '_self_update_uid'")

I’m not sure if this is relevant.

Attachments

colab-jupyter.log

Dec 15, 2020, 10:41:32 AM | WARNING | WARNING:root:kernel b6193863-8d44-476f-b8cc-eadbe7129967 restarted
Dec 15, 2020, 10:41:32 AM | INFO | KernelRestarter: restarting kernel (1/5), keep random ports
Dec 15, 2020, 10:40:56 AM | WARNING | 2020-12-15 09:40:56.133076: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Dec 15, 2020, 10:40:56 AM | WARNING | 2020-12-15 09:40:56.133022: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1b91640 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
Dec 15, 2020, 10:40:56 AM | WARNING | 2020-12-15 09:40:56.131837: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2199995000 Hz
Dec 15, 2020, 10:40:56 AM | WARNING | To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Dec 15, 2020, 10:40:56 AM | WARNING | 2020-12-15 09:40:56.125112: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
Dec 15, 2020, 10:40:56 AM | WARNING | 2020-12-15 09:40:56.124271: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (0071d832075f): /proc/driver/nvidia/version does not exist
Dec 15, 2020, 10:40:56 AM | WARNING | 2020-12-15 09:40:56.123595: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Dec 15, 2020, 10:40:56 AM | WARNING | 2020-12-15 09:40:56.109400: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
Dec 15, 2020, 10:40:53 AM | WARNING | 2020-12-15 09:40:53.250994: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Dec 15, 2020, 10:37:53 AM | WARNING | WARNING:root:kernel b6193863-8d44-476f-b8cc-eadbe7129967 restarted
Dec 15, 2020, 10:37:53 AM | INFO | KernelRestarter: restarting kernel (1/5), keep random ports
Dec 15, 2020, 10:36:24 AM | WARNING | 2020-12-15 09:36:24.601416: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Dec 15, 2020, 10:36:24 AM | WARNING | 2020-12-15 09:36:24.601370: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x20c3640 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
Dec 15, 2020, 10:36:24 AM | WARNING | 2020-12-15 09:36:24.600345: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2199995000 Hz
Dec 15, 2020, 10:36:24 AM | WARNING | To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Dec 15, 2020, 10:36:24 AM | WARNING | 2020-12-15 09:36:24.593357: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
Dec 15, 2020, 10:36:24 AM | WARNING | 2020-12-15 09:36:24.592695: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (0071d832075f): /proc/driver/nvidia/version does not exist
Dec 15, 2020, 10:36:24 AM | WARNING | 2020-12-15 09:36:24.592632: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Dec 15, 2020, 10:36:24 AM | WARNING | 2020-12-15 09:36:24.531111: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
Dec 15, 2020, 10:36:20 AM | WARNING | 2020-12-15 09:36:20.926549: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Dec 15, 2020, 10:36:01 AM | INFO | Adapting to protocol v5.1 for kernel b6193863-8d44-476f-b8cc-eadbe7129967
Dec 15, 2020, 10:33:42 AM | INFO | Adapting to protocol v5.1 for kernel b6193863-8d44-476f-b8cc-eadbe7129967
Dec 15, 2020, 10:33:41 AM | INFO | Kernel started: b6193863-8d44-476f-b8cc-eadbe7129967
Dec 15, 2020, 10:33:13 AM | INFO | Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Dec 15, 2020, 10:33:13 AM | INFO | http://172.28.0.2:9000/
Dec 15, 2020, 10:33:13 AM | INFO | The Jupyter Notebook is running at:
Dec 15, 2020, 10:33:13 AM | INFO | 0 active kernels
Dec 15, 2020, 10:33:13 AM | INFO | Serving notebooks from local directory: /
Dec 15, 2020, 10:33:13 AM | INFO | google.colab serverextension initialized.
Dec 15, 2020, 10:33:13 AM | INFO | Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
Dec 15, 2020, 10:33:13 AM | WARNING | Config option `delete_to_trash` not recognized by `ColabFileContentsManager`.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:3
  • Comments:8 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
MichaelBroughtoncommented, Jul 22, 2021

That’s awesome! Always happy to see more publications making use of TFQ!

1reaction
MichaelBroughtoncommented, Dec 16, 2020

No problem. So at first glance I think you’ve solved your own problem in your comment on the side there.

The compressed_image_size is too big with a value of 8. Quick review on quantum circuit simulation:

Simulating n qubits takes 2^n memory. So looking at your code:

compressed_image_size=8 => compressed_image_shape = (8,8)

Then in the line: qubits = cirq.GridQubit.rect(*compressed_image_shape) => len(qubits) == 64

Mathing that out really quick gives us a state vector with 2^64 complex amplitudes where one amplitude is 64 bits means you requested 147 Exabytes of RAM. A bit too much 😃. In general simulations cap out around 30 qubits unless you’ve got some serious hardware and you might be able to push things up to 35-40.

My guess is that the malloc call didn’t fail gracefully on that size which is a bug we should probably look into. Does this help clear things up ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Sequential model | TensorFlow Core
Specifying the input shape in advance. Generally, all layers in Keras need to know the shape of their inputs in order to be...
Read more >
GPU freezes randomly while training using tf.keras models
While training any model using tensorflow 2.0, randomly during a epoch, the GPU will freeze and the power usage of the GPU will...
Read more >
Kernel dies when calling fit method [closed] - Cross Validated
this is strange! I would suggest trying reinstalling tensorflow and when you are importing keras try to mention tensorflow backend.
Read more >
Keras FAQ
What's the recommended way to monitor my metrics when training with fit() ? ... the first sequence on one GPU with tf.device_scope('/gpu:0'): encoded_a ......
Read more >
How to Grid Search Hyperparameters for Deep Learning ...
from tensorflow.keras.models import Sequential ... But when I fit the grid search to the second dataset, the program got stuck there.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found