question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Potential memory leakage of TensorFlow Swin model on kaggle!

See original GitHub issue

System Info

Info:

Framework: TensorFlow 2 (Keras)
Version: 2.6
OS: Kaggle

Who can help?

Swin Model Card @amyeroberts TensorFlow: @Rocketknight1 Vision: @NielsRogge, @sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

A recent kaggle competition (hosted by Google), I tried to use pretrained tf swin transformer model from hugging face but even with the base model, I consistently received out of memory error. Below is the submission status with a base_tf_swin model.

image

Some note:

  • Other framework like pytorch works fine here.
  • Other than this model, much larger model like tf_convnext_xlarge is able to run without OOM.

So, I’m assuming there might be some potential memory leakage in tf_swin implementation. Below is the code I use to build the complete model.

id = "microsoft/swin-base-patch4-window7-224-in22k"

from transformers import AutoFeatureExtractor, TFSwinModel
feature_extractor = AutoFeatureExtractor.from_pretrained(id)
inputs = keras.Input(shape=(None, None, 3), dtype='uint8')
mode_inputs = tf.cast(inputs, tf.float32)

mode_inputs = keras.layers.Resizing(*INPUT_SHAPE)(mode_inputs)
mode_inputs = keras.layers.Rescaling(scale=1.0 / 255)(mode_inputs)
mode_inputs = keras.layers.Normalization(
    mean=feature_extractor.image_mean,
    variance=[x ** 2 for x in feature_extractor.image_std ],
    axis=3
)(mode_inputs)
mode_inputs = keras.layers.Permute(dims=(3, 1, 2))(mode_inputs)

tf_huggingface_module = TFSwinModel.from_pretrained(id)
tf_huggingface_module.trainable = False
logits = tf_huggingface_module(mode_inputs)
adv_logits = keras.Dense(64)(logits.pooler_output)

outputs = keras.layers.Lambda(
    lambda x: tf.math.l2_normalize(x, axis=-1), name='embedding_norm'
)(adv_logits)

tf_huggingface_classifier = keras.Model(inputs, outputs)

Expected behavior

It should work like other model. To reproduce the issue exactly, (in the worst case), you may need to run it on kaggle platform. Kaggle submission status (as shown in the above diagram) is not very descriptive other than just showing submission status 😦. Mainly, I like to know what could be the cause of it and any possible solution.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:14 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
ydshiehcommented, Aug 18, 2022

Randomly jumping in this thread 😃

  • Are you able to reproduce this issue in a machine with similar spec as Kaggle machines?
  • One way to narrow down to the root cause is to gradually remove some parts of code
  • From the provided notebook, we can’t have any conclusion on memory leak. Memory leak refers to the memory usage increase during a repetition of the same call to a particular code block.
  • Suggestion: try to see if this issue occurs during model saving, or the memory usage increases during inference time.
1reaction
amyerobertscommented, Aug 17, 2022

Hi @innat. As mentioned above it’s quite hard to debug without know what’s happening during submission and logs from the kaggle notebook. My current best guess is it’s due to the size of the saved Swin model.

Using your script to create and save out a model, I looked at the sizes across different checkpoints:

"microsoft/resnet-50"                              # 23,561,152 params
"google/vit-base-patch16-224-in21k"                # 86,389,248 params
"microsoft/swin-base-patch4-window7-224-in22k"     # 86,743,224 params
"microsoft/swin-tiny-patch4-window7-224"           # 27,519,354 params
"facebook/convnext-large-224-22k-1k"               # 196,230,336 params
tf_hf_classifier_convnext_large_224_22k_1k:
total 25712
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 13:13 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 13:13 assets
-rw-r--r--   1 amyroberts  staff   510K 10 Aug 13:13 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff    12M 10 Aug 13:13 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 13:13 variables

tf_hf_classifier_resnet_50:
total 12048
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 12:51 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 12:51 assets
-rw-r--r--   1 amyroberts  staff   488K 10 Aug 12:51 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff   5.4M 10 Aug 12:51 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 12:51 variables

tf_hf_classifier_swin_base_patch4_window7_224_in22k:
total 179216
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 13:00 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 12:59 assets
-rw-r--r--   1 amyroberts  staff   7.4M 10 Aug 13:00 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff    80M 10 Aug 13:00 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 12:59 variables

tf_hf_classifier_swin_tiny_patch4_window7_224:
total 83944
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 13:09 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 13:09 assets
-rw-r--r--   1 amyroberts  staff   474K 10 Aug 13:09 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff    41M 10 Aug 13:09 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 13:09 variables

tf_hf_classifier_vit_base_patch16_224_in21k:
total 21328
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 12:53 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 12:53 assets
-rw-r--r--   1 amyroberts  staff   162K 10 Aug 12:53 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff    10M 10 Aug 12:53 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 12:53 variables

I haven’t dug much into why the model is so much larger. A cursory glance at the model graphs didn’t reveal anything particularly surprising.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[TPU] Swin-Transformer (Tensorflow) - Kaggle
This Kaggle utility will copy the dataset to a GCS bucket co-located with the TPU. If you have multiple datasets attached to the...
Read more >
Winning solutions of kaggle competitions
Explore and run machine learning code with Kaggle Notebooks | Using data from Meta Kaggle.
Read more >
Memory leak in kernels when reading files? - Kaggle
When reading many images in a kernel, and probably other file types too, the RAM usage steadily increases even if the image variables...
Read more >
Approaching Text Data Problems using Deep Learn - Kaggle
I learned a lot how to implement Deep Learning by using tensorflow and how to preprocess text data so that it is ready...
Read more >
Sentiment Analysis on Amazon Product (RNN-97% Acc)
preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found