Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Potential memory leakage of TensorFlow Swin model on kaggle!

See original GitHub issue

System Info

Info:

Framework: TensorFlow 2 (Keras)
Version: 2.6
OS: Kaggle

Who can help?

Swin Model Card @amyeroberts TensorFlow: @Rocketknight1 Vision: @NielsRogge, @sgugger

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

A recent kaggle competition (hosted by Google), I tried to use pretrained tf swin transformer model from hugging face but even with the base model, I consistently received out of memory error. Below is the submission status with a base_tf_swin model.

Some note:

Other framework like pytorch works fine here.
Other than this model, much larger model like tf_convnext_xlarge is able to run without OOM.

So, I’m assuming there might be some potential memory leakage in tf_swin implementation. Below is the code I use to build the complete model.

id = "microsoft/swin-base-patch4-window7-224-in22k"

from transformers import AutoFeatureExtractor, TFSwinModel
feature_extractor = AutoFeatureExtractor.from_pretrained(id)

inputs = keras.Input(shape=(None, None, 3), dtype='uint8')
mode_inputs = tf.cast(inputs, tf.float32)

mode_inputs = keras.layers.Resizing(*INPUT_SHAPE)(mode_inputs)
mode_inputs = keras.layers.Rescaling(scale=1.0 / 255)(mode_inputs)
mode_inputs = keras.layers.Normalization(
    mean=feature_extractor.image_mean,
    variance=[x ** 2 for x in feature_extractor.image_std ],
    axis=3
)(mode_inputs)
mode_inputs = keras.layers.Permute(dims=(3, 1, 2))(mode_inputs)

tf_huggingface_module = TFSwinModel.from_pretrained(id)
tf_huggingface_module.trainable = False

logits = tf_huggingface_module(mode_inputs)
adv_logits = keras.Dense(64)(logits.pooler_output)

outputs = keras.layers.Lambda(
    lambda x: tf.math.l2_normalize(x, axis=-1), name='embedding_norm'
)(adv_logits)

tf_huggingface_classifier = keras.Model(inputs, outputs)

Expected behavior

It should work like other model. To reproduce the issue exactly, (in the worst case), you may need to run it on kaggle platform. Kaggle submission status (as shown in the above diagram) is not very descriptive other than just showing submission status 😦. Mainly, I like to know what could be the cause of it and any possible solution.

Issue Analytics

State:
Created a year ago
Comments:14 (4 by maintainers)

Top GitHub Comments

2reactions

ydshiehcommented, Aug 18, 2022

Randomly jumping in this thread 😃

Are you able to reproduce this issue in a machine with similar spec as Kaggle machines?
One way to narrow down to the root cause is to gradually remove some parts of code
From the provided notebook, we can’t have any conclusion on memory leak. Memory leak refers to the memory usage increase during a repetition of the same call to a particular code block.
Suggestion: try to see if this issue occurs during model saving, or the memory usage increases during inference time.

1reaction

amyerobertscommented, Aug 17, 2022

Hi @innat. As mentioned above it’s quite hard to debug without know what’s happening during submission and logs from the kaggle notebook. My current best guess is it’s due to the size of the saved Swin model.

Using your script to create and save out a model, I looked at the sizes across different checkpoints:

"microsoft/resnet-50"                              # 23,561,152 params
"google/vit-base-patch16-224-in21k"                # 86,389,248 params
"microsoft/swin-base-patch4-window7-224-in22k"     # 86,743,224 params
"microsoft/swin-tiny-patch4-window7-224"           # 27,519,354 params
"facebook/convnext-large-224-22k-1k"               # 196,230,336 params

tf_hf_classifier_convnext_large_224_22k_1k:
total 25712
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 13:13 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 13:13 assets
-rw-r--r--   1 amyroberts  staff   510K 10 Aug 13:13 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff    12M 10 Aug 13:13 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 13:13 variables

tf_hf_classifier_resnet_50:
total 12048
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 12:51 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 12:51 assets
-rw-r--r--   1 amyroberts  staff   488K 10 Aug 12:51 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff   5.4M 10 Aug 12:51 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 12:51 variables

tf_hf_classifier_swin_base_patch4_window7_224_in22k:
total 179216
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 13:00 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 12:59 assets
-rw-r--r--   1 amyroberts  staff   7.4M 10 Aug 13:00 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff    80M 10 Aug 13:00 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 12:59 variables

tf_hf_classifier_swin_tiny_patch4_window7_224:
total 83944
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 13:09 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 13:09 assets
-rw-r--r--   1 amyroberts  staff   474K 10 Aug 13:09 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff    41M 10 Aug 13:09 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 13:09 variables

tf_hf_classifier_vit_base_patch16_224_in21k:
total 21328
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 12:53 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 12:53 assets
-rw-r--r--   1 amyroberts  staff   162K 10 Aug 12:53 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff    10M 10 Aug 12:53 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 12:53 variables

I haven’t dug much into why the model is so much larger. A cursory glance at the model graphs didn’t reveal anything particularly surprising.

Top Results From Across the Web

[TPU] Swin-Transformer (Tensorflow) - Kaggle

This Kaggle utility will copy the dataset to a GCS bucket co-located with the TPU. If you have multiple datasets attached to the...

Winning solutions of kaggle competitions

Explore and run machine learning code with Kaggle Notebooks | Using data from Meta Kaggle.

Memory leak in kernels when reading files? - Kaggle

When reading many images in a kernel, and probably other file types too, the RAM usage steadily increases even if the image variables...

Approaching Text Data Problems using Deep Learn - Kaggle

I learned a lot how to implement Deep Learning by using tensorflow and how to preprocess text data so that it is ready...

Sentiment Analysis on Amazon Product (RNN-97% Acc)

preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense ...