Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak with resnet image encoder

See original GitHub issue

Describe the bug While testing using synthetic image data, encountered this error during model.train() with Ludwig ResNet image encoder.

╒══════════╕
│ TRAINING │
╘══════════╛

Training for 2 step(s), approximately 2 epoch(s).
Early stopping policy: 5 round(s) of evaluation, or 5 step(s), approximately 5 epoch(s).

Starting with step 0, epoch: 0
Training:   0%|          | 0/2 [00:00<?, ?it/s]
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

The error occurs in a 12GB Docker container. The error occurs ONLY with the resnet encoder. All the other encoders (stacked_cnn, map_mixer and vit) WORK as expected.

To Reproduce Steps to reproduce the behavior:

Run this reproducible example:

import logging
import os
import shutil

from ludwig.api import LudwigModel
from ludwig.data.dataset_synthesizer import cli_synthesize_dataset

FEATURES_LIST = [
    {"name": "category", "type": "category"},
    {
        "name": "image", "type": "image",
        "destination_folder": os.path.join(os.getcwd(), "data2/images"),
        "preprocessing": {"height": 224, "width": 224, "num_channels": 3}
    }
]

CONFIG = {
    "input_features": [
        {
            "name": "image", "type": "image",
            "encoder": {
                "type": "resnet",   
            },
        }
    ],
    "output_features": [
        {
            "name": "category", "type": "category",
        }
    ],
    "trainer": {"epochs": 2, }
}


if __name__ == "__main__":
    shutil.rmtree("data2", ignore_errors=True)
    os.makedirs("data2", exist_ok=True)
    cli_synthesize_dataset(40, FEATURES_LIST, "data2/syn_train.csv")

    model = LudwigModel(CONFIG, logging_level=logging.INFO)
    model.train(dataset="data2/syn_train.csv")

Expected behavior Successful running of the model.train() method.

Screenshots Here is a screen shot of the output from docker stats for the container just before the out of memory error occurs Screen Shot 2022-09-11 at 16 48 28

As the program is running, I can see MEM USAGE increase until the Error 137 is reported.

Environment (please complete the following information):

OS: MacOS 12.5.1 with Docker Desktop 4.9.1 (81317)
Version
Python version 3.8
Ludwig version: 0.6.dev on the master branch

Additional context Here is full log file

77567a00d862:python -u /opt/project/sandbox/vision_models/mwe_dask_backend_issue1.py
NumExpr defaulting to 8 threads.

╒════════════════════════╕
│ EXPERIMENT DESCRIPTION │
╘════════════════════════╛

╒══════════════════╤═══════════════════════════════════════════════════════════════════╕
│ Experiment name  │ api_experiment                                                    │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ Model name       │ run                                                               │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ Output directory │ /opt/project/sandbox/vision_models/results/api_experiment_run_552 │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ ludwig_version   │ '0.6.dev'                                                         │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ command          │ '/opt/project/sandbox/vision_models/mwe_dask_backend_issue1.py'   │
├──────────────────┼──────────────────────────────────────────────────────────────��────┤
│ commit_hash      │ '4b0825bd4be7'                                                    │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ random_seed      │ 42                                                                │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ dataset          │ 'data2/syn_train.csv'                                             │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ data_format      │ 'csv'                                                             │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ torch_version    │ '1.12.1+cu102'                                                    │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ compute          │ {'num_nodes': 1}                                                  │
╘══════════════════╧═════════════════════════���═════════════════════════════════════════╛

╒═══════════════╕
│ LUDWIG CONFIG │
╘═══════════════╛

{   'combiner': {   'activation': 'relu',
                    'bias_initializer': 'zeros',
                    'dropout': 0.0,
                    'fc_layers': None,
                    'flatten_inputs': False,
                    'norm': None,
                    'norm_params': None,
                    'num_fc_layers': 0,
                    'output_size': 256,
                    'residual': False,
                    'type': 'concat',
                    'use_bias': True,
                    'weights_initializer': 'xavier_uniform'},
    'defaults': {   'audio': {   'preprocessing': {   'audio_file_length_limit_in_s': 7.5,
                                                      'computed_fill_value': None,
                                                      'fill_value': None,
                                                      'in_memory': True,
                                                      'missing_value_strategy': 'bfill',
                                                      'norm': None,
                                                      'num_fft_points': None,
                                                      'num_filter_bands': 80,
                                                      'padding_value': 0.0,
                                                      'type': 'fbank',
                                                      'window_length_in_s': 0.04,
                                                      'window_shift_in_s': 0.02,
                                                      'window_type': 'hamming'}},
                    'bag': {   'preprocessing': {   'computed_fill_value': '<UNK>',
                                                    'fill_value': '<UNK>',
                                                    'lowercase': False,
                                                    'missing_value_strategy': 'fill_with_const',
                                                    'most_common': 10000,
                                                    'tokenizer': 'space'}},
                    'binary': {   'preprocessing': {   'computed_fill_value': None,
                                                       'fallback_true_label': None,
                                                       'fill_value': None,
                                                       'missing_value_strategy': 'fill_with_false'}},
                    'category': {   'preprocessing': {   'computed_fill_value': '<UNK>',
                                                         'fill_value': '<UNK>',
                                                         'lowercase': False,
                                                         'missing_value_strategy': 'fill_with_const',
                                                         'most_common': 10000}},
                    'date': {   'preprocessing': {   'computed_fill_value': '',
                                                     'datetime_format': None,
                                                     'fill_value': '',
                                                     'missing_value_strategy': 'fill_with_const'}},
                    'h3': {   'preprocessing': {   'computed_fill_value': 576495936675512319,
                                                   'fill_value': 576495936675512319,
                                                   'missing_value_strategy': 'fill_with_const'}},
                    'image': {   'preprocessing': {   'computed_fill_value': None,
                                                      'fill_value': None,
                                                      'height': None,
                                                      'in_memory': True,
                                                      'infer_image_dimensions': True,
                                                      'infer_image_max_height': 256,
                                                      'infer_image_max_width': 256,
                                                      'infer_image_num_channels': True,
                                                      'infer_image_sample_size': 100,
                                                      'missing_value_strategy': 'bfill',
                                                      'num_channels': None,
                                                      'num_processes': 1,
                                                      'resize_method': 'interpolate',
                                                      'scaling': 'pixel_normalization',
                                                      'width': None}},
                    'number': {   'preprocessing': {   'computed_fill_value': 0.0,
                                                       'fill_value': 0.0,
                                                       'missing_value_strategy': 'fill_with_const',
                                                       'normalization': None}},
                    'sequence': {   'preprocessing': {   'computed_fill_value': '<UNK>',
                                                         'fill_value': '<UNK>',
                                                         'lowercase': False,
                                                         'max_sequence_length': 256,
                                                         'missing_value_strategy': 'fill_with_const',
                                                         'most_common': 20000,
                                                         'padding': 'right',
                                                         'padding_symbol': '<PAD>',
                                                         'tokenizer': 'space',
                                                         'unknown_symbol': '<UNK>',
                                                         'vocab_file': None}},
                    'set': {   'preprocessing': {   'computed_fill_value': '<UNK>',
                                                    'fill_value': '<UNK>',
                                                    'lowercase': False,
                                                    'missing_value_strategy': 'fill_with_const',
                                                    'most_common': 10000,
                                                    'tokenizer': 'space'}},
                    'text': {   'preprocessing': {   'computed_fill_value': '<UNK>',
                                                     'fill_value': '<UNK>',
                                                     'lowercase': True,
                                                     'max_sequence_length': 256,
                                                     'missing_value_strategy': 'fill_with_const',
                                                     'most_common': 20000,
                                                     'padding': 'right',
                                                     'padding_symbol': '<PAD>',
                                                     'pretrained_model_name_or_path': None,
                                                     'tokenizer': 'space_punct',
                                                     'unknown_symbol': '<UNK>',
                                                     'vocab_file': None}},
                    'timeseries': {   'preprocessing': {   'computed_fill_value': '',
                                                           'fill_value': '',
                                                           'missing_value_strategy': 'fill_with_const',
                                                           'padding': 'right',
                                                           'padding_value': 0.0,
                                                           'timeseries_length_limit': 256,
                                                           'tokenizer': 'space'}},
                    'vector': {   'preprocessing': {   'computed_fill_value': '',
                                                       'fill_value': '',
                                                       'missing_value_strategy': 'fill_with_const',
                                                       'vector_size': None}}},
    'input_features': [   {   'column': 'image',
                              'encoder': {'type': 'resnet'},
                              'name': 'image',
                              'preprocessing': {},
                              'proc_column': 'image_mZFLky',
                              'tied': None,
                              'type': 'image'}],
    'ludwig_version': '0.6.dev',
    'model_type': 'ecd',
    'output_features': [   {   'column': 'category',
                               'decoder': {'type': 'classifier'},
                               'dependencies': [],
                               'loss': {   'class_similarities_temperature': 0,
                                           'class_weights': None,
                                           'confidence_penalty': 0.0,
                                           'robust_lambda': 0,
                                           'type': 'softmax_cross_entropy',
                                           'weight': 1.0},
                               'name': 'category',
                               'preprocessing': {   'missing_value_strategy': 'drop_row'},
                               'proc_column': 'category_mZFLky',
                               'reduce_dependencies': 'sum',
                               'reduce_input': 'sum',
                               'top_k': 3,
                               'type': 'category'}],
    'preprocessing': {   'oversample_minority': None,
                         'sample_ratio': 1.0,
                         'split': {   'probabilities': [0.7, 0.1, 0.2],
                                      'type': 'random'},
                         'undersample_majority': None},
    'trainer': {   'batch_size': 128,
                   'checkpoints_per_epoch': 0,
                   'decay': False,
                   'decay_rate': 0.96,
                   'decay_steps': 10000,
                   'early_stop': 5,
                   'epochs': 2,
                   'eval_batch_size': None,
                   'evaluate_training_set': True,
                   'gradient_clipping': {   'clipglobalnorm': 0.5,
                                            'clipnorm': None,
                                            'clipvalue': None},
                   'increase_batch_size_eval_metric': 'loss',
                   'increase_batch_size_eval_split': 'training',
                   'increase_batch_size_on_plateau': 0,
                   'increase_batch_size_on_plateau_max': 512,
                   'increase_batch_size_on_plateau_patience': 5,
                   'increase_batch_size_on_plateau_rate': 2.0,
                   'learning_rate': 0.001,
                   'learning_rate_scaling': 'linear',
                   'learning_rate_warmup_epochs': 1.0,
                   'optimizer': {   'amsgrad': False,
                                    'betas': (0.9, 0.999),
                                    'eps': 1e-08,
                                    'lr': 0.001,
                                    'type': 'adam',
                                    'weight_decay': 0.0},
                   'reduce_learning_rate_eval_metric': 'loss',
                   'reduce_learning_rate_eval_split': 'training',
                   'reduce_learning_rate_on_plateau': 0.0,
                   'reduce_learning_rate_on_plateau_patience': 5,
                   'reduce_learning_rate_on_plateau_rate': 0.5,
                   'regularization_lambda': 0.0,
                   'regularization_type': 'l2',
                   'should_shuffle': True,
                   'staircase': False,
                   'steps_per_checkpoint': 0,
                   'train_steps': None,
                   'type': 'trainer',
                   'validation_field': 'combined',
                   'validation_metric': 'loss'}}

╒═══════════════╕
│ PREPROCESSING │
╘═══════════════╛

Using full raw dataset, no hdf5 and json file with the same name have been found
Building dataset (it may take a while)
Inferring num_channels from the first 40 images.
  images with 3 channels: 40
Using 3 channels because it is the majority in sample. If an image with a different depth is read, will attempt to convert to 3 channels.
To explicitly set the number of channels, define num_channels in the preprocessing dictionary of the image input feature config.
Building dataset: DONE
Writing preprocessed training set cache
Writing preprocessed test set cache
Writing preprocessed validation set cache
Writing train set metadata

Dataset sizes:
╒════════════╤════════╕
│ Dataset    │   Size │
╞════════════╪════════╡
│ Training   │     28 │
├────────────┼────────┤
│ Validation │      4 │
├────────────┼────────┤
│ Test       │      8 │
╘════════════╧════════╛

╒═══════╕
│ MODEL │
╘═══════╛

Warnings and other logs:
/usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
  warnings.warn(

╒══════════╕
│ TRAINING │
╘══════════╛

Training for 2 step(s), approximately 2 epoch(s).
Early stopping policy: 5 round(s) of evaluation, or 5 step(s), approximately 5 epoch(s).

Starting with step 0, epoch: 0
Training:   0%|          | 0/2 [00:00<?, ?it/s]
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

Issue Analytics

State:
Created a year ago
Comments:7

Top GitHub Comments

1reaction

w4nderlustcommented, Sep 16, 2022

Our implementation was the original TF1 code from the official ResNet repo back in the days, we ported it to TF2 and then to PyTorch. it’s possible that in the process there were some suboptimal choices made. For instance, i would be curious to see what layer are they useing in the torchvision implementation instead of Conv2dLayerFixedPadding. Anyway, @justinxzhao no big reason to keep our own implementation around, we can just adopt the torchvision one. The only issue is backward compatibility of models trained using the old implementation, but we will just add a note to the release and suggest users to retrain because of the advantages in memory usage.So @jimthompson5802 I think we can just replace implementations, ditch the old one and keep the tv one.

0reactions

jimthompson5802commented, Sep 18, 2022

Closing. No action.

Top Results From Across the Web

Memory usage and epoch iteration time increases indefinitely ...

This appears to be a memory leak, where PyTorch isn't releasing the references to the MTLBuffer objects. If they did release the reference,...

Error: Cuda Out of Memory after training on 2.5 million images ...

I've found that after my Resnet, a lot of memory is being used up. The basic nn stack is: encoder: resnet with final...

Rapidly updating image with Data URI causes caching ...

Methods to address the memory leaks problems in Safari var BASE64_MARKER ... I use a javascript decoder to decode and display the image...

AN ANALYSIS OF DEEP NEURAL NETWORK MODELS

In this work, we present a comprehensive analysis of important met- rics in practical applications: accuracy, memory footprint, parameters, operations count, ...

Why is so much memory needed for deep neural networks?

Finally, additional memory is also required to store the input data, temporary values and the program's instructions. Measuring the memory use of ResNet-50 ......