Memory leak with resnet image encoder
See original GitHub issueDescribe the bug
While testing using synthetic image data, encountered this error during model.train()
with Ludwig ResNet image encoder.
╒══════════╕
│ TRAINING │
╘══════════╛
Training for 2 step(s), approximately 2 epoch(s).
Early stopping policy: 5 round(s) of evaluation, or 5 step(s), approximately 5 epoch(s).
Starting with step 0, epoch: 0
Training: 0%| | 0/2 [00:00<?, ?it/s]
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
The error occurs in a 12GB Docker container. The error occurs ONLY with the resnet
encoder. All the other encoders (stacked_cnn
, map_mixer
and vit
) WORK as expected.
To Reproduce Steps to reproduce the behavior:
Run this reproducible example:
import logging
import os
import shutil
from ludwig.api import LudwigModel
from ludwig.data.dataset_synthesizer import cli_synthesize_dataset
FEATURES_LIST = [
{"name": "category", "type": "category"},
{
"name": "image", "type": "image",
"destination_folder": os.path.join(os.getcwd(), "data2/images"),
"preprocessing": {"height": 224, "width": 224, "num_channels": 3}
}
]
CONFIG = {
"input_features": [
{
"name": "image", "type": "image",
"encoder": {
"type": "resnet",
},
}
],
"output_features": [
{
"name": "category", "type": "category",
}
],
"trainer": {"epochs": 2, }
}
if __name__ == "__main__":
shutil.rmtree("data2", ignore_errors=True)
os.makedirs("data2", exist_ok=True)
cli_synthesize_dataset(40, FEATURES_LIST, "data2/syn_train.csv")
model = LudwigModel(CONFIG, logging_level=logging.INFO)
model.train(dataset="data2/syn_train.csv")
Expected behavior
Successful running of the model.train()
method.
Screenshots
Here is a screen shot of the output from docker stats
for the container just before the out of memory error occurs
As the program is running, I can see MEM USAGE
increase until the Error 137 is reported.
Environment (please complete the following information):
- OS: MacOS 12.5.1 with Docker Desktop 4.9.1 (81317)
- Version
- Python version 3.8
- Ludwig version: 0.6.dev on the
master
branch
Additional context Here is full log file
77567a00d862:python -u /opt/project/sandbox/vision_models/mwe_dask_backend_issue1.py
NumExpr defaulting to 8 threads.
╒════════════════════════╕
│ EXPERIMENT DESCRIPTION │
╘════════════════════════╛
╒══════════════════╤═══════════════════════════════════════════════════════════════════╕
│ Experiment name │ api_experiment │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ Model name │ run │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ Output directory │ /opt/project/sandbox/vision_models/results/api_experiment_run_552 │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ ludwig_version │ '0.6.dev' │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ command │ '/opt/project/sandbox/vision_models/mwe_dask_backend_issue1.py' │
├──────────────────┼──────────────────────────────────────────────────────────────��────┤
│ commit_hash │ '4b0825bd4be7' │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ random_seed │ 42 │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ dataset │ 'data2/syn_train.csv' │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ data_format │ 'csv' │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ torch_version │ '1.12.1+cu102' │
├──────────────────┼───────────────────────────────────────────────────────────────────┤
│ compute │ {'num_nodes': 1} │
╘══════════════════╧═════════════════════════���═════════════════════════════════════════╛
╒═══════════════╕
│ LUDWIG CONFIG │
╘═══════════════╛
{ 'combiner': { 'activation': 'relu',
'bias_initializer': 'zeros',
'dropout': 0.0,
'fc_layers': None,
'flatten_inputs': False,
'norm': None,
'norm_params': None,
'num_fc_layers': 0,
'output_size': 256,
'residual': False,
'type': 'concat',
'use_bias': True,
'weights_initializer': 'xavier_uniform'},
'defaults': { 'audio': { 'preprocessing': { 'audio_file_length_limit_in_s': 7.5,
'computed_fill_value': None,
'fill_value': None,
'in_memory': True,
'missing_value_strategy': 'bfill',
'norm': None,
'num_fft_points': None,
'num_filter_bands': 80,
'padding_value': 0.0,
'type': 'fbank',
'window_length_in_s': 0.04,
'window_shift_in_s': 0.02,
'window_type': 'hamming'}},
'bag': { 'preprocessing': { 'computed_fill_value': '<UNK>',
'fill_value': '<UNK>',
'lowercase': False,
'missing_value_strategy': 'fill_with_const',
'most_common': 10000,
'tokenizer': 'space'}},
'binary': { 'preprocessing': { 'computed_fill_value': None,
'fallback_true_label': None,
'fill_value': None,
'missing_value_strategy': 'fill_with_false'}},
'category': { 'preprocessing': { 'computed_fill_value': '<UNK>',
'fill_value': '<UNK>',
'lowercase': False,
'missing_value_strategy': 'fill_with_const',
'most_common': 10000}},
'date': { 'preprocessing': { 'computed_fill_value': '',
'datetime_format': None,
'fill_value': '',
'missing_value_strategy': 'fill_with_const'}},
'h3': { 'preprocessing': { 'computed_fill_value': 576495936675512319,
'fill_value': 576495936675512319,
'missing_value_strategy': 'fill_with_const'}},
'image': { 'preprocessing': { 'computed_fill_value': None,
'fill_value': None,
'height': None,
'in_memory': True,
'infer_image_dimensions': True,
'infer_image_max_height': 256,
'infer_image_max_width': 256,
'infer_image_num_channels': True,
'infer_image_sample_size': 100,
'missing_value_strategy': 'bfill',
'num_channels': None,
'num_processes': 1,
'resize_method': 'interpolate',
'scaling': 'pixel_normalization',
'width': None}},
'number': { 'preprocessing': { 'computed_fill_value': 0.0,
'fill_value': 0.0,
'missing_value_strategy': 'fill_with_const',
'normalization': None}},
'sequence': { 'preprocessing': { 'computed_fill_value': '<UNK>',
'fill_value': '<UNK>',
'lowercase': False,
'max_sequence_length': 256,
'missing_value_strategy': 'fill_with_const',
'most_common': 20000,
'padding': 'right',
'padding_symbol': '<PAD>',
'tokenizer': 'space',
'unknown_symbol': '<UNK>',
'vocab_file': None}},
'set': { 'preprocessing': { 'computed_fill_value': '<UNK>',
'fill_value': '<UNK>',
'lowercase': False,
'missing_value_strategy': 'fill_with_const',
'most_common': 10000,
'tokenizer': 'space'}},
'text': { 'preprocessing': { 'computed_fill_value': '<UNK>',
'fill_value': '<UNK>',
'lowercase': True,
'max_sequence_length': 256,
'missing_value_strategy': 'fill_with_const',
'most_common': 20000,
'padding': 'right',
'padding_symbol': '<PAD>',
'pretrained_model_name_or_path': None,
'tokenizer': 'space_punct',
'unknown_symbol': '<UNK>',
'vocab_file': None}},
'timeseries': { 'preprocessing': { 'computed_fill_value': '',
'fill_value': '',
'missing_value_strategy': 'fill_with_const',
'padding': 'right',
'padding_value': 0.0,
'timeseries_length_limit': 256,
'tokenizer': 'space'}},
'vector': { 'preprocessing': { 'computed_fill_value': '',
'fill_value': '',
'missing_value_strategy': 'fill_with_const',
'vector_size': None}}},
'input_features': [ { 'column': 'image',
'encoder': {'type': 'resnet'},
'name': 'image',
'preprocessing': {},
'proc_column': 'image_mZFLky',
'tied': None,
'type': 'image'}],
'ludwig_version': '0.6.dev',
'model_type': 'ecd',
'output_features': [ { 'column': 'category',
'decoder': {'type': 'classifier'},
'dependencies': [],
'loss': { 'class_similarities_temperature': 0,
'class_weights': None,
'confidence_penalty': 0.0,
'robust_lambda': 0,
'type': 'softmax_cross_entropy',
'weight': 1.0},
'name': 'category',
'preprocessing': { 'missing_value_strategy': 'drop_row'},
'proc_column': 'category_mZFLky',
'reduce_dependencies': 'sum',
'reduce_input': 'sum',
'top_k': 3,
'type': 'category'}],
'preprocessing': { 'oversample_minority': None,
'sample_ratio': 1.0,
'split': { 'probabilities': [0.7, 0.1, 0.2],
'type': 'random'},
'undersample_majority': None},
'trainer': { 'batch_size': 128,
'checkpoints_per_epoch': 0,
'decay': False,
'decay_rate': 0.96,
'decay_steps': 10000,
'early_stop': 5,
'epochs': 2,
'eval_batch_size': None,
'evaluate_training_set': True,
'gradient_clipping': { 'clipglobalnorm': 0.5,
'clipnorm': None,
'clipvalue': None},
'increase_batch_size_eval_metric': 'loss',
'increase_batch_size_eval_split': 'training',
'increase_batch_size_on_plateau': 0,
'increase_batch_size_on_plateau_max': 512,
'increase_batch_size_on_plateau_patience': 5,
'increase_batch_size_on_plateau_rate': 2.0,
'learning_rate': 0.001,
'learning_rate_scaling': 'linear',
'learning_rate_warmup_epochs': 1.0,
'optimizer': { 'amsgrad': False,
'betas': (0.9, 0.999),
'eps': 1e-08,
'lr': 0.001,
'type': 'adam',
'weight_decay': 0.0},
'reduce_learning_rate_eval_metric': 'loss',
'reduce_learning_rate_eval_split': 'training',
'reduce_learning_rate_on_plateau': 0.0,
'reduce_learning_rate_on_plateau_patience': 5,
'reduce_learning_rate_on_plateau_rate': 0.5,
'regularization_lambda': 0.0,
'regularization_type': 'l2',
'should_shuffle': True,
'staircase': False,
'steps_per_checkpoint': 0,
'train_steps': None,
'type': 'trainer',
'validation_field': 'combined',
'validation_metric': 'loss'}}
╒═══════════════╕
│ PREPROCESSING │
╘═══════════════╛
Using full raw dataset, no hdf5 and json file with the same name have been found
Building dataset (it may take a while)
Inferring num_channels from the first 40 images.
images with 3 channels: 40
Using 3 channels because it is the majority in sample. If an image with a different depth is read, will attempt to convert to 3 channels.
To explicitly set the number of channels, define num_channels in the preprocessing dictionary of the image input feature config.
Building dataset: DONE
Writing preprocessed training set cache
Writing preprocessed test set cache
Writing preprocessed validation set cache
Writing train set metadata
Dataset sizes:
╒════════════╤════════╕
│ Dataset │ Size │
╞════════════╪════════╡
│ Training │ 28 │
├────────────┼────────┤
│ Validation │ 4 │
├────────────┼────────┤
│ Test │ 8 │
╘════════════╧════════╛
╒═══════╕
│ MODEL │
╘═══════╛
Warnings and other logs:
/usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
warnings.warn(
╒══════════╕
│ TRAINING │
╘══════════╛
Training for 2 step(s), approximately 2 epoch(s).
Early stopping policy: 5 round(s) of evaluation, or 5 step(s), approximately 5 epoch(s).
Starting with step 0, epoch: 0
Training: 0%| | 0/2 [00:00<?, ?it/s]
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
Issue Analytics
- State:
- Created a year ago
- Comments:7
Top GitHub Comments
Our implementation was the original TF1 code from the official ResNet repo back in the days, we ported it to TF2 and then to PyTorch. it’s possible that in the process there were some suboptimal choices made. For instance, i would be curious to see what layer are they useing in the torchvision implementation instead of
Conv2dLayerFixedPadding
. Anyway, @justinxzhao no big reason to keep our own implementation around, we can just adopt the torchvision one. The only issue is backward compatibility of models trained using the old implementation, but we will just add a note to the release and suggest users to retrain because of the advantages in memory usage.So @jimthompson5802 I think we can just replace implementations, ditch the old one and keep the tv one.Closing. No action.