question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tabular: os.fork() called by MXNet NN model during prediction causing OOM

See original GitHub issue

This OOM error was caused during training of Albert OpenML dataset (Large) on m5.2xlarge. This OOM error has never been encountered before. (Happened 3 times on latest benchmark run on Albert and Airlines datasets).

The error happened during the setup for training the weighted ensemble with bagging and stacking disabled. Therefore, each model needs to do predictions on the validation data in turn (The amount of validation data is very small, less than 10,000, and therefore should not cause issues). For whatever reason, TabularNN is forking the process, which likely leads to a doubling of memory usage.

Question: Why are we running OOM during a stage that should require very minimal memory usage, and why are we forking the process?

Stack Trace, note that the memory check shortly prior to this indicates 32% memory usage:

[INFO] [amlb.utils.process:01:07:58.029] CPU Utilization: 41.7%
[INFO] [amlb.utils.process:01:07:58.029] Memory Usage: 32.1%
[INFO] [amlb.utils.process:01:07:58.029] Disk Usage: 15.7%
[ERROR] [amlb.benchmark:01:08:55.237] [Errno 12] Cannot allocate memory
Traceback (most recent call last):
  File "/home/ubuntu/workspace/automlbenchmark/amlb/benchmark.py", line 391, in run
    meta_result = framework.run(self._dataset, task_config)
  File "/home/ubuntu/workspace/automlbenchmark/frameworks/autogluon_nobag/__init__.py", line 4, in run
    return run(*args, **kwargs)
  File "/home/ubuntu/workspace/automlbenchmark/frameworks/autogluon_nobag/exec.py", line 16, in run
    return exec_template.run(dataset=dataset, config=config, parameters=parameters)
  File "/home/ubuntu/workspace/automlbenchmark/frameworks/autogluon/exec_template.py", line 82, in run
    **fit_params
  File "/home/ubuntu/workspace/autogluon/autogluon/task/tabular_prediction/tabular_prediction.py", line 425, in fit
    hyperparameters=hyperparameters, time_limit=time_limits_orig, save_data=cache_data, verbosity=verbosity)
  File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/learner/default_learner.py", line 99, in fit
    hyperparameters=hyperparameters)
  File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/trainer/auto_trainer.py", line 39, in train
    self.train_multi_and_ensemble(X_train, y_train, X_test, y_test, models, hyperparameter_tune=hyperparameter_tune, feature_prune=feature_prune)
  File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 507, in train_multi_and_ensemble
    self.train_multi_levels(X_train, y_train, X_test, y_test, models=models, hyperparameter_tune=hyperparameter_tune, feature_prune=feature_prune, level_start=0, level_end=self.stack_ensemble_levels)
  File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 522, in train_multi_levels
    self.stack_new_level(X=X_train, y=y_train, X_test=X_test, y_test=y_test, models=models, level=level, hyperparameter_tune=hyperparameter_tune, feature_prune=feature_prune, time_limit_core=time_limit_core, time_limit_aux=time_limit_aux)
  File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 533, in stack_new_level
    aux_models = self.stack_new_level_aux(X=X_test, y=y_test, fit=False, level=level+1, time_limit=time_limit_aux)
  File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 568, in stack_new_level_aux
    X_train_stack_preds = self.get_inputs_to_stacker(X, level_start=0, level_end=level, fit=fit)
  File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 797, in get_inputs_to_stacker
    X = dummy_stackers[level+1].preprocess(X=X, preprocess=False, fit=False, compute_base_preds=True)
  File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/models/ensemble/stacker_ensemble_model.py", line 92, in preprocess
    y_pred_proba = base_model.predict_proba(X)
  File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 384, in predict_proba
    return self._predict_tabular_data(new_data=X, process=True, predict_proba=True)
  File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 398, in _predict_tabular_data
    new_data = self.process_test_data(new_data, batch_size=self.batch_size, num_dataloading_workers=self.num_dataloading_workers, labels=None)
  File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 486, in process_test_data
    problem_type=self.problem_type, labels=labels, is_test=True)
  File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_dataset.py", line 150, in __init__
    self.generate_dataset_and_dataloader(data_list=data_list)
  File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_dataset.py", line 158, in generate_dataset_and_dataloader
    num_workers=self.num_dataloading_workers)  # no need to shuffle test data
  File "/home/ubuntu/virtual/automlbenchmark/lib/python3.6/site-packages/mxnet/gluon/data/dataloader.py", line 642, in __init__
    initargs=[self._dataset, is_np_shape(), is_np_array()])
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/context.py", line 119, in Pool
    context=self.get_context())
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/pool.py", line 174, in __init__
    self._repopulate_pool()
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/pool.py", line 239, in _repopulate_pool
    w.start()
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
zhresholdcommented, Apr 16, 2020

Discussed offline with @jwmueller , according to the existing use case, there’s no reason to use multiprocessing workers, since each worker process has to keep a copy of original ndarray.

Two choices here:

  • Stick with num_workers=0 for DataLoader, to avoid duplicate data in child processes
  • Use num_workers>0, together with thread_pool=True, this will not bypass GIL, but can significantly reduce the memory usage, while potentially improve the IO
0reactions
Innixmacommented, Apr 20, 2020

Please see #422 for the proposed solution.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tabular: os.fork() called by MXNet NN model during prediction ...
Tabular : os.fork() called by MXNet NN model during prediction causing OOM #411. Closed. Innixma opened this issue on Apr 11, ...
Read more >
Memory leak when running cpu inference - Gluon
I'm running into a memory leak when performing inference on an mxnet model (i.e. converting an image buffer to tensor and running one...
Read more >
FAQ — AutoGluon Documentation 0.6.1 documentation
To enable GPU training, specify in predictor.fit the argument ... and some models may need special installations such as LightGBM and MXNet to...
Read more >
Machine Learning | GitHub Release Tracker
This is a bugfix release, addressing two issues: Ability to save a model when a file with the same name already exists. Issue...
Read more >
Track Awesome Machine Learning Updates Weekly
NannyML: Python library capable of fully capturing the impact of data drift on performance. Allows estimation of post-deployment model performance without ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found