Tabular: os.fork() called by MXNet NN model during prediction causing OOM
See original GitHub issueThis OOM error was caused during training of Albert OpenML dataset (Large) on m5.2xlarge. This OOM error has never been encountered before. (Happened 3 times on latest benchmark run on Albert and Airlines datasets).
The error happened during the setup for training the weighted ensemble with bagging and stacking disabled. Therefore, each model needs to do predictions on the validation data in turn (The amount of validation data is very small, less than 10,000, and therefore should not cause issues). For whatever reason, TabularNN is forking the process, which likely leads to a doubling of memory usage.
Question: Why are we running OOM during a stage that should require very minimal memory usage, and why are we forking the process?
Stack Trace, note that the memory check shortly prior to this indicates 32% memory usage:
[INFO] [amlb.utils.process:01:07:58.029] CPU Utilization: 41.7%
[INFO] [amlb.utils.process:01:07:58.029] Memory Usage: 32.1%
[INFO] [amlb.utils.process:01:07:58.029] Disk Usage: 15.7%
[ERROR] [amlb.benchmark:01:08:55.237] [Errno 12] Cannot allocate memory
Traceback (most recent call last):
File "/home/ubuntu/workspace/automlbenchmark/amlb/benchmark.py", line 391, in run
meta_result = framework.run(self._dataset, task_config)
File "/home/ubuntu/workspace/automlbenchmark/frameworks/autogluon_nobag/__init__.py", line 4, in run
return run(*args, **kwargs)
File "/home/ubuntu/workspace/automlbenchmark/frameworks/autogluon_nobag/exec.py", line 16, in run
return exec_template.run(dataset=dataset, config=config, parameters=parameters)
File "/home/ubuntu/workspace/automlbenchmark/frameworks/autogluon/exec_template.py", line 82, in run
**fit_params
File "/home/ubuntu/workspace/autogluon/autogluon/task/tabular_prediction/tabular_prediction.py", line 425, in fit
hyperparameters=hyperparameters, time_limit=time_limits_orig, save_data=cache_data, verbosity=verbosity)
File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/learner/default_learner.py", line 99, in fit
hyperparameters=hyperparameters)
File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/trainer/auto_trainer.py", line 39, in train
self.train_multi_and_ensemble(X_train, y_train, X_test, y_test, models, hyperparameter_tune=hyperparameter_tune, feature_prune=feature_prune)
File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 507, in train_multi_and_ensemble
self.train_multi_levels(X_train, y_train, X_test, y_test, models=models, hyperparameter_tune=hyperparameter_tune, feature_prune=feature_prune, level_start=0, level_end=self.stack_ensemble_levels)
File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 522, in train_multi_levels
self.stack_new_level(X=X_train, y=y_train, X_test=X_test, y_test=y_test, models=models, level=level, hyperparameter_tune=hyperparameter_tune, feature_prune=feature_prune, time_limit_core=time_limit_core, time_limit_aux=time_limit_aux)
File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 533, in stack_new_level
aux_models = self.stack_new_level_aux(X=X_test, y=y_test, fit=False, level=level+1, time_limit=time_limit_aux)
File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 568, in stack_new_level_aux
X_train_stack_preds = self.get_inputs_to_stacker(X, level_start=0, level_end=level, fit=fit)
File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 797, in get_inputs_to_stacker
X = dummy_stackers[level+1].preprocess(X=X, preprocess=False, fit=False, compute_base_preds=True)
File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/models/ensemble/stacker_ensemble_model.py", line 92, in preprocess
y_pred_proba = base_model.predict_proba(X)
File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 384, in predict_proba
return self._predict_tabular_data(new_data=X, process=True, predict_proba=True)
File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 398, in _predict_tabular_data
new_data = self.process_test_data(new_data, batch_size=self.batch_size, num_dataloading_workers=self.num_dataloading_workers, labels=None)
File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 486, in process_test_data
problem_type=self.problem_type, labels=labels, is_test=True)
File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_dataset.py", line 150, in __init__
self.generate_dataset_and_dataloader(data_list=data_list)
File "/home/ubuntu/workspace/autogluon/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_dataset.py", line 158, in generate_dataset_and_dataloader
num_workers=self.num_dataloading_workers) # no need to shuffle test data
File "/home/ubuntu/virtual/automlbenchmark/lib/python3.6/site-packages/mxnet/gluon/data/dataloader.py", line 642, in __init__
initargs=[self._dataset, is_np_shape(), is_np_array()])
File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/context.py", line 119, in Pool
context=self.get_context())
File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/pool.py", line 174, in __init__
self._repopulate_pool()
File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/pool.py", line 239, in _repopulate_pool
w.start()
File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (5 by maintainers)
Discussed offline with @jwmueller , according to the existing use case, there’s no reason to use multiprocessing workers, since each worker process has to keep a copy of original ndarray.
Two choices here:
num_workers=0
for DataLoader, to avoid duplicate data in child processesnum_workers>0
, together withthread_pool=True
, this will not bypass GIL, but can significantly reduce the memory usage, while potentially improve the IOPlease see #422 for the proposed solution.