Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Auto-sklearn consumes a lot of memory compared to dataset size

See original GitHub issue

Describe the bug

By default memory_limit is set to 3GB for machine learning models. I do get following error even when I use data of size ~3.5MB. ValueError: Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 1024 MB).', 'configuration_origin': 'DUMMY'} As per subsample_if_too_large, 10x memory should be sufficient to successfully train best model. But I have to increase memory_limit to 4-5GB to successfully run experiments. We have performed few experiments to find root cause. It looks like it is related to pynisher and some memory is used by python interpreter. I do wonder why is such large memory is required for such small dataset.

To Reproduce

I am using following script. I am getting same error even with n=1,2,3 and 4 i.e. (memory_limit=1-4GB).

import sklearn.datasets
import pandas as pd
import autosklearn.classification

X, y = sklearn.datasets.fetch_openml(data_id=1461, return_X_y=True, as_frame=False)
for col in X.columns:
    if X[col].dtype.name == 'object':
        X[col] = X[col].astype('category')
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)
print('data shape:', X_train.shape)

n=1
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=300,
    per_run_time_limit=60,
    tmp_folder='/tmp/autosklearn_classification_example_tmp',
    output_folder='/tmp/autosklearn_classification_example_out',
    memory_limit=n*1024,
)
automl.fit(X_train, y_train, dataset_name='breast_cancer')

Actual behavior, stacktrace or logfile

data shape: (33908, 16)
Traceback (most recent call last):
  File "test_memory_askl.py", line 31, in <module>
[ERROR] [2021-04-18 00:19:13,203:Client-AutoML(1):breast_cancer] Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 1024 MB).', 'configuration_origin': 'DUMMY'}.
    automl.fit(X_train, y_train, dataset_name='breast_cancer')
  File "automl_env/lib/python3.7/site-packages/autosklearn/estimators.py", line 598, in fit
    dataset_name=dataset_name,
  File "automl_env/lib/python3.7/site-packages/autosklearn/estimators.py", line 357, in fit
    self.automl_.fit(load_models=self.load_models, **kwargs)
  File "automl_env/lib/python3.7/site-packages/autosklearn/automl.py", line 1422, in fit
    is_classification=True,
  File "automl_env/lib/python3.7/site-packages/autosklearn/automl.py", line 623, in fit
    self._do_dummy_prediction(datamanager, num_run)
  File "automl_env/lib/python3.7/site-packages/autosklearn/automl.py", line 438, in _do_dummy_prediction
    % (str(status), str(additional_info))
ValueError: Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 1024 MB).', 'configuration_origin': 'DUMMY'}.
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/opt/python/python37/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
Process ForkProcess-1:
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Exception ignored in: <function AutoML.__del__ at 0x7ff16baa0840>
Traceback (most recent call last):
  File "automl_env/lib/python3.7/site-packages/autosklearn/automl.py", line 1373, in __del__
  File "automl_env/lib/python3.7/site-packages/autosklearn/automl.py", line 352, in _clean_logger
  File "/opt/python/python37/lib/python3.7/multiprocessing/process.py", line 140, in join
  File "/opt/python/python37/lib/python3.7/multiprocessing/popen_fork.py", line 44, in wait
TypeError: 'NoneType' object is not callable
Traceback (most recent call last):
  File "/opt/python/python37/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/python/python37/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "automl_env/lib/python3.7/site-packages/autosklearn/util/logging_.py", line 295, in start_log_server
    receiver.serve_until_stopped()
  File "automl_env/lib/python3.7/site-packages/autosklearn/util/logging_.py", line 327, in serve_until_stopped
    self.timeout)
KeyboardInterrupt