question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Auto-sklearn consumes a lot of memory compared to dataset size

See original GitHub issue

Describe the bug

By default memory_limit is set to 3GB for machine learning models. I do get following error even when I use data of size ~3.5MB. ValueError: Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 1024 MB).', 'configuration_origin': 'DUMMY'} As per subsample_if_too_large, 10x memory should be sufficient to successfully train best model. But I have to increase memory_limit to 4-5GB to successfully run experiments. We have performed few experiments to find root cause. It looks like it is related to pynisher and some memory is used by python interpreter. I do wonder why is such large memory is required for such small dataset.

To Reproduce

I am using following script. I am getting same error even with n=1,2,3 and 4 i.e. (memory_limit=1-4GB).

import sklearn.datasets
import pandas as pd
import autosklearn.classification

X, y = sklearn.datasets.fetch_openml(data_id=1461, return_X_y=True, as_frame=False)
for col in X.columns:
    if X[col].dtype.name == 'object':
        X[col] = X[col].astype('category')
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)
print('data shape:', X_train.shape)

n=1
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=300,
    per_run_time_limit=60,
    tmp_folder='/tmp/autosklearn_classification_example_tmp',
    output_folder='/tmp/autosklearn_classification_example_out',
    memory_limit=n*1024,
)
automl.fit(X_train, y_train, dataset_name='breast_cancer')

Actual behavior, stacktrace or logfile

data shape: (33908, 16)
Traceback (most recent call last):
  File "test_memory_askl.py", line 31, in <module>
[ERROR] [2021-04-18 00:19:13,203:Client-AutoML(1):breast_cancer] Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 1024 MB).', 'configuration_origin': 'DUMMY'}.
    automl.fit(X_train, y_train, dataset_name='breast_cancer')
  File "automl_env/lib/python3.7/site-packages/autosklearn/estimators.py", line 598, in fit
    dataset_name=dataset_name,
  File "automl_env/lib/python3.7/site-packages/autosklearn/estimators.py", line 357, in fit
    self.automl_.fit(load_models=self.load_models, **kwargs)
  File "automl_env/lib/python3.7/site-packages/autosklearn/automl.py", line 1422, in fit
    is_classification=True,
  File "automl_env/lib/python3.7/site-packages/autosklearn/automl.py", line 623, in fit
    self._do_dummy_prediction(datamanager, num_run)
  File "automl_env/lib/python3.7/site-packages/autosklearn/automl.py", line 438, in _do_dummy_prediction
    % (str(status), str(additional_info))
ValueError: Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 1024 MB).', 'configuration_origin': 'DUMMY'}.
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/opt/python/python37/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
Process ForkProcess-1:
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Exception ignored in: <function AutoML.__del__ at 0x7ff16baa0840>
Traceback (most recent call last):
  File "automl_env/lib/python3.7/site-packages/autosklearn/automl.py", line 1373, in __del__
  File "automl_env/lib/python3.7/site-packages/autosklearn/automl.py", line 352, in _clean_logger
  File "/opt/python/python37/lib/python3.7/multiprocessing/process.py", line 140, in join
  File "/opt/python/python37/lib/python3.7/multiprocessing/popen_fork.py", line 44, in wait
TypeError: 'NoneType' object is not callable
Traceback (most recent call last):
  File "/opt/python/python37/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/python/python37/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "automl_env/lib/python3.7/site-packages/autosklearn/util/logging_.py", line 295, in start_log_server
    receiver.serve_until_stopped()
  File "automl_env/lib/python3.7/site-packages/autosklearn/util/logging_.py", line 327, in serve_until_stopped
    self.timeout)
KeyboardInterrupt

Environment and installation:

  • Is your installation in a virtual environment or conda environment? Virtual Environment
  • Python Version: 3.7.3
  • Auto-skearn: 0.12.3

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
rabsrcommented, Jun 28, 2021

Sure. I will try with autosklearn docker container and keep you posted once I have results.

0reactions
github-actions[bot]commented, Aug 6, 2021

This issue has been automatically closed due to inactivity.

Read more comments on GitHub >

github_iconTop Results From Across the Web

FAQ — AutoSklearn 0.15.0 documentation - GitHub Pages
For most datasets a memory limit of 3GB or 6GB as found on most modern computers is sufficient. For the time limits it...
Read more >
Auto-Sklearn for Automated Machine Learning in Python
Auto-Sklearn is an open-source library for performing AutoML in Python. It makes use of the popular Scikit-Learn machine learning library for ...
Read more >
The impact of Auto-Sklearn's Learning Settings - CEUR-WS
[24] have conducted a study using 300 datasets to compare the performance of 7 AutoML frameworks, namely,. H2O, [15], AutoSklearn, Ludwig2, ...
Read more >
Auto-sklearn: Efficient and Robust Automated Machine Learning
Full size image ... More specifically, for a large number of datasets, ... 5 Comparing Auto-sklearn to Auto-WEKA and Hyperopt-Sklearn.
Read more >
Auto-Sklearn 2.0: The Next Generation | DeepAI
[30, 31] . Algorithm portfolios were introduced to ML with the goal of reducing the required time to perform model selection compared to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found