Error running "fit" with many cores.
See original GitHub issueHi! I’m experiencing a problem when I fit an AutoSklearn instance in a virtual machine with many cores.
I have run exactly the same code, with the same dataset in three different virtual machines:
in a vm with 4 cores and 15Gb of RAM: works ok ✅ in a vm with 8 cores and 30Gb of RAM: works ok ✅ in a vm with 40 cores and 157 Gb of RAM: fails ❌ with the following error:
ValueError: Dummy prediction failed with run state StatusType.CRASHED and additional output: {'error': 'Result queue is empty', 'exit_status': "<class 'pynisher.limit_function_call.AnythingException'>", 'subprocess_stdout': '', 'subprocess_stderr': 'Process pynisher function call:\nTraceback (most recent call last):\n File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap\n self.run()\n File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run\n self._target(*self._args, **self._kwargs)\n File "/usr/local/lib/python3.7/site-packages/pynisher/limit_function_call.py", line 133, in subprocess_func\n return_value = ((func(*args, **kwargs), 0))\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/__init__.py", line 40, in fit_predict_try_except_decorator\n return ta(queue=queue, **kwargs)\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py", line 1164, in eval_holdout\n budget_type=budget_type,\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py", line 194, in __init__\n budget_type=budget_type,\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/abstract_evaluator.py", line 199, in __init__\n threadpool_limits(limits=1)\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 171, in __init__\n self._original_info = self._set_threadpool_limits()\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 280, in _set_threadpool_limits\n module.set_num_threads(num_threads)\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 659, in set_num_threads\n return set_func(num_threads)\nKeyboardInterrupt\n', 'exitcode': 1, 'configuration_origin': 'DUMMY'}.
This is the code I was running:
automl = AutoSklearnClassifier(time_left_for_this_task=600, metric=roc_auc)
automl.fit(x_train, y_train, x_validation, y_validation)
Limiting the number of cores with the param nproc
seems to work, but it’s a pity that we cannot take advantage of larger infra 😦
The dataset doesn’t seem to be the problem. I reproduced the bug with datasets of different sizes and different feature types, and everytime it raises the same error (it’s not something that happens stochastically).
Also, the error is almost instantaneous: clearly it doesn’t even start to fit when it fails.
Environment and installation:
- OS: linux
- Python version: 3.7
- Auto-sklearn version: 0.13.0
Issue Analytics
- State:
- Created 2 years ago
- Reactions:11
- Comments:12 (4 by maintainers)
The workaround I found to fix this issue is to limit the number of cores with the env var
OPENBLAS_NUM_THREADS
before importing anything from autosklearn.For example:
I have been getting this error as well on
macOS Monterey 12.0
andauto-sklearn==0.13.0
, and I have not updated any libraries in my environment before this error started showing up. It happens when callingfit
regardless of parameters: