Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segfault in HistGradientBoostingClassifier

See original GitHub issue

Describe the bug

I trigger a segfault in HistGradientBoostingClassifier. ~~I could trigger during cross-validation with n_jobs=-1 and n_jobs=1.~~Actually, I am not able to trigger anymore in n_jobs=1 but it was the case before (on a case without a random_state set.

I am using both missing values and categorical features management at the same time. I don’t know if it could be one of the issue.

Steps/Code to Reproduce

# %%
import pandas as pd

target_name = "RainTomorrow"
data = pd.read_csv("./weather.csv", parse_dates=["Date"])
data = data.dropna(axis="index", subset=[target_name])
X, y = data.drop(columns=["Date", target_name]), data[target_name]

# %%
X.info()

# %%
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector

categorical_columns = make_column_selector(dtype_include=object)(X)
preprocessing = make_column_transformer(
    (
        OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
        categorical_columns,
    ),
    remainder="passthrough",
)

# %%
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import HistGradientBoostingClassifier

model = make_pipeline(
    preprocessing,
    HistGradientBoostingClassifier(
        categorical_features=range(len(categorical_columns)),
        random_state=0,
    ),
)

# %%
from sklearn.model_selection import cross_validate

cross_validate(model, X, y, n_jobs=-1)

I am also attaching the dataset that I used to trigger the problem.

weather.csv

I tried to reproduce with a random set with both categorical and missing values but it did segfault.

Expected Results

At least it should not segfault.

Actual Results

---------------------------------------------------------------------------
TerminatedWorkerError                     Traceback (most recent call last)
~/Documents/scratch/bug_hist_gradient_boosting.py in <module>
      40 from sklearn.model_selection import cross_validate
      41 
----> 42 cross_validate(model, X, y, n_jobs=-1)

~/Documents/packages/scikit-learn/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    265     # independent, and that it is pickle-able.
    266     parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
--> 267     results = parallel(
    268         delayed(_fit_and_score)(
    269             clone(estimator),

~/Documents/packages/joblib/joblib/parallel.py in __call__(self, iterable)
   1052 
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time

~/Documents/packages/joblib/joblib/parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())

~/Documents/packages/joblib/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

~/mambaforge/envs/dev/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
    442                     raise CancelledError()
    443                 elif self._state == FINISHED:
--> 444                     return self.__get_result()
    445                 else:
    446                     raise TimeoutError()

~/mambaforge/envs/dev/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
    387         if self._exception:
    388             try:
--> 389                 raise self._exception
    390             finally:
    391                 # Break a reference cycle with the exception in self._exception

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGSEGV(-11)}

Versions

System:
    python: 3.8.12 | packaged by conda-forge | (default, Sep 16 2021, 01:38:21)  [Clang 11.1.0 ]
executable: /Users/glemaitre/mambaforge/envs/dev/bin/python
   machine: macOS-11.6-arm64-arm-64bit

Python dependencies:
          pip: 21.2.4
   setuptools: 58.2.0
      sklearn: 1.1.dev0
        numpy: 1.21.2
        scipy: 1.7.1
       Cython: 0.29.24
       pandas: 1.3.3
   matplotlib: 3.4.3
       joblib: 1.0.1
threadpoolctl: 3.0.0

Built with OpenMP: True

Issue Analytics

State:
Created 2 years ago
Comments:13 (12 by maintainers)

Top GitHub Comments

1reaction

glemaitrecommented, Oct 14, 2021

Tomorrow is Friday. It could be a nide day to release 😃

On Thu, 14 Oct 2021 at 19:49, Olivier Grisel @.***> wrote:

We should probably hurry the 1.0.1 release for this and for #21188 https://github.com/scikit-learn/scikit-learn/issues/21188.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/21283#issuecomment-943583794, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABY32P42XEUSAUK3AHT63ZLUG4J35ANCNFSM5FTTNRCA .

– Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

0reactions

ogriselcommented, Oct 25, 2021

I think we can consider that #21227 will fix it in 1.0.1.