question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segfault in HistGradientBoostingClassifier

See original GitHub issue

Describe the bug

I trigger a segfault in HistGradientBoostingClassifier. ~~I could trigger during cross-validation with n_jobs=-1 and n_jobs=1.~~Actually, I am not able to trigger anymore in n_jobs=1 but it was the case before (on a case without a random_state set.

I am using both missing values and categorical features management at the same time. I don’t know if it could be one of the issue.

Steps/Code to Reproduce

# %%
import pandas as pd

target_name = "RainTomorrow"
data = pd.read_csv("./weather.csv", parse_dates=["Date"])
data = data.dropna(axis="index", subset=[target_name])
X, y = data.drop(columns=["Date", target_name]), data[target_name]

# %%
X.info()

# %%
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector

categorical_columns = make_column_selector(dtype_include=object)(X)
preprocessing = make_column_transformer(
    (
        OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
        categorical_columns,
    ),
    remainder="passthrough",
)

# %%
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import HistGradientBoostingClassifier

model = make_pipeline(
    preprocessing,
    HistGradientBoostingClassifier(
        categorical_features=range(len(categorical_columns)),
        random_state=0,
    ),
)

# %%
from sklearn.model_selection import cross_validate

cross_validate(model, X, y, n_jobs=-1)

I am also attaching the dataset that I used to trigger the problem.

weather.csv

I tried to reproduce with a random set with both categorical and missing values but it did segfault.

Expected Results

At least it should not segfault.

Actual Results

---------------------------------------------------------------------------
TerminatedWorkerError                     Traceback (most recent call last)
~/Documents/scratch/bug_hist_gradient_boosting.py in <module>
      40 from sklearn.model_selection import cross_validate
      41 
----> 42 cross_validate(model, X, y, n_jobs=-1)

~/Documents/packages/scikit-learn/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    265     # independent, and that it is pickle-able.
    266     parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
--> 267     results = parallel(
    268         delayed(_fit_and_score)(
    269             clone(estimator),

~/Documents/packages/joblib/joblib/parallel.py in __call__(self, iterable)
   1052 
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time

~/Documents/packages/joblib/joblib/parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())

~/Documents/packages/joblib/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

~/mambaforge/envs/dev/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
    442                     raise CancelledError()
    443                 elif self._state == FINISHED:
--> 444                     return self.__get_result()
    445                 else:
    446                     raise TimeoutError()

~/mambaforge/envs/dev/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
    387         if self._exception:
    388             try:
--> 389                 raise self._exception
    390             finally:
    391                 # Break a reference cycle with the exception in self._exception

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGSEGV(-11)}

Versions

System:
    python: 3.8.12 | packaged by conda-forge | (default, Sep 16 2021, 01:38:21)  [Clang 11.1.0 ]
executable: /Users/glemaitre/mambaforge/envs/dev/bin/python
   machine: macOS-11.6-arm64-arm-64bit

Python dependencies:
          pip: 21.2.4
   setuptools: 58.2.0
      sklearn: 1.1.dev0
        numpy: 1.21.2
        scipy: 1.7.1
       Cython: 0.29.24
       pandas: 1.3.3
   matplotlib: 3.4.3
       joblib: 1.0.1
threadpoolctl: 3.0.0

Built with OpenMP: True

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
glemaitrecommented, Oct 14, 2021

Tomorrow is Friday. It could be a nide day to release 😃

On Thu, 14 Oct 2021 at 19:49, Olivier Grisel @.***> wrote:

We should probably hurry the 1.0.1 release for this and for #21188 https://github.com/scikit-learn/scikit-learn/issues/21188.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/21283#issuecomment-943583794, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABY32P42XEUSAUK3AHT63ZLUG4J35ANCNFSM5FTTNRCA .

– Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

0reactions
ogriselcommented, Oct 25, 2021

I think we can consider that #21227 will fix it in 1.0.1.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Histogram GBDT can segfault if categorical contains negative ...
Histogram GBDT can segfault if they contain negative categories. Indeed, we documented that they should all be positive in [0, max_bins] but ...
Read more >
8.3. Parallelism, resource management, and configuration
Each instance of HistGradientBoostingClassifier will spawn 8 threads (since you have 8 CPUs). ... This is useful for finding segfaults.
Read more >
Segmentation fault while importing sklearn - Stack Overflow
When I try to import scikit-learn in python, I get a segmentation fault >>>import sklearn as sk Segmentation fault (core dumped).
Read more >
tests/test_docstring_parameters.py · alkaline-ml/scikit-learn - Gemfury
... reason='test segfaults on PyPy') def test_docstring_parameters(): # Test module docstring formatting # Skip test if numpydoc is not found ...
Read more >
Democratizing Machine Learning: Perspective from a scikit ...
... backend manages a pool of Python VMs segfault resilient lazy loop ... HistGradientBoostingClassifier()) scores = cross_val_score(model, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found