Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ivis seems to provoke errors when composing a sklearn.pipeline.Pipeline passed to sklearn.model_selection.GridSearchCV and executed in parallel

See original GitHub issue

The problem

I noticed that when Ivis compose a sklearn.pipeline.Pipeline which is passed to sklearn.model_selection.GridSearch to fine-tune hyper-parameters across all estimators/transformers, and GridSearch has n_jobs=-1 (i.e., when executions within GridSearch are parallel), errors are thrown. This does not happen when n_jobs=1 (i.e., when the executions within GridSearch are sequential).

Since Pipeline globally regulates the n_jobs parameter, thus not supporting the parallelization of only specific steps, this problem forces the global use of n_jobs=1, which sensibly slows down the fine-tuning process by underusing the computational power of the setup in which the script is being executed (even in parts where n_jobs=-1 would work).

Environment

A virtual environment was created specifically to this repository, wherein all modules described in requirements.txt were installed. My setup runs an up-to-date version of Windows 10 (no WSL).

Runtime

python=3.8.4

Relevant modules

ivis=2.0.3
tensorflow=2.5.0

Minimal reproducible example

Code

if __name__ == "__main__":
    import tempfile
    import ivis

    from sklearn import datasets, ensemble, model_selection, pipeline, preprocessing
    from os import environ

    environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

    X, y = datasets.load_iris(return_X_y=True)

    pipeline_with_ivis = pipeline.Pipeline([
        ("normalize", preprocessing.MinMaxScaler()),
        ("project", ivis.Ivis()),
        ("classify", ensemble.RandomForestClassifier()),
    ], memory=tempfile.mkdtemp())

    parameter_grid = {
        "project__k": (15,),
        "project__verbose": (True,),

        "classify__random_state": (2021,)
    }

    grid_search = model_selection.GridSearchCV(pipeline_with_ivis, parameter_grid, scoring="accuracy", cv=10, n_jobs=-1,
                                               return_train_score=True, verbose=3).fit(X, y)

Error

<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\model_selection\_validation.py:615: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 212, in extract_knn
    process.start()
  File "C:\Python38\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Python38\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\externals\loky\backend\process.py", line 39, in _Popen
    return Popen(process_obj)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\externals\loky\backend\popen_loky_win32.py", line 70, in __init__
    child_env.update(process_obj.env)
AttributeError: 'KnnWorker' object has no attribute 'env'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\memory.py", line 591, in __call__
    return self._cached_call(args, kwargs)[0]
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\memory.py", line 534, in _cached_call
    out, metadata = self.call(*args, **kwargs)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\memory.py", line 761, in call
    output = self.func(*args, **kwargs)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "<REPOSITORY_ROOT>\ivis\ivis.py", line 350, in fit_transform
    self.fit(X, Y, shuffle_mode)
  File "<REPOSITORY_ROOT>\ivis\ivis.py", line 328, in fit
    self._fit(X, Y, shuffle_mode)
  File "<REPOSITORY_ROOT>\ivis\ivis.py", line 190, in _fit
    self.neighbour_matrix = AnnoyKnnMatrix.build(X, path=self.annoy_index_path,
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 63, in build
    return cls(index, X.shape, path, k, search_k, precompute, include_distances, verbose)
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 48, in __init__
    self.precomputed_neighbours = self.get_neighbour_indices()
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 96, in get_neighbour_indices
    return extract_knn(
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 236, in extract_knn
    process.terminate()
  File "C:\Python38\lib\multiprocessing\process.py", line 133, in terminate
    self._popen.terminate()
AttributeError: 'NoneType' object has no attribute 'terminate'
  warnings.warn("Estimator fit failed. The score on this train-test"

[...]

<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\model_selection\_search.py:922: UserWarning: One or more of the test scores are non-finite: [nan]
  warnings.warn(

Discussion

By coding and playing with the example above, I acquired the understanding that, since both sklearn uses joblib and ivis uses multiprocessing, these modules might not be playing well with each other for some reason.

I would discard the understanding that nested estimators/transformers with parallel routines would be the problem: estimators like sklearn.ensemble.RandomForestClassifier can be set to have n_jobs=-1 without problem within the Pipeline passed to GridSearchCV.

I am particularly affected by this issue because I want to employ ivis in projects that involve hyper-parameter fine-tuning using cross-validation via GridSearchCV with concurrent executions. I attempted to diagnose the problem, but to no avail, which is why I bring this issue to your attention.

Observation: another part of this problem is a design choice that is not adherent to the sklearn API guidelines, whose solution I propose and detail in #95. This issue does not cause the aforementioned error, but might cause other errors that could affect the same use scenario (Pipeline in GridSearchCV running in parallel).

Issue Analytics

State:
Created 2 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

imatheussmcommented, Jun 1, 2021

Just a quick update: I am still testing ivis on the minimal reproducible example, as well as on a pipeline I have been working on. I still managed to find some errors, but they seem to happen just when I am running GridSearchCV with n_jobs=-1 inside a docker container. I am just ascertaining that this is a docker problem, and not a ivis one.

If it serves of anything, here is the error I have been seeing under docker. It seems to happen whenever I run GridSearchCV with n_jobs != 1. It runs for some time without any problems, and then this happens:

free(): invalid pointer
exception calling callback for <Future at 0x7f907840c220 state=finished raised TerminatedWorkerError>
Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "/opt/venv/lib/python3.9/site-packages/joblib/parallel.py", line 359, in __call__
    self.parallel.dispatch_next()
  File "/opt/venv/lib/python3.9/site-packages/joblib/parallel.py", line 792, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/opt/venv/lib/python3.9/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/venv/lib/python3.9/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/venv/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "/opt/venv/lib/python3.9/site-packages/joblib/externals/loky/reusable_executor.py", line 177, in submit
    return super(_ReusablePoolExecutor, self).submit(
  File "/opt/venv/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 1102, in submit
    raise self._flags.broken
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGABRT(-6)}

I tried to find more information on this on the web, but the only thing I was able to find that resembles this problem was this unanswered question on StackOverflow. It seems to happen whenever the Pipeline includes ivis, and it does not seem to happen with other projectors (e.g., UMAP, PCA) on the few tests I made, which makes me wonder if ivis plays a part on this. If pertinent, I will produce another minimal reproducible example involving docker for you to try and reproduce this error on your end.

As I said, I am still running tests, so take everything I said above with a pinch of salt. And before I forget, thank you for the diligence with which you assisted me in solving this issue. I really appreciate it.

0reactions

imatheussmcommented, Jun 3, 2021

Hmm, it never occurred to me that this could be memory. Thank you for the clarification on this matter and for solving the issue with GridSearchCV, I really appreciate it. Feel free to close this issue if there is nothing else to be added.