question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fetch_openml can raise "PermissionError: [WinError 32] The process cannot access the file because it is being used by another process"

See original GitHub issue

Describe the bug

On windows, if fetch_openml is run concurrently in 2 processes, for instance when running the test with pytest-xdist, one sometimes get errors such as:

[...]
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x1439D0F0>
gzip_response = True

    @pytest.mark.parametrize("gzip_response", [True, False])
        version    = 'active'
C:\hostedtoolcache\windows\Python\3.7.9\x86\lib\site-packages\sklearn\datasets\_openml.py:449: in _get_data_description_by_id
    url, error_message, data_home=data_home
        data_home  = 'C:\\Users\\VssAdministrator\\scikit_learn_data\\openml'
        data_id    = 2
        error_message = 'Dataset with data_id 2 not found.'
        url        = 'api/v1/json/data/2'
C:\hostedtoolcache\windows\Python\3.7.9\x86\lib\site-packages\sklearn\datasets\_openml.py:172: in _get_json_content_from_openml_api
    return _load_json()
        _load_json = <function _get_json_content_from_openml_api.<locals>._load_json at 0x14167C00>
        data_home  = 'C:\\Users\\VssAdministrator\\scikit_learn_data\\openml'
        error_message = 'Dataset with data_id 2 not found.'
        url        = 'api/v1/json/data/2'
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (), kw = {}
local_path = 'C:\\Users\\VssAdministrator\\scikit_learn_data\\openml\\openml.org\\api/v1/json/data/2.gz'

    @wraps(f)
    def wrapper(*args, **kw):
        if data_home is None:
            return f(*args, **kw)
        try:
            return f(*args, **kw)
        except HTTPError:
            raise
        except Exception:
            warn("Invalid cache, redownloading file", RuntimeWarning)
            local_path = _get_local_path(openml_path, data_home)
            if os.path.exists(local_path):
>               os.unlink(local_path)
E               PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\VssAdministrator\\scikit_learn_data\\openml\\openml.org\\api/v1/json/data/2.gz'

Full error log:

https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=35377&view=logs&j=18b0749f-dd9a-5274-d197-77895e43d4e4&t=ba53dc33-2c0b-592b-6f69-b1c7af7ca977

Steps/Code to Reproduce

Run pytest -x -n 4 --pyargs sklearn many times.

Expected Results

No crash, the fetch_openml should be concurrent safe.

Actual Results

See error report above.

Versions

Python dependencies:
          pip: 21.3.1
   setuptools: 47.1.0
      sklearn: 1.1.dev0
        numpy: 1.21.4
        scipy: 1.7.3
       Cython: 0.29.24
       pandas: None
   matplotlib: None
       joblib: 1.1.0
threadpoolctl: 3.0.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: C:\hostedtoolcache\windows\Python\3.7.9\x86\lib\site-packages\numpy\.libs\libopenblas.VTYUM5MXKVFE4PZZER3L7PNO6YB4XFF3.gfortran-win32.dll
        version: 0.3.17
threading_layer: pthreads
   architecture: Nehalem
    num_threads: 2

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: C:\hostedtoolcache\windows\Python\3.7.9\x86\lib\site-packages\scipy\.libs\libopenblas.VTYUM5MXKVFE4PZZER3L7PNO6YB4XFF3.gfortran-win32.dll
        version: 0.3.17
threading_layer: pthreads
   architecture: Nehalem
    num_threads: 2

       user_api: openmp
   internal_api: openmp
         prefix: vcomp
       filepath: C:\Windows\SYSTEM32\VCOMP140.DLL
        version: None
    num_threads: 2

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
adrinjalalicommented, Dec 2, 2021

@siavrez Python has tempfile builtin 😃

I’d be in favor of making fetch_* multiprocess safe.

1reaction
thomasjpfancommented, Nov 27, 2021

We have introduced some complexity in our test code for fetch_* functions that is not fetch_openml:

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/conftest.py#L81-L82

This code downloads all the necessary files before pytest-xdist distributes the work. To me it still feels like a workaround to get tests to work with pytest-xdist.

Any function, class or IO operation in sklearn or any other code might be used concurrently and implemented in different ways and a large proportion of them might not be thread safe.

We have to choose what we want to be threadsafe and I would prefer to have fetch_* be threadsafe.

Given all that, I think it is important to fix the tests so the CI is stable. I opened https://github.com/scikit-learn/scikit-learn/pull/21806 as a quick workaround to fix the tests.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[WinError 32] The process cannot access the file because it is ...
Your process is the one that has the file open (via im still existing). You need to close it first before deleting it....
Read more >
[winerror 32] the process cannot access the file because it is ...
Summary. Hey. I get the error "PermissionError: [WinError 32] The process cannot access the file because it is being used by another process:...
Read more >
PermissionError: [WinError 32] The process cannot access
PermissionError : [WinError 32] The process cannot access the file because it is being used by another process.
Read more >
[WinError 32] The process cannot access the file because it is ...
PermissionError : [WinError 32] The process cannot access the file because it is being used by another process. What steps will reproduce the...
Read more >
[WinError 32] The process cannot access the file because it is ...
[Solved]-Getting Python error -->PermissionError: [WinError 32] The process cannot access the file because it is being used by another process-Pandas,Python.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found