[BUG] 1.24.0 regression: Deadlock when downloading directory (async download)
See original GitHub issueRegression: Downloading directories using SFTP may cause deadlocks on 1.24.0
Willingness to contribute
The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?
- Yes. I can contribute a fix for this bug independently.
- Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
- No. I cannot contribute a bug fix at this time.
System information
- OS Platform and Distribution:
Ubuntu 20.04.3 LTS
- MLflow installed from: binary
- MLflow version: 1.24 (works on version 1.23)
- Python version: 3.8.10
Describe the problem
Downloading directory artifacts causes a deadlock, and sometimes a paramiko issue (Garbage package received) when using sftp artifact storage.
Code to reproduce issue
Working:
from mlflow.tracking import MlflowClient
run_id = "run-id"
server = "server-uri"
# This works; fetches the contents of model sequentually.
MlflowClient(server).download_artifacts(run_id, "model/MLmodel", "out")
MlflowClient(server).download_artifacts(run_id, "model/conda.yaml", "out")
MlflowClient(server).download_artifacts(run_id, "model/model.pkl", "out")
MlflowClient(server).download_artifacts(run_id, "model/requirements.txt", "out")
Not working:
# Hangs or throws exception
MlflowClient(server).download_artifacts(run_id, "model", "out2")
Both work on 1.23.1
Other info / logs
Paramiko exception:
---------------------------------------------------------------------------
MlflowException Traceback (most recent call last)
<ipython-input-4-3293f4eca114> in <module>
----> 1 MlflowClient("<server url>").download_artifacts("run-id", "model", "out2")
~/.local/lib/python3.8/site-packages/mlflow/tracking/client.py in download_artifacts(self, run_id, path, dst_path)
1411 Artifacts: ['features.txt']
1412 """
-> 1413 return self._tracking_client.download_artifacts(run_id, path, dst_path)
1414
1415 def set_terminated(
~/.local/lib/python3.8/site-packages/mlflow/tracking/_tracking_service/client.py in download_artifacts(self, run_id, path, dst_path)
389 :return: Local path of desired artifact.
390 """
--> 391 return self._get_artifact_repo(run_id).download_artifacts(path, dst_path)
392
393 def set_terminated(self, run_id, status=None, end_time=None):
~/.local/lib/python3.8/site-packages/mlflow/store/artifact/artifact_repo.py in download_artifacts(self, artifact_path, dst_path)
263
264 if len(failed_downloads) > 0:
--> 265 raise MlflowException(
266 message=(
267 "The following failures occurred while downloading one or more"
MlflowException: The following failures occurred while downloading one or more artifacts from sftp://server:/mlflow_ftp/upload/exp-id/run-id/artifacts: {'model/MLmodel': "SFTPError('Garbage packet received')", 'model/conda.yaml': "SFTPError('Garbage packet received')", 'model/model.pkl': "SFTPError('Garbage packet received')", 'model/requirements.txt': "SFTPError('Garbage packet received')"}
Exception when interrupting deadlock:
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-4-3293f4eca114> in <module>
----> 1 MlflowClient(server).download_artifacts(run_id, "model", "out2")
~/.local/lib/python3.8/site-packages/mlflow/tracking/client.py in download_artifacts(self, run_id, path, dst_path)
1411 Artifacts: ['features.txt']
1412 """
-> 1413 return self._tracking_client.download_artifacts(run_id, path, dst_path)
1414
1415 def set_terminated(
~/.local/lib/python3.8/site-packages/mlflow/tracking/_tracking_service/client.py in download_artifacts(self, run_id, path, dst_path)
389 :return: Local path of desired artifact.
390 """
--> 391 return self._get_artifact_repo(run_id).download_artifacts(path, dst_path)
392
393 def set_terminated(self, run_id, status=None, end_time=None):
~/.local/lib/python3.8/site-packages/mlflow/store/artifact/artifact_repo.py in download_artifacts(self, artifact_path, dst_path)
258 for inflight_download in inflight_downloads:
259 try:
--> 260 inflight_download.download_future.result()
261 except Exception as e:
262 failed_downloads[inflight_download.src_artifact_path] = repr(e)
/usr/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
437 return self.__get_result()
438
--> 439 self._condition.wait(timeout)
440
441 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:
/usr/lib/python3.8/threading.py in wait(self, timeout)
300 try: # restore state no matter what (e.g., KeyboardInterrupt)
301 if timeout is None:
--> 302 waiter.acquire()
303 gotit = True
304 else:
KeyboardInterrupt:
What component(s), interfaces, languages, and integrations does this bug affect?
Components
-
area/artifacts
: Artifact stores and artifact logging -
area/build
: Build and test infrastructure for MLflow -
area/docs
: MLflow documentation pages -
area/examples
: Example code -
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry -
area/models
: MLmodel format, model serialization/deserialization, flavors -
area/projects
: MLproject format, project running backends -
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs -
area/server-infra
: MLflow Tracking server backend -
area/tracking
: Tracking Service, tracking client APIs, autologging
Interface
-
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server -
area/docker
: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models -
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry -
area/windows
: Windows support
Language
-
language/r
: R APIs and clients -
language/java
: Java APIs and clients -
language/new
: Proposals for new client languages
Integrations
-
integrations/azure
: Azure and Azure ML integrations -
integrations/sagemaker
: SageMaker integrations -
integrations/databricks
: Databricks integrations
Issue Analytics
- State:
- Created a year ago
- Reactions:3
- Comments:14 (11 by maintainers)
Top Results From Across the Web
Plex Media Server Version History - VideoHelp
(Downloads) Corrected a case where played downloaded media was not marked as played on server (Maintenance) Plex Media Server could quit unexpectedly when ......
Read more >device-mapper-multipath security update
Resolves: rhbz#1733185 - Allow brltty to request to load kernel module ... move message error checking to avoid deadlock (Tony Camuso) [1731388 1718699] ......
Read more >发布说明 - Emscripten中文网
This fixes a regression that started in Aug 31st 2020 (Emscripten 2.0.2) in #12059. ... The default location for downloaded ports is now...
Read more >ChangeLog.md
The transition also moves all builds and downloads away from the old mozilla-games infrastructure to the new ... Fixed a deadlock bug with...
Read more >f56a0cdea0a741fc63f22b7b9e6...
Resolves: rhbz#1802251 - fix incorrect changelog entry for bug 1802251. ... [reposync] Check GPG signatures of downloaded packages (RhBug:1856818) - Update ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@KrSiddhartha They’re on master in rsundqvist/mlflow. See the commits linked above in this thread 😃
Can be installed with pip using
pip install git+https://github.com/rsundqvist/mlflow.git
@rsundqvist Thank you for diagnosing this! Would you be able to introduce a connection pool to
SFTPArtifactRepository
to address this issue?