question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] 1.24.0 regression: Deadlock when downloading directory (async download)

See original GitHub issue

Regression: Downloading directories using SFTP may cause deadlocks on 1.24.0

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
  • No. I cannot contribute a bug fix at this time.

System information

  • OS Platform and Distribution: Ubuntu 20.04.3 LTS
  • MLflow installed from: binary
  • MLflow version: 1.24 (works on version 1.23)
  • Python version: 3.8.10

Describe the problem

Downloading directory artifacts causes a deadlock, and sometimes a paramiko issue (Garbage package received) when using sftp artifact storage.

Code to reproduce issue

Working:

from mlflow.tracking import MlflowClient
run_id = "run-id"
server = "server-uri"
# This works; fetches the contents of model sequentually.
MlflowClient(server).download_artifacts(run_id, "model/MLmodel", "out")
MlflowClient(server).download_artifacts(run_id, "model/conda.yaml", "out")
MlflowClient(server).download_artifacts(run_id, "model/model.pkl", "out")
MlflowClient(server).download_artifacts(run_id, "model/requirements.txt", "out")

Not working:

# Hangs or throws exception
MlflowClient(server).download_artifacts(run_id, "model", "out2")

Both work on 1.23.1

Other info / logs

Paramiko exception:

---------------------------------------------------------------------------
MlflowException                           Traceback (most recent call last)
<ipython-input-4-3293f4eca114> in <module>
----> 1 MlflowClient("<server url>").download_artifacts("run-id", "model", "out2")

~/.local/lib/python3.8/site-packages/mlflow/tracking/client.py in download_artifacts(self, run_id, path, dst_path)
   1411             Artifacts: ['features.txt']
   1412         """
-> 1413         return self._tracking_client.download_artifacts(run_id, path, dst_path)
   1414 
   1415     def set_terminated(

~/.local/lib/python3.8/site-packages/mlflow/tracking/_tracking_service/client.py in download_artifacts(self, run_id, path, dst_path)
    389         :return: Local path of desired artifact.
    390         """
--> 391         return self._get_artifact_repo(run_id).download_artifacts(path, dst_path)
    392 
    393     def set_terminated(self, run_id, status=None, end_time=None):

~/.local/lib/python3.8/site-packages/mlflow/store/artifact/artifact_repo.py in download_artifacts(self, artifact_path, dst_path)
    263 
    264         if len(failed_downloads) > 0:
--> 265             raise MlflowException(
    266                 message=(
    267                     "The following failures occurred while downloading one or more"

MlflowException: The following failures occurred while downloading one or more artifacts from sftp://server:/mlflow_ftp/upload/exp-id/run-id/artifacts: {'model/MLmodel': "SFTPError('Garbage packet received')", 'model/conda.yaml': "SFTPError('Garbage packet received')", 'model/model.pkl': "SFTPError('Garbage packet received')", 'model/requirements.txt': "SFTPError('Garbage packet received')"}

Exception when interrupting deadlock:

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-4-3293f4eca114> in <module>
----> 1 MlflowClient(server).download_artifacts(run_id, "model", "out2")

~/.local/lib/python3.8/site-packages/mlflow/tracking/client.py in download_artifacts(self, run_id, path, dst_path)
   1411             Artifacts: ['features.txt']
   1412         """
-> 1413         return self._tracking_client.download_artifacts(run_id, path, dst_path)
   1414 
   1415     def set_terminated(

~/.local/lib/python3.8/site-packages/mlflow/tracking/_tracking_service/client.py in download_artifacts(self, run_id, path, dst_path)
    389         :return: Local path of desired artifact.
    390         """
--> 391         return self._get_artifact_repo(run_id).download_artifacts(path, dst_path)
    392 
    393     def set_terminated(self, run_id, status=None, end_time=None):

~/.local/lib/python3.8/site-packages/mlflow/store/artifact/artifact_repo.py in download_artifacts(self, artifact_path, dst_path)
    258         for inflight_download in inflight_downloads:
    259             try:
--> 260                 inflight_download.download_future.result()
    261             except Exception as e:
    262                 failed_downloads[inflight_download.src_artifact_path] = repr(e)

/usr/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
    437                     return self.__get_result()
    438 
--> 439                 self._condition.wait(timeout)
    440 
    441                 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:

/usr/lib/python3.8/threading.py in wait(self, timeout)
    300         try:    # restore state no matter what (e.g., KeyboardInterrupt)
    301             if timeout is None:
--> 302                 waiter.acquire()
    303                 gotit = True
    304             else:

KeyboardInterrupt:

What component(s), interfaces, languages, and integrations does this bug affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:3
  • Comments:14 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
rsundqvistcommented, Apr 28, 2022

@KrSiddhartha They’re on master in rsundqvist/mlflow. See the commits linked above in this thread 😃

Can be installed with pip using pip install git+https://github.com/rsundqvist/mlflow.git

1reaction
dbczumarcommented, Apr 20, 2022

@rsundqvist Thank you for diagnosing this! Would you be able to introduce a connection pool to SFTPArtifactRepository to address this issue?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Plex Media Server Version History - VideoHelp
(Downloads) Corrected a case where played downloaded media was not marked as played on server (Maintenance) Plex Media Server could quit unexpectedly when ......
Read more >
device-mapper-multipath security update
Resolves: rhbz#1733185 - Allow brltty to request to load kernel module ... move message error checking to avoid deadlock (Tony Camuso) [1731388 1718699] ......
Read more >
发布说明 - Emscripten中文网
This fixes a regression that started in Aug 31st 2020 (Emscripten 2.0.2) in #12059. ... The default location for downloaded ports is now...
Read more >
ChangeLog.md
The transition also moves all builds and downloads away from the old mozilla-games infrastructure to the new ... Fixed a deadlock bug with...
Read more >
f56a0cdea0a741fc63f22b7b9e6...
Resolves: rhbz#1802251 - fix incorrect changelog entry for bug 1802251. ... [reposync] Check GPG signatures of downloaded packages (RhBug:1856818) - Update ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found