Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to make mlflow tracking useable on shared file systems?

See original GitHub issue

Hello, first time user here. I am failing to use the mlflow ui because it is just awfully slow for even very low numbers of runs: [CRITICAL] WORKER TIMEOUT

What I tried:

Changed file store to shared file system for parallel training runs (using mlflow.set_tracking_uri()), but recognized that mlflow by default is awfully slow this way. Simply running mlflow.search_runs() for 120 runs and 9 metrics takes 30s.
Tried to change to sqlite URI on shared filesystem, but this causes artifacts to land in the local folder again (why??).
Try to find mlflow.set_artifact_uri() but cannot find one.

TL;DR: How do I make mlflow work with a shared filesystem? How can I store the artifacts in the same folder as the mlflow.db file from sqlite?

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:6 (1 by maintainers)

Top GitHub Comments

3reactions

Hoezecommented, Feb 20, 2021

Thanks for your answers 😃 This issue somehow slipped my attention, sorry for that…

So what I wanted to do with mlflow was basically serverless offline tracking with SQLite3 DB:

project_dir="/proj/myfancymodel"

# sqlite3:///proj/myfancymodel/mlflow.db
mlflow.set_tracking_uri("sqlite3://" + project_dir + "/mlflow.db") 
# file:///proj/myfancymodel/artifacts/
mlflow.set_artifact_uri("file://" + project_dir + "/artifacts/")

sqlite3 should be more than fast enough to handle < 50,000 rows
artifacts + database are stored at the same location
no fancy database connection and user authentication setup
simple file access permission is enough
database concurrency is handled through file locking of mlflow.db

This failed because there is no possibility to specify the artifact_uri inside a python script.

2reactions

hughperkinscommented, Feb 20, 2021

Just in case it’s useful (I’m just another user), [CRITICAL] WORKER TIMEOUT => I was getting this at the start, when my mlruns was on a shared nfs file system. I changed to use postgres on a local file system, just on the server, and then now my mlflow ui just zips along smoothly 😃 The artifacts are stored by the client, not by the server, so a shared file system might work well for them. I’m pretty sure the ui is slow because of the tracking db ,ie mlruns, not because of the artifacts.