[BUG] Deadlock issue raised from backend store DB
See original GitHub issueMLflow Roadmap Item
This is an MLflow Roadmap item that has been prioritized by the MLflow maintainers.
System information
- OS Platform and Distribution: MLflow docker image based on ‘python:3.9-slim-buster’. Deployed in k8s cluster.
- MLflow installed from: official pip \ pypi
- MLflow version: 1.16.0
- MLflow backend store: AWS RDS MySQL 8.0.23
- MLflow artifact root: AWS S3 bucket
- Python version: 3.9
- ** command** :
mlflow server --host 0.0.0.0 --port 5000 --default-artifact-root ${BUCKET} --backend-store-uri mysql+pymysql://${USERNAME}:${PASSWORD}@${HOST}:${PORT}/${DATABASE}
Describe the problem
I have been performing load test on the Mlflow Tracking server with a multiprocessing sample code. For few of the runs the following issue was raised :
mlflow.exceptions.MlflowException: (pymysql.err.OperationalError) (1213, ‘Deadlock found when trying to get lock; try restarting transaction’) [SQL: INSERT INTO latest_metrics (
key
, value, timestamp, step, is_nan, run_uuid) VALUES (%(key)s, %(value)s, %(timestamp)s, %(step)s, %(is_nan)s, %(run_uuid)s)] [parameters: {‘key’: ‘rmse’, ‘value’: 0.859125957974236, ‘timestamp’: 1620801352207, ‘step’: 0, ‘is_nan’: 0, ‘run_uuid’: ‘eb6d1876d37d4706a97cec4db51937a6’}]
Code to reproduce issue
import time
import multiprocessing
from multiprocessing import Pool
import os
import warnings
import sys
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
import mlflow
import mlflow.sklearn
mlflow.set_tracking_uri({REMOTE_TRACKING_SERVER_URI})
mlflow.set_experiment({EXPERIMENT_NAME})
def eval_metrics(actual, pred):
# compute relevant metrics
rmse = np.sqrt(mean_squared_error(actual, pred))
mae = mean_absolute_error(actual, pred)
r2 = r2_score(actual, pred)
return rmse, mae, r2
def load_data(data_path):
data = pd.read_csv(data_path)
# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data)
# The predicted column is "quality" which is a scalar from [3, 9]
train_x = train.drop(["quality"], axis=1)
test_x = test.drop(["quality"], axis=1)
train_y = train[["quality"]]
test_y = test[["quality"]]
return train_x, train_y, test_x, test_y
def train(x, alpha=0.5, l1_ratio=0.5):
# train a model with given parameters
warnings.filterwarnings("ignore")
np.random.seed(40)
mlflow.sklearn.autolog()
# Read the wine-quality csv file (make sure you're running this from the root of MLflow!)
data_path = "./data/wine-quality.csv"
train_x, train_y, test_x, test_y = load_data(data_path)
# Useful for multiple runs (only doing one run in this sample notebook)
with mlflow.start_run(run_name='run-infy 1000 - ' + str(x)):
# Execute ElasticNet
lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
lr.fit(train_x, train_y)
# Evaluate Metrics
predicted_qualities = lr.predict(test_x)
(rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
print("############## logging params and metrics for " + str(x) + "/" + str(500) + " ###########")
# Log parameter, metrics, and model to MLflow
mlflow.log_param(key="alpha", value=alpha)
mlflow.log_param(key="l1_ratio", value=l1_ratio)
mlflow.log_metric(key="rmse", value=rmse)
mlflow.log_metrics({"mae": mae, "r2": r2})
mlflow.log_artifact(data_path)
mlflow.sklearn.log_model(lr, "model")
def multiprocessing_func(x):
time.sleep(2)
(x, train(x, 0.5, 0.8))
if __name__ == '__main__':
starttime = time.time()
pool = Pool()
pool.map(multiprocessing_func, range(1, 1000))
pool.close()
print()
print('Time taken = {} seconds'.format(time.time() - starttime))
What component(s), interfaces, languages, and integrations does this bug affect?
-
help wanted
: Need help from community
Components
-
area/artifacts
: Artifact stores and artifact logging -
area/build
: Build and test infrastructure for MLflow -
area/docs
: MLflow documentation pages -
area/examples
: Example code -
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry -
area/models
: MLmodel format, model serialization/deserialization, flavors -
area/projects
: MLproject format, project running backends -
area/scoring
: Local serving, model deployment tools, spark UDFs -
area/server-infra
: MLflow server, JavaScript dev server -
area/tracking
: Tracking Service, tracking client APIs, autologging
Interface
-
area/uiux
: Front-end, user experience, JavaScript, plotting -
area/docker
: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models -
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry -
area/windows
: Windows support
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
How to resolve deadlocks in SQL Server - SQLShack
A deadlock problem occurs when two (or more than two) operations already want to access resources locked by the other one. In this...
Read more >[BUG] Deadlock issue raised frombackend store DB - - Bountysource
Refresh Issue. [BUG] Deadlock issue raised frombackend store DB ... your organization be willing to contribute a fix for this bug to the...
Read more >How to fix SQL Server deadlocks - Redgate Software
The first time a user sees the following message, the result of an unhandled deadlock error in SQL Server, it can come as...
Read more >PostgreSQL: Understanding deadlocks
The reason is that transactions have to wait for one another. If two transactions are in a conflict, PostgreSQL will not resolve the...
Read more >What are the main causes of deadlocks and can they be ...
From a database perspective, I'm not sure on how to go about preventing this situation, as locks are handled by the database itself,...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Addressed via #5460. Apologies for the delay!
Hi @NieuweNils ! We’ve identified a similar issue and fixed it in the Databricks-hosted MLflow Tracking service. I’m going to apply a corresponding fix to OSS to close out this issue shortly!