Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Deadlock issue raised from backend store DB

See original GitHub issue

MLflow Roadmap Item

This is an MLflow Roadmap item that has been prioritized by the MLflow maintainers.

System information

OS Platform and Distribution: MLflow docker image based on ‘python:3.9-slim-buster’. Deployed in k8s cluster.
MLflow installed from: official pip \ pypi
MLflow version: 1.16.0
MLflow backend store: AWS RDS MySQL 8.0.23
MLflow artifact root: AWS S3 bucket
Python version: 3.9
** command** : mlflow server --host 0.0.0.0 --port 5000 --default-artifact-root ${BUCKET} --backend-store-uri mysql+pymysql://${USERNAME}:${PASSWORD}@${HOST}:${PORT}/${DATABASE}

Describe the problem

I have been performing load test on the Mlflow Tracking server with a multiprocessing sample code. For few of the runs the following issue was raised :

mlflow.exceptions.MlflowException: (pymysql.err.OperationalError) (1213, ‘Deadlock found when trying to get lock; try restarting transaction’) [SQL: INSERT INTO latest_metrics (key, value, timestamp, step, is_nan, run_uuid) VALUES (%(key)s, %(value)s, %(timestamp)s, %(step)s, %(is_nan)s, %(run_uuid)s)] [parameters: {‘key’: ‘rmse’, ‘value’: 0.859125957974236, ‘timestamp’: 1620801352207, ‘step’: 0, ‘is_nan’: 0, ‘run_uuid’: ‘eb6d1876d37d4706a97cec4db51937a6’}]

Code to reproduce issue

import time
import multiprocessing
from multiprocessing import Pool
import os
import warnings
import sys
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
import mlflow
import mlflow.sklearn

mlflow.set_tracking_uri({REMOTE_TRACKING_SERVER_URI}) 
mlflow.set_experiment({EXPERIMENT_NAME})

def eval_metrics(actual, pred):
    # compute relevant metrics
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2
    
def load_data(data_path):
    data = pd.read_csv(data_path)
    # Split the data into training and test sets. (0.75, 0.25) split.
    train, test = train_test_split(data)
    
    # The predicted column is "quality" which is a scalar from [3, 9]
    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]
    return train_x, train_y, test_x, test_y

def train(x, alpha=0.5, l1_ratio=0.5):
    # train a model with given parameters
    warnings.filterwarnings("ignore")
    np.random.seed(40)
    mlflow.sklearn.autolog()

    # Read the wine-quality csv file (make sure you're running this from the root of MLflow!)
    data_path = "./data/wine-quality.csv"
    train_x, train_y, test_x, test_y = load_data(data_path)
    # Useful for multiple runs (only doing one run in this sample notebook)

    with mlflow.start_run(run_name='run-infy 1000 - ' + str(x)):
        # Execute ElasticNet
        lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
        lr.fit(train_x, train_y)
        # Evaluate Metrics
        predicted_qualities = lr.predict(test_x)
        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

        print("##############  logging params and metrics for " + str(x) + "/" + str(500) + " ###########")
        # Log parameter, metrics, and model to MLflow
        mlflow.log_param(key="alpha", value=alpha)
        mlflow.log_param(key="l1_ratio", value=l1_ratio)
        mlflow.log_metric(key="rmse", value=rmse)
        mlflow.log_metrics({"mae": mae, "r2": r2})
        mlflow.log_artifact(data_path)
        mlflow.sklearn.log_model(lr, "model")

def multiprocessing_func(x):
    time.sleep(2)
    (x, train(x, 0.5, 0.8))

if __name__ == '__main__':
    starttime = time.time()
    pool = Pool()
    pool.map(multiprocessing_func, range(1, 1000))
    pool.close()
    print()
    print('Time taken = {} seconds'.format(time.time() - starttime))

What component(s), interfaces, languages, and integrations does this bug affect?

help wanted: Need help from community

Components

area/artifacts: Artifact stores and artifact logging
area/build: Build and test infrastructure for MLflow
area/docs: MLflow documentation pages
area/examples: Example code
area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
area/models: MLmodel format, model serialization/deserialization, flavors
area/projects: MLproject format, project running backends
area/scoring: Local serving, model deployment tools, spark UDFs
area/server-infra: MLflow server, JavaScript dev server
area/tracking: Tracking Service, tracking client APIs, autologging

Interface

area/uiux: Front-end, user experience, JavaScript, plotting
area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Issue Analytics

State:
Created 2 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

1reaction

dbczumarcommented, Apr 14, 2022

Addressed via #5460. Apologies for the delay!

1reaction

dbczumarcommented, Nov 8, 2021

Hi @NieuweNils ! We’ve identified a similar issue and fixed it in the Databricks-hosted MLflow Tracking service. I’m going to apply a corresponding fix to OSS to close out this issue shortly!