question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Deadlock issue raised from backend store DB

See original GitHub issue

MLflow Roadmap Item

This is an MLflow Roadmap item that has been prioritized by the MLflow maintainers.

System information

  • OS Platform and Distribution: MLflow docker image based on ‘python:3.9-slim-buster’. Deployed in k8s cluster.
  • MLflow installed from: official pip \ pypi
  • MLflow version: 1.16.0
  • MLflow backend store: AWS RDS MySQL 8.0.23
  • MLflow artifact root: AWS S3 bucket
  • Python version: 3.9
  • ** command** : mlflow server --host 0.0.0.0 --port 5000 --default-artifact-root ${BUCKET} --backend-store-uri mysql+pymysql://${USERNAME}:${PASSWORD}@${HOST}:${PORT}/${DATABASE}

Describe the problem

I have been performing load test on the Mlflow Tracking server with a multiprocessing sample code. For few of the runs the following issue was raised :

mlflow.exceptions.MlflowException: (pymysql.err.OperationalError) (1213, ‘Deadlock found when trying to get lock; try restarting transaction’) [SQL: INSERT INTO latest_metrics (key, value, timestamp, step, is_nan, run_uuid) VALUES (%(key)s, %(value)s, %(timestamp)s, %(step)s, %(is_nan)s, %(run_uuid)s)] [parameters: {‘key’: ‘rmse’, ‘value’: 0.859125957974236, ‘timestamp’: 1620801352207, ‘step’: 0, ‘is_nan’: 0, ‘run_uuid’: ‘eb6d1876d37d4706a97cec4db51937a6’}]

Code to reproduce issue

import time
import multiprocessing
from multiprocessing import Pool
import os
import warnings
import sys
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
import mlflow
import mlflow.sklearn

mlflow.set_tracking_uri({REMOTE_TRACKING_SERVER_URI}) 
mlflow.set_experiment({EXPERIMENT_NAME})

def eval_metrics(actual, pred):
    # compute relevant metrics
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2
    
def load_data(data_path):
    data = pd.read_csv(data_path)
    # Split the data into training and test sets. (0.75, 0.25) split.
    train, test = train_test_split(data)
    
    # The predicted column is "quality" which is a scalar from [3, 9]
    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]
    return train_x, train_y, test_x, test_y

def train(x, alpha=0.5, l1_ratio=0.5):
    # train a model with given parameters
    warnings.filterwarnings("ignore")
    np.random.seed(40)
    mlflow.sklearn.autolog()

    # Read the wine-quality csv file (make sure you're running this from the root of MLflow!)
    data_path = "./data/wine-quality.csv"
    train_x, train_y, test_x, test_y = load_data(data_path)
    # Useful for multiple runs (only doing one run in this sample notebook)

    with mlflow.start_run(run_name='run-infy 1000 - ' + str(x)):
        # Execute ElasticNet
        lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
        lr.fit(train_x, train_y)
        # Evaluate Metrics
        predicted_qualities = lr.predict(test_x)
        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

        print("##############  logging params and metrics for " + str(x) + "/" + str(500) + " ###########")
        # Log parameter, metrics, and model to MLflow
        mlflow.log_param(key="alpha", value=alpha)
        mlflow.log_param(key="l1_ratio", value=l1_ratio)
        mlflow.log_metric(key="rmse", value=rmse)
        mlflow.log_metrics({"mae": mae, "r2": r2})
        mlflow.log_artifact(data_path)
        mlflow.sklearn.log_model(lr, "model")

def multiprocessing_func(x):
    time.sleep(2)
    (x, train(x, 0.5, 0.8))

if __name__ == '__main__':
    starttime = time.time()
    pool = Pool()
    pool.map(multiprocessing_func, range(1, 1000))
    pool.close()
    print()
    print('Time taken = {} seconds'.format(time.time() - starttime))

What component(s), interfaces, languages, and integrations does this bug affect?

  • help wanted: Need help from community

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: Local serving, model deployment tools, spark UDFs
  • area/server-infra: MLflow server, JavaScript dev server
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, JavaScript, plotting
  • area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
dbczumarcommented, Apr 14, 2022

Addressed via #5460. Apologies for the delay!

1reaction
dbczumarcommented, Nov 8, 2021

Hi @NieuweNils ! We’ve identified a similar issue and fixed it in the Databricks-hosted MLflow Tracking service. I’m going to apply a corresponding fix to OSS to close out this issue shortly!

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to resolve deadlocks in SQL Server - SQLShack
A deadlock problem occurs when two (or more than two) operations already want to access resources locked by the other one. In this...
Read more >
[BUG] Deadlock issue raised frombackend store DB - - Bountysource
Refresh Issue. [BUG] Deadlock issue raised frombackend store DB ... your organization be willing to contribute a fix for this bug to the...
Read more >
How to fix SQL Server deadlocks - Redgate Software
The first time a user sees the following message, the result of an unhandled deadlock error in SQL Server, it can come as...
Read more >
PostgreSQL: Understanding deadlocks
The reason is that transactions have to wait for one another. If two transactions are in a conflict, PostgreSQL will not resolve the...
Read more >
What are the main causes of deadlocks and can they be ...
From a database perspective, I'm not sure on how to go about preventing this situation, as locks are handled by the database itself,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found