[BUG] Internal Server Error on delete
See original GitHub issueWillingness to contribute
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
MLflow version
1.19.0
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS 7
- Python version: 3.9
- yarn version, if running the dev UI:
Describe the problem
After using mlflow for ~1 year, weβve run into this difficult to reproduce bug several times where an experiment tab will show an INTERNAL_SERVER_ERROR and stops displaying the runs that were stored in that experiment.
This issue seems to come up when deleting one or multiple runs from an experiment, but it doesnβt happen every time a delete is attempted. But it seems to usually happen on the first delete attempt in an experiment if itβs going to failβ¦ The delete command hangs longer than usual then prompts the INTERNAL_SERVER_ERROR.

Besides a full fix, I would also be interested to learn if itβs possible to retrieve the runs that were stored in the broken experiment.
Steps to reproduce the bug
This is inconsistently reproducible (and I apologize, I know thatβs not very helpful), but I logged 50-100 runs in a new experiment. Each run had 5 metrics with ~200 values. I then tried deleting a random amount of runs. Sometimes one or the whole page. Sometimes the experiment broke, but most times the delete executed no problem.
Code to generate data required to reproduce the bug
import mlflow
import boto3
import random
MLFLOW_EXPERIMENT_NAME = "break server"
if not mlflow.get_experiment_by_name(MLFLOW_EXPERIMENT_NAME):
mlflow.create_experiment(MLFLOW_EXPERIMENT_NAME)
mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)
mlflow.start_run()
mlflow.log_param("number", 1)
randomlist = random.sample(range(10, 500), 200)
for x in randomlist:
mlflow.log_metric("rand1", x)
mlflow.log_metric("rand2", x)
mlflow.log_metric("rand3", x)
mlflow.log_metric("rand4", x)
mlflow.log_metric("rand5", x)
mlflow.end_run()
I then ran this 50-100 times to fill the experiment with runs to delete.
Is the console panel in DevTools showing errors relevant to the bug?

Error: Promised response from onMessage listener went out of scope 5 [background.js:841:170](moz-extension://d6a7add6-de11-4239-898f-0a287b30c3e7/dist/background.js)
XHR failed
Object { readyState: 4, getResponseHeader: getResponseHeader(e), getAllResponseHeaders: getAllResponseHeaders(), setRequestHeader: setRequestHeader(e, t), overrideMimeType: overrideMimeType(e), statusCode: statusCode(e), abort: abort(e), state: state(), always: always(), catch: catch(e)
, β¦ }
β
abort: function abort(e)β
always: function always()β
catch: function catch(e)β
done: function add()β
fail: function add()β
getAllResponseHeaders: function getAllResponseHeaders()β
getResponseHeader: function getResponseHeader(e)β
overrideMimeType: function overrideMimeType(e)β
pipe: function pipe()β
progress: function add()β
promise: function promise(e)
β
readyState: 4
β
responseText: "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n<title>500 Internal Server Error</title>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>\n"
β
setRequestHeader: function setRequestHeader(e, t)β
state: function state()
β
status: 500
β
statusCode: function statusCode(e)
β
statusText: "Internal Server Error"
β
then: function then(e, r, o)β
<prototype>: Object { β¦ }
[main.895f3836.chunk.js:1:15852](https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js)
error https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js:1
l https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
fireWith https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
S https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
t https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
Object { xhr: {β¦} }
β
xhr: Object { readyState: 4, getResponseHeader: getResponseHeader(e), getAllResponseHeaders: getAllResponseHeaders()
, β¦ }
ββ
abort: function abort(e)ββ
always: function always()ββ
catch: function catch(e)ββ
done: function add()ββ
fail: function add()ββ
getAllResponseHeaders: function getAllResponseHeaders()ββ
getResponseHeader: function getResponseHeader(e)ββ
overrideMimeType: function overrideMimeType(e)ββ
pipe: function pipe()ββ
progress: function add()ββ
promise: function promise(e)
ββ
readyState: 4
ββ
responseText: "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n<title>500 Internal Server Error</title>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>\n"
ββ
setRequestHeader: function setRequestHeader(e, t)ββ
state: function state()
ββ
status: 500
ββ
statusCode: function statusCode(e)
ββ
statusText: "Internal Server Error"
ββ
then: function then(e, r, o)ββ
<prototype>: Object { β¦ }
β
<prototype>: Object { β¦ }
[main.895f3836.chunk.js:1:8885](https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js)
value https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js:1
value https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js:1
Does the network panel in DevTools contain failed requests relevant to the bug?

Issue Analytics
- State:
- Created a year ago
- Comments:13 (4 by maintainers)
Happy to help! You should be able to view runs again by identifying and removing any empty
meta.yaml
files in yourmlruns
directory. Unfortunately, runs whosemeta.yaml
files are empty may be a bit harder to restore, as you would need to reconstruct the contents ofmeta.yaml
; this can be done by copying the contents of a non-emptymeta.yaml
file and changing the run ID field (start and end times will be incorrect, unfortunately).This is a very interesting class of failure that weβll make sure to address. Thank you for reporting it to us.
This might not be whatβs really happening but we can reproduce the error with this script:
UI
Tracking server log