Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Internal Server Error on delete

See original GitHub issue

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

1.19.0

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS 7
Python version: 3.9
yarn version, if running the dev UI:

Describe the problem

After using mlflow for ~1 year, we’ve run into this difficult to reproduce bug several times where an experiment tab will show an INTERNAL_SERVER_ERROR and stops displaying the runs that were stored in that experiment.

This issue seems to come up when deleting one or multiple runs from an experiment, but it doesn’t happen every time a delete is attempted. But it seems to usually happen on the first delete attempt in an experiment if it’s going to fail… The delete command hangs longer than usual then prompts the INTERNAL_SERVER_ERROR.

Besides a full fix, I would also be interested to learn if it’s possible to retrieve the runs that were stored in the broken experiment.

Steps to reproduce the bug

This is inconsistently reproducible (and I apologize, I know that’s not very helpful), but I logged 50-100 runs in a new experiment. Each run had 5 metrics with ~200 values. I then tried deleting a random amount of runs. Sometimes one or the whole page. Sometimes the experiment broke, but most times the delete executed no problem.

Code to generate data required to reproduce the bug

import mlflow
import boto3
import random

MLFLOW_EXPERIMENT_NAME = "break server"
if not mlflow.get_experiment_by_name(MLFLOW_EXPERIMENT_NAME):
      mlflow.create_experiment(MLFLOW_EXPERIMENT_NAME)
mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)

mlflow.start_run()

mlflow.log_param("number", 1)

randomlist = random.sample(range(10, 500), 200)
for x in randomlist:
  mlflow.log_metric("rand1", x)
  mlflow.log_metric("rand2", x)
  mlflow.log_metric("rand3", x)
  mlflow.log_metric("rand4", x)
  mlflow.log_metric("rand5", x)

mlflow.end_run()

I then ran this 50-100 times to fill the experiment with runs to delete.

Is the console panel in DevTools showing errors relevant to the bug?

Error: Promised response from onMessage listener went out of scope 5 [background.js:841:170](moz-extension://d6a7add6-de11-4239-898f-0a287b30c3e7/dist/background.js)
XHR failed 
Object { readyState: 4, getResponseHeader: getResponseHeader(e), getAllResponseHeaders: getAllResponseHeaders(), setRequestHeader: setRequestHeader(e, t), overrideMimeType: overrideMimeType(e), statusCode: statusCode(e), abort: abort(e), state: state(), always: always(), catch: catch(e)
, … }

abort: function abort(e)
always: function always()
catch: function catch(e)
done: function add()
fail: function add()
getAllResponseHeaders: function getAllResponseHeaders()
getResponseHeader: function getResponseHeader(e)
overrideMimeType: function overrideMimeType(e)
pipe: function pipe()
progress: function add()
promise: function promise(e)

readyState: 4

responseText: "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n<title>500 Internal Server Error</title>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>\n"

setRequestHeader: function setRequestHeader(e, t)
state: function state()

status: 500

statusCode: function statusCode(e)

statusText: "Internal Server Error"

then: function then(e, r, o)
<prototype>: Object { … }
[main.895f3836.chunk.js:1:15852](https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js)
    error https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js:1
    l https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
    fireWith https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
    S https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
    t https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
Object { xhr: {…} }

xhr: Object { readyState: 4, getResponseHeader: getResponseHeader(e), getAllResponseHeaders: getAllResponseHeaders()
, … }

abort: function abort(e)
always: function always()
catch: function catch(e)
done: function add()
fail: function add()
getAllResponseHeaders: function getAllResponseHeaders()
getResponseHeader: function getResponseHeader(e)
overrideMimeType: function overrideMimeType(e)
pipe: function pipe()
progress: function add()
promise: function promise(e)

readyState: 4

responseText: "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n<title>500 Internal Server Error</title>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>\n"

setRequestHeader: function setRequestHeader(e, t)
state: function state()

status: 500

statusCode: function statusCode(e)

statusText: "Internal Server Error"

then: function then(e, r, o)
<prototype>: Object { … }

<prototype>: Object { … }
[main.895f3836.chunk.js:1:8885](https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js)
    value https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js:1
    value https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js:1

Does the network panel in DevTools contain failed requests relevant to the bug?

Issue Analytics

State:
Created a year ago
Comments:13 (4 by maintainers)

Top GitHub Comments

2reactions

dbczumarcommented, Jul 29, 2022

Happy to help! You should be able to view runs again by identifying and removing any empty meta.yaml files in your mlruns directory. Unfortunately, runs whose meta.yaml files are empty may be a bit harder to restore, as you would need to reconstruct the contents of meta.yaml; this can be done by copying the contents of a non-empty meta.yaml file and changing the run ID field (start and end times will be incorrect, unfortunately).

This is a very interesting class of failure that we’ll make sure to address. Thank you for reporting it to us.

1reaction

harupycommented, Jul 28, 2022

This might not be what’s really happening but we can reproduce the error with this script:

# Before running this script: log some runs using the file store and launch MLflow UI

import time

path = "mlruns/0/f18249d1f76d445f8324550a74c0d07b/meta.yaml"

with open(path, "r") as f:
    a = f.read()

with open(path, "w") as f:
    print("Fetch some runs on MLflow UI")
    time.sleep(10)
    f.write(a)

UI

Tracking server log

2022/07/28 21:36:08 ERROR mlflow.server: Exception on /ajax-api/2.0/preview/mlflow/runs/search [POST]
Traceback (most recent call last):
  File "/home/haru/miniconda3/envs/mlflow-dev-env/lib/python3.7/site-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/haru/miniconda3/envs/mlflow-dev-env/lib/python3.7/site-packages/flask/app.py", line 1515, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/haru/miniconda3/envs/mlflow-dev-env/lib/python3.7/site-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/haru/miniconda3/envs/mlflow-dev-env/lib/python3.7/site-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/haru/Desktop/repositories/mlflow/mlflow/server/handlers.py", line 454, in wrapper
    return func(*args, **kwargs)
  File "/home/haru/Desktop/repositories/mlflow/mlflow/server/handlers.py", line 512, in wrapper
    return func(*args, **kwargs)
  File "/home/haru/Desktop/repositories/mlflow/mlflow/server/handlers.py", line 858, in _search_runs
    experiment_ids, filter_string, run_view_type, max_results, order_by, page_token
  File "/home/haru/Desktop/repositories/mlflow/mlflow/store/tracking/abstract_store.py", line 294, in search_runs
    experiment_ids, filter_string, run_view_type, max_results, order_by, page_token
  File "/home/haru/Desktop/repositories/mlflow/mlflow/store/tracking/file_store.py", line 809, in _search_runs
    run_infos = self._list_run_infos(experiment_id, run_view_type)
  File "/home/haru/Desktop/repositories/mlflow/mlflow/store/tracking/file_store.py", line 777, in _list_run_infos
    run_info = self._get_run_info_from_dir(r_dir)
  File "/home/haru/Desktop/repositories/mlflow/mlflow/store/tracking/file_store.py", line 610, in _get_run_info_from_dir
    run_info = _read_persisted_run_info_dict(meta)
  File "/home/haru/Desktop/repositories/mlflow/mlflow/store/tracking/file_store.py", line 112, in _read_persisted_run_info_dict
    dict_copy = run_info_dict.copy()
AttributeError: 'NoneType' object has no attribute 'copy'