question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Internal Server Error on delete

See original GitHub issue

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

1.19.0

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS 7
  • Python version: 3.9
  • yarn version, if running the dev UI:

Describe the problem

After using mlflow for ~1 year, we’ve run into this difficult to reproduce bug several times where an experiment tab will show an INTERNAL_SERVER_ERROR and stops displaying the runs that were stored in that experiment.

This issue seems to come up when deleting one or multiple runs from an experiment, but it doesn’t happen every time a delete is attempted. But it seems to usually happen on the first delete attempt in an experiment if it’s going to fail… The delete command hangs longer than usual then prompts the INTERNAL_SERVER_ERROR.

Screen Shot 2022-07-26 at 4 30 40 PM

Besides a full fix, I would also be interested to learn if it’s possible to retrieve the runs that were stored in the broken experiment.

Steps to reproduce the bug

This is inconsistently reproducible (and I apologize, I know that’s not very helpful), but I logged 50-100 runs in a new experiment. Each run had 5 metrics with ~200 values. I then tried deleting a random amount of runs. Sometimes one or the whole page. Sometimes the experiment broke, but most times the delete executed no problem.

Code to generate data required to reproduce the bug

import mlflow
import boto3
import random

MLFLOW_EXPERIMENT_NAME = "break server"
if not mlflow.get_experiment_by_name(MLFLOW_EXPERIMENT_NAME):
      mlflow.create_experiment(MLFLOW_EXPERIMENT_NAME)
mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)

mlflow.start_run()

mlflow.log_param("number", 1)

randomlist = random.sample(range(10, 500), 200)
for x in randomlist:
  mlflow.log_metric("rand1", x)
  mlflow.log_metric("rand2", x)
  mlflow.log_metric("rand3", x)
  mlflow.log_metric("rand4", x)
  mlflow.log_metric("rand5", x)

mlflow.end_run()

I then ran this 50-100 times to fill the experiment with runs to delete.

Is the console panel in DevTools showing errors relevant to the bug?

Screen Shot 2022-07-26 at 5 19 29 PM
Error: Promised response from onMessage listener went out of scope 5 [background.js:841:170](moz-extension://d6a7add6-de11-4239-898f-0a287b30c3e7/dist/background.js)
XHR failed 
Object { readyState: 4, getResponseHeader: getResponseHeader(e), getAllResponseHeaders: getAllResponseHeaders(), setRequestHeader: setRequestHeader(e, t), overrideMimeType: overrideMimeType(e), statusCode: statusCode(e), abort: abort(e), state: state(), always: always(), catch: catch(e)
, … }
​
abort: function abort(e)​
always: function always()​
catch: function catch(e)​
done: function add()​
fail: function add()​
getAllResponseHeaders: function getAllResponseHeaders()​
getResponseHeader: function getResponseHeader(e)​
overrideMimeType: function overrideMimeType(e)​
pipe: function pipe()​
progress: function add()​
promise: function promise(e)
​
readyState: 4
​
responseText: "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n<title>500 Internal Server Error</title>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>\n"
​
setRequestHeader: function setRequestHeader(e, t)​
state: function state()
​
status: 500
​
statusCode: function statusCode(e)
​
statusText: "Internal Server Error"
​
then: function then(e, r, o)​
<prototype>: Object { … }
[main.895f3836.chunk.js:1:15852](https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js)
    error https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js:1
    l https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
    fireWith https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
    S https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
    t https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/2.8c754ded.chunk.js:1
Object { xhr: {…} }
​
xhr: Object { readyState: 4, getResponseHeader: getResponseHeader(e), getAllResponseHeaders: getAllResponseHeaders()
, … }
​​
abort: function abort(e)​​
always: function always()​​
catch: function catch(e)​​
done: function add()​​
fail: function add()​​
getAllResponseHeaders: function getAllResponseHeaders()​​
getResponseHeader: function getResponseHeader(e)​​
overrideMimeType: function overrideMimeType(e)​​
pipe: function pipe()​​
progress: function add()​​
promise: function promise(e)
​​
readyState: 4
​​
responseText: "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n<title>500 Internal Server Error</title>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>\n"
​​
setRequestHeader: function setRequestHeader(e, t)​​
state: function state()
​​
status: 500
​​
statusCode: function statusCode(e)
​​
statusText: "Internal Server Error"
​​
then: function then(e, r, o)​​
<prototype>: Object { … }
​
<prototype>: Object { … }
[main.895f3836.chunk.js:1:8885](https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js)
    value https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js:1
    value https://mlwf-devel-server.mlflow.pnl.gov/static-files/static/js/main.895f3836.chunk.js:1

Does the network panel in DevTools contain failed requests relevant to the bug?

Screen Shot 2022-07-26 at 5 24 17 PM

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:13 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
dbczumarcommented, Jul 29, 2022

Happy to help! You should be able to view runs again by identifying and removing any empty meta.yaml files in your mlruns directory. Unfortunately, runs whose meta.yaml files are empty may be a bit harder to restore, as you would need to reconstruct the contents of meta.yaml; this can be done by copying the contents of a non-empty meta.yaml file and changing the run ID field (start and end times will be incorrect, unfortunately).

This is a very interesting class of failure that we’ll make sure to address. Thank you for reporting it to us.

1reaction
harupycommented, Jul 28, 2022

This might not be what’s really happening but we can reproduce the error with this script:

# Before running this script: log some runs using the file store and launch MLflow UI

import time

path = "mlruns/0/f18249d1f76d445f8324550a74c0d07b/meta.yaml"

with open(path, "r") as f:
    a = f.read()

with open(path, "w") as f:
    print("Fetch some runs on MLflow UI")
    time.sleep(10)
    f.write(a)

UI

image

Tracking server log

2022/07/28 21:36:08 ERROR mlflow.server: Exception on /ajax-api/2.0/preview/mlflow/runs/search [POST]
Traceback (most recent call last):
  File "/home/haru/miniconda3/envs/mlflow-dev-env/lib/python3.7/site-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/haru/miniconda3/envs/mlflow-dev-env/lib/python3.7/site-packages/flask/app.py", line 1515, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/haru/miniconda3/envs/mlflow-dev-env/lib/python3.7/site-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/haru/miniconda3/envs/mlflow-dev-env/lib/python3.7/site-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/haru/Desktop/repositories/mlflow/mlflow/server/handlers.py", line 454, in wrapper
    return func(*args, **kwargs)
  File "/home/haru/Desktop/repositories/mlflow/mlflow/server/handlers.py", line 512, in wrapper
    return func(*args, **kwargs)
  File "/home/haru/Desktop/repositories/mlflow/mlflow/server/handlers.py", line 858, in _search_runs
    experiment_ids, filter_string, run_view_type, max_results, order_by, page_token
  File "/home/haru/Desktop/repositories/mlflow/mlflow/store/tracking/abstract_store.py", line 294, in search_runs
    experiment_ids, filter_string, run_view_type, max_results, order_by, page_token
  File "/home/haru/Desktop/repositories/mlflow/mlflow/store/tracking/file_store.py", line 809, in _search_runs
    run_infos = self._list_run_infos(experiment_id, run_view_type)
  File "/home/haru/Desktop/repositories/mlflow/mlflow/store/tracking/file_store.py", line 777, in _list_run_infos
    run_info = self._get_run_info_from_dir(r_dir)
  File "/home/haru/Desktop/repositories/mlflow/mlflow/store/tracking/file_store.py", line 610, in _get_run_info_from_dir
    run_info = _read_persisted_run_info_dict(meta)
  File "/home/haru/Desktop/repositories/mlflow/mlflow/store/tracking/file_store.py", line 112, in _read_persisted_run_info_dict
    dict_copy = run_info_dict.copy()
AttributeError: 'NoneType' object has no attribute 'copy'
Read more comments on GitHub >

github_iconTop Results From Across the Web

HTTP 500 Internal Server Error: What It Means & How to Fix It
This error is a server response to stop sending requests because of overloaded resources. This code might show up if your site needs...
Read more >
Bug #1939977 β€œ500 error on deleting image from store if ...
1. Verify you have multiple stores configured Β· 2. Create image in all stores using below command Β· 3. Disable get_image_location in policy....
Read more >
500 Internal Server Error: Delete project's github services
Summary 500 Internal Server Error when deleting project's github service Steps to reproduce.
Read more >
How do I remove an internal server error? - Quora
The best way to start is to open the error log file. Hopefully the message will give you enough detail to figure out...
Read more >
500 Internal Server Error when trying to delete certain courses
I was able to delete another 50 something courses where the owning users still existed. So, is this a bug that we can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found