[FR] Validate request JSON with JSON schema or similar
See original GitHub issueThank you for submitting a feature request. Before proceeding, please review MLflow’s Issue Policy for feature requests and the MLflow Contributing Guide.
Please fill in this feature request template to ensure a timely and thorough response.
Willingness to contribute
The MLflow Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature (either as an MLflow Plugin or an enhancement to the MLflow code base)?
- Yes. I can contribute this feature independently.
- Yes. I would be willing to contribute this feature with guidance from the MLflow community.
- No. I cannot contribute this feature at this time.
Proposal Summary
API calls should return HTTP 400
when the parameters (e.g.) don’t match expected data types instead of failing with a 500
. Creating a JSON schema – using jsonschema, for example – for the MLFlow REST API to check requests against would fix these issues. This would result in far friendlier UX, easier debugging, more predictable responses, and a generally more RESTful API.
Motivation
I keep getting 500
errors for things like supplying a parameter to an API call that’s the wrong data type. See this issue for an example. This has also happened with calls to logging parameters (both individually and in batches) and all kinds of other functions.
Right now, this means that an end user of a running MLFlow service get an error message like this back when something goes wrong:
Response [https://<<host>>/api/2.0/mlflow/runs/log-batch]
Date: 2021-12-29 20:15
Status: 500
Content-Type: text/html; charset=utf-8
Size: 290 B
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
This error was caused by providing a timestamp
value to log-batch
that was a character string as opposed to a numeric timestamp.
Obviously, this error is unhelpful. There’s no indication of what went wrong or how to fix the issue. More importantly, the 500
is a hint that the client actually did not do anything wrong, and that there was a legitimate issue on the server side. For bad parameters (e.g.), this is obviously not the case, and the client should be seeing an error message with information about the incorrect parameter type and what type was expected, not a cryptic 500
with Unable to complete your request
.
The value prop here should be relatively obvious, so I won’t write too much beyond just saying that validating requests against a JSON schema would let users of the MLFlow REST API (in other words, every MLFlow user) more easily and reliably use MLFlow, develop wrappers for the MLFlow API, debug their code when things go wrong, etc. etc. etc.
What component(s), interfaces, languages, and integrations does this feature affect?
Components
-
area/artifacts
: Artifact stores and artifact logging -
area/build
: Build and test infrastructure for MLflow -
area/docs
: MLflow documentation pages -
area/examples
: Example code -
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry -
area/models
: MLmodel format, model serialization/deserialization, flavors -
area/projects
: MLproject format, project running backends -
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs -
area/server-infra
: MLflow Tracking server backend -
area/tracking
: Tracking Service, tracking client APIs, autologging
Interfaces
-
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server -
area/docker
: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models -
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry -
area/windows
: Windows support
Languages
-
language/r
: R APIs and clients -
language/java
: Java APIs and clients -
language/new
: Proposals for new client languages
Integrations
-
integrations/azure
: Azure and Azure ML integrations -
integrations/sagemaker
: SageMaker integrations -
integrations/databricks
: Databricks integrations
Details
I haven’t written any JSON schema for Python, but in R I know it’s easy to just set up a function to validate requests and then use that function to validate the JSON body of any requests that come in before doing any actual work. If a request fails the JSON validation checks, you can easily return an HTTP 400 -- JSON validation failed with << some error >>
.
Let me know if I can help with this improvement! I think it’d be a major step forward for everyone using MLFlow and for the project in general.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
@mrkaye97 That sounds great!
Hi @mrkaye97 , yes, you can use the result of
ParseDict
for validation. To ensure that validation is applied across the various handlers, I’d recommend adding it to_get_request_message()
. You can creating a mapping from each message type to its associated validation function; when_get_request_message()
is called, it should resolve the output ofParseDict
to the appropriate validation function and invoke it. IfParseDict
fails, we should return a 400 (not sure if we’re doing this already). Thank you for taking this on!