Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

500 encountered from metaflow service

See original GitHub issue

@russellbrooks reported the following issue

Metaflow service error:
Metadata request (/flows/TuningXGB/runs/572/steps/start/tasks/1821/metadata) failed (code 500): {"message":"Internal server error"}

For context, this was encountered as part of a hyperparameter tuning framework (each flow run is a model training evaluation) after ~6hrs with 125 runs successfully completed. Everything is executed on Batch with 12 variants being trained in parallel, then feeding those results back to a bayesian optimizer to generate the next.

The cloudwatch logs from Batch indicate that the step completed successfully, and the Metaflow service error was encountered on a separate on-demand EC2 instance that’s running the optimizer and executes the flows using asyncio.create_subprocess_shell. Looking at API Gateway, the request rates seemed reasonable and its configured without throttling. RDS showed plenty of CPU credits and barely seemed phased throughout the workload. Each run was executed with --retry but this error seems to have short-circuited that logic and resulted in a hard-stop.

Issue Analytics

State:
Created 4 years ago
Comments:20 (4 by maintainers)

Top GitHub Comments

2reactions

savingoyalcommented, May 5, 2021

Thanks for the extra data points. I am actively looking into this issue.

1reaction

savingoyalcommented, May 29, 2021

I worked with @dpatschke to triage this issue (Many thanks!). Looks like AWS Batch refuses to send the correct response to the describe_jobs API call we make to ascertain job status for some requests (the response code is still 200). Upgrading to Metaflow 2.3.0 should address this issue https://github.com/Netflix/metaflow/pull/543.

Top Results From Across the Web

Metaflow/community - metaflow_org/community - Gitter

The graph looks good! Running pylint... Pylint not found, so extra checks are disabled. Flow failed: HTTPError('500 Server Error: Internal Server Error for ......

Dealing with Failures - Metaflow Docs

Failures are a natural, expected part of data science workflows. Here are some typical reasons why you can expect your workflow to fail:...

Metaflow: The ML Infrastructure at Netflix - SlideShare

Metaflow was started at Netflix to answer a pressing business need: How to enable an organization of data scientists, who are not software ......

500 error: how to solve - SupportHost

Status code 500 (Internal Server Error) indicates that the server encountered an unexpected condition that prevented it from fulfilling the ...

SAP AI Core

Metaflow Python Library for SAP AI Core. ... SAP AI Core is a service in the SAP Business Technology Platform which is designed...