question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

500 encountered from metaflow service

See original GitHub issue

@russellbrooks reported the following issue

Metaflow service error:
Metadata request (/flows/TuningXGB/runs/572/steps/start/tasks/1821/metadata) failed (code 500): {"message":"Internal server error"}

For context, this was encountered as part of a hyperparameter tuning framework (each flow run is a model training evaluation) after ~6hrs with 125 runs successfully completed. Everything is executed on Batch with 12 variants being trained in parallel, then feeding those results back to a bayesian optimizer to generate the next.

The cloudwatch logs from Batch indicate that the step completed successfully, and the Metaflow service error was encountered on a separate on-demand EC2 instance that’s running the optimizer and executes the flows using asyncio.create_subprocess_shell. Looking at API Gateway, the request rates seemed reasonable and its configured without throttling. RDS showed plenty of CPU credits and barely seemed phased throughout the workload. Each run was executed with --retry but this error seems to have short-circuited that logic and resulted in a hard-stop.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:20 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
savingoyalcommented, May 5, 2021

Thanks for the extra data points. I am actively looking into this issue.

1reaction
savingoyalcommented, May 29, 2021

I worked with @dpatschke to triage this issue (Many thanks!). Looks like AWS Batch refuses to send the correct response to the describe_jobs API call we make to ascertain job status for some requests (the response code is still 200). Upgrading to Metaflow 2.3.0 should address this issue https://github.com/Netflix/metaflow/pull/543.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Metaflow/community - metaflow_org/community - Gitter
The graph looks good! Running pylint... Pylint not found, so extra checks are disabled. Flow failed: HTTPError('500 Server Error: Internal Server Error for ......
Read more >
Dealing with Failures - Metaflow Docs
Failures are a natural, expected part of data science workflows. Here are some typical reasons why you can expect your workflow to fail:...
Read more >
Metaflow: The ML Infrastructure at Netflix - SlideShare
Metaflow was started at Netflix to answer a pressing business need: How to enable an organization of data scientists, who are not software ......
Read more >
500 error: how to solve - SupportHost
Status code 500 (Internal Server Error) indicates that the server encountered an unexpected condition that prevented it from fulfilling the ...
Read more >
SAP AI Core
Metaflow Python Library for SAP AI Core. ... SAP AI Core is a service in the SAP Business Technology Platform which is designed...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found