500 encountered from metaflow service
See original GitHub issue@russellbrooks reported the following issue
Metaflow service error:
Metadata request (/flows/TuningXGB/runs/572/steps/start/tasks/1821/metadata) failed (code 500): {"message":"Internal server error"}
For context, this was encountered as part of a hyperparameter tuning framework (each flow run is a model training evaluation) after ~6hrs with 125 runs successfully completed. Everything is executed on Batch with 12 variants being trained in parallel, then feeding those results back to a bayesian optimizer to generate the next.
The cloudwatch logs from Batch indicate that the step completed successfully, and the Metaflow service error was encountered on a separate on-demand EC2 instance that’s running the optimizer and executes the flows using asyncio.create_subprocess_shell. Looking at API Gateway, the request rates seemed reasonable and its configured without throttling. RDS showed plenty of CPU credits and barely seemed phased throughout the workload. Each run was executed with --retry but this error seems to have short-circuited that logic and resulted in a hard-stop.
Issue Analytics
- State:
- Created 4 years ago
- Comments:20 (4 by maintainers)
Thanks for the extra data points. I am actively looking into this issue.
I worked with @dpatschke to triage this issue (Many thanks!). Looks like AWS Batch refuses to send the correct response to the
describe_jobs
API call we make to ascertain job status for some requests (the response code is still 200). Upgrading toMetaflow 2.3.0
should address this issue https://github.com/Netflix/metaflow/pull/543.