Slow, expensive metadata endpoint
See original GitHub issueWhat happens
When a large workflow is queried for metadata, cromwell spends a considerable amount of time preparing the repsonse. This usually results in a timeout for the caller. In some cases, the preparation is so expensive that Cromwell either runs out of memory or enters a zombie-like state(#4105).
What should happen
The caller should receive a timely response, and Cromwell should not be endangered by operations on large workflows.
Speculation: Construction of result
The result is constructed in a two-phase manner: gather all the data, then produce a structured response.
This is done for two reasons:
- Unstructured metadata is difficult for a human to understand.
- There are possibly many duplicates due to the way restarts are handled.
Recommendation
~Stream results (using doobie SQL library?) and construct response while gathering data. This should mean that a large pool of data is never present in memory, only the current result set and the partial response.~
Not streaming for now. Instead going to foldMap
large sequence into Map
monoid, then combine all those maps together into a final result.
There is some manipulation to be done after combining a result.
- Sort calls by time
- Prune duplicates by taking the most recent. This has some special cases that need to be considered.
Speculation: Database table
The metadata table is currently an unindexed monster, comprising 10^6 - 10^9 rows and between 2-3 TB of data. The query has historically been surprisingly performant but is likely going to degrade over time.
Recommendation
punt on DB changes
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (5 by maintainers)
Top GitHub Comments
FWIW Slick also supports streaming
you’ve done a bunch of investigation on this already, but adding for posterity: the metadata endpoint is particularly vulnerable to the joint-calling use case. In this use case (and in similar workflows), calls can scatter widely, and each call can have many inputs, each of which is a substantial value. So, calls * scatter * inputs * value length makes for a lot of data.