Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow, expensive metadata endpoint

See original GitHub issue

What happens

When a large workflow is queried for metadata, cromwell spends a considerable amount of time preparing the repsonse. This usually results in a timeout for the caller. In some cases, the preparation is so expensive that Cromwell either runs out of memory or enters a zombie-like state(#4105).

What should happen

The caller should receive a timely response, and Cromwell should not be endangered by operations on large workflows.

Speculation: Construction of result

The result is constructed in a two-phase manner: gather all the data, then produce a structured response.

This is done for two reasons:

Unstructured metadata is difficult for a human to understand.
There are possibly many duplicates due to the way restarts are handled.

Recommendation

~Stream results (using doobie SQL library?) and construct response while gathering data. This should mean that a large pool of data is never present in memory, only the current result set and the partial response.~

Not streaming for now. Instead going to foldMap large sequence into Map monoid, then combine all those maps together into a final result.

There is some manipulation to be done after combining a result.

Sort calls by time
Prune duplicates by taking the most recent. This has some special cases that need to be considered.

Speculation: Database table

The metadata table is currently an unindexed monster, comprising 10^6 - 10^9 rows and between 2-3 TB of data. The query has historically been surprisingly performant but is likely going to degrade over time.

Recommendation

punt on DB changes

Believe to be related to #4093 and #4105

Issue Analytics

State:
Created 5 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

Hornethcommented, Sep 20, 2018

FWIW Slick also supports streaming

0reactions

davidangbcommented, Dec 12, 2018

you’ve done a bunch of investigation on this already, but adding for posterity: the metadata endpoint is particularly vulnerable to the joint-calling use case. In this use case (and in similar workflows), calls can scatter widely, and each call can have many inputs, each of which is a substantial value. So, calls * scatter * inputs * value length makes for a lot of data.