Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Final operators stats are not always propagated

See original GitHub issue

Looking at: io.prestosql.operator.DriverContext#finished, it seems that driver might be set as done while it’s stats are still not populated into pipeline stats (pipelineContext.driverFinished(this); happens after). This might mark task as finished (and final task info set) with driver stats lost.

Relates to: https://github.com/prestosql/presto/issues/5120

Issue Analytics

State:
Created 3 years ago
Comments:15 (14 by maintainers)

Top GitHub Comments

1reaction

atanasenkocommented, Oct 22, 2021

After quite some time of reading code and analyzing logs, I think I figured it out. There are at least 3 issues that caused test flakiness in stats. All stem from the asynchronous nature of task updates coming from workers to coordinator, and StateMachine listener events which might also race with other code to update TaskInfo instances.

The most frequent one is because of worker’s SqlTask status being transitioned to FINISHED while the initial SqlTaskExecution is still being created, and at that time, the TaskHolder reference is just an empty one, having no stats to provide to the final TaskInfo.
Second, less frequent, but still prominent is when final TaskInfo on the coordinator is being constructed from a final task status and a partial TaskInfo received previously from the worker which might not have all the stats collected just yet, while the final TaskInfo on the worker is built just a bit later.
Third one is similar to previous, but in this case the final status of the task is set during substage cancellation on parent’s FLUSHING status. Sometimes final TaskInfo (or even TaskStatus) have not yet reached coordinator, meaning it’s stage is not yet completed. Upon cancellation of a substage, any stats received by coordinator subsquently are ignored.

I’ve submitted a pr https://github.com/trinodb/trino/pull/9733 with my attempt to fix those issues. I tested it out using a loop with 10K queries in sequence. Without those changes first lost stats happened within the first 100.

0reactions

findepicommented, Nov 9, 2021

Additional fix https://github.com/trinodb/trino/pull/9913 Follow up cleanup https://github.com/trinodb/trino/issues/9898

Top Results From Across the Web

Statistics in Spark SQL explained - Towards Data Science

Here all the stats are propagated and if we provide also the column level metrics, Spark can compute the selectivity for the Filter...

Filters Section - Stat Server User's Guide

In the event that T-Server propagates no reason code, Stat Server reports the value of this condition as Unknown and any filters using...

Assessment and Propagation of Model Uncertainty - jstor

In general this approach fails to assess and propagate structural uncertainty fully and may lead to miscalibrated uncertainty assessments about y given x....

Adaptive query execution | Databricks on AWS

Dynamically detects and propagates empty relations. Application. AQE applies to all queries that are: Non-streaming. Contain at least one ...

raquo/Airstream: State propagation and event ... - GitHub

State propagation and event streams with mandatory ownership and no glitches ... have finished propagating, so it will always see the final Var...