Final operators stats are not always propagated
See original GitHub issueLooking at: io.prestosql.operator.DriverContext#finished
, it seems that driver might be set as done while it’s stats are still not populated into pipeline stats (pipelineContext.driverFinished(this);
happens after). This might mark task as finished (and final task info set) with driver stats lost.
Relates to: https://github.com/prestosql/presto/issues/5120
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (14 by maintainers)
Top Results From Across the Web
Statistics in Spark SQL explained - Towards Data Science
Here all the stats are propagated and if we provide also the column level metrics, Spark can compute the selectivity for the Filter...
Read more >Filters Section - Stat Server User's Guide
In the event that T-Server propagates no reason code, Stat Server reports the value of this condition as Unknown and any filters using...
Read more >Assessment and Propagation of Model Uncertainty - jstor
In general this approach fails to assess and propagate structural uncertainty fully and may lead to miscalibrated uncertainty assessments about y given x....
Read more >Adaptive query execution | Databricks on AWS
Dynamically detects and propagates empty relations. Application. AQE applies to all queries that are: Non-streaming. Contain at least one ...
Read more >raquo/Airstream: State propagation and event ... - GitHub
State propagation and event streams with mandatory ownership and no glitches ... have finished propagating, so it will always see the final Var...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
After quite some time of reading code and analyzing logs, I think I figured it out. There are at least 3 issues that caused test flakiness in stats. All stem from the asynchronous nature of task updates coming from workers to coordinator, and
StateMachine
listener events which might also race with other code to updateTaskInfo
instances.SqlTask
status being transitioned toFINISHED
while the initialSqlTaskExecution
is still being created, and at that time, theTaskHolder
reference is just an empty one, having no stats to provide to the finalTaskInfo
.TaskInfo
on the coordinator is being constructed from a final task status and a partialTaskInfo
received previously from the worker which might not have all the stats collected just yet, while the finalTaskInfo
on the worker is built just a bit later.FLUSHING
status. Sometimes finalTaskInfo
(or evenTaskStatus
) have not yet reached coordinator, meaning it’s stage is not yet completed. Upon cancellation of a substage, any stats received by coordinator subsquently are ignored.I’ve submitted a pr https://github.com/trinodb/trino/pull/9733 with my attempt to fix those issues. I tested it out using a loop with 10K queries in sequence. Without those changes first lost stats happened within the first 100.
Additional fix https://github.com/trinodb/trino/pull/9913 Follow up cleanup https://github.com/trinodb/trino/issues/9898