question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Final operators stats are not always propagated

See original GitHub issue

Looking at: io.prestosql.operator.DriverContext#finished, it seems that driver might be set as done while it’s stats are still not populated into pipeline stats (pipelineContext.driverFinished(this); happens after). This might mark task as finished (and final task info set) with driver stats lost.

Relates to: https://github.com/prestosql/presto/issues/5120

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:15 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
atanasenkocommented, Oct 22, 2021

After quite some time of reading code and analyzing logs, I think I figured it out. There are at least 3 issues that caused test flakiness in stats. All stem from the asynchronous nature of task updates coming from workers to coordinator, and StateMachine listener events which might also race with other code to update TaskInfo instances.

  • The most frequent one is because of worker’s SqlTask status being transitioned to FINISHED while the initial SqlTaskExecution is still being created, and at that time, the TaskHolder reference is just an empty one, having no stats to provide to the final TaskInfo.
  • Second, less frequent, but still prominent is when final TaskInfo on the coordinator is being constructed from a final task status and a partial TaskInfo received previously from the worker which might not have all the stats collected just yet, while the final TaskInfo on the worker is built just a bit later.
  • Third one is similar to previous, but in this case the final status of the task is set during substage cancellation on parent’s FLUSHING status. Sometimes final TaskInfo (or even TaskStatus) have not yet reached coordinator, meaning it’s stage is not yet completed. Upon cancellation of a substage, any stats received by coordinator subsquently are ignored.

I’ve submitted a pr https://github.com/trinodb/trino/pull/9733 with my attempt to fix those issues. I tested it out using a loop with 10K queries in sequence. Without those changes first lost stats happened within the first 100.

0reactions
findepicommented, Nov 9, 2021
Read more comments on GitHub >

github_iconTop Results From Across the Web

Statistics in Spark SQL explained - Towards Data Science
Here all the stats are propagated and if we provide also the column level metrics, Spark can compute the selectivity for the Filter...
Read more >
Filters Section - Stat Server User's Guide
In the event that T-Server propagates no reason code, Stat Server reports the value of this condition as Unknown and any filters using...
Read more >
Assessment and Propagation of Model Uncertainty - jstor
In general this approach fails to assess and propagate structural uncertainty fully and may lead to miscalibrated uncertainty assessments about y given x....
Read more >
Adaptive query execution | Databricks on AWS
Dynamically detects and propagates empty relations. Application. AQE applies to all queries that are: Non-streaming. Contain at least one ...
Read more >
raquo/Airstream: State propagation and event ... - GitHub
State propagation and event streams with mandatory ownership and no glitches ... have finished propagating, so it will always see the final Var...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found