DZL Proposal and Variance Measurement
See original GitHub issueI’ve been doing some thinking and background research on what’s out there. I’m taking down my thoughts here as a sounding board so we can narrow down to an MVP.
The Problem
LH team needs to be able to understand metric variance and overall runtime/latency (bonus if possible: accuracy) in different environments, how changes we make are affecting these attributes, and how we are trending over time.
Recap
Need to monitor:
- Metric variance
- Overall LH runtime
- (maybe if possible) Accuracy to real phones
Across:
- Environment (i.e. LR vs. local vs. travis/cloud/whatever)
- Different site types/URLs (i.e. example.com-type vs. cnn.com-type)
- Throttling types (i.e. Lantern vs. DevTools vs. WPT vs. none)
Use Cases:
- Overall “health dashboard” i.e. what does master look like overall?
- Compare version A to version B i.e. does this change improve LH?
- Timeline view by commit i.e. are we suffering from a death by a thousand cuts over time?
Potential Solution Components
- Mechanism for running LH
n
times in a particular environment on given URLs and storing the results in some queryable format - Mechanism for visualizing all the latest
master
results - Mechanism for visualizing the difference between two different versions of LH
- Mechanism for visualizing the history of
master
results
Existing Solutions
The good news: we have an awesome community that has built lots of things to look at LH results over time 😃 The bad news: their big selling points usually revolve around ease of time series data and abstracting away the environment concerns (which is the one piece we will actually need to change up and have control over the most) 😕
Only one of the use cases here is really a timeseries problem (and even then it’s not a real-time timeseries, it’s commit level timeseries). That’s not to say we can’t repurpose a timeseries DB for our use cases, graphana still supports histograms and all that, it just is a bit of shoehorn for some of the things we’ll want to do.
Other problem, one of the big things we actually care about most in all of this is differences between versions of Lighthouse. Given that abstracting the environment away and keeping it stable is a selling point of all these solutions, breaking in to make comparing versions our priority is really cutting against the grain. Again, not impossible, but not exactly leveraging the strengths of these solutions.
Proposed MVP
K-I-S-S, keep it simple stupid. Great advice, hurts my feelings every time.
Simple CLI with 2 commands.
run
- handle the runn
times and save piece, single js file for each connector we need to run, just local and LR to startserve
- serve a site that enables the visualization pieces
These two commands share a CLI config that specifies storage location. I’m thinking sqlite to start to avoid any crazy docker mess and work with some hypothetical remote SQL server. We can include a field for the raw response so we can always add more columns easily later.
Thoughts so far? Did I completely miss what the pain is from others’ perspective? Does it sound terrifyingly similar to plots
😱
Issue Analytics
- State:
- Created 5 years ago
- Comments:19 (4 by maintainers)
Top GitHub Comments
Hey, so yeah, I was thinking that I want to see data commit-over-commit so that I could see if a specific commit has introduced a problem, or that we can see that a variance has been reduced.
I am still liking the idea of a candlestick graph with the data like this:
This would allow us to visualize when the variance was narrowing i.e. the std dev would be going down over time:
Or if a specific commit increased variance:
So that is kind of how I like to visualize the scores over time, either with a candlestick chart, or with a line chart + shaded area of +/- 1-2 std dev around it to show the variance.
I like the current visualizations esp. broken down by URL. But personally I want to see line charts/candlestick charts that show me what each metric is doing over time so that I can see if something is getting out of hand over time or degrading slowly. But for snapshots I like all the called out percentages and variance boxes coded red/yellow/green.
hey @patrickhulce have you looked into Superset (Disclaimer: I used to work on the airflow DAG ingestion and viz on a tool that used Superset so I like it and it’s python)
Made some candles with some of the dumped data to show what variance in duration of run could look like over multiple commits in candle form.
OK. I think i’m sufficiently convinced that these tools optimize for metrics happening continuously and requiring grouping. Our usecase of discrete metrics every hour or day isn’t supported well by them. (your 3rd bold point).
I appreciate the attempt to make it work, but agree that grafana isn’t a great solution for what we’re trying to do here.