question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Rework global user accuracy metric

See original GitHub issue

Motivation

As per my understanding, currently the accuracy of a user is calculated as a weighted average of the accuracies of the top plays, using the same weighting system pp does. This concept has several issues:

  1. The average is not a robust measure of central tendency.

This means that an outlier accuracy reaching the top plays would affect heavily the resulting value, either boosting it or reducing it sharply.

  1. The weighting system heavily biases the result towards the accuracy of the very top pp plays. However, the relationship of the pp gained from a play with the difficulty of the map is non-trivial, rendering the weights used questionable.

After successfully passing a map beyond the player’s current comfort zone, it could turn out to be a top-pp play with low accuracy. In this situation which is supposed to be joyful for the player, the global accuracy of the user will drop sharply, resulting in a bad™ user experience.

  1. It’s hard to leverage to compare the skill of multiple players.

Example: if player A has 98% accuracy and player B has 95% accuracy, who is better given that player A is at 500pp and player B at 800pp? The higher accuracy of player A sheds a shadow on the higher ranking of player B, because player A could “sacrifice” his accuracy by playing harder maps in order to get PP. However, in the end the interpretation of the higher accuracy is just a suspicion, and not an actual reliable conclusion.

Proposal

I have three proposals to fix these three issues. Each of them more or less builds upon the previous one.

  1. Switch the calculation from using a weighted average to using the median of the relevant accuracies, from the top pp plays (the median is a robust measure of central tendency.)

  2. The displayed user accuracy should be a compound metric of 2 (or more) values which represents the performance of the user over different map difficulty brackets.

In particular, I am proposing to split the top pp plays in two sets, according to the maximum pp obtainable (or star rating?) on each map (with the respective used mods).

Hence, the accuracy would be displayed as a ➜ b, where a is the median of the accuracies of the plays of easiest maps, and b is the median of the accuracies of the plays of the hardest maps. Note that the value a would represent the accuracy for maps within the player’s comfort zone, while b would represent the accuracy for maps which challenge the player’s skill.

In this context, a ➜ b is to be read roughly as “a accuracy in comfortable maps, with b accuracy in harder maps”.

  1. Establish manageable pp cutoffs, such as 100 * 2^n. Then, the accuracy becomes a list of medians for each pp range associated to map difficulty. In most places where it would be displayed the accuracy would be shown as just the two medians of the highest difficulty sets. However, there would be a place in the user profile where the full value would be listed in order to compare several players.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:15 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Slayer95commented, Apr 27, 2019

Is there an issue with using the average without weighting? That’s another way I’d consider, though I haven’t thought about it too much, it could be flawed.

Why median over average? I understand why you want to get rid of weighted average […].

The arithmetic mean is too susceptible to outlier values. Let’s show this with in an example:

Example New player installed osu! and played a low-end Hard map, and got 95%. "too ez", he says, I'll play Insane.

New player plays 5 Insane maps, getting accuracy of ~70%.

Later on, in social networks:

New Player: Hey! I finally installed the game you talked me about, osu" Old Player: Oh, really? That’s great! Old Player: So, how good are you? Your accuracy? New Player: Well, I’ve been getting around 70%. (Old Player: lol that’s so bad, well whatever) Old Player: Let’s have some multiplayer matches (Old Player gets online, searches for their friend) (Old Player reads in their profile: Accuracy: 74.17%, oh I guess they are not that that bad?) Insane games played together. Accuracies: 70%, 71%, 68%, 69%, 72%, 70%.

The value of 74.1% above was calculated with an unweighted average (5*70+95)/6. For comparison, the median would be 70%. When they play together, the median value turns out to be a much better predictor of their typical accuracy. This is the large effect of the outlier 95% accuracy from the first easy game.

tl;dr: the average is too vulnerable to outlier values. This issue is exacerbated by the current weighting system used, but in the end it’s intrinsic to the arithmetic mean.

0reactions
abraker95commented, Apr 29, 2019

@abraker95 , your data doesn’t seem right. See e.g. https://i.imgur.com/WNiuX2D.png

Furthermore, since I posted my own results above, that user has already played more matches. Now I get a median of 80.60%.

Yup, I forgot to factor in misses. I updated the code in the demo linked, but fixing the table will have to wait until tomorrow.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Accuracy — PyTorch-Metrics 1.0.1 documentation
Returns: If multidim_average is set to global , the metric returns a scalar value. If multidim_average is set to samplewise , the metric...
Read more >
The 7 Best DevOps Metrics & KPIs to Measure Success
Your rework rate is the percentage of changes that are considered rework for a given timeframe. Ideally, you'd keep this rate as low...
Read more >
When You Should Not Use Accuracy to Evaluate Your ...
Classification accuracy might be the most intuitive metric. It shows the ratio of correct predictions to all predictions. Classification ...
Read more >
Evaluation Metrics For Multi-class Classification
Accuracy : It is one of the most straightforward metrics used in machine learning. It defines how accurate your model is. For example,...
Read more >
How do you update your talent metrics?
Learn how to update your global talent metrics to align them with your goals, collect and analyze your data, and act on your...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found