question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Should bins implicitly filter data that is out of range?

See original GitHub issue

Vega-lite’s default for binning is to assign out-of-range data to the nearest defined bin. This is not a default I’ve come across in other charting libraries, and is potentially very misleading. For example, here is a zoomed-in normal distribution (vega editor):

{
  "data": {"sequence": {"start": 0, "stop": 1000}},
  "transform": [{"calculate": "sampleNormal()", "as": "x"}],
  "mark": "bar",
  "encoding": {
    "x": {"field": "x", "type": "quantitative", "bin": {"extent": [-1, 1]}},
    "y": {"type": "quantitative", "aggregate": "count"}
  }
}

visualization (57)

If you don’t know to look for this (and my guess is most users wouldn’t) it would lead you to make erroneous conclusions about the content of the data.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
jheercommented, Dec 4, 2019

I like the proposal to output either null or infinite values for out-of-range bins. I will look into that soon!

1reaction
zanarmstrongcommented, Dec 4, 2019

+1 to Jake’s point that this is “potentially very misleading” and “it would lead you to make erroneous conclusions about the content of the data.” Can this get a higher priority?

In Jake’s example, the chart is showing that there are >200 points in the dataset that have a value between 0.8 and 1. But, there aren’t >200 points in the dataset with values between 0.8 and 1. There are ~50.

This breaks the fundamental concept of a histogram that the height of the bar corresponds to the number of points >=x0 and <x1 (where x0 is the left edge of a bin and x1 is the right edge).

Some options

(I’m writing from a user’s perspective, not knowing how this fits into Vega structure)

a. Values outside the extent aren’t shown at all: If a bin extent is defined (either explicitly or via a selection) it seems much more reasonable that those values would be excluded from the chart altogether than end up in a bin that doesn’t match their value. This is what Numpy, ggplot2, and d3 all do.

b. n+2 bins exist: what if there is always an additional bin for “values < the left-most bin” and “values > the right-most bin”? With perhaps a flag to include/exclude these? Or, a flag to “show-bucket for left side”, “show bucket for right side”, “show both”, or “show neither”.

c. Single out of range bin: Something like what kanitw suggested above.

d. Do (a) with a warning saying that “x points have been excluded from the chart due to being out of range”. ggplot2 also does this.

e. The x-axis text clearly indicates that the left/right most bins are for values infinity to -0.8 and 0.8 to infinity, and not -1 to -0.8 and 0.8 to 1.

My personal preference is for (a) and (d), with a flag to enable (b). Option © could easily mess up the scale on the y-axis if you are purposefully excluding a lot of data (more than is any single bin).

Option (e) would at least make it clear what’s happening in the current chart, without having to change how the data is being rendered.

What other libraries do:

Numpy: “Values outside the range are ignored”

ggplot2: “Note that, by default, any values outside the limits will be replaced with NA.”]. A warning is also printed with the number of datapoints not shown.

d3: if you set a domain for the bin, it ignores value outside of those ranges.

In the example notebook Fil adds some amusing warning labels.

Screen Shot 2019-12-03 at 4 04 31 PM Screen Shot 2019-12-03 at 4 04 39 PM Screen Shot 2019-12-03 at 4 04 45 PM
Read more comments on GitHub >

github_iconTop Results From Across the Web

Developing a bin that changes with filter - Tableau Community
The solution talks about changing bin size based on a parameter. I am not sure how I can use that here. Also my...
Read more >
Filter data in a range or table - Microsoft Support
Once you filter data in a range of cells or table, you can either reapply a filter to get up-to-date results, or clear...
Read more >
Filter for last 90 days of data in Qlik Replicate
This article references two options for filtering the last 90 days worth of data on a date column in Qlik Replicate.
Read more >
Data Transformation / UW Interactive Data Lab - Observable
To summarize this data, we can bin a data field to group numeric values into discrete groups. Here we bin along ... Edit...
Read more >
How can i filter a dataframe's rows by specific bins
This would be so easier to explain if you added the bin indicator for each row to be able to filter based on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found