Reimagining histogram autobin
See original GitHub issueHistogram autobin works relatively well for a single trace, but could be better. But for multiple traces, despite several attempts to clean it up over time (#2028, #1944, #1901, …), it has a bunch of problems:
- If you leave out all binning information, we’ll push
autobinx: true
back to the first trace, butautobin: false
to all subsequent traces. This doesn’t have any immediate impact, but if you then try to alter bins later, the results depend on which trace you edit. In particular if you edit the first one, (which seems logical, right?) the later ones will keep their original bin sizes https://codepen.io/alexcjohnson/pen/pOVrGj?editors=0010, but if instead you edit the other two (so change[0]
to[1,2]
in therestyle
call at the end) the first one will get this new size as well. - Stacked/grouped histograms can have different bin sizes or incompatible start positions (this is the resulting situation in the codepen above). This results in a misleading plot (you can make peaks shift, so two matching peaks look separated, for example, and make a flat distribution look like it has gaps), and I would argue even if explicitly supplied this way we should not allow this situation, and take the size only from the first one that explicitly specifies it, similar to how we handle stacked area options. It’s fine though to have independent sizes & starts if
barmode='overlay'
. I initially thought we might need a “bingroup
” attribute similar to scatter’sstackgroup
but now I think it’s better to just enforce a match across the already known group. - #1944, while matching up autobin sizes across histograms, made what I now think is the wrong decision: for multiple autobinned histograms the bin size “is the minimum any of them were auto-assigned”. I think a much better solution would be to concatenate all the data together and autobin it as a single unit. That will generally result in roughly the largest bin size of any of the constituent traces, but sometimes it will be bigger than any of the individual traces (if they have a bigger range together than separately), sometimes smaller (if they have similar ranges but the total sample size is now big enough that we choose a smaller bin), and it becomes very clear how to shift the start to reduce ambiguity (to minimize data exactly at bin edges). Initially I was thinking we should make this optional, but I’ve come to think if we clean up the first two points above, the autobin size should just be changed to be the composite size.
- We’re mutating
gd.data
(autobinx
,xbins.(start|end|size)
) - and doing so in buggy ways at that.
Proposal:
- Drop
autobin(x|y)
entirely, and just use the (improved) autobin routine to fill in whatever gaps there are in the explicitly specified attributes. So if you have explicitly specified an attribute and would like it to revert to auto, instead of turning onautobin
you would delete the value. - For backward compatibility with
restyle
we can convertautobinx: true
intoxbins: null
. I’ll have to investigate the existing behavior whenautobinx: true
andxbins
are both specified upfront to figure out what we would want to do incleanData
. - Coerce
nbinsx
iff an explicitxbins.size
is not found (in any trace in the group) - Determine any needed auto values for
xbins
, and stash these and only these infullTrace._xbins
, but have bothsupplyDefaults
andcalc
ensure the two are merged infullTrace.xbins
. So for example, if you specifytrace.xbins: {end: 30}
we’ll setfullTrace._xbins: {start: 0, size: 5}
andfullTrace.xbins: {start: 0, end: 30, size: 5}
. The reason for this isPlotly.react
: we want the full set infullTrace
, andsupplyDefaults
needs access to the auto-generated values, but in case you delete an explicit value fromtrace
we don’t wantsupplyDefaults
to be able to fill it back in, so we will see the change and trigger acalc
.
@etpinard I know this is a fairly big change, potentially breaking for some users, but the existing behavior is sufficiently broken and problematic already that I think we should consider it. This would also allow us to take a broader view of what “autobin” means, realizing it’s not just a boolean but size
, start
, and end
can all be independently auto or explicit.
Note that start
does not need to match exactly from one trace to the next but does need to be compatible (all of them must be the same modulo size
). end
can be completely independent from trace to trace. So this logic will still be a little bit intricate…
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
For the record, some aspects of the current ~disaster~ behavior:
Single trace
Starting from
{type: 'histogram', x: xArray}
add the following:{}
(full implied auto): We addautobin: true, xbins: {start, end, size}
togd.data
. Future data changes cause new autobin.{autobinx: false}
(specify not auto, but no values): We keepautobinx: false
but addxbins: {start, end, size}
togd.data
. So we autobin once but future data changes will not autobin again.{autobinx: ?, xbins: {size}}
(specify any auto, incomplete bin values, any combination except all 3 bin attributes specified): Partial bin info is discarded, future autobin as in above two cases. Unspecifiedautobinx
gets set totrue
.{xbins: {start, end, size}}
(autobin not included, complete bin values): We addautobin: false
togd.data
, no autobin ever happens.{autobinx: false, xbins: {start, end, size}}
(specify not auto, complete values): No mutation togd.data
, no autobin ever happens.{autobinx: true, xbins: {start, end, size}}
(specify auto AND complete values): New autobin values overwritexbins
. Future data changes cause new autobin.As proposed above, the only change to first draw behavior would be case 3, where partial bin specification would respect the specified parts and auto-determine the others. But all
gd.data
mutations would disappear except thecleanData
step of clearingxbins
ifautobinx=true
then removingautobinx
On updates, the new proposal has no “autobin once” possibility. In fact I believe that’s theoretically impossible without mutatinggd.data
. I suppose we could keep an explicitautobinx=false
around until we finish autobinning, then push the results back togd.data
and deleteautobinx
… This would be similar to how we handlexaxis.autorange: 'reversed'
and turn it intoautorange: true, range: [high, low]
Multiple traces
[{}, {}, {}]
(full implied auto):xbins
gets chosen to match across all traces (minimum size), butautobinx
is settrue
in the first andfalse
in others.[{autobinx:?}, {autobinx:?}, {autobinx:?}]
(no bin defs but explicit autobinx either true or false):xbins
as in implied auto,autobinx
takes the requested value for each trace.[{xbins:{size}}, {xbins:{size}}, {xbins:{size}}]
(partial bin defs):autobinx=true
for all traces, butxbins.start,end
are NOT filled in anywhere. PLOT FAILS TO RENDER. Same result if even one trace has partial bin defs and others have no bin defs (though specifyingstart
andend
with no size seems to nearly work.[{autobinx:true, xbins:{?}}, {autobinx:true, xbins:{?}}, {autobinx:true, xbins:{?}}]
(explicit autobin, any or no bin defs):xbins
gets chosen to match across all traces,autobinx
is respected.[{}, {xbins:{start,end,size}}, {}]
(full bin def for one trace):xbins
of other traces sometimes get chosen to match the one with a full def, sometimes smaller bin sizes (we seem to have thought this acceptable at some point, it seems to always be the defined bin size divided by an integer? Whatever, it really makes no sense).autobinx
is setfalse
for the trace with defined bins, and otherwisetrue
for the first trace andfalse
for others (as in case 1)[{xbins:{start,end,size}}, {xbins:{start,end,size}}, {}]
(full bin def for more than one trace): The defined bins are allowed, even if they conflict. Any undefined bin(s) get sized to the first defined bin size divided by some integer, always small enough to be less than the smallest defined bin size.All of these scenarios have problems, other than (3) explicit and complete autobin (though even that I think we agree we should pick the bin size differently and usually larger). I’d say the multi-trace behavior is broken enough that anyone who’s currently using it successfully must either have data that happens to work fine fully auto, or must have inserted complete manual (and explicitly matching!) bin specs. Neither of those cases would break using the above proposal.
Referencing https://github.com/plotly/plotly.js/issues/1282 for some “old” thoughts on
auto*
attributes.