[WIP] Crossfilter discussion
See original GitHub issueReactive, crossfiltered data visualization
Plotly has originally focused on generating visualizations, and interactivity increased over time. Plotly, by now, has acquired rich layout, style and data update facilities, even animations. Data transformations such as declarative input grouping and filtering have also been added.
As there’s growing expectations for fluid, efficient, yet still declarative interactions such as crossfiltering, we are starting a discussion with the purpose of shaping an API in line with Plotly conventions, current practices and future expectations.
Crossfilters are behaviors that let the user subset a multivariate dataset via direct manipulation across multiple views on that dataset. It is also known as linked brushing or linked filtering. The set of views included in one crossfilter is called coordinated views by the crossfilter.js doc, or sometimes linked views. There is no clear-cut boundary for the functional scope and features of crossfilters.
The archetypal crossfilter example by Mike Bostock, author of crossfilter.js, showing multidimensional filtering and aggregation on a quarter million records, also updating a sample table:
This text is just to start the ball going. There is prior art, surrounding the Plotly toolchain and its dependencies such as D3. Since these tools are in active use and well-documented, this description doesn’t detail them, except enlisting them and highlighting some of their properties.
Also, there is a fantastic discussion on the topic by Carson Sievert, including many of the crossfilter concepts. Due to the richness of that material, this writeup can be a bit sparser on the crossfilter behaviors and more detailed on implementation concerns.
It’s still useful to start with one way of thinking about interactivity, as crossfiltering is a particular instance of it. Also, crossfilter cores such as crossfilter.js
usually peer-depend on change propagation or reactivity. Section 1 may be skipped for directly jumping into the crossfilter-specific part.
reactivity
What runtime changes may occur to a visualization?
Not all types of visualizations require sophisticated updates. For example, a command tool such as the typical use of ggplot2
is technically a single-step execution even if the dataviz maker may repeatedly invoke it with various projections, aesthetics and data. These are common things that need data flow:
- respond to media characteristics such as document viewport size, device type, browser zoom level, device pixel density, print vs screen etc. - as the screen size or other attributes change, due to browser window resizing, new, constraining peer element in the DOM, or a portrait -> landscape rotation, the dataviz view updates itself (traditional, very narrow, hijacked notion of the word
responsive design
) - respond to incoming data streams such as live stock market data, limit price reached or notifications, or incrementally loaded data, updating a view over time
- respond touser session lifecycle eventssuch as timed logout on inactivity
- respond to elapsing time, physical or logical - for example,
- transitions shift the view from one specific state to another, e.g. updating scatterplot data would gradually ease the scatter points to their new place, or fade glyphs in/out - it’s often an important usability aid called
object constancy
, or it can be used for effects of aesthetic appeal or engagement - animations are a way of storytelling, taking up some amount of time, causing a sequence of visual changes
- feature tour for walking the user through the views, interactions and key points about the dataviz
- recurrence relations or other simulation for kinetic scrolling or other effects in a dataviz (for example, simulating gravity to compare physics on different planets).
- transitions shift the view from one specific state to another, e.g. updating scatterplot data would gradually ease the scatter points to their new place, or fade glyphs in/out - it’s often an important usability aid called
- respond to direct user intent such as filtering the view on a particular year, zooming, panning, hovering over data points to get details, or clicking on specific interactions such as reset view, printing, next page etc.
- domain-specific intent - for example, hover over a scatterpoint to see details, click a link to jump to a detailed view, or adjust an asset price to see the effects on the profit
- configuration - some of the power user’s intent may pertain to the dataviz layout or other settings itself, rather than its contents via regular use (the boundary is blurred as a dimension change can be categorized as configuration or domain-specific intent)
- development - no-latency redraw during development is a productivity aid, allowing the designer or developer to more effectively search the design space, see what works and what doesn’t, via tweaking otherwise unexposed knobs and sliders
Browser standards may cover some of the above items. For example, a CSS media query might provide print layouting; the <title>
SVG tag provides basic tooltip hover; CSS supports transitions and animations for HTML and DOM elements. Often, these have limitations: CSS transitions, animations do not work for Canvas and WebGL (and in IE11, even SVG is poorly supported); the tooltip is very basic; sometimes the browsers have bugs, making CSS based layout changes hard or impossible (for example, non-scaling-stroke
is buggy in some browser versions, and CSS translations can run into numerical issues).
Therefore, while following the standards is important for accessibility and progressive enhancement, they do not in general substitute for JavaScript execution for dataviz recalculation and rerender.
Why do runtime changes need some data flow concept?
Various terms exist for the need of a data flow concept. Perhaps the most often used term is “reactivity”, not to be confused with react
, a library that solves some rendering aspects of a reactive UI. The term responsive
is sometimes used, although it’s often meant in a regrettably limiting sense, such as redrawing on a window resize. There are various technical names such as streams and observables. Below, we’ll stick to a generic term “data flow”. There are related things like promises, publish/subscribe pattern, observer pattern, all trying to solve some aspect of the data flow problem.
Some visualizations may not really need one
For example, a very simple D3
or react
based visualization may just rely on these respective libraries for the initial rendering and update (rerendering). Both D3
and react
have been designed to allow idempotent rendering, such that the user may have a simple concept of ‘data in, view out’ - and these libraries handle the rest. Even in this case, there’s some data flow concept, hidden beneath the library, but expressed through the API. In case of D3
there are selection
s, data binding
and the General Update Pattern
, involving most DOM-specific API calls such as selection.data().enter()
, selection.attr()
, selection.transition()
. D3
also provides common interactions such as brushing and dragging, as well as simple event dispatch
and HTTP request handling. In react
, the basic idea is that a pure function maps some data object to a DOM fragment; its underlying mechanism is the DOM diffing
via the virtual DOM
, and it allows methods for component lifecycle events such as insertion or removal of a node.
Anyway, D3
views are often embedded in some framework that provide data flow functions, and react
, or lighter weight alternatives such as inferno
and preact
are often accompanied with data-centric tools such as MobX
or redux
.
Also, some use cases simply involve a one-off rendering, for example, outputting a static visualization, with no or basic interactivity features.
Some visualizations do need a data flow concept
A lot can be done just by using the simplest approach with D3
or react
, so why go further?
A reminder is that Web standards are often quite limited (browser version limitations, IE feature lagging, no Canvas/WebGL animation
support via CSS, more complex dataviz, see above).
One reason is declarative, denotational semantics, letting users specify what the visualizations and interactions should result, rather than how a desired effects are achieved (an operational notion, implementation detail).
Some of the larger, more complex, ambitious data visualization libraries such as Plotly
and Vega
/Vega-Lite
strive to be declarative, letting users tell what the dataviz should be - and this principle has merit even as an implementation concept. Current research is going into making not only visual output but also interactions declarative; sensible due to how much interactivity became integral to data visualizations.
When a visualization gets complex, working with data flow declaratively helps developer understanding and system overview. Even a most basic view, a single line or area plot has a lot of calculations which are best described as relations in a directed graph (annotations added to a vanilla Apple Numbers template):
For another simple example, consider
- a newly arriving price point may be higher than currently shown point
- this updates the minimum-maximum range (provided the user doesn’t expressly constrain it)
- the new Y domain may be further increased for round tick values or padding
- the vertical scale needs to be updated – the points need to move according to the new projection – axis ticks must be rerendered
Relationships get much more complex if there are lots of lines, projections and transitions. For example, an exploratory tool may allow the replacement of one axis with another, or even the transition from one plot type (e.g. scatterplot) to another (e.g. beehive plot). Then there may be animations, filter, pan, zoom, small multiple or trellised views, multipanel views and dashboards with diverse sets of visuals on them. Being declarative in the implementation means that new time-varying or reactive behaviors may be easier to compose from existing ones, with easier reuse (pure functions), and testing is easier as mocks aren’t needed.
Another reason is efficiency, an operational concern which is important for fluidity thus good user experience. An idealized computer would be able to calculate with infinite speed, no impact on the battery life, and we’d have a way of just recalculating everything from direct inputs and the user’s interaction history. Actually, this is a bit like the model for the most basic react
or D3
use, as well as a main concept of elm
and redux
time travel, and this works fine for a lot of use cases (we’ll consider it a data flow model and come back to its pros and cons later).
But computers are not infinitely fast, so there is a host of reasons for why it’s not sufficient in general:
- rerendering artifacts: there may be flashing, avoidable relayout jumps or other artifacts on a full rerender
- browser freeze: the rendering time makes it impossible to achieve fluid updates, i.e. transitions and animations wouldn’t work, and the browser would become non-responsive all the time
- growing history replay: as the user session becomes longer, there would be more data and user inputs to process (reminder, we discuss the fully functional approach which recalculates everything from primary input), so calculations would take longer over time (yes, history consolidation is possible, but it’s already a scan operation on a data flow stream, see below)
- sizeable datasets: a visualization that can run fluidly for a small amount of data may be brought to a browser freeze by more data
- expensive model and view model calculations: even the most trivial aggregate statistics are often faster to calculate incrementally than to recalculate from scratch - for example, adding a new data point may trivially update the data min/max domain for an axis calculation in O(1) constant time, while a naive recalculation is O(n); there are dozens of metrics in even the simplest dataviz that take significant time to calculate for sizeable datasets
- user expectations: a slowish initial calculation, e.g. a few hundred milliseconds may be OK for an initial rendering (or even more with incremental rendering), but every tooltip hover or other non-pertinent interaction shouldn’t cause such a delay especially if it’s blocking the browser
In short, a basic reason for thinking about the data flow is that we want fluid user experience in a world of asynchronous actions, limited CPU and battery power. Janky interactions or avoidance of fluid interactions altogether underutilizes the computer medium and is a competitive disadvantage.
A simple example (follow link for writeup) for granular, incremental recalculations to reflect ongoing configuration on a live, real time updated view, e.g. changing bandline quantiles for outlier-vs-not shading: We also expect that morphing from one visual representation (projections, channels, aesthetics) to another is going to become more common, for dashboard building via direct manipulation as well as exploratory analysis, an early Plotly concept morphs from parcoords panel to scatter, preserving filtering:
Couldn’t we solve the problem without some data flow concept? (informal data flow)
We’ll categorize such solutions as data flow concepts 😃 But here they go anyway:
- Function application memoization (caching). Functions that has a big impact in the profiler gets cached, so the next time around, it’s a simple lookup. Its benefit is referential transparency and therefore easy testability. Its workings are fully testable by supplying some input and making assertions about its output. Results don’t depend on some state. Basic functional programming with or without caching is a kind of data flow concept as data values are transformed by a directed acyclic graph (DAG) of various data transformation functions. The main problem is high risk of memory leaks, especially as current JS is hostile to solving them (no weak references; no explicit GC trigger; no object finalization; no tail-call optimization; ES5 only supports string based maps/hashtables and ES2015 Maps are still slower etc.)
- Incremental update. Some
state
object gets incrementally updated on each new piece of input. For example,newState.min = Math.min(previousState.min, input.newPrice)
. It’s theredux
model. It’s great for single-layer, relatively simple actions, but isn’t that suitable for the type of deeply cascading changes that characterizes data visualization. - Lean on D3 data binding. The data binding, especially with keyed
selection.data()
functions and carefully tailoredenter
vsupdate
discrimination, is a powerful way forSVG
visualizations. For example, it’s possible to enhance an initially raw dataset with expensive aggregate statistics, and run a recalculation only if needed (e.g. a new point is added), which requires that thekey
function incorporate the data array length or some surrogate (hash etc.). Limitations: large DOM trees may be slow; more convoluted, and rigid, less component oriented design; data needs to be naturally hierarchical or otherwise crosslinks are needed; easily introduced bugs when a recalculation isn’t done though it should be, or the other way around. Canvas support is doable but somewhat convoluted. - Lean on
react
lifecycle methods. The lifecycle methods make it possible to compute things just once. But model calculations are an anti-pattern inreact
; even the presence of lifecycle methods remove quite a bit from thereact
philosophy; and the issues mentioned forD3
above also apply.
Again, these approaches work, and can be very compact and natural to use, but they don’t scale well to complex visualizations. Now on to some alternatives that are often used for larger projects:
- Manual update processing. It’s often that a tool starts its life with a set of expectations such as single-pass rendering and then must assume more and more dynamic functionality. Initially, there are some objects onto which the result of expensive or repeated calculations are hung. Then, upon adding update functionality, there are methods that take new input and some of the previous state and update various object properties. Usually, there are some means for change propagation, e.g. via the observer pattern or the pub/sub pattern (discussed separately). The drawback is that it’s in essence, a manual, informally specified way of doing caching, which is error prone to not only overeager recalculations or worse, missed ones (stale props), but also, make things hard to test, because state is a leaky abstraction and once the state can be altered by other units and methods, there’s a combinatorial explosion of what might go wrong. Adding a new feature or refactoring requires knowledge of much implementation detail; missing these may result in broken things even if test suites pass. Refactoring can also incur friction if the test suite boundaries relate to such state (implementation detail) rather than effect (e.g. resulting DOM contents or better, visible output). The appeal of the manual update processing is that the code looks traceable: there are no magic mechanisms that need to be learned, just plain JavaScript everywhere. It’s easy to debug at a micro level, by putting in a
console.log
or adebugger
statement. In contrast, sophisticated approaches require a good amount of learning and debugging practice (non-trivial costs). - Model-View-Container or derivative patterns (MVVM etc.). Though it sounds more authoritative than manual update processing, most problems are shared even if some disciplines allude to a formal approach. Also, in data visualization, separating
model
fromview
orview
fromcontroller
is not trivial. In the case ofMVVM
, the separation ofmodel
andviewModel
is also a bit arbitrary. MV* also typically uses some data binding pattern, e.g. observer or pub/sub. There’s also a competition among the MV* zoo and the definitions aren’t clear enough to even firmly know which is which. - Pub/sub and observer patterns. A lot has been written about the disillusionment caused by these patterns. Most of these are centered around the fact that while components become appealingly decoupled at the source code level, they turn out to be semantically coupled in all sorts of ways.
The above three approaches have the common problem that they can lead to overreliance on tribal knowledge. There are no hard and fast rules or protocols about these approaches; they’re grown organically (manual update processing) or are vague guidelines that leave the details up to debate and an endless stream of ‘best practices’ books. Often, the data flow code (if this separate aspect is kept as separate code) is developed in-house, and lacks proper documentation.
I think the lack of reusability comes in object-oriented languages, not in functional languages. Because the problem with object-oriented languages is they’ve got all this implicit environment that they carry around with them. You wanted a banana but what you got was a gorilla holding the banana and the entire jungle.
Joe Armstrong
- Using a comprehensive framework such as Angular. With Angular 1, there is two-way data binding, and with Angular 2,
RxJS
is incorporated (discussed separately, as it’s been an established library on its own). Both angular versions are rather large, opinionated frameworks with idiosyncrasies, and neither is quite efficient for dataviz. It’s unclear which of the two Angulars will be more popular. Similarly, asreact
is not a comprehensive framework, its complementing (independent) data flow tools are discussed on their own right.
Data flow tool categories
The below list includes a few specific libraries, not meant to imply that Plotly should follow or use any of these specifically.
A. Object-centered approaches
Usually, operations are done to objects via method calls, and methods achieve effects via altering various objects. It is hard to establish causal links: during debugging, one can’t often get to a root cause just by traversing up the call stack, since the failing calculation fails likely because some of its input object properties are wrong, but those properties were not set in a frame currently on the stack, but some unknown different stack that preceded the current execution. In addition to familiarity with the API, a lot of implementation detail needs to be known to a contributor. Data is often exposed on objects, which commits the solution to particular representation structures, an operational rather than declarative concept. The flow of the data is implicit in the code and hard to have a mental image of.
- No formal approach: the code reflects a gradual evolution from an original code base that didn’t stress interactive features to a code base that’s expected to respond fluidly
- MVC pattern
- Plotly
relayout
/restyle
- idempotent plot update
B. Special-purpose data flow tool: low-level, idempotent, data-driven renderers
Some view generator solutions have their built-in data propagation patterns, such as data binding, which are fairly powerful, yet not quite appropriate for complex functions such as a crossfilter. Also, these tools themselves don’t scale well to a moderate number of DOM elements for executions as frequent as the animation frame (60FPS).
- Leaning on
D3
data binding and frequent, on-event rerendering for dashboard-level data flow react
component tree; often, lifecycle methods and stateful componentsreact
alternatives with smaller scope and minimal footprint (inferno
,preact
,react-lite
…)regl
, inspired byreact
, transforms specifications to efficiently generated and executed WebGL API calls (Plotlyparcoords
already usesregl
.)
C. Special-purpose data flow tool: pipes
These tools usually facilitate one-off execution of a sequence of data transformations, sometimes including side effecting processing steps or terminal nodes. Due to their one-off nature, they’re often built to handle explicit, e.g. command line execution, or individual input events synchronously or via promises. The archetype is the unix pipe. Usually, branching is besides the scope or very limited, therefore it’s not as natural for handling diverse inputs that factor into various points in the series of transformations, or intermediary transformations that take data from and/or feed into multiple other transformations.
- Unix pipes
- magrittr in R (often with dplyr)
- Fluent-style method chaining in various libraries (d3, d3fc, RxJS)
- Promises
- Ramda.js compose / pipe
D. Special-purpose data flow tool: crossfilters
Crossfilters usually want to efficiently and scaleably solve the problem of multidimensional selection of individual data points for fast querying of he resulting sets or their aggregations. They typically process filter range changes and even new data points incrementally. Usually, processing is done with reducer functions, efficient if the incremental change is of limited frequency, but not as efficient when the changes are big enough to warrant for a tight, cache aware numerical processing loop. They often do not want to provide a mechanism for notification, whether it’s related to their input (new data or interactions) or output (downstream changes on the changed itemized and aggregate queries), so a crossfilter, on its own, isn’t sufficient for crossfiltering; it needs to be embedded in a more general data propagation mechanism. Internally, crossfilters use interesting implementations for efficiently updating query sets, and are rather stateful so as to save computational costs for handling incremental changes with low latency.
- JS based: 1. crossfilter.js 2. vega-crossfilter 3. scijs/cwise based (idea: turn reducer functions into an efficient loop body)
- WebGL based
1. plotly vertex shader based mini-crossfilter as in the new Plotly
parcoords
2. regl-cwise based (idea: turn reducer functions into shader code and hierarchical aggregations)
E. General data flow tool categories
These can be thought of as a spreadsheet, in that the developer doesn’t state how a sum
is calculated and updated: whenever some input changes, it propagates downstream in the directed acyclic graph that is the data flow structure. Yet, proper FRP, coined by Conal Elliott, has a rigorous foundation so we call JS libraries FRP inspired, as they center around operational concerns such as a data propagation graph, event emission, backpressure etc. While sound in principle, many of these libraries make it hard to debug userland code, because the stack is usually deep, verbose, nondescript and even with blackboxing, it’s hard to see what initial change cascaded down to the current stack, and what transformations took place. MobX puts more emphasis on letting the coder understand cause and effect relationships in the debugger.
- Object magic based 1. MobX
- FRP inspired libraries 1. RxJS 2. Bacon 3. Kefir 4. Flyd 5. most.js 6. xstream
- Real FRP libraries Motives and properties recapped here Libraries not listed as none currently exists for JS
F. Reducer based
Redux is a predictable state container, a reducer based library. It handles singular changes, called actions, elegantly and in a functionally pure way, responsible for the predictability part. Each action is mapped into a transform of a (current) state to a next state; the state object itself is modeled as a large, inert JSON-like object, whose hierarchical structure can represent inputs or derived data. Since redux handles direct actions and doesn’t in itself handle the rippling effects of such actions, it’s combined with change propagation means for deeper dependency graphs.
- redux only
- redux-saga
- redux-observable
G. View and logic together
These tools bind some data propagation concept / tool with a view rendering mechanism such as DOM updates. They can be made to work on Canvas/WebGL, though in this case the benefit of being cycle-oriented is somewhat underutilized.
- Elm (transpiles to JS)
- Vue.js
- Cycle-like 1. cycle.js 2. motorcycle 3. TSERS (few recent commits)
- dc.js
2. Crossfiltering
Crossfiltering is a major data visualization interaction type that lets the user slice and subset their data, most often by highlighting a range on an axis or an area on a plot. An archetypal implementation (for me, having used it first) is Bostock’s crossfilter.js
published in 2012.
Interactivity in data visualization is only limited by creativity and practicality. Yet, there are archetypal interactions that can be easily identified in literature and implementations alike, such as
- zooming and panning,
- hovering over an element for details,
- selecting a point or a subset of the data
The latter is often called crossfiltering on a multi-plot view when the purpose of selecting elements or a range of elements is not primarily to get detailed, itemized info on them, but to control what is shown on the other subplots, conveying the notion to users that they interact with a single dataset, filterable in any of the interactive subplots, all of which provide a particular view into the single dataset.
Crossfiltering is an important solution for what we can term as the big problem of data visualization: the focusing problem. Crossfiltering lets the user start exploratory analysis by viewing the visualization based on the entirety of the data, or a pertinent set (e.g. last 30 days), but then focus on subsets of data, guided by their goals and patterns in already rendered subsets. It is also usable in explanatory analytics such as interactive journalism or education: the reader or student may gain useful extra information using the same set of views, altering just the set of data in scope, e.g. selecting his city of residence or highlighting an interesting range of distance.
Common crossfiltering facilities - overview
Interactions:
- Select
- a specific data instance, e.g. by clicking on the corresponding point on a scatterplot that shows all elements
- Select a specific characteristic of the data such as City in a climate data set
- a subset of the data by individually selecting multiple items; unselect previously selected items
- a subset of the data by a gesture such as
- highlighting a range on an axis
- highlighting a rectangular area on a Cartesian plot
- drawing a straight section or freeform curve that intersects a subset of lines or other glyphs
- a subset by selecting on aggregate views (usually assuming that one data point goes into one aggregate on that view)
- Select / unselect multiple sets
- Combine selection sets in various ways, e.g. via Boolean algebra
- Reset selection
- on one specific subplot or table (for one or more dimensions) to the initial state
- on all filters, to the initial state
Responses:
- All other subviews show increased salience for the selected items or decreased salience for others
- Aggregates get updated (e.g. brush rows here to see aggregates updated)
- All these updates may be tweened so that
object constancy
is preserved where possible - Automatic zooming and/or panning may take place in linked views, if there is autofitting, and data bounds change (preferably, tweened as well, though maybe differently to this example)
- Selection reset interactions show up, or a timeout to reset starts
- Tabular views, showing all or sample rows get updated (these are no different in concept to plots; tables may enable filtering too)
- The user may or may not be able to persist the current selection state, or perhaps multiple selection states for later retrieval
- Usage statistics might be gathered and relayed to understand usage patterns in UX technical or domain-specific terms
Crossfilter implementations
To inform crossfilter API design, it’s useful to touch on current, actually avaliable crossfiltering methods. Features are enlisted so that the common, and perhaps some rare functions are input to API design. Similarly, current limitation - subject to getting obsolete - are mentioned not as criticism, but simply to gauge the extent its API has needed to cope with planned use cases.
Crossfilter.js
Crossfilter is an in-memory, incremental mapReduce implementation in JS created by Mike Bostock who also authored D3
.
- Have a bag of opaque objects
- you can add them in bulk
- you can add further ones later
- you can’t remove them once added - the solution for removal, if needed, is to equate objects with transactions (e.g. instead of adding bank movements, add bank transactions, where a subsequent transaction can invalidate an earlier transaction)
- Have some dimensions (attributes or virtual fields on the objects)
- Have some aggregations determined by a dimension and add/remove reducers
- Can filter on arbitrary dimension
- Can get:
- group aggregates in line with current filters
- groupAll
- group element counts
- group top/bottom K elements
- group constituent elements are basically top or bottom infinity
Possible gotcha: “a grouping intersects the crossfilter’s current filters, except for the associated dimension’s filter. Thus, group methods consider only records that satisfy every filter except this dimension’s filter. So, if the crossfilter of payments is filtered by type and total, then group by total only observes the filter by type”
Key features
Very small (10kb uncompressed, 4.4kB compressed) Very mature and stable Fast for large datasets e.g. 100k elements, if reducers are fast (though obv. not as fast as array looping) Does one thing and one thing well Small API surface
Limitation of scope
These are inherent either in the focused scope of this component (do one thing well), or in the JS language and runtimes (no weak maps, no object finalization etc.) so they’re just observations rather than criticism.
- Needs resource management (discarding groups, dimensions); somewhat error prone and memory leak prone
- Doesn’t do loop fusing (it uses reducers naively)
- Limited dimensions; prone to overrunning with dimensions (again needs management)
- No notifications or triggering subsequent recalculations
- No bijective relation handling (eg. select bar in histogram, arrive at constituent points)
- No concept of foreign keys, star schema etc. - all is implemented manually
Vega crossfiltering
Vega is an interesting, long-running project run by the Interactive Data Lab; its approaches demonstrate important research, and there’s a level of rigorousness and compactness about the concepts. Vega implements a visualization grammar (see also Wilkinson’s Grammar of Graphics, ggplot2), a declarative format for creating interactive visualizations.
Vega is based on reactive data flow, and has enabled the creation of crossfiltering, although not in a particularly declarative way. The award winning research paper describes the addition of declarative graphics interactions.
Vega
Example: https://vega.github.io/vega-editor/?mode=vega&spec=crossfilter
Depends on vega-dataflow
and vega-crossfilter
.
Vega is a reactive library of broad, general data visualization scope. Uses its own reactive data flow means rather than depending on another lib. 342kB uncompressed.
While Vega supports crossfiltering in that reactive streams causing a crossfilter mechanism can be established, the creation is somewhat intricate, and isn’t a concise, high level, declarative API.
Vega-dataflow
Dependency of vega-crossfilter
and vega
. Streams scalar and composite data.
https://github.com/vega/vega-dataflow
Relatively large, bundle is 88kB uncompressed.
Vega-crossfilter
https://github.com/vega/vega-crossfilter/blob/master/test/crossfilter-test.js
Uses vega-dataflow
but doesn’t use Bostock’s crossfilter.js
Dependency of vega
Vega-lite
Vega-lite is a translation layer between the Vega-Lite compact, higher level visualization grammar format and the powerful, more verbose Vega visualization grammar format.
As of January 16 2017 there’s no crossfilter or declarative (or any) interactions; declarative interactions are currently in feature branches and slated to arrive soon. If I’m not mistaken, even with declarative interactivity in Vega-lite, it won’t be as simple as identifying dimensions and subplots for a crossfiltering relationship. But at the expense of more verbosity, there’ll be more flexibility as well, permitting custom and hybrid interactions.
devDepends on vega
.
Crosstalk (htmlwidgets)
Crosstalk is a protocol for linked brushing across multiple, possibly heterogeneous htmlwidgets. It uses shared state (SharedData) among various htmlwidgets. A htmlwidget can be made compatible with crosstalk by following a well-documented protocol.
Limitations (as of writing; evergreen doc):
- it handles atomic data points only, i.e. doesn’t handle aggregates such as histograms
- it only handles brushing
- naturally, since htmlwidgets can be published without central vetting, only some of the htmlwidgets support crosstalk
Bokeh crossfilter
Bokeh has a crossfilter, also referred to as linked brushing that redraws subplots upon the completion of the selection, and the rectangular or lassoed area doesn’t persist, therefore cannot be interactively moved. It is a possible way of bypassing stringent latency requirements, and is a useful option to consider for an initial Plotly implementation.
Interestingly, the seen Bokeh examples have no explicit crossfilter specifications beyond enlisting the interaction start buttons box_select
, auto_select
. According to the text, the only other criterion is that multiple plots use the same dataset (same identity). It has a lot of appeal by virtue of its simplicity, although Plotly, given its numerous connectors, serialized tree representation and granular data structures probably can’t follow this model. Yet, it shows that the API search space should include very terse or implied linking. In the absence of relying on dataset identity, a closest option would be simply to add a filtergroup
attribute to all plots (see below).
Upshot
- the crossfilter.js API is compact and regular
- it is tightly focused, and isn’t meant for high level declarative specification that a Plotly user would feel at home with
- but its internals are sound for fast indexing (though array loops, cwise and WebGL based filtering are possible alternatives)
- performance is dependent on use case: the ideal use case has very cheap dimension functions and reducers, and even then, a straight optimized loop for selection and reduction might be faster, not to mention a WebGL alternative
- library size is not insignificant but not large either (4.4kB compressed)
- main difficulty is resource management, but could be wrapped around by some MRU cache-like dimension eviction mechanism
Crossfilter API design thoughts for Plotly
Based on the above landscape and some motives below, as well as strong, preexisting Plotly API conventions that have been found useful by a wide base of users, we can start assembling thoughts on possible crossfilter API elements for Plotly.
For simplicity, the term Plotly means Plotly.js here; all the language API bindings and the Plotly Workspace would likely expose the crossfilter specifications to their respective users.
Existing interactive and related features in Plotly.js
Plotly already supports interactivity and data processing features that relate to crossfiltering:
- hovering over points or other glyphs (representing data points or aggregates) for tooltip information or raised event for custom callback
- zooming and panning
- rectangular area brush mechanism for selecting area to zoom to, 1D or 2D
- idempotent replotting (
restyle
,relayout
) - latency-optimized, scalable replotting (e.g.
pointcloud
) - animations
- transforms, such as
groupBy
andfilter
- GPU based mini crossfiltering in
parcoords
- brush selection on the axes, also in
parcoords
- subplots, shared Y domains or axes (one form of sharing a dimension)
- serialization of the plot description via JSON (naturally, the way a crossfilter and its state would be represented too)
- declarative plots: the JSON specifications tell what, rather than how to render
Currently limiting features in Plotly
- Coupling of data-related aspects with rendering-related aspects. The
data
block, in its name, is about data, and indeed, contains column vectors such asx
andy
. But it also deals with plot (trace) specifications, for example, whether ascatter
or aheatmap
is required, what aesthetics would be present (markers
,lines
) and with what channel styling. - Grammar of Graphics, compositional semantics: while Plotly plots are declarative, the structures and output generation are established by conventions. Yet it’s possible to conceive and make novel plots, and it’s also true that the presence of a GoG wouldn’t guarantee a lot more flexibility, so for practical use, it’s not currently a big divide.
- Encoding logic or algebra in the JSON specification. Vega, in effect, lets the user define new functions, for example
filter
transforms that are code strings. Direct calculations are handy for specifying custom predicates for e.g. point inclusion or aggregation. - A data propagation abstraction. Currently, data propagation is handled implicitly, via the convention of one central
plot
function, and invoking chart-specific mutations such asattributes
,defaults
,calcs
,render
. Some of the main functions have hundreds of code lines. The result is that updates are coarse-grained and much redundant recalculation occurs. - Latency. Especially with a higher point count, the above item (coarse-grained updates) cause that fluidity of interactions is limited. An alternative to fluid updates is updating the coordinated views upon just the completion of the selection, such as in this Bokeh lasso selection example.
Understanding prior art
- crossfilter.js performance characteristics should be retested on large sets, to check if e.g. straight loops are much slower - CPU architectures and especially JavaScript engines have improved a lot since 2012
- consider basic features from
crossfilter.js
such as the possibility for enumerated or range based filtering, and generation of aggregates vega
representations aren’t crossfilter-specific, but are at a finer granularity of interactions (Vega-Lite) and streams (Vega) - yet it makes a lot of sense to learn from how Vega-Lite represents interactions in a compact yet versatile mannercrosstalk
is a great example that can be very simply used to establish crossfiltering links, although currently not dealing with the complexities of aggregates
Desired functional features in the Plotly crossfilter
Flexible data subsetting in crossfiltering
Specification for
- which of the rendered plots are included in
crossfilter
, - which of their dimensions (axes) are included, and shared - although it can be implied, e.g. include all, and rely on identical axis labels for sharing (it’s not possible to rely on axis keys, because they need to be disambiguated for paper placement -
domain
) - possibly multiple crossfilters on the same overall dashboard - again, not excluding the possibility of a
filtergroup
attribute per plot
Diverse selection sets and filtering algebra
For compact, common representation, both enumerated values and contiguous ranges are ideally supported. We may consider
- operations such as union, intersection to join multiple enumerated values and/or ranges
- predicates
An initial implementation is already useful with one simple, single range based filter per dimension, as done for parcoords
.
Aggregations
Some crossfilters, e.g. R’s crosstalk
, may only (currently) support crossfiltering over atomic data. It is already useful, since it can yield linked brushing. Going beyond this, most crossfilters support the inclusion of groups or aggregates. Selecting a subset of the scatter points may lead to updated histograms similar to this dc.js example.
In addition to updating aggregates, it is desirable if projection ranges (brushed areas) or glyphs corresponding to aggregates, such as histogram bars or choropleth maps, are themselves subject to selection. For example, highlighting a range of bars on a histogram would highlight the source scatter points, and other aggregates would be updated based on this highlighted set of scatter points (link below does this too, relying on crossfilter.js in dc.js).
This reverse direction requires an explicit bijective relationship between an aggregate plot and the source data, otherwise the corresponding atomic data points can’t be identified. I think Plotly doesn’t yet handle this aspect, but again, aggregates, especially the selection of aggregates need not be part of an initial step. Plotly currently handles few, discrete types of aggregations, such as binning for histograms, so adding inverse mapping doesn’t seem burdensome. More challenging is that users do, or are lead to preaggregate data themselves to make their own aggregations, in effect, using Plotly as a dumb, static view with the data processing steps residing outside Plotly - in this case, establishing links is impossible, unless we invent some heavyweight annotation for bijective mapping. Consequently, the Plotly API would need to move more into data handling territory with datasets, dimensions and aggregation keys as first-class JSON structures; then individual plots or traces may refer to said datasets as their data, and dimensions in their axes, as opposed to the current practice of supplying data directly to the traces.
Many dashboards in the wild display solely aggregates (no items in sight). It’s good to consider an API with at least eventual aggregation support in mind.
Some other dashboards such as an implementation of Stephen Few’s student dashboard in d3 feature itemized data selection, updating aggregates, where each item itself is composite, e.g. a student that’s a foreign key in a per-student attendance time series table:
If sorting is present (analogous to using Plotly.restyle
with a different order for ordinal ticks), the previously contiguous selection range becomes fragmented (or conversely, we may use an ordering-then-brushing facility to avoid complications with multiple set selections), yet the aggregation itself doesn’t change:
Familiarity
Lots of good work went into crossfilters in JS and other languages via the above mentioned libraries and lots of libraries not mentioned here. To make things easy for users, our design should recognize established, learnt patterns. Since the concepts are fairly transferable across tools, yet the actual behaviors, limitations, method and granularity of specification is diverse, it’s best to follow the concepts and do it in a way that’s coherent with Plotly patterns, on principle of least surprise to the users.
Time series data
It’s often the case that crossfiltering is combined with, or applied to time series data. This poses additional demand, because of the data points and especially DOM impact involved. A headroom in smooth rerendering performance may be achieved by hybrid charts where the single or few performance critical layers are rendered with WebGL e.g. via regl: There are additional use cases with time series data:
- as the user zooms or pans, interactively focusing on parts of the time series, the currently visible temporal extent needs to act as a filter range on the other subplots
- even if no time series is rendered, there may be a need to move across time (i.e. continuously applying a range filter on a perhaps unrendered temporal dimension), analogously with animations.
Animating filters
It’s useful for animations to also work with crossfiltering, enabling that a single dimension filter is declared for animation, yet the visual effects show in all the rendered plots that involve the filtered dimension.
Desired non-functional features
Serializability
- Serializability of crossfilter specs (dimensions, aggregations)
- Serializability of filter states (ranges, items)
Low latency
The lower the latency, the better - the ideal is 30FPS-60FPS. If it’s worse than around 10-15FPS, it eliminates the illusion of direct manipulation, which often underpins crossfiltering, and the users need to wait for debounced, delayed recalculations, i.e. views are out of sync. Therefore it’s important to optimize data paths in some systematic manner, or settle with deferred view update.
Low latency has lots of elements: efficient filtering code (e.g. crossfilter.js uses heavy bit fiddling; our parcoords crossfilter in the vertex shader); avoiding unnecessary recalculations, since the changes may be very cheap to calculate compared to an initial rendering; touching the DOM sparingly, using e.g. d3selection.data(fun, key)
to detect changes and rely on the DOM diffing
of the General Update Pattern.
Reusing existing Plotly facilities
A lot of existing Plotly facilities may be reused for crossfiltering.
- some plots already implement box brushing and lasso brushing, for selection events and/or zooming
- pointer interactions, currently used for hover tooltips and events (can be reused for point selection with modifier key), as well as panning (may be reused for moving the box selector)
- code paths put in place for animations
- existing
transforms
such asfilter
andgroupBy
are good candidates for both filtering and grouping for aggregations, although the current implementations are not incremental and are of somewhat high latency - preexisting Plotly aggregations such as binning for histogram, or for heatmaps, as well as contour line generation, may be reused and exposed as transforms available to users
- axis brushing, as implemented for
parcoords
- the Plotly
parcoords
GPU based crossfilter might be shared for supporting large data quantities plotted with WebGL; it works as an N dimensional crossfilter even for rendering parcoords lines
A sample API for simple, atomic crossfiltering
Unlike Bokeh, Plotly can’t currently rely on a single, shared data structure to deduce a default crossfiltering behavior. Also, current axis keys (keys of the JSON object) can’t serve to indicate dimensional unity, because of their preexisting separation for e.g. layouting in screen space (called domain
in Plotly).
But there would be ways for retaining the current Plotly semantics and API, while introducing datasets
as first class objects.
Establishing unity of data and dimensions can be done by modeling these as first class entities. It would yield a compact, scalable and high level representation.
What looks like this now, with repeated vectors for disparate plots or traces:
{
"data": [
{
"filtergroup": "cf1",
"x": [1, 3, 2],
"y": [4, 5, 6],
"type": "scatter"
},
{
"filtergroup": "cf1",
"x": [1, 3, 2],
"y": [50, 60, 70],
"xaxis": "x2",
"yaxis": "y2",
"type": "scatter"
}
],
"layout": {...}
may be, in order to preserve relations, represented as
{
"datasets": {
"iris": {
"petalwidth": [1, 3, 2], // analogous to x, y vectors but shareable by name
"sepalwidth": [4, 5, 6],
"petallength": [50, 60, 70],
"species": ["setosa", "setosa", "versicolor"]
}
},
"data": [
{
"filtergroup": "myCrossfilterGroup1", // multiple crossfilters are possible
"x": "iris.petalwidth", // just referencing the actual data
"y": "iris.sepalwidth",
"mode": "markers",
"xaxis": "x",
"yaxis": "y"
},
{
"filtergroup": "myCrossfilterGroup1",
"x": "iris.petalwidth",
"y": "iris.petallength",
"mode": "markers",
"xaxis": "x2",
"yaxis": "y2"
}
],
"layout": {...}
}
In addition to retaining data relations, it would have other benefits:
- save big on serialized payload size if there are lots of data points and subplots or grids
- save on bandwidth, data parsing, processing time, main or GPU memory consumption
- retain meaningful information
- allow template plots, into which e.g. one dataset is input, yet multiple, different plots and traces arise
Tools surrounding Plotly.js, such as the Workspace, already have analogous facilities, so it can be considered a natural absorption of useful features into Plotly.js.
API possibilities for grouping
Groups, in general, can be many things: nodes in a normalized relational star schema model; or calculated on the fly, such as specific bins; or in the simplest case, just another dimension (denormalized representation). There’s ample precedent for this last option in Plotly, such as the current transforms/groupBy
specification, or the use ordinal or nominal dimensions (e.g. overplotting points with semitransparent markers).
Therefore, groups might be specified, quite verbosely, as
{
"datasets": {
"iris": {
"dimensions": {
"petalWidth": [1, 3, 2], // analogous to x, y vectors but shareable by name
"sepalWidth": [4, 5, 6],
"petalLength": [50, 60, 70],
"species": ["setosa", "versicolor", "versicolor"]
}
},
"mySpeciesAggregate": {
"dimensions": {
"avgPetalLength": {
"sources": ["iris"],
"transforms": {
"groupBy": [{
"key": "iris.species", // or alternatively, a vector in place
"value": "petalLength",
"aggregates": {
"average": "mean" // assuming there's a Plotly-defined set of aggregations like in SQL
}
}]
}
}
}
}
},
"data": [
{
"filtergroup": "myCrossfilterGroup1", // multiple crossfilters are possible
"x": "iris.petalWidth", // just referencing the actual data
"y": "iris.sepalWidth",
"mode": "markers",
"xaxis": "x",
"yaxis": "y"
},
{
"filtergroup": "myCrossfilterGroup1",
"x": "mySpeciesAggregate.species",
"y": "mySpeciesAggregate.avgPetalLength",
"mode": "markers",
"xaxis": "x2",
"yaxis": "y2"
}
],
"layout": {...}
}
Recognizing that this is a lot of text for mundane aggregations, supposedly coming from a list of Plotly-implemented aggregation functions, it should either be made much briefer, or the reward should be a lot more power, for example, some plugin mechanism for custom, programmed aggregator and filter components, even if the API doesn’t go to the length Vega goes, which encourages infix and functional algebraic expressions represented as strings.
An example for the former option - much briefer notation - could be a simple reference to the aggregation in the view:
{
"datasets": {
"iris": {
"petalwidth": [1, 3, 2], // analogous to x, y vectors but shareable by name
"sepalwidth": [4, 5, 6],
"petallength": [50, 60, 70],
"species": ["setosa", "setosa", "versicolor"]
}
},
"data": [
{
"filtergroup": "myCrossfilterGroup1", // multiple crossfilters are possible
"x": "iris.petalwidth", // just referencing the actual data
"y": "iris.sepalwidth",
"mode": "markers",
"xaxis": "x",
"yaxis": "y"
},
{
"filtergroup": "myCrossfilterGroup1",
"x": "iris.species",
"y": "iris.petallength",
"aggregation": "mean",
"mode": "markers",
"xaxis": "x2",
"yaxis": "y2"
}
],
"layout": {...}
}
API for filter state; other API elements
Compared to representing data relations such as shared data and aggregations, the problem of representing and serializing filter states is quite trivial. It just falls into place once these larger problems are resolved. The crossfilter.js
API doc contains sensible options, such as using [from, to]
filter domains, or [elem1, elem2, ...]
enumerations for specifying filter state. Inspired by this, Plotly may add
filtersets: [0, 2, 3, [7, 11], 15, [17, 20]]
though some questions remain, such as whether the ranges, denoted with arrays, are right-open or right-closed.
An alternative is to use the relations similar to the current filter
transforms, building up the filtered set more verbosely but perhaps giving more flexibility:
transforms: [
{
type: 'filter',
operation: '>',
value: 0
},
{
type: 'filter',
operation: '<',
value: 100
}]
though there’ll need to be more algebra such as specifying unions and intersections.
Draft conclusions
Adding crossfilter to Plotly is sought after, given the current level of interactivity, the user expectations toward data exploration and Plotly facilities, support for heterogeneous subplots/dashboards, and upcoming plots that need to use crossfiltering such as parcoords, small multiple charts, SPLOM and trellised plots.
A Plotly crossfilter would benefit from (depend on) concurrently introduced new concepts, such as
- a dataset concept, to retain relationship among disparate views to the same data, currently represented redundantly and with loss of association - this seems hard to avoid or work around
- declarative, bijective aggregations, even if they come from a list of predefined Plotly aggregations based on the most common usage (mean, median, IQR, count, domain, variance, bins, …) - provided aggregations be supported
- a data flow concept, for clarity, and to minimize unnecessary recalculations and rerendering - a useful first version may not need it, but a state of the art version likely would
As this list contains elements on which current libraries have iterated for years - such as LINQ.js for specifying aggregates and other derived queries, not to mention host language features and common libraries in R, Python etc. such as the very compactdplyr
API - a question is, where should boundaries be drawn, whether the API of an existing tool should be adopted, or whether it’s possible to postpone the introduction of such concepts altogether.
Also, the listed changes may require some refactoring and API change (or addition) such as
- expose plot state via an API call rather than view-bound data such as
graphDiv.data
, so that internal representations aren’t committed to, and the code is free to delay data exports until needed - separation of
data
and traces/plots in the JSON specification - reduction of internal state manipulation in favor of pure functions that transform data predictably from given input to output
Issue Analytics
- State:
- Created 7 years ago
- Reactions:28
- Comments:25 (16 by maintainers)
Top GitHub Comments
Such a great read!
I’ve created a small dash app that implements bostock’s http://square.github.io/crossfilter crossfilter.js example here: https://gist.github.com/nite/aff146e2b161c19f6d553dc0a4ce3622 - not quite the same level of realtime & slick UI/UX as the original, but good enough for a PoC. Currently hosted on https://crossfilter-dash.herokuapp.com, otherwise create a venv, pip install -r requirements.txt & run app.py