Optimize data in specs?
See original GitHub issueCurrently, when you use Altair’s API to create layered or concatenated charts, it is quite common to end up with multiple copies of the data within the specification. For example:
import altair as alt
import pandas as pd
data = pd.DataFrame({'x': [0, 1, 2], 'y': [0, 1, 0]})
base = alt.Chart(data).encode(x='x', y='y')
chart = base.mark_point() + base.mark_line()
print(chart.to_dict())
{'$schema': 'https://vega.github.io/schema/vega-lite/v2.json',
'config': {'view': {'height': 300, 'width': 400}},
'layer': [{'data': {'values': [{'x': 0, 'y': 0},
{'x': 1, 'y': 1},
{'x': 2, 'y': 0}]},
'encoding': {'x': {'field': 'x', 'type': 'quantitative'},
'y': {'field': 'y', 'type': 'quantitative'}},
'mark': 'point'},
{'data': {'values': [{'x': 0, 'y': 0},
{'x': 1, 'y': 1},
{'x': 2, 'y': 0}]},
'encoding': {'x': {'field': 'x', 'type': 'quantitative'},
'y': {'field': 'y', 'type': 'quantitative'}},
'mark': 'line'}]}
For such a small dataset it’s not an issue, but obviously as the size of the data grows the cost of the duplication could become significant.
We should be able to detect at the Python level if we are creating a compound chart in which each subchart has identical data, and then move these duplicate datasets to the top level; resulting in a spec that looks like this:
{'$schema': 'https://vega.github.io/schema/vega-lite/v2.json',
'config': {'view': {'height': 300, 'width': 400}},
'data': {'values': [{'x': 0, 'y': 0}, {'x': 1, 'y': 1}, {'x': 2, 'y': 0}]},
'layer': [{'encoding': {'x': {'field': 'x', 'type': 'quantitative'},
'y': {'field': 'y', 'type': 'quantitative'}},
'mark': 'point'},
{'encoding': {'x': {'field': 'x', 'type': 'quantitative'},
'y': {'field': 'y', 'type': 'quantitative'}},
'mark': 'line'}]}
Is this something we want to do by default?
@craigcitro suggested that this could be done by checking for object identity at the Python level (i.e. chart1.data is chart2.data
) and that there could be a module-wide flag to turn this behavior off if it proves to cause problems in some edge-case.
I like this idea… I would also propose that we do this kind of data optimization check within the chart1 + chart2
and alt.layer(chart1, chart2)
APIs, but not when directly defining the underlying object (i.e. alt.LayerChart([chart1, chart2])
). And follow a similar pattern for hconcat/vconcat, etc.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:9 (7 by maintainers)
Top GitHub Comments
I recently implemented a top level datastore to reduce data redundancy. This may be helpful as you don’t have to reason about the semantics of Vega-Lite. See https://github.com/vega/vega-lite/pull/3417
As of Altair 3, we consolidate data by hash so that multiple copies don’t appear. In Altair 3.1, we additionally add some logic to move datasets from subcharts to parent charts when possible (#1521). I think this issue has been suitably addressed.