Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optimize data in specs?

See original GitHub issue

Currently, when you use Altair’s API to create layered or concatenated charts, it is quite common to end up with multiple copies of the data within the specification. For example:

import altair as alt
import pandas as pd

data = pd.DataFrame({'x': [0, 1, 2], 'y': [0, 1, 0]})

base = alt.Chart(data).encode(x='x', y='y')

chart = base.mark_point() + base.mark_line()
print(chart.to_dict())

{'$schema': 'https://vega.github.io/schema/vega-lite/v2.json',
 'config': {'view': {'height': 300, 'width': 400}},
 'layer': [{'data': {'values': [{'x': 0, 'y': 0},
                                {'x': 1, 'y': 1},
                                {'x': 2, 'y': 0}]},
            'encoding': {'x': {'field': 'x', 'type': 'quantitative'},
                         'y': {'field': 'y', 'type': 'quantitative'}},
            'mark': 'point'},
           {'data': {'values': [{'x': 0, 'y': 0},
                                {'x': 1, 'y': 1},
                                {'x': 2, 'y': 0}]},
            'encoding': {'x': {'field': 'x', 'type': 'quantitative'},
                         'y': {'field': 'y', 'type': 'quantitative'}},
            'mark': 'line'}]}

For such a small dataset it’s not an issue, but obviously as the size of the data grows the cost of the duplication could become significant.

We should be able to detect at the Python level if we are creating a compound chart in which each subchart has identical data, and then move these duplicate datasets to the top level; resulting in a spec that looks like this:

{'$schema': 'https://vega.github.io/schema/vega-lite/v2.json',
 'config': {'view': {'height': 300, 'width': 400}},
 'data': {'values': [{'x': 0, 'y': 0}, {'x': 1, 'y': 1}, {'x': 2, 'y': 0}]},
 'layer': [{'encoding': {'x': {'field': 'x', 'type': 'quantitative'},
                         'y': {'field': 'y', 'type': 'quantitative'}},
            'mark': 'point'},
           {'encoding': {'x': {'field': 'x', 'type': 'quantitative'},
                         'y': {'field': 'y', 'type': 'quantitative'}},
            'mark': 'line'}]}

Is this something we want to do by default?

@craigcitro suggested that this could be done by checking for object identity at the Python level (i.e. chart1.data is chart2.data) and that there could be a module-wide flag to turn this behavior off if it proves to cause problems in some edge-case.

I like this idea… I would also propose that we do this kind of data optimization check within the chart1 + chart2 and alt.layer(chart1, chart2) APIs, but not when directly defining the underlying object (i.e. alt.LayerChart([chart1, chart2])). And follow a similar pattern for hconcat/vconcat, etc.

Any thought on this? cc/@ellisonbg, @domoritz, @kanitw

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:9 (7 by maintainers)

Top GitHub Comments

1reaction

domoritzcommented, Mar 14, 2018

I recently implemented a top level datastore to reduce data redundancy. This may be helpful as you don’t have to reason about the semantics of Vega-Lite. See https://github.com/vega/vega-lite/pull/3417

0reactions

jakevdpcommented, May 27, 2019

As of Altair 3, we consolidate data by hash so that multiple copies don’t appear. In Altair 3.1, we additionally add some logic to move datasets from subcharts to parent charts when possible (#1521). I think this issue has been suitably addressed.

Top Results From Across the Web

Optimize data in Google Analytics - Optimize Resource Hub

Optimize experiment dimensions in the Analytics Segment builder. Select Experiment ID with Variant. Select the exactly matches match type. Enter the Experiment ...

Moving Beyond Specifications: Optimizing Your Process and ...

Read this blog to learn how you can move beyond specification to improve quality and productivity across your plant.

Optimizing Data Loads - Essbase - Oracle Help Center

To optimize data load performance, think in terms of database structure. Essbase loads data block by block. For each unique combination of sparse...

OPTIMIZE | Databricks on AWS

It aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. The two...

Website, A/B Testing & Optimization Tools - Google Optimize

Use your Analytics data to identify where to improve your site, measure experiments against business goals, and see exactly how your changes influence...