question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optimize data in specs?

See original GitHub issue

Currently, when you use Altair’s API to create layered or concatenated charts, it is quite common to end up with multiple copies of the data within the specification. For example:

import altair as alt
import pandas as pd

data = pd.DataFrame({'x': [0, 1, 2], 'y': [0, 1, 0]})

base = alt.Chart(data).encode(x='x', y='y')

chart = base.mark_point() + base.mark_line()
print(chart.to_dict())
{'$schema': 'https://vega.github.io/schema/vega-lite/v2.json',
 'config': {'view': {'height': 300, 'width': 400}},
 'layer': [{'data': {'values': [{'x': 0, 'y': 0},
                                {'x': 1, 'y': 1},
                                {'x': 2, 'y': 0}]},
            'encoding': {'x': {'field': 'x', 'type': 'quantitative'},
                         'y': {'field': 'y', 'type': 'quantitative'}},
            'mark': 'point'},
           {'data': {'values': [{'x': 0, 'y': 0},
                                {'x': 1, 'y': 1},
                                {'x': 2, 'y': 0}]},
            'encoding': {'x': {'field': 'x', 'type': 'quantitative'},
                         'y': {'field': 'y', 'type': 'quantitative'}},
            'mark': 'line'}]}

For such a small dataset it’s not an issue, but obviously as the size of the data grows the cost of the duplication could become significant.

We should be able to detect at the Python level if we are creating a compound chart in which each subchart has identical data, and then move these duplicate datasets to the top level; resulting in a spec that looks like this:

{'$schema': 'https://vega.github.io/schema/vega-lite/v2.json',
 'config': {'view': {'height': 300, 'width': 400}},
 'data': {'values': [{'x': 0, 'y': 0}, {'x': 1, 'y': 1}, {'x': 2, 'y': 0}]},
 'layer': [{'encoding': {'x': {'field': 'x', 'type': 'quantitative'},
                         'y': {'field': 'y', 'type': 'quantitative'}},
            'mark': 'point'},
           {'encoding': {'x': {'field': 'x', 'type': 'quantitative'},
                         'y': {'field': 'y', 'type': 'quantitative'}},
            'mark': 'line'}]}

Is this something we want to do by default?

@craigcitro suggested that this could be done by checking for object identity at the Python level (i.e. chart1.data is chart2.data) and that there could be a module-wide flag to turn this behavior off if it proves to cause problems in some edge-case.

I like this idea… I would also propose that we do this kind of data optimization check within the chart1 + chart2 and alt.layer(chart1, chart2) APIs, but not when directly defining the underlying object (i.e. alt.LayerChart([chart1, chart2])). And follow a similar pattern for hconcat/vconcat, etc.

Any thought on this? cc/@ellisonbg, @domoritz, @kanitw

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
domoritzcommented, Mar 14, 2018

I recently implemented a top level datastore to reduce data redundancy. This may be helpful as you don’t have to reason about the semantics of Vega-Lite. See https://github.com/vega/vega-lite/pull/3417

0reactions
jakevdpcommented, May 27, 2019

As of Altair 3, we consolidate data by hash so that multiple copies don’t appear. In Altair 3.1, we additionally add some logic to move datasets from subcharts to parent charts when possible (#1521). I think this issue has been suitably addressed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Optimize data in Google Analytics - Optimize Resource Hub
Optimize experiment dimensions in the Analytics Segment builder. Select Experiment ID with Variant. Select the exactly matches match type. Enter the Experiment ...
Read more >
Moving Beyond Specifications: Optimizing Your Process and ...
Read this blog to learn how you can move beyond specification to improve quality and productivity across your plant.
Read more >
Optimizing Data Loads - Essbase - Oracle Help Center
To optimize data load performance, think in terms of database structure. Essbase loads data block by block. For each unique combination of sparse...
Read more >
OPTIMIZE | Databricks on AWS
It aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. The two...
Read more >
Website, A/B Testing & Optimization Tools - Google Optimize
Use your Analytics data to identify where to improve your site, measure experiments against business goals, and see exactly how your changes influence...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found