question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve perfomance of extremely large datasets

See original GitHub issue

When I search around for plotting libraries for a large number of data points, people are talking 50k, I’m talking 10M. I made a simple React demo that generates 10 random sinusoids of 1M points each. 10 * 100k works fine, but 10 * 1M becomes unusable.

  var numpoints = 1e6;
  var time: number[] = [];
  for (let i=0; i<numpoints; i++) {
    time.push(i);
  }
  var traces: object[] = [];
  for (let i=0; i<10; i++) {
    let points: number[] = [];
    let freq = Math.random()/1000;
    for (let j=0; j<numpoints; j++) {
      points.push(Math.sin(j*freq));
    }
    traces.push({
      x: time,
      y: points,
      type: 'scatter',
      mode: 'lines',
      yaxis: `y${i+1}`,
      xaxis: `x${i+1}`,
    })
  }
// ...
<Plot
        data={traces}
        layout={ {
          width: 2000,
          height: 1000,
          grid: {rows:5, columns: 2, pattern: "independent"},
          title: 'A Fancy Plot'} }
      />

I did a bit of profiling, and there are two issues. The first on is relatively simple, and the comment already describes what needs to be done to make hovering over the plot not lag because it’s looping over every single data point. https://github.com/plotly/plotly.js/blob/623fcd1fea9d9bfb86e5e0d44d8047cd8636881c/src/components/fx/helpers.js#L59-L62

The second issue seems to be just the drawing of the plot itself after a drag or zoom action. It spends all its time in plot, plotOne and linePoints. What’s interesting is that even if you zoom in, where it would only have to draw a small subset of the line, it’s still just as slow. https://github.com/plotly/plotly.js/blob/623fcd1fea9d9bfb86e5e0d44d8047cd8636881c/src/traces/scatter/line_points.js#L346

So it seems like both problems could be solved with some sort of index to avoid looping over all the datapoints. Some suggestions to jumpstart the discussion:

  • For the common case of a monotonous x axis, implement a simple binary/interpolation search (monotonicity could be detected or specified)
  • Store points in a quadtree, to allow fast spatial indexing for any type of data. (such as https://github.com/plotly/point-cluster )
  • Automatic downsampling. If I have 10M points, when zoomed out all the detail is lost anyway, but I still want to be able to zoom in and inspect it.
  • Offload operations to a webworker. At some point you’re going to need to do a thing 10M times, but don’t freeze the UI to do it.

If I end up using Plotly in production I’d be happy to try and contribute towards this, but for now just some suggestions to see how the maintainers feel about the issue and what their preferred approach would be.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
jonasvddcommented, Dec 4, 2022

Hi, We created the functionality that @szkjn states, available for plotly.py, through the plotly-resampler toolkit!

1reaction
szkjncommented, Jan 21, 2022

Following up on this.

Is there a straighforward way to perform dynamic downsampling depending on zoom range ? So far, selectedData (selection tool) provides both points and range but relayoutData (zoom/pan tool) only returns the latter.

This has been mentioned in thread #145 but not yet solved as far as I know.

Would appreciate any lead on this !

🙏

Read more comments on GitHub >

github_iconTop Results From Across the Web

Performance on Large Datasets - healthcareai
Since character variables with lots of unique values, or high cardinality variables, cause very wide datasets, the best way to limit width is...
Read more >
17 Strategies for Dealing with Data, Big Data, and Even ...
Things do with really big data (roughly tens of millions of rows and up) · Use numba. Numba gives you a big speed...
Read more >
How to improve performance of large datasets - Esri Community
Hello, We have a PLSS(sections & ranges) layer for the entire US. It's roughly around 2 million features. Every time we bring in...
Read more >
Improve Query Performance over Large Dataset
First of all is goot way to help us to help you is show valid CREATE statements: tbl_last_input_visit: #1072 - Key column 'idvisitor' ......
Read more >
Improving performance for extremely large datasets · Issue #202
hey, i'm attempting to display a large dataset (117927 nodes and 117921 links), but WebGL runs out of memory after the first few...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found