question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Plotting many lines as separate columns takes a long time

See original GitHub issue

I am hoping there is a misunderstanding or misuse on my part but plotting lots of lines with Datashader is taking much longer than I would expect, at least compared to plotting points. To provide an example case, I wrote the following code which demonstrates what I am seeing. As a summary, this figure shows how many seconds it takes to plot n lines, each with 50 data points. ds_line_test While this shows it scales as n log(n), better than n^2, the large amount of time, over 4.5 hours for 8192 lines, makes me wonder if it is actually running using optimized numba code. Again, I hope this is simply misuse on my part as I want to use this to plot over 100,000 lines. I have got similar results when running on in linux desktop (12 core, 32 GB ram), and on my mac laptop. I did run this in a jupyter notebook; would it perform better in a python script?

Here is the code I am running:

import pandas as pd
import numpy as np
import datashader
import bokeh.plotting
import collections
import xarray
import time
from bokeh.palettes import Colorblind7 as palette

bokeh.plotting.output_notebook()

# create some data worth plotting
x = np.linspace(0, np.pi * 2)
y = np.sin(x)
n = 100000
data = np.empty([n+1, len(y)])
data[0] = x
prng = np.random.RandomState(123)
offset = prng.normal(0, 0.1, n).reshape(n, -1)
data[1:] = y
data[1:] += offset
df = pd.DataFrame(data.T)
x_range = 0, 2*np.pi
y_range = -1.5, 1.5
y_cols = range(1, n+1)

# iterate over increasing number of lines
run_times = []
imgs = []
test = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
for n in test:
    this_df = df[:n+1]
    y_cols = range(1, n+1)
    tic = time.time()
    canvas = datashader.Canvas(x_range=x_range, y_range=y_range, 
                               plot_height=300, plot_width=300)
    aggs = collections.OrderedDict((c, canvas.line(df, 0, c)) for c in y_cols)
    merged = xarray.concat(aggs.values(), dim=pd.Index(y_cols, name='cols'))
    img = datashader.transfer_functions.shade(merged.sum(dim='cols'), how='eq_hist')
    toc = time.time() - tic
    run_times.append(toc)
    imgs.append(img)

# plot the result
p = bokeh.plotting.figure(y_axis_label='time (s)', x_axis_label='n lines',
                          width=400, height=300, x_axis_type='log',
                          y_axis_type='log',
                          title='Run-time for Datashader line plot')
p.circle(test, run_times, legend='run times', color=palette[0])

# what is the slope of the rate of increase?
test = np.array(test)
n2 = test ** 2 # + run_times[0]
n3 = test ** 3 # + run_times[0]
nlogn = test * np.log(n) # + run_times[0]
p.line(test, n2, legend='n^2', color=palette[1])
p.line(test, n3, legend='n^3', color=palette[2])
p.line(test, nlogn, legend='n log(n)', color=palette[3])

p.legend.location = 'top_left'
bokeh.plotting.show(p)

Here is the last image (using 8192 of the total 100000 lines):

ds_8192_lines

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:16 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
jbednarcommented, Oct 6, 2017

We can certainly provide utility functions for converting common data structures that are used in practice into ones that datashader can deal with directly. I’d be happy to see the above generalized and put into a function in ds.utils, along with an example somewhere in the notebooks of how to use it (e.g. in tseries.ipynb). Any hope of you creating a PR for that? 😃

With that in place we can then determine if there’s a big savings to be had by supporting a fixed-length array type directly.

0reactions
narendramukherjeecommented, Oct 6, 2017

Cool, I will have a look at ds.utils and tseries.ipynb and put in a PR soon 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Plotting many lines as separate columns takes a long time #286
As a summary, this figure shows how many seconds it takes to plot n lines, each with 50 data points. ... While this...
Read more >
panda plot multiple lines base on a certain column
I think you can reshape DataFrame to columns and then plot : df['g'] = df.groupby('type').cumcount() df = df.set_index(['timestamp','g', ...
Read more >
Using - Gnuplot
If timeseries data are being used, the time can span multiple columns. The starting column should be specified. Note that the spaces within...
Read more >
How to plot multiple lines on the same y-axis using plotly ...
Hello all, I just installed plotly express. And I am trying to do something simple - plot each column of my data frame...
Read more >
How do I create plots in pandas?
Separate subplots for each of the data columns are supported by the subplots argument of the plot functions. The builtin options available in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found