Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Plotting many lines as separate columns takes a long time

See original GitHub issue

I am hoping there is a misunderstanding or misuse on my part but plotting lots of lines with Datashader is taking much longer than I would expect, at least compared to plotting points. To provide an example case, I wrote the following code which demonstrates what I am seeing. As a summary, this figure shows how many seconds it takes to plot n lines, each with 50 data points. ds_line_test While this shows it scales as n log(n), better than n^2, the large amount of time, over 4.5 hours for 8192 lines, makes me wonder if it is actually running using optimized numba code. Again, I hope this is simply misuse on my part as I want to use this to plot over 100,000 lines. I have got similar results when running on in linux desktop (12 core, 32 GB ram), and on my mac laptop. I did run this in a jupyter notebook; would it perform better in a python script?

Here is the code I am running:

import pandas as pd
import numpy as np
import datashader
import bokeh.plotting
import collections
import xarray
import time
from bokeh.palettes import Colorblind7 as palette

bokeh.plotting.output_notebook()

# create some data worth plotting
x = np.linspace(0, np.pi * 2)
y = np.sin(x)
n = 100000
data = np.empty([n+1, len(y)])
data[0] = x
prng = np.random.RandomState(123)
offset = prng.normal(0, 0.1, n).reshape(n, -1)
data[1:] = y
data[1:] += offset
df = pd.DataFrame(data.T)
x_range = 0, 2*np.pi
y_range = -1.5, 1.5
y_cols = range(1, n+1)

# iterate over increasing number of lines
run_times = []
imgs = []
test = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
for n in test:
    this_df = df[:n+1]
    y_cols = range(1, n+1)
    tic = time.time()
    canvas = datashader.Canvas(x_range=x_range, y_range=y_range, 
                               plot_height=300, plot_width=300)
    aggs = collections.OrderedDict((c, canvas.line(df, 0, c)) for c in y_cols)
    merged = xarray.concat(aggs.values(), dim=pd.Index(y_cols, name='cols'))
    img = datashader.transfer_functions.shade(merged.sum(dim='cols'), how='eq_hist')
    toc = time.time() - tic
    run_times.append(toc)
    imgs.append(img)

# plot the result
p = bokeh.plotting.figure(y_axis_label='time (s)', x_axis_label='n lines',
                          width=400, height=300, x_axis_type='log',
                          y_axis_type='log',
                          title='Run-time for Datashader line plot')
p.circle(test, run_times, legend='run times', color=palette[0])

# what is the slope of the rate of increase?
test = np.array(test)
n2 = test ** 2 # + run_times[0]
n3 = test ** 3 # + run_times[0]
nlogn = test * np.log(n) # + run_times[0]
p.line(test, n2, legend='n^2', color=palette[1])
p.line(test, n3, legend='n^3', color=palette[2])
p.line(test, nlogn, legend='n log(n)', color=palette[3])

p.legend.location = 'top_left'
bokeh.plotting.show(p)

Here is the last image (using 8192 of the total 100000 lines):

ds_8192_lines

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:16 (10 by maintainers)

Top GitHub Comments

1reaction

jbednarcommented, Oct 6, 2017

We can certainly provide utility functions for converting common data structures that are used in practice into ones that datashader can deal with directly. I’d be happy to see the above generalized and put into a function in ds.utils, along with an example somewhere in the notebooks of how to use it (e.g. in tseries.ipynb). Any hope of you creating a PR for that? 😃

With that in place we can then determine if there’s a big savings to be had by supporting a fixed-length array type directly.

0reactions

narendramukherjeecommented, Oct 6, 2017

Cool, I will have a look at ds.utils and tseries.ipynb and put in a PR soon 😃