Plotting many lines as separate columns takes a long time
See original GitHub issueI am hoping there is a misunderstanding or misuse on my part but plotting lots of lines with Datashader is taking much longer than I would expect, at least compared to plotting points. To provide an example case, I wrote the following code which demonstrates what I am seeing. As a summary, this figure shows how many seconds it takes to plot n
lines, each with 50 data points.
While this shows it scales as n log(n), better than n^2, the large amount of time, over 4.5 hours for 8192 lines, makes me wonder if it is actually running using optimized numba code. Again, I hope this is simply misuse on my part as I want to use this to plot over 100,000 lines. I have got similar results when running on in linux desktop (12 core, 32 GB ram), and on my mac laptop. I did run this in a jupyter notebook; would it perform better in a python script?
Here is the code I am running:
import pandas as pd
import numpy as np
import datashader
import bokeh.plotting
import collections
import xarray
import time
from bokeh.palettes import Colorblind7 as palette
bokeh.plotting.output_notebook()
# create some data worth plotting
x = np.linspace(0, np.pi * 2)
y = np.sin(x)
n = 100000
data = np.empty([n+1, len(y)])
data[0] = x
prng = np.random.RandomState(123)
offset = prng.normal(0, 0.1, n).reshape(n, -1)
data[1:] = y
data[1:] += offset
df = pd.DataFrame(data.T)
x_range = 0, 2*np.pi
y_range = -1.5, 1.5
y_cols = range(1, n+1)
# iterate over increasing number of lines
run_times = []
imgs = []
test = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
for n in test:
this_df = df[:n+1]
y_cols = range(1, n+1)
tic = time.time()
canvas = datashader.Canvas(x_range=x_range, y_range=y_range,
plot_height=300, plot_width=300)
aggs = collections.OrderedDict((c, canvas.line(df, 0, c)) for c in y_cols)
merged = xarray.concat(aggs.values(), dim=pd.Index(y_cols, name='cols'))
img = datashader.transfer_functions.shade(merged.sum(dim='cols'), how='eq_hist')
toc = time.time() - tic
run_times.append(toc)
imgs.append(img)
# plot the result
p = bokeh.plotting.figure(y_axis_label='time (s)', x_axis_label='n lines',
width=400, height=300, x_axis_type='log',
y_axis_type='log',
title='Run-time for Datashader line plot')
p.circle(test, run_times, legend='run times', color=palette[0])
# what is the slope of the rate of increase?
test = np.array(test)
n2 = test ** 2 # + run_times[0]
n3 = test ** 3 # + run_times[0]
nlogn = test * np.log(n) # + run_times[0]
p.line(test, n2, legend='n^2', color=palette[1])
p.line(test, n3, legend='n^3', color=palette[2])
p.line(test, nlogn, legend='n log(n)', color=palette[3])
p.legend.location = 'top_left'
bokeh.plotting.show(p)
Here is the last image (using 8192 of the total 100000 lines):
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:16 (10 by maintainers)
Top GitHub Comments
We can certainly provide utility functions for converting common data structures that are used in practice into ones that datashader can deal with directly. I’d be happy to see the above generalized and put into a function in
ds.utils
, along with an example somewhere in the notebooks of how to use it (e.g. in tseries.ipynb). Any hope of you creating a PR for that? 😃With that in place we can then determine if there’s a big savings to be had by supporting a fixed-length array type directly.
Cool, I will have a look at ds.utils and tseries.ipynb and put in a PR soon 😃