Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Execution time of patching and streaming operation on ColumnDataSource increases with size of the data source.

See original GitHub issue

Stream and patch operations on ColumnDataSource are taking progressively more time as the size of data source grows. In my application this causes a significant problem since during its run it accumulates several thousands of records at which point the stream operation starts to be a bottleneck.

Expected behavior

To me it seems reasonable that the streaming operation that only appends to data source should be implemented in such way that it doesn’t slow down with size of data source (or at least in some reasonable manner).

System Info

Since this is only very general issue I’m presuming that these are the only relevant system information:

bokeh: 0.12.10
python: 3.5.2 (default, Nov 17 2016, 17:05:23)
system: Linux-4.4.0-97-generic-x86_64-with-Ubuntu-16.04-xenial

Minimal example

import bokeh
import time

class TestSource:
  def __init__(self):
    self._source = bokeh.models.ColumnDataSource(data = self._empty_source_dict())

  def _empty_source_dict(self) -> Dict:
    return {'x': [], 'text': [], 'color': [], 'price': [], 'ownership': []}

  def add(self, record):
    record_list = {key: [value] for key, value in record.items()}
    self._source.stream(record_list)

def get_record():
  return {
    'x': 0,
    'text': 'text',
    'color': 'black',
    'price': 0,
    'ownership': 1
  }

def test_add(iteration):
  source = TestSource()
  start = time.time()
  record = get_record()
  for _ in range(iteration):
    source.add(record)
  print("Streaming %s records took %s seconds." % (iteration, str(time.time() - start)))

if __name__ == '__main__':
  test_counts = [1000, 2000, 5000, 10000]
  for i in test_counts:
    test_add(i)

This example produces following output:

Streaming 1000 records took 0.6277530193328857 seconds.
Streaming 2000 records took 2.4086313247680664 seconds.
Streaming 5000 records took 14.723567724227905 seconds.
Streaming 10000 records took 59.13327884674072 seconds.

As you can see difference between 1’000 and 10’000 has almost a factor of 100 instead of 10.

Investigation

Using cProfile I’ve found that almost entire time of execution is spend in value = self.property.prepare_value(obj, self.name, value) and namely it’s call to method self.validate(value). After some examination it looks to me that calls to the validate function has no meaningful effect.

Firstly the call to super(Seq, self).validate(value) seems to be completely useless since it calls this. But most expensive is the call of all(self.item_type.is_valid(item) for item in value)), where is_valid calls on each item in the sequence, where the item can be sequence itself.

This is most likely caused by my ignorance but it seems to me that there is no actual validation going on. If this is the case the validation could be removed all together.

In my particular case I’ve solved the issue temporarily by monkey patching the validate method of Seq class by an empty function.

The same code run with monkey patching produces:

Streaming 1000 records took 0.04285454750061035 seconds.
Streaming 2000 records took 0.09637665748596191 seconds.
Streaming 5000 records took 0.3601646423339844 seconds.
Streaming 10000 records took 0.8589320182800293 seconds.

This of course is not a great solution, but I would like to know if anybody is aware of the idea behind testing the validity in the first place. What is accomplished by it?

cProfile of example code

cprofile_plot

This is cProfile of worst offenders in example code in case of 10000 added records:

ncalls	tottime	percall	cumtime	percall	filename:lineno(function)
250075167	55.28	2.21e-07	113.9	4.555e-07	properties.py:1170(<genexpr>)
250125402/100365	46.47	0.000463	130.1	0.001297	bases.py:248(is_valid)
70424/20410	15.86	0.0007772	130.9	0.006412	~:0(<built-in method builtins.all>)
250138574	12.22	4.886e-08	12.22	4.886e-08	bases.py:231(validate)
60067/10067	0.6478	6.434e-05	0.741	7.36e-05	bases.py:168(matches)
838508/838507	0.123	1.466e-07	0.2018	2.407e-07	~:0(<built-in method builtins.isinstance>)
10000	0.1108	1.108e-05	131.6	0.01316	sources.py:344(_stream)
10000	0.1013	1.013e-05	131.5	0.01315	containers.py:348(_stream)

You can see that is_valid is called little over 250 million times.

Issue Analytics

State:
Created 6 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

bryevdvcommented, Jan 5, 2018

cc @mattpap @mrocklin I’d like to come to a decision on this. I guess I would not have expected users to be streaming thousands of rows, I had much smaller marginal stream additions in mind when I implemented things. But evidently it’s a thing people want to do. Would the best option be a settings flag to optionally turn off validation on CDS columns?

0reactions

bryevdvcommented, May 11, 2018

turning off array validations as an optimization is also discussed here https://github.com/bokeh/bokeh/issues/6470 so I am closing this issue as a dupe