Execution time of patching and streaming operation on ColumnDataSource increases with size of the data source.
See original GitHub issueStream and patch operations on ColumnDataSource are taking progressively more time as the size of data source grows. In my application this causes a significant problem since during its run it accumulates several thousands of records at which point the stream operation starts to be a bottleneck.
Expected behavior
To me it seems reasonable that the streaming operation that only appends to data source should be implemented in such way that it doesn’t slow down with size of data source (or at least in some reasonable manner).
System Info
Since this is only very general issue I’m presuming that these are the only relevant system information:
bokeh: 0.12.10
python: 3.5.2 (default, Nov 17 2016, 17:05:23)
system: Linux-4.4.0-97-generic-x86_64-with-Ubuntu-16.04-xenial
Minimal example
import bokeh
import time
class TestSource:
def __init__(self):
self._source = bokeh.models.ColumnDataSource(data = self._empty_source_dict())
def _empty_source_dict(self) -> Dict:
return {'x': [], 'text': [], 'color': [], 'price': [], 'ownership': []}
def add(self, record):
record_list = {key: [value] for key, value in record.items()}
self._source.stream(record_list)
def get_record():
return {
'x': 0,
'text': 'text',
'color': 'black',
'price': 0,
'ownership': 1
}
def test_add(iteration):
source = TestSource()
start = time.time()
record = get_record()
for _ in range(iteration):
source.add(record)
print("Streaming %s records took %s seconds." % (iteration, str(time.time() - start)))
if __name__ == '__main__':
test_counts = [1000, 2000, 5000, 10000]
for i in test_counts:
test_add(i)
This example produces following output:
Streaming 1000 records took 0.6277530193328857 seconds.
Streaming 2000 records took 2.4086313247680664 seconds.
Streaming 5000 records took 14.723567724227905 seconds.
Streaming 10000 records took 59.13327884674072 seconds.
As you can see difference between 1’000 and 10’000 has almost a factor of 100 instead of 10.
Investigation
Using cProfile I’ve found that almost entire time of execution is spend in value = self.property.prepare_value(obj, self.name, value) and namely it’s call to method self.validate(value). After some examination it looks to me that calls to the validate function has no meaningful effect.
Firstly the call to super(Seq, self).validate(value) seems to be completely useless since it calls this. But most expensive is the call of all(self.item_type.is_valid(item) for item in value)), where is_valid
calls on each item in the sequence, where the item can be sequence itself.
This is most likely caused by my ignorance but it seems to me that there is no actual validation going on. If this is the case the validation could be removed all together.
In my particular case I’ve solved the issue temporarily by monkey patching the validate method of Seq class by an empty function.
The same code run with monkey patching produces:
Streaming 1000 records took 0.04285454750061035 seconds.
Streaming 2000 records took 0.09637665748596191 seconds.
Streaming 5000 records took 0.3601646423339844 seconds.
Streaming 10000 records took 0.8589320182800293 seconds.
This of course is not a great solution, but I would like to know if anybody is aware of the idea behind testing the validity in the first place. What is accomplished by it?
cProfile of example code
This is cProfile of worst offenders in example code in case of 10000 added records:
ncalls | tottime | percall | cumtime | percall | filename:lineno(function) |
---|---|---|---|---|---|
250075167 | 55.28 | 2.21e-07 | 113.9 | 4.555e-07 | properties.py:1170(<genexpr>) |
250125402/100365 | 46.47 | 0.000463 | 130.1 | 0.001297 | bases.py:248(is_valid) |
70424/20410 | 15.86 | 0.0007772 | 130.9 | 0.006412 | ~:0(<built-in method builtins.all>) |
250138574 | 12.22 | 4.886e-08 | 12.22 | 4.886e-08 | bases.py:231(validate) |
60067/10067 | 0.6478 | 6.434e-05 | 0.741 | 7.36e-05 | bases.py:168(matches) |
838508/838507 | 0.123 | 1.466e-07 | 0.2018 | 2.407e-07 | ~:0(<built-in method builtins.isinstance>) |
10000 | 0.1108 | 1.108e-05 | 131.6 | 0.01316 | sources.py:344(_stream) |
10000 | 0.1013 | 1.013e-05 | 131.5 | 0.01315 | containers.py:348(_stream) |
You can see that is_valid
is called little over 250 million times.
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (8 by maintainers)
Top GitHub Comments
cc @mattpap @mrocklin I’d like to come to a decision on this. I guess I would not have expected users to be streaming thousands of rows, I had much smaller marginal stream additions in mind when I implemented things. But evidently it’s a thing people want to do. Would the best option be a settings flag to optionally turn off validation on CDS columns?
turning off array validations as an optimization is also discussed here https://github.com/bokeh/bokeh/issues/6470 so I am closing this issue as a dupe