question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Execution time of patching and streaming operation on ColumnDataSource increases with size of the data source.

See original GitHub issue

Stream and patch operations on ColumnDataSource are taking progressively more time as the size of data source grows. In my application this causes a significant problem since during its run it accumulates several thousands of records at which point the stream operation starts to be a bottleneck.

Expected behavior

To me it seems reasonable that the streaming operation that only appends to data source should be implemented in such way that it doesn’t slow down with size of data source (or at least in some reasonable manner).

System Info

Since this is only very general issue I’m presuming that these are the only relevant system information:

bokeh: 0.12.10
python: 3.5.2 (default, Nov 17 2016, 17:05:23)
system: Linux-4.4.0-97-generic-x86_64-with-Ubuntu-16.04-xenial

Minimal example

import bokeh
import time

class TestSource:
  def __init__(self):
    self._source = bokeh.models.ColumnDataSource(data = self._empty_source_dict())

  def _empty_source_dict(self) -> Dict:
    return {'x': [], 'text': [], 'color': [], 'price': [], 'ownership': []}

  def add(self, record):
    record_list = {key: [value] for key, value in record.items()}
    self._source.stream(record_list)

def get_record():
  return {
    'x': 0,
    'text': 'text',
    'color': 'black',
    'price': 0,
    'ownership': 1
  }

def test_add(iteration):
  source = TestSource()
  start = time.time()
  record = get_record()
  for _ in range(iteration):
    source.add(record)
  print("Streaming %s records took %s seconds." % (iteration, str(time.time() - start)))

if __name__ == '__main__':
  test_counts = [1000, 2000, 5000, 10000]
  for i in test_counts:
    test_add(i)

This example produces following output:

Streaming 1000 records took 0.6277530193328857 seconds.
Streaming 2000 records took 2.4086313247680664 seconds.
Streaming 5000 records took 14.723567724227905 seconds.
Streaming 10000 records took 59.13327884674072 seconds.

As you can see difference between 1’000 and 10’000 has almost a factor of 100 instead of 10.

Investigation

Using cProfile I’ve found that almost entire time of execution is spend in value = self.property.prepare_value(obj, self.name, value) and namely it’s call to method self.validate(value). After some examination it looks to me that calls to the validate function has no meaningful effect.

Firstly the call to super(Seq, self).validate(value) seems to be completely useless since it calls this. But most expensive is the call of all(self.item_type.is_valid(item) for item in value)), where is_valid calls on each item in the sequence, where the item can be sequence itself.

This is most likely caused by my ignorance but it seems to me that there is no actual validation going on. If this is the case the validation could be removed all together.

In my particular case I’ve solved the issue temporarily by monkey patching the validate method of Seq class by an empty function.

The same code run with monkey patching produces:

Streaming 1000 records took 0.04285454750061035 seconds.
Streaming 2000 records took 0.09637665748596191 seconds.
Streaming 5000 records took 0.3601646423339844 seconds.
Streaming 10000 records took 0.8589320182800293 seconds.

This of course is not a great solution, but I would like to know if anybody is aware of the idea behind testing the validity in the first place. What is accomplished by it?

cProfile of example code

cprofile_plot

This is cProfile of worst offenders in example code in case of 10000 added records:

ncalls tottime percall cumtime percall filename:lineno(function)
250075167 55.28 2.21e-07 113.9 4.555e-07 properties.py:1170(<genexpr>)
250125402/100365 46.47 0.000463 130.1 0.001297 bases.py:248(is_valid)
70424/20410 15.86 0.0007772 130.9 0.006412 ~:0(<built-in method builtins.all>)
250138574 12.22 4.886e-08 12.22 4.886e-08 bases.py:231(validate)
60067/10067 0.6478 6.434e-05 0.741 7.36e-05 bases.py:168(matches)
838508/838507 0.123 1.466e-07 0.2018 2.407e-07 ~:0(<built-in method builtins.isinstance>)
10000 0.1108 1.108e-05 131.6 0.01316 sources.py:344(_stream)
10000 0.1013 1.013e-05 131.5 0.01315 containers.py:348(_stream)

You can see that is_valid is called little over 250 million times.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
bryevdvcommented, Jan 5, 2018

cc @mattpap @mrocklin I’d like to come to a decision on this. I guess I would not have expected users to be streaming thousands of rows, I had much smaller marginal stream additions in mind when I implemented things. But evidently it’s a thing people want to do. Would the best option be a settings flag to optionally turn off validation on CDS columns?

0reactions
bryevdvcommented, May 11, 2018

turning off array validations as an optimization is also discussed here https://github.com/bokeh/bokeh/issues/6470 so I am closing this issue as a dupe

Read more comments on GitHub >

github_iconTop Results From Across the Web

sources — Bokeh 3.0.3 Documentation
A data source that can populate columns by making Ajax calls to REST endpoints. ... in a given ColumnDataSource all have the same...
Read more >
Bokeh: Whats the differences between certain methods of ...
The second method is using the .patch() function of ColumnDataSource . Which the reference calls describes as Efficiently update data source ...
Read more >
Python and Bokeh: Part II. The beginner's guide to creating…
stream method of ColumnDataSource appends new data to the data source. Remember, Bokeh keeps track of the two versions of our data: one...
Read more >
Streaming data with Bokeh and Panel periodic callbacks?
More specifically, I am trying to use ColumnDataSource.stream (from Bokeh) in a callback function that should be periodically called.
Read more >
Working with Streaming Data — HoloViews v1.15.3
The ability to accumulate data allows performing operations on a recent history of data, while plotting backends (such as bokeh) can optimize plot...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found