Smart Config
See original GitHub issueMissing functionality
Configuration is always a big problem for me.
When I was beginner of PP, I didn’t know how to set various parameters for my data set and my ML case, the used time and memory was sometimes unacceptable.
Until recently, I read all of the code, familiar with all the implementations and configurations, so I can choose the efficient configuration method. But you can’t expect every user to use this approach to learn to configure their own case.
Some friends of mine always complains about how slow PP is, but I find that PP itself is actually not that slow, it’s the constant configuration makes PP slow.
What’s worse, some of the default config items become problems when I am trying to tune the performance. For example, here are some test result about running time when I am doing performance tests on a dual-core-processor server(The benchmark is to generate HTML reports on some commonly used data sets):
Branch | Use_dask | bayesian_blocks | Repeat | Benchmark1(ms) | Benchmark2(ms) | Benchmark3(ms) | Benchmark4(ms) | Benchmark5(ms) |
---|---|---|---|---|---|---|---|---|
loopy-patch-fast | True | False | 10 | 5361 | 10151 | 12342 | 6013 | 1804 |
loopy-patch-fast | False | False | 10 | 16089 | 12734 | 16799 | 9680 | 1802 |
loopy-patch-fast | True | True | 10 | 35903 | 78999 | 92227 | 6945 | 1906 |
master | False | False | 10 | 17098 | 12697 | 16287 | 11783 | 1742 |
master(Default) | False | True | 10 | 39990 | 73714 | 86397 | 13032 | 1863 |
As the table above shows, bayesian_blocks
(Default to be True) takes more than 60% time and produces an almost same histogram on large data sets. What’s worse, this problem will become more serious as the data set increases. On some particular data sets, the ratio even rises to more than 90%.
Different data sets should handle differently to be both fast and effective. Otherwise, user experience and ease of use will be greatly affected, especially for beginners, even complete and detailed documentation is not enough in this case.
In fact, when running on some large data sets, tweaking the config parameters, using parallel scheduling can save about 75%~95% of the time and produce an almost same report.
So I propose these two features below.
Proposed feature
- Interactive config widget: Since PP is always used in notebook, why not create an interactive config widget to help user (especially the beginners) make their own config? It will improve the user experience a lot and make it more convenience.
- Auto config: When no configuration is specified, we should generate a config according to the input data.
Both of the feature are not difficult to implement, but ‘Auto config’ requires some experienced to run some tests and carefully choose the strategies and thresholds.
Additional context
Recently, I am focusing on pipelining the project, performing performance tuning and fixing related bugs. As a result, recently I may not able to implement these two features. That’s also why I did not send PR, but left an issue here.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (1 by maintainers)
Hi, @neomatrix369!
The ‘Config Recommendation System’ your proposed is very promising and I’ve had similar thoughts before, which I called ‘auto-config’. The key problem, just like you mentioned - how do we find out what to recommend?, I’ve also thought about it for a long time, could not find a proper solution till now. You have proposed a nice one, but I think it may be still not a very proper solution.
As far as I know, PP is essentially a cool tool of generating report and most of the configuration items are used to describe user needs, not run-time parameters. So from the user’s perspective, config is always fixed with a given demand. For example, if I need correlations between variables and want to use A as reject threshold, no matter how much time or memory computation will take, I will still need them and config should not change.
As a result,
recommended_configs
strategy may only applicable to some runtime-related configuration items likepool_size
. If we add more run-time control parameters later, maybe it will become a nice move.(BTW, I think the root problem with the ‘bayesian_blocks’ I mentioned earlier in this issue is that the third-party package that implement this feature are not scale on big data set.)
I found out two ways, which may improve the user experience about config:
The pr and task-graph thing is WIP and currently on hold, I am sorry that these work are currently stalled partly because of some mechanism selection, I am occupied with some other work related to computational graphs in this period of time. Once I have some time, I will continue the previous work.
Initially, we will have to collect data and also the right data - both of which could come through iterations. And this does not have to be from others but initially, it will be from our own setups (machines, environments, etc…). When things mature, we can also get samples from others to help fine-tune the internal model.
I have just put together some ideas after reading your post so it needs further thinking and experimentation but I have a feeling the template of the path is more or less fine to walk on.
After reading your task graph resource I’m more of the idea that its smart optimisation(s) on the pipeline-end we might need to make, as opposed to suggesting a single or list of suitable configurations.
I’m still thinking that the system (whatever we call it recommender or autoconfig) can make suggestions/predictions about: