question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Missing functionality

Configuration is always a big problem for me.

When I was beginner of PP, I didn’t know how to set various parameters for my data set and my ML case, the used time and memory was sometimes unacceptable.

Until recently, I read all of the code, familiar with all the implementations and configurations, so I can choose the efficient configuration method. But you can’t expect every user to use this approach to learn to configure their own case.

Some friends of mine always complains about how slow PP is, but I find that PP itself is actually not that slow, it’s the constant configuration makes PP slow.

What’s worse, some of the default config items become problems when I am trying to tune the performance. For example, here are some test result about running time when I am doing performance tests on a dual-core-processor server(The benchmark is to generate HTML reports on some commonly used data sets):

Branch Use_dask bayesian_blocks Repeat Benchmark1(ms) Benchmark2(ms) Benchmark3(ms) Benchmark4(ms) Benchmark5(ms)
loopy-patch-fast True False 10 5361 10151 12342 6013 1804
loopy-patch-fast False False 10 16089 12734 16799 9680 1802
loopy-patch-fast True True 10 35903 78999 92227 6945 1906
master False False 10 17098 12697 16287 11783 1742
master(Default) False True 10 39990 73714 86397 13032 1863

As the table above shows, bayesian_blocks(Default to be True) takes more than 60% time and produces an almost same histogram on large data sets. What’s worse, this problem will become more serious as the data set increases. On some particular data sets, the ratio even rises to more than 90%.

Different data sets should handle differently to be both fast and effective. Otherwise, user experience and ease of use will be greatly affected, especially for beginners, even complete and detailed documentation is not enough in this case.

In fact, when running on some large data sets, tweaking the config parameters, using parallel scheduling can save about 75%~95% of the time and produce an almost same report.

So I propose these two features below.

Proposed feature

  • Interactive config widget: Since PP is always used in notebook, why not create an interactive config widget to help user (especially the beginners) make their own config? It will improve the user experience a lot and make it more convenience.
  • Auto config: When no configuration is specified, we should generate a config according to the input data.

Both of the feature are not difficult to implement, but ‘Auto config’ requires some experienced to run some tests and carefully choose the strategies and thresholds.

Additional context

Recently, I am focusing on pipelining the project, performing performance tuning and fixing related bugs. As a result, recently I may not able to implement these two features. That’s also why I did not send PR, but left an issue here.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
loopymecommented, Jul 22, 2020

Hi, @neomatrix369!

The ‘Config Recommendation System’ your proposed is very promising and I’ve had similar thoughts before, which I called ‘auto-config’. The key problem, just like you mentioned - how do we find out what to recommend?, I’ve also thought about it for a long time, could not find a proper solution till now. You have proposed a nice one, but I think it may be still not a very proper solution.

As far as I know, PP is essentially a cool tool of generating report and most of the configuration items are used to describe user needs, not run-time parameters. So from the user’s perspective, config is always fixed with a given demand. For example, if I need correlations between variables and want to use A as reject threshold, no matter how much time or memory computation will take, I will still need them and config should not change.

As a result, recommended_configs strategy may only applicable to some runtime-related configuration items like pool_size. If we add more run-time control parameters later, maybe it will become a nice move.

(BTW, I think the root problem with the ‘bayesian_blocks’ I mentioned earlier in this issue is that the third-party package that implement this feature are not scale on big data set.)


I found out two ways, which may improve the user experience about config:

  • Config widget, like I mentioned in this issue, which I think will guide user to accurately express their needs. (See #477)
  • Task scheduling system, can be used to avoid redundant calculations with a given config and support more fine-grained configuration options. (See Task-Graph It may not be applied to PP, but similar ideas can be used for subsequent adjustments)

The pr and task-graph thing is WIP and currently on hold, I am sorry that these work are currently stalled partly because of some mechanism selection, I am occupied with some other work related to computational graphs in this period of time. Once I have some time, I will continue the previous work.

0reactions
neomatrix369commented, Jul 22, 2020

how do we find out what to recommend?, I’ve also thought about it for a long time, could not find a proper solution till now. You have proposed a nice one, but I think it may be still not a very proper solution. Interesting you have meandered around a similar path. This is a very raw idea and needs some PoC and exploratory work before we can nail it. Hence a feature-flagged approach will help.

Initially, we will have to collect data and also the right data - both of which could come through iterations. And this does not have to be from others but initially, it will be from our own setups (machines, environments, etc…). When things mature, we can also get samples from others to help fine-tune the internal model.

I have just put together some ideas after reading your post so it needs further thinking and experimentation but I have a feeling the template of the path is more or less fine to walk on.

After reading your task graph resource I’m more of the idea that its smart optimisation(s) on the pipeline-end we might need to make, as opposed to suggesting a single or list of suitable configurations.

I’m still thinking that the system (whatever we call it recommender or autoconfig) can make suggestions/predictions about:

  • the time it would take to generate these reports (with an acceptable level of error) per configuration (based on its past experience)
  • accuracy of the reports produced (again per config)
  • it’s own accuracy (about each assessment it makes)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Basic Smart Config — Sming documentation
SmartConfig is a mechanism to more easily configure an ESP device using a smart phone. Calling smartConfigStart() starts a search for an Access...
Read more >
ESP32 Using SmartConfig - ESP-Touch App - TechTOnions.com
ESP32 SmartConfig, The easiest way to configure Wi-Fi credentials for your ESP32 based IoT project or device. And explore the ESP-Touch app.
Read more >
How does SmartConfig technically work? - Wi-Fi forum - TI E2E
The phone app broadcasts the Smart Config sequence using a TI proprietary protocol. The sequence is broadcasted on the Wi-Fi channel which the ......
Read more >
Demo 11: How to use SmartConfig on Arduino ESP32
Now this technique was also applied for ESP32. In order to do SmartConfig, you need a smartphone or tablet (Android or iOS) that...
Read more >
EspTouch: SmartConfig for ESP8 - Apps on Google Play
The application help ESP8266 and ESP32 auto-config wifi network. Updated on. Oct 22, 2019. Education. Data safety. Developers can show information here ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found