Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] How to get the number of runs executed in SMAC for each pipeline?

See original GitHub issue

I can use runcount_limit to limit the number of runs in SMAC for each pipeline.

automl = AutoSklearnClassifier( smac_scenario_args={'runcount_limit': 1000}, )

Is it possible to get the number of executed runs in SMAC for each pipeline?

Or am I misunderstanding the meaning of runcount_limit?

Any comments are highly appreciated.

Issue Analytics

State:
Created a year ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

eddiebergmancommented, May 11, 2022

These initial configurations are just meta-learned configurations that we use to provide some initial points SMAC should evaluate to get some information for it surrogate model. This provides no constraint on the actual running of autosklearn though, just a set of initial configurations to try.

For your second question, part b) no there is not but that would be a useful feature and I have added that as an issue #1470.

The second question part a), you can use the dataframe provided by leaderboard() to extract all the information you need. However it would be good to have more of a handy solution to this!

For any follow up quesions, please create a new issue 😃

1reaction

eddiebergmancommented, May 10, 2022

Hi @jmren168,

So the main categories, are holdout and cv with each having a flavour of iterative which is limits the available configurations to a subset of algorithms, those supporting an iterative method of fitting. Unless you have specified something else, the default is "holdout", check out resampling-strategy. I think #428 was a misunderstanding, we do not have that feature. To run a single pipeline more than once is to use a "cv" resampling strategy.

In the case of a non-iterative flavour of holdout resampling strategy, the default, each pipeline gets evaluated once by SMAC.
In the case of a non-iterative flavour of cv resampling strategy, each of them will be evaluated with a certain amount of folds as is done with normal cross validation. The pipeline will be trained on different folds and the resulting pipelines will be accumulated together with something like a VotingClassifer or VotingRegressor of sklearn. It will be evaluated by default with resampling_strategy_arguments = { "folds": 5 }.
In the case of an "*iterative*" flavour of resampling-strategy, each pipeline can get evaluated more than once by SMAC. That is dependent upon SMAC and it’s scheduling for which @mfeurer might be able to give a better answer.

Considering you haven’t mentioned the “iterative” or “cv” resampling strategy, I will assume you are in case 1 and runcount_limit means how many pipelines are evaluated, each being evaluated once.

The performance_over_time_ data is related to the final ensemble built by autosklearn. This ensemble building is interleaved with pipeline evaluation. Eval a pipeline, build an ensemble, eval another pipeline, build an ensemble, eval a pipeline, … . The runcount_limit and number of ensembles built is not in one-to-one correspondence. This explains the difference between the Number of target algorithm runs and the single best optimization score.

You are also correct that num_run is essentially a config_id, this is perhaps something we should change to be more reflective of what it is.

I hope this helped answer some questions 😃

Best, Eddie