Benchmarking design & implementation

See original GitHub issue

Evaluation

Extend evaluation API and functionality to single dataset case, currently only multiple dataset case is supported

Performance metrics

A few implementation notes:

Vectorised vs iterative computations
Callable classes vs classes with methods for computation
Use of jackknife by default for non point-wise metrics
Computation of standard error as decorator/mix-in
Have separate classes for point-wise metrics which can be wrapped by aggregation functions (e.g. mean)

Also see https://github.com/JuliaML/LossFunctions.jl.

Orchestration

Should have

Allow orchestrator to be persisted to replicate benchmarking studies
add unit tests for evaluator methods
update all methods on evaluator to work on new internal data representation, also see https://www.statsmodels.org/stable/stats.html for some additional test implementations, e.g. the sign test, to improve readability, so that we can deprecate _get_metrics_per_estimator_dataset and _get_metrics_per_estimator methods
for saving results inside the orchestrator and for loading results in results classes use _ResultsWrapper to simply/unify interface, _ResultsWrapper needs to have slots for at least: y_true, y_pred, y_proba, index, fit_time, predict_time, strategy_name, dataset_name, cv_fold, train_or_test
No timing of fit and predict available, see https://docs.python.org/3/library/time.html#time.perf_counter, potentially have new save_timings and load_timings method
orchestrator cannot make probabilistic predictions, orchestrator tries to make probabilistic predictions using predict_proba, but (i) this will only works for some but not all classifiers and it won’t work in regression, (ii) strategies currently don’t even have a predict_proba (not even TSCStrategy), and (iii) current computation of y_proba fails if y_pred contains strings instead of integers which however is an accepted output format for classification I believe, add predict_proba to TSCStrategy
handling of probabilistic metrics in evaluator
no longer sure that saving results object as a master file is a good idea, as it may cause problems when multiple processes try to update it and because it needs to reflect the state of the directory somehow, maybe better to have a method on results object that allow to infer datasets, strategies and so on, something like a register_results method, instead of loading a fully specified dumped result object
separate predict method on orchestrator which loads and uses already fitted strategies
fix UEA results class

Could have

allow for pre-defined cv splits in files
allow for pre-defined tasks in files
add random_state as input arg to orchestrator which is propagated to all strategies and cv
perhaps also useful to catch exceptions and skip over them in orchestrator instead of breaking execution?
currently only works for ts data input format, add other use cases
better user feedback, logging, keeping track of progress
many docstrings still missing or outdated
perhaps metrics shouldn’t be wrapped in classes and the evaluator should take care of it internally, working with kwargs (e.g. pointwise=True)
handling of multiple metrics in evaluator
functionality for space filling parameter grids for large hyper-parameter search spaces (e.g. latin hypercube design), see this Python package: https://github.com/tirthajyoti/doepy
monitoring and comparison of memory usage of different estimators

Related issues/PRs: #132

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:17 (5 by maintainers)

Top GitHub Comments

2reactions

ngupta23commented, Jul 5, 2021

Pointing to business needs is not a guarantee that whatever argument follows is correct.

Understood, I just wanted to provide more context on what I was trying to do so I could explain the rest, but looks like it complicated things further 😃.

Can you kindly explain in more detail why you think a design like evaluate(ModelSelect(models = [model1, model2, etc], cv), backtest_cv, fh) is not sufficient for your needs?

Is this not doing the same thing as what I proposed, just replacing the workflow_func with ModelSelect class? if so, are you proposing converting the workflow_func with a ModelSelect class instead?

In the end, from pycaret’s perspective, back testing needs to work with pycaret’s flow which uses create_model, compare_models, finalize_model, etc. which takes care of avoiding overestimation and broken models 😃 This counter proposal may be too restrictive in that sense.

I am looping in @Yard1 and @pycaret so they are aware of this discussion. Maybe this is something to discuss in a future combined design session.

2reactions

mloningcommented, Jun 30, 2021

@TonyBagnall we could turn this into an enhancement proposal, that may be the better place for this. Any objections?

Top Results From Across the Web

Benchmarking: How To Conduct One? - Aela School

Benchmarking is a process that occurs in the UX Research phase in UX Design projects. This means you should conduct a competition analysis...

7 Steps to Benchmark Your Product's UX

At a high level, benchmarking is a method to evaluate the overall performance of a product (and as such, is a type of...

Design and Analysis of Benchmarking Experiments for ...

Using simulations and empirical data from Perflab, we validate our theoretical results, and provide easy-to-implement guidelines for designing and analyzing ...

UX Benchmarking – forms, benefits, and potential traps

Easy implementation. Developers like readymade solutions, as they speed up their work considerably. Lower cost. Faster design and implementation ...

Designing a Benchmarking Plan - gov.energy.eere.www1

This guide provides a framework for developing an internal benchmarking plan. The outline walks through the various stages of the benchmarking.