Benchmarking design & implementation
See original GitHub issueEvaluation
- Extend evaluation API and functionality to single dataset case, currently only multiple dataset case is supported
Performance metrics
A few implementation notes:
- Vectorised vs iterative computations
- Callable classes vs classes with methods for computation
- Use of jackknife by default for non point-wise metrics
- Computation of standard error as decorator/mix-in
- Have separate classes for point-wise metrics which can be wrapped by aggregation functions (e.g. mean)
Also see https://github.com/JuliaML/LossFunctions.jl.
Orchestration
Should have
- Allow orchestrator to be persisted to replicate benchmarking studies
- add unit tests for
evaluator
methods - update all methods on evaluator to work on new internal data representation, also see https://www.statsmodels.org/stable/stats.html for some additional test implementations, e.g. the sign test, to improve readability, so that we can deprecate
_get_metrics_per_estimator_dataset
and_get_metrics_per_estimator
methods - for saving results inside the
orchestrator
and for loading results in results classes use_ResultsWrapper
to simply/unify interface,_ResultsWrapper
needs to have slots for at least: y_true, y_pred, y_proba, index, fit_time, predict_time, strategy_name, dataset_name, cv_fold, train_or_test - No timing of fit and predict available, see https://docs.python.org/3/library/time.html#time.perf_counter, potentially have new
save_timings
andload_timings
method -
orchestrator
cannot make probabilistic predictions, orchestrator tries to make probabilistic predictions usingpredict_proba
, but (i) this will only works for some but not all classifiers and it won’t work in regression, (ii) strategies currently don’t even have apredict_proba
(not evenTSCStrategy
), and (iii) current computation ofy_proba
fails ify_pred
contains strings instead of integers which however is an accepted output format for classification I believe, addpredict_proba
toTSCStrategy
- handling of probabilistic metrics in
evaluator
- no longer sure that saving results object as a master file is a good idea, as it may cause problems when multiple processes try to update it and because it needs to reflect the state of the directory somehow, maybe better to have a method on results object that allow to infer datasets, strategies and so on, something like a
register_results
method, instead of loading a fully specified dumped result object - separate
predict
method onorchestrator
which loads and uses already fitted strategies - fix UEA results class
Could have
- allow for pre-defined cv splits in files
- allow for pre-defined tasks in files
- add
random_state
as input arg to orchestrator which is propagated to all strategies and cv - perhaps also useful to catch exceptions and skip over them in
orchestrator
instead of breaking execution? - currently only works for ts data input format, add other use cases
- better user feedback, logging, keeping track of progress
- many docstrings still missing or outdated
- perhaps metrics shouldn’t be wrapped in classes and the evaluator should take care of it internally, working with kwargs (e.g.
pointwise=True
) - handling of multiple metrics in
evaluator
- functionality for space filling parameter grids for large hyper-parameter search spaces (e.g. latin hypercube design), see this Python package: https://github.com/tirthajyoti/doepy
- monitoring and comparison of memory usage of different estimators
Related issues/PRs: #132
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:17 (5 by maintainers)
Top Results From Across the Web
Benchmarking: How To Conduct One? - Aela School
Benchmarking is a process that occurs in the UX Research phase in UX Design projects. This means you should conduct a competition analysis...
Read more >7 Steps to Benchmark Your Product's UX
At a high level, benchmarking is a method to evaluate the overall performance of a product (and as such, is a type of...
Read more >Design and Analysis of Benchmarking Experiments for ...
Using simulations and empirical data from Perflab, we validate our theoretical results, and provide easy-to-implement guidelines for designing and analyzing ...
Read more >UX Benchmarking – forms, benefits, and potential traps
Easy implementation. Developers like readymade solutions, as they speed up their work considerably. Lower cost. Faster design and implementation ...
Read more >Designing a Benchmarking Plan - gov.energy.eere.www1
This guide provides a framework for developing an internal benchmarking plan. The outline walks through the various stages of the benchmarking.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Understood, I just wanted to provide more context on what I was trying to do so I could explain the rest, but looks like it complicated things further 😃.
Is this not doing the same thing as what I proposed, just replacing the
workflow_func
withModelSelect
class? if so, are you proposing converting theworkflow_func
with aModelSelect
class instead?In the end, from pycaret’s perspective, back testing needs to work with pycaret’s flow which uses create_model, compare_models, finalize_model, etc. which takes care of avoiding overestimation and broken models 😃 This counter proposal may be too restrictive in that sense.
I am looping in @Yard1 and @pycaret so they are aware of this discussion. Maybe this is something to discuss in a future combined design session.
@TonyBagnall we could turn this into an enhancement proposal, that may be the better place for this. Any objections?