question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Benchmarking design & implementation

See original GitHub issue

Evaluation

  • Extend evaluation API and functionality to single dataset case, currently only multiple dataset case is supported

Performance metrics

A few implementation notes:

  • Vectorised vs iterative computations
  • Callable classes vs classes with methods for computation
  • Use of jackknife by default for non point-wise metrics
  • Computation of standard error as decorator/mix-in
  • Have separate classes for point-wise metrics which can be wrapped by aggregation functions (e.g. mean)

Also see https://github.com/JuliaML/LossFunctions.jl.

Orchestration

Should have

  • Allow orchestrator to be persisted to replicate benchmarking studies
  • add unit tests for evaluator methods
  • update all methods on evaluator to work on new internal data representation, also see https://www.statsmodels.org/stable/stats.html for some additional test implementations, e.g. the sign test, to improve readability, so that we can deprecate _get_metrics_per_estimator_dataset and _get_metrics_per_estimator methods
  • for saving results inside the orchestrator and for loading results in results classes use _ResultsWrapper to simply/unify interface, _ResultsWrapper needs to have slots for at least: y_true, y_pred, y_proba, index, fit_time, predict_time, strategy_name, dataset_name, cv_fold, train_or_test
  • No timing of fit and predict available, see https://docs.python.org/3/library/time.html#time.perf_counter, potentially have new save_timings and load_timings method
  • orchestrator cannot make probabilistic predictions, orchestrator tries to make probabilistic predictions using predict_proba, but (i) this will only works for some but not all classifiers and it won’t work in regression, (ii) strategies currently don’t even have a predict_proba (not even TSCStrategy), and (iii) current computation of y_proba fails if y_pred contains strings instead of integers which however is an accepted output format for classification I believe, add predict_proba to TSCStrategy
  • handling of probabilistic metrics in evaluator
  • no longer sure that saving results object as a master file is a good idea, as it may cause problems when multiple processes try to update it and because it needs to reflect the state of the directory somehow, maybe better to have a method on results object that allow to infer datasets, strategies and so on, something like a register_results method, instead of loading a fully specified dumped result object
  • separate predict method on orchestrator which loads and uses already fitted strategies
  • fix UEA results class

Could have

  • allow for pre-defined cv splits in files
  • allow for pre-defined tasks in files
  • add random_state as input arg to orchestrator which is propagated to all strategies and cv
  • perhaps also useful to catch exceptions and skip over them in orchestrator instead of breaking execution?
  • currently only works for ts data input format, add other use cases
  • better user feedback, logging, keeping track of progress
  • many docstrings still missing or outdated
  • perhaps metrics shouldn’t be wrapped in classes and the evaluator should take care of it internally, working with kwargs (e.g. pointwise=True)
  • handling of multiple metrics in evaluator
  • functionality for space filling parameter grids for large hyper-parameter search spaces (e.g. latin hypercube design), see this Python package: https://github.com/tirthajyoti/doepy
  • monitoring and comparison of memory usage of different estimators

Related issues/PRs: #132

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:17 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
ngupta23commented, Jul 5, 2021

Pointing to business needs is not a guarantee that whatever argument follows is correct.

Understood, I just wanted to provide more context on what I was trying to do so I could explain the rest, but looks like it complicated things further 😃.

Can you kindly explain in more detail why you think a design like evaluate(ModelSelect(models = [model1, model2, etc], cv), backtest_cv, fh) is not sufficient for your needs?

Is this not doing the same thing as what I proposed, just replacing the workflow_func with ModelSelect class? if so, are you proposing converting the workflow_func with a ModelSelect class instead?

In the end, from pycaret’s perspective, back testing needs to work with pycaret’s flow which uses create_model, compare_models, finalize_model, etc. which takes care of avoiding overestimation and broken models 😃 This counter proposal may be too restrictive in that sense.

I am looping in @Yard1 and @pycaret so they are aware of this discussion. Maybe this is something to discuss in a future combined design session.

2reactions
mloningcommented, Jun 30, 2021

@TonyBagnall we could turn this into an enhancement proposal, that may be the better place for this. Any objections?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Benchmarking: How To Conduct One? - Aela School
Benchmarking is a process that occurs in the UX Research phase in UX Design projects. This means you should conduct a competition analysis...
Read more >
7 Steps to Benchmark Your Product's UX
At a high level, benchmarking is a method to evaluate the overall performance of a product (and as such, is a type of...
Read more >
Design and Analysis of Benchmarking Experiments for ...
Using simulations and empirical data from Perflab, we validate our theoretical results, and provide easy-to-implement guidelines for designing and analyzing ...
Read more >
UX Benchmarking – forms, benefits, and potential traps
Easy implementation. Developers like readymade solutions, as they speed up their work considerably. Lower cost. Faster design and implementation ...
Read more >
Designing a Benchmarking Plan - gov.energy.eere.www1
This guide provides a framework for developing an internal benchmarking plan. The outline walks through the various stages of the benchmarking.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found