Log continuously models with wandb
See original GitHub issue🚀 Feature request
wandb integration currently logs last model (which can be the best by using TrainingArguments.load_best_model_at_end
).
It would be great to allow continuous upload of model with appropriate aliases to versions.
Options would be:
* WANDB_LOG_MODEL = True
which just logs at the end as currently (not sure if we want to add scheduler and optimizer)
* WANDB_LOG_MODEL = 'all'
which logs continuously the model
* WANDB_LOG_MODEL = False
which does not log the model
Motivation
Training can be very long and it would be so sad to lose a model 😭
Your contribution
I can probably propose a PR but would love brainstorming on the ideal logic:
- should we leverage
Trainer.save_model
(as currently) orTrainer._save_checkpoint
- should we consider an artifact version as containing only the model & config or also containing optimizer and scheduler? Or should it actually be 2 separate artifacts?
- if we leverage
on_save
, can we avoid the same current logic (fake trainer saving to a temporary directory that is then uploaded async) and just use an actual copy of what has been saved. We would just need the path or list of files that have been saved (should be straightforward) - If we log continuously the model, should we upload it only if it’s improved (when
metric_for_best_model
is defined)? If that’s the case, we’ll need to be able to detect when that is the case. If that’s not the case we’ll still need to be able to know which one is the best.
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (11 by maintainers)
Top Results From Across the Web
Log Data with wandb.log - Documentation - Weights & Biases
Call wandb.log(dict) to log a dictionary of metrics, media, or custom objects to a step. Each time you log, we increment the step...
Read more >W&B Integration Best Practices – Weights & Biases - Wandb
A wandb.Artifact can be used to log datasets and models, automatically version them, and visualize and query datasets in the dashboard. This allows...
Read more >Managing and Tracking ML Experiments with W&B - Wandb
You can log custom metrics, matplotlib plots, datasets, embeddings from your models, prediction distribution, etc. Recently, Weights and Biases ...
Read more >W&B System of Record – Weights & Biases - WandB
Ability to log model stats and results per training run ... with new architectures being implemented along with continuous re-retraining.
Read more >1. How do I log a model in W&B - Wandb
1️⃣. Start a new run and pass in hyperparameters to track ; 2️⃣. Log metrics from training or evaluation ; 3️⃣. Visualize results...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
To do the same on the hub, my idea was to leverage the versioning system and just push the saved checkpoint every save with a commit message like “checkpoint step xxx”. Ideally inside a Callback to avoid adding more stuff to the main training loop. I’ll try to focus on this next week and see what we can easily do!
Yes, we can definitely save the URL somewhere! Would you like to make a PR with that?
I’m on another project that we will release soon right now but also plan to go back to the continuous integration after (should be in two weeks!)