question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Log continuously models with wandb

See original GitHub issue

🚀 Feature request

wandb integration currently logs last model (which can be the best by using TrainingArguments.load_best_model_at_end).

It would be great to allow continuous upload of model with appropriate aliases to versions.

Options would be: * WANDB_LOG_MODEL = True which just logs at the end as currently (not sure if we want to add scheduler and optimizer) * WANDB_LOG_MODEL = 'all' which logs continuously the model * WANDB_LOG_MODEL = False which does not log the model

Motivation

Training can be very long and it would be so sad to lose a model 😭

Your contribution

I can probably propose a PR but would love brainstorming on the ideal logic:

  1. should we leverage Trainer.save_model (as currently) or Trainer._save_checkpoint
  2. should we consider an artifact version as containing only the model & config or also containing optimizer and scheduler? Or should it actually be 2 separate artifacts?
  3. if we leverage on_save, can we avoid the same current logic (fake trainer saving to a temporary directory that is then uploaded async) and just use an actual copy of what has been saved. We would just need the path or list of files that have been saved (should be straightforward)
  4. If we log continuously the model, should we upload it only if it’s improved (when metric_for_best_model is defined)? If that’s the case, we’ll need to be able to detect when that is the case. If that’s not the case we’ll still need to be able to know which one is the best.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
sguggercommented, Apr 15, 2021

To do the same on the hub, my idea was to leverage the versioning system and just push the saved checkpoint every save with a commit message like “checkpoint step xxx”. Ideally inside a Callback to avoid adding more stuff to the main training loop. I’ll try to focus on this next week and see what we can easily do!

1reaction
sguggercommented, May 26, 2021

Yes, we can definitely save the URL somewhere! Would you like to make a PR with that?

I’m on another project that we will release soon right now but also plan to go back to the continuous integration after (should be in two weeks!)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Log Data with wandb.log - Documentation - Weights & Biases
Call wandb.log(dict) to log a dictionary of metrics, media, or custom objects to a step. Each time you log, we increment the step...
Read more >
W&B Integration Best Practices – Weights & Biases - Wandb
A wandb.Artifact can be used to log datasets and models, automatically version them, and visualize and query datasets in the dashboard. This allows...
Read more >
Managing and Tracking ML Experiments with W&B - Wandb
You can log custom metrics, matplotlib plots, datasets, embeddings from your models, prediction distribution, etc. Recently, Weights and Biases ...
Read more >
W&B System of Record – Weights & Biases - WandB
Ability to log model stats and results per training run ... with new architectures being implemented along with continuous re-retraining.
Read more >
1. How do I log a model in W&B - Wandb
1️⃣. Start a new run and pass in hyperparameters to track ; 2️⃣. Log metrics from training or evaluation ; 3️⃣. Visualize results...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found