Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Log continuously models with wandb

See original GitHub issue

🚀 Feature request

wandb integration currently logs last model (which can be the best by using TrainingArguments.load_best_model_at_end).

It would be great to allow continuous upload of model with appropriate aliases to versions.

Options would be: * WANDB_LOG_MODEL = True which just logs at the end as currently (not sure if we want to add scheduler and optimizer) * WANDB_LOG_MODEL = 'all' which logs continuously the model * WANDB_LOG_MODEL = False which does not log the model

Motivation

Training can be very long and it would be so sad to lose a model 😭

Your contribution

I can probably propose a PR but would love brainstorming on the ideal logic:

should we leverage Trainer.save_model (as currently) or Trainer._save_checkpoint
should we consider an artifact version as containing only the model & config or also containing optimizer and scheduler? Or should it actually be 2 separate artifacts?
if we leverage on_save, can we avoid the same current logic (fake trainer saving to a temporary directory that is then uploaded async) and just use an actual copy of what has been saved. We would just need the path or list of files that have been saved (should be straightforward)
If we log continuously the model, should we upload it only if it’s improved (when metric_for_best_model is defined)? If that’s the case, we’ll need to be able to detect when that is the case. If that’s not the case we’ll still need to be able to know which one is the best.

Issue Analytics

State:
Created 3 years ago
Comments:13 (11 by maintainers)

Top GitHub Comments

2reactions

sguggercommented, Apr 15, 2021

To do the same on the hub, my idea was to leverage the versioning system and just push the saved checkpoint every save with a commit message like “checkpoint step xxx”. Ideally inside a Callback to avoid adding more stuff to the main training loop. I’ll try to focus on this next week and see what we can easily do!

1reaction

sguggercommented, May 26, 2021

Yes, we can definitely save the URL somewhere! Would you like to make a PR with that?

I’m on another project that we will release soon right now but also plan to go back to the continuous integration after (should be in two weeks!)

Top Results From Across the Web

Log Data with wandb.log - Documentation - Weights & Biases

Call wandb.log(dict) to log a dictionary of metrics, media, or custom objects to a step. Each time you log, we increment the step...

W&B Integration Best Practices – Weights & Biases - Wandb

A wandb.Artifact can be used to log datasets and models, automatically version them, and visualize and query datasets in the dashboard. This allows...

Managing and Tracking ML Experiments with W&B - Wandb

You can log custom metrics, matplotlib plots, datasets, embeddings from your models, prediction distribution, etc. Recently, Weights and Biases ...

W&B System of Record – Weights & Biases - WandB

Ability to log model stats and results per training run ... with new architectures being implemented along with continuous re-retraining.

1. How do I log a model in W&B - Wandb

1️⃣. Start a new run and pass in hyperparameters to track ; 2️⃣. Log metrics from training or evaluation ; 3️⃣. Visualize results...