wandb sweeps accumulates GPU memory
See original GitHub issueDescribe the bug When running a GPU sweep using any NERModel() or ClassificationModel() such as bert, distilbert, roberta, etc. the sweep retains some amount of GPU memory on every run. I tried to add a statement in my run functions passed to the agent to clear gpu cache but it did not work.
To Reproduce
sweep_config = {
'name' : 'batch-16',
'method': 'bayes', #grid, random, bayes
'metric': {
'name': 'f1_score',
'goal': 'maximize'
},
'parameters': {
'learning_rate': {'min': 0, 'max': 4e-4},
'num_train_epochs': {'min': 1, 'max': 10},
#'train_batch_size': {'values': [16, 32]},
'weight_decay': {'min': 0, 'max': 0.1 } # {'values': [0, 0.01, 0.03]}
},
'early_terminate': {'type': 'hyperband', 'min_iter': 1}
}
sweep_id = wandb.sweep(sweep_config, project=wandb_project)
import gc
def run_training():
# Initialize a new wandb run
wandb.init()
# Create a TransformerModel
model = NERModel(
model_type = model_type,
model_name = model_name,
args = model_args,
sweep_config = wandb.config
)
# Train the model
model.train_model(train_data = train,
eval_data = val)
# Sync wandb
wandb.join()
# clear gpu and cpu ram
gc.collect()
torch.cuda.empty_cache()
Expected behavior Cleared GPU and CPU ram on every run of sweep. As we can see below there is an incremental increase in GPU memory allocated for every run in a sweep.
Screenshots
Desktop (please complete the following information):
- OS
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (3 by maintainers)
Top Results From Across the Web
System Metrics - Documentation - Weights & Biases - Wandb
Metrics automatically logged by wandb. ... GPU Time Spent Accessing Memory (as a percentage of the sample time). GPU Memory Allocated. TPU Utilization....
Read more >How to Implement Gradient Accumulation in PyTorch - Wandb
Learn how to implement gradient accumulation in PyTorch in this short tutorial complete with code and interactive visualizations.
Read more >How to use 8 bit Optimizers in PyTorch - Wandb
Learn how to use 8 bit optimizers in PyTorch in this short tutorial complete with code and interactive visualizations.
Read more >Monitor & Improve GPU Usage for Model Training - Wandb
Nearly a third of our users are averaging less than 15% utilization. Average GPU memory usage is quite similar. Our users tend to...
Read more >Memory error cause failing processes - Weights & Biases
I m using sweep config for hyperparameters. Sometimes my running is get broke due to some reason in the models parameters etc. and...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @ThilinaRajapakse I am getting the same memory allocation errors for in colab, kaggle, and on-premise servers. Is there anything you can do to help?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.