Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Google Colab GPUs and Sweeps

See original GitHub issue

Hi I’m new to Wandb and I was using it to optimize the hyperparameters of my deep learning model (the library is simpletransformers). I’m running the fine-tuning on Google colab.

The following is my code:

import logging
import pandas as pd
import wandb
from simpletransformers.classification import ClassificationArgs, ClassificationModel
from statistics import mean
from sklearn.metrics import accuracy_score

sweep_config = {
    #"namke":"vanilla-sweep-batch-16"
    "method": "bayes",  # grid, random
    "metric": {"name": "accuracy", "goal": "maximize"},
    "parameters": {
        "num_train_epochs": {"min":1,"max":40},
        "learning_rate": {"min": 5e-5, "max": 1e-3},
        #"train_batch_size":{"values":[8,16,32,64,128]}
    },
    #W&B Sweeps can also speed up the hyperparameter optimization by terminating any poorly performing runs
    "early_terminate": {"type": "hyperband", "min_iter": 6,}
}

sweep_id = wandb.sweep(sweep_config, project="Simple Sweep")

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# Optional model configuration
model_args = ClassificationArgs(sliding_window = True)
model_args.num_train_epochs = 5
model_args.learning_rate = 1e-4
#model_args.train_batch_size = 64
model_args.overwrite_output_dir = True
model_args.reprocess_input_data = True
model_args.no_save = True
###########################################
model_args.use_early_stopping = False
model_args.early_stopping_delta = 0.01
model_args.early_stopping_metric = "mcc"
model_args.early_stopping_metric_minimize = False
model_args.early_stopping_patience = 5
model_args.evaluate_during_training_steps = 1000
model_args.evaluate_during_training = True
model_args.wandb_project = "Simple Sweep"
###########################################

def train():
    # Initialize a new wandb run
    wandb.init()
    print("HyperParams=>>", wandb.config.epochs)

    # Create a TransformerModel
    model = ClassificationModel(
        "electra",
        "google/electra-base-discriminator",
        use_cuda = True,
        args=model_args,
        sweep_config=wandb.config,
    )

    # Train the model
    model.train_model(train_data, eval_df=eval_data, 
                      accuracy=lambda truth, predictions: accuracy_score(truth, [round(p) for p in predictions]), 
    )

    # Evaluate the model
    model.eval_model(test_data,verbose=True)

    # Sync wandb
    wandb.join()


wandb.agent(sweep_id, train)

I received the following output

Create sweep with ID: qqdm6bfy
Sweep URL: https://app.wandb.ai/galvinator/Simple%20Sweep/sweeps/qqdm6bfy
INFO:wandb.wandb_agent:Running runs: []
INFO:wandb.wandb_agent:Agent received command: run
INFO:wandb.wandb_agent:Agent starting run with config:
	learning_rate: 0.0009937809505724256
	num_train_epochs: 29
wandb: Agent Starting Run: h46vr2qv with config:
	learning_rate: 0.0009937809505724256
	num_train_epochs: 29
INFO:wandb.wandb_agent:Running runs: ['h46vr2qv']

However, it remains as such even after 1 hour… is sweep meant to take this long? Should I wait longer - and how long? I’m expecting to see more permutations of learning_rate and num_train_epochs in the output but please let me know if the fine-tuning is much longer.

Thank you

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

issue-label-bot[bot]commented, Jul 22, 2020

Issue-Label Bot is automatically applying the label question to this issue, with a confidence of 0.99. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

0reactions

vanpeltcommented, Jul 22, 2020

You may have to restart your kernel. If any code tries to connect to the gpu outside of the train function the notebook will hang. The other alternative is to shell out and call !wandb agent and have the code execute a python script instead of a functon. More details here: https://docs.wandb.com/sweeps/python-api#wandb-agent