question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi-GPU CLI issue

See original GitHub issue

Hi- Thanks for the great library, Sylvain!

The config file looks as follows:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: true
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2

The relevant part of the code is as follows:

    accelerator = Accelerator(fp16=config['fp16'], cpu=config['cpu'])
    print(accelerator.device)

    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = config["lr"]
    num_epochs = int(config["num_epochs"])
    seed = int(config["seed"])
    batch_size = int(config["batch_size"])

    # If the batch size is too big we use gradient accumulation
    gradient_accumulation_steps = 1
    if batch_size > MAX_GPU_BATCH_SIZE:
        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
        batch_size = MAX_GPU_BATCH_SIZE

    # Instantiate dataloaders.
    train_dataloader = DataLoader(
        train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size
    )
    valid_dataloader = DataLoader(
        validation_dataset, shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
    )
    test_dataloader = DataLoader(
        test_dataset, shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
    )

    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")


    # Instantiate optimizer
    optimizer = AdamW(params=model.parameters(), lr=lr)

    # Prepare everything
    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
    # prepare method.
    prepared = accelerator.prepare(
        model, optimizer, train_dataloader, valid_dataloader, test_dataloader
    )
    model, optimizer, train_dataloader, valid_dataloader, test_dataloader = prepared


    # Now we train the model
    for epoch in range(num_epochs):
        model.train()
        for step, batch in enumerate(train_dataloader):
            # We could avoid this line since we set the accelerator with `device_placement=True`.
            #batch.to(accelerator.device)
            outputs = model(**batch)
            loss = outputs.loss
            loss = loss / gradient_accumulation_steps
            accelerator.backward(loss)
            if step % gradient_accumulation_steps == 0:
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

The script utilizes a single GPU, though there are 2 GPUS.

>>> torch.cuda.device_count()
2

Launching the scipt in the command line:

accelerate launch training.py

The print statement print(accelerator.device) returns following (happy to add more debugging)

cuda

Any help is appreciated. Thank you!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
dzorlucommented, Apr 22, 2021

This seems to be a false alarm, the process now sees both GPUs. Thank you for the quick turnaround. Can’t wait to use the library more. Deniz

0reactions
sguggercommented, Apr 23, 2021

Closing the issue then, but feel free to reopen if you get the problem again!

Read more comments on GitHub >

github_iconTop Results From Across the Web

how to use multiple GPUs,the default is to use the first CUDA ...
You can do that by specifying jit=False , which is now the default in clip.load() . Once the non-JIT model is loaded, the...
Read more >
GPU Rendering — Blender Manual
Can multiple GPUs be used for rendering? . Yes, go to Preferences ‣ System ‣ Compute Device Panel, and configure it as...
Read more >
Train 1 trillion+ parameter models - PyTorch Lightning
Train 1 trillion+ parameter models. When training large models, fitting larger batch sizes, or trying to increase throughput using multi-GPU compute, ...
Read more >
Multi-Process Service :: GPU Deployment and Management ...
Without MPS, when processes share the GPU their scheduling resources must be swapped on and off the GPU. The MPS server shares one...
Read more >
Running multi-instance GPUs | Google Kubernetes Engine ...
You can also use any of the supported GPU partition sizes mentioned earlier. To create a cluster with multi-instance GPUs enabled using the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found