Multi-GPU CLI issue
See original GitHub issueHi- Thanks for the great library, Sylvain!
The config file looks as follows:
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: true
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2
The relevant part of the code is as follows:
accelerator = Accelerator(fp16=config['fp16'], cpu=config['cpu'])
print(accelerator.device)
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = config["lr"]
num_epochs = int(config["num_epochs"])
seed = int(config["seed"])
batch_size = int(config["batch_size"])
# If the batch size is too big we use gradient accumulation
gradient_accumulation_steps = 1
if batch_size > MAX_GPU_BATCH_SIZE:
gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
batch_size = MAX_GPU_BATCH_SIZE
# Instantiate dataloaders.
train_dataloader = DataLoader(
train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size
)
valid_dataloader = DataLoader(
validation_dataset, shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
)
test_dataloader = DataLoader(
test_dataset, shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
)
# Instantiate the model (we build the model here so that the seed also control new weights initialization)
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
# Instantiate optimizer
optimizer = AdamW(params=model.parameters(), lr=lr)
# Prepare everything
# There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
# prepare method.
prepared = accelerator.prepare(
model, optimizer, train_dataloader, valid_dataloader, test_dataloader
)
model, optimizer, train_dataloader, valid_dataloader, test_dataloader = prepared
# Now we train the model
for epoch in range(num_epochs):
model.train()
for step, batch in enumerate(train_dataloader):
# We could avoid this line since we set the accelerator with `device_placement=True`.
#batch.to(accelerator.device)
outputs = model(**batch)
loss = outputs.loss
loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
if step % gradient_accumulation_steps == 0:
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
The script utilizes a single GPU, though there are 2 GPUS.
>>> torch.cuda.device_count()
2
Launching the scipt in the command line:
accelerate launch training.py
The print statement print(accelerator.device)
returns following (happy to add more debugging)
cuda
Any help is appreciated. Thank you!
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
how to use multiple GPUs,the default is to use the first CUDA ...
You can do that by specifying jit=False , which is now the default in clip.load() . Once the non-JIT model is loaded, the...
Read more >GPU Rendering — Blender Manual
Can multiple GPUs be used for rendering? . Yes, go to Preferences ‣ System ‣ Compute Device Panel, and configure it as...
Read more >Train 1 trillion+ parameter models - PyTorch Lightning
Train 1 trillion+ parameter models. When training large models, fitting larger batch sizes, or trying to increase throughput using multi-GPU compute, ...
Read more >Multi-Process Service :: GPU Deployment and Management ...
Without MPS, when processes share the GPU their scheduling resources must be swapped on and off the GPU. The MPS server shares one...
Read more >Running multi-instance GPUs | Google Kubernetes Engine ...
You can also use any of the supported GPU partition sizes mentioned earlier. To create a cluster with multi-instance GPUs enabled using the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This seems to be a false alarm, the process now sees both GPUs. Thank you for the quick turnaround. Can’t wait to use the library more. Deniz
Closing the issue then, but feel free to reopen if you get the problem again!