question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Global Accelerator.RNG_types is passed and modified when preparing

See original GitHub issue

Hi,

Correct me if I’m wrong but:

The seed for sampler generator is currently set with generator.manual_seed(int(torch.empty((), dtype=torch.int64).random_().item())). Ref To my understanding without setting initial seed, the sampler generator is different across GPUs and should be synchronized at every step (or at least the first time calling iter()). This is true provided the rng_types contains [‘generator’] (when called the first time).

However, when preparing dataloader, the global accelerator.rng_types is passed around and then could be modified (remove ‘generator’ from it) (ref). This occurs when preparing dataloader (that doesn’t shuffle) (eval loader).

So I thought after calling train_loader, eval_loader = accelerator.prepare(train_loader, eval_loader)

the rng_state is now empty list, and train_loader sampler generator is not synchronized.

Should this be an issue? if not what is the logic.

Note: If same seed is required at for all process, then dropout wouldn’t operate independently across GPUs (ref)

Example code:

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from torch.utils.data import Dataset, DataLoader
import random
from accelerate import Accelerator

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

class CustomDataset(Dataset):
    def __init__(self, size=10) -> None:
        super().__init__()
        self.size=10
    
    def __getitem__(self, i):
        return torch.Tensor([i]), torch.Tensor([1, i , 2])
    
    def __len__(self):
        return self.size

class CustomModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(1, 3)
        self.dropout = nn.Dropout(p=0.5)

    def forward(self, x):
        return self.dropout(self.linear1(x))

def main():
    accelerator = Accelerator()
    set_seed(66+accelerator.process_index)
    device = torch.device('cuda:7')

    dataset = CustomDataset(size=10)
    dataloader = DataLoader(dataset, batch_size=5, num_workers=0, shuffle=True)
    eval_dataloader = DataLoader(dataset, batch_size=5, num_workers=0, shuffle=False)

    model = CustomModel()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.5)
    model, dataloader, optimizer, eval_dataloader = accelerator.prepare(model, dataloader, optimizer, eval_dataloader)
    model.train()
    for input_, output_ in dataloader:
        res = model(input_)
        print(f'{accelerator.process_index}, {input_}')
    
if __name__ == '__main__':
    main()
-----
Output: 
1, tensor([[2.],                                                                                                                          
        [8.],                                                                                                                             
        [4.],                                                                                                                             
        [7.],                                                                                                                             
        [6.]], device='cuda:1')                                                                                                           
0, tensor([[9.],                                                                                                                          
        [8.],                                                                                                                             
        [0.],                                                                                                                             
        [2.],                                                                                                                             
        [3.]], device='cuda:0')

Thanks.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Jun 29, 2021

No, it is, as the name indicates, a BatchSampler. It only computes indices (specifically the indices of the elements we want in each batch), the access to the Dataset is done later inside the DataLoader.

0reactions
github-actions[bot]commented, May 24, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

AWS Global Accelerator FAQs - Amazon Web Services
A: AWS Global Accelerator utilizes the Amazon global network, allowing you to improve the performance of your applications by lowering first byte latency...
Read more >
AWS Global Accelerator - Jayendra's Cloud Certification Blog
AWS Global Accelerator is a networking service that helps improve the availability and performance of the applications to global users.
Read more >
Introduction to AWS Global Accelerator - Whizlabs Blog
AWS Global Accelerator improves the performance and availability of applications with users (local or global). Let's get started and learn more!
Read more >
AWS Global Accelerator - YouTube
AWS Global Accelerator - Improve Global Application Availability and Performance for Your Traffic. 20K views · 2 years ago ...more ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found