question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

reproducibility issue of DGL

See original GitHub issue

🐛 Bug

I used the dgl to utilize GAT-like network. And I fixed the seed of python, numpy, pytorch and dgl for reproducibility. However, the results are still not deterministic and the varied range is very large. Detailedly, I used the following code for fixing seed:

def set_seeds(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    dgl.seed(seed)

To Reproduce

My GAT-like networks are like:

class GATLayer(nn.Module):
    def __init__(self, hidden_size, alpha, beta, gamma=0.2, dropout=0.6):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha
        self.beta = beta

        self.hidden_size = hidden_size

        self.W_fc = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
        self.attn_fc = nn.Linear(2 * hidden_size, 1, bias=False)
        self.leakyrelu = nn.LeakyReLU(self.gamma)
    
    def edge_attention(self, edges):
        z2 = torch.cat([edges.src['emb_attn'], edges.dst['emb_attn']], dim=1) # N x 2h
        a = self.attn_fc(z2) # N x 1
        return {'e': self.leakyrelu(a)} # N x 1

    def message_func(self, edges):
        # message UDF for equation (3) & (4)
        return {'z': edges.src['emb_crf'], 'e': edges.data['e']}
    
    def reduce_func(self, nodes):
        alpha = torch.softmax(nodes.mailbox['e'], dim=1) # N x 1
        # equation (4)
        h = torch.sum(alpha * nodes.mailbox['z'], dim=1) # N x h -> 1 x h
        return {'h': h}

    def forward(self, embedding_input, h_input, graph):
        dv = 'cuda' if embedding_input.is_cuda else 'cpu'

        z = self.W_fc(h_input)
        graph.ndata['emb_crf'] = h_input
        graph.ndata['emb_attn'] = z
        graph.apply_edges(self.edge_attention)
        graph.update_all(self.message_func, self.reduce_func)
        
        gat_output = graph.ndata.pop('h')
        output = (self.alpha * embedding_input + self.beta * gat_output) / (self.alpha + self.beta)

        return output

Expected behavior

Environment

  • DGL Version (e.g., 1.0): 0.6.x
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):Pytorch 1.9.0
  • OS (e.g., Linux): Linux
  • How you installed DGL (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.7.9
  • CUDA/cuDNN version (if applicable): 10.2
  • GPU models and configuration (e.g. V100): P40
  • Any other relevant information:

Additional context

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:31 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
BarclayIIcommented, Mar 26, 2022

@duncanriach Great comment! I get it, I’ve currently achieved determinism without using DGL’s dataloader and running on single GPU.

My problem may be the one mentioned in @BarclayII 's comment , I currently still suspect that dgl.dataloading.NeighborSampler introduces nondetermination. If there are still problems, I’ll try to give the minimum reproducible program.

Thanks !

If you absolutely want to remove the non-determinism in neighbor sampling, you could try setting num_workers=1 (which disables OpenMP in neighbor sampling since the sampling happens in subprocesses, but only in DGL 0.8+), or setting the environment variable OMP_NUM_THREADS=1.

1reaction
rickyxumecommented, Mar 26, 2022

@BarclayII Thx! It’s done!!! Setting num_workers=1 works!

OMP_NUM_THREADS=1 does not seem to work. Anyway, my problem was finally solved and I learned a lot from you guys!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Reproducibility issue - Questions - Deep Graph Library
When running the common GCN model using DGL, I met the reproducibility issue, i.e. even I have tried my best to set seed...
Read more >
Reproducibility of the results for GNN using DGL grahSAGE
I'm working on a node classification problem using graphSAGE. I'm new to GNN so my code is based on the tutorials of GraphSAGE...
Read more >
Gastroprotective and gastric motility benefits of AD-lico ... - NCBI
The aim of this study was to evaluate in vivo both the anti-Helicobacter and the gastric-relaxing effects of AD-lico/Healthy Gut™ in rat models....
Read more >
An approach for implementing and deploying Graph Deep ...
Where a flask app is serving the GraphSAGE PyTorch model built on the DGL library. Neptune Connection Issue+Kubernetes Probes Solution: It is the...
Read more >
2.4. PNA:DNA and DGL:DNA heteroduplex formation and DCL ...
If you have any questions about the protocol or need a more detailed version, post your question or submit your request for a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found