Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

reproducibility issue of DGL

See original GitHub issue

🐛 Bug

I used the dgl to utilize GAT-like network. And I fixed the seed of python, numpy, pytorch and dgl for reproducibility. However, the results are still not deterministic and the varied range is very large. Detailedly, I used the following code for fixing seed:

def set_seeds(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    dgl.seed(seed)

To Reproduce

My GAT-like networks are like:

class GATLayer(nn.Module):
    def __init__(self, hidden_size, alpha, beta, gamma=0.2, dropout=0.6):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha
        self.beta = beta

        self.hidden_size = hidden_size

        self.W_fc = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
        self.attn_fc = nn.Linear(2 * hidden_size, 1, bias=False)
        self.leakyrelu = nn.LeakyReLU(self.gamma)
    
    def edge_attention(self, edges):
        z2 = torch.cat([edges.src['emb_attn'], edges.dst['emb_attn']], dim=1) # N x 2h
        a = self.attn_fc(z2) # N x 1
        return {'e': self.leakyrelu(a)} # N x 1

    def message_func(self, edges):
        # message UDF for equation (3) & (4)
        return {'z': edges.src['emb_crf'], 'e': edges.data['e']}
    
    def reduce_func(self, nodes):
        alpha = torch.softmax(nodes.mailbox['e'], dim=1) # N x 1
        # equation (4)
        h = torch.sum(alpha * nodes.mailbox['z'], dim=1) # N x h -> 1 x h
        return {'h': h}

    def forward(self, embedding_input, h_input, graph):
        dv = 'cuda' if embedding_input.is_cuda else 'cpu'

        z = self.W_fc(h_input)
        graph.ndata['emb_crf'] = h_input
        graph.ndata['emb_attn'] = z
        graph.apply_edges(self.edge_attention)
        graph.update_all(self.message_func, self.reduce_func)
        
        gat_output = graph.ndata.pop('h')
        output = (self.alpha * embedding_input + self.beta * gat_output) / (self.alpha + self.beta)

        return output

Expected behavior

Environment

DGL Version (e.g., 1.0): 0.6.x
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):Pytorch 1.9.0
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.7.9
CUDA/cuDNN version (if applicable): 10.2
GPU models and configuration (e.g. V100): P40
Any other relevant information:

Additional context

Issue Analytics

State:
Created 2 years ago
Comments:31 (4 by maintainers)

Top GitHub Comments

2reactions

BarclayIIcommented, Mar 26, 2022

@duncanriach Great comment! I get it, I’ve currently achieved determinism without using DGL’s dataloader and running on single GPU.

My problem may be the one mentioned in @BarclayII 's comment , I currently still suspect that dgl.dataloading.NeighborSampler introduces nondetermination. If there are still problems, I’ll try to give the minimum reproducible program.

Thanks !

If you absolutely want to remove the non-determinism in neighbor sampling, you could try setting num_workers=1 (which disables OpenMP in neighbor sampling since the sampling happens in subprocesses, but only in DGL 0.8+), or setting the environment variable OMP_NUM_THREADS=1.

1reaction

rickyxumecommented, Mar 26, 2022

@BarclayII Thx! It’s done!!! Setting num_workers=1 works!

OMP_NUM_THREADS=1 does not seem to work. Anyway, my problem was finally solved and I learned a lot from you guys!