Weird behavior when using GraphConv with norm=right
See original GitHub issue🐛 Bug
To Reproduce
Steps to reproduce the behavior:
- Define model with GraphConv layer and set
norm=right
- Train model and evaluate error/metrics on train data
- Metrics logged while training improves as expected, but with the same data and model under model.eval() gives near-random performance
- Re-run the same code, but with removed
norm=right
- As expected, evaluating metrics on train data shows improvement.
From what I can gather, setting norm=‘right’ introduces some form of error somehow (which doesn’t make a lot of sense, after a brief look at the implementation). The model itself does not have any sources of non-determinism like Dropout either, so that part is ruled out as well.
Also, the error goes away if I do not set the model to evaluation mode (and let it stay in train mode) while evaluating, which doesn’t make any sense: the only difference between the two for this model would be gradient accumulation.
Code snippet to reproduce
from dgl.nn.pytorch import GraphConv
import torch.nn as nn
import torch.optim as optim
import torch as ch
from tqdm import tqdm
class GCN(nn.Module):
def __init__(self, n_inp, n_hidden, n_layers, n_classes=2, residual=False):
super(GCN, self).__init__()
self.layers = nn.ModuleList()
self.residual = residual
# input layer
self.layers.append(
GraphConv(n_inp, n_hidden, norm='right'))
# GraphConv(n_inp, n_hidden))
# hidden layers
for i in range(n_layers-1):
self.layers.append(
GraphConv(n_hidden, n_hidden, norm='right'))
# GraphConv(n_hidden, n_hidden))
# output layer
self.final = GraphConv(n_hidden, n_classes, norm='right')
# self.final = GraphConv(n_hidden, n_classes)
self.activation = nn.ReLU()
def forward(self, g, latent=None):
if latent is not None:
if latent < 0 or latent > len(self.layers):
raise ValueError("Invald interal layer requested")
x = g.ndata['feat']
for i, layer in enumerate(self.layers):
xo = self.activation(layer(g, x))
# Add prev layer directly, if requested
if self.residual and i != 0:
xo = self.activation(xo + x)
x = xo
# Return representation, if requested
if i == latent:
return x
return self.final(g, x)
def true_positive(pred, target):
return (target[pred == 1] == 1).sum().item()
def get_metrics(y, y_pred, threshold=0.5):
y_ = 1 * (y_pred > threshold)
tp = true_positive(y_, y)
precision = tp / ch.sum(y_ == 1)
recall = tp / ch.sum(y == 1)
f1 = (2 * precision * recall) / (precision + recall)
precision = precision.item()
recall = recall.item()
f1 = f1.item()
# Check for NaNs
if precision != precision:
precision = 0
if recall != recall:
recall = 0
if f1 != f1:
f1 = 0
return (precision, recall, f1)
# @ch.no_grad()
def lmao(model, loader, gpu):
loss_func = nn.CrossEntropyLoss()
tot_loss, precision, recall, f1 = 0, 0, 0, 0
iterator = enumerate(loader)
iterator = tqdm(iterator, total=len(loader))
for e, batch in iterator:
# Shift graph to GPU
if gpu:
batch = batch.to('cuda')
# Get model predictions and get loss
labels = batch.ndata['y'].long()
logits = model(batch)
loss = loss_func(logits, labels)
probs = ch.softmax(logits, dim=1)[:, 1]
# Get metrics
m = get_metrics(labels, probs)
precision += m[0]
recall += m[1]
f1 += m[2]
tot_loss += loss.item()
iterator.set_description(
"Loss: %.5f | Precision: %.3f | Recall: %.3f | F-1: %.3f" %
(tot_loss / (e+1), precision / (e+1), recall / (e+1), f1 / (e+1)))
return tot_loss / (e+1)
def epoch(model, loader, gpu, optimizer=None, verbose=False):
loss_func = nn.CrossEntropyLoss()
is_train = True
if optimizer is None:
is_train = False
tot_loss, precision, recall, f1 = 0, 0, 0, 0
iterator = enumerate(loader)
if verbose:
iterator = tqdm(iterator, total=len(loader))
with ch.set_grad_enabled(is_train):
for e, batch in iterator:
if gpu:
# Shift graph to GPU
batch = batch.to('cuda')
# Get model predictions and get loss
labels = batch.ndata['y'].long()
logits = model(batch)
loss = loss_func(logits, labels)
with ch.no_grad():
probs = ch.softmax(logits, dim=1)[:, 1]
# Backprop gradients if training
if is_train:
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Get metrics
m = get_metrics(labels, probs)
precision += m[0]
recall += m[1]
f1 += m[2]
tot_loss += loss.detach().item()
if verbose:
iterator.set_description(
"Loss: %.5f | Precision: %.3f | Recall: %.3f | F-1: %.3f" %
(tot_loss / (e+1), precision / (e+1), recall / (e+1), f1 / (e+1)))
return tot_loss / (e+1)
def train_model(net, ds, args):
train_loader, test_loader = ds.get_loaders(1, shuffle=False)
optimizer = optim.Adam(net.parameters(), lr=args.lr)
for e in range(args.epochs):
# Train
print("[Train]")
net.train()
epoch(net, train_loader, args.gpu, optimizer, verbose=args.verbose)
# Test
print("[Eval]")
net.eval()
epoch(net, train_loader, args.gpu, None, verbose=args.verbose)
print()
Expected behavior
Loss/metrics keep improving as the model is trained, so re-evaluating them on the SAME data should indeed show similar performance. Instead, the performance logged while training keeps on improving while checking performance on the same dataset and model in the evaluation model leads to near-random performance. Example of what I’m talking about (evaluation is also done on train data):
Environment
- DGL Version: 0.6.1
- Backend Library & Version: PyTorch 1.7.1
- OS: Linux
- How you installed DGL: pip
- Python version: 3.6.10
- CUDA/cuDNN version: 10.1/7.5.0
- GPU models and configuration: NVidia Quadro RTX 4000
Additional context
Error persists without GPU as well (training on CPU)
Issue Analytics
- State:
- Created 2 years ago
- Comments:32
The in-degree and out-degree are the same for the same node. However the denominator of
both
is the square root of the product between the out-degree of source node and in-degree of destination node, which are not necessarily the same.Before the code you showed, the output representation is computed via summing the incoming messages. Since the number of incoming messages of a node is the same as the node’s in-degree, the output will be the same value.
The difference between their normalization and ours is that they divide the outgoing messages by out-degrees before message passing. That is OK.
If I write down the equations things will get clearer. Assuming that x is the same input feature for all nodes.
both
:right
:With DGL 0.6+ you can specify your own normalization weights using the
EdgeWeightNorm
module, though I can add another normalization option inGraphConv
if you want to.DROPPUT, DROPOUT, DROPOUT
Finally, issue could be reproduced in my side. The reason why I cannot repro is no dropout is configured in the code snippet you pasted at the top of this post. Dropout is configured in the ‘gist.py’ you just shared.
As for the issue, I’d like to blame dropout which is the main difference between model.train() and model.eval(). Why do I blame to dropout? If dropout=0.0 when calling model.train() with norm=‘right’, the precision is always ~0.000(this is what I reproduced before), not mention model.eval() on train_loader and test_loader. In other words, if dropout=0.0, model.train() is almost same as model.eval() because no dropout at all. But if dropout=0.5, this takes effect in model.train() which obtains good precision(>0.7) while no dropout at all in model.eval() which results in 0.000 precision.
In short, model is vulnerable and sensitive to dropout if norm=‘right’. If norm=‘both’, model is more robust and less sensitive to dropout even dropout=0.0, according to my experiment.
I think we’d better train with GraphConv(norm=‘both’), dropout=0.5 to obtain a robust model in this scenario.