Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Weird behavior when using GraphConv with norm=right

See original GitHub issue

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Define model with GraphConv layer and set norm=right
Train model and evaluate error/metrics on train data
Metrics logged while training improves as expected, but with the same data and model under model.eval() gives near-random performance
Re-run the same code, but with removed norm=right
As expected, evaluating metrics on train data shows improvement.

From what I can gather, setting norm=‘right’ introduces some form of error somehow (which doesn’t make a lot of sense, after a brief look at the implementation). The model itself does not have any sources of non-determinism like Dropout either, so that part is ruled out as well.

Also, the error goes away if I do not set the model to evaluation mode (and let it stay in train mode) while evaluating, which doesn’t make any sense: the only difference between the two for this model would be gradient accumulation.

Code snippet to reproduce

from dgl.nn.pytorch import GraphConv
import torch.nn as nn
import torch.optim as optim
import torch as ch
from tqdm import tqdm


class GCN(nn.Module):
    def __init__(self, n_inp, n_hidden, n_layers, n_classes=2, residual=False):
        super(GCN, self).__init__()
        self.layers = nn.ModuleList()
        self.residual = residual

        # input layer
        self.layers.append(
            GraphConv(n_inp, n_hidden, norm='right'))
            # GraphConv(n_inp, n_hidden))

        # hidden layers
        for i in range(n_layers-1):
            self.layers.append(
                GraphConv(n_hidden, n_hidden, norm='right'))
                # GraphConv(n_hidden, n_hidden))

        # output layer
        self.final = GraphConv(n_hidden, n_classes, norm='right')
        # self.final = GraphConv(n_hidden, n_classes)
        self.activation = nn.ReLU()

    def forward(self, g, latent=None):

        if latent is not None:
            if latent < 0 or latent > len(self.layers):
                raise ValueError("Invald interal layer requested")

        x = g.ndata['feat']
        for i, layer in enumerate(self.layers):
            xo = self.activation(layer(g, x))

            # Add prev layer directly, if requested
            if self.residual and i != 0:
                xo = self.activation(xo + x)

            x = xo

            # Return representation, if requested
            if i == latent:
                return x

        return self.final(g, x)


def true_positive(pred, target):
    return (target[pred == 1] == 1).sum().item()


def get_metrics(y, y_pred, threshold=0.5):
    y_ = 1 * (y_pred > threshold)
    tp = true_positive(y_, y)
    precision = tp / ch.sum(y_ == 1)
    recall = tp / ch.sum(y == 1)
    f1 = (2 * precision * recall) / (precision + recall)

    precision = precision.item()
    recall = recall.item()
    f1 = f1.item()

    # Check for NaNs
    if precision != precision:
        precision = 0
    if recall != recall:
        recall = 0
    if f1 != f1:
        f1 = 0

    return (precision, recall, f1)


# @ch.no_grad()
def lmao(model, loader, gpu):
    loss_func = nn.CrossEntropyLoss()

    tot_loss, precision, recall, f1 = 0, 0, 0, 0
    iterator = enumerate(loader)
    iterator = tqdm(iterator, total=len(loader))

    for e, batch in iterator:

        # Shift graph to GPU
        if gpu:
            batch = batch.to('cuda')

        # Get model predictions and get loss
        labels = batch.ndata['y'].long()
        logits = model(batch)
        loss = loss_func(logits, labels)
        probs = ch.softmax(logits, dim=1)[:, 1]

        # Get metrics
        m = get_metrics(labels, probs)
        precision += m[0]
        recall += m[1]
        f1 += m[2]

        tot_loss += loss.item()
        iterator.set_description(
            "Loss: %.5f | Precision: %.3f | Recall: %.3f | F-1: %.3f" %
            (tot_loss / (e+1), precision / (e+1), recall / (e+1), f1 / (e+1)))
    return tot_loss / (e+1)


def epoch(model, loader, gpu, optimizer=None, verbose=False):
    loss_func = nn.CrossEntropyLoss()
    is_train = True
    if optimizer is None:
        is_train = False

    tot_loss, precision, recall, f1 = 0, 0, 0, 0
    iterator = enumerate(loader)
    if verbose:
        iterator = tqdm(iterator, total=len(loader))

    with ch.set_grad_enabled(is_train):
        for e, batch in iterator:

            if gpu:
                # Shift graph to GPU
                batch = batch.to('cuda')

            # Get model predictions and get loss
            labels = batch.ndata['y'].long()
            logits = model(batch)
            loss = loss_func(logits, labels)

            with ch.no_grad():
                probs = ch.softmax(logits, dim=1)[:, 1]

            # Backprop gradients if training
            if is_train:
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            # Get metrics
            m = get_metrics(labels, probs)
            precision += m[0]
            recall += m[1]
            f1 += m[2]

            tot_loss += loss.detach().item()
            if verbose:
                iterator.set_description(
                    "Loss: %.5f | Precision: %.3f | Recall: %.3f | F-1: %.3f" %
                    (tot_loss / (e+1), precision / (e+1), recall / (e+1), f1 / (e+1)))
    return tot_loss / (e+1)


def train_model(net, ds, args):
    train_loader, test_loader = ds.get_loaders(1, shuffle=False)
    optimizer = optim.Adam(net.parameters(), lr=args.lr)

    for e in range(args.epochs):
        # Train
        print("[Train]")
        net.train()
        epoch(net, train_loader, args.gpu, optimizer, verbose=args.verbose)

        # Test
        print("[Eval]")
        net.eval()

        epoch(net, train_loader, args.gpu, None, verbose=args.verbose)
        print()

Expected behavior

Loss/metrics keep improving as the model is trained, so re-evaluating them on the SAME data should indeed show similar performance. Instead, the performance logged while training keeps on improving while checking performance on the same dataset and model in the evaluation model leads to near-random performance. Example of what I’m talking about (evaluation is also done on train data):

Environment

DGL Version: 0.6.1
Backend Library & Version: PyTorch 1.7.1
OS: Linux
How you installed DGL: pip
Python version: 3.6.10
CUDA/cuDNN version: 10.1/7.5.0
GPU models and configuration: NVidia Quadro RTX 4000

Additional context

Error persists without GPU as well (training on CPU)

Issue Analytics

State:
Created 2 years ago
Comments:32

Top GitHub Comments

1reaction

BarclayIIcommented, Jul 7, 2021

This is a bidirected graph, so the in-degrees and out-degrees are the same.

The in-degree and out-degree are the same for the same node. However the denominator of both is the square root of the product between the out-degree of source node and in-degree of destination node, which are not necessarily the same.

As you can see, the output activation indeed depends on the node degrees. Even if all node features are the same, the graph will output different features based on the degrees of nodes and not the same features for all nodes, as you suggested.

Before the code you showed, the output representation is computed via summing the incoming messages. Since the number of incoming messages of a node is the same as the node’s in-degree, the output will be the same value.

The architecture I posted above has been used in existing work (off of which I based this experiment) and reached an F-1 score upwards of 0.9. The only difference between their implementation and this one is the library used: they used torch_geometric, while this code is for dgl.

The difference between their normalization and ours is that they divide the outgoing messages by out-degrees before message passing. That is OK.

If I write down the equations things will get clearer. Assuming that x is the same input feature for all nodes.

Theirs: $\sum_{j\in \mathcal{N}(i)} \frac{1}{d_j} x$
Our both: $\sum_{j\in \mathcal{N}(i)} \frac{1}{\sqrt{d_j d_i}} x$
Our right: $\sum_{j\in \mathcal{N}(i)} \frac{1}{d_i} x = x$

With DGL 0.6+ you can specify your own normalization weights using the EdgeWeightNorm module, though I can add another normalization option in GraphConv if you want to.

1reaction

Rhett-Yingcommented, Jul 2, 2021

DROPPUT, DROPOUT, DROPOUT

Finally, issue could be reproduced in my side. The reason why I cannot repro is no dropout is configured in the code snippet you pasted at the top of this post. Dropout is configured in the ‘gist.py’ you just shared.

As for the issue, I’d like to blame dropout which is the main difference between model.train() and model.eval(). Why do I blame to dropout? If dropout=0.0 when calling model.train() with norm=‘right’, the precision is always ~0.000(this is what I reproduced before), not mention model.eval() on train_loader and test_loader. In other words, if dropout=0.0, model.train() is almost same as model.eval() because no dropout at all. But if dropout=0.5, this takes effect in model.train() which obtains good precision(>0.7) while no dropout at all in model.eval() which results in 0.000 precision.

In short, model is vulnerable and sensitive to dropout if norm=‘right’. If norm=‘both’, model is more robust and less sensitive to dropout even dropout=0.0, according to my experiment.

I think we’d better train with GraphConv(norm=‘both’), dropout=0.5 to obtain a robust model in this scenario.