question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Adding new datasets to dgl.data

See original GitHub issue

🚀 Feature

Adding a new graph dataset related to node classification (fraud detection) and a new graph dataset related to graph classification (fake news detection) as the default datasets in dgl.data.

Motivation

The first graph dataset includes two homogeneous multi-relational graphs extracted from Yelp and Amazon where nodes represent fraudulent reviews or fraudulent reviewers. It was first proposed in a CIKM’20 paper and has been used by a recent WWW’21 paper as a benchmark. Another paper also takes the dataset as an example to study the non-homophilous graphs. This dataset is built upon industrial data and has rich relational information and unique properties like class-imbalance and feature inconsistency, which makes the dataset be a good instance to investigate how GNNs perform on real-world noisy graphs.

The second graph dataset is composed of two sets of tree-structured fake/real news propagation graphs extracted from Twitter. Different from most of the benchmark datasets for the graph classification task, the graphs in this dataset are tree-structured graphs where the root node represents the news, the leaf nodes are Twitter users who retweeted the root news. Besides, the node features are encoded user historical tweets using different pretrained language models. The dataset could help GNNs learn how to fuse multi-modal information and learn representations for tree-structured graphs. It would be a good addition to current graph classification benchmarks.

Alternatives

N/A

Pitch

Adding the above two new datasets as default datasets in dgl.data.

Additional context

N/A

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:9 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
YingtongDoucommented, May 12, 2021

Isn’t setting a random seed a better way to control randomness? From my experience, if you simply cut data into 3 sections, the validation accuracy and test accuracy would vary a lot.

Random seed works but I think storing the fixed train-val-test ids is more safe and standard.

0reactions
saharshleocommented, Jul 1, 2021

@BarclayII Thank you for your response!

There is a little update, I am now working on edge classification with imbalance classes. I have modified RECT-L as follows:

class MLPPredictor(nn.Module):
    def __init__(self, in_features, out_classes):
        super().__init__()
        self.W = nn.Linear(in_features * 2, out_classes)

    def apply_edges(self, edges):
        h_u = edges.src['h']
        h_v = edges.dst['h']
        score = self.W(torch.cat([h_u, h_v], 1))
        return {'score': score}

    def forward(self, graph, h):
        # h contains the node representations computed from the GNN defined
        # in the node classification section (Section 5.1).
        with graph.local_scope():
            graph.ndata['h'] = h
            graph.apply_edges(self.apply_edges)
            return graph.edata['score']

class RECT_L(nn.Module):
    def __init__(self, g, in_feats, n_hidden, activation, dropout=0.0):
        super(RECT_L, self).__init__()
        self.g = g
        self.gcn_1 = GraphConv(in_feats, n_hidden, activation=activation)
        self.fc = nn.Linear(n_hidden, in_feats)
        self.dropout = dropout
        nn.init.xavier_uniform_(self.fc.weight.data)

        self.pred = MLPPredictor(in_feats, 1)
        
    def forward(self, inputs):
        h_1 = self.gcn_1(self.g, inputs)
        h_1 = F.dropout(h_1, p=self.dropout, training=self.training)
        preds = self.fc(h_1)

        preds = self.pred(g, preds)
        preds = torch.sigmoid(preds)
        return preds
    
    # Detach the return variables
    def embed(self, inputs):
        h_1 = self.gcn_1(self.g, inputs)
        return h_1.detach()
  • MLPPredictor class is same as given here
  • in_feats = g.ndata[‘features’].shape[1]
  • hidden_feats = 200
  • activation = nn.PReLU()

Also I am using binary cross entropy as loss function:

# class_weights = [90.0]
loss = F.binary_cross_entropy_with_logits(logits[train_mask], edge_labels[train_mask], pos_weight=torch.FloatTensor(class_weights))

I have tried both with and without class weights in loss function, but there is no impact on predictions. After certain epochs the model predicts only majority class.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Make Your Own Dataset — DGL 0.9.1post1 documentation
Create your own graph dataset for node classification, link prediction, or graph classification. ... Your custom graph dataset should inherit the dgl.data.
Read more >
How to visualize a graph from DGL's datasets? - Stack Overflow
How can I visualize a graph from the dataset ? Using something like matplotlib if possible. import dgl import torch import torch.nn as...
Read more >
Graph convolution netwoks Node classification - Kaggle
Step 2 -> Decide which library to use¶ · Step 3 -> Import everything you need¶ · Step 4 -> Load the data¶...
Read more >
A Graph Convolution Network in SageMaker - DataChef
Every dataset in the DGL package inherits from dgl.data.DGLDataset . This base class formulates utilities for downloading, processing, ...
Read more >
DGL Walkthrough 01: Data - Xinhao Li
Method 2: Create a graph by calling DGL interface ; A DGLGraph contains four major elements: the node, the edges, and the feature...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found