Adding new datasets to dgl.data
See original GitHub issue🚀 Feature
Adding a new graph dataset related to node classification (fraud detection) and a new graph dataset related to graph classification (fake news detection) as the default datasets in dgl.data
.
Motivation
The first graph dataset includes two homogeneous multi-relational graphs extracted from Yelp and Amazon where nodes represent fraudulent reviews or fraudulent reviewers. It was first proposed in a CIKM’20 paper and has been used by a recent WWW’21 paper as a benchmark. Another paper also takes the dataset as an example to study the non-homophilous graphs. This dataset is built upon industrial data and has rich relational information and unique properties like class-imbalance and feature inconsistency, which makes the dataset be a good instance to investigate how GNNs perform on real-world noisy graphs.
The second graph dataset is composed of two sets of tree-structured fake/real news propagation graphs extracted from Twitter. Different from most of the benchmark datasets for the graph classification task, the graphs in this dataset are tree-structured graphs where the root node represents the news, the leaf nodes are Twitter users who retweeted the root news. Besides, the node features are encoded user historical tweets using different pretrained language models. The dataset could help GNNs learn how to fuse multi-modal information and learn representations for tree-structured graphs. It would be a good addition to current graph classification benchmarks.
Alternatives
N/A
Pitch
Adding the above two new datasets as default datasets in dgl.data
.
Additional context
N/A
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:9 (1 by maintainers)
Random seed works but I think storing the fixed train-val-test ids is more safe and standard.
@BarclayII Thank you for your response!
There is a little update, I am now working on edge classification with imbalance classes. I have modified RECT-L as follows:
MLPPredictor
class is same as given hereAlso I am using binary cross entropy as loss function:
I have tried both with and without class weights in loss function, but there is no impact on predictions. After certain epochs the model predicts only majority class.