Split Error in RandomLinkSplit
See original GitHub issue🐛 Bug
When I use the RandomLinkSplit
to split dataset MovieLens, I found that the split data is wrong.
To Reproduce
The link prediction task is as follows:
train_data, val_data, test_data = T.RandomLinkSplit(
num_val=0.1,
num_test=0.1,
neg_sampling_ratio=0.0,
edge_types=[('user', 'rates', 'movie')],
rev_edge_types=[('movie', 'rev_rates', 'user')],
)(data)
I get the following result:
train: 80670(this is right) val: 80670(wrong) test: 90753(wrong)
Expected behavior
The number of edges ('user', 'rates', 'movie')
in this dataset is 100836. According to the ratio (0.8, 0.1, 0.1), we should get the split dataset as follows:
train: 80670(this is right) val: 10083(wrong) test: 10083(wrong)
Environment
- PyG version (
torch_geometric.__version__
): 2.0.2 - PyTorch version: (
torch.__version__
): 1.10.0 - OS (e.g., Linux): MacOS
- Python version (e.g.,
3.9
): 3.8 - CUDA/cuDNN version: CPU
- How you installed PyTorch and PyG (
conda
,pip
, source): pip - Any other relevant information (e.g., version of
torch-scatter
): Not yet.
Additional context
I review the source code, I found the error may be made in the line 176 in RandomLinkSplit
with wrong parameters.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:29 (14 by maintainers)
Top Results From Across the Web
Dataset concatenation from random link split but it just ends ...
hello as the title suggests, im having a hard time with dataloaders in pytorch geometric as im trying to concatenate two splits of...
Read more >Source code for torch_geometric.transforms.random_link_split
Note that this only affects the graph split, label data will not be returned ... ValueError( "The 'RandomLinkSplit' transform expects 'edge_types' to" "be ......
Read more >How to split a data set to do 10-fold cross validation
It is not currently accepting new answers or interactions. Now I have a R data frame (training), can anyone tell me how to...
Read more >Integrating Graph Neural Networks with Space Syntax
They are built on the notion that space can be split into ... The reconstruction loss, which measures the error between the.
Read more >pyg例子link_pred - pytorch学习 - CSDN博客
RandomLinkSplit 的用法:执行数据集的切分,切分后的数据集The split is performed such that the ... This will be an error in PyTorch 0.5.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
“Training message edges” are the edges that are used in the GNN part of your model: The edges that you use to exchange neighborhood information and to enhance your node representations. “Training supervision edges” are then used to train your final link predictor: Given a training supervision edge, you take the source and destination node representations obtained from a GNN and use them as input to predict the probability of a link.
This depends on the model and validation performance. In GAE (https://arxiv.org/abs/1611.07308), training supervision edges and training message edges denote the same set of edges. IN SEAL (https://arxiv.org/pdf/1802.09691.pdf), training supervision edges and training message edges are disjoint.
In general, I think using the same set of edges for message passing and supervision may lead to same data leakage in your training phase, but this depends on the power/expressiveness of your model. For example, GAE uses a GCN-based encoder and a dot-product based decoder. Both encoder and decoder have limited power, so the data leakage capabilities of the model are limited as well.
Yes, this is correct. Validation and test edges need to always be disjoint.