Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Split Error in RandomLinkSplit

See original GitHub issue

🐛 Bug

When I use the RandomLinkSplit to split dataset MovieLens, I found that the split data is wrong.

To Reproduce

The link prediction task is as follows:

train_data, val_data, test_data = T.RandomLinkSplit(
        num_val=0.1,
        num_test=0.1,
        neg_sampling_ratio=0.0,
        edge_types=[('user', 'rates', 'movie')],
        rev_edge_types=[('movie', 'rev_rates', 'user')],
    )(data)

I get the following result:

train: 80670(this is right) val: 80670(wrong) test: 90753(wrong)

Expected behavior

The number of edges ('user', 'rates', 'movie') in this dataset is 100836. According to the ratio (0.8, 0.1, 0.1), we should get the split dataset as follows:

train: 80670(this is right) val: 10083(wrong) test: 10083(wrong)

Environment

PyG version (torch_geometric.__version__): 2.0.2
PyTorch version: (torch.__version__): 1.10.0
OS (e.g., Linux): MacOS
Python version (e.g., 3.9): 3.8
CUDA/cuDNN version: CPU
How you installed PyTorch and PyG (conda, pip, source): pip
Any other relevant information (e.g., version of torch-scatter): Not yet.

Additional context

I review the source code, I found the error may be made in the line 176 in RandomLinkSplit with wrong parameters.

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:29 (14 by maintainers)

Top GitHub Comments

3reactions

rusty1scommented, Dec 25, 2021

“Training message edges” are the edges that are used in the GNN part of your model: The edges that you use to exchange neighborhood information and to enhance your node representations. “Training supervision edges” are then used to train your final link predictor: Given a training supervision edge, you take the source and destination node representations obtained from a GNN and use them as input to predict the probability of a link.
This depends on the model and validation performance. In GAE (https://arxiv.org/abs/1611.07308), training supervision edges and training message edges denote the same set of edges. IN SEAL (https://arxiv.org/pdf/1802.09691.pdf), training supervision edges and training message edges are disjoint.

In general, I think using the same set of edges for message passing and supervision may lead to same data leakage in your training phase, but this depends on the power/expressiveness of your model. For example, GAE uses a GCN-based encoder and a dot-product based decoder. Both encoder and decoder have limited power, so the data leakage capabilities of the model are limited as well.

1reaction

rusty1scommented, Dec 26, 2021

Yes, this is correct. Validation and test edges need to always be disjoint.

Top Results From Across the Web

Dataset concatenation from random link split but it just ends ...

hello as the title suggests, im having a hard time with dataloaders in pytorch geometric as im trying to concatenate two splits of...

Source code for torch_geometric.transforms.random_link_split

Note that this only affects the graph split, label data will not be returned ... ValueError( "The 'RandomLinkSplit' transform expects 'edge_types' to" "be ......

How to split a data set to do 10-fold cross validation

It is not currently accepting new answers or interactions. Now I have a R data frame (training), can anyone tell me how to...

Integrating Graph Neural Networks with Space Syntax

They are built on the notion that space can be split into ... The reconstruction loss, which measures the error between the.

pyg例子link_pred - pytorch学习 - CSDN博客

RandomLinkSplit 的用法：执行数据集的切分，切分后的数据集The split is performed such that the ... This will be an error in PyTorch 0.5.