question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Split Error in RandomLinkSplit

See original GitHub issue

🐛 Bug

When I use the RandomLinkSplit to split dataset MovieLens, I found that the split data is wrong.

To Reproduce

The link prediction task is as follows:

train_data, val_data, test_data = T.RandomLinkSplit(
        num_val=0.1,
        num_test=0.1,
        neg_sampling_ratio=0.0,
        edge_types=[('user', 'rates', 'movie')],
        rev_edge_types=[('movie', 'rev_rates', 'user')],
    )(data)

I get the following result:

train: 80670(this is right) val: 80670(wrong) test: 90753(wrong)

Expected behavior

The number of edges ('user', 'rates', 'movie') in this dataset is 100836. According to the ratio (0.8, 0.1, 0.1), we should get the split dataset as follows:

train: 80670(this is right) val: 10083(wrong) test: 10083(wrong)

Environment

  • PyG version (torch_geometric.__version__): 2.0.2
  • PyTorch version: (torch.__version__): 1.10.0
  • OS (e.g., Linux): MacOS
  • Python version (e.g., 3.9): 3.8
  • CUDA/cuDNN version: CPU
  • How you installed PyTorch and PyG (conda, pip, source): pip
  • Any other relevant information (e.g., version of torch-scatter): Not yet.

Additional context

I review the source code, I found the error may be made in the line 176 in RandomLinkSplit with wrong parameters.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:2
  • Comments:29 (14 by maintainers)

github_iconTop GitHub Comments

3reactions
rusty1scommented, Dec 25, 2021
  1. “Training message edges” are the edges that are used in the GNN part of your model: The edges that you use to exchange neighborhood information and to enhance your node representations. “Training supervision edges” are then used to train your final link predictor: Given a training supervision edge, you take the source and destination node representations obtained from a GNN and use them as input to predict the probability of a link.

  2. This depends on the model and validation performance. In GAE (https://arxiv.org/abs/1611.07308), training supervision edges and training message edges denote the same set of edges. IN SEAL (https://arxiv.org/pdf/1802.09691.pdf), training supervision edges and training message edges are disjoint.

    In general, I think using the same set of edges for message passing and supervision may lead to same data leakage in your training phase, but this depends on the power/expressiveness of your model. For example, GAE uses a GCN-based encoder and a dot-product based decoder. Both encoder and decoder have limited power, so the data leakage capabilities of the model are limited as well.

1reaction
rusty1scommented, Dec 26, 2021

Yes, this is correct. Validation and test edges need to always be disjoint.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dataset concatenation from random link split but it just ends ...
hello as the title suggests, im having a hard time with dataloaders in pytorch geometric as im trying to concatenate two splits of...
Read more >
Source code for torch_geometric.transforms.random_link_split
Note that this only affects the graph split, label data will not be returned ... ValueError( "The 'RandomLinkSplit' transform expects 'edge_types' to" "be ......
Read more >
How to split a data set to do 10-fold cross validation
It is not currently accepting new answers or interactions. Now I have a R data frame (training), can anyone tell me how to...
Read more >
Integrating Graph Neural Networks with Space Syntax
They are built on the notion that space can be split into ... The reconstruction loss, which measures the error between the.
Read more >
pyg例子link_pred - pytorch学习 - CSDN博客
RandomLinkSplit 的用法:执行数据集的切分,切分后的数据集The split is performed such that the ... This will be an error in PyTorch 0.5.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found