question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Various malloc-related crashes with NeighborLoader

See original GitHub issue

🐛 Describe the bug

I am using NeighborLoader to iterate a dataset that fits entirely in memory.

    device = 'cpu'

    node_size = 3
    edge_size = 2
    global_size = node_size
    hidden_size = 64
    learning_rate = 1e-5
    batch_size = 1
    epochs = 2

    data = Data()
    name_to_id, id_to_name = parse_csv_as_torch_data(data=data, nodes_path=train_nodes_path, edges_path=train_edges_path, exclude_path=test_node_names_path, device=device)
    model = Graphnet(node_size=node_size, edge_size=edge_size, global_size=global_size, hidden_size=hidden_size)

    print("%s:\t[%d,%d]" % ("x", data.x.shape[0], data.x.shape[1]))
    print("%s:\t[%d,%d]" % ("y", data.y.shape[0], data.y.shape[1]))
    print("%s:\t[%d,%d]" % ("edge_index", data.edge_index.shape[0], data.edge_index.shape[1]))
    print("%s:\t[%d,%d]" % ("edge_attr", data.edge_attr.shape[0], data.edge_attr.shape[1]))

    for m in torch.nn.ModuleList():
        print(m.device)
        m.to(device)

    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    loss_fn = torch.nn.BCEWithLogitsLoss(reduction='none')

    print("attempting to initialize data loaders...")
    data_loader_train = NeighborLoader(data=data, batch_size=batch_size, num_neighbors=[1] * 1, drop_last=True)
    print("Initialized train loader")

and the DataLoader never succeeds at initializing:

Moving to device:  cpu
done
x:      [26237,3]
y:      [26237,1]
edge_index:     [2,737180]
edge_attr:      [737180,2]
attempting to initialize data loaders...
free(): corrupted unsorted chunks

For reference, my CSV files which describe the node and edge data are these sizes:

24M edges.csv
1.2M nodes.csv

I get a variety of memory related errors when I attempt to run this:

free(): corrupted unsorted chunks

python3: malloc.c:3839: _int_malloc: Assertion 'chunk_main_arena (bck->bk)' failed.

corrupted double-linked list

corrupted size vs. prev_size

malloc_consolidate(): invalid chunk size

The most common of which is probably the last one. I’ve tried batch_size=1, drop_last=True, and reducing the neighbors to one iteration and/or one neighbor. It doesn’t seem to matter whether I use cpu or gpu.

Here are the stats when using the time -v command:

Command terminated by signal 6
        Command being timed: "python3 ../scripts/find_bubbles.py"
        User time (seconds): 7.07
        System time (seconds): 0.66
        Percent of CPU this job got: 110%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.98
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 608156
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 116061
        Voluntary context switches: 95
        Involuntary context switches: 137
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Environment

  • PyG version: 2.0.4
  • PyTorch version: 1.11.0+cu113
  • OS: Ubuntu 20.04
  • Python version: 3.8.10
  • CUDA/cuDNN version: 11.4.4 and 11.3.1
  • How you installed PyTorch and PyG (conda, pip, source): pip
  • Any other relevant information (e.g., version of torch-scatter):

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
rlorigrocommented, Jul 18, 2022

Awesome, thanks. I will close the issue then.

I also have a question regarding the behavior of NeighborLoader… for each data object sampled, is the centroid of the subgraph (the starting node) always at index 0 of data.x? If I am training a node classifier, I don’t think I want to compute a prediction or loss for any node which doesn’t have sufficient neighbors in the graph for the number of layers my graphnet has.

1reaction
rusty1scommented, Jul 18, 2022

We just added a data.validate() function to master to check for this 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

torch_geometric.loader.neighbor_loader - PyTorch Geometric
[docs]class NeighborLoader(NodeLoader): r"""A data loader that performs neighbor ... which will include all edges between all sampled nodes (but is slightly ...
Read more >
pytorch_geometric - bytemeta
Various malloc-related crashes with NeighborLoader · Wrong Import (UniMP Example)? · Captum explainability interface doesn't work anymore.
Read more >
Issues-pyg-team/pytorch_geometric - PythonTechWorld
Various malloc-related crashes with NeighborLoader. 888. Describe the bug I am using NeighborLoader to iterate a dataset that fits entirely in memory.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found