Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Various malloc-related crashes with NeighborLoader

See original GitHub issue

🐛 Describe the bug

I am using NeighborLoader to iterate a dataset that fits entirely in memory.

    device = 'cpu'

    node_size = 3
    edge_size = 2
    global_size = node_size
    hidden_size = 64
    learning_rate = 1e-5
    batch_size = 1
    epochs = 2

    data = Data()
    name_to_id, id_to_name = parse_csv_as_torch_data(data=data, nodes_path=train_nodes_path, edges_path=train_edges_path, exclude_path=test_node_names_path, device=device)
    model = Graphnet(node_size=node_size, edge_size=edge_size, global_size=global_size, hidden_size=hidden_size)

    print("%s:\t[%d,%d]" % ("x", data.x.shape[0], data.x.shape[1]))
    print("%s:\t[%d,%d]" % ("y", data.y.shape[0], data.y.shape[1]))
    print("%s:\t[%d,%d]" % ("edge_index", data.edge_index.shape[0], data.edge_index.shape[1]))
    print("%s:\t[%d,%d]" % ("edge_attr", data.edge_attr.shape[0], data.edge_attr.shape[1]))

    for m in torch.nn.ModuleList():
        print(m.device)
        m.to(device)

    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    loss_fn = torch.nn.BCEWithLogitsLoss(reduction='none')

    print("attempting to initialize data loaders...")
    data_loader_train = NeighborLoader(data=data, batch_size=batch_size, num_neighbors=[1] * 1, drop_last=True)
    print("Initialized train loader")

and the DataLoader never succeeds at initializing:

Moving to device:  cpu
done
x:      [26237,3]
y:      [26237,1]
edge_index:     [2,737180]
edge_attr:      [737180,2]
attempting to initialize data loaders...
free(): corrupted unsorted chunks

For reference, my CSV files which describe the node and edge data are these sizes:

24M edges.csv
1.2M nodes.csv

I get a variety of memory related errors when I attempt to run this:

free(): corrupted unsorted chunks

python3: malloc.c:3839: _int_malloc: Assertion 'chunk_main_arena (bck->bk)' failed.

corrupted double-linked list

corrupted size vs. prev_size

malloc_consolidate(): invalid chunk size

The most common of which is probably the last one. I’ve tried batch_size=1, drop_last=True, and reducing the neighbors to one iteration and/or one neighbor. It doesn’t seem to matter whether I use cpu or gpu.

Here are the stats when using the time -v command:

Command terminated by signal 6
        Command being timed: "python3 ../scripts/find_bubbles.py"
        User time (seconds): 7.07
        System time (seconds): 0.66
        Percent of CPU this job got: 110%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.98
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 608156
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 116061
        Voluntary context switches: 95
        Involuntary context switches: 137
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Environment

PyG version: 2.0.4
PyTorch version: 1.11.0+cu113
OS: Ubuntu 20.04
Python version: 3.8.10
CUDA/cuDNN version: 11.4.4 and 11.3.1
How you installed PyTorch and PyG (conda, pip, source): pip
Any other relevant information (e.g., version of torch-scatter):

Issue Analytics

State:
Created a year ago
Comments:12 (6 by maintainers)

Top GitHub Comments

1reaction

rlorigrocommented, Jul 18, 2022

Awesome, thanks. I will close the issue then.

I also have a question regarding the behavior of NeighborLoader… for each data object sampled, is the centroid of the subgraph (the starting node) always at index 0 of data.x? If I am training a node classifier, I don’t think I want to compute a prediction or loss for any node which doesn’t have sufficient neighbors in the graph for the number of layers my graphnet has.

1reaction

rusty1scommented, Jul 18, 2022

We just added a data.validate() function to master to check for this 😃

Top Results From Across the Web

torch_geometric.loader.neighbor_loader - PyTorch Geometric

[docs]class NeighborLoader(NodeLoader): r"""A data loader that performs neighbor ... which will include all edges between all sampled nodes (but is slightly ...