Various malloc-related crashes with NeighborLoader
See original GitHub issue🐛 Describe the bug
I am using NeighborLoader to iterate a dataset that fits entirely in memory.
device = 'cpu'
node_size = 3
edge_size = 2
global_size = node_size
hidden_size = 64
learning_rate = 1e-5
batch_size = 1
epochs = 2
data = Data()
name_to_id, id_to_name = parse_csv_as_torch_data(data=data, nodes_path=train_nodes_path, edges_path=train_edges_path, exclude_path=test_node_names_path, device=device)
model = Graphnet(node_size=node_size, edge_size=edge_size, global_size=global_size, hidden_size=hidden_size)
print("%s:\t[%d,%d]" % ("x", data.x.shape[0], data.x.shape[1]))
print("%s:\t[%d,%d]" % ("y", data.y.shape[0], data.y.shape[1]))
print("%s:\t[%d,%d]" % ("edge_index", data.edge_index.shape[0], data.edge_index.shape[1]))
print("%s:\t[%d,%d]" % ("edge_attr", data.edge_attr.shape[0], data.edge_attr.shape[1]))
for m in torch.nn.ModuleList():
print(m.device)
m.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
loss_fn = torch.nn.BCEWithLogitsLoss(reduction='none')
print("attempting to initialize data loaders...")
data_loader_train = NeighborLoader(data=data, batch_size=batch_size, num_neighbors=[1] * 1, drop_last=True)
print("Initialized train loader")
and the DataLoader never succeeds at initializing:
Moving to device: cpu
done
x: [26237,3]
y: [26237,1]
edge_index: [2,737180]
edge_attr: [737180,2]
attempting to initialize data loaders...
free(): corrupted unsorted chunks
For reference, my CSV files which describe the node and edge data are these sizes:
24M edges.csv
1.2M nodes.csv
I get a variety of memory related errors when I attempt to run this:
free(): corrupted unsorted chunks
python3: malloc.c:3839: _int_malloc: Assertion 'chunk_main_arena (bck->bk)' failed.
corrupted double-linked list
corrupted size vs. prev_size
malloc_consolidate(): invalid chunk size
The most common of which is probably the last one. I’ve tried batch_size=1
, drop_last=True
, and reducing the neighbors to one iteration and/or one neighbor. It doesn’t seem to matter whether I use cpu or gpu.
Here are the stats when using the time -v
command:
Command terminated by signal 6
Command being timed: "python3 ../scripts/find_bubbles.py"
User time (seconds): 7.07
System time (seconds): 0.66
Percent of CPU this job got: 110%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.98
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 608156
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 116061
Voluntary context switches: 95
Involuntary context switches: 137
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Environment
- PyG version: 2.0.4
- PyTorch version: 1.11.0+cu113
- OS: Ubuntu 20.04
- Python version: 3.8.10
- CUDA/cuDNN version: 11.4.4 and 11.3.1
- How you installed PyTorch and PyG (
conda
,pip
, source): pip - Any other relevant information (e.g., version of
torch-scatter
):
Issue Analytics
- State:
- Created a year ago
- Comments:12 (6 by maintainers)
Top Results From Across the Web
torch_geometric.loader.neighbor_loader - PyTorch Geometric
[docs]class NeighborLoader(NodeLoader): r"""A data loader that performs neighbor ... which will include all edges between all sampled nodes (but is slightly ...
Read more >pytorch_geometric - bytemeta
Various malloc-related crashes with NeighborLoader · Wrong Import (UniMP Example)? · Captum explainability interface doesn't work anymore.
Read more >Issues-pyg-team/pytorch_geometric - PythonTechWorld
Various malloc-related crashes with NeighborLoader. 888. Describe the bug I am using NeighborLoader to iterate a dataset that fits entirely in memory.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Awesome, thanks. I will close the issue then.
I also have a question regarding the behavior of NeighborLoader… for each data object sampled, is the centroid of the subgraph (the starting node) always at index 0 of
data.x
? If I am training a node classifier, I don’t think I want to compute a prediction or loss for any node which doesn’t have sufficient neighbors in the graph for the number of layers my graphnet has.We just added a
data.validate()
function to master to check for this 😃