Bus error(core dumped) in EdgeDataLoader of unsupervised graphsage example (20M edges)
See original GitHub issue🐛 Bug
To Reproduce
Steps to reproduce the behavior:
- In the example code: https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/advanced/train_sampling_unsupervised.py
- Feed a dataset with 10M nodes and 20M edges
- Bus error(core dumped) happened in EdgeDataLoader before training and DataLoader in inference.
Expected behavior
Environment
- DGL Version (e.g., 1.0): 0.8.1cu101
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.8.1+cu101
- OS (e.g., Linux): Docker
- How you installed DGL (
conda
,pip
, source): pip - Build command you used (if compiling from source):
- Python version: 3.6.13
- CUDA/cuDNN version (if applicable): 101
- GPU models and configuration (e.g. V100): V100
- Any other relevant information:
Additional context
Issue Analytics
- State:
- Created a year ago
- Comments:11 (1 by maintainers)
Top Results From Across the Web
[ERROR] Bus error (core dumped) · Issue #33 - GitHub
And then i meet an error “Bus error (core dumped)” ... Same problem when trying to train model on a tiny graph (300K...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
This line of code in dataloader will create a shared memory array for shuffling. https://github.com/dmlc/dgl/blob/5ba5106acab6a642e9b790e5331ee519112a5623/python/dgl/dataloading/dataloader.py#L146-L149 When
len(train_seeds)
> 8M, the shared tensor will run out of Docker’s default shm size 64MB. @jermainewang @BarclayIIThat sounds reasonable. I’m not sure how PyTorch shares the tensor to forked subprocesses though: if PyTorch uses shared memory then we are technically still copying the ID tensor into shared memory implicitly.