question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bus error(core dumped) in EdgeDataLoader of unsupervised graphsage example (20M edges)

See original GitHub issue

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

  1. In the example code: https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/advanced/train_sampling_unsupervised.py
  2. Feed a dataset with 10M nodes and 20M edges
  3. Bus error(core dumped) happened in EdgeDataLoader before training and DataLoader in inference.

Expected behavior

Environment

  • DGL Version (e.g., 1.0): 0.8.1cu101
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.8.1+cu101
  • OS (e.g., Linux): Docker
  • How you installed DGL (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.6.13
  • CUDA/cuDNN version (if applicable): 101
  • GPU models and configuration (e.g. V100): V100
  • Any other relevant information:

Additional context

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:11 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
yaox12commented, Aug 8, 2022

This line of code in dataloader will create a shared memory array for shuffling. https://github.com/dmlc/dgl/blob/5ba5106acab6a642e9b790e5331ee519112a5623/python/dgl/dataloading/dataloader.py#L146-L149 When len(train_seeds) > 8M, the shared tensor will run out of Docker’s default shm size 64MB. @jermainewang @BarclayII

0reactions
BarclayIIcommented, Aug 8, 2022

We can change the code to use shared tensors only when these conditions hold.

That sounds reasonable. I’m not sure how PyTorch shares the tensor to forked subprocesses though: if PyTorch uses shared memory then we are technically still copying the ID tensor into shared memory implicitly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[ERROR] Bus error (core dumped) · Issue #33 - GitHub
And then i meet an error “Bus error (core dumped)” ... Same problem when trying to train model on a tiny graph (300K...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found