question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use of --max-ind-range

See original GitHub issue

I am having an issue with using --max-ind-range arg. It is used when preprocessing dataset and if we set it to low value while training run, it produces runtime error saying index out of range. Ideally, preprocessing should not use it but training run should do the modulo operation in data loader, what do you think?

Here is an example of error. I have preprocessed terabyte dataset using 10M range. Then if I run training with --max-ind-range=1000000 (1M), it produces this run time error:

python dlrm_s_pytorch.py --arch-sparse-feature-size=128 --arch-mlp-bot="13-512-256-128" --arch-mlp-top="1024-1024-512-256-1" --data-generation=dataset --data-set=terabyte --raw-data-file=$HOME/dlrm_dataset/day --processed-data-file=./input/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=1.0 --mini-batch-size=2048 --print-freq=10 --num-batches=100 --print-time --test-freq=0 --test-mini-batch-size=16384 --test-num-workers=0 --memory-map --mlperf-logging  --max-ind-range=1000000 --numpy-rand-seed 4


time/loss/accuracy (if enabled):
Traceback (most recent call last):
  File "dlrm_s_pytorch.py", line 1055, in <module>
    Z = dlrm_wrap(X, lS_o, lS_i, use_gpu, device)
  File "dlrm_s_pytorch.py", line 948, in dlrm_wrap
    return dlrm(X, lS_o, lS_i)
  File "/nfs_home/ddkalamk/venv/pytsrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 539, in _call_
    result = self.forward(*input, **kwargs)
  File "dlrm_s_pytorch.py", line 384, in forward
    return self.sequential_forward(dense_x, lS_o, lS_i)
  File "dlrm_s_pytorch.py", line 396, in sequential_forward
    ly = self.apply_emb(lS_o, lS_i, self.emb_l)
  File "dlrm_s_pytorch.py", line 338, in apply_emb
    V = E(sparse_index_group_batch.contiguous(), sparse_offset_group_batch)
  File "/nfs_home/ddkalamk/venv/pytsrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 539, in _call_
    result = self.forward(*input, **kwargs)
  File "/nfs_home/ddkalamk/venv/pytsrc/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 281, in forward
    per_sample_weights)
  File "/nfs_home/ddkalamk/venv/pytsrc/lib/python3.7/site-packages/torch/nn/functional.py", line 1646, in embedding_bag
    per_sample_weights)
RuntimeError: [enforce fail at embedding_lookup_idx.cc:226] 0 <= idx && idx < data_size. Index 815 is out of bounds: 2147347, range 0 to 1000000

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ddkalamkcommented, Feb 1, 2020

What you can not do is run pre-process with one --max-in-range and then run training with another value of the --max-ind-range. Is this the case for your error above?

Yes, currently there is no use of specifying --max-ind-range while doing a training run (unless one is doing preprocessing in the same run). So, behavior of --max-ind-range for training run is confusing. It should either error out at the beginning saying --max-ind-range specified is smaller than actual embedding table size in pre-processed dataset so it avoids runtime error later or modulo operation should be performed in the data loader before returning the indices. I suggest using the later option as it allows doing preprocessing on some big fat machine once and use different --max-ind-range values for training runs based on resources available.

Also, it might be helpful to save --max-ind-range used for preprocessing and dump it at the time of loading the dataset. Also, dump the actual MLP and Embedding table sizes being used under MLPerf logging option.

0reactions
ddkalamkcommented, Feb 24, 2020

Looks good. Thank you, closing the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Optimizing Deep Learning Recommender Systems Training ...
relevant application mix: 1) recommender systems (RecSys) and 2) language models, e.g. recurrent ... 256-1" –max-ind-range=40000000 –data-generation=dataset.
Read more >
DLRM - HackMD
--inference-only --save-onnx --save-proto-types-shapes --use-gpu --print-freq ... --max-ind-range=40000000 --data-generation=dataset --data-set=terabyte ...
Read more >
Problem saving nn.Module as a TorchScript module (DLRM ...
Hi, I am trying to create a TorchScript module of Facebook's deep learning recommendation model (DLRM) using torch.jit.script() method.
Read more >
DLRM(Deep Learning Recommendation Model)是深度学习 ...
Please do the following to prepare the dataset for use with DLRM code: ... --max-ind-range=40000000 --data-generation=random --loss-function=bce ...
Read more >
Insertando series de precios en el sistema Economatica ...
Columnas y códigos aceptados ; MinIndRange. Intervalo indicativo mín. Min indicative range ; MaxIndRange. Intervalo indicativo máx. Max indicative ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found