Use of --max-ind-range
See original GitHub issueI am having an issue with using --max-ind-range arg. It is used when preprocessing dataset and if we set it to low value while training run, it produces runtime error saying index out of range. Ideally, preprocessing should not use it but training run should do the modulo operation in data loader, what do you think?
Here is an example of error. I have preprocessed terabyte dataset using 10M range. Then if I run training with --max-ind-range=1000000 (1M), it produces this run time error:
python dlrm_s_pytorch.py --arch-sparse-feature-size=128 --arch-mlp-bot="13-512-256-128" --arch-mlp-top="1024-1024-512-256-1" --data-generation=dataset --data-set=terabyte --raw-data-file=$HOME/dlrm_dataset/day --processed-data-file=./input/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=1.0 --mini-batch-size=2048 --print-freq=10 --num-batches=100 --print-time --test-freq=0 --test-mini-batch-size=16384 --test-num-workers=0 --memory-map --mlperf-logging --max-ind-range=1000000 --numpy-rand-seed 4
time/loss/accuracy (if enabled):
Traceback (most recent call last):
File "dlrm_s_pytorch.py", line 1055, in <module>
Z = dlrm_wrap(X, lS_o, lS_i, use_gpu, device)
File "dlrm_s_pytorch.py", line 948, in dlrm_wrap
return dlrm(X, lS_o, lS_i)
File "/nfs_home/ddkalamk/venv/pytsrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 539, in _call_
result = self.forward(*input, **kwargs)
File "dlrm_s_pytorch.py", line 384, in forward
return self.sequential_forward(dense_x, lS_o, lS_i)
File "dlrm_s_pytorch.py", line 396, in sequential_forward
ly = self.apply_emb(lS_o, lS_i, self.emb_l)
File "dlrm_s_pytorch.py", line 338, in apply_emb
V = E(sparse_index_group_batch.contiguous(), sparse_offset_group_batch)
File "/nfs_home/ddkalamk/venv/pytsrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 539, in _call_
result = self.forward(*input, **kwargs)
File "/nfs_home/ddkalamk/venv/pytsrc/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 281, in forward
per_sample_weights)
File "/nfs_home/ddkalamk/venv/pytsrc/lib/python3.7/site-packages/torch/nn/functional.py", line 1646, in embedding_bag
per_sample_weights)
RuntimeError: [enforce fail at embedding_lookup_idx.cc:226] 0 <= idx && idx < data_size. Index 815 is out of bounds: 2147347, range 0 to 1000000
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
Optimizing Deep Learning Recommender Systems Training ...
relevant application mix: 1) recommender systems (RecSys) and 2) language models, e.g. recurrent ... 256-1" –max-ind-range=40000000 –data-generation=dataset.
Read more >DLRM - HackMD
--inference-only --save-onnx --save-proto-types-shapes --use-gpu --print-freq ... --max-ind-range=40000000 --data-generation=dataset --data-set=terabyte ...
Read more >Problem saving nn.Module as a TorchScript module (DLRM ...
Hi, I am trying to create a TorchScript module of Facebook's deep learning recommendation model (DLRM) using torch.jit.script() method.
Read more >DLRM(Deep Learning Recommendation Model)是深度学习 ...
Please do the following to prepare the dataset for use with DLRM code: ... --max-ind-range=40000000 --data-generation=random --loss-function=bce ...
Read more >Insertando series de precios en el sistema Economatica ...
Columnas y códigos aceptados ; MinIndRange. Intervalo indicativo mín. Min indicative range ; MaxIndRange. Intervalo indicativo máx. Max indicative ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, currently there is no use of specifying --max-ind-range while doing a training run (unless one is doing preprocessing in the same run). So, behavior of --max-ind-range for training run is confusing. It should either error out at the beginning saying --max-ind-range specified is smaller than actual embedding table size in pre-processed dataset so it avoids runtime error later or modulo operation should be performed in the data loader before returning the indices. I suggest using the later option as it allows doing preprocessing on some big fat machine once and use different --max-ind-range values for training runs based on resources available.
Also, it might be helpful to save --max-ind-range used for preprocessing and dump it at the time of loading the dataset. Also, dump the actual MLP and Embedding table sizes being used under MLPerf logging option.
Looks good. Thank you, closing the issue.