Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sampling chooses vocab index that does not exist with certain random seeds

See original GitHub issue

Running into the following error while sampling with certain seeds:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/translate.py", line 269, in <module>
    main()
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/translate.py", line 46, in main
    run_translate(args)
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/translate.py", line 155, in run_translate
    input_is_json=args.json_input)
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/translate.py", line 237, in read_and_translate
    chunk_time = translate(output_handler, chunk, translator)
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/translate.py", line 260, in translate
    trans_outputs = translator.translate(trans_inputs)
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/inference.py", line 861, in translate
    results.append(self._make_result(trans_input, translation))
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/inference.py", line 963, in _make_result
    target_tokens = [self.vocab_target_inv[target_id] for target_id in target_ids]
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/inference.py", line 963, in <listcomp>
    target_tokens = [self.vocab_target_inv[target_id] for target_id in target_ids]
KeyError: 7525

I am calling Sockeye with a script such as

OMP_NUM_THREADS=1 python -m sockeye.translate \
                -i $data_sub/$corpus.pieces.src \
                -o $samples_sub_sub/$corpus.pieces.$seed.trg \
                -m $model_path \
                --sample \
                --seed $seed \
                --length-penalty-alpha 1.0 \
                --device-ids 0 \
                --batch-size 64 \
                --disable-device-locking

Sockeye and Mxnet versions:

[2020-08-25:17:03:03:INFO:sockeye.utils:log_sockeye_version] Sockeye version 2.1.17, commit 92a020a25cbe75935c700ce2f29b286b31a87189, path /net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/sockeye/__init__.py
[2020-08-25:17:03:03:INFO:sockeye.utils:log_mxnet_version] MXNet version 1.6.0, path /net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3/lib/python3.5/site-packages/mxnet/__init__.py

Details that may be relevant:

This only happens for certain random --seeds
Running on a Tesla V100
OS: Ubuntu 16.04.6 LTS
the MXnet version in the CUDA 10.2 requirements file (https://github.com/awslabs/sockeye/blob/master/requirements/requirements.gpu-cu102.txt) is no longer available on Pypi. I had to install mxnet-cu102mkl==1.6.0.post0.

The vocabulary does not have this index:


[INFO:sockeye.vocab] Vocabulary (7525 words) loaded from "/net/cephfs/scratch/mathmu/map-volatility/models/bel-eng/baseline/vocab.src.0.json"
[INFO:sockeye.vocab] Vocabulary (7525 words) loaded from "/net/cephfs/scratch/mathmu/map-volatility/models/bel-eng/baseline/vocab.trg.0.json"

I suspect that the sampling procedure somehow assumes 1-based indexing, whereas the vocabulary is 0-indexed. This would mean that there is a small chance that max_vocab_id+1 is picked as the next token.

Looking at the inference code, I am not sure yet why this happens.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:21 (21 by maintainers)

Top GitHub Comments

1reaction

bricksdontcommented, Sep 18, 2020

I still believe this is an MXnet bug, but don’t know how to reduce the problem to the single RNG state and input that cause random.multinomial to misbehave. As @fhieber said, that would be possible if the RNG state could be saved somehow.

@KellenSunderland we could need some MXnet expertise here, if you are interested in tackling this.

0reactions

mjdenkowskicommented, Mar 30, 2022

Closing for now as this applies to an older version of Sockeye.