Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fairseq with torch 1.8.0, ROCm 4.0.1 and MI50 AMD GPUs

See original GitHub issue

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

I am trying to run fairseq on AMD cluster. Installation was smooth. I have some trained checkpoint and run this command to perform validation:

fairseq-validate data-bin/iwslt14.tokenized.de-en.eostask/ --task translation_eos --path ckpt_numeos_1/checkpoint_best.pt --user-dir ./fairseq_module/ --max-tokens 4096

This works on NVIDUA & CUDA machine, but fails with rocm with the following error:

Traceback (most recent call last):
  File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/data/data_utils.py", line 302, in batch_by_size
    from fairseq.data.data_utils_fast import (
  File "fairseq/data/data_utils_fast.pyx", line 1, in init fairseq.data.data_utils_fast
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ext3/miniconda3/bin/fairseq-validate", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-validate')())
  File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq_cli/validate.py", line 145, in cli_main
    distributed_utils.call_main(
  File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/distributed/utils.py", line 364, in call_main
    main(cfg, **kwargs)
  File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq_cli/validate.py", line 89, in main
    itr = task.get_batch_iterator(
  File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/tasks/fairseq_task.py", line 285, in get_batch_iterator
    batch_sampler = dataset.batch_by_size(
  File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/data/fairseq_dataset.py", line 145, in batch_by_size
    return data_utils.batch_by_size(
  File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/data/data_utils.py", line 313, in batch_by_size
    raise ValueError(
ValueError: Please build (or rebuild) Cython components with: `pip install  --editable .` or `python setup.py build_ext --inplace`.

Code

What have you tried?

I have tried to build Cython components as suggested above, the build is successful, but the error still occurs. From my understanding Cython extensions for fast data processing may not work well with rocm. Is there some env var which will block all fast data batching etc. so that I will have higher change of success with rocm?

What’s your environment?

fairseq Version (e.g., 1.0 or master): master
PyTorch Version (e.g., 1.0): 1.8.0
OS (e.g., Linux): Ubuntu 20.04.2 LTS, used within Singularity container
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): pip install --editable . , then I also ran python setup.py build_ext --inplace after I got the error message.
Python version: 3.8.5
CUDA/cuDNN version: None
GPU models and configuration: 8 x AMD MI50 (gfx906, 32gb)
Any other relevant information: ROCm 4.0.1 (used withing Singularity container with --rocm argument).

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:5 (3 by maintainers)

Top GitHub Comments

3reactions

compwiztobecommented, Mar 23, 2021

What versions of numpy and fairseq do you have installed? I’ve seen that binary incompatibility error with numpy 1.19.5 and fairseq 0.10.0, but not with fairseq 0.10.2, or with numpy >= 1.20.0.

I’m not sure however why training should succeed and only validation shows this problem (I saw it when running fairseq-train).

2reactions

uralikcommented, Mar 27, 2021

The issue was the numpy built with mkl which was installed together with torch through conda which seem to have issues with AMD EPYC cpu, installing numpy without MKL from pip solved the issue.