question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fairseq with torch 1.8.0, ROCm 4.0.1 and MI50 AMD GPUs

See original GitHub issue

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

I am trying to run fairseq on AMD cluster. Installation was smooth. I have some trained checkpoint and run this command to perform validation:

fairseq-validate data-bin/iwslt14.tokenized.de-en.eostask/ --task translation_eos --path ckpt_numeos_1/checkpoint_best.pt --user-dir ./fairseq_module/ --max-tokens 4096

This works on NVIDUA & CUDA machine, but fails with rocm with the following error:

Traceback (most recent call last):
  File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/data/data_utils.py", line 302, in batch_by_size
    from fairseq.data.data_utils_fast import (
  File "fairseq/data/data_utils_fast.pyx", line 1, in init fairseq.data.data_utils_fast
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ext3/miniconda3/bin/fairseq-validate", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-validate')())
  File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq_cli/validate.py", line 145, in cli_main
    distributed_utils.call_main(
  File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/distributed/utils.py", line 364, in call_main
    main(cfg, **kwargs)
  File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq_cli/validate.py", line 89, in main
    itr = task.get_batch_iterator(
  File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/tasks/fairseq_task.py", line 285, in get_batch_iterator
    batch_sampler = dataset.batch_by_size(
  File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/data/fairseq_dataset.py", line 145, in batch_by_size
    return data_utils.batch_by_size(
  File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/data/data_utils.py", line 313, in batch_by_size
    raise ValueError(
ValueError: Please build (or rebuild) Cython components with: `pip install  --editable .` or `python setup.py build_ext --inplace`.

Code

What have you tried?

I have tried to build Cython components as suggested above, the build is successful, but the error still occurs. From my understanding Cython extensions for fast data processing may not work well with rocm. Is there some env var which will block all fast data batching etc. so that I will have higher change of success with rocm?

What’s your environment?

  • fairseq Version (e.g., 1.0 or master): master
  • PyTorch Version (e.g., 1.0): 1.8.0
  • OS (e.g., Linux): Ubuntu 20.04.2 LTS, used within Singularity container
  • How you installed fairseq (pip, source): source
  • Build command you used (if compiling from source): pip install --editable . , then I also ran python setup.py build_ext --inplace after I got the error message.
  • Python version: 3.8.5
  • CUDA/cuDNN version: None
  • GPU models and configuration: 8 x AMD MI50 (gfx906, 32gb)
  • Any other relevant information: ROCm 4.0.1 (used withing Singularity container with --rocm argument).

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
compwiztobecommented, Mar 23, 2021

What versions of numpy and fairseq do you have installed? I’ve seen that binary incompatibility error with numpy 1.19.5 and fairseq 0.10.0, but not with fairseq 0.10.2, or with numpy >= 1.20.0.

I’m not sure however why training should succeed and only validation shows this problem (I saw it when running fairseq-train).

2reactions
uralikcommented, Mar 27, 2021

The issue was the numpy built with mkl which was installed together with torch through conda which seem to have issues with AMD EPYC cpu, installing numpy without MKL from pip solved the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

fairseq with torch 1.8.0, ROCm 4.0.1 and MI50 AMD GPUs
I am trying to run fairseq on AMD cluster. Installation was smooth. ... fairseq with torch 1.8.0, ROCm 4.0.1 and MI50 AMD GPUs...
Read more >
PyTorch 1.8 supports AMD ROCm - Reddit
Torch 1.8.1 for Rocm 4.0.1 (the Pytorch instructions how to install with pip for Rocm) Fairseq ... I'm trying to run fairseq on...
Read more >
PyTorch for AMD ROCm™ Platform now available as Python ...
ROCm is AMD's open source software platform for GPU-accelerated high performance computing and machine learning. Since the original ROCm ...
Read more >
PyTorch 1.8, with AMD ROCm support - Hacker News
So yeah, it's a nice interface for writing fast numerical code. And for zero effort you can change between running on CPUs, GPUs...
Read more >
PyTorch ROCm is out - How to select Radeon GPU as device
I'm still having some configuration issues with my AMD GPU, ... so you can just call torch.device('cuda') and no actual porting is required!...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found