fairseq with torch 1.8.0, ROCm 4.0.1 and MI50 AMD GPUs
See original GitHub issue❓ Questions and Help
Before asking:
- search the issues.
- search the docs.
What is your question?
I am trying to run fairseq on AMD cluster. Installation was smooth. I have some trained checkpoint and run this command to perform validation:
fairseq-validate data-bin/iwslt14.tokenized.de-en.eostask/ --task translation_eos --path ckpt_numeos_1/checkpoint_best.pt --user-dir ./fairseq_module/ --max-tokens 4096
This works on NVIDUA & CUDA machine, but fails with rocm with the following error:
Traceback (most recent call last):
File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/data/data_utils.py", line 302, in batch_by_size
from fairseq.data.data_utils_fast import (
File "fairseq/data/data_utils_fast.pyx", line 1, in init fairseq.data.data_utils_fast
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/ext3/miniconda3/bin/fairseq-validate", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-validate')())
File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq_cli/validate.py", line 145, in cli_main
distributed_utils.call_main(
File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/distributed/utils.py", line 364, in call_main
main(cfg, **kwargs)
File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq_cli/validate.py", line 89, in main
itr = task.get_batch_iterator(
File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/tasks/fairseq_task.py", line 285, in get_batch_iterator
batch_sampler = dataset.batch_by_size(
File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/data/fairseq_dataset.py", line 145, in batch_by_size
return data_utils.batch_by_size(
File "/state/partition1/ik1147/nmt_multiple_eos/fairseq/fairseq/data/data_utils.py", line 313, in batch_by_size
raise ValueError(
ValueError: Please build (or rebuild) Cython components with: `pip install --editable .` or `python setup.py build_ext --inplace`.
Code
What have you tried?
I have tried to build Cython components as suggested above, the build is successful, but the error still occurs. From my understanding Cython extensions for fast data processing may not work well with rocm. Is there some env var which will block all fast data batching etc. so that I will have higher change of success with rocm?
What’s your environment?
- fairseq Version (e.g., 1.0 or master): master
- PyTorch Version (e.g., 1.0): 1.8.0
- OS (e.g., Linux): Ubuntu 20.04.2 LTS, used within Singularity container
- How you installed fairseq (
pip
, source): source - Build command you used (if compiling from source):
pip install --editable .
, then I also ranpython setup.py build_ext --inplace
after I got the error message. - Python version: 3.8.5
- CUDA/cuDNN version: None
- GPU models and configuration: 8 x AMD MI50 (gfx906, 32gb)
- Any other relevant information: ROCm 4.0.1 (used withing Singularity container with
--rocm
argument).
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:5 (3 by maintainers)
Top Results From Across the Web
fairseq with torch 1.8.0, ROCm 4.0.1 and MI50 AMD GPUs
I am trying to run fairseq on AMD cluster. Installation was smooth. ... fairseq with torch 1.8.0, ROCm 4.0.1 and MI50 AMD GPUs...
Read more >PyTorch 1.8 supports AMD ROCm - Reddit
Torch 1.8.1 for Rocm 4.0.1 (the Pytorch instructions how to install with pip for Rocm) Fairseq ... I'm trying to run fairseq on...
Read more >PyTorch for AMD ROCm™ Platform now available as Python ...
ROCm is AMD's open source software platform for GPU-accelerated high performance computing and machine learning. Since the original ROCm ...
Read more >PyTorch 1.8, with AMD ROCm support - Hacker News
So yeah, it's a nice interface for writing fast numerical code. And for zero effort you can change between running on CPUs, GPUs...
Read more >PyTorch ROCm is out - How to select Radeon GPU as device
I'm still having some configuration issues with my AMD GPU, ... so you can just call torch.device('cuda') and no actual porting is required!...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
What versions of numpy and fairseq do you have installed? I’ve seen that binary incompatibility error with numpy 1.19.5 and fairseq 0.10.0, but not with fairseq 0.10.2, or with numpy >= 1.20.0.
I’m not sure however why training should succeed and only validation shows this problem (I saw it when running
fairseq-train
).The issue was the numpy built with mkl which was installed together with torch through conda which seem to have issues with AMD EPYC cpu, installing numpy without MKL from pip solved the issue.