Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] deepspeed-inference seems not working correctly with torch.half on Pascal GPU

See original GitHub issue

Describe the bug

Thanks for releasing deepspeed-inference. I’m following tutorial, https://www.deepspeed.ai/tutorials/inference-tutorial/#end-to-end-gpt-neo-27b-inference and I want to do inference with half-precision by setting dtype=torch.half. However, when using Tesla P40, it seems not working correctly generating unmeaningful text such as [{'generated_text': 'DeepSpeed is that one one S\'s of more it his B in B I it a I and an- two The an high B it all.. or old in a D of B T the,\n F and the " S S The'}]. As a side note, when I switched GPU to Tesla T4 with same environment setting and script, this issue was not observed (attached log in Additional context). Would be Pascal GPU not supported in deepspeed-inference?

To Reproduce

# Filename: gpt-neo-2.7b-generation-float16.py
import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
                     device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.half,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

$ deepspeed gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:35:04,866] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-08-08 13:35:04,866] [INFO] [runner.py:504:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.8.4-1+cuda11.2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NCCL_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.8.4-1+cuda11.2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:136:main] WORLD INFO DICT: {'localhost': [0]}
[2022-08-08 13:35:06,221] [INFO] [launch.py:142:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-08-08 13:35:06,221] [INFO] [launch.py:155:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-08-08 13:35:06,221] [INFO] [launch.py:156:main] dist_world_size=1
[2022-08-08 13:35:06,221] [INFO] [launch.py:158:main] Setting CUDA_VISIBLE_DEVICES=0
vocab_file vocab.json
merges_file merges.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
[2022-08-08 13:35:59,433] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.0, git-hash=unknown, git-branch=unknown
[2022-08-08 13:35:59,434] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /tmp/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/.cache/torch_extensions/py38_cu113/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.25583362579345703 seconds
[2022-08-08 13:36:00,342] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/transformers/src/transformers/generation_utils.py:1202: UserWarning: Neither `max_length` nor `max_new_tokens` have been set, `max_length` will default to 50 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
[{'generated_text': 'DeepSpeed is that one one S\'s of more it his B in B I it a I and an- two The an high B it all.. or old in a D of B T the,\n F and the " S S The'}]
[2022-08-08 13:36:14,299] [INFO] [launch.py:318:main] Process 33 exits successfully.

Expected behavior generated_text should be some meaningful text.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.12.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

System info (please complete the following information):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P40           On   | 00000001:00:00.0 Off |                  Off |
| N/A   22C    P8     9W / 250W |      0MiB / 24451MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Additional context When I switched GPU to Tesla T4, this issue was not observed.

$ deepspeed gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:40:14,922] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-08-08 13:40:14,923] [INFO] [runner.py:504:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.8.4-1+cuda11.2
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NCCL_VERSION=2.8.4-1
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.8.4-1+cuda11.2
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:40:16,025] [INFO] [launch.py:136:main] WORLD INFO DICT: {'localhost': [0]}
[2022-08-08 13:40:16,025] [INFO] [launch.py:142:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-08-08 13:40:16,025] [INFO] [launch.py:155:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-08-08 13:40:16,025] [INFO] [launch.py:156:main] dist_world_size=1
[2022-08-08 13:40:16,025] [INFO] [launch.py:158:main] Setting CUDA_VISIBLE_DEVICES=0
vocab_file vocab.json
merges_file merges.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
[2022-08-08 13:42:26,151] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.0, git-hash=unknown, git-branch=unknown
[2022-08-08 13:42:26,152] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /tmp/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/.cache/torch_extensions/py38_cu113/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.25422143936157227 seconds
[2022-08-08 13:42:26,994] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/transformers/src/transformers/generation_utils.py:1202: UserWarning: Neither `max_length` nor `max_new_tokens` have been set, `max_length` will default to 50 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
[{'generated_text': 'DeepSpeed is the result of his experiences with the U.S. Army. He served from 2000 to 2004 as a Combat Medic in Special Forces with the 2nd Platoon, 1st Sustainment Brigade. He also has served as a Fire'}]
[2022-08-08 13:42:39,178] [INFO] [launch.py:318:main] Process 29 exits successfully.

Issue Analytics

State:
Created a year ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

mrwyattiicommented, Aug 11, 2022

@wkkautas Thanks for reporting your issue. I’ll try to reproduce what you’re seeing and report back

0reactions

cmikeh2commented, Dec 9, 2022

If you are still seeing this issue, please reopen.

Top Results From Across the Web

Automatic Mixed Precision — PyTorch Tutorials 1.12.1+cu102 ...

This recipe measures the performance of a simple network in default precision, then walks through adding autocast and GradScaler to run the same...

PyTorch 1.9.0 Now Available - Exxact Corporation

PyTorch just released version 1.9 with support scientific computing, support for large scale distributed training with GPU support, ...

Train With Mixed Precision - NVIDIA Documentation Center

Mixed precision is the combined use of different numerical precisions in a computational method. Half precision (also known as FP16) data ...

How does one use Pytorch (+ cuda) with an A100 GPU?

From the link pytorch site from @SimonB 's answer, I did: pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f ...

Pascal (microarchitecture) - Wikipedia

Pascal is the codename for a GPU microarchitecture developed by Nvidia, as the successor to the Maxwell architecture. The architecture was first introduced ......