Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Host Memory Efficiency

See original GitHub issue

Describe the bug

Not sure if it’s a bug or the architecture.

When performing inference, more memory is consumed than the model checkpoint capacity in the current implementation. In experiments, it is estimated that the consumption is approximately doubled.

GPT-NEO 2.7B host memory estimate consumption : 99GB = 10 gpus * 9.9GB(checkpoint) host memory real consumption : 221GB

Also, since the amount of memory consumed is multiplied according to the number of num_gpus, much more host memory is required to use more gpus.

For small models, host memory will not be a problem, but when using large models, the checkpoint capacity of the model will also be large, so if we use a lot of gpu to load it, host memory oom will occur and this will be a problem.

It seems that the model checkpoint should be shared each process. Do you have any comments or improvement plans for this?

To Reproduce Steps to reproduce the behavior:

import os
import deepspeed
import torch
from transformers import pipeline
import time
import datetime

def init():
    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    world_size = int(os.getenv('WORLD_SIZE', '1'))
    generator = pipeline(
        'text-generation', model='EleutherAI/gpt-neo-2.7B', device=local_rank)
    generator.model = deepspeed.init_inference(generator.model,
                                            mp_size=world_size,
                                            dtype=torch.float,
                                            replace_method='auto')
    return generator

def predict(text, max_len):
    torch.distributed.barrier()
    with torch.no_grad():
        string = generator(text, do_sample=True,
                            min_length=max_len,
                            max_length=max_len,
                            top_k=50,
                            temperature=1.0,
                            top_p=1.0,
                            num_return_sequences=1,
                            pad_token_id=3)
    return string

if __name__ == '__main__':
    generator = init()
    torch.cuda.empty_cache()
    text = 'a'
    seq = [50, 100, 300, 1000, 2048]
    for i in seq:
        avg_time = 0
        for j in range(5):
            #print(f'##### max_len: {i}')
            startime = time.time()
            string = predict(text, i)
            torch.distributed.barrier()
            avg_time += (time.time()-startime)
        spend_time = str(datetime.timedelta(seconds=avg_time/5))
        print(f'[{torch.distributed.get_rank()}] ##### seq: {i}, avg_spend_time: {spend_time}')

Expected behavior A clear and concise description of what you expected to happen.

memory consumed just size of checkpoint

ds_report output Please run ds_report to give us details about your setup.

JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
/tmp/io_submithubcvos0.c: In function ‘main’:
/tmp/io_submithubcvos0.c:2:5: warning: implicit declaration of function ‘io_submit’ [-Wimplicit-function-declaration]
    2 |     io_submit();
      |     ^~~~~~~~~
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.9.0a0+c3d40fd
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.4.4+3e4dd96, 3e4dd96, reyazda/large-model-inference
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.3

System info (please complete the following information):

OS: Ubuntu 18.04
GPU count and types x16 A100
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version : 3.8
Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

deepspeed --num_gpus [2,4,5,10] test.py

Docker context Are you using a specific docker image that you can share?

NGC docker 21.6

Additional context Add any other context about the problem here.

Issue Analytics

State:
Created 2 years ago
Comments:16 (13 by maintainers)

Top GitHub Comments

3reactions

switizcommented, Aug 25, 2021

@hyunwoongko

Cool! Awesome.

It’s a really needed feature. I really want this features to be merged.

I also experimented with your amazing open source Parallelformer a while ago.

There was a little issue in the Docker environment, so I did not proceed further, but if it is included in Deep Speed, a large open source, it will shine even more.

I look forward to that day.

thank you

1reaction

RezaYazdaniAminabadicommented, Aug 26, 2021

@hyunwoongko, I will try to repro this again and open an issue on your side. Thanks, Reza