question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Host Memory Efficiency

See original GitHub issue

Describe the bug

Not sure if it’s a bug or the architecture.

When performing inference, more memory is consumed than the model checkpoint capacity in the current implementation. In experiments, it is estimated that the consumption is approximately doubled.

GPT-NEO 2.7B host memory estimate consumption : 99GB = 10 gpus * 9.9GB(checkpoint) host memory real consumption : 221GB

Also, since the amount of memory consumed is multiplied according to the number of num_gpus, much more host memory is required to use more gpus.

For small models, host memory will not be a problem, but when using large models, the checkpoint capacity of the model will also be large, so if we use a lot of gpu to load it, host memory oom will occur and this will be a problem.

It seems that the model checkpoint should be shared each process. Do you have any comments or improvement plans for this?

To Reproduce Steps to reproduce the behavior:

import os
import deepspeed
import torch
from transformers import pipeline
import time
import datetime

def init():
    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    world_size = int(os.getenv('WORLD_SIZE', '1'))
    generator = pipeline(
        'text-generation', model='EleutherAI/gpt-neo-2.7B', device=local_rank)
    generator.model = deepspeed.init_inference(generator.model,
                                            mp_size=world_size,
                                            dtype=torch.float,
                                            replace_method='auto')
    return generator

def predict(text, max_len):
    torch.distributed.barrier()
    with torch.no_grad():
        string = generator(text, do_sample=True,
                            min_length=max_len,
                            max_length=max_len,
                            top_k=50,
                            temperature=1.0,
                            top_p=1.0,
                            num_return_sequences=1,
                            pad_token_id=3)
    return string

if __name__ == '__main__':
    generator = init()
    torch.cuda.empty_cache()
    text = 'a'
    seq = [50, 100, 300, 1000, 2048]
    for i in seq:
        avg_time = 0
        for j in range(5):
            #print(f'##### max_len: {i}')
            startime = time.time()
            string = predict(text, i)
            torch.distributed.barrier()
            avg_time += (time.time()-startime)
        spend_time = str(datetime.timedelta(seconds=avg_time/5))
        print(f'[{torch.distributed.get_rank()}] ##### seq: {i}, avg_spend_time: {spend_time}')

Expected behavior A clear and concise description of what you expected to happen.

  • memory consumed just size of checkpoint

ds_report output Please run ds_report to give us details about your setup.

JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
/tmp/io_submithubcvos0.c: In function ‘main’:
/tmp/io_submithubcvos0.c:2:5: warning: implicit declaration of function ‘io_submit’ [-Wimplicit-function-declaration]
    2 |     io_submit();
      |     ^~~~~~~~~
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.9.0a0+c3d40fd
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.4.4+3e4dd96, 3e4dd96, reyazda/large-model-inference
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.3

System info (please complete the following information):

  • OS: Ubuntu 18.04
  • GPU count and types x16 A100
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version : 3.8
  • Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

  • deepspeed --num_gpus [2,4,5,10] test.py

Docker context Are you using a specific docker image that you can share?

  • NGC docker 21.6

Additional context Add any other context about the problem here.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:16 (13 by maintainers)

github_iconTop GitHub Comments

3reactions
switizcommented, Aug 25, 2021

@hyunwoongko

Cool! Awesome.

It’s a really needed feature. I really want this features to be merged.

I also experimented with your amazing open source Parallelformer a while ago.

There was a little issue in the Docker environment, so I did not proceed further, but if it is included in Deep Speed, a large open source, it will shine even more.

I look forward to that day.

thank you

1reaction
RezaYazdaniAminabadicommented, Aug 26, 2021

@hyunwoongko, I will try to repro this again and open an issue on your side. Thanks, Reza

Read more comments on GitHub >

github_iconTop Results From Across the Web

[BUG] Unexpected high memory usage of the backing ...
Describe the bug​​ After downloading a large backing image to local (#3155) or syncing backing images between backing image managers, the pod  ......
Read more >
1659342 – Dashboard shows huge memory usage
We have 2 hosts, each with 512GB RAM. In total that should be around 192GB used max. Steps to Reproduce: 1. 2. 3....
Read more >
Advanced Secure Gateway Memory Handling Bug
This bug causes the reporting of memory pressure in the Content Analysis engine on the ASG to report as high (80% and higher)....
Read more >
How does high memory pressure affect performance?
High memory pressure works against cluster performance in two ways: As memory pressure rises to 75% and above, less memory remains available, but...
Read more >
ceilometer compute cannot collection memory.usage meter ...
the Physics host can not get memory.usage meter anymore! 2018-05-03 07:57:02.211 82162 ERROR ceilometer.compute.virt.libvirt.inspector ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found