question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

seems meet the GPU memory leak problem

See original GitHub issue

I wrap the ``BertModel’’ as a persistent object and init it once, then iteratively use it as the feature extractor to generate the feature of data batch, while it seems I met the GPU memory leak problem. After starting the program, the GPU memory usage keeps increasing until ‘out-of-memory’. Some key codes are as following! Every ‘self.bert_model.get_bert_feature()’ executed, the GPU memory increased. I did simple debugging, and maybe the problem caused by the ‘class BertEmbeddings.forward()’. My pytorch version is 0.4.0, py3. Waiting for your reply, thanks very much!

class BertModel(PreTrainedBertModel):
    def __init__(self, config):
        super(BertModel, self).__init__(config)
        self.embeddings = BertEmbeddings(config)
        self.encoder = BertEncoder(config)
        self.pooler = BertPooler(config)
        self.apply(self.init_bert_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=False):
        #logger.info('bert forward')
        if attention_mask is None:
            attention_mask = torch.ones_like(input_ids)
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        # We create a 3D attention mask from a 2D tensor mask.
        # Sizes are [batch_size, 1, 1, to_seq_length]
        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
        # this attention mask is more simple than the triangular masking of causal attention
        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

        embedding_output = self.embeddings(input_ids, token_type_ids)
        encoded_layers = self.encoder(embedding_output,
                                      extended_attention_mask,
                                      output_all_encoded_layers=output_all_encoded_layers)
        return encoded_layers

class Bert_Instance(object):
    def __init__(self, vocab_file, bert_model_path, device):
        #tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
     
        self.tokenizer = BertTokenizer(vocab_file)
        self.model = BertModel.from_pretrained(bert_model_path)
        self.device = device
        print ('bert_device=', self.device)
        self.model.to(self.device)
        self.model.eval()

        for para in self.model.parameters():
            para.requires_grad = False

    def get_feature(self, text_list, max_seq_length=50, layer=-1):
        '''
        Args:
            text_list is a list to store the sentences, length is the sentence_number
        Return:
            (batch_size, seq_len+2, hidden_size)
        '''
        # a list, each dict element key is (ex_index, tokens, input_ids, input_mask, input_type_ids)
        all_features = convert_examples_to_features(examples=text_list,
                                                    max_seq_length=max_seq_length,
                                                    tokenizer=self.tokenizer)

        all_input_ids = torch.tensor([f['input_ids'] for f in all_features]).type(torch.cuda.LongTensor).to(self.device)
        all_input_mask = torch.tensor([f['input_mask'] for f in all_features]).type(torch.cuda.LongTensor).to(self.device)

        all_encoder_layers = self.model(all_input_ids,
                                        token_type_ids=None,
                                        attention_mask=all_input_mask)
       return all_encoder_layers, all_input_mask


class Bert_Model(object):
    def __init__(self, device):
        self.bert_model = Bert_Instance(BERT_VOCAB, BERT_MODEL, device)
        self.device = device
        self.zp_pre_cache = {}
        self.zp_post_cache = {}
        self.candi_np = {}
        self.cache = {'zp_pre': self.zp_pre_cache,
                      'zp_post': self.zp_post_cache,
                      'candi_np': self.candi_np}

    def get_bert_feature(self, text_list, cache_name, batch_id, max_seq_length=30, layer=-1):
        if batch_id in self.cache[cache_name].keys():
            #res = torch.tensor(self.cache[cache_name][batch_id]).type(torch.cuda.FloatTensor).to(self.device)
            res = self.cache[cache_name][batch_id]
            return res
        else:
            res = self.bert_model.get_feature(text_list, max_seq_length, layer)
            self.cache[cache_name][batch_id] = res
            return res

class Experiment(object):
    def __init__(self):
        # load training data   
        with open(DIR+"data/train_data", "rb") as fin1, \
             open(DIR+"data/emb","rb") as fin2:
            self.train_generator = cPickle.load(fin1)
            self.embedding_matrix, _ , _ = cPickle.load(fin2, encoding='iso-8859-1')
        # load test data
        self.test_generator = DataGenerator("test", 256)
        self.dev_data = self.train_generator.generate_dev_data()
        self.test_data = self.test_generator.generate_data()

        # declare model architecture
        self.model = Network(nnargs["embedding_size"], nnargs["embedding_dimension"], self.embedding_matrix, nnargs["hidden_dimension"], 2).to(NET_DEVICE)
        self.bert_model = Bert_Model(BERT_DEVICE)

        this_lr = 0.003
        self.optimizer = optim.Adagrad(self.model.parameters(), lr = this_lr)
        self.best = {"sum":0.0, "test_f":0.0, "best_test_f":0.0}
        self.dropout = nnargs["dropout"]


 def forward_step(self, data, mode, dropout=0.0):
        zp_relative_index, zp_pre, zp_pre_mask, zp_post, zp_post_mask, candi_np, candi_np_mask, feature, zp_pre_words, zp_post_words, candi_np_words, batch_id = data2tensor(data)

        batch_id = mode + '_' + str(batch_id)
        zp_pre_bert, _ = self.bert_model.get_bert_feature(zp_pre_words, 'zp_pre', batch_id)
        zp_post_bert, _ = self.bert_model.get_bert_feature(zp_post_words, 'zp_post', batch_id)
        candi_np_bert, _ = self.bert_model.get_bert_feature(candi_np_words, 'candi_np', batch_id)
        .....

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:17 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
RomanTeuchercommented, Dec 20, 2019

I have the newest version of pytorch and transformers, yes.

I have been monitoring the memory usage over 24h when I made ~ 300.000 requests. It seems that the memory increases constantly for quite some time but also seems to stabilize at a certain maximum. So the application started using ~2.5GB RAM and now stays at ~4.3GB.

Maybe it has something to do with varying lengths of the texts I process? So that the longest texts are processed at a later point in time which then require the most RAM. Then, any subsequent text cannot need more so it stabilizes. Though this is just a thought.

Thanks already for your help, I’m off to Christmas vacations for now and will have a look at the issue in January again. I’ll see if memory usage increases by then.

1reaction
RomanTeuchercommented, Dec 19, 2019

So I tried it with bert-base-multilingual-uncased as well and it is the same behavior. I do not understand, why memory constantly grows on inference. To my understanding, I only push data through the network and then use the result layer’s output. Before using the transformers, I had been using custom word embeddings trained in own keras models and I did not have this behavior. What am I missing here?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Seems like there's a memory leak after the expansion update.
I have 8gb RAM(dual channel), i5 7500 CPU, GTX 1070 GPU, VS550 PSU, D2 installed on 500GB SSD. RAM usage increases after playing...
Read more >
How to debug causes of GPU memory leaks? - PyTorch Forums
I have wrote a simple line profiler that examines amount of GPU memory on each step, and it seems like on each loss.backwards()...
Read more >
How do I check for memory leaks, and what should I do to stop ...
Using Window's Resource Monitor. To find a memory leak, you've got to look at the system's RAM usage. This can be accomplished in...
Read more >
Desktop Windows Manager (DWM) memory leakage in ...
The infamous dwm.exe hogging lots of memory. FYI this is not a specific program issue. It just happens with any programs, ...
Read more >
How to avoid memory leak on segfault with Cuda
Definitely try latest CUDA and driver. Also, if for some reason the host process associated with the program that segfaulted does not actually ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found