Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

seems meet the GPU memory leak problem

See original GitHub issue

I wrap the ``BertModel’’ as a persistent object and init it once, then iteratively use it as the feature extractor to generate the feature of data batch, while it seems I met the GPU memory leak problem. After starting the program, the GPU memory usage keeps increasing until ‘out-of-memory’. Some key codes are as following! Every ‘self.bert_model.get_bert_feature()’ executed, the GPU memory increased. I did simple debugging, and maybe the problem caused by the ‘class BertEmbeddings.forward()’. My pytorch version is 0.4.0, py3. Waiting for your reply, thanks very much!

class BertModel(PreTrainedBertModel):
    def __init__(self, config):
        super(BertModel, self).__init__(config)
        self.embeddings = BertEmbeddings(config)
        self.encoder = BertEncoder(config)
        self.pooler = BertPooler(config)
        self.apply(self.init_bert_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=False):
        #logger.info('bert forward')
        if attention_mask is None:
            attention_mask = torch.ones_like(input_ids)
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        # We create a 3D attention mask from a 2D tensor mask.
        # Sizes are [batch_size, 1, 1, to_seq_length]
        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
        # this attention mask is more simple than the triangular masking of causal attention
        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

        embedding_output = self.embeddings(input_ids, token_type_ids)
        encoded_layers = self.encoder(embedding_output,
                                      extended_attention_mask,
                                      output_all_encoded_layers=output_all_encoded_layers)
        return encoded_layers

class Bert_Instance(object):
    def __init__(self, vocab_file, bert_model_path, device):
        #tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
     
        self.tokenizer = BertTokenizer(vocab_file)
        self.model = BertModel.from_pretrained(bert_model_path)
        self.device = device
        print ('bert_device=', self.device)
        self.model.to(self.device)
        self.model.eval()

        for para in self.model.parameters():
            para.requires_grad = False

    def get_feature(self, text_list, max_seq_length=50, layer=-1):
        '''
        Args:
            text_list is a list to store the sentences, length is the sentence_number
        Return:
            (batch_size, seq_len+2, hidden_size)
        '''
        # a list, each dict element key is (ex_index, tokens, input_ids, input_mask, input_type_ids)
        all_features = convert_examples_to_features(examples=text_list,
                                                    max_seq_length=max_seq_length,
                                                    tokenizer=self.tokenizer)

        all_input_ids = torch.tensor([f['input_ids'] for f in all_features]).type(torch.cuda.LongTensor).to(self.device)
        all_input_mask = torch.tensor([f['input_mask'] for f in all_features]).type(torch.cuda.LongTensor).to(self.device)

        all_encoder_layers = self.model(all_input_ids,
                                        token_type_ids=None,
                                        attention_mask=all_input_mask)
       return all_encoder_layers, all_input_mask


class Bert_Model(object):
    def __init__(self, device):
        self.bert_model = Bert_Instance(BERT_VOCAB, BERT_MODEL, device)
        self.device = device
        self.zp_pre_cache = {}
        self.zp_post_cache = {}
        self.candi_np = {}
        self.cache = {'zp_pre': self.zp_pre_cache,
                      'zp_post': self.zp_post_cache,
                      'candi_np': self.candi_np}

    def get_bert_feature(self, text_list, cache_name, batch_id, max_seq_length=30, layer=-1):
        if batch_id in self.cache[cache_name].keys():
            #res = torch.tensor(self.cache[cache_name][batch_id]).type(torch.cuda.FloatTensor).to(self.device)
            res = self.cache[cache_name][batch_id]
            return res
        else:
            res = self.bert_model.get_feature(text_list, max_seq_length, layer)
            self.cache[cache_name][batch_id] = res
            return res

class Experiment(object):
    def __init__(self):
        # load training data   
        with open(DIR+"data/train_data", "rb") as fin1, \
             open(DIR+"data/emb","rb") as fin2:
            self.train_generator = cPickle.load(fin1)
            self.embedding_matrix, _ , _ = cPickle.load(fin2, encoding='iso-8859-1')
        # load test data
        self.test_generator = DataGenerator("test", 256)
        self.dev_data = self.train_generator.generate_dev_data()
        self.test_data = self.test_generator.generate_data()

        # declare model architecture
        self.model = Network(nnargs["embedding_size"], nnargs["embedding_dimension"], self.embedding_matrix, nnargs["hidden_dimension"], 2).to(NET_DEVICE)
        self.bert_model = Bert_Model(BERT_DEVICE)

        this_lr = 0.003
        self.optimizer = optim.Adagrad(self.model.parameters(), lr = this_lr)
        self.best = {"sum":0.0, "test_f":0.0, "best_test_f":0.0}
        self.dropout = nnargs["dropout"]


 def forward_step(self, data, mode, dropout=0.0):
        zp_relative_index, zp_pre, zp_pre_mask, zp_post, zp_post_mask, candi_np, candi_np_mask, feature, zp_pre_words, zp_post_words, candi_np_words, batch_id = data2tensor(data)

        batch_id = mode + '_' + str(batch_id)
        zp_pre_bert, _ = self.bert_model.get_bert_feature(zp_pre_words, 'zp_pre', batch_id)
        zp_post_bert, _ = self.bert_model.get_bert_feature(zp_post_words, 'zp_post', batch_id)
        candi_np_bert, _ = self.bert_model.get_bert_feature(candi_np_words, 'candi_np', batch_id)
        .....

Issue Analytics

State:
Created 5 years ago
Comments:17 (4 by maintainers)

Top GitHub Comments

1reaction

RomanTeuchercommented, Dec 20, 2019

I have the newest version of pytorch and transformers, yes.

I have been monitoring the memory usage over 24h when I made ~ 300.000 requests. It seems that the memory increases constantly for quite some time but also seems to stabilize at a certain maximum. So the application started using ~2.5GB RAM and now stays at ~4.3GB.

Maybe it has something to do with varying lengths of the texts I process? So that the longest texts are processed at a later point in time which then require the most RAM. Then, any subsequent text cannot need more so it stabilizes. Though this is just a thought.

Thanks already for your help, I’m off to Christmas vacations for now and will have a look at the issue in January again. I’ll see if memory usage increases by then.

1reaction

RomanTeuchercommented, Dec 19, 2019

So I tried it with bert-base-multilingual-uncased as well and it is the same behavior. I do not understand, why memory constantly grows on inference. To my understanding, I only push data through the network and then use the result layer’s output. Before using the transformers, I had been using custom word embeddings trained in own keras models and I did not have this behavior. What am I missing here?