question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error in making prediction on CPU after training the model on GPU

See original GitHub issue

Hi, I trained the model on GPU according to tutorial.

reader = BertQA(bert_model='bert-base-multilingual-cased',
                train_batch_size=256,
                learning_rate=3e-5,
                num_train_epochs=2,
                do_lower_case=False,
                verbose_logging=True,
                output_dir='./temp')

reader.fit(X=(train_examples, train_features))

And before dumping the model, send it to CPU.

reader.model.to('cpu')
reader.device = torch.device('cpu')

But I try to make a prediction on CPU, then following error occurs…

query = 'some sample query...'
prediction = cdqa_pipeline.predict(X=query)

--------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-79-c881b3585457> in <module>
      1 query = ''some sample query...''
----> 2 prediction = cdqa_pipeline.predict(X=query)

~/anaconda3/lib/python3.7/site-packages/cdqa/pipeline/cdqa_sklearn.py in predict(self, X, return_logit)
    158                                                      metadata=self.metadata)
    159             examples, features = self.processor_predict.fit_transform(X=squad_examples)
--> 160             prediction = self.reader.predict((examples, features), return_logit)
    161             return prediction
    162 

~/anaconda3/lib/python3.7/site-packages/cdqa/reader/bertqa_sklearn.py in predict(self, X, return_logit)
   1220             with torch.no_grad():
   1221                 batch_start_logits, batch_end_logits = self.model(
-> 1222                     input_ids, segment_ids, input_mask)
   1223             for i, example_index in enumerate(example_indices):
   1224                 start_logits = batch_start_logits[i].detach().cpu().tolist()

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    491             result = self._slow_forward(*input, **kwargs)
    492         else:
--> 493             result = self.forward(*input, **kwargs)
    494         for hook in self._forward_hooks.values():
    495             hook_result = hook(self, input, result)

~/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
    144                 raise RuntimeError("module must have its parameters and buffers "
    145                                    "on device {} (device_ids[0]) but found one of "
--> 146                                    "them on device: {}".format(self.src_device_obj, t.device))
    147 
    148         inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

Is there something else I need to do?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6

github_iconTop GitHub Comments

1reaction
deo4kyocommented, Sep 24, 2019

Hi, andrelmfarias. I used 8 GPUs(distributed training).

But now I understand what’s wrong. Thanks for your help ndrelmfarias.

0reactions
andrelmfariascommented, Sep 20, 2019

I just tried to train a new model and when print type(model.model) I get

pytorch_pretrained_bert.modeling.BertForQuestionAnswering

Not torch.nn.parallel.data_parallel.DataParallel

Did you train the model with multiple GPUs with distributed training?

Thanks

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using CPU after training in GPU - Data Science Stack Exchange
I can train a network with 560x560 pix images and batch-size=1, but after training is over when I try to test/predict I get...
Read more >
Machine Learning Models for GPU Error Prediction in a Large ...
The basic block is a node, that consists of one AMD Opteron 6274 CPU and one NVIDIA. K20X GPU. Four nodes make up...
Read more >
Error on prediction running keras multi_gpu_model
From the tf.keras.utils.multi_gpu_model we can see that it works in the following way: Divide the model's input(s) into multiple sub-batches ...
Read more >
Explain Your Machine Learning Model Predictions with GPU ...
This post explains how you can train an XGBoost model, implement the SHAP technique in Python using a CPU and GPU, and finally...
Read more >
Building a Performance Model for Deep Learning ... - arXiv
For predicting GPU training time of DL models, we show our critical-path-based E2E performance ... error will be 60% by following the same...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found