Error in making prediction on CPU after training the model on GPU
See original GitHub issueHi, I trained the model on GPU according to tutorial.
reader = BertQA(bert_model='bert-base-multilingual-cased',
train_batch_size=256,
learning_rate=3e-5,
num_train_epochs=2,
do_lower_case=False,
verbose_logging=True,
output_dir='./temp')
reader.fit(X=(train_examples, train_features))
And before dumping the model, send it to CPU.
reader.model.to('cpu')
reader.device = torch.device('cpu')
But I try to make a prediction on CPU, then following error occurs…
query = 'some sample query...'
prediction = cdqa_pipeline.predict(X=query)
--------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-79-c881b3585457> in <module>
1 query = ''some sample query...''
----> 2 prediction = cdqa_pipeline.predict(X=query)
~/anaconda3/lib/python3.7/site-packages/cdqa/pipeline/cdqa_sklearn.py in predict(self, X, return_logit)
158 metadata=self.metadata)
159 examples, features = self.processor_predict.fit_transform(X=squad_examples)
--> 160 prediction = self.reader.predict((examples, features), return_logit)
161 return prediction
162
~/anaconda3/lib/python3.7/site-packages/cdqa/reader/bertqa_sklearn.py in predict(self, X, return_logit)
1220 with torch.no_grad():
1221 batch_start_logits, batch_end_logits = self.model(
-> 1222 input_ids, segment_ids, input_mask)
1223 for i, example_index in enumerate(example_indices):
1224 start_logits = batch_start_logits[i].detach().cpu().tolist()
~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
--> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)
~/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
144 raise RuntimeError("module must have its parameters and buffers "
145 "on device {} (device_ids[0]) but found one of "
--> 146 "them on device: {}".format(self.src_device_obj, t.device))
147
148 inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
Is there something else I need to do?
Issue Analytics
- State:
- Created 4 years ago
- Comments:6
Top Results From Across the Web
Using CPU after training in GPU - Data Science Stack Exchange
I can train a network with 560x560 pix images and batch-size=1, but after training is over when I try to test/predict I get...
Read more >Machine Learning Models for GPU Error Prediction in a Large ...
The basic block is a node, that consists of one AMD Opteron 6274 CPU and one NVIDIA. K20X GPU. Four nodes make up...
Read more >Error on prediction running keras multi_gpu_model
From the tf.keras.utils.multi_gpu_model we can see that it works in the following way: Divide the model's input(s) into multiple sub-batches ...
Read more >Explain Your Machine Learning Model Predictions with GPU ...
This post explains how you can train an XGBoost model, implement the SHAP technique in Python using a CPU and GPU, and finally...
Read more >Building a Performance Model for Deep Learning ... - arXiv
For predicting GPU training time of DL models, we show our critical-path-based E2E performance ... error will be 60% by following the same...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi, andrelmfarias. I used 8 GPUs(distributed training).
But now I understand what’s wrong. Thanks for your help ndrelmfarias.
I just tried to train a new model and when print
type(model.model)
I getNot
torch.nn.parallel.data_parallel.DataParallel
…Did you train the model with multiple GPUs with distributed training?
Thanks