RuntimeError: CUDA error: device-side assert triggered - during inference stage
See original GitHub issueDescribe the bug
I am running into RuntimeError: CUDA error: device-side assert triggered
when making predictions on my test data.
What is the current behavior?
After 10 epochs of training, albeit both train_auc
and valid_auc
are awful compared to other models. I have decided to apply a trained classifier to my test data and that’s when the CUDA error occurred.
If the current behavior is a bug, please provide the steps to reproduce.
I am afraid I am using private data and cannot share it. My features are all categorical, which are derived either from binning or one-hot encoding (so many are counts).
Below is what I did following the example jupyter notebooks to define categorical features and dims:
categorical_columns = feature_names
categorical_dims = {}
for i in range(X_train.shape[1]):
categorical_dims[feature_names[i]] = len(np.unique(X_train.getcol(i).toarray()))
cat_idxs = [i for i, f in enumerate(feature_names) if f in categorical_columns]
cat_dims = [categorical_dims[f] for i, f in enumerate(feature_names) if f in categorical_columns]
I am using a GPU (Device used : cuda
), and the hyperparameter choices are simply copied from an example here:
clf = TabNetClassifier(
n_d=64, n_a=64, n_steps=5,
gamma=1.5, n_independent=2, n_shared=2,
cat_idxs=cat_idxs,
cat_dims=cat_dims,
cat_emb_dim=1,
lambda_sparse=1e-4, momentum=0.3, clip_value=2.,
optimizer_fn=torch.optim.Adam,
optimizer_params=dict(lr=2e-2),
scheduler_params = {"gamma": 0.95,
"step_size": 20},
scheduler_fn=torch.optim.lr_scheduler.StepLR,
epsilon=1e-15,
device_name="auto"
)
Training performance is bad:
clf.fit(
X_train=x_train, y_train=y_train,
eval_set=[(x_train, y_train), (x_val, y_val)],
eval_name=['train', 'valid'],
eval_metric=['auc'],
max_epochs=100 , patience=10,
batch_size=1024, virtual_batch_size=128,
)
epoch 0 | loss: 0.02333 | train_auc: 0.52572 | valid_auc: 0.55781 | 0:21:36s
epoch 1 | loss: 0.01288 | train_auc: 0.49525 | valid_auc: 0.45546 | 0:43:08s
epoch 2 | loss: 0.01246 | train_auc: 0.48608 | valid_auc: 0.54022 | 1:04:40s
epoch 3 | loss: 0.01218 | train_auc: 0.49686 | valid_auc: 0.51621 | 1:25:17s
epoch 4 | loss: 0.01209 | train_auc: 0.48224 | valid_auc: 0.50605 | 1:46:04s
epoch 5 | loss: 0.01225 | train_auc: 0.50008 | valid_auc: 0.51382 | 2:07:08s
epoch 6 | loss: 0.01189 | train_auc: 0.50827 | valid_auc: 0.51544 | 2:28:04s
epoch 7 | loss: 0.01185 | train_auc: 0.5211 | valid_auc: 0.47366 | 2:48:44s
epoch 8 | loss: 0.01197 | train_auc: 0.49405 | valid_auc: 0.50044 | 3:09:45s
epoch 9 | loss: 0.01163 | train_auc: 0.50984 | valid_auc: 0.46645 | 3:30:41s
epoch 10 | loss: 0.01173 | train_auc: 0.47489 | valid_auc: 0.48854 | 3:51:27s
Early stopping occurred at epoch 10 with best_epoch = 0 and best_valid_auc = 0.55781
Best weights from best epoch are automatically used!
And finally this where the error occurred, during inference:
preds = clf.predict_proba(X_test.toarray()[:1000])
test_auc = roc_auc_score(y_score=preds[:,1], y_true=np.array(y_test))
preds_valid = clf.predict_proba(x_val)
valid_auc = roc_auc_score(y_score=preds_valid[:,1], y_true=y_val)
print(f"BEST VALID SCORE: {clf.best_cost}")
print(f"FINAL TEST SCORE: {test_auc}")
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_16220/1787765779.py in <module>
----> 1 preds = clf.predict_proba(X_test.toarray()[:1000])
2 test_auc = roc_auc_score(y_score=preds[:,1], y_true=np.array(y_test))
3
4
5 preds_valid = clf.predict_proba(x_val)
D:\bo\envs\bd\lib\site-packages\pytorch_tabnet\tab_model.py in predict_proba(self, X)
98 results = []
99 for batch_nb, data in enumerate(dataloader):
--> 100 data = data.to(self.device).float()
101
102 output, M_loss = self.network(data)
RuntimeError: CUDA error: device-side assert triggered
Expected behavior The model makes predictions and AUC for the test data is calculated.
Screenshots
Other relevant information: poetry version: not using poetry python version: Python 3.7.11 Operating System: Windows 10 Additional tools: I am running the experiment on a jupyter notebook
Additional context
I have installed tabnet
using pip, and didn’t poetry install
or make notebook
.
Issue Analytics
- State:
- Created a year ago
- Comments:8
Many thanks @Optimox !
I have made a huge mistake by treating my counting features as categorical features. Counts are ordinal by nature. Simply yet detrimental mistake.
I have rerun my code after making sure correct categorical feature indices are fed to the model, and the cuda error disappeared. Thanks again.
Numerical features won’t use any embedding scheme. Counts should be fine, I don’t know what is going wrong then.