question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: CUDA error: device-side assert triggered - during inference stage

See original GitHub issue

Describe the bug I am running into RuntimeError: CUDA error: device-side assert triggered when making predictions on my test data.

What is the current behavior? After 10 epochs of training, albeit both train_auc and valid_auc are awful compared to other models. I have decided to apply a trained classifier to my test data and that’s when the CUDA error occurred. If the current behavior is a bug, please provide the steps to reproduce. I am afraid I am using private data and cannot share it. My features are all categorical, which are derived either from binning or one-hot encoding (so many are counts).

Below is what I did following the example jupyter notebooks to define categorical features and dims:

categorical_columns = feature_names
categorical_dims =  {}
for i in range(X_train.shape[1]):
    categorical_dims[feature_names[i]] = len(np.unique(X_train.getcol(i).toarray()))

cat_idxs = [i for i, f in enumerate(feature_names) if f in categorical_columns]
cat_dims = [categorical_dims[f] for i, f in enumerate(feature_names) if f in categorical_columns]

I am using a GPU (Device used : cuda), and the hyperparameter choices are simply copied from an example here:

clf = TabNetClassifier(
    n_d=64, n_a=64, n_steps=5,
    gamma=1.5, n_independent=2, n_shared=2,
    cat_idxs=cat_idxs,
    cat_dims=cat_dims,
    cat_emb_dim=1,
    lambda_sparse=1e-4, momentum=0.3, clip_value=2.,
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    scheduler_params = {"gamma": 0.95,
                     "step_size": 20},
    scheduler_fn=torch.optim.lr_scheduler.StepLR, 
    epsilon=1e-15,
    device_name="auto"
)

Training performance is bad:

clf.fit(
    X_train=x_train, y_train=y_train,
    eval_set=[(x_train, y_train), (x_val, y_val)],
    eval_name=['train', 'valid'],
    eval_metric=['auc'],
    max_epochs=100 , patience=10,
    batch_size=1024, virtual_batch_size=128,
)
​
epoch 0  | loss: 0.02333 | train_auc: 0.52572 | valid_auc: 0.55781 |  0:21:36s
epoch 1  | loss: 0.01288 | train_auc: 0.49525 | valid_auc: 0.45546 |  0:43:08s
epoch 2  | loss: 0.01246 | train_auc: 0.48608 | valid_auc: 0.54022 |  1:04:40s
epoch 3  | loss: 0.01218 | train_auc: 0.49686 | valid_auc: 0.51621 |  1:25:17s
epoch 4  | loss: 0.01209 | train_auc: 0.48224 | valid_auc: 0.50605 |  1:46:04s
epoch 5  | loss: 0.01225 | train_auc: 0.50008 | valid_auc: 0.51382 |  2:07:08s
epoch 6  | loss: 0.01189 | train_auc: 0.50827 | valid_auc: 0.51544 |  2:28:04s
epoch 7  | loss: 0.01185 | train_auc: 0.5211  | valid_auc: 0.47366 |  2:48:44s
epoch 8  | loss: 0.01197 | train_auc: 0.49405 | valid_auc: 0.50044 |  3:09:45s
epoch 9  | loss: 0.01163 | train_auc: 0.50984 | valid_auc: 0.46645 |  3:30:41s
epoch 10 | loss: 0.01173 | train_auc: 0.47489 | valid_auc: 0.48854 |  3:51:27s

Early stopping occurred at epoch 10 with best_epoch = 0 and best_valid_auc = 0.55781
Best weights from best epoch are automatically used!

And finally this where the error occurred, during inference:

preds = clf.predict_proba(X_test.toarray()[:1000])
test_auc = roc_auc_score(y_score=preds[:,1], y_true=np.array(y_test))
​

preds_valid = clf.predict_proba(x_val)
valid_auc = roc_auc_score(y_score=preds_valid[:,1], y_true=y_val)
​
print(f"BEST VALID SCORE: {clf.best_cost}")
print(f"FINAL TEST SCORE: {test_auc}")
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_16220/1787765779.py in <module>
----> 1 preds = clf.predict_proba(X_test.toarray()[:1000])
      2 test_auc = roc_auc_score(y_score=preds[:,1], y_true=np.array(y_test))
      3 
      4 
      5 preds_valid = clf.predict_proba(x_val)

D:\bo\envs\bd\lib\site-packages\pytorch_tabnet\tab_model.py in predict_proba(self, X)
     98         results = []
     99         for batch_nb, data in enumerate(dataloader):
--> 100             data = data.to(self.device).float()
    101 
    102             output, M_loss = self.network(data)

RuntimeError: CUDA error: device-side assert triggered

Expected behavior The model makes predictions and AUC for the test data is calculated.

Screenshots

Other relevant information: poetry version: not using poetry python version: Python 3.7.11 Operating System: Windows 10 Additional tools: I am running the experiment on a jupyter notebook

Additional context

I have installed tabnet using pip, and didn’t poetry install or make notebook.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8

github_iconTop GitHub Comments

1reaction
bwang482commented, Apr 20, 2022

Many thanks @Optimox !

I have made a huge mistake by treating my counting features as categorical features. Counts are ordinal by nature. Simply yet detrimental mistake.

I have rerun my code after making sure correct categorical feature indices are fed to the model, and the cuda error disappeared. Thanks again.

0reactions
Optimoxcommented, Apr 22, 2022

Numerical features won’t use any embedding scheme. Counts should be fine, I don’t know what is going wrong then.

Read more comments on GitHub >

github_iconTop Results From Across the Web

RuntimeError: CUDA error: device-side assert triggered #3200
When I modified the fast_rcnn_inference_single_image, an exception occurred. This problem has troubled me for a long time. I printed out the ...
Read more >
How to fix “CUDA error: device-side assert triggered” error?
I use huggingface Transformer to fine-tune a binary classification model. When I do inference job on big data. In rare case, it will...
Read more >
CUDA error: device-side assert triggered" in PyTorch mean ...
When I shifted my code to work on CPU instead of GPU, I got the following error: IndexError: index 128 is out of...
Read more >
A walk with fastai2 - Vision - Study Group and Online Lectures ...
Without preprocessing, I can't start training because of RuntimeError: CUDA error: device-side assert triggered error. Is there a nice way to handle this ......
Read more >
[Solved] RuntimeError: CUDA error: device-side assert triggered
Today when I using PyTorch framework to train a simple classifier, I got an error message like following: "RuntimeError: CUDA error: ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found