question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running out of memory during training

See original GitHub issue

When training with custom eval metric (pearson corr), after first evaluation my colab session runs out of memory.

What is the current behavior? Training of TabNetRegressor starts fine and after first evaluation round, I run out of memory. I am training the model on GPU 16GB and free RAM is approx 40 GB. The RAM consumption during training steadily increases. I am training on a pretty large dataset (11 GB)

Expected behavior

I would expect that the RAM consumption is more or less constant during training, once the model is initialized.

Screenshots

max_epochs = 2
batch_size = 1028
model = TabNetRegressor(
                       optimizer_fn=torch.optim.Adam,
                       optimizer_params=dict(lr=1e-2)
                      )

model.fit(
    X_train=factors_train[features].to_numpy(), y_train=factors_train.target.to_numpy().reshape((-1,1)),
    eval_set=[(factors_test[features].to_numpy(), factors_test.target.to_numpy().reshape((-1,1)))],
    eval_name=['test'],
    eval_metric=[PearsonCorrMetric],
    max_epochs=max_epochs , patience=5,
    batch_size=batch_size,
    virtual_batch_size=128,
    num_workers=0,
    drop_last=False
)

class PearsonCorrMetric(Metric):
  def __init__(self):
    self._name = "pearson_corr"
    self._maximize = True
  
def __call__(self, y_true, y_score):
    return corr_score(y_true, y_score)[1]

def corr_score(y_true, y_pred):
    return "score", np.corrcoef(y_true, y_pred)[0,1], True

Other relevant information: poetry version: ? python version: 3.8 Operating System: Ubuntu Additional tools:

Additional context

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    24W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:13

github_iconTop GitHub Comments

1reaction
Kayne88commented, Sep 6, 2022

TRAIN (1914562, 1214) - TEST (476390, 1214) RMSE actually works 😃

0reactions
Optimoxcommented, Sep 12, 2022

@Kayne88 thank you very much for sharing your results.

The model learns to pay attention to specific features in order to minimize the loss function. Some features might end up masked out if they correlate too much with a better feature, however you’ll have no guarantee that this is the case. You could simply remove those feature before training.

However you can play with hyperparameters to get closer to what you want:

  • lambda_sparse : the bigger this is the sparsier your mask will be. So setting this to a score > 0 might ensure that the model won’t look at two correlated features.
  • gamma : a large gamma (gamma values should stay between 1 and 5 max I’d recommend) will forbid the model to reuse the same features at different steps. So if you don’t want weak correlated features to be used by the model you can set a high gamma
  • n_steps : the more steps the more features your model will be able to pick at some point.

All these recommendations have no guarantee of working. This is just my general understanding but you should experiment on them and see how it goes.

Good luck!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Running out of memory while training machine learning model
You could do a few things: (1) reduce the size of training set by randomly selecting rows, assuming you have a randomly selected...
Read more >
Out of memory error during evaluation but training works fine!
Surprisingly my old programs are throwing an out of memory error during evaluation (in eval() mode) but training works just fine.
Read more >
Running out of memory during training #40 - GitHub
The memory issue might be due to the multi-scale training. We change the input resolution during training randomly and with the increasing ...
Read more >
Out of memory error when using validation while training a ...
This issue is not a result of the increased training set size. One workaround is to train by splitting the training set into...
Read more >
Cuda out of memory during evaluation but training is fine
By default the Trainer accumulated all predictions on the host before sending them to the CPU (because it's faster) but if you run...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found