Running out of memory during training
See original GitHub issueWhen training with custom eval metric (pearson corr), after first evaluation my colab session runs out of memory.
What is the current behavior? Training of TabNetRegressor starts fine and after first evaluation round, I run out of memory. I am training the model on GPU 16GB and free RAM is approx 40 GB. The RAM consumption during training steadily increases. I am training on a pretty large dataset (11 GB)
Expected behavior
I would expect that the RAM consumption is more or less constant during training, once the model is initialized.
Screenshots
max_epochs = 2
batch_size = 1028
model = TabNetRegressor(
optimizer_fn=torch.optim.Adam,
optimizer_params=dict(lr=1e-2)
)
model.fit(
X_train=factors_train[features].to_numpy(), y_train=factors_train.target.to_numpy().reshape((-1,1)),
eval_set=[(factors_test[features].to_numpy(), factors_test.target.to_numpy().reshape((-1,1)))],
eval_name=['test'],
eval_metric=[PearsonCorrMetric],
max_epochs=max_epochs , patience=5,
batch_size=batch_size,
virtual_batch_size=128,
num_workers=0,
drop_last=False
)
class PearsonCorrMetric(Metric):
def __init__(self):
self._name = "pearson_corr"
self._maximize = True
def __call__(self, y_true, y_score):
return corr_score(y_true, y_score)[1]
def corr_score(y_true, y_pred):
return "score", np.corrcoef(y_true, y_pred)[0,1], True
Other relevant information: poetry version: ? python version: 3.8 Operating System: Ubuntu Additional tools:
Additional context
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 |
| N/A 40C P0 24W / 300W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Issue Analytics
- State:
- Created a year ago
- Comments:13
Top Results From Across the Web
Running out of memory while training machine learning model
You could do a few things: (1) reduce the size of training set by randomly selecting rows, assuming you have a randomly selected...
Read more >Out of memory error during evaluation but training works fine!
Surprisingly my old programs are throwing an out of memory error during evaluation (in eval() mode) but training works just fine.
Read more >Running out of memory during training #40 - GitHub
The memory issue might be due to the multi-scale training. We change the input resolution during training randomly and with the increasing ...
Read more >Out of memory error when using validation while training a ...
This issue is not a result of the increased training set size. One workaround is to train by splitting the training set into...
Read more >Cuda out of memory during evaluation but training is fine
By default the Trainer accumulated all predictions on the host before sending them to the CPU (because it's faster) but if you run...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
TRAIN (1914562, 1214) - TEST (476390, 1214)
RMSE actually works 😃@Kayne88 thank you very much for sharing your results.
The model learns to pay attention to specific features in order to minimize the loss function. Some features might end up masked out if they correlate too much with a better feature, however you’ll have no guarantee that this is the case. You could simply remove those feature before training.
However you can play with hyperparameters to get closer to what you want:
lambda_sparse
: the bigger this is the sparsier your mask will be. So setting this to a score > 0 might ensure that the model won’t look at two correlated features.gamma
: a large gamma (gamma values should stay between 1 and 5 max I’d recommend) will forbid the model to reuse the same features at different steps. So if you don’t want weak correlated features to be used by the model you can set a high gamman_steps
: the more steps the more features your model will be able to pick at some point.All these recommendations have no guarantee of working. This is just my general understanding but you should experiment on them and see how it goes.
Good luck!