question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory error in official kaggle tutorial

See original GitHub issue

TabularPredictor().fit from this tutorial doesn’t work in Colab or in Kaggle because of the memory error. Kaggle has 16.81 GB of available RAM. Autogluon log shows that it needs much less memory: Train Data (Original) Memory Usage: 2715.97 MB What is the problem?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11

github_iconTop GitHub Comments

1reaction
qo4oncommented, Apr 11, 2021

Kaggle doesn’t offer an option to restart the runtime while keeping the environment

You can do this. First run this:

image

Then restart the runtime while keeping the environment:

image

After that import AutoGluon without any error. But the second import shows a warning:

image

You also can restart the runtime while keeping the environment from code:

import os
os.kill(os.getpid(), 9)
0reactions
Innixmacommented, Apr 14, 2021

Can we get features after preprocessing step? They are all numerical, and they suppose to take less memory. If we do preprocessing in a separate process when the data is loading and then skip preprocessing step in fit(), this will help save memory.

Unfortunately, to fit the preprocessing stage features, we need all the data present. There is no ‘skipping’ it.

However, you can fit the preprocessing prior to calling fit (useful for isolating the memory error):

This example shows how feature generators work: https://github.com/awslabs/autogluon/blob/master/examples/tabular/example_custom_feature_generator.py

To get an identical generator to the one used in fit by default:

from autogluon.features.generators import AutoMLPipelineFeatureGenerator
feature_generator = AutoMLPipelineFeatureGenerator()
train_data_transformed = feature_generator.fit_transform(X=train_data.drop(columns=[LABEL]))
train_data_transformed[LABEL] = train_data[LABEL]
test_data_transformed = feature_generator.transform(test_data)
test_data_transformed[LABEL] = test_data[LABEL]

Then to fit autogluon with the transformed data and avoid performing any preprocessing during fit, replace the default feature generator with a no-op generator (Identity):

from autogluon.features.generators import IdentityFeatureGenerator
predictor = TabularPredictor(
    label=LABEL,
    verbosity=2,
).fit(
    train_data=train_data_transformed,
    feature_generator=IdentityFeatureGenerator(),
    time_limit=60,
)

leaderboard = predictor.leaderboard(test_data_transformed)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory error in official kaggle tutorial · Issue #1051 - GitHub
It is likely that there is too little memory, 2.7 GB dataset is very large for 15 GB memory, and depending on how...
Read more >
Tutorial on reading large datasets - Kaggle
read_csv will result in an out-of-memory error on Kaggle Notebooks. It has over 100 million rows and 10 columns. Different packages have their...
Read more >
RDD Programming Guide - Spark 3.3.1 Documentation
The first line defines a base RDD from an external file. This dataset is not loaded in memory or otherwise acted on: lines...
Read more >
Load - Hugging Face
This guide will show you how to load a dataset from: The Hub without a dataset loading script; Local loading script; Local files;...
Read more >
How do I load the CelebA dataset on Google Colab, using ...
I did not manage to find a solution to the memory problem. However, I came up with a workaround, custom dataset. Here is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found