Tabular: Support Sparse Pandas DataFrames
See original GitHub issueConsider the following MWE where I created a sparse 10,000 × 1,500 matrix and, I made the first column the label. The entire dataset takes less only a few MB in memory. The task.fit
step encounters two major error.
import autogluon as ag
import pandas as pd
from scipy import sparse
from autogluon import TabularPrediction as task
n = int(1e4)
m = sparse.eye(n, format='csc')[:, :1500]
m[::2, 0] = 1 # make it a balanced binary prediction problem.
columns=['label', *range(1499)]
dataset = task.Dataset(df=pd.DataFrame.sparse.from_spmatrix(m, columns=columns))
predictor = task.fit(train_data=dataset,
label='label',
output_directory='/tmp/tmp')
First the main issue is that autogluon outputs the following error 1499 times, presumably once for each column!
ERROR:autogluon.utils.tabular.features.utils:Warning: dtype Sparse[float64, 0.0] is not recognized as a valid dtype by numpy! AutoGluon may incorrectly handle this feature...
Traceback (most recent call last):
File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/features/utils.py", line 16, in get_type_family
elif np.issubdtype(dtype, np.integer):
File "/home/prastogi/conda/lib/python3.7/site-packages/numpy/core/numerictypes.py", line 393, in issubdtype
arg1 = dtype(arg1).type
TypeError: data type not understood
ERROR:autogluon.utils.tabular.features.utils:data type not understood
Traceback (most recent call last):
File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/features/utils.py", line 16, in get_type_family
elif np.issubdtype(dtype, np.integer):
File "/home/prastogi/conda/lib/python3.7/site-packages/numpy/core/numerictypes.py", line 393, in issubdtype
arg1 = dtype(arg1).type
TypeError: data type not understood
Second it outputs the following error with the NeuralNetclassifier.
INFO:root:Fitting model: NeuralNetClassifier ...
File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 255, in train_and_save
model = self.train_single(X_train, y_train, X_val, y_val, model, kfolds=kfolds, k_fold_start=k_fold_start, k_fold_end=k_fold_end, n_repeats=n_repeats, n_repeat_start=n_repeat_start, level=level, time_limit=time_limit)
File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 240, in train_single
model.fit(X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, time_limit=time_limit, **model_fit_kwargs)
File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/abstract/abstract_model.py", line 264, in fit
self._fit(**kwargs)
File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 206, in _fit
self.train_net(train_dataset=train_dataset, params=params, val_dataset=val_dataset, initialize=True, setup_trainer=True, time_limit=time_limit, reporter=reporter)
File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 328, in train_net
output = self.model(data_batch)
File "/home/prastogi/conda/lib/python3.7/site-packages/mxnet/gluon/block.py", line 693, in __call__
out = self.forward(*args)
File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/embednet.py", line 210, in forward
return self.output_block(input_activations)
ERROR:autogluon.utils.tabular.ml.trainer.abstract_trainer:Warning: Exception caused NeuralNetClassifier to fail during training... Skipping this model.
Traceback (most recent call last):
File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 255, in train_and_save
model = self.train_single(X_train, y_train, X_val, y_val, model, kfolds=kfolds, k_fold_start=k_fold_start, k_fold_end=k_fold_end, n_repeats=n_repeats, n_repeat_start=n_repeat_start, level=level, time_limit=time_limit)
File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 240, in train_single
model.fit(X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, time_limit=time_limit, **model_fit_kwargs)
File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/abstract/abstract_model.py", line 264, in fit
self._fit(**kwargs)
File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 206, in _fit
self.train_net(train_dataset=train_dataset, params=params, val_dataset=val_dataset, initialize=True, setup_trainer=True, time_limit=time_limit, reporter=reporter)
File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 328, in train_net
output = self.model(data_batch)
File "/home/prastogi/conda/lib/python3.7/site-packages/mxnet/gluon/block.py", line 693, in __call__
out = self.forward(*args)
File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/embednet.py", line 210, in forward
return self.output_block(input_activations)
UnboundLocalError: local variable 'input_activations' referenced before assignment
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:local variable 'input_activations' referenced before assignment
However the tree based classifiers and the KNN classifier models seem to work.
INFO:root:Fitting model: RandomForestClassifierGini ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.5 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 7.98s = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 1.42s = Validation runtime
INFO:root:Fitting model: RandomForestClassifierEntr ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.5 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 7.54s = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 1.25s = Validation runtime
INFO:root:Fitting model: ExtraTreesClassifierGini ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.5 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 13.05s = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 1.45s = Validation runtime
INFO:root:Fitting model: ExtraTreesClassifierEntr ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.5 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 13.41s = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 1.62s = Validation runtime
INFO:root:Fitting model: KNeighborsClassifierUnif ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.5 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 1.23s = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 1.57s = Validation runtime
INFO:root:Fitting model: KNeighborsClassifierDist ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.5 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 1.21s = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 1.48s = Validation runtime
INFO:root:Fitting model: LightGBMClassifier ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.5 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 2.13s = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.68s = Validation runtime
INFO:root:Fitting model: CatboostClassifier ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.5 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 1.62s = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.04s = Validation runtime
INFO:root:Fitting model: LightGBMClassifierCustom ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.5 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 1.67s = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.65s = Validation runtime
INFO:root:Fitting model: weighted_ensemble_k0_l1 ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.5 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.84s = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer: 0.0s = Validation runtime
INFO:autogluon.utils.tabular.ml.learner.default_learner:AutoGluon training complete, total runtime = 98.8s ...
It’ll be great if the first two errors can be resolved.
Issue Analytics
- State:
- Created 3 years ago
- Comments:12
Top Results From Across the Web
Sparse data structures — pandas 1.5.2 documentation
In a SparseDataFrame , all columns were sparse. A DataFrame can have a mixture of sparse and dense columns. As a consequence, assigning...
Read more >Working with sparse data sets in pandas and sklearn
It is possible to create a sparse data frame directly, using the sparse parameter in pandas get_dummies. This parameter defaults to False.
Read more >Making sparse DataFrames efficiently - Minot Lab
What I wanted to see was what the most efficient method would be to make this into a sparse table in wide format,...
Read more >How to build sparse matrix based on pandas table?
How to efficiently build and fill sparse matrix (from scipy.sparse) by values using i and j as row and column indices and d...
Read more >sparse data Python Pandas - Scaler Topics
Some dedicated attributes and methods of pandas sparse accessor allow the creation of sparse data frames and series from scipy sparse matrices ( ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Focusing on a case where the input is an array, can also help in adding a flag where the preprocessing can be skipped. If the user is passing in an array of
float
values then the user may have already done all the preprocessing themselves. The user should be allowed to bypass feature pre-processing at this point. A benefit is that this way auto gluon will not have to deal with the sparse-feature pre-processing that may be very messy.@albert-ying Sparse matrices are not yet supported, however a great deal of work has been done to make adding this functionality easier via a total refactor of feature preprocessing in AutoGluon.
If anyone wants to try implementing this feature, first I’d recommend taking a close look at this example of building a custom feature generator:
https://github.com/awslabs/autogluon/blob/master/examples/tabular/example_custom_feature_generator.py
From this script, you’ll be able to test any new functionality you add and pass sparse inputs to it to see if it is able to handle the sparse data without crashing.