Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tabular: Support Sparse Pandas DataFrames

See original GitHub issue

Consider the following MWE where I created a sparse 10,000 × 1,500 matrix and, I made the first column the label. The entire dataset takes less only a few MB in memory. The task.fit step encounters two major error.

import autogluon as ag
import pandas as pd
from scipy import sparse
from autogluon import TabularPrediction as task
n = int(1e4)
m = sparse.eye(n, format='csc')[:, :1500]
m[::2, 0] = 1 # make it a balanced binary prediction problem.
columns=['label', *range(1499)]

dataset = task.Dataset(df=pd.DataFrame.sparse.from_spmatrix(m, columns=columns)) 
predictor = task.fit(train_data=dataset, 
                     label='label', 
                     output_directory='/tmp/tmp')

First the main issue is that autogluon outputs the following error 1499 times, presumably once for each column!

ERROR:autogluon.utils.tabular.features.utils:Warning: dtype Sparse[float64, 0.0] is not recognized as a valid dtype by numpy! AutoGluon may incorrectly handle this feature...
Traceback (most recent call last):
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/features/utils.py", line 16, in get_type_family
    elif np.issubdtype(dtype, np.integer):
  File "/home/prastogi/conda/lib/python3.7/site-packages/numpy/core/numerictypes.py", line 393, in issubdtype
    arg1 = dtype(arg1).type
TypeError: data type not understood
ERROR:autogluon.utils.tabular.features.utils:data type not understood
Traceback (most recent call last):
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/features/utils.py", line 16, in get_type_family
    elif np.issubdtype(dtype, np.integer):
  File "/home/prastogi/conda/lib/python3.7/site-packages/numpy/core/numerictypes.py", line 393, in issubdtype
    arg1 = dtype(arg1).type
TypeError: data type not understood

Second it outputs the following error with the NeuralNetclassifier.

INFO:root:Fitting model: NeuralNetClassifier ...
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 255, in train_and_save
    model = self.train_single(X_train, y_train, X_val, y_val, model, kfolds=kfolds, k_fold_start=k_fold_start, k_fold_end=k_fold_end, n_repeats=n_repeats, n_repeat_start=n_repeat_start, level=level, time_limit=time_limit)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 240, in train_single
    model.fit(X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, time_limit=time_limit, **model_fit_kwargs)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/abstract/abstract_model.py", line 264, in fit
    self._fit(**kwargs)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 206, in _fit
    self.train_net(train_dataset=train_dataset, params=params, val_dataset=val_dataset, initialize=True, setup_trainer=True, time_limit=time_limit, reporter=reporter)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 328, in train_net
    output = self.model(data_batch)
  File "/home/prastogi/conda/lib/python3.7/site-packages/mxnet/gluon/block.py", line 693, in __call__
    out = self.forward(*args)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/embednet.py", line 210, in forward
    return self.output_block(input_activations)
ERROR:autogluon.utils.tabular.ml.trainer.abstract_trainer:Warning: Exception caused NeuralNetClassifier to fail during training... Skipping this model.
Traceback (most recent call last):
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 255, in train_and_save
    model = self.train_single(X_train, y_train, X_val, y_val, model, kfolds=kfolds, k_fold_start=k_fold_start, k_fold_end=k_fold_end, n_repeats=n_repeats, n_repeat_start=n_repeat_start, level=level, time_limit=time_limit)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 240, in train_single
    model.fit(X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, time_limit=time_limit, **model_fit_kwargs)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/abstract/abstract_model.py", line 264, in fit
    self._fit(**kwargs)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 206, in _fit
    self.train_net(train_dataset=train_dataset, params=params, val_dataset=val_dataset, initialize=True, setup_trainer=True, time_limit=time_limit, reporter=reporter)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 328, in train_net
    output = self.model(data_batch)
  File "/home/prastogi/conda/lib/python3.7/site-packages/mxnet/gluon/block.py", line 693, in __call__
    out = self.forward(*args)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/embednet.py", line 210, in forward
    return self.output_block(input_activations)
UnboundLocalError: local variable 'input_activations' referenced before assignment
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:local variable 'input_activations' referenced before assignment

However the tree based classifiers and the KNN classifier models seem to work.

INFO:root:Fitting model: RandomForestClassifierGini ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	7.98s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.42s	 = Validation runtime

INFO:root:Fitting model: RandomForestClassifierEntr ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	7.54s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.25s	 = Validation runtime

INFO:root:Fitting model: ExtraTreesClassifierGini ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	13.05s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.45s	 = Validation runtime

INFO:root:Fitting model: ExtraTreesClassifierEntr ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	13.41s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.62s	 = Validation runtime

INFO:root:Fitting model: KNeighborsClassifierUnif ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.23s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.57s	 = Validation runtime

INFO:root:Fitting model: KNeighborsClassifierDist ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.21s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.48s	 = Validation runtime

INFO:root:Fitting model: LightGBMClassifier ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	2.13s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.68s	 = Validation runtime

INFO:root:Fitting model: CatboostClassifier ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.62s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.04s	 = Validation runtime

INFO:root:Fitting model: LightGBMClassifierCustom ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.67s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.65s	 = Validation runtime

INFO:root:Fitting model: weighted_ensemble_k0_l1 ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.84s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.0s	 = Validation runtime

INFO:autogluon.utils.tabular.ml.learner.default_learner:AutoGluon training complete, total runtime = 98.8s ...

It’ll be great if the first two errors can be resolved.

Issue Analytics

State:
Created 3 years ago
Comments:12

Top GitHub Comments

1reaction

pushpendrecommented, Jul 30, 2020

Focusing on a case where the input is an array, can also help in adding a flag where the preprocessing can be skipped. If the user is passing in an array of float values then the user may have already done all the preprocessing themselves. The user should be allowed to bypass feature pre-processing at this point. A benefit is that this way auto gluon will not have to deal with the sparse-feature pre-processing that may be very messy.

0reactions

Innixmacommented, Jun 30, 2021

@albert-ying Sparse matrices are not yet supported, however a great deal of work has been done to make adding this functionality easier via a total refactor of feature preprocessing in AutoGluon.

If anyone wants to try implementing this feature, first I’d recommend taking a close look at this example of building a custom feature generator:

https://github.com/awslabs/autogluon/blob/master/examples/tabular/example_custom_feature_generator.py

From this script, you’ll be able to test any new functionality you add and pass sparse inputs to it to see if it is able to handle the sparse data without crashing.

Top Results From Across the Web

Sparse data structures — pandas 1.5.2 documentation

In a SparseDataFrame , all columns were sparse. A DataFrame can have a mixture of sparse and dense columns. As a consequence, assigning...

Working with sparse data sets in pandas and sklearn

It is possible to create a sparse data frame directly, using the sparse parameter in pandas get_dummies. This parameter defaults to False.

Making sparse DataFrames efficiently - Minot Lab

What I wanted to see was what the most efficient method would be to make this into a sparse table in wide format,...

How to build sparse matrix based on pandas table?

How to efficiently build and fill sparse matrix (from scipy.sparse) by values using i and j as row and column indices and d...

sparse data Python Pandas - Scaler Topics

Some dedicated attributes and methods of pandas sparse accessor allow the creation of sparse data frames and series from scipy sparse matrices ( ......