question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tabular: Support Sparse Pandas DataFrames

See original GitHub issue

Consider the following MWE where I created a sparse 10,000 × 1,500 matrix and, I made the first column the label. The entire dataset takes less only a few MB in memory. The task.fit step encounters two major error.

import autogluon as ag
import pandas as pd
from scipy import sparse
from autogluon import TabularPrediction as task
n = int(1e4)
m = sparse.eye(n, format='csc')[:, :1500]
m[::2, 0] = 1 # make it a balanced binary prediction problem.
columns=['label', *range(1499)]

dataset = task.Dataset(df=pd.DataFrame.sparse.from_spmatrix(m, columns=columns)) 
predictor = task.fit(train_data=dataset, 
                     label='label', 
                     output_directory='/tmp/tmp')

First the main issue is that autogluon outputs the following error 1499 times, presumably once for each column!

ERROR:autogluon.utils.tabular.features.utils:Warning: dtype Sparse[float64, 0.0] is not recognized as a valid dtype by numpy! AutoGluon may incorrectly handle this feature...
Traceback (most recent call last):
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/features/utils.py", line 16, in get_type_family
    elif np.issubdtype(dtype, np.integer):
  File "/home/prastogi/conda/lib/python3.7/site-packages/numpy/core/numerictypes.py", line 393, in issubdtype
    arg1 = dtype(arg1).type
TypeError: data type not understood
ERROR:autogluon.utils.tabular.features.utils:data type not understood
Traceback (most recent call last):
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/features/utils.py", line 16, in get_type_family
    elif np.issubdtype(dtype, np.integer):
  File "/home/prastogi/conda/lib/python3.7/site-packages/numpy/core/numerictypes.py", line 393, in issubdtype
    arg1 = dtype(arg1).type
TypeError: data type not understood

Second it outputs the following error with the NeuralNetclassifier.

INFO:root:Fitting model: NeuralNetClassifier ...
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 255, in train_and_save
    model = self.train_single(X_train, y_train, X_val, y_val, model, kfolds=kfolds, k_fold_start=k_fold_start, k_fold_end=k_fold_end, n_repeats=n_repeats, n_repeat_start=n_repeat_start, level=level, time_limit=time_limit)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 240, in train_single
    model.fit(X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, time_limit=time_limit, **model_fit_kwargs)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/abstract/abstract_model.py", line 264, in fit
    self._fit(**kwargs)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 206, in _fit
    self.train_net(train_dataset=train_dataset, params=params, val_dataset=val_dataset, initialize=True, setup_trainer=True, time_limit=time_limit, reporter=reporter)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 328, in train_net
    output = self.model(data_batch)
  File "/home/prastogi/conda/lib/python3.7/site-packages/mxnet/gluon/block.py", line 693, in __call__
    out = self.forward(*args)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/embednet.py", line 210, in forward
    return self.output_block(input_activations)
ERROR:autogluon.utils.tabular.ml.trainer.abstract_trainer:Warning: Exception caused NeuralNetClassifier to fail during training... Skipping this model.
Traceback (most recent call last):
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 255, in train_and_save
    model = self.train_single(X_train, y_train, X_val, y_val, model, kfolds=kfolds, k_fold_start=k_fold_start, k_fold_end=k_fold_end, n_repeats=n_repeats, n_repeat_start=n_repeat_start, level=level, time_limit=time_limit)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/trainer/abstract_trainer.py", line 240, in train_single
    model.fit(X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, time_limit=time_limit, **model_fit_kwargs)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/abstract/abstract_model.py", line 264, in fit
    self._fit(**kwargs)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 206, in _fit
    self.train_net(train_dataset=train_dataset, params=params, val_dataset=val_dataset, initialize=True, setup_trainer=True, time_limit=time_limit, reporter=reporter)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py", line 328, in train_net
    output = self.model(data_batch)
  File "/home/prastogi/conda/lib/python3.7/site-packages/mxnet/gluon/block.py", line 693, in __call__
    out = self.forward(*args)
  File "/home/prastogi/conda/lib/python3.7/site-packages/autogluon/utils/tabular/ml/models/tabular_nn/embednet.py", line 210, in forward
    return self.output_block(input_activations)
UnboundLocalError: local variable 'input_activations' referenced before assignment
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:local variable 'input_activations' referenced before assignment

However the tree based classifiers and the KNN classifier models seem to work.

INFO:root:Fitting model: RandomForestClassifierGini ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	7.98s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.42s	 = Validation runtime

INFO:root:Fitting model: RandomForestClassifierEntr ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	7.54s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.25s	 = Validation runtime

INFO:root:Fitting model: ExtraTreesClassifierGini ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	13.05s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.45s	 = Validation runtime

INFO:root:Fitting model: ExtraTreesClassifierEntr ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	13.41s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.62s	 = Validation runtime

INFO:root:Fitting model: KNeighborsClassifierUnif ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.23s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.57s	 = Validation runtime

INFO:root:Fitting model: KNeighborsClassifierDist ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.21s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.48s	 = Validation runtime

INFO:root:Fitting model: LightGBMClassifier ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	2.13s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.68s	 = Validation runtime

INFO:root:Fitting model: CatboostClassifier ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.62s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.04s	 = Validation runtime

INFO:root:Fitting model: LightGBMClassifierCustom ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	1.67s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.65s	 = Validation runtime

INFO:root:Fitting model: weighted_ensemble_k0_l1 ...
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.5	 = Validation accuracy score
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.84s	 = Training runtime
INFO:autogluon.utils.tabular.ml.trainer.abstract_trainer:	0.0s	 = Validation runtime

INFO:autogluon.utils.tabular.ml.learner.default_learner:AutoGluon training complete, total runtime = 98.8s ...

It’ll be great if the first two errors can be resolved.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:12

github_iconTop GitHub Comments

1reaction
pushpendrecommented, Jul 30, 2020

Focusing on a case where the input is an array, can also help in adding a flag where the preprocessing can be skipped. If the user is passing in an array of float values then the user may have already done all the preprocessing themselves. The user should be allowed to bypass feature pre-processing at this point. A benefit is that this way auto gluon will not have to deal with the sparse-feature pre-processing that may be very messy.

0reactions
Innixmacommented, Jun 30, 2021

@albert-ying Sparse matrices are not yet supported, however a great deal of work has been done to make adding this functionality easier via a total refactor of feature preprocessing in AutoGluon.

If anyone wants to try implementing this feature, first I’d recommend taking a close look at this example of building a custom feature generator:

https://github.com/awslabs/autogluon/blob/master/examples/tabular/example_custom_feature_generator.py

From this script, you’ll be able to test any new functionality you add and pass sparse inputs to it to see if it is able to handle the sparse data without crashing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sparse data structures — pandas 1.5.2 documentation
In a SparseDataFrame , all columns were sparse. A DataFrame can have a mixture of sparse and dense columns. As a consequence, assigning...
Read more >
Working with sparse data sets in pandas and sklearn
It is possible to create a sparse data frame directly, using the sparse parameter in pandas get_dummies. This parameter defaults to False.
Read more >
Making sparse DataFrames efficiently - Minot Lab
What I wanted to see was what the most efficient method would be to make this into a sparse table in wide format,...
Read more >
How to build sparse matrix based on pandas table?
How to efficiently build and fill sparse matrix (from scipy.sparse) by values using i and j as row and column indices and d...
Read more >
sparse data Python Pandas - Scaler Topics
Some dedicated attributes and methods of pandas sparse accessor allow the creation of sparse data frames and series from scipy sparse matrices ( ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found