LSA component errors due to `nan`s in natural language feature
See original GitHub issueRelated to #1999
Repro
import evalml
import pandas as pd
import woodwork as ww
df = pd.read_csv('/Users/dylan.sherry/Downloads/bos_311_balanced.csv')
dt = ww.DataTable(df, logical_types={'reason': 'Categorical'})
dt = dt.drop('closed_dt') # ignore datetime feature because nans in it produce another bug
automl = evalml.automl.AutoMLSearch(X_train=dt, y_train=y, problem_type='multiclass')
automl.search()
Produces:
File "/Users/dylan.sherry/development/evalml/evalml/pipelines/components/transformers/preprocessing/text_featurizer.py", line 121, in transform
X_lsa = self._lsa.transform(X[self._text_columns]).to_dataframe()
...
File "/Users/dylan.sherry/development/evalml/evalml/pipelines/components/transformers/preprocessing/lsa.py", line 63, in transform
transformed = self._lsa_pipeline.transform(X[col])
...
File "/Users/dylan.sherry/.pyenv/versions/evalml/lib/python3.8/site-packages/sklearn/feature_extraction/text.py", line 219, in decode
raise ValueError("np.nan is an invalid document, expected byte or "
...
Fold 0: Exception during automl search: np.nan is an invalid document, expected byte or unicode string.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (6 by maintainers)
Top Results From Across the Web
Impute NaNs for natural language features · Issue #1587 · alteryx ...
In #1492, we realized the LSA transformer errors when there are NaNs in one of the feature columns. Should the LSA transformer handle...
Read more >Latent Semantic Analysis and its Uses in Natural Language ...
This tutorial will try to focus on one of the many methods available to tame textual data. This is called Latent Semantic Analysis...
Read more >Representation Learning for Natural Language Processing
This book aims to review and present the recent advances of distributed repre- sentation learning for NLP, including why representation learning can improve....
Read more >Natural Language Direction Following for Robots in ... - DTIC
Robots are increasingly performing collaborative tasks with people in homes, workplaces, and outdoors, and with this increase in interaction.
Read more >Text Analytics - Latent Semantic Analysis - YouTube
Using TF-IDF to convert unstructured text to useful features ... Natural Language Processing ( Part 5): Topic Modeling with Latent Dirichlet ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Had discussion with @jeremyliweishih @boopboopbeepboop @cmancuso
The short-term plan (tracked by this issue) is to add a data check error when natural language features contain
nan
s, asking users to either drop those rows or fill in the missing values on their own through some other means.Long-term we will discuss a) options for handling missing natural language features beyond simply dropping rows and b) ways our transformers and/or estimator can still benefit from the information contained in the rows with missing values for the natural language features in question.
Awesome, closing in favor of https://github.com/alteryx/evalml/issues/3240 !