Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LSA component errors due to `nan`s in natural language feature

See original GitHub issue

Related to #1999

Repro

import evalml
import pandas as pd
import woodwork as ww
df = pd.read_csv('/Users/dylan.sherry/Downloads/bos_311_balanced.csv')
dt = ww.DataTable(df, logical_types={'reason': 'Categorical'})
dt = dt.drop('closed_dt') # ignore datetime feature because nans in it produce another bug
automl = evalml.automl.AutoMLSearch(X_train=dt, y_train=y, problem_type='multiclass')
automl.search()

Produces:

  File "/Users/dylan.sherry/development/evalml/evalml/pipelines/components/transformers/preprocessing/text_featurizer.py", line 121, in transform
    X_lsa = self._lsa.transform(X[self._text_columns]).to_dataframe()
...
  File "/Users/dylan.sherry/development/evalml/evalml/pipelines/components/transformers/preprocessing/lsa.py", line 63, in transform
    transformed = self._lsa_pipeline.transform(X[col])
...
  File "/Users/dylan.sherry/.pyenv/versions/evalml/lib/python3.8/site-packages/sklearn/feature_extraction/text.py", line 219, in decode
    raise ValueError("np.nan is an invalid document, expected byte or "
...
Fold 0: Exception during automl search: np.nan is an invalid document, expected byte or unicode string.

Issue Analytics

State:
Created 3 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

2reactions

dsherrycommented, Mar 23, 2021

Had discussion with @jeremyliweishih @boopboopbeepboop @cmancuso

The short-term plan (tracked by this issue) is to add a data check error when natural language features contain nans, asking users to either drop those rows or fill in the missing values on their own through some other means.

Long-term we will discuss a) options for handling missing natural language features beyond simply dropping rows and b) ways our transformers and/or estimator can still benefit from the information contained in the rows with missing values for the natural language features in question.

1reaction

freddyaboultoncommented, Jan 13, 2022

Awesome, closing in favor of https://github.com/alteryx/evalml/issues/3240 !

Top Results From Across the Web

Impute NaNs for natural language features · Issue #1587 · alteryx ...

In #1492, we realized the LSA transformer errors when there are NaNs in one of the feature columns. Should the LSA transformer handle...

Latent Semantic Analysis and its Uses in Natural Language ...

This tutorial will try to focus on one of the many methods available to tame textual data. This is called Latent Semantic Analysis...

Representation Learning for Natural Language Processing

This book aims to review and present the recent advances of distributed repre- sentation learning for NLP, including why representation learning can improve....

Natural Language Direction Following for Robots in ... - DTIC

Robots are increasingly performing collaborative tasks with people in homes, workplaces, and outdoors, and with this increase in interaction.

Text Analytics - Latent Semantic Analysis - YouTube

Using TF-IDF to convert unstructured text to useful features ... Natural Language Processing ( Part 5): Topic Modeling with Latent Dirichlet ...