question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LSA component errors due to `nan`s in natural language feature

See original GitHub issue

Related to #1999

Repro

import evalml
import pandas as pd
import woodwork as ww
df = pd.read_csv('/Users/dylan.sherry/Downloads/bos_311_balanced.csv')
dt = ww.DataTable(df, logical_types={'reason': 'Categorical'})
dt = dt.drop('closed_dt') # ignore datetime feature because nans in it produce another bug
automl = evalml.automl.AutoMLSearch(X_train=dt, y_train=y, problem_type='multiclass')
automl.search()

Produces:

  File "/Users/dylan.sherry/development/evalml/evalml/pipelines/components/transformers/preprocessing/text_featurizer.py", line 121, in transform
    X_lsa = self._lsa.transform(X[self._text_columns]).to_dataframe()
...
  File "/Users/dylan.sherry/development/evalml/evalml/pipelines/components/transformers/preprocessing/lsa.py", line 63, in transform
    transformed = self._lsa_pipeline.transform(X[col])
...
  File "/Users/dylan.sherry/.pyenv/versions/evalml/lib/python3.8/site-packages/sklearn/feature_extraction/text.py", line 219, in decode
    raise ValueError("np.nan is an invalid document, expected byte or "
...
Fold 0: Exception during automl search: np.nan is an invalid document, expected byte or unicode string.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
dsherrycommented, Mar 23, 2021

Had discussion with @jeremyliweishih @boopboopbeepboop @cmancuso

The short-term plan (tracked by this issue) is to add a data check error when natural language features contain nans, asking users to either drop those rows or fill in the missing values on their own through some other means.

Long-term we will discuss a) options for handling missing natural language features beyond simply dropping rows and b) ways our transformers and/or estimator can still benefit from the information contained in the rows with missing values for the natural language features in question.

1reaction
freddyaboultoncommented, Jan 13, 2022

Awesome, closing in favor of https://github.com/alteryx/evalml/issues/3240 !

Read more comments on GitHub >

github_iconTop Results From Across the Web

Impute NaNs for natural language features · Issue #1587 · alteryx ...
In #1492, we realized the LSA transformer errors when there are NaNs in one of the feature columns. Should the LSA transformer handle...
Read more >
Latent Semantic Analysis and its Uses in Natural Language ...
This tutorial will try to focus on one of the many methods available to tame textual data. This is called Latent Semantic Analysis...
Read more >
Representation Learning for Natural Language Processing
This book aims to review and present the recent advances of distributed repre- sentation learning for NLP, including why representation learning can improve....
Read more >
Natural Language Direction Following for Robots in ... - DTIC
Robots are increasingly performing collaborative tasks with people in homes, workplaces, and outdoors, and with this increase in interaction.
Read more >
Text Analytics - Latent Semantic Analysis - YouTube
Using TF-IDF to convert unstructured text to useful features ... Natural Language Processing ( Part 5): Topic Modeling with Latent Dirichlet ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found