Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

help: How do I replicate the processed version of the data from pipeline of AutoMLSearch.search()?

See original GitHub issue

It still is not clear to me that how I can get the version of data that was transformed and passed to the pipeline and pipeline descriptions are very vague as well.

here’s all that I have done:

>>> from evalml import AutoMLSearch
>>> automl = AutoMLSearch(X_train=Xy_3.drop(columns='target'), y_train=Xy_3.loc[:,'target'],
                                                 problem_type='regression', objective='root mean squared error', n_jobs=-1)
>>> automl.search()
>>> print(automl.describe_pipeline(automl.rankings.iloc[0]["id"]))

*********************************************************************************
* XGBoost Regressor w/ Imputer + Text Featurization Component + One Hot Encoder *
*********************************************************************************

Problem Type: regression
Model Family: XGBoost

Pipeline Steps
==============
1. Imputer
	 * categorical_impute_strategy : most_frequent
	 * numeric_impute_strategy : median
	 * categorical_fill_value : None
	 * numeric_fill_value : None
2. Text Featurization Component
3. One Hot Encoder
	 * top_n : 10
	 * features_to_encode : None
	 * categories : None
	 * drop : if_binary
	 * handle_unknown : ignore
	 * handle_missing : error
4. XGBoost Regressor
	 * eta : 0.053613563977506225
	 * max_depth : 20
	 * min_child_weight : 6.470576304373694
	 * n_estimators : 688
	 * n_jobs : -1
>>> best_pipeline = automl.best_pipeline
>>> type(best_pipeline)
evalml.pipelines.regression_pipeline.RegressionPipeline

There are columns like these below, which I have no idea on how to replicate.

 'DIVERSITY_SCORE(original_column_name)', 'LSA(original_column_name)[0]', 'LSA(original_column_name)[1]',, 'POLARITY_SCORE(original_column_name)

There is APi available to get all the column names but that is still not helping me with understanding thing like which imputation method was used, how columns were encoded etc,…

If i am missing something here please tell me. Or include API to be able to reproduce the pre-processing.

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

angela97lincommented, Jul 7, 2021

Glad to help, @naveen-9697 !

To understand your first question, those columns were created via our TextFeaturization component which generates features via the nlp-primitives package. You can learn more about that package here, but EvalML leverages the LSA, Polarity Score, and Diversity Score primitives from that package to create those columns.

RE your second question: one way to determine that could be to dig further into the OHE. For example, you could use the categories attribute (ex: pipeline.component_graph.component_instances["One Hot Encoder"].categories('Country')) to get the different categories for the “Country” column that were created, or pipeline.component_graph.component_instances["One Hot Encoder"]._get_feature_provenance() to get a mapping of the original column to the columns created (note: this is available only after fitting).

To answer your third question, hopefully this documentation helps: https://evalml.alteryx.com/en/stable/generated/methods/evalml.pipelines.components.Imputer.__init__.html#evalml.pipelines.components.Imputer.__init__

Specifically, “numeric_impute_strategy” determines the strategy to use (ex: most_frequent, mean, median, etc.). As our docs say: numeric_fill_value (int, float) – When numeric_impute_strategy == “constant”, fill_value is used to replace missing data. The default value of None will fill with 0. We’re filling in NaNs in both cases, it’s a matter of how we’re filling them in!

1reaction

angela97lincommented, Jul 7, 2021

Hi @naveen-9697! Thanks for your questions.

RE figuring out which imputation method was used, how columns were encoded: What you’re currently looking at is our pipeline description which lists out the different components in the pipelines and their parameters in a readable format. For example, the imputer has categorical_impute_strategy set to most_frequent, meaning that it uses the most frequent value to impute categorical columns. If you’d like to dive into the parameters specifically, you can also call automl.get_pipeline(automl.rankings.iloc[0]["id"]).parameters.

This probably isn’t ideal, but you can see the input feature names to each component via the input_feature_names attribute on a pipeline after its been fit. So in your example:

pipeline = automl.get_pipeline(automl.rankings.iloc[0]["id"])
# fit pipeline on your original training data:
pipeline.fit(X_train, y_train)
pipeline.input_feature_names # returns a dictionary mapping component names to columns

If you see something like:

{'TextFeaturization': [original_column_name],
'XGBoost Regressor': [LSA(original_column_name)[0]}

from calling .input_feature_names, that means the XGBoost Regressor got the LSA(original_column_name)[0] column as an output of the TextFeaturization component.

I hope this helps! Let us know if you have any other questions 😃