question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

help: How do I replicate the processed version of the data from pipeline of AutoMLSearch.search()?

See original GitHub issue

It still is not clear to me that how I can get the version of data that was transformed and passed to the pipeline and pipeline descriptions are very vague as well.

here’s all that I have done:

>>> from evalml import AutoMLSearch
>>> automl = AutoMLSearch(X_train=Xy_3.drop(columns='target'), y_train=Xy_3.loc[:,'target'],
                                                 problem_type='regression', objective='root mean squared error', n_jobs=-1)
>>> automl.search()
>>> print(automl.describe_pipeline(automl.rankings.iloc[0]["id"]))

*********************************************************************************
* XGBoost Regressor w/ Imputer + Text Featurization Component + One Hot Encoder *
*********************************************************************************

Problem Type: regression
Model Family: XGBoost

Pipeline Steps
==============
1. Imputer
	 * categorical_impute_strategy : most_frequent
	 * numeric_impute_strategy : median
	 * categorical_fill_value : None
	 * numeric_fill_value : None
2. Text Featurization Component
3. One Hot Encoder
	 * top_n : 10
	 * features_to_encode : None
	 * categories : None
	 * drop : if_binary
	 * handle_unknown : ignore
	 * handle_missing : error
4. XGBoost Regressor
	 * eta : 0.053613563977506225
	 * max_depth : 20
	 * min_child_weight : 6.470576304373694
	 * n_estimators : 688
	 * n_jobs : -1
>>> best_pipeline = automl.best_pipeline
>>> type(best_pipeline)
evalml.pipelines.regression_pipeline.RegressionPipeline

There are columns like these below, which I have no idea on how to replicate.

 'DIVERSITY_SCORE(original_column_name)', 'LSA(original_column_name)[0]', 'LSA(original_column_name)[1]',, 'POLARITY_SCORE(original_column_name)

There is APi available to get all the column names but that is still not helping me with understanding thing like which imputation method was used, how columns were encoded etc,…

If i am missing something here please tell me. Or include API to be able to reproduce the pre-processing.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
angela97lincommented, Jul 7, 2021

Glad to help, @naveen-9697 !

To understand your first question, those columns were created via our TextFeaturization component which generates features via the nlp-primitives package. You can learn more about that package here, but EvalML leverages the LSA, Polarity Score, and Diversity Score primitives from that package to create those columns.

RE your second question: one way to determine that could be to dig further into the OHE. For example, you could use the categories attribute (ex: pipeline.component_graph.component_instances["One Hot Encoder"].categories('Country')) to get the different categories for the “Country” column that were created, or pipeline.component_graph.component_instances["One Hot Encoder"]._get_feature_provenance() to get a mapping of the original column to the columns created (note: this is available only after fitting).

To answer your third question, hopefully this documentation helps: https://evalml.alteryx.com/en/stable/generated/methods/evalml.pipelines.components.Imputer.__init__.html#evalml.pipelines.components.Imputer.__init__

Specifically, “numeric_impute_strategy” determines the strategy to use (ex: most_frequent, mean, median, etc.). As our docs say: numeric_fill_value (int, float) – When numeric_impute_strategy == “constant”, fill_value is used to replace missing data. The default value of None will fill with 0. We’re filling in NaNs in both cases, it’s a matter of how we’re filling them in!

1reaction
angela97lincommented, Jul 7, 2021

Hi @naveen-9697! Thanks for your questions.

RE figuring out which imputation method was used, how columns were encoded: What you’re currently looking at is our pipeline description which lists out the different components in the pipelines and their parameters in a readable format. For example, the imputer has categorical_impute_strategy set to most_frequent, meaning that it uses the most frequent value to impute categorical columns. If you’d like to dive into the parameters specifically, you can also call automl.get_pipeline(automl.rankings.iloc[0]["id"]).parameters.

This probably isn’t ideal, but you can see the input feature names to each component via the input_feature_names attribute on a pipeline after its been fit. So in your example:

pipeline = automl.get_pipeline(automl.rankings.iloc[0]["id"])
# fit pipeline on your original training data:
pipeline.fit(X_train, y_train)
pipeline.input_feature_names # returns a dictionary mapping component names to columns

If you see something like:

{'TextFeaturization': [original_column_name],
'XGBoost Regressor': [LSA(original_column_name)[0]}

from calling .input_feature_names, that means the XGBoost Regressor got the LSA(original_column_name)[0] column as an output of the TextFeaturization component.

I hope this helps! Let us know if you have any other questions 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

AutoMLSearch for time series problems - EvalML
In this guide, we'll show how you can use EvalML to perform an automated search of machine learning pipelines for time series problems....
Read more >
EvalML AutoML Library To Automate Feature ... - YouTube
In this video we will discuss about an automl library EvalML to automate the lifecycle of Data Science Projectgithub: ...
Read more >
Exception during automl search: 'NAType' object has no ...
Hi, When I run the automlSearch for Regression I get this exception for all the pipelines. What do I do? Regards, Sitharth.
Read more >
A Guide to Machine Learning Pipelines and Orchest
Orchest is a data pipeline ecosystem that does not requires DAGs ... Just copy and paste the code below in the command line...
Read more >
How to use the ML.NET Automated ML (AutoML) API
Your pipeline defines the data processing steps and machine learning pipeline to use for training your model. C# Copy. SweepablePipeline ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found