help: How do I replicate the processed version of the data from pipeline of AutoMLSearch.search()?
See original GitHub issueIt still is not clear to me that how I can get the version of data that was transformed and passed to the pipeline and pipeline descriptions are very vague as well.
here’s all that I have done:
>>> from evalml import AutoMLSearch
>>> automl = AutoMLSearch(X_train=Xy_3.drop(columns='target'), y_train=Xy_3.loc[:,'target'],
problem_type='regression', objective='root mean squared error', n_jobs=-1)
>>> automl.search()
>>> print(automl.describe_pipeline(automl.rankings.iloc[0]["id"]))
*********************************************************************************
* XGBoost Regressor w/ Imputer + Text Featurization Component + One Hot Encoder *
*********************************************************************************
Problem Type: regression
Model Family: XGBoost
Pipeline Steps
==============
1. Imputer
* categorical_impute_strategy : most_frequent
* numeric_impute_strategy : median
* categorical_fill_value : None
* numeric_fill_value : None
2. Text Featurization Component
3. One Hot Encoder
* top_n : 10
* features_to_encode : None
* categories : None
* drop : if_binary
* handle_unknown : ignore
* handle_missing : error
4. XGBoost Regressor
* eta : 0.053613563977506225
* max_depth : 20
* min_child_weight : 6.470576304373694
* n_estimators : 688
* n_jobs : -1
>>> best_pipeline = automl.best_pipeline
>>> type(best_pipeline)
evalml.pipelines.regression_pipeline.RegressionPipeline
There are columns like these below, which I have no idea on how to replicate.
'DIVERSITY_SCORE(original_column_name)', 'LSA(original_column_name)[0]', 'LSA(original_column_name)[1]',, 'POLARITY_SCORE(original_column_name)
There is APi available to get all the column names but that is still not helping me with understanding thing like which imputation method was used, how columns were encoded etc,…
If i am missing something here please tell me. Or include API to be able to reproduce the pre-processing.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
AutoMLSearch for time series problems - EvalML
In this guide, we'll show how you can use EvalML to perform an automated search of machine learning pipelines for time series problems....
Read more >EvalML AutoML Library To Automate Feature ... - YouTube
In this video we will discuss about an automl library EvalML to automate the lifecycle of Data Science Projectgithub: ...
Read more >Exception during automl search: 'NAType' object has no ...
Hi, When I run the automlSearch for Regression I get this exception for all the pipelines. What do I do? Regards, Sitharth.
Read more >A Guide to Machine Learning Pipelines and Orchest
Orchest is a data pipeline ecosystem that does not requires DAGs ... Just copy and paste the code below in the command line...
Read more >How to use the ML.NET Automated ML (AutoML) API
Your pipeline defines the data processing steps and machine learning pipeline to use for training your model. C# Copy. SweepablePipeline ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Glad to help, @naveen-9697 !
To understand your first question, those columns were created via our TextFeaturization component which generates features via the nlp-primitives package. You can learn more about that package here, but EvalML leverages the LSA, Polarity Score, and Diversity Score primitives from that package to create those columns.
RE your second question: one way to determine that could be to dig further into the OHE. For example, you could use the
categories
attribute (ex:pipeline.component_graph.component_instances["One Hot Encoder"].categories('Country')
) to get the different categories for the “Country” column that were created, orpipeline.component_graph.component_instances["One Hot Encoder"]._get_feature_provenance()
to get a mapping of the original column to the columns created (note: this is available only after fitting).To answer your third question, hopefully this documentation helps: https://evalml.alteryx.com/en/stable/generated/methods/evalml.pipelines.components.Imputer.__init__.html#evalml.pipelines.components.Imputer.__init__
Specifically, “numeric_impute_strategy” determines the strategy to use (ex: most_frequent, mean, median, etc.). As our docs say:
numeric_fill_value (int, float) – When numeric_impute_strategy == “constant”, fill_value is used to replace missing data. The default value of None will fill with 0.
We’re filling in NaNs in both cases, it’s a matter of how we’re filling them in!Hi @naveen-9697! Thanks for your questions.
RE figuring out which imputation method was used, how columns were encoded: What you’re currently looking at is our pipeline description which lists out the different components in the pipelines and their parameters in a readable format. For example, the imputer has
categorical_impute_strategy
set tomost_frequent
, meaning that it uses the most frequent value to impute categorical columns. If you’d like to dive into the parameters specifically, you can also callautoml.get_pipeline(automl.rankings.iloc[0]["id"]).parameters
.This probably isn’t ideal, but you can see the input feature names to each component via the
input_feature_names
attribute on a pipeline after its been fit. So in your example:If you see something like:
from calling
.input_feature_names
, that means the XGBoost Regressor got the LSA(original_column_name)[0] column as an output of the TextFeaturization component.I hope this helps! Let us know if you have any other questions 😃