question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot get feature names after ColumnTransformer

See original GitHub issue

When I use ColumnTransformer to preprocess different columns (include numeric, category, text) with pipeline, I cannot get the feature names of the final transformed data, which is hard for debugging.

Here is the code:

titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')

data = pd.read_csv(titanic_url)

target = data.pop('survived')

numeric_columns = ['age','sibsp','parch']
category_columns = ['pclass','sex','embarked']
text_columns = ['name','home.dest']

numeric_transformer = Pipeline(steps=[
    ('impute',SimpleImputer(strategy='median')),
    ('scaler',StandardScaler()
    )
])
category_transformer = Pipeline(steps=[
    ('impute',SimpleImputer(strategy='constant',fill_value='missing')),
    ('ohe',OneHotEncoder(handle_unknown='ignore'))
])
text_transformer = Pipeline(steps=[
    ('cntvec',CountVectorizer())
])

preprocesser = ColumnTransformer(transformers=[
    ('numeric',numeric_transformer,numeric_columns),
    ('category',category_transformer,category_columns),
    ('text',text_transformer,text_columns[0])
])

preprocesser.fit_transform(data)
  1. preprocesser.get_feature_names() will get error: AttributeError: Transformer numeric (type Pipeline) does not provide get_feature_names.
  2. In ColumnTransformertext_transformer can only process a string (eg ‘Sex’), but not a list of string as text_columns

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:21
  • Comments:20 (4 by maintainers)

github_iconTop GitHub Comments

24reactions
pjgaocommented, Nov 6, 2018

This is not an issue about ColumnTransformer.

  1. is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you’re right that it’s unfriendly that we don’t have a clean way to apply a text vectorizer to each column. I’m not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

Thanks for your kind reply! As I know, when I preprocess a column using methods which can change one column to multi-columns such as OneHotEncoder,CountVectorizer, I can get the new data column names from pipeline last step’s transformer by function get_feature_names, when using methods which not create new columns, can just set the raw columns name.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

Using above code, I can get my preprocesser 's column names. Is these code solve this question? As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?

8reactions
kylegildecommented, Sep 10, 2020

FYI, I wrote some code and a blog about how to extract the feature names from complex Pipelines & ColumnTransformers. The code is an improvement over my previous post. https://towardsdatascience.com/extracting-plotting-feature-names-importance-from-scikit-learn-pipelines-eb5bfa6a31f4

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sklearn Pipeline: Get feature names after OneHotEncode In ...
To complete Venkatachalam's answer with what Paul asked in his comment, the order of feature names as it appears in the ColumnTransformer ....
Read more >
Extracting Column Names from the ColumnTransformer
scikit-learn's ColumnTransformer is a great tool for data preprocessing but returns a numpy array without column names.
Read more >
sklearn.compose.ColumnTransformer
Fit all transformers, transform the data and concatenate results. get_feature_names_out ([input_features]). Get output feature names for transformation.
Read more >
Extracting Feature Names from the ColumnTransformer
Get feature names from ColumnTransformer in scikit-learn. ... After transforming the features, they do not have names in the new numpy array ...
Read more >
Extracting Scikit Feature Names & Importances - Kaggle
Extracting & Plotting Feature Names & Importance from Scikit-Learn Pipelines¶ ... verbose=None): """ Get the column names from the a ColumnTransformer ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found