Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot get feature names after ColumnTransformer

See original GitHub issue

When I use ColumnTransformer to preprocess different columns (include numeric, category, text) with pipeline, I cannot get the feature names of the final transformed data, which is hard for debugging.

Here is the code:

titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')

data = pd.read_csv(titanic_url)

target = data.pop('survived')

numeric_columns = ['age','sibsp','parch']
category_columns = ['pclass','sex','embarked']
text_columns = ['name','home.dest']

numeric_transformer = Pipeline(steps=[
    ('impute',SimpleImputer(strategy='median')),
    ('scaler',StandardScaler()
    )
])
category_transformer = Pipeline(steps=[
    ('impute',SimpleImputer(strategy='constant',fill_value='missing')),
    ('ohe',OneHotEncoder(handle_unknown='ignore'))
])
text_transformer = Pipeline(steps=[
    ('cntvec',CountVectorizer())
])

preprocesser = ColumnTransformer(transformers=[
    ('numeric',numeric_transformer,numeric_columns),
    ('category',category_transformer,category_columns),
    ('text',text_transformer,text_columns[0])
])

preprocesser.fit_transform(data)

preprocesser.get_feature_names() will get error: AttributeError: Transformer numeric (type Pipeline) does not provide get_feature_names.
In ColumnTransformer，text_transformer can only process a string (eg ‘Sex’), but not a list of string as text_columns

Issue Analytics

State:
Created 5 years ago
Reactions:21
Comments:20 (4 by maintainers)

Top GitHub Comments

24reactions

pjgaocommented, Nov 6, 2018

This is not an issue about ColumnTransformer.

is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you’re right that it’s unfriendly that we don’t have a clean way to apply a text vectorizer to each column. I’m not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

Thanks for your kind reply! As I know, when I preprocess a column using methods which can change one column to multi-columns such as OneHotEncoder,CountVectorizer, I can get the new data column names from pipeline last step’s transformer by function get_feature_names, when using methods which not create new columns, can just set the raw columns name.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

Using above code, I can get my preprocesser 's column names. Is these code solve this question? As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?

8reactions

kylegildecommented, Sep 10, 2020

FYI, I wrote some code and a blog about how to extract the feature names from complex Pipelines & ColumnTransformers. The code is an improvement over my previous post. https://towardsdatascience.com/extracting-plotting-feature-names-importance-from-scikit-learn-pipelines-eb5bfa6a31f4

Top Results From Across the Web

Sklearn Pipeline: Get feature names after OneHotEncode In ...

To complete Venkatachalam's answer with what Paul asked in his comment, the order of feature names as it appears in the ColumnTransformer ....

Extracting Column Names from the ColumnTransformer

scikit-learn's ColumnTransformer is a great tool for data preprocessing but returns a numpy array without column names.

sklearn.compose.ColumnTransformer

Fit all transformers, transform the data and concatenate results. get_feature_names_out ([input_features]). Get output feature names for transformation.

Extracting Feature Names from the ColumnTransformer

Get feature names from ColumnTransformer in scikit-learn. ... After transforming the features, they do not have names in the new numpy array ...

Extracting Scikit Feature Names & Importances - Kaggle

Extracting & Plotting Feature Names & Importance from Scikit-Learn Pipelines¶ ... verbose=None): """ Get the column names from the a ColumnTransformer ...