Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using ColumnTransformer with FeatureHasher(string) hashes characters instead of strings

See original GitHub issue

Description

Steps/Code to Reproduce

I was puzzled by a huge performance decrease of a model by using FeatureHasher(input_type=‘string’) on some of the features, even when n_features was higher than the number of unique strings in my features. Digging into it, I found that FeatureHasher was hashing the individual characters instead of the string values. Gist demonstrating the output of transformations

Expected Results

ColumnTransformer(
    transformers=[('', FeatureHasher(n_features=8, input_type='string', alternate_sign=False), 'x')])

should hash string values.

Actual Results

Individual characters get hashed.

Versions

System
------
    python: 3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34)  [GCC 7.3.0]
executable: {HOME}/anaconda3/bin/python
   machine: Linux-4.4.0-142-generic-x86_64-with-debian-stretch-sid

BLAS
----
    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
  lib_dirs: {HOME}/anaconda3/lib
cblas_libs: mkl_rt, pthread

Python deps
-----------
       pip: 19.0.1
setuptools: 40.8.0
   sklearn: 0.20.0
     numpy: 1.15.4
     scipy: 1.1.0
    Cython: 0.29.5
    pandas: 0.23.4

Issue Analytics

State:
Created 5 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

2reactions

mathijs81commented, Feb 20, 2019

I agree with @jorisvandenbossche that [‘x’] would make much more sense. However, this doesn’t work, then the ColumnTransformer / FeatureHasher combination just returns a single row instead of a row per row in the input dataset.

Maybe the problem lies in that FeatureHasher (and HashingVectorizer) are very text-focused, and I’m basically just looking for a transformer that takes an existing feature (possibly string, but can also be int) and reduces the number of one-hot values by using a hash function. Maybe make_pipeline(FunctionTransformer(own_hash_function), OneHotEncoder) is better for my purpose instead of trying to use FeatureHasher directly.

If FeatureHasher would have printed a warning or have refused to hash single characters that would have saved me a few hours of time. If you think that that is something useful to add, I can try to propose a PR for that.

0reactions

JnsLnscommented, Nov 19, 2022

I ran into that problem as well. I think this (still) has a lot of potential for silent failure.

Top Results From Across the Web

Feature Hashing on multiple categorical features(columns)

The reason is that if you use the feature hasher with input type 'string' it expects a list of strings. If you just...

sklearn.feature_extraction.FeatureHasher

Implements feature hashing, aka the hashing trick. This class turns sequences of symbolic feature names (strings) into scipy.sparse matrices, using a hash ......

Extracting, transforming and selecting features - Apache Spark

This is done using the hashing trick to map features to indices in the feature vector. The FeatureHasher transformer operates on multiple columns....

dask_ml.feature_extraction.text.FeatureHasher - Dask-ML

Unicode strings are converted to UTF-8 first, but no Unicode normalization is done. ... Vectorizes string-valued features using a hash table.

The power of Shapes, Hashing, and Column Transformers in ...

Excitement for me comes in many forms. ... pipeline and Column Transformer (numerical, categorical and hashing) — the holy grail.