question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using ColumnTransformer with FeatureHasher(string) hashes characters instead of strings

See original GitHub issue

Description

Using ColumnTransformer with FeatureHasher(string) hashes characters instead of strings

Steps/Code to Reproduce

I was puzzled by a huge performance decrease of a model by using FeatureHasher(input_type=‘string’) on some of the features, even when n_features was higher than the number of unique strings in my features. Digging into it, I found that FeatureHasher was hashing the individual characters instead of the string values. Gist demonstrating the output of transformations

Expected Results

ColumnTransformer(
    transformers=[('', FeatureHasher(n_features=8, input_type='string', alternate_sign=False), 'x')])

should hash string values.

Actual Results

Individual characters get hashed.

Versions

System
------
    python: 3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34)  [GCC 7.3.0]
executable: {HOME}/anaconda3/bin/python
   machine: Linux-4.4.0-142-generic-x86_64-with-debian-stretch-sid

BLAS
----
    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
  lib_dirs: {HOME}/anaconda3/lib
cblas_libs: mkl_rt, pthread

Python deps
-----------
       pip: 19.0.1
setuptools: 40.8.0
   sklearn: 0.20.0
     numpy: 1.15.4
     scipy: 1.1.0
    Cython: 0.29.5
    pandas: 0.23.4

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
mathijs81commented, Feb 20, 2019

I agree with @jorisvandenbossche that [‘x’] would make much more sense. However, this doesn’t work, then the ColumnTransformer / FeatureHasher combination just returns a single row instead of a row per row in the input dataset.

Maybe the problem lies in that FeatureHasher (and HashingVectorizer) are very text-focused, and I’m basically just looking for a transformer that takes an existing feature (possibly string, but can also be int) and reduces the number of one-hot values by using a hash function. Maybe make_pipeline(FunctionTransformer(own_hash_function), OneHotEncoder) is better for my purpose instead of trying to use FeatureHasher directly.

If FeatureHasher would have printed a warning or have refused to hash single characters that would have saved me a few hours of time. If you think that that is something useful to add, I can try to propose a PR for that.

0reactions
JnsLnscommented, Nov 19, 2022

I ran into that problem as well. I think this (still) has a lot of potential for silent failure.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Feature Hashing on multiple categorical features(columns)
The reason is that if you use the feature hasher with input type 'string' it expects a list of strings. If you just...
Read more >
sklearn.feature_extraction.FeatureHasher
Implements feature hashing, aka the hashing trick. This class turns sequences of symbolic feature names (strings) into scipy.sparse matrices, using a hash ......
Read more >
Extracting, transforming and selecting features - Apache Spark
This is done using the hashing trick to map features to indices in the feature vector. The FeatureHasher transformer operates on multiple columns....
Read more >
dask_ml.feature_extraction.text.FeatureHasher - Dask-ML
Unicode strings are converted to UTF-8 first, but no Unicode normalization is done. ... Vectorizes string-valued features using a hash table.
Read more >
The power of Shapes, Hashing, and Column Transformers in ...
Excitement for me comes in many forms. ... pipeline and Column Transformer (numerical, categorical and hashing) — the holy grail.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found