Using ColumnTransformer with FeatureHasher(string) hashes characters instead of strings
See original GitHub issueDescription
Using ColumnTransformer with FeatureHasher(string) hashes characters instead of strings
Steps/Code to Reproduce
I was puzzled by a huge performance decrease of a model by using FeatureHasher(input_type=‘string’) on some of the features, even when n_features was higher than the number of unique strings in my features. Digging into it, I found that FeatureHasher was hashing the individual characters instead of the string values. Gist demonstrating the output of transformations
Expected Results
ColumnTransformer(
transformers=[('', FeatureHasher(n_features=8, input_type='string', alternate_sign=False), 'x')])
should hash string values.
Actual Results
Individual characters get hashed.
Versions
System
------
python: 3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0]
executable: {HOME}/anaconda3/bin/python
machine: Linux-4.4.0-142-generic-x86_64-with-debian-stretch-sid
BLAS
----
macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
lib_dirs: {HOME}/anaconda3/lib
cblas_libs: mkl_rt, pthread
Python deps
-----------
pip: 19.0.1
setuptools: 40.8.0
sklearn: 0.20.0
numpy: 1.15.4
scipy: 1.1.0
Cython: 0.29.5
pandas: 0.23.4
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Feature Hashing on multiple categorical features(columns)
The reason is that if you use the feature hasher with input type 'string' it expects a list of strings. If you just...
Read more >sklearn.feature_extraction.FeatureHasher
Implements feature hashing, aka the hashing trick. This class turns sequences of symbolic feature names (strings) into scipy.sparse matrices, using a hash ......
Read more >Extracting, transforming and selecting features - Apache Spark
This is done using the hashing trick to map features to indices in the feature vector. The FeatureHasher transformer operates on multiple columns....
Read more >dask_ml.feature_extraction.text.FeatureHasher - Dask-ML
Unicode strings are converted to UTF-8 first, but no Unicode normalization is done. ... Vectorizes string-valued features using a hash table.
Read more >The power of Shapes, Hashing, and Column Transformers in ...
Excitement for me comes in many forms. ... pipeline and Column Transformer (numerical, categorical and hashing) — the holy grail.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I agree with @jorisvandenbossche that [‘x’] would make much more sense. However, this doesn’t work, then the ColumnTransformer / FeatureHasher combination just returns a single row instead of a row per row in the input dataset.
Maybe the problem lies in that FeatureHasher (and HashingVectorizer) are very text-focused, and I’m basically just looking for a transformer that takes an existing feature (possibly string, but can also be int) and reduces the number of one-hot values by using a hash function. Maybe make_pipeline(FunctionTransformer(own_hash_function), OneHotEncoder) is better for my purpose instead of trying to use FeatureHasher directly.
If FeatureHasher would have printed a warning or have refused to hash single characters that would have saved me a few hours of time. If you think that that is something useful to add, I can try to propose a PR for that.
I ran into that problem as well. I think this (still) has a lot of potential for silent failure.