Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TfidfVectorizer ngrams does not work when vocabulary provided

See original GitHub issue

Description

The TfidfVectorizer does not honor the ngram_range argument when the vocabulary is provided.

Steps/Code to Reproduce

Example 1, vocabulary is not provided, this works as expected:

from sklearn.feature_extraction.text import TfidfVectorizer

X = ['abc',
     'bcd',
     'cde']

tfidf = TfidfVectorizer(stop_words=None,
                        analyzer='char',
                        ngram_range=(2, 2))

sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['ab', 'bc', 'cd', 'de']

Example 2, when vocabulary is provided. This does not work as expected:

from sklearn.feature_extraction.text import TfidfVectorizer

X = ['abc',
     'bcd',
     'cde']

tfidf = TfidfVectorizer(stop_words=None,
                        analyzer='char',
                        vocabulary={'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4},
                        ngram_range=(2, 2))

sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['a', 'b', 'c', 'd', 'e']

Note that it works if the vocabulary I provide are the ngrams themselves:

from sklearn.feature_extraction.text import TfidfVectorizer

X = ['abc',
     'bcd',
     'cde']

tfidf = TfidfVectorizer(stop_words=None,
                        analyzer='char',
                        vocabulary=['ab', 'bc', 'cd', 'de'],
                        ngram_range=(2, 2))

sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['ab', 'bc', 'cd', 'de']

But that seems kind of silly, since I can’t possibly know all of the ngrams a priori for a large dataset.

Expected Results

Expected to still get ngrams when vocabulary is provided, but did not.

Actual Results

See steps to reproduce above.

Versions

System:
    python: 3.7.5 (default, Oct 25 2019, 10:52:18)  [Clang 4.0.1 (tags/RELEASE_401/final)]
executable: /anaconda3/envs/myenv/bin/python
   machine: Darwin-18.6.0-x86_64-i386-64bit

Python dependencies:
       pip: 19.3.1
setuptools: 42.0.2.post20191203
   sklearn: 0.22
     numpy: 1.17.4
     scipy: 1.4.0
    Cython: 0.29.14
    pandas: 0.25.3
matplotlib: 3.1.2
    joblib: 0.14.1

Built with OpenMP: True

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:9 (4 by maintainers)

Top GitHub Comments

1reaction

rthcommented, Jan 3, 2020

Thanks @tgsmith61591 , PR would be welcome to fix it…

0reactions

chrispecommented, Apr 18, 2020

I had a look at this and I’m not sure If this is an issue or not. Can you validate it @tgsmith61591/@rth?

When you set the vocabulary argument then that is considered fixed and so any new words or ngrams will be ignored and not be processed by the CountVectorizer.

This is the line where that is defined: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L1118

I also made a ‘fix’ which does achieve the behaviour it’s been asked for, but doesn’t pass all the tests. You can check my changes here: https://github.com/chrispe92/scikit-learn/commit/33fa122590245435e09ff5688142250c5d2dec2a

Top Results From Across the Web

n-gram vectorization using TfidfVectorizer - Stack Overflow

Depends on how you are passing that to TfidfVectorizer ! If passed as a single document, TfidfVectorizer will only keep words which contain ......

sklearn.feature_extraction.text.TfidfVectorizer

TfidfVectorizer: Biclustering documents with the Spectral Co-clustering algorithm ... If not given, a vocabulary is determined from the input documents.

TF-IDF: How to handle terms not part of the corpus

In my opinion, there is no way to deal out-of-vocabulary terms in TF-IDF as it only works using the test corpus. It's like...

Introduction to Bag of Words, N-Gram and TF-IDF - AI ASPIRANT

Bag of words is not concerned about the order in which words appear in the text; instead, it only cares about which words...

Working with Text data — Applied Machine Learning in Python

Usually, it's a very bad idea to convert to sparse matrix into Numpy array, since usually, it will not fit into your memory....