question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TfidfVectorizer ngrams does not work when vocabulary provided

See original GitHub issue

Description

The TfidfVectorizer does not honor the ngram_range argument when the vocabulary is provided.

Steps/Code to Reproduce

Example 1, vocabulary is not provided, this works as expected:

from sklearn.feature_extraction.text import TfidfVectorizer

X = ['abc',
     'bcd',
     'cde']

tfidf = TfidfVectorizer(stop_words=None,
                        analyzer='char',
                        ngram_range=(2, 2))

sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['ab', 'bc', 'cd', 'de']

Example 2, when vocabulary is provided. This does not work as expected:

from sklearn.feature_extraction.text import TfidfVectorizer

X = ['abc',
     'bcd',
     'cde']

tfidf = TfidfVectorizer(stop_words=None,
                        analyzer='char',
                        vocabulary={'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4},
                        ngram_range=(2, 2))

sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['a', 'b', 'c', 'd', 'e']

Note that it works if the vocabulary I provide are the ngrams themselves:

from sklearn.feature_extraction.text import TfidfVectorizer

X = ['abc',
     'bcd',
     'cde']

tfidf = TfidfVectorizer(stop_words=None,
                        analyzer='char',
                        vocabulary=['ab', 'bc', 'cd', 'de'],
                        ngram_range=(2, 2))

sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['ab', 'bc', 'cd', 'de']

But that seems kind of silly, since I can’t possibly know all of the ngrams a priori for a large dataset.

Expected Results

Expected to still get ngrams when vocabulary is provided, but did not.

Actual Results

See steps to reproduce above.

Versions

System:
    python: 3.7.5 (default, Oct 25 2019, 10:52:18)  [Clang 4.0.1 (tags/RELEASE_401/final)]
executable: /anaconda3/envs/myenv/bin/python
   machine: Darwin-18.6.0-x86_64-i386-64bit

Python dependencies:
       pip: 19.3.1
setuptools: 42.0.2.post20191203
   sklearn: 0.22
     numpy: 1.17.4
     scipy: 1.4.0
    Cython: 0.29.14
    pandas: 0.25.3
matplotlib: 3.1.2
    joblib: 0.14.1

Built with OpenMP: True

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
rthcommented, Jan 3, 2020

Thanks @tgsmith61591 , PR would be welcome to fix it…

0reactions
chrispecommented, Apr 18, 2020

I had a look at this and I’m not sure If this is an issue or not. Can you validate it @tgsmith61591/@rth?

When you set the vocabulary argument then that is considered fixed and so any new words or ngrams will be ignored and not be processed by the CountVectorizer.

This is the line where that is defined: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L1118

I also made a ‘fix’ which does achieve the behaviour it’s been asked for, but doesn’t pass all the tests. You can check my changes here: https://github.com/chrispe92/scikit-learn/commit/33fa122590245435e09ff5688142250c5d2dec2a

Read more comments on GitHub >

github_iconTop Results From Across the Web

n-gram vectorization using TfidfVectorizer - Stack Overflow
Depends on how you are passing that to TfidfVectorizer ! If passed as a single document, TfidfVectorizer will only keep words which contain ......
Read more >
sklearn.feature_extraction.text.TfidfVectorizer
TfidfVectorizer: Biclustering documents with the Spectral Co-clustering algorithm ... If not given, a vocabulary is determined from the input documents.
Read more >
TF-IDF: How to handle terms not part of the corpus
In my opinion, there is no way to deal out-of-vocabulary terms in TF-IDF as it only works using the test corpus. It's like...
Read more >
Introduction to Bag of Words, N-Gram and TF-IDF - AI ASPIRANT
Bag of words is not concerned about the order in which words appear in the text; instead, it only cares about which words...
Read more >
Working with Text data — Applied Machine Learning in Python
Usually, it's a very bad idea to convert to sparse matrix into Numpy array, since usually, it will not fit into your memory....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found