TfidfVectorizer ngrams does not work when vocabulary provided
See original GitHub issueDescription
The TfidfVectorizer
does not honor the ngram_range
argument when the vocabulary
is provided.
Steps/Code to Reproduce
Example 1, vocabulary is not provided, this works as expected:
from sklearn.feature_extraction.text import TfidfVectorizer
X = ['abc',
'bcd',
'cde']
tfidf = TfidfVectorizer(stop_words=None,
analyzer='char',
ngram_range=(2, 2))
sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['ab', 'bc', 'cd', 'de']
Example 2, when vocabulary is provided. This does not work as expected:
from sklearn.feature_extraction.text import TfidfVectorizer
X = ['abc',
'bcd',
'cde']
tfidf = TfidfVectorizer(stop_words=None,
analyzer='char',
vocabulary={'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4},
ngram_range=(2, 2))
sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['a', 'b', 'c', 'd', 'e']
Note that it works if the vocabulary I provide are the ngrams themselves:
from sklearn.feature_extraction.text import TfidfVectorizer
X = ['abc',
'bcd',
'cde']
tfidf = TfidfVectorizer(stop_words=None,
analyzer='char',
vocabulary=['ab', 'bc', 'cd', 'de'],
ngram_range=(2, 2))
sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['ab', 'bc', 'cd', 'de']
But that seems kind of silly, since I can’t possibly know all of the ngrams a priori for a large dataset.
Expected Results
Expected to still get ngrams when vocabulary is provided, but did not.
Actual Results
See steps to reproduce above.
Versions
System:
python: 3.7.5 (default, Oct 25 2019, 10:52:18) [Clang 4.0.1 (tags/RELEASE_401/final)]
executable: /anaconda3/envs/myenv/bin/python
machine: Darwin-18.6.0-x86_64-i386-64bit
Python dependencies:
pip: 19.3.1
setuptools: 42.0.2.post20191203
sklearn: 0.22
numpy: 1.17.4
scipy: 1.4.0
Cython: 0.29.14
pandas: 0.25.3
matplotlib: 3.1.2
joblib: 0.14.1
Built with OpenMP: True
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:9 (4 by maintainers)
Top Results From Across the Web
n-gram vectorization using TfidfVectorizer - Stack Overflow
Depends on how you are passing that to TfidfVectorizer ! If passed as a single document, TfidfVectorizer will only keep words which contain ......
Read more >sklearn.feature_extraction.text.TfidfVectorizer
TfidfVectorizer: Biclustering documents with the Spectral Co-clustering algorithm ... If not given, a vocabulary is determined from the input documents.
Read more >TF-IDF: How to handle terms not part of the corpus
In my opinion, there is no way to deal out-of-vocabulary terms in TF-IDF as it only works using the test corpus. It's like...
Read more >Introduction to Bag of Words, N-Gram and TF-IDF - AI ASPIRANT
Bag of words is not concerned about the order in which words appear in the text; instead, it only cares about which words...
Read more >Working with Text data — Applied Machine Learning in Python
Usually, it's a very bad idea to convert to sparse matrix into Numpy array, since usually, it will not fit into your memory....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks @tgsmith61591 , PR would be welcome to fix it…
I had a look at this and I’m not sure If this is an issue or not. Can you validate it @tgsmith61591/@rth?
When you set the
vocabulary
argument then that is considered fixed and so any new words or ngrams will be ignored and not be processed by theCountVectorizer
.This is the line where that is defined: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L1118
I also made a ‘fix’ which does achieve the behaviour it’s been asked for, but doesn’t pass all the tests. You can check my changes here: https://github.com/chrispe92/scikit-learn/commit/33fa122590245435e09ff5688142250c5d2dec2a