Memoize tokenisation in CountVectorizer
See original GitHub issueIIRC, previous attempts to improve efficiency in CountVectorizer have found that tokenization is the primary bottleneck. In my private extensions, memoizing the mapping (s -> tokenize(preprocess(s))
) can improve runtime greatly, especially when the vectorizer is in a cross-validation pipeline (which ColumnTransformer etc. helps encourage).
Challenges to memoization:
- it might makes sense to use in-memory caching rather than on-disk given the relatively small blobs being cached.
- it has to be conditioned on all relevant constructor parameters to the CountVectorizer (i.e. anything that goes into build_analyzer up to and including tokenization)
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:8 (8 by maintainers)
Top Results From Across the Web
10+ Examples for Using CountVectorizer - Kavita Ganesan, PhD
Learn how to correctly use Scikit-learn's CountVectorizer. Custom tokenization, custom preprocessing, working with n-grams, word counts and more.
Read more >CountVectorizer takes too long to fit_transform - Stack Overflow
I am using stop words from nltk. corpus. Note: tokenize function works fine, using separately for any text input.
Read more >Counting words in Python with scikit-learn's CountVectorizer
Using CountVectorizer to count words in multiple documents. ... It converts a collection of text documents to a matrix of token counts.
Read more >rasa.nlu.featurizers.sparse_featurizer.count_vectors_featurizer
Creates a sequence of token counts features based on sklearn's CountVectorizer . All tokens which consist only of digits (e.g. 123 and 99 ......
Read more >Text Analysis (NLP), Classification - Antonino Furnari
To perform tokenization, we will use the CountVectorizer object. ... In this case, it is used to provide (and memorize) the training set....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I think you could work on it @mina1987.
One tricky decision is how to handle the cache if the vectorizer is cloned
I’m not sure what you mean. A pull request or code snippet would make your proposal more concrete