Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memoize tokenisation in CountVectorizer

See original GitHub issue

IIRC, previous attempts to improve efficiency in CountVectorizer have found that tokenization is the primary bottleneck. In my private extensions, memoizing the mapping (s -> tokenize(preprocess(s))) can improve runtime greatly, especially when the vectorizer is in a cross-validation pipeline (which ColumnTransformer etc. helps encourage).

Challenges to memoization:

it might makes sense to use in-memory caching rather than on-disk given the relatively small blobs being cached.
it has to be conditioned on all relevant constructor parameters to the CountVectorizer (i.e. anything that goes into build_analyzer up to and including tokenization)

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

jnothmancommented, Oct 31, 2019

I think you could work on it @mina1987.

One tricky decision is how to handle the cache if the vectorizer is cloned

0reactions

jnothmancommented, Nov 4, 2019

I’m not sure what you mean. A pull request or code snippet would make your proposal more concrete

Read more comments on GitHub >

Top Results From Across the Web

10+ Examples for Using CountVectorizer - Kavita Ganesan, PhD

Learn how to correctly use Scikit-learn's CountVectorizer. Custom tokenization, custom preprocessing, working with n-grams, word counts and more.

CountVectorizer takes too long to fit_transform - Stack Overflow

I am using stop words from nltk. corpus. Note: tokenize function works fine, using separately for any text input.

Counting words in Python with scikit-learn's CountVectorizer

Using CountVectorizer to count words in multiple documents. ... It converts a collection of text documents to a matrix of token counts.

rasa.nlu.featurizers.sparse_featurizer.count_vectors_featurizer

Creates a sequence of token counts features based on sklearn's CountVectorizer . All tokens which consist only of digits (e.g. 123 and 99 ......

Text Analysis (NLP), Classification - Antonino Furnari

To perform tokenization, we will use the CountVectorizer object. ... In this case, it is used to provide (and memorize) the training set....

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Allow disassembled use of check_estimator

import failure for sklearn.naive_bayes.ComplementNB