question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memoize tokenisation in CountVectorizer

See original GitHub issue

IIRC, previous attempts to improve efficiency in CountVectorizer have found that tokenization is the primary bottleneck. In my private extensions, memoizing the mapping (s -> tokenize(preprocess(s))) can improve runtime greatly, especially when the vectorizer is in a cross-validation pipeline (which ColumnTransformer etc. helps encourage).

Challenges to memoization:

  • it might makes sense to use in-memory caching rather than on-disk given the relatively small blobs being cached.
  • it has to be conditioned on all relevant constructor parameters to the CountVectorizer (i.e. anything that goes into build_analyzer up to and including tokenization)

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:1
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
jnothmancommented, Oct 31, 2019

I think you could work on it @mina1987.

One tricky decision is how to handle the cache if the vectorizer is cloned

0reactions
jnothmancommented, Nov 4, 2019

I’m not sure what you mean. A pull request or code snippet would make your proposal more concrete

Read more comments on GitHub >

github_iconTop Results From Across the Web

10+ Examples for Using CountVectorizer - Kavita Ganesan, PhD
Learn how to correctly use Scikit-learn's CountVectorizer. Custom tokenization, custom preprocessing, working with n-grams, word counts and more.
Read more >
CountVectorizer takes too long to fit_transform - Stack Overflow
I am using stop words from nltk. corpus. Note: tokenize function works fine, using separately for any text input.
Read more >
Counting words in Python with scikit-learn's CountVectorizer
Using CountVectorizer to count words in multiple documents. ... It converts a collection of text documents to a matrix of token counts.
Read more >
rasa.nlu.featurizers.sparse_featurizer.count_vectors_featurizer
Creates a sequence of token counts features based on sklearn's CountVectorizer . All tokens which consist only of digits (e.g. 123 and 99 ......
Read more >
Text Analysis (NLP), Classification - Antonino Furnari
To perform tokenization, we will use the CountVectorizer object. ... In this case, it is used to provide (and memorize) the training set....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found