question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TfidfTransformer: idea to avoid unnecessary copy of csr matrices

See original GitHub issue

Describe the workflow you want to enable

TfidfTransformer.transform(X, copy=False) shouldn’t make copies of X but it does. By shouldn’t I mean I didn’t expect it to happen.

Describe your proposed solution

If both X and self._idf_diag are CSR matrices, one can do:

for i in range(X.data.shape[0]):
    X.data[i] *= self._idf_diag.data[X.indices[i]]
X.eliminate_zeros()

…that is, if I’m not mistaken.

This can ofc be implemented efficiently in cython (there might be even some specialized function already?).

I will gladly submit a PR for this (currently I’m hacking around it) if somebody experienced in sklearn can guide me wrt where to put the code. I never contributed to this project but would love to.

Describe alternatives you’ve considered, if relevant

Additional context

I’m working with big sparse matrices and this would help a lot.

See the comment at https://github.com/scikit-learn/scikit-learn/blob/05ce8141bc71ad21e55be4d1b3f6609f65e91e49/sklearn/feature_extraction/text.py#L1499

Line 1500 creates a copy.

EDIT: the eliminate_zeros is not even required I guess.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
rthcommented, Nov 15, 2020

In the end the issue is that scipy sparse doesn’t have broadcasting support (https://github.com/scipy/scipy/issues/2128). What we want to do is,

X = X * idf[None, :]

where idf is a 1D array. Because broadcasting isn’t supported, we have to fallback to that multiplication by the sparse diagonal matrice hack.

A more efficient workaround here, assuming X is in CSR format would be,

if copy:
    X = X.copy()
X.data *= idf[X.indices]

This would both use less memory (when copy=False) and be much faster (x4 faster in the benchmarks I did).

I prototyped a generic implementation in safe_multiply_broadcast which would be equivalent to np.multiply but with some broadcasting support with scipy.sparse (we also have the same issue in other parts of the code base e.g. in Ridge), though I’m not really sure what to do with it. It’s still a bit complex, so I’m not sure we would want to maintain it in sklearn.utils.extmath (and handling all edge cases would take more work). While scipy.sparse is kind of reaching end of life so I’m not sure about trying to contribute it there (in any case I don’t have the availability to do that).

So maybe we could already make a PR with the above code snippet to improve TfidfTransformer, if you are interested @thebabush?

0reactions
jnothmancommented, Nov 16, 2020

If this is a good speedup, then don’t worry about it too much. I think @thebabush can implement an initial PR which avoids the matrix product. Then additional PRs or commits in the same can be used to move the copy parameter and to limit overall memory cost.

On Mon, 16 Nov 2020 at 17:53, Roman Yurchak notifications@github.com wrote:

should I go ahead and change everywhere _idf_diag is used to _idf as a simple ndarray?

We should do that as well. It’s a private attribute. And indeed move copy definition to init.

I see that your workaround still requires a copy of idx due to indexing

Is it still the case with,

np.take(idf, X.index, out=X.data)

? Otherwise yes, we can use batching.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/18812#issuecomment-727775836, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATH22V5FDFMBTBRNCQZELSQDD5DANCNFSM4TSIP55Q .

Read more comments on GitHub >

github_iconTop Results From Across the Web

TF-IDF Vectorizer scikit-learn - Medium
spicy sparse matrix of count and tf-idf vectorizer ... Vectorizer with better way if we understand core concept of TF-IDF functionality.
Read more >
Sklearn's TfidfTransformer(use_idf=False, norm=None ...
The code that is supposed to calculate term frequencies. def transform(self, x, copy=True): """Transform a count matrix to ...
Read more >
6.2. Feature extraction — scikit-learn 1.2.0 documentation
The output from FeatureHasher is always a scipy.sparse matrix in the CSR format. Feature hashing can be employed in document classification, but unlike ......
Read more >
Handling Sparse matrix — Concept behind Compressed ...
Compressed Sparse Row(CSR) algorithm is one of the types of provided by Scipy. Below is how it works. Sample Text Document. Short sentence....
Read more >
cuSPARSE - NVIDIA Documentation Center
Note that in the latter case, the library cuda is not needed. ... The whole idea of matrix type and fill mode is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found