Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TfidfTransformer: idea to avoid unnecessary copy of csr matrices

See original GitHub issue

Describe the workflow you want to enable

TfidfTransformer.transform(X, copy=False) shouldn’t make copies of X but it does. By shouldn’t I mean I didn’t expect it to happen.

Describe your proposed solution

If both X and self._idf_diag are CSR matrices, one can do:

for i in range(X.data.shape[0]):
    X.data[i] *= self._idf_diag.data[X.indices[i]]
X.eliminate_zeros()

…that is, if I’m not mistaken.

This can ofc be implemented efficiently in cython (there might be even some specialized function already?).

I will gladly submit a PR for this (currently I’m hacking around it) if somebody experienced in sklearn can guide me wrt where to put the code. I never contributed to this project but would love to.

Describe alternatives you’ve considered, if relevant

Additional context

I’m working with big sparse matrices and this would help a lot.

See the comment at https://github.com/scikit-learn/scikit-learn/blob/05ce8141bc71ad21e55be4d1b3f6609f65e91e49/sklearn/feature_extraction/text.py#L1499

Line 1500 creates a copy.

EDIT: the eliminate_zeros is not even required I guess.

Issue Analytics

State:
Created 3 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

rthcommented, Nov 15, 2020

In the end the issue is that scipy sparse doesn’t have broadcasting support (https://github.com/scipy/scipy/issues/2128). What we want to do is,

X = X * idf[None, :]

where idf is a 1D array. Because broadcasting isn’t supported, we have to fallback to that multiplication by the sparse diagonal matrice hack.

A more efficient workaround here, assuming X is in CSR format would be,

if copy:
    X = X.copy()
X.data *= idf[X.indices]

This would both use less memory (when copy=False) and be much faster (x4 faster in the benchmarks I did).

I prototyped a generic implementation in safe_multiply_broadcast which would be equivalent to np.multiply but with some broadcasting support with scipy.sparse (we also have the same issue in other parts of the code base e.g. in Ridge), though I’m not really sure what to do with it. It’s still a bit complex, so I’m not sure we would want to maintain it in sklearn.utils.extmath (and handling all edge cases would take more work). While scipy.sparse is kind of reaching end of life so I’m not sure about trying to contribute it there (in any case I don’t have the availability to do that).

So maybe we could already make a PR with the above code snippet to improve TfidfTransformer, if you are interested @thebabush?

0reactions

jnothmancommented, Nov 16, 2020

If this is a good speedup, then don’t worry about it too much. I think @thebabush can implement an initial PR which avoids the matrix product. Then additional PRs or commits in the same can be used to move the copy parameter and to limit overall memory cost.

On Mon, 16 Nov 2020 at 17:53, Roman Yurchak notifications@github.com wrote:

should I go ahead and change everywhere _idf_diag is used to _idf as a simple ndarray?

We should do that as well. It’s a private attribute. And indeed move copy definition to init.

I see that your workaround still requires a copy of idx due to indexing

Is it still the case with,

np.take(idf, X.index, out=X.data)

? Otherwise yes, we can use batching.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/18812#issuecomment-727775836, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATH22V5FDFMBTBRNCQZELSQDD5DANCNFSM4TSIP55Q .

Top Results From Across the Web

TF-IDF Vectorizer scikit-learn - Medium

spicy sparse matrix of count and tf-idf vectorizer ... Vectorizer with better way if we understand core concept of TF-IDF functionality.

Sklearn's TfidfTransformer(use_idf=False, norm=None ...

The code that is supposed to calculate term frequencies. def transform(self, x, copy=True): """Transform a count matrix to ...

6.2. Feature extraction — scikit-learn 1.2.0 documentation

The output from FeatureHasher is always a scipy.sparse matrix in the CSR format. Feature hashing can be employed in document classification, but unlike ......

Handling Sparse matrix — Concept behind Compressed ...

Compressed Sparse Row(CSR) algorithm is one of the types of provided by Scipy. Below is how it works. Sample Text Document. Short sentence....

cuSPARSE - NVIDIA Documentation Center

Note that in the latter case, the library cuda is not needed. ... The whole idea of matrix type and fill mode is...