TfidfTransformer: idea to avoid unnecessary copy of csr matrices
See original GitHub issueDescribe the workflow you want to enable
TfidfTransformer.transform(X, copy=False)
shouldn’t make copies of X but it does.
By shouldn’t I mean I didn’t expect it to happen.
Describe your proposed solution
If both X
and self._idf_diag
are CSR matrices, one can do:
for i in range(X.data.shape[0]):
X.data[i] *= self._idf_diag.data[X.indices[i]]
X.eliminate_zeros()
…that is, if I’m not mistaken.
This can ofc be implemented efficiently in cython (there might be even some specialized function already?).
I will gladly submit a PR for this (currently I’m hacking around it) if somebody experienced in sklearn can guide me wrt where to put the code. I never contributed to this project but would love to.
Describe alternatives you’ve considered, if relevant
Additional context
I’m working with big sparse matrices and this would help a lot.
See the comment at https://github.com/scikit-learn/scikit-learn/blob/05ce8141bc71ad21e55be4d1b3f6609f65e91e49/sklearn/feature_extraction/text.py#L1499
Line 1500 creates a copy.
EDIT: the eliminate_zeros
is not even required I guess.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (5 by maintainers)
In the end the issue is that scipy sparse doesn’t have broadcasting support (https://github.com/scipy/scipy/issues/2128). What we want to do is,
where idf is a 1D array. Because broadcasting isn’t supported, we have to fallback to that multiplication by the sparse diagonal matrice hack.
A more efficient workaround here, assuming X is in CSR format would be,
This would both use less memory (when
copy=False
) and be much faster (x4 faster in the benchmarks I did).I prototyped a generic implementation in safe_multiply_broadcast which would be equivalent to
np.multiply
but with some broadcasting support with scipy.sparse (we also have the same issue in other parts of the code base e.g. in Ridge), though I’m not really sure what to do with it. It’s still a bit complex, so I’m not sure we would want to maintain it insklearn.utils.extmath
(and handling all edge cases would take more work). Whilescipy.sparse
is kind of reaching end of life so I’m not sure about trying to contribute it there (in any case I don’t have the availability to do that).So maybe we could already make a PR with the above code snippet to improve
TfidfTransformer
, if you are interested @thebabush?If this is a good speedup, then don’t worry about it too much. I think @thebabush can implement an initial PR which avoids the matrix product. Then additional PRs or commits in the same can be used to move the copy parameter and to limit overall memory cost.
On Mon, 16 Nov 2020 at 17:53, Roman Yurchak notifications@github.com wrote: