MultiLabelBinarizer constructs pathological CSR matrices
See original GitHub issueDescription
MultiLabelBinarizer
can be made to construct pathological CSR matrices.
Steps/Code to Reproduce
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
test_data = np.array([
['a', 'c'],
['b', 'a', 'c']
])
mlb = MultiLabelBinarizer(classes=list('hhhhhaaafffgggeeeeeeddddddcccccccccccbb'), sparse_output=True)
result = mlb.fit_transform(test_data)
result.tocoo()
Expected Results
Result is converted to COO matrix.
Actual Results
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-121-81b248265d3a> in <module>()
6 mlb = MultiLabelBinarizer(classes=list('hhhhhaaafffgggeeeeeeddddddcccccccccccbb'), sparse_output=True)
7
----> 8 result = mlb.fit_transform(test_data).tocoo()
~\Projects\derived-rating-attributes\conda-env\lib\site-packages\scipy\sparse\compressed.py in tocoo(self, copy)
938 from .coo import coo_matrix
939 return coo_matrix((self.data, (row, col)), self.shape, copy=copy,
--> 940 dtype=self.dtype)
941
942 tocoo.__doc__ = spmatrix.tocoo.__doc__
~\Projects\derived-rating-attributes\conda-env\lib\site-packages\scipy\sparse\coo.py in __init__(self, arg1, shape, dtype, copy)
190 self.data = self.data.astype(dtype, copy=False)
191
--> 192 self._check()
193
194 def reshape(self, *args, **kwargs):
~\Projects\derived-rating-attributes\conda-env\lib\site-packages\scipy\sparse\coo.py in _check(self)
272 raise ValueError('row index exceeds matrix dimensions')
273 if self.col.max() >= self.shape[1]:
--> 274 raise ValueError('column index exceeds matrix dimensions')
275 if self.row.min() < 0:
276 raise ValueError('negative row index found')
ValueError: column index exceeds matrix dimensions
The problem is evident when you look at the indices:
result.indices
# array([36, 7, 36, 38, 7], dtype=int32)
result.shape
# (2, 8)
Versions
Windows-7-6.1.7601-SP1
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
NumPy 1.15.1
SciPy 1.1.0
Scikit-Learn 0.19.1
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
sklearn.preprocessing.MultiLabelBinarizer
This transformer converts between this intuitive format and the supported multilabel format: a (samples x classes) binary matrix indicating the presence of ...
Read more >scipy.sparse.csr_matrix — SciPy v1.9.3 Manual
Sparse matrices can be used in arithmetic operations: they support addition, subtraction, multiplication, division, and matrix power. Advantages of the CSR ...
Read more >python - How to do the MultiLabelBinarizer in a huge list of lists
I am trying to train OneVsRest algorithm where it gets a tf-idf matrix(called x_train) which is of ...
Read more >Constructing Sparse Matrices - MATLAB & Simulink - MathWorks
Creating Sparse Matrices Directly. You can create a sparse matrix from a list of nonzero elements using the sparse function with five arguments....
Read more >msmbuilder.preprocessing.MultiLabelBinarizer
This transformer converts between this intuitive format and the supported multilabel format: a (samples x classes) binary matrix indicating the presence of ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@rth There’s already a PR at #12195 and I’ve proposed the same suggestion there. With your +1, let’s raise an error unless someone provide a use case. .
The problem (I think) is this part of the
fit()
function:If the user specifies a
classes
argument, uniqueness is not imposed which presumably causes the problems. I modified the above code with this, and the error no longer occurs:Of course, this is not a real solution since a set will not maintain the order, which is the whole purpose of the
classes
argument. I can take this issue if that’s okay and hopefully find an efficient way do this while preserving order.