Plea to not use datasets with duplicate samples in SVC tests
See original GitHub issueDescription
Support vector classification solves a quadratic programming problem, see sklearn’s user guide secion on SVC.
The quadratic form Q[i, j] is constructed from kernel function values at pairs (i, j) of input samples.
Duplicate samples thus cause quadratic form to have zero eigenvalues. In such a case the solution of the quadratic problem is not unique.
It is important for SVM code to handle such cases, and libSVM resorts to use of regularization parameter tau
(see src/libsvm/svm.cpp#L89). So any eigenvalues smaller than tau
are replaced with tau
, essentially resolving the non-uniqueness in a machine arithmetic fragile way.
A change in implementation may result in a different set of support vectors found.
Inputs with duplicates are pervasive in sklearn’s test suite. Duplicates arise from use of grouped data, members of groups being represented by the same value. In iris dataset, the length and width of petals are only recorded with single decimal place precision, creating duplicates
In [17]: iris.data[[101, 142]]
Out[17]:
array([[5.8, 2.7, 5.1, 1.9],
[5.8, 2.7, 5.1, 1.9]])
Steps/Code to Reproduce
Consider an subset of 0-1 targets of iris, where we introduce a duplicate on purpose:
import numpy as np
import scipy.sparse as sparse
perm = np.array([18, 8, 18, 93, 77, 81, 81])
from sklearn.datasets import load_iris
X = load_iris().data[perm]
y = load_iris().target[perm]
Define function to evaluate quadratic programming objective function:
import sklearn
import sklearn.svm
def compute_obj_func(clf, X, y):
y_c = y.copy()
y_c[y==0] = -1
Kmat = sklearn.metrics.pairwise.rbf_kernel(clf.support_vectors_, gamma = clf._gamma)
# support coefficients is alpha * y
dc = clf.dual_coef_
if sparse.issparse(dc):
dc = dc.toarray()
ys = y_c[clf.support_]
assert np.all(dc * ys >= 0)
assert np.all(dc * ys <= clf.C), "{}, {}".format(dc * ys, clf.C)
assert np.allclose( np.dot(dc, np.ones_like(clf.support_)), 0)
aQa = np.dot(np.dot(dc, Kmat), dc.T)
eTa = np.dot(dc, ys)
return (0.5 * aQa - eTa)
Now run
idx = np.arange(len(y))
pos_all = np.arange(len(y))
rng = np.random.RandomState(1234)
kw_args = {'gamma': 0.12682018203120274, 'tol': 1e-10, 'decision_function_shape': 'ovo'}
for i, p in enumerate([pos_all]):
ii = idx[p].copy()
XX = X[p].copy()
yy = y[p].copy()
clf_sp = sklearn.svm.SVC(**kw_args).fit(sparse.csr_matrix(XX), yy)
clf = sklearn.svm.SVC(**kw_args).fit(XX, yy)
if not (len(clf.support_) == len(clf_sp.support_) and
np.allclose(clf.support_, clf_sp.support_) ):
print("failed for index {}. Supports: {} and {}, objective functions: {}".format(
i, ii[clf.support_], ii[clf_sp.support_], (compute_obj_func(clf, XX, yy).tolist(),
compute_obj_func(clf_sp, XX, yy).tolist()) ))
With run with daal4py
patches the dense result is computed by Intel DAAL, while sparse result is computed by libSVM.
DAAL find [0 1 2 3 4 6]
as support vectors, libSVM gives [0 1 2 3 4 5 6]
, with objective functions being
([[-2.249185019516638]], [[-2.2491850195166347]])
respectively.
As you see, objective function on DAAL’s solution is several machine epsilons (14*np.finfo(np.double).eps
to be exact) lower than libSVM’s objective function, and hence are essentially equivalent.
Expected Results
I hope to have convinced the community to either not use inputs with duplicate samples for SVC tests, or modify that test to compare values of the objective function at the solutions, or the result of
predict
.
Versions
I used scikit-learn 0.20.1
with Numpy 1.15.4 on 64-bit Linux machine.
I also used Intel DAAL 2019.1
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
How about we do both? We test predict / decision_function on iris and check support vectors on something without duplicate data points? That way we test both things?
It’s an implementation detail but one that makes it easier to write the tests 😉 we could also make sure the sets of support vectors are equal if you prefer that. We don’t have an “implementation independent” version of the dual coefficients, though, right?