Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Plea to not use datasets with duplicate samples in SVC tests

See original GitHub issue

Description

Support vector classification solves a quadratic programming problem, see sklearn’s user guide secion on SVC.

The quadratic form Q[i, j] is constructed from kernel function values at pairs (i, j) of input samples.

Duplicate samples thus cause quadratic form to have zero eigenvalues. In such a case the solution of the quadratic problem is not unique.

It is important for SVM code to handle such cases, and libSVM resorts to use of regularization parameter tau (see src/libsvm/svm.cpp#L89). So any eigenvalues smaller than tau are replaced with tau, essentially resolving the non-uniqueness in a machine arithmetic fragile way.

A change in implementation may result in a different set of support vectors found.

Inputs with duplicates are pervasive in sklearn’s test suite. Duplicates arise from use of grouped data, members of groups being represented by the same value. In iris dataset, the length and width of petals are only recorded with single decimal place precision, creating duplicates

In [17]: iris.data[[101, 142]]
Out[17]:
array([[5.8, 2.7, 5.1, 1.9],
       [5.8, 2.7, 5.1, 1.9]])

Steps/Code to Reproduce

Consider an subset of 0-1 targets of iris, where we introduce a duplicate on purpose:

import numpy as np
import scipy.sparse as sparse

perm = np.array([18, 8, 18, 93, 77, 81, 81])
from sklearn.datasets import load_iris
X = load_iris().data[perm]
y = load_iris().target[perm]

Define function to evaluate quadratic programming objective function:

import sklearn
import sklearn.svm

def compute_obj_func(clf, X, y):
    y_c = y.copy()
    y_c[y==0] = -1
    Kmat = sklearn.metrics.pairwise.rbf_kernel(clf.support_vectors_, gamma = clf._gamma)
    # support coefficients is alpha * y
    dc = clf.dual_coef_
    if sparse.issparse(dc):
        dc = dc.toarray()
    ys = y_c[clf.support_]
    assert np.all(dc * ys >= 0)
    assert np.all(dc * ys <= clf.C), "{}, {}".format(dc * ys, clf.C)
    assert np.allclose( np.dot(dc, np.ones_like(clf.support_)), 0)
    aQa = np.dot(np.dot(dc, Kmat), dc.T)
    eTa = np.dot(dc, ys)
    return (0.5 * aQa - eTa)

Now run

idx = np.arange(len(y))
pos_all = np.arange(len(y))
rng = np.random.RandomState(1234)

kw_args = {'gamma': 0.12682018203120274, 'tol': 1e-10, 'decision_function_shape': 'ovo'}

for i, p in enumerate([pos_all]):
    ii = idx[p].copy()
    XX = X[p].copy()
    yy = y[p].copy()

    clf_sp = sklearn.svm.SVC(**kw_args).fit(sparse.csr_matrix(XX), yy)
    clf = sklearn.svm.SVC(**kw_args).fit(XX, yy)

    if not (len(clf.support_) == len(clf_sp.support_) and
            np.allclose(clf.support_, clf_sp.support_) ):
        print("failed for index {}. Supports: {} and {}, objective functions: {}".format(
                i, ii[clf.support_], ii[clf_sp.support_], (compute_obj_func(clf, XX, yy).tolist(),
                 compute_obj_func(clf_sp, XX, yy).tolist()) ))

With run with daal4py patches the dense result is computed by Intel DAAL, while sparse result is computed by libSVM.

DAAL find [0 1 2 3 4 6] as support vectors, libSVM gives [0 1 2 3 4 5 6], with objective functions being ([[-2.249185019516638]], [[-2.2491850195166347]]) respectively.

As you see, objective function on DAAL’s solution is several machine epsilons (14*np.finfo(np.double).eps to be exact) lower than libSVM’s objective function, and hence are essentially equivalent.

Expected Results

I hope to have convinced the community to either not use inputs with duplicate samples for SVC tests, or modify that test to compare values of the objective function at the solutions, or the result of predict.

Versions

I used scikit-learn 0.20.1 with Numpy 1.15.4 on 64-bit Linux machine. I also used Intel DAAL 2019.1

Issue Analytics

State:
Created 5 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

amuellercommented, Dec 8, 2018

How about we do both? We test predict / decision_function on iris and check support vectors on something without duplicate data points? That way we test both things?

0reactions

amuellercommented, Dec 10, 2018

It’s an implementation detail but one that makes it easier to write the tests 😉 we could also make sure the sets of support vectors are equal if you prefer that. We don’t have an “implementation independent” version of the dual coefficients, though, right?