question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Plea to not use datasets with duplicate samples in SVC tests

See original GitHub issue

Description

Support vector classification solves a quadratic programming problem, see sklearn’s user guide secion on SVC.

The quadratic form Q[i, j] is constructed from kernel function values at pairs (i, j) of input samples.

Duplicate samples thus cause quadratic form to have zero eigenvalues. In such a case the solution of the quadratic problem is not unique.

It is important for SVM code to handle such cases, and libSVM resorts to use of regularization parameter tau (see src/libsvm/svm.cpp#L89). So any eigenvalues smaller than tau are replaced with tau, essentially resolving the non-uniqueness in a machine arithmetic fragile way.

A change in implementation may result in a different set of support vectors found.

Inputs with duplicates are pervasive in sklearn’s test suite. Duplicates arise from use of grouped data, members of groups being represented by the same value. In iris dataset, the length and width of petals are only recorded with single decimal place precision, creating duplicates

In [17]: iris.data[[101, 142]]
Out[17]:
array([[5.8, 2.7, 5.1, 1.9],
       [5.8, 2.7, 5.1, 1.9]])

Steps/Code to Reproduce

Consider an subset of 0-1 targets of iris, where we introduce a duplicate on purpose:

import numpy as np
import scipy.sparse as sparse

perm = np.array([18, 8, 18, 93, 77, 81, 81])
from sklearn.datasets import load_iris
X = load_iris().data[perm]
y = load_iris().target[perm]

Define function to evaluate quadratic programming objective function:

import sklearn
import sklearn.svm

def compute_obj_func(clf, X, y):
    y_c = y.copy()
    y_c[y==0] = -1
    Kmat = sklearn.metrics.pairwise.rbf_kernel(clf.support_vectors_, gamma = clf._gamma)
    # support coefficients is alpha * y
    dc = clf.dual_coef_
    if sparse.issparse(dc):
        dc = dc.toarray()
    ys = y_c[clf.support_]
    assert np.all(dc * ys >= 0)
    assert np.all(dc * ys <= clf.C), "{}, {}".format(dc * ys, clf.C)
    assert np.allclose( np.dot(dc, np.ones_like(clf.support_)), 0)
    aQa = np.dot(np.dot(dc, Kmat), dc.T)
    eTa = np.dot(dc, ys)
    return (0.5 * aQa - eTa)

Now run

idx = np.arange(len(y))
pos_all = np.arange(len(y))
rng = np.random.RandomState(1234)

kw_args = {'gamma': 0.12682018203120274, 'tol': 1e-10, 'decision_function_shape': 'ovo'}

for i, p in enumerate([pos_all]):
    ii = idx[p].copy()
    XX = X[p].copy()
    yy = y[p].copy()

    clf_sp = sklearn.svm.SVC(**kw_args).fit(sparse.csr_matrix(XX), yy)
    clf = sklearn.svm.SVC(**kw_args).fit(XX, yy)

    if not (len(clf.support_) == len(clf_sp.support_) and
            np.allclose(clf.support_, clf_sp.support_) ):
        print("failed for index {}. Supports: {} and {}, objective functions: {}".format(
                i, ii[clf.support_], ii[clf_sp.support_], (compute_obj_func(clf, XX, yy).tolist(),
                 compute_obj_func(clf_sp, XX, yy).tolist()) ))

With run with daal4py patches the dense result is computed by Intel DAAL, while sparse result is computed by libSVM.

DAAL find [0 1 2 3 4 6] as support vectors, libSVM gives [0 1 2 3 4 5 6], with objective functions being ([[-2.249185019516638]], [[-2.2491850195166347]]) respectively.

As you see, objective function on DAAL’s solution is several machine epsilons (14*np.finfo(np.double).eps to be exact) lower than libSVM’s objective function, and hence are essentially equivalent.

Expected Results

I hope to have convinced the community to either not use inputs with duplicate samples for SVC tests, or modify that test to compare values of the objective function at the solutions, or the result of predict.

Versions

I used scikit-learn 0.20.1 with Numpy 1.15.4 on 64-bit Linux machine. I also used Intel DAAL 2019.1

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
amuellercommented, Dec 8, 2018

How about we do both? We test predict / decision_function on iris and check support vectors on something without duplicate data points? That way we test both things?

0reactions
amuellercommented, Dec 10, 2018

It’s an implementation detail but one that makes it easier to write the tests 😉 we could also make sure the sets of support vectors are equal if you prefer that. We don’t have an “implementation independent” version of the dual coefficients, though, right?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Should I remove duplicates from my dataset for my machine ...
You should probably remove them. Duplicates are an extreme case of nonrandom sampling, and they bias your fitted model.
Read more >
Should I remove test samples that are identical to some ...
In as much as the training and testing datasets are representative of the underlying data distribution, I think it's perfectly valid ...
Read more >
Relative Percent Difference Check - DQM - EarthSoft
3.MultiplierSoil – Multiplier used with detection limit in the LAB DUP ABS DIFF > AD_Multiplier_CL rule to test soil samples for results with ......
Read more >
arXiv:2202.06539v2 [cs.CR] 16 Feb 2022
samples which are not duplicated are very rarely ... Datasets In our experiments, we use mod- ... Is Deduplication An Effective Defense?
Read more >
Duplicate News Story Detection Revisited - Microsoft
Near-duplication is not necessarily transitive: a ≈ b and b ≈ c do not guarantee ... use weight-proportional sampling to create compact document...
Read more >

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found