Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

unique() needlessly slow

See original GitHub issue

np.unique has the axis option that allows, for example, to call unique on the rows of a matrix. I noticed however that it’s quite slow. Creating a view of the data and calling unique on the view is faster by a factor of 3.

MVCE:

import numpy
import perfplot


def unique_axis(data):
    return numpy.unique(data, axis=0)


def unique_row_view(data):
    b = numpy.ascontiguousarray(data).view(
        numpy.dtype((numpy.void, data.dtype.itemsize * data.shape[1]))
    )
    u = numpy.unique(b).view(data.dtype).reshape(-1, data.shape[1])
    return u


def unique_scikit(ar):
    if ar.ndim != 2:
        raise ValueError("unique_rows() only makes sense for 2D arrays, "
                         "got %dd" % ar.ndim)
    # the view in the next line only works if the array is C-contiguous
    ar = numpy.ascontiguousarray(ar)
    # np.unique() finds identical items in a raveled array. To make it
    # see each row as a single item, we create a view of each row as a
    # byte string of length itemsize times number of columns in `ar`
    ar_row_view = ar.view('|S%d' % (ar.itemsize * ar.shape[1]))
    _, unique_row_indices = numpy.unique(ar_row_view, return_index=True)
    ar_out = ar[unique_row_indices]
    return ar_out


perfplot.save(
    "unique.png",
    setup=lambda n: numpy.random.randint(0, 100, (n, 2)),
    kernels=[unique_axis, unique_row_view, unique_scikit],
    n_range=[2 ** k for k in range(20)],
)

unique

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

eric-wiesercommented, May 22, 2018

Note that your approach also only works for types where equality ⇔ binary equality, which is not true for floating point numbers.

0reactions

pdemarticommented, Apr 18, 2021

Here are three more ways:

unique_rows creates a 1D view (like unique_row_view), then uses pd.unique on it.
unique_via_pd_drop_duplicates simply uses pd.drop_duplicates.
unique_rows_via_pd_internals uses some internals of pd.drop_duplicates, in order to avoid creating a DataFrame and other unnecessary operations.

Interestingly, compared to unique_row_view, these methods are only faster for large arrays. In the case of unique_rows, the relative speedup vanishes when the result is large, i.e. there are not many duplicates).

Case 1: Lots of duplicates

BTW, I didn’t know about perfplot. It is awesome!

perfplot.show(
    setup=lambda n: np.random.randint(0, 100, (n, 2)),
    kernels=[unique_axis, unique_row_view, unique_rows, unique_via_pd_drop_duplicates, unique_rows_via_pd_internals],
    n_range=[2 ** k for k in range(20)],
    equality_check=None,  # the alt methods return results in the original order (not sorted)
)

Case 2: No duplicates

perfplot.show(
    setup=lambda n: np.stack((np.arange(n), np.arange(n)), axis=1),
    kernels=[unique_axis, unique_row_view, unique_rows, unique_via_pd_drop_duplicates, unique_rows_via_pd_internals],
    n_range=[2 ** k for k in range(20)],
    equality_check=None,
)

Implementation

import numpy as np
import pandas as pd

def unique_rows(a):
    # np.unique() is slow, in part because it sorts;
    # pd.unique() is much faster, but only 1D
    # This is inspired by https://github.com/numpy/numpy/issues/11136#issue-325345618
    # It creates a 1D view where each element is a byte-encoding of a row, then uses
    # pd.unique(), and then reconstruct the original type.
    if a.ndim != 2:
        raise ValueError(f'bad array dimension {a.ndim}; should be 2')
    b = np.ascontiguousarray(a).view(
        np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    )[:, 0]
    return pd.unique(b).view(a.dtype).reshape(-1, a.shape[1])

def unique_via_pd_drop_duplicates(a):
    return pd.DataFrame(a).drop_duplicates().values

from pandas._libs.hashtable import SIZE_HINT_LIMIT, duplicated_int64
from pandas.core.sorting import get_group_index
from pandas.core import algorithms

def duplicated_rows(a, keep='first'):
    size_hint = min(a.shape[0], SIZE_HINT_LIMIT)
    def f(vals):
        labels, shape = algorithms.factorize(vals, size_hint=size_hint)
        return labels.astype("i8", copy=False), len(shape)

    vals = (col for col in a.T)
    labels, shape = map(list, zip(*map(f, vals)))

    ids = get_group_index(labels, shape, sort=False, xnull=False)
    return duplicated_int64(ids, keep)

def unique_rows_via_pd_internals(a):
    return a[~duplicated_rows(a)]

Top Results From Across the Web

Filtering block plugins by context is slow [#2994550] | Drupal.org

Filtering block plugins by context is slow ... \Context\EntityContextDefinition::getSampleValues() needlessly generates full sample entities ...

Excel Extract a Unique List - My Online Training Hub

Array formulas can slow workbooks down, especially if the range being referenced is large, or there are lots of these formulas. Remove Duplicate ......

Numpy Unique slow on a large array... is there any way to ...

I am using the following code to load a numpy array, find unique dates in the first column, and then extract a slice...

Slow Google Sheets? Here are 27 Ideas to Try Today

How can you speed up a slow Google Sheet? ... I then deleted 99,999 of these formulas and just left TODAY() in cell...

Redfish requests for dbus-sensors data are needlessly slow

requestRoutesSensor() is called to process the GET request. For further processing a D-Bus call is made. For experimenting on a running ...