Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

numpy lacks memory and speed efficiency for Booleans

See original GitHub issue

Using pure Boolean values is pretty limiting when dealing with very large data due to memory waste. np.array will use 1 byte per Boolean which although better than 4 or 8, is still 8 times waste.

Easy enough to pack the Boolean value np.array as bytes (np.uint8):

def boolarr_tobytes(arr):
    rem = len(arr) % 8
    if rem != 0: arr = np.concatenate((arr, np.zeros(8 - rem, dtype=np.bool_)))
    arr = np.reshape(arr, (int(len(arr) / 8), 8))
    return np.packbits(arr) #translates boolean to bytes if array shape (n, 8) with high bits first

Then & and | provide bitwise operations already. And np.sum can be written as:

bytebitcounts = np.array([bin(x).count("1") for x in range(256)])
def totalbits_bytearr(arr):
    return np.sum(bytebitcounts[arr])

Now I am truly supposing that the table lookup which uses the imaging table translation function is vectorized properly. I would imagine since it is used heavily for image processing that it is. This would be 2 vector operations (np.sum and table lookup) instead of 1 np.sum. PSHUFB (packed bytes shuffle) is the name of the processor intrinsic which can do byte table lookup translation. However, since the AVX/SSE2 and like instructions have data limits, 8 times less vector operations would occur per vector operation. 1 vector operation * 8 vs 2 vector operations is still 4 times faster.

So if numpy would dare to add a whole data type to use packed byte representations of Booleans instead (which might be a major change which would need to be implemented), it would decrease memory by 8 times, increase vector operations by 8 times except where bit twiddling like mentioned is needed where it would depending on the specific operation still tend to be faster.

I can see no reason why this would not be highly desirable for the library especially since large datasets are pretty typical.

Yes the primary problem would be endless indexing oddities (1 << bitoffset) & value, along with if set: value |= (1 << bitoffset). But a lot of things are already implicitly supported.

Half of the operations are probably trivial like multiplication, addition as shown, and a few would require some real thinking.

It would make these Python libraries as flexibly scalable as C though so it would be impressive. This would further positively effect a great many libraries out there giving dramatic potential increases in large data sets.

Issue Analytics

State:
Created 4 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

eric-wiesercommented, Nov 1, 2019

Also surprised to see the numba can take Python code with initialization and figure out how to correctly vectorize

I think you have the wrong model of how numba works, like I used to. I think the things you need to know about the numba are that:

it translates python into unoptimized LLVM IR
it treats any closures (eg the lookup table) as compile-time constants
LLVM does all the heavy lifting regarding optimization, the same as is used by clang

which has a bunch of numba code for all the ufuncs.

Numpy cannot take a dependency on numba, it would make everything far too cyclic.

I suppose bitwise & and | are already implemented using numba

They’re implemented using native loops in C. The problem with np.sum(bytebitcounts[arr]) is it uses two loops, an intermediate array, and no compile-time knowledge of the lookup table.

1reaction

eric-wiesercommented, Nov 1, 2019

Note that you can probably get 90% of the performance you want with numba:

import numba
lookup_table = np.array([bin(x).count("1") for x in range(256)], np.uint8)
@numba.guvectorize([(numba.uint8[:], numba.int64[:])], '(n)->()', nopython=True)
def sum_bits(x, res):
    res[0] = 0
    for xi in x:
        res[0] += lookup_table[xi]

In [41]: a = np.array([0b1100, 0b11011011], np.uint8)

In [42]: sum_bits(a)
Out[42]: 8

In [43]: %timeit sum_bits(a)
1.83 µs ± 80.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [44]: def totalbits_bytearr(arr):
    ...:     return np.sum(bytebitcounts[arr])
    ...:

In [45]: bytebitcounts=lookup_table

In [46]: %timeit totalbits_bytearr(a)
12.7 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Top Results From Across the Web

Memory-efficient way to generate a large numpy array ...

One problem with using np.random.randint is that it generates 64-bit integers, whereas numpy's np.bool dtype uses only 8 bits to represent each boolean...

Reducing NumPy memory usage with lossless compression

Reduce NumPy memory usage by choosing smaller dtypes, and using sparse arrays.

Optimize Memory Tips in Python - Towards Data Science

Tracking, managing, and optimizing memory usage in Python is a well-understood matter but lacks a comprehensive summary of methods.

A new high performance, memory-efficient file parser engine ...

A project I've put off for a long time is building a high performance, memory efficient file parser for pandas. The existing code...

Missing Data Functionality in NumPy

Bit Patterns Signalling Missing Values (bitpattern); Boolean Masks ... Parameterized Data Type Which Adds Additional Memory for the NA Flag.