Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature Request: BloomFilter.approximateSize()

See original GitHub issue

From a ten year old paper, there’s a function that approximates the number of items in a bloom filter.

All it needs is Bloomfilter’s bits.bitSize(), bits.bitCount(), and numHashFunctions. I played around with it and found it to be remarkably accurate for large and small, empty and full filters alike. I was seeing accuracy between 2 and 5 9’s based on the size of the filters.

I’d be happy to implement it (properly) if needed. I think there is value in this addition, and since it’s just a calculation with some internal values, I don’t see any major downsides. I suppose there isn’t a way to quantify the accuracy of the results, but between Javadoc and a descriptive method name (like in “mightContain”) the approximate nature of the result could be clear.

Issue Analytics

State:
Created 7 years ago
Comments:17 (9 by maintainers)

Top GitHub Comments

1reaction

tcbeutlercommented, Mar 22, 2017

Oh! Sorry about that. I agree that works for most users, although the limitation also extends to situations where the filters are being deserialized. Being a probabilistic data structure, the approximate count is intrinsic to the object the same way mightContain() is, so this would allow the object to be passed around without having to pair it with a count.

0reactions

tcbeutlercommented, Mar 27, 2017

Glad to see this added - thank you!

Bit of rambling ahead…

For fun @kevinb9n regarding the ln(1 - k/m) => -k/m substitution - it’s just stopping after the first term of the Taylor series expansion for ln. Since k/m is constrained to being so close to 0, it’s still a very accurate approximation.

If calculating k hashes and m bits from the given optimal formulas using p fpp and n expected insertions, the accuracy of the first term Taylor approximation varies almost completely on n. At n=100, it’s already 99.65% accurate. At n=1000, it’s 99.965%, and we get an extra 9 for every order of magnitude n grows. Pretty nifty.

Top Results From Across the Web

All About Bloom Filters - Manning

Therefore, Bloom filters grow linearly with the size of the dataset and even though they are intended to be a small signature of...

Bloom filter - Wikipedia

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard ... For example, a hash area only 15% of the...

3 Approximate Membership and Bloom Filter

This chapter covers: Learning what Bloom filters are, why and when they are useful; Understanding how Bloom filters work; Configuring a Bloom filter...

BoomFilters/README.md at master - GitHub

A classic Bloom filter is a special case of a Stable Bloom Filter whose eviction rate is zero and cell size is one....

Optimizing Bloom Filter: Challenges, Solutions, and ... - arXiv

BLOOM filter [1] is a space-efficient probabilistic data ... If BFs judges that x is not stored by the cache, the request.