Feature Request: BloomFilter.approximateSize()
See original GitHub issueFrom a ten year old paper, there’s a function that approximates the number of items in a bloom filter.
All it needs is Bloomfilter’s bits.bitSize()
, bits.bitCount()
, and numHashFunctions
. I played around with it and found it to be remarkably accurate for large and small, empty and full filters alike. I was seeing accuracy between 2 and 5 9’s based on the size of the filters.
I’d be happy to implement it (properly) if needed. I think there is value in this addition, and since it’s just a calculation with some internal values, I don’t see any major downsides. I suppose there isn’t a way to quantify the accuracy of the results, but between Javadoc and a descriptive method name (like in “mightContain”) the approximate nature of the result could be clear.
Issue Analytics
- State:
- Created 7 years ago
- Comments:17 (9 by maintainers)
Top GitHub Comments
Oh! Sorry about that. I agree that works for most users, although the limitation also extends to situations where the filters are being deserialized. Being a probabilistic data structure, the approximate count is intrinsic to the object the same way mightContain() is, so this would allow the object to be passed around without having to pair it with a count.
Glad to see this added - thank you!
Bit of rambling ahead…
For fun @kevinb9n regarding the
ln(1 - k/m)
=>-k/m
substitution - it’s just stopping after the first term of the Taylor series expansion for ln. Sincek/m
is constrained to being so close to 0, it’s still a very accurate approximation.If calculating
k
hashes andm
bits from the given optimal formulas usingp
fpp andn
expected insertions, the accuracy of the first term Taylor approximation varies almost completely onn
. At n=100, it’s already 99.65% accurate. At n=1000, it’s 99.965%, and we get an extra 9 for every order of magnitude n grows. Pretty nifty.