2007-04-12
| Table of Contents: |
| Rate This Article: | Add This Article To: |
( Page 2 of 2 )
Bloom Filters
The Bloom filter (http://en.wikipedia.org/wiki/Bloom_filter), developed by Burton H. Bloom in 1970, is a more space-efficient way of testing whether an element is a member of a set. As with a hash table, false positives are possible and false negatives are not. Also like a hash table, a Bloom filter works by computing hash codes. The difference is that the Bloom filter computes multiple hash codes for each string, and sets multiple bits (one bit per hash code) in the table for each item that is inserted. The result is counter-intuitive: by setting multiple bits per item, the Bloom filter can store many more items in a table of given size while maintaining the same false positive rate.
For example, in the previous example of a hash table with one million items, I showed that we could insert 104,166 items with a false positive rate of 10 percent. A Bloom filter using the same number of bits and three hash functions can store approximately 207,972 itemsalmost twice as many. Better yet, with seven hash functions, the Bloom filter can store 104,243 with a false positive rate of only one percent!
The explanation of why the Bloom filter works this way is beyond the scope of this article. The linked Wikipedia article above will give you a good introduction, and provides ample sources of further information. The important part is the result, which is that the false positive rate (p) for a given Bloom filter depends on three things:
m the number bits in the hash table
n the number of items that you want to insert into the table
k the number of hash keys
The optimum false positive rate for a given m and n is approximately 0.6185m/n. The optimum number of keys, k, for a given m and n (i.e. that will give you the optimum false positive rate) is m/n * ln 2, or approximately 0.7 * m/n.
The Bloom Filter Calculator (http://www.cc.gatech.edu/fac/Pete.Manolios/bloom-filters/calculator.html) is a handy tool that you can use to experiment with your own settings. For example, if you know you want to insert 100,000 items into a table and maintain a false positive rate of one percent or less, just enter 100,000 for n and 0.01 for p. The calculator will tell you that you need at least 959,296 bits and 7 hashing functions.
Creating the Hash Functions
The biggest difference between a regular hash table (which is, in effect, a Bloom filter with a k value of 1), and the generalized Bloom filter is that you need to define a variable number of hashing functions. At first it seems like an onerous task: coming up with potentially dozens of unique hashing functions. As it turns out, you can achieve very close to the theoretical false positive performance by using only two hash functions (f1 and f2) and then combining them using this formula:
hi(x) = (f1(x) + if2(x)) mod m
Since the .NET Framework computes a hash code for every object already, you only need to come up with one other. And that's relatively easy to do. The function shown in Listing 3 computes a hash code for the passed string.
Listing 3 computing a hash code from a passed string
private int HashString(string s)
{
int hash = 0;
for (int i = 0; i < s.Length; i++)
{
hash += s[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
Given two computed hash codes, you can then create the other hash keys with the simple bit of code shown in Listing 4.
Listing 4 here computing additional hash codes
private void CreateHashes(string str)
{
int hash1 = str.GetHashCode();
int hash2 = HashString(str);
hashKeys[0] = Math.Abs(hash1 % hashbits.Count);
if (numKeys > 1)
{
for (int i = 1; i < numKeys; i++)
{
hashKeys[i] = Math.Abs((hash1 + (i * hash2)) % hashbits.Count);
}
}
}
The BloomFilter Class
The BloomFilter class is a simple extension of the SimpleHashCode class. The constructor takes the size of the hash table and the number of keys, and creates the empty BitArray. Externally, the Add and Test methods work identically to their SimpleHashTable counterparts. Internally, they work much differently.
The Test function computes the hash codes and then tests each bit in the bit table to see if it is set. Test will return true only if all of the corresponding bits are set. If any one of the corresponding bits is not set, Test returns false.
Like Test, Add also computes the hash codes. It then examines each corresponding bit and sets the bit to true if it was not set previously. If any of the bits was not set, Add returns false. It only returns true if all of the bits were already set.
The complete BloomFilter class is shown in Listing 5.
Listing 5 The BloomFilter class
using System;
using System.Collections;
namespace Mischel.Hashing
{
public class BloomFilter
{
private BitArray hashbits;
private int numKeys;
private int[] hashKeys;
public BloomFilter(int tableSize, int nKeys)
{
numKeys = nKeys;
hashKeys = new int[numKeys];
hashbits = new BitArray(tableSize);
}
private int HashString(string s)
{
int hash = 0;
for (int i = 0; i < s.Length; i++)
{
hash += s[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
private void CreateHashes(string str)
{
int hash1 = str.GetHashCode();
int hash2 = HashString(str);
hashKeys[0] = Math.Abs(hash1 % hashbits.Count);
if (numKeys > 1)
{
for (int i = 1; i < numKeys; i++)
{
hashKeys[i] = Math.Abs((hash1 + (i * hash2))
% hashbits.Count);
}
}
}
public bool Test(string str)
{
CreateHashes(str);
// Test each hash key. Return false if any
// one of the bits is not set.
foreach (int hash in hashKeys)
{
if (!hashbits[hash])
return false;
}
// All bits set. The item is there.
return true;
}
public bool Add(string str)
{
// Initially assume that the item is in the table
bool rslt = true;
CreateHashes(str);
foreach (int hash in hashKeys)
{
if (!hashbits[hash])
{
// One of the bits wasn't set, so show that
// the item wasn't in the table, and set that bit.
rslt = false;
hashbits[hash] = true;
}
}
return rslt;
}
}
}
I modified the hash table test program so that it also inserts the items into a Bloom filter and compares the results. The Bloom filter calculator tells me that the optimum size if I want to insert 100,000 items with an error rate of 10 percent is 480,833 bits. So that's the size I used. The complete program is shown in Listing 6.
Listing 6 Testing the hash table and the Bloom filter
using System;
using System.IO;
using Mischel.Hashing;
namespace testhash
{
class Program
{
static void Main(string[] args)
{
int urlsRead = 0;
int hashCollisions = 0;
int bloomCollisions = 0;
SimpleHashTable hashTable = new SimpleHashTable(1000000);
BloomFilter bloom = new BloomFilter(480833, 3);
using (StreamReader sr = new StreamReader("urls.txt"))
{
string url;
while ((url = sr.ReadLine()) != null)
{
urlsRead++;
bool rslt = hashTable.Add(url);
if (rslt)
hashCollisions++;
rslt = bloom.Add(url);
if (rslt)
bloomCollisions++;
if ((urlsRead % 10000) == 0)
{
Console.WriteLine("{0} {1} {2}% {3} {4}%", urlsRead,
hashCollisions, 100*(double)hashCollisions / urlsRead,
bloomCollisions, 100*(double)bloomCollisions / urlsRead);
}
}
}
Console.WriteLine("{0} urls read", urlsRead);
Console.WriteLine("{0} hash collisions", hashCollisions);
Console.WriteLine("False positive rate (hash) = {0}%",
100*(double)hashCollisions / urlsRead);
Console.WriteLine("{0} Bloom collisions", bloomCollisions);
Console.WriteLine("False positive rate (Bloom) = {0}%",
100*(double)bloomCollisions / urlsRead);
}
}
}
The results, shown below, are a little better than you might expect, again because of the way that false positive rates are computed.
10000 44 0.44% 1 0.01%
20000 187 0.935% 10 0.05%
30000 423 1.41% 38 0.126666666666667%
40000 753 1.8825% 118 0.295%
50000 1200 2.4% 262 0.524%
60000 1753 2.92166666666667% 517 0.861666666666667%
70000 2375 3.39285714285714% 866 1.23714285714286%
80000 3123 3.90375% 1352 1.69%
90000 3946 4.38444444444444% 2118 2.35333333333333%
100000 4834 4.834% 2966 2.966%
100000 urls read
4834 hash collisions
False positive rate (hash) = 4.834%
2966 Bloom collisions
False positive rate (Bloom) = 2.966%
A Generic BloomFilter Class
I used strings in my BloomFilter example class above because strings are commonly used, easy to hash, and because I didn't want to complicate the code needlessly while explaining how Bloom filters work. But strings aren't the only kinds of objects one might want to use Bloom filters for. It's conceivable that you'd want to use some other objectone for which it's difficult to create a string that can be hashed. Fortunately, the C# support for generics makes it relatively easy to create a generalized Bloom filter class that will let you work with any kind of object. The only catch is that you'll have to create your own hashing functions.
The generic BloomFilter class is very similar to the one shown above. It's an abstract class, meaning that you'll have to inherit from it in order to create your own type-specific Bloom filter. When you inherit, you must implement the two methods CreateHash1 and CreateHash2. In addition, you may override the CreateHashes method if you want to change the way that the additional hash keys are created--either by creating more hashing functions or by changing the way that the hash codes are combined to create the keys. The complete BloomFilter class is shown in Listing 7, and the StringBloomFilter descendent class is shown in Listing 8.
Listing 7 - The generic BloomFilter class
public abstract class BloomFilter
{
private BitArray hashbits;
private int numKeys;
protected int[] hashKeys;
public BloomFilter(int tableSize, int nKeys)
{
numKeys = nKeys;
hashKeys = new int[numKeys];
hashbits = new BitArray(tableSize);
}
public bool Test(TValue val)
{
CreateHashes(val);
// Test each hash key. Return false
// if any one of the bits is not set.
foreach (int hash in hashKeys)
{
if (!hashbits[hash])
return false;
}
// All bits set. The item is there.
return true;
}
public bool Add(TValue val)
{
// Initially assume that the item is in the table
bool rslt = true;
CreateHashes(val);
foreach (int hash in hashKeys)
{
if (!hashbits[hash])
{
// One of the bits wasn't set, so show that
// the item wasn't in the table, and set that bit.
rslt = false;
hashbits[hash] = true;
}
}
return rslt;
}
protected virtual void CreateHashes(TValue val)
{
int hash1 = CreateHash1(val);
int hash2 = CreateHash2(val);
hashKeys[0] = Math.Abs(hash1 % hashbits.Count);
if (numKeys > 1)
{
for (int i = 1; i < numKeys; i++)
{
hashKeys[i] = Math.Abs((hash1 + (i * hash2)) %
hashbits.Count);
}
}
}
protected abstract int CreateHash1(TValue val);
protected abstract int CreateHash2(TValue val);
}
Listing 8 - The StringBloomFilter class
class StringBloomFilter : BloomFilter
{
public StringBloomFilter(int tableSize, int nKeys)
: base(tableSize, nKeys)
{
}
protected override int CreateHash1(string val)
{
return val.GetHashCode();
}
protected override int CreateHash2(string val)
{
int hash = 0;
for (int i = 0; i < val.Length; i++)
{
hash += val[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
}
Final Thoughts
Bloom filters take somewhat more processing time than do simple hash tables, because you have to compute multiple hash keys for each item that you will add to or test against the filter. The time requirement is somewhat alleviated by using the hash code combining method that I demonstrate, with little to no increase in the false positive rate.
The only real drawback to Bloom filters is that their false positive rate increases much faster than that of hash tables when the bit array begins to get full. However, you can tailor the Bloom filter's size to reduce the false positive rate.
You cannot remove an item from a Bloom filter. In a simple hash table, where one hash code corresponds to a single bit, you can remove an item by clearing that bit. But because a single bit in a Bloom filter might correspond to many different items, clearing one bit can remove many items from the set. Removing an item from a Bloom filter could (most likely would) result in false negatives, which are not allowed.
The problem of keeping track of things you have seen will only get larger as we continue to work with larger data sets. In applications where you can tolerate some percentage of false positives--reporting that you have seen an item that you in fact have not actually processed--Bloom filters provide a much more space-efficient way to test for set inclusion than do hash tables, at the cost of slightly more processing time. As with many other applications, the tradeoff is one of time versus space. In the case of Bloom filters versus hash tables, the tradeoff is usually worthwhile.
![]() |
|


