What if you're organizing an enormous amount of data (say tens of millions of records or more), and you have to test if an item is a member of the data? This could be time consuming and, possibly, require more memory than the amount present. Jim Mischel t
The Problem
Imagine that you want to take a random sample from a very large number of items, and you want to guarantee that you don't somehow end up sampling the same item more than once. You're collecting data in real time and there's no way to prevent duplicates from showing up in the input stream. Let's assume that the items you're sampling are names.
ADVERTISEMENT
Determining which names to sample is fairly simple. If you wanted to sample ten percent of the incoming data, you could generate a random number between 1 and 100 (inclusive) as each name is received and then use the name if the random number is less than 11. But then you need to determine if you've seen that name before.
If the number of names is small, you can keep a simple list of names and search it as each name is selected. Searching a sequential list gets expensive, though, when the list grows, and you'll likely want to move to a sorted list, dictionary, or hash table. Whatever data structure you choose, it's a simple matter to check the list to see if the name has been seen before, and insert it if not.
But what happens when the number or size of items begins to exceed available memory? Tens of millions of names start to occupy hundreds of megabytes. Inserting an item into a sorted list of ten million items takes significant time. A hash table helps with the speed issue, but at the cost of having to store the hash codes along with the strings, and the additional runtime overhead of having to handle hash collisions.
When you get to hundreds of millions of items, it's unlikely that you'll have enough memory to hold the entire table in RAM. At that point you need to make a choice: find an efficient way to store and lookup items on disk, or find a faster and more memory-efficient way to do it within the available memory.
There are databases that can handle tens or hundreds of millions of records, but they take enormous amounts of disk space and aren't especially fast. You can do better than a commercial database by creating your own custom indexing and storage scheme, but you'll still find yourself bumping up against IO limitations. A disk head can only seek so fast. You can increase the speed of lookups by allocating more memory for disk cache, but if you have that much memory to work with you're better off keeping the whole thing in memory, anyway.
When you run into a problem like this, it's time to sit back and revisit your assumptions. In this particular case, we're trying to make sure that we don't sample the same name more than once. But, does it matter if our "name seen" test says that we have seen a name that we have not, in fact, actually seen? That is, can our sampling tolerate a certain number of false positives? We know that we can't tolerate false negativesthe test telling us that we haven't seen a name that we actually did see. But false positives shouldn't affect the outcome, provided that we don't get too many.
Hash Tables
If you allow some percentage of false positives, then your memory requirements go way down. Rather than storing the names themselves, you only need to compute and store a hash code for each name. Checking to see if an item has been seen is a simple matter of computing the hash code and then looking to see if that hash code is already in the table. You just need to make a hash table that's big enough to hold all your items and also keep the false positive rate to an acceptable level.
The false positive rate of a hash table depends on two things: the quality of the hashing function and the size of the hash table relative to the number of items being inserted. The best hashing functions evenly distribute items throughout the table. Hashing is a well-known problem and much research has gone into developing good hashing functions. You can easily create a good hashing function.
The other problem is a bit more difficult to solve. The best way to explain why is to use the birthday paradox (http://en.wikipedia.org/wiki/Birthday_paradox)--a problem well known to anybody who's taken a college statistics class--as an example. Briefly, the birthday paradox shows that, given a random sample of 23 people, the probability of any two of them having the same birthday (month and day) is better than 50%. With 50 people, the probability is 97%.
If you assume that birthdays are randomly distributed, then it's easy to view the birthday as a hashing function. If you then replace the birthday hashing function with one that computes a hash code based on the names, you'll find similar results: the likelihood of getting a hash collision (two different names that return the same hash code) increases very quickly as the number of items to be hashed increases.
The false positive rate for a simple hash table is very close to m/n, where m is the size of the hash table and n is the number of items inserted into the table. So, if you want to hash 100,000 items and maintain a false positive rate of less than 10%, you'll need a hash table that can hold 1,000,000 items. To reduce your false positive rate, you increase the size of your hash table.
Fortunately, a simple hash table can be represented in memory as an array of bits, with each bit representing the status of each hash table item: seen (set to 1) or not seen (set to 0). Using this representation, a hash table capable of holding M items will take m/8 bytes of memory. To hold 10^20 items (one megabit) would require 128 kilobytes. The .NET Framework has a class, BitArray that makes creating this kind of hash table very easy, as you can see in Listing 1.
Listing 1 a simple hash table using BitArray
using System;
using System.Collections;
namespace Mischel.Hashing
{
public class SimpleHashTable
{
private BitArray hashbits = null;
public SimpleHashTable(int tableSize)
{
hashbits = new BitArray(tableSize, false);
}
public bool Test(string str)
{
int hash = Math.Abs(str.GetHashCode()) % hashbits.Count;
return hashbits[hash];
}
public bool Add(string str)
{
int hash = Math.Abs(str.GetHashCode()) % hashbits.Count;
bool rslt = hashbits[hash];
if (!rslt)
hashbits[hash] = true;
return rslt;
}
}
}
The public API consists of a constructor that takes a single int parameter that defines the size of the hash table, and two methods: Test, which will tell you if an item is in the hash table; and Add, which will add an item to the hash table if it's not already there, and will tell you if the item was there.
The program in Listing 2 reads 100,000 strings (guaranteed unique URLs from a web crawl) from a file. For each string, the program computes the hash code and checks the value in the hash table. Collisions are counted and results are displayed as each 10,000 items are read.
Listing 2 using the simple hash table.
using System;
using System.IO;
using Mischel.Hashing;
namespace testhash
{
class Program
{
static void Main(string[] args)
{
int urlsRead = 0;
int collisions = 0;
SimpleHashTable hashTable = new SimpleHashTable(1000000);
using (StreamReader sr = new StreamReader("urls.txt"))
{
string url;
while ((url = sr.ReadLine()) != null)
{
urlsRead++;
bool rslt = hashTable.Add(url);
if (rslt)
collisions++;
if ((urlsRead % 10000) == 0)
{
Console.WriteLine("{0} {1} {2}%",
urlsRead, collisions, 100*
(double)collisions / urlsRead);
}
}
}
Console.WriteLine("{0} urls read", urlsRead);
Console.WriteLine("{0} collisions", collisions);
Console.WriteLine("False positive rate = {0}%",
100*(double)collisions / urlsRead);
}
}
}
The output from that program, run against the 100,000 unique URLs in my file, is:
Note how the false positive rate increases as the number of items inserted into the table increases. Note also that the overall false positive rate of 4.834 percent is much lower than the computed false positive rate of 9.516 percent. The reason is that the computed false positive rate is the probability of a collision when inserting a value into a table of size m after n items have already been added. It does not represent the number of false positives you can expect when inserting n items.
A simple hash table is fine if you have the memory to adjust the size for your desired false positive rate. A good rule of thumb is to set the size of the hash table (in bits) equal to the number of items you want to insert, divided by the false positive rate:
m = n /p
Where m is the size of the table in bits, n is the number of items to be inserted, and p is the false positive rate. To insert 100,000 items with a false positive rate of 10%, you would need 100,000/0.1, or one million bits. If you want to lower the false positive rate to one percent, you have to increase the table size to ten million bits.
A simple bit-addressed hash table will allow you to process more items, but even it breaks down as the number of items increases and/or your desired false positive rate decreases. If you want to insert a million items with an error rate of 0.1 percent (one in a thousand), you need a billion bitsalmost 128 megabytes of RAM. You can use hash tables to process millions of items, but you'll need something better if you want tens or hundreds of millions.