BloomFilter set;
Iterator loop_beginning = null;
float prob = 1.0;
//go through the entire list
for(Iterator iter = list.beginning();
iter.hasNext(); ++iter) {
//is the node in the set
if(set.contains(*iter)) {
//it is! we may be in a loop.
//is this the first node in this possible loop?
if(loop_beginning == null) {
prob = set.falsePositiveRate();
loop_beginning = iter;
} else {
prob *= set.falsePositiveRate();
} //end if
//is the probability low enough?
if(prob < errorThreshold) {
//good enough for us return the
//first node in the loop.
return loop_beginning;
} //end if
} else {
//nope not in the set. We have not seen
//this node before so add it to the set.
set.add(*iter);
//no loop
loop_beginning = null;
} //end if else
} //end for
return null;
The algorithm is pretty easy it goes around the list until the node being examined is in the bloom filter. We could stop there but it could be that we just found a false positive. Bloom filters never say something is not in the set when it is so we check to see that the next few nodes are in the set also. Every contiguous positive result reduces the probability that we are examining a false positive. So we just go until the probability that this is a false positive drops to an acceptable level.
So when would you use this? You probably wouldn't; not unless you have a known upper bound on the number of nodes and you can't have two iterators walking the sequence for some contrived reason. The size of the bloom filter will grow linearly with the upper bound of links but it will be considerably smaller then storing the keys themselves. Even with that space advantage the bloom filter might not fit in memory. But like I said it was fun coming up with this.
edit: I must have been really fried when I wrote that algorithm the first time. So its better now.
edit-edit: Argh! one day ill get that pseudo-code right.
Monday, October 5, 2009
Betting Against the Tortoise and Hare
So apparently one question that comes up pretty often in programmer interviews is "how do you find a loop in a singly linked list?" I've never personally been asked this question but thats the hearsay. The common answer is Floyd's cycle detection algorithm (also called the tortoise and the hare algorithm). It works by having two iterators the tortoise and the hare, where the hare moves through the list twice as fast as the tortoise. The algorithm can find not only a cycle in the list but where it starts (check Wikipedia's article for details). The algorithm takes a few passes through the list to find the information though.
That got me thinking about a slightly different problem. Can we find the first element of the loop with only a single pass in a list that wont fit in memory. The answer that I came up with is: Sort of. At least we can find it with some small probability that we are wrong. Is this optimal ...I doubt it but it was fun to devise. The algorithm relies on a data structure called a bloom filter.
Bloom filters have the ability to keep a tab on things in a set with very little space overhead. However bloom filters have a non-zero probability of saying something is in a set when it is not. So here comes the algorithm (I'm glossing over the exact details of the filter here).
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment