## Buckets of Things

Splitting a data structure into smaller pieces, even just a little, can greatly increase performance with a minute increase in memory.

When I earning my CS, the classes guided me to thinking there are only a few types of data structures that we should use based off of their worst-case scenario bounds. In the real world, hybrid structures and solutions exist. But without getting to know them or feeling comfortable with them, too many fall back on the simple structures and pain themselves and their applications.

Trees aim to give better lookup and insertion times than a simple list while using a bit more memory, but those times can still seem harsh given a large enough tree. Hashes give fast insertions and looksups, but they can soak up memory like a sponge (even though memory can be cheap today, we still need to be mindful of its use). Each structure has various optimizations to help with their drawbacks, but there isn’t a go-to solution to balance time and memory with trees and hashes.

Why not blend the two together?  Bucket or hash part of the lookup, and then dive through a tree?

That’s exactly what we did while analyzing IPv4 space for a publication in graduate school.

We had this problem where inserting IPv4 prefixes into a Patricia Tree to look at BGP coverage while analyzing the entries before they were incorporated into the router’s Forwarding Information Base was taking too long.  Wow, a lot of words there!  Allow me to step back and explain the problem a bit.

Essentially a company, or Autonomous System (AS), can own a chunk of contiguous IPv4 addresses.  That chunk can be represented as an IPv4 address and a mask, which when combined is called a “prefix”, and takes up 32 bits of memory.  For example, 10.0.0.0/8 (or 10/8 for short) would account for all addresses between 10.0.0.0 – 10.255.255.255, inclusively. Incrementing the mask by 1 splits the chunk into two equally-sized, disjoint chunks, e.g., 10/8 would split into 10.0/9 (10.0.0.0 – 10.127.255.255) and 10.128/9 (10.128.0.0 – 10.255.255.255).

Comparing these values is no easy feat as a prefix can “cover” another.  With the numbers above, 10/8 covers both 10.0/9 and 10.128/9 prefixes, so an AS can announce its 10/8, and the router can drop the others since they’re both owned by the same AS.  Computationally, this is quite more involved than the typical and speedy integer comparison.

There is a protocol on the internet called the Border Gateway Protocol (BGP) that helps ASes announce what prefixes they and their customers own as well as where and how to pass traffic destined for those address.  At the time, there were over 500,000 BGP entries in a common Routing Information Base table.  Trees gave us O(log(n)) insertions and searches, and memory (something that is extremely limited in routers) wasn’t too impacted.  The complexity of comparisons was quite high though due to the nature of prefixes.

Even using powerful and efficient simulation environments, and utilizing a lower-level language for the task, C, it was taking roughly 3 hours just to load the tree, let alone run analysis and swap entries as they came in. Completely inefficient.

Most people here would find a way to use a HashMap of some sort.  But 500,000 entries in memory can be quite large (something we didn’t have), and the tree gave us the ability to compare prefixes as insertions and updates were made (something that we wanted as we were looking for caching improvements).

Searching for optimizations, we realized each prefix didn’t have a mask smaller than /8, and since IPv4 space was quickly depleting, no prefix would exist that had a smaller mask. This meant that the first octet of the address would never have to be compared to an address starting with a value.

There was our bucket! Instead of having one tree, we made 256 trees, one for each possible value, and we had an array where the index based on the first octet resolved to the respective tree.  This array holds only pointers to the root nodes of the trees, so it only took up 1KB.

Now initially, with one tree, the total time it would take to insert all of the elements into the tree would be :

$\int_{1}^{500,000} O(log(n)) \, dx \approx 8,744,438$

For the bucket approach, assuming each of the 256 trees would take in the same amount of prefixes, the time it would take to insert all of the elements would be:

$256 * \int_{1}^{500,000/256} O(log(n)) \, dx \approx 4,744,448$

As you can see, the total worst-case insertion time needed for these prefixes dropped by roughly half.  It’s easier to see visually why this is the case:

The number of operations saved comes mostly from a cap on how large any tree is going to become.  It also helps that prefixes of a new tree have far fewer comparisons early on.

People don’t usually like theoretics, though, especially managers.  They need practical, reproducible, and timed results.  So what were our results?

Instead of waiting 3 hours to load, we waited 3 minutes.  Say what?  180 minutes to 3?  A 60x speedup at the cost of a 1KB array in memory? How is that possible?

As seen with the integrals above, we were able to reduce the number of search and insert operations in half, but the biggest time saver is the time it took to do the comparisons. Remember when I said they were computational intensive?  I wasn’t kidding.

Reducing the number of comparisons needed means spending less time on comparing. This should come as a, “Well, Duh!” moment, but you’d be surprised how often people forget to account for the time for comparisons and just focus on the time for building the data structure.  Comparisons can quickly compound.

Now, the OpenMP library is easy to use with C, so those inserts could be done in parallel with the addition of a single line of code.  That 3 minutes went down to an average of 27 seconds.

Needless to say, we were back in business!

In total, the time to prepare the data reduced from 180 minutes to 27 seconds, or a 400X speedup, just because a single tree was split into 256 disjoint trees, and the trade off was to increase memory used by 1KB for the array.

All of this to find a better FIB-caching algorithm for routers that were robust against attacks and scalable for the increasing demand and size of the Internet.

Now what does a graduate student do when achieving such a victory?  They go and eat Chipotle!  And it was good!