A short note on random load balancing
by Mateusz
Update: I’ve updated wording a little bit, as I did not express myself clearly when it came to distribution of balls into bins and few readers pointed it out. I have clarified the meaning, by explaining things by using number of balls. In the next post I will show how it is almost sure that we can have an empty bin when distributing balls with uniform distribution, which is not the property we want. The goal of this note was to educate the reader on a possibility of improving the distribution of balls into bins so that we have almost equal number of balls into bins.
The technique presented in this post is little known outside of academic circles but is really significant result in random allocations. Let’s say you have a problem of allocating balls into bins randomly, but you want the number of balls in bins to be distributed as evenly as possible. How would you do that?
Very frequently I see people picking a bin at random with a uniform probability distribution and dropping a ball there hoping that the balls would be distributed evenly. The problem is that this is not the case. Running a short program that does the allocations according to this scheme with 100 bins and 1000000 balls produces distributes the balls as in the picture below. Let’s look at that picture. The X-axis is a bin number, the Y-axis is the number of balls in it.
You can clearly see that the minimum ball count in one of the bins is just below 9700 and the maximum ball count is above 10200 for some bins (for the particular realization). The number of balls in each of the bins however varies a lot. Is there anything that can be done that could help solve the problem? Yes, there is. Instead of picking a single bin to throw the ball into, pick two bins at random and throw the ball into the one that is least filled. Let’s look at the distribution generated by this scheme.
You can now clearly see that the difference between the number of balls in bins is very small. Most of the bins have 10000 balls in them, the minimum number of balls in a bin for this particular realization of a random experiment is 9997 and the maximum number of balls in any bin for this particular realization is 10002. This is much closer to a uniform distribution than one would expect. I found this result very enlightening and significant.
Next time you perform random allocations think of the power of two random choices.


Comments
Smells like a good patch to your favorite LB sw.
[...] This post was mentioned on Twitter by . said: [...]
interesting, but if you maintain state about how full the bins are, you don’t really need/want randomness anyway.
let’s simplify the case: we have 2 bins.
First request went to bin #1. With your algorithm, the 2nd goes to bin #2, while with random it has 50% chance of ending in #1 or #2.
What you suggest is a combination of random and round-robin balancing.
I was reminded of this nice series of posts on random load balancing
http://www.stdlib.net/~colmmacc/2009/11/26/period-pain-3/
BTW, please don’t let browsers resize your graphs, they look bad. Instead try:
convert -filter catrom -resize ‘520x>’ load1.png load1_web.png
Thanks for the link and a hint on resizing!
This is silly. If you know how many balls there are in a bin, just put the ball in the least used bin. No randomness needed.
If counting balls is expensive and unbalanced bins are expensive, why limit the algorithm to 2 candidate bins ? Why not a percentage of the number of bins ?
1)The mathematical explanation will follow in the next post. Picking 2 bins is enough.
2) Looking for least used bin requires searching…
> You can now clearly see that the difference between the number of balls in bins is very small.
Actually you can’t, because the Y-axis scale has stealthily changed. At first glance, it looks like it distributes them in weird layers with wide gaps between them.
@randomwhiner – Perhaps the author probably should have mentioned that the graph is a cartesian coordinate system. The y-axis changed due to the fact the standard deviation of ball counts in each bin was much smaller. I can totally see why you debunk his wording of “clearly” for these reasons, but it really wasn’t too difficult to follow.
@mateusz – Thanks for the post, good to see a quick solution to a common load balancing problem when you don’t have a local copy of the frequencies stored.
One reason you shouldn’t necessarily always go for the least utilized server in your cluster is that it might get bogged down with requests.
Let’s say you have 20 servers with 200 connected clients per server, and then add a new server into the cluster. Now the new server will have the lowest load of the bunch and it’ll (however momentarily) get hit with a LOT of connections since every new connection now goes to it. If this flood of connections makes the least utilized server beg for mercy, the whole cluster will suffer as it can’t deal with new connections.
Not believing the results I decided to write a little app to do exactly this. I couldnt believe how much more evenly it was spread out! I cant wait to read your next post!
As some commenters have pointed out, one doesn’t always know the existing distribution. I’d like to see load-balancing systems support a shuffling approach, where the load balancer makes an array of the bin IDs, shuffles them, and uses them in series, and then goes back to the shuffle step. This would not require knowledge of the history for each bin, and would still be quite flat.
@Preston:
Assuming that there are many more incoming connections than there are servers, would you expect there to be a difference between the shuffle technique and simple round-robin?
nice, i actually have a problem for which this is a possible solution. here are a few further things to think about:
how does adding more bins during the run affect the balance in both the short term and the long term? how much faster will the bins become rebalanced as the number of bins sampled increases beyond 2? is it a linear, polynomial, or exponential speedup? (i guess i have to define a metric for “the bins are now rebalanced”, lets say the lowest bin is within 10% of the highest)
what if instead of all the events using a weight of one, they use a random (but known) weight w between 1 and n? should you sample more bins for larger values of w? does the number of bins you need to sample to achieve the same balance increase logarithmically or linearly with w? how does this change if the distribution is not random but normal?
Given how well this works, I wonder if we can also do acceptably well with a smaller fraction of a 2-bin check. For instance, what if 1/2 of the time, we just pick a bin at random, and the other 1/2 of the time, we select the least used of 2 randomly chosen bins?
This could be helpful when the cost of querying a bin is high.
For some background: http://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf
[...] This post was Twitted by syscomet [...]
Steve Jorgensen:
For the sake of doing it I decided to try out your idea. The normal total deviation from the average with 100 bins and 300000 iterations is around 60-80. When choosing a completely random number half the time and comparing the other half, it yields about 120-160 deviation. Still much better than the 1500-1800 when choosing the bin at random.
i posted some thoughts about the speed at which the bins will rebalance when adding empty bins during a run here: http://www.jeffplaisance.com/2010/06/automaticly-balancing-data-when-growing.html
@Jeff : cool, thanks for sharing the link.
@Jim Danz: yes this is the seminal paper.
non-intuitive and brilliant.
found your site on del.icio.us today and really liked it.. i bookmarked it and will be back to check it out some more later