Bhaskar Maddala | 10e26de | 2013-11-20 12:55:21 -0500 | [diff] [blame] | 1 | 2013/11/20 - How hashing works internally in haproxy - maddalab@gmail.com |
| 2 | |
| 3 | This document describes how Haproxy implements hashing both map-based and |
| 4 | consistent hashing, both prior to versions 1.5 and the motivation and tests |
| 5 | that were done when providing additional options starting in version 1.5. |
| 6 | |
| 7 | A note on hashing in general, hash functions strive to have little |
| 8 | correlation between input and output. The heart of a hash function is its |
| 9 | mixing step. The behavior of the mixing step largely determines whether the |
| 10 | hash function is collision-resistant. Hash functions that are collision |
| 11 | resistant are more likely to have an even distribution of load. |
| 12 | |
| 13 | The purpose of the mixing function is to spread the effect of each message |
| 14 | bit throughout all the bits of the internal state. Ideally every bit in the |
| 15 | hash state is affected by every bit in the message. And we want to do that |
| 16 | as quickly as possible simply for the sake of program performance. A |
| 17 | function is said to satisfy the strict avalanche criterion if, whenever a |
| 18 | single input bit is complemented (toggled between 0 and 1), each of the |
| 19 | output bits should change with a probability of one half for an arbitrary |
| 20 | selection of the remaining input bits. |
| 21 | |
| 22 | To guard against a combination of hash function and input that results in |
| 23 | high rate of collisions, haproxy implements an avalanche algorithm on the |
| 24 | result of the hashing function. In all versions 1.4 and prior avalanche is |
| 25 | always applied when using the consistent hashing directive. It is intended |
| 26 | to provide quite a good distribution for little input variations. The result |
| 27 | is quite suited to fit over a 32-bit space with enough variations so that |
| 28 | a randomly picked number falls equally before any server position, which is |
| 29 | ideal for consistently hashed backends, a common use case for caches. |
| 30 | |
| 31 | In all versions 1.4 and prior Haproxy implements the SDBM hashing function. |
| 32 | However tests show that alternatives to SDBM have a better cache |
| 33 | distribution on different hashing criteria. Additional tests involving |
| 34 | alternatives for hash input and an option to trigger avalanche, we found |
| 35 | different algorithms perform better on different criteria. DJB2 performs |
| 36 | well when hashing ascii text and is a good choice when hashing on host |
| 37 | header. Other alternatives perform better on numbers and are a good choice |
| 38 | when using source ip. The results also vary by use of the avalanche flag. |
| 39 | |
| 40 | The results of the testing can be found under the tests folder. Here is |
| 41 | a summary of the discussion on the results on 1 input criteria and the |
| 42 | methodology used to generate the results. |
| 43 | |
| 44 | A note of the setup when validating the results independently, one |
| 45 | would want to avoid backend server counts that may skew the results. As |
| 46 | an example with DJB2 avoid 33 servers. Please see the implementations of |
| 47 | the hashing function, which can be found in the links under references. |
| 48 | |
| 49 | The following was the set up used |
| 50 | |
| 51 | (a) hash-type consistent/map-based |
| 52 | (b) avalanche on/off |
| 53 | (c) balanche host(hdr) |
| 54 | (d) 3 criteria for inputs |
| 55 | - ~ 10K requests, including duplicates |
| 56 | - ~ 46K requests, unique requests from 1 MM requests were obtained |
| 57 | - ~ 250K requests, including duplicates |
| 58 | (e) 17 servers in backend, all servers were assigned the same weight |
| 59 | |
| 60 | Result of the hashing were obtained across the server via monitoring log |
| 61 | files for haproxy. Population Standard deviation was used to evaluate the |
| 62 | efficacy of the hashing algorithm. Lower standard deviation, indicates |
| 63 | a better distribution of load across the backends. |
| 64 | |
| 65 | On 10K requests, when using consistent hashing with avalanche on host |
| 66 | headers, DJB2 significantly out performs SDBM. Std dev on SDBM was 48.95 |
| 67 | and DJB2 was 26.29. This relationship is inverted with avalanche disabled, |
| 68 | however DJB2 with avalanche enabled out performs SDBM with avalanche |
| 69 | disabled. |
| 70 | |
| 71 | On map-based hashing SDBM out performs DJB2 irrespective of the avalanche |
| 72 | option. SDBM without avalanche is marginally better than with avalanche. |
| 73 | DJB2 performs significantly worse with avalanche enabled. |
| 74 | |
| 75 | Summary: The results of the testing indicate that there isn't a hashing |
| 76 | algorithm that can be applied across all input criteria. It is necessary |
| 77 | to support alternatives to SDBM, which is generally the best option, with |
| 78 | algorithms that are better for different inputs. Avalanche is not always |
| 79 | applicable and may result in less smooth distribution. |
| 80 | |
| 81 | References: |
| 82 | Mixing Functions/Avalanche: http://home.comcast.net/~bretm/hash/3.html |
| 83 | Hash Functions: http://www.cse.yorku.ca/~oz/hash.html |