Blame - doc/internals/hashing.txt - haproxy

blob: 281dcf6b9142986ca33e2bab3e5f75e3e3f22f7d [file] [log] [blame]

Bhaskar Maddala	10e26de	2013-11-20 12:55:21 -0500	[diff] [blame]	1	2013/11/20 - How hashing works internally in haproxy - maddalab@gmail.com
				2
				3	This document describes how Haproxy implements hashing both map-based and
				4	consistent hashing, both prior to versions 1.5 and the motivation and tests
				5	that were done when providing additional options starting in version 1.5.
				6
				7	A note on hashing in general, hash functions strive to have little
				8	correlation between input and output. The heart of a hash function is its
				9	mixing step. The behavior of the mixing step largely determines whether the
				10	hash function is collision-resistant. Hash functions that are collision
				11	resistant are more likely to have an even distribution of load.
				12
				13	The purpose of the mixing function is to spread the effect of each message
				14	bit throughout all the bits of the internal state. Ideally every bit in the
				15	hash state is affected by every bit in the message. And we want to do that
				16	as quickly as possible simply for the sake of program performance. A
				17	function is said to satisfy the strict avalanche criterion if, whenever a
				18	single input bit is complemented (toggled between 0 and 1), each of the
				19	output bits should change with a probability of one half for an arbitrary
				20	selection of the remaining input bits.
				21
				22	To guard against a combination of hash function and input that results in
				23	high rate of collisions, haproxy implements an avalanche algorithm on the
				24	result of the hashing function. In all versions 1.4 and prior avalanche is
				25	always applied when using the consistent hashing directive. It is intended
				26	to provide quite a good distribution for little input variations. The result
				27	is quite suited to fit over a 32-bit space with enough variations so that
				28	a randomly picked number falls equally before any server position, which is
				29	ideal for consistently hashed backends, a common use case for caches.
				30
				31	In all versions 1.4 and prior Haproxy implements the SDBM hashing function.
				32	However tests show that alternatives to SDBM have a better cache
				33	distribution on different hashing criteria. Additional tests involving
				34	alternatives for hash input and an option to trigger avalanche, we found
				35	different algorithms perform better on different criteria. DJB2 performs
				36	well when hashing ascii text and is a good choice when hashing on host
				37	header. Other alternatives perform better on numbers and are a good choice
				38	when using source ip. The results also vary by use of the avalanche flag.
				39
				40	The results of the testing can be found under the tests folder. Here is
				41	a summary of the discussion on the results on 1 input criteria and the
				42	methodology used to generate the results.
				43
				44	A note of the setup when validating the results independently, one
				45	would want to avoid backend server counts that may skew the results. As
				46	an example with DJB2 avoid 33 servers. Please see the implementations of
				47	the hashing function, which can be found in the links under references.
				48
				49	The following was the set up used
				50
				51	(a) hash-type consistent/map-based
				52	(b) avalanche on/off
				53	(c) balanche host(hdr)
				54	(d) 3 criteria for inputs
				55	- ~ 10K requests, including duplicates
				56	- ~ 46K requests, unique requests from 1 MM requests were obtained
				57	- ~ 250K requests, including duplicates
				58	(e) 17 servers in backend, all servers were assigned the same weight
				59
				60	Result of the hashing were obtained across the server via monitoring log
				61	files for haproxy. Population Standard deviation was used to evaluate the
				62	efficacy of the hashing algorithm. Lower standard deviation, indicates
				63	a better distribution of load across the backends.
				64
				65	On 10K requests, when using consistent hashing with avalanche on host
				66	headers, DJB2 significantly out performs SDBM. Std dev on SDBM was 48.95
				67	and DJB2 was 26.29. This relationship is inverted with avalanche disabled,
				68	however DJB2 with avalanche enabled out performs SDBM with avalanche
				69	disabled.
				70
				71	On map-based hashing SDBM out performs DJB2 irrespective of the avalanche
				72	option. SDBM without avalanche is marginally better than with avalanche.
				73	DJB2 performs significantly worse with avalanche enabled.
				74
				75	Summary: The results of the testing indicate that there isn't a hashing
				76	algorithm that can be applied across all input criteria. It is necessary
				77	to support alternatives to SDBM, which is generally the best option, with
				78	algorithms that are better for different inputs. Avalanche is not always
				79	applicable and may result in less smooth distribution.
				80
				81	References:
				82	Mixing Functions/Avalanche: http://home.comcast.net/~bretm/hash/3.html
				83	Hash Functions: http://www.cse.yorku.ca/~oz/hash.html