62e8aaa1bd5ca96089eaa88487c700c4af4617f4 - haproxy

commit	62e8aaa1bd5ca96089eaa88487c700c4af4617f4	[log] [tgz]
author	Willy Tarreau <w@1wt.eu>	Wed Jan 27 17:22:29 2021 +0100
committer	Willy Tarreau <w@1wt.eu>	Thu Jan 28 16:48:01 2021 +0100
tree	20af5383f6584021cb148cb76c3a9250d2329f64
parent	405f05465252498429fb9dc38db40f0803f7cb69 [diff]

BUG/MEDIUM: listener: do not accept connections faster than we can process them In github issue #822, user @ngaugler reported some performance problems when dealing with many concurrent SSL connections on restarts, after migrating from 1.6 to 2.2, indicating a long time required to re-establish connections. The Run_queue metric in the traces showed an abnormally high number of tasks in the run queue, likely indicating we were accepting faster than we could process. And this is indeed one of the differences between 1.6 and 2.2, the accept I/O loop and the TLS handshakes are totally independent, so much that they can even run on different threads. In 1.6 the SSL handshake was handled almost immediately after the accept(), so this was limiting the input rate. With large maxconn values, as long as there are incoming connections, new I/Os are scheduled and many of them pass before the handshake, being tagged for low latency processing. The result is that handshakes get postponed, and are further postponed as new connections are accepted. When they are finally able to be processed, some of them fail as the client is gone, and the client had already queued new ones. This causes an excess number of apparent connections and total number of handshakes to be processed, just because we were accepting connections on a temporarily saturated machine. The solution is to temporarily pause new incoming connections when the load already indicates that more tasks are already queued than will be handled in a poll loop. The difficulty with this usually is to be able to come back to re-enable the operation, but given that the metric is the run queue, we just have to queue the global_listener_queue task so that it gets picked by any thread once the run queues get flushed. Before this patch, injecting with SSL reneg with 10000 concurrent connections resulted in 350k tasks in the run queue, and a majority of handshake timeouts noticed by the client. With the patch, the run queue fluctuates between 1-3x runqueue-depth, the process is constantly busy, the accept rate is maximized and clients observe no error anymore. It would be desirable to backport this patch to 2.3 and 2.2 after some more testing, provided the accept loop there is compatible.