MAJOR: session: only wake up as many sessions as available buffers permit

We've already experimented with three wake up algorithms when releasing
buffers : the first naive one used to wake up far too many sessions,
causing many of them not to get any buffer. The second approach which
was still in use prior to this patch consisted in waking up either 1
or 2 sessions depending on the number of FDs we had released. And this
was still inaccurate. The third one tried to cover the accuracy issues
of the second and took into consideration the number of FDs the sessions
would be willing to use, but most of the time we ended up waking up too
many of them for nothing, or deadlocking by lack of buffers.

This patch completely removes the need to allocate two buffers at once.
Instead it splits allocations into critical and non-critical ones and
implements a reserve in the pool for this. The deadlock situation happens
when all buffers are be allocated for requests pending in a maxconn-limited
server queue, because then there's no more way to allocate buffers for
responses, and these responses are critical to release the servers's
connection in order to release the pending requests. In fact maxconn on
a server creates a dependence between sessions and particularly between
oldest session's responses and latest session's requests. Thus, it is
mandatory to get a free buffer for a response in order to release a
server connection which will permit to release a request buffer.

Since we definitely have non-symmetrical buffers, we need to implement
this logic in the buffer allocation mechanism. What this commit does is
implement a reserve of buffers which can only be allocated for responses
and that will never be allocated for requests. This is made possible by
the requester indicating how much margin it wants to leave after the
allocation succeeds. Thus it is a cooperative allocation mechanism : the
requester (process_session() in general) prefers not to get a buffer in
order to respect other's need for response buffers. The session management
code always knows if a buffer will be used for requests or responses, so
that is not difficult :

  - either there's an applet on the initiator side and we really need
    the request buffer (since currently the applet is called in the
    context of the session)

  - or we have a connection and we really need the response buffer (in
    order to support building and sending an error message back)

This reserve ensures that we don't take all allocatable buffers for
requests waiting in a queue. The downside is that all the extra buffers
are really allocated to ensure they can be allocated. But with small
values it is not an issue.

With this change, we don't observe any more deadlocks even when running
with maxconn 1 on a server under severely constrained memory conditions.

The code becomes a bit tricky, it relies on the scheduler's run queue to
estimate how many sessions are already expected to run so that it doesn't
wake up everyone with too few resources. A better solution would probably
consist in having two queues, one for urgent requests and one for normal
requests. A failed allocation for a session dealing with an error, a
connection event, or the need for a response (or request when there's an
applet on the left) would go to the urgent request queue, while other
requests would go to the other queue. Urgent requests would be served
from 1 entry in the pool, while the regular ones would be served only
according to the reserve. Despite not yet having this, it works
remarkably well.

This mechanism is quite efficient, we don't perform too many wake up calls
anymore. For 1 million sessions elapsed during massive memory contention,
we observe about 4.5M calls to process_session() compared to 4.0M without
memory constraints. Previously we used to observe up to 16M calls, which
rougly means 12M failures.

During a test run under high memory constraints (limit enforced to 27 MB
instead of the 58 MB normally needed), performance used to drop by 53% prior
to this patch. Now with this patch instead it *increases* by about 1.5%.

The best effect of this change is that by limiting the memory usage to about
2/3 to 3/4 of what is needed by default, it's possible to increase performance
by up to about 18% mainly due to the fact that pools are reused more often
and remain hot in the CPU cache (observed on regular HTTP traffic with 20k
objects, buffers.limit = maxconn/10, buffers.reserve = limit/2).

Below is an example of scenario which used to cause a deadlock previously :
  - connection is received
  - two buffers are allocated in process_session() then released
  - one is allocated when receiving an HTTP request
  - the second buffer is allocated then released in process_session()
    for request parsing then connection establishment.
  - poll() says we can send, so the request buffer is sent and released
  - process session gets notified that the connection is now established
    and allocates two buffers then releases them
  - all other sessions do the same till one cannot get the request buffer
    without hitting the margin
  - and now the server responds. stream_interface allocates the response
    buffer and manages to get it since it's higher priority being for a
    response.
  - but process_session() cannot allocate the request buffer anymore

  => We could end up with all buffers used by responses so that none may
     be allocated for a request in process_session().

When the applet processing leaves the session context, the test will have
to be changed so that we always allocate a response buffer regardless of
the left side (eg: H2->H1 gateway). A final improvement would consists in
being able to only retry the failed I/O operation without waking up a
task, but to date all experiments to achieve this have proven not to be
reliable enough.
diff --git a/include/common/defaults.h b/include/common/defaults.h
index cc1fbbd..eec0cfd 100644
--- a/include/common/defaults.h
+++ b/include/common/defaults.h
@@ -35,6 +35,20 @@
 #define BUFSIZE	        16384
 #endif
 
+/* certain buffers may only be allocated for responses in order to avoid
+ * deadlocks caused by request queuing. 2 buffers is the absolute minimum
+ * acceptable to ensure that a request gaining access to a server can get
+ * a response buffer even if it doesn't completely flush the request buffer.
+ * The worst case is an applet making use of a request buffer that cannot
+ * completely be sent while the server starts to respond, and all unreserved
+ * buffers are allocated by request buffers from pending connections in the
+ * queue waiting for this one to flush. Both buffers reserved buffers may
+ * thus be used at the same time.
+ */
+#ifndef RESERVED_BUFS
+#define RESERVED_BUFS   2
+#endif
+
 // reserved buffer space for header rewriting
 #ifndef MAXREWRITE
 #define MAXREWRITE      (BUFSIZE / 2)
diff --git a/include/proto/session.h b/include/proto/session.h
index 5e26edc..1b3f603 100644
--- a/include/proto/session.h
+++ b/include/proto/session.h
@@ -25,8 +25,10 @@
 #include <common/config.h>
 #include <common/memory.h>
 #include <types/session.h>
+#include <proto/fd.h>
 #include <proto/freq_ctr.h>
 #include <proto/stick_table.h>
+#include <proto/task.h>
 
 extern struct pool_head *pool2_session;
 extern struct list sessions;
@@ -54,8 +56,9 @@
 
 /* Update the session's backend and server time stats */
 void session_update_time_stats(struct session *s);
-void session_offer_buffers(int count);
-int session_alloc_buffers(struct session *s);
+void __session_offer_buffers(int rqlimit);
+static inline void session_offer_buffers();
+int session_alloc_work_buffer(struct session *s);
 void session_release_buffers(struct session *s);
 int session_alloc_recv_buffer(struct session *s, struct buffer **buf);
 
@@ -272,6 +275,27 @@
 	LIST_INIT(&sess->by_srv);
 }
 
+static inline void session_offer_buffers()
+{
+	int avail;
+
+	if (LIST_ISEMPTY(&buffer_wq))
+		return;
+
+	/* all sessions will need 1 buffer, so we can stop waking up sessions
+	 * once we have enough of them to eat all the buffers. Note that we
+	 * don't really know if they are sessions or just other tasks, but
+	 * that's a rough estimate. Similarly, for each cached event we'll need
+	 * 1 buffer. If no buffer is currently used, always wake up the number
+	 * of tasks we can offer a buffer based on what is allocated, and in
+	 * any case at least one task per two reserved buffers.
+	 */
+	avail = pool2_buffer->allocated - pool2_buffer->used - global.tune.reserved_bufs / 2;
+
+	if (avail > (int)run_queue)
+		__session_offer_buffers(avail);
+}
+
 #endif /* _PROTO_SESSION_H */
 
 /*
diff --git a/include/types/global.h b/include/types/global.h
index 97767e1..143b864 100644
--- a/include/types/global.h
+++ b/include/types/global.h
@@ -128,6 +128,7 @@
 		int recv_enough;   /* how many input bytes at once are "enough" */
 		int bufsize;       /* buffer size in bytes, defaults to BUFSIZE */
 		int maxrewrite;    /* buffer max rewrite size in bytes, defaults to MAXREWRITE */
+		int reserved_bufs; /* how many buffers can only be allocated for response */
 		int client_sndbuf; /* set client sndbuf to this value if not null */
 		int client_rcvbuf; /* set client rcvbuf to this value if not null */
 		int server_sndbuf; /* set server sndbuf to this value if not null */
diff --git a/src/buffer.c b/src/buffer.c
index d9301bf..c7326dc 100644
--- a/src/buffer.c
+++ b/src/buffer.c
@@ -34,8 +34,27 @@
 /* perform minimal intializations, report 0 in case of error, 1 if OK. */
 int init_buffer()
 {
+	void *buffer;
+
 	pool2_buffer = create_pool("buffer", sizeof (struct buffer) + global.tune.bufsize, MEM_F_SHARED);
-	return pool2_buffer != NULL;
+	if (!pool2_buffer)
+		return 0;
+
+	/* The reserved buffer is what we leave behind us. Thus we always need
+	 * at least one extra buffer in minavail otherwise we'll end up waking
+	 * up tasks with no memory available, causing a lot of useless wakeups.
+	 * That means that we always want to have at least 3 buffers available
+	 * (2 for current session, one for next session that might be needed to
+	 * release a server connection).
+	 */
+	pool2_buffer->minavail = MAX(global.tune.reserved_bufs, 3);
+
+	buffer = pool_refill_alloc(pool2_buffer, pool2_buffer->minavail - 1);
+	if (!buffer)
+		return 0;
+
+	pool_free2(pool2_buffer, buffer);
+	return 1;
 }
 
 /* This function writes the string <str> at position <pos> which must be in
diff --git a/src/haproxy.c b/src/haproxy.c
index a306952..8a8e5f5 100644
--- a/src/haproxy.c
+++ b/src/haproxy.c
@@ -142,6 +142,7 @@
 		.bufsize = BUFSIZE,
 		.maxrewrite = MAXREWRITE,
 		.chksize = BUFSIZE,
+		.reserved_bufs = RESERVED_BUFS,
 #ifdef USE_OPENSSL
 		.sslcachesize = SSLCACHESIZE,
 		.ssl_default_dh_param = SSL_DEFAULT_DH_PARAM,
diff --git a/src/session.c b/src/session.c
index fc5b886..ad65963 100644
--- a/src/session.c
+++ b/src/session.c
@@ -621,7 +621,7 @@
 	b_drop(&s->req->buf);
 	b_drop(&s->rep->buf);
 	if (!LIST_ISEMPTY(&buffer_wq))
-		session_offer_buffers(1);
+		session_offer_buffers();
 
 	pool_free2(pool2_channel, s->req);
 	pool_free2(pool2_channel, s->rep);
@@ -679,14 +679,22 @@
 
 /* Allocates a single buffer for session <s>, but only if it's guaranteed that
  * it's not the last available buffer. To be called at the beginning of recv()
- * callbacks to ensure that the required buffers are properly allocated.
- * Returns 0 in case of failure, non-zero otherwise.
+ * callbacks to ensure that the required buffers are properly allocated. If the
+ * buffer is the session's request buffer, an extra control is made so that we
+ * always keep <tune.buffers.reserved> buffers available after this allocation.
+ * In all circumstances we leave at least 2 buffers so that any later call from
+ * process_session() has a chance to succeed. The response buffer is not bound
+ * to this control. Returns 0 in case of failure, non-zero otherwise.
  */
 int session_alloc_recv_buffer(struct session *s, struct buffer **buf)
 {
 	struct buffer *b;
+	int margin = 0;
 
-	b = b_alloc_margin(buf, 2);
+	if (buf == &s->req->buf)
+		margin = global.tune.reserved_bufs;
+
+	b = b_alloc_margin(buf, margin);
 	if (b)
 		return 1;
 
@@ -695,23 +703,39 @@
 	return 0;
 }
 
-/* Allocates up to two buffers for session <s>. Only succeeds if both buffers
- * are properly allocated. It is meant to be called inside process_session() so
- * that both request and response buffers are allocated. Returns 0 in case of
- * failure, non-zero otherwise.
+/* Allocates a work buffer for session <s>. It is meant to be called inside
+ * process_session(). It will only allocate the side needed for the function
+ * to work fine. For a regular connection, only the response is needed so that
+ * an error message may be built and returned. In case where the initiator is
+ * an applet (eg: peers), then we need to allocate the request buffer for the
+ * applet to be able to send its data (in this case a response is not needed).
+ * Request buffers are never picked from the reserved pool, but response
+ * buffers may be allocated from the reserve. This is critical to ensure that
+ * a response may always flow and will never block a server from releasing a
+ * connection. Returns 0 in case of failure, non-zero otherwise.
  */
-int session_alloc_buffers(struct session *s)
+int session_alloc_work_buffer(struct session *s)
 {
+	int margin;
+	struct buffer **buf;
+
+	if (objt_appctx(s->si[0].end)) {
+		buf = &s->req->buf;
+		margin = global.tune.reserved_bufs;
+	}
+	else {
+		buf = &s->rep->buf;
+		margin = 0;
+	}
+
 	if (!LIST_ISEMPTY(&s->buffer_wait)) {
 		LIST_DEL(&s->buffer_wait);
 		LIST_INIT(&s->buffer_wait);
 	}
 
-	if ((s->req->buf->size || b_alloc(&s->req->buf)) &&
-	    (s->rep->buf->size || b_alloc(&s->rep->buf)))
+	if (b_alloc_margin(buf, margin))
 		return 1;
 
-	session_release_buffers(s);
 	LIST_ADDQ(&buffer_wq, &s->buffer_wait);
 	return 0;
 }
@@ -724,10 +748,6 @@
  */
 void session_release_buffers(struct session *s)
 {
-	int release_count = 0;
-
-	release_count = !!s->req->buf->size + !!s->rep->buf->size;
-
 	if (s->req->buf->size && buffer_empty(s->req->buf))
 		b_free(&s->req->buf);
 
@@ -737,27 +757,28 @@
 	/* if we're certain to have at least 1 buffer available, and there is
 	 * someone waiting, we can wake up a waiter and offer them.
 	 */
-	if (release_count >= 1 && !LIST_ISEMPTY(&buffer_wq))
-		session_offer_buffers(release_count);
+	if (!LIST_ISEMPTY(&buffer_wq))
+		session_offer_buffers();
 }
 
-/* run across the list of pending sessions waiting for a buffer and wake
- * one up if buffers are available.
+/* Runs across the list of pending sessions waiting for a buffer and wakes one
+ * up if buffers are available. Will stop when the run queue reaches <rqlimit>.
+ * Should not be called directly, use session_offer_buffers() instead.
  */
-void session_offer_buffers(int count)
+void __session_offer_buffers(int rqlimit)
 {
 	struct session *sess, *bak;
 
 	list_for_each_entry_safe(sess, bak, &buffer_wq, buffer_wait) {
+		if (rqlimit <= run_queue)
+			break;
+
 		if (sess->task->state & TASK_RUNNING)
 			continue;
 
 		LIST_DEL(&sess->buffer_wait);
 		LIST_INIT(&sess->buffer_wait);
-
 		task_wakeup(sess->task, TASK_WOKEN_RES);
-		if (--count <= 0)
-			break;
 	}
 }
 
@@ -1797,7 +1818,7 @@
 	/* below we may emit error messages so we have to ensure that we have
 	 * our buffers properly allocated.
 	 */
-	if (!session_alloc_buffers(s)) {
+	if (!session_alloc_work_buffer(s)) {
 		/* No buffer available, we've been subscribed to the list of
 		 * buffer waiters, let's wait for our turn.
 		 */