MEDIUM: protocol: implement a "drain" function in protocol layers Since commit cfd97c6f was merged into 1.5-dev14 (BUG/MEDIUM: checks: prevent TIME_WAITs from appearing also on timeouts), some valid health checks sometimes used to show some TCP resets. For example, this HTTP health check sent to a local server : 19:55:15.742818 IP 127.0.0.1.16568 > 127.0.0.1.8000: S 3355859679:3355859679(0) win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7> 19:55:15.742841 IP 127.0.0.1.8000 > 127.0.0.1.16568: S 1060952566:1060952566(0) ack 3355859680 win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7> 19:55:15.742863 IP 127.0.0.1.16568 > 127.0.0.1.8000: . ack 1 win 257 19:55:15.745402 IP 127.0.0.1.16568 > 127.0.0.1.8000: P 1:23(22) ack 1 win 257 19:55:15.745488 IP 127.0.0.1.8000 > 127.0.0.1.16568: FP 1:146(145) ack 23 win 257 19:55:15.747109 IP 127.0.0.1.16568 > 127.0.0.1.8000: R 23:23(0) ack 147 win 257 After some discussion with Chris Huang-Leaver, it appeared clear that what we want is to only send the RST when we have no other choice, which means when the server has not closed. So we still keep SYN/SYN-ACK/RST for pure TCP checks, but don't want to see an RST emitted as above when the server has already sent the FIN. The solution against this consists in implementing a "drain" function at the protocol layer, which, when defined, causes as much as possible of the input socket buffer to be flushed to make recv() return zero so that we know that the server's FIN was received and ACKed. On Linux, we can make use of MSG_TRUNC on TCP sockets, which has the benefit of draining everything at once without even copying data. On other platforms, we read up to one buffer of data before the close. If recv() manages to get the final zero, we don't disable lingering. Same for hard errors. Otherwise we do. In practice, on HTTP health checks we generally find that the close was pending and is returned upon first recv() call. The network trace becomes cleaner : 19:55:23.650621 IP 127.0.0.1.16561 > 127.0.0.1.8000: S 3982804816:3982804816(0) win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7> 19:55:23.650644 IP 127.0.0.1.8000 > 127.0.0.1.16561: S 4082139313:4082139313(0) ack 3982804817 win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7> 19:55:23.650666 IP 127.0.0.1.16561 > 127.0.0.1.8000: . ack 1 win 257 19:55:23.651615 IP 127.0.0.1.16561 > 127.0.0.1.8000: P 1:23(22) ack 1 win 257 19:55:23.651696 IP 127.0.0.1.8000 > 127.0.0.1.16561: FP 1:146(145) ack 23 win 257 19:55:23.652628 IP 127.0.0.1.16561 > 127.0.0.1.8000: F 23:23(0) ack 147 win 257 19:55:23.652655 IP 127.0.0.1.8000 > 127.0.0.1.16561: . ack 24 win 257 This change should be backported to 1.4 which is where Chris encountered this issue. The code is different, so probably the tcp_drain() function will have to be put in the checks only.

commit: 2b57cb8f30208e4d1d003590e141c9bef03bb279 [log] [tgz]
author: Willy Tarreau <w@1wt.eu> Mon Jun 10 19:56:38 2013 +0200
committer: Willy Tarreau <w@1wt.eu> Mon Jun 10 20:33:23 2013 +0200
tree: 0a63ea9f8ba4d38fb392a9c2afbac6d73f2d902a
parent: 04ff9f105f441a4542456b47d17abe0a006917cf [diff] [blame]
diff --git a/src/proto_tcp.c b/src/proto_tcp.c
index e1b5d8b..049988c 100644
--- a/src/proto_tcp.c
+++ b/src/proto_tcp.c

@@ -77,6 +77,7 @@
 	.enable_all = enable_all_listeners,
 	.get_src = tcp_get_src,
 	.get_dst = tcp_get_dst,
+	.drain = tcp_drain,
 	.listeners = LIST_HEAD_INIT(proto_tcpv4.listeners),
 	.nb_listeners = 0,
 };
@@ -98,6 +99,7 @@
 	.enable_all = enable_all_listeners,
 	.get_src = tcp_get_src,
 	.get_dst = tcp_get_dst,
+	.drain = tcp_drain,
 	.listeners = LIST_HEAD_INIT(proto_tcpv6.listeners),
 	.nb_listeners = 0,
 };
@@ -513,6 +515,41 @@
 		return getsockname(fd, sa, &salen);
 }
 
+/* Tries to drain any pending incoming data from the socket to reach the
+ * receive shutdown. Returns non-zero if the shutdown was found, otherwise
+ * zero. This is useful to decide whether we can close a connection cleanly
+ * are we must kill it hard.
+ */
+int tcp_drain(int fd)
+{
+	int turns = 2;
+	int len;
+
+	while (turns) {
+#ifdef MSG_TRUNC_CLEARS_INPUT
+		len = recv(fd, NULL, INT_MAX, MSG_DONTWAIT | MSG_NOSIGNAL | MSG_TRUNC);
+		if (len == -1 && errno == EFAULT)
+#endif
+			len = recv(fd, trash.str, trash.size, MSG_DONTWAIT | MSG_NOSIGNAL);
+
+		if (len == 0)                /* cool, shutdown received */
+			return 1;
+
+		if (len < 0) {
+			if (errno == EAGAIN) /* connection not closed yet */
+				return 0;
+			if (errno == EINTR)  /* oops, try again */
+				continue;
+			/* other errors indicate a dead connection, fine. */
+			return 1;
+		}
+		/* OK we read some data, let's try again once */
+		turns--;
+	}
+	/* some data are still present, give up */
+	return 0;
+}
+
 /* This is the callback which is set when a connection establishment is pending
  * and we have nothing to send, or if we have an init function we want to call
  * once the connection is established. It updates the FD polling status. It
commit	2b57cb8f30208e4d1d003590e141c9bef03bb279	[log] [tgz]
author	Willy Tarreau <w@1wt.eu>	Mon Jun 10 19:56:38 2013 +0200
committer	Willy Tarreau <w@1wt.eu>	Mon Jun 10 20:33:23 2013 +0200
tree	0a63ea9f8ba4d38fb392a9c2afbac6d73f2d902a
parent	04ff9f105f441a4542456b47d17abe0a006917cf [diff] [blame]