MEDIUM: streams: Add the ability to retry a request on L7 failure. When running in HTX mode, if we sent the request, but failed to get the answer, either because the server just closed its socket, we hit a server timeout, or we get a 404, 408, 425, 500, 501, 502, 503 or 504 error, attempt to retry the request, exactly as if we just failed to connect to the server. To do so, add a new backend keyword, "retry-on". It accepts a list of keywords, which can be "none" (never retry), "conn-failure" (we failed to connect, or to do the SSL handshake), "empty-response" (the server closed the connection without answering), "response-timeout" (we timed out while waiting for the server response), or "404", "408", "425", "500", "501", "502", "503" and "504". The default is "conn-failure".

commit: a254a37ad70e8ff266ff3d69ab16eb5303163f7b [log] [tgz]
author: Olivier Houchard <ohouchard@haproxy.com> Fri Apr 05 15:30:12 2019 +0200
committer: Willy Tarreau <w@1wt.eu> Sat May 04 10:19:56 2019 +0200
tree: b3425aac0574f40e746499efa14c378add1d800b
parent: f4bda993dde4f7414a59d48cb3b009a29456d4ba [diff]
diff --git a/doc/configuration.txt b/doc/configuration.txt
index 4088291..f6269d7 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt

@@ -2352,6 +2352,7 @@
 -- keyword -------------------------- defaults - frontend - listen -- backend -
 reqtarpit                                 -          X         X         X
 retries                                   X          -         X         X
+retry-on                                  X          -         X         X
 rspadd                                    -          X         X         X
 rspdel                                    -          X         X         X
 rspdeny                                   -          X         X         X
@@ -8004,6 +8005,70 @@
   See also : "option redispatch"
 
 
+retry-on [list of keywords]
+  Specify when to attempt to automatically retry a failed request
+  May be used in sections:    defaults | frontend | listen | backend
+                                 yes   |    no    |   yes  |   yes
+  Arguments :
+    <keywords>  is a list of keywords or HTTP status codes, each representing a
+                type of failure event on which an attempt to retry the request
+                is desired. Please read the notes at the bottom before changing
+                this setting. The following keywords are supported :
+
+      none              never retry
+
+      conn-failure      retry when the connection or the SSL handshake failed
+                        and the request could not be sent. This is the default.
+
+      empty-response    retry when the server connection was closed after part
+                        of the request was sent, and nothing was received from
+                        the server. This type of failure may be caused by the
+                        request timeout on the server side, poor network
+                        condition, or a server crash or restart while
+                        processing the request.
+
+      response-timeout  the server timeout stroke while waiting for the server
+                        to respond to the request. This may be caused by poor
+                        network condition, the reuse of an idle connection
+                        which has expired on the path, or by the request being
+                        extremely expensive to process. It generally is a bad
+                        idea to retry on such events on servers dealing with
+                        heavy database processing (full scans, etc) as it may
+                        amplify denial of service attacks.
+
+      <status>          any HTTP status code among "404" (Not Found), "408"
+                        (Request Timeout), "425" (Too Early), "500" (Server
+                        Error), "501" (Not Implemented), "502" (Bad Gateway),
+                        "503" (Service Unavailable), "504" (Gateway Timeout).
+
+  Using this directive replaces any previous settings with the new ones; it is
+  not cumulative.
+
+  Please note that using anything other than "none" and "conn-failure" requires
+  to allocate a buffer and copy the whole request into it, so it has memory and
+  performance impacts. Requests not fitting in a single buffer will never be
+  retried (see the global tune.bufsize setting).
+
+  You have to make sure the application has a replay protection mechanism built
+  in such as a unique transaction IDs passed in requests, or that replaying the
+  same request has no consequence, or it is very dangerous to use any retry-on
+  value beside "conn-failure" and "none". Static file servers and caches are
+  generally considered safe against any type of retry. Using a status code can
+  be useful to quickly leave a server showing an abnormal behavior (out of
+  memory, file system issues, etc), but in this case it may be a good idea to
+  immediately redispatch the connection to another server (please see "option
+  redispatch" for this). Last, it is important to understand that most causes
+  of failures are the requests themselves and that retrying a request causing a
+  server to misbehave will often make the situation even worse for this server,
+  or for the whole service in case of redispatch.
+
+  Unless you know exactly how the application deals with replayed requests, you
+  should not use this directive.
+
+  The default is "conn-failure".
+
+  See also: "retries", "option redispatch", "tune.bufsize"
+
 rspadd <string> [{if | unless} <cond>]
   Add a header at the end of the HTTP response
   May be used in sections :   defaults | frontend | listen | backend

diff --git a/doc/internals/filters.txt b/doc/internals/filters.txt
index 09090e5..2cb0eed 100644
--- a/doc/internals/filters.txt
+++ b/doc/internals/filters.txt

@@ -1170,9 +1170,12 @@
 Then, to finish, there are 2 informational callbacks:
 
   * 'flt_ops.http_reset': This callback is called when a HTTP message is
-    reset. This only happens when a '100-continue' response is received. It
+    reset. This happens either when a '100-continue' response is received, or
+    if we're retrying to send the request to the server after it failed. It
     could be useful to reset the filter context before receiving the true
     response.
+    You can know why the callback is called by checking s->txn->status. If it's
+    10X, we're called because of a '100-continue', if not, it's a L7 retry.
 
   * 'flt_ops.http_reply': This callback is called when, at any time, HAProxy
     decides to stop the processing on a HTTP message and to send an internal
commit	a254a37ad70e8ff266ff3d69ab16eb5303163f7b	[log] [tgz]
author	Olivier Houchard <ohouchard@haproxy.com>	Fri Apr 05 15:30:12 2019 +0200
committer	Willy Tarreau <w@1wt.eu>	Sat May 04 10:19:56 2019 +0200
tree	b3425aac0574f40e746499efa14c378add1d800b
parent	f4bda993dde4f7414a59d48cb3b009a29456d4ba [diff]