DOC: better document the config file format and escaping/quoting rules
It's always a pain to figure how to proceed when special characters need
to be embedded inside arguments of an expression. Let's document the
configuration file format and how unquoting/unescaping works at each
level (top level and argument level) so that everyone hopefully finds
suitable reminders or examples for complex cases.
This is related to github issue #200 and addresses issues #712 and #966.
(cherry picked from commit 6f1129d14dace99687f8681bf825dfda2905502a)
Signed-off-by: William Lallemand <wlallemand@haproxy.org>
diff --git a/doc/configuration.txt b/doc/configuration.txt
index a934850..c64e0cd 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -404,28 +404,137 @@
HAProxy's configuration process involves 3 major sources of parameters :
- the arguments from the command-line, which always take precedence
- - the "global" section, which sets process-wide parameters
- - the proxies sections which can take form of "defaults", "listen",
- "frontend" and "backend".
+ - the configuration file(s), whose format is described here
+ - the running process' environment, in case some environment variables are
+ explicitly referenced
-The configuration file syntax consists in lines beginning with a keyword
-referenced in this manual, optionally followed by one or several parameters
-delimited by spaces.
+The configuration file follows a fairly simple hierarchical format which obey
+a few basic rules:
+ 1. a configuration file is an ordered sequence of statements
+
+ 2. a statement is a single non-empty line before any unprotected "#" (hash)
+
+ 3. a line is a series of tokens or "words" delimited by unprotected spaces or
+ tab characters
+
+ 4. the first word or sequence of words of a line is one of the keywords or
+ keyword sequences listed in this document
+
+ 5. all other words are all arguments of the first one, some being well-known
+ keywords listed in this document, others being values, references to other
+ parts of the configuration, or expressions
+
+ 6. certain keywords delimit a section inside which only a subset of keywords
+ are supported
+
+ 7. a section ends at the end of a file or on a special keyword starting a new
+ section
+
+This is all that is needed to know to write a simple but reliable configuration
+generator, but this is not enough to reliably parse any configuration nor to
+figure how to deal with certain corner cases.
+
+First, there are a few consequences of the rules above. Rule 6 and 7 imply that
+the keywords used to define a new section are valid everywhere and cannot have
+a different meaning in a specific section. These keywords are always a single
+word (as opposed to a sequence of words), and traditionally the section that
+follows them is designated using the same name. For example when speaking about
+the "global section", it designates the section of configuration that follows
+the "global" keyword. This usage is used a lot in error messages to help locate
+the parts that need to be addressed.
+
+A number of sections create an internal object or configuration space, which
+requires to be distinguished from other ones. In this case they will take an
+extra word which will set the name of this particular section. For some of them
+the section name is mandatory. For example "frontend foo" will create a new
+section of type "frontend" named "foo". Usually a name is specific to its
+section and two sections of different types may use the same name, but this is
+not recommended as it tends to complexify configuration management.
+
+A direct consequence of rule 7 is that when multiple files are read at once,
+each of them must start with a new section, and the end of each file will end
+a section. A file cannot contain sub-sections nor end an existing section and
+start a new one.
+
+Rule 1 mentioned that ordering matters. Indeed, some keywords create directives
+that can be repeated multiple times to create ordered sequences of rules to be
+applied in a certain order. For example "tcp-request" can be used to alternate
+"accept" and "reject" rules on varying criteria. As such, a configuration file
+processor must always preserve a section's ordering when editing a file. The
+ordering of sections usually does not matter except for the global section
+which must be placed before other sections, but it may be repeated if needed.
+In addition, some automatic identifiers may automatically be assigned to some
+of the created objects (e.g. proxies), and by reordering sections, their
+identifiers will change. These ones appear in the statistics for example. As
+such, the configuration below will assign "foo" ID number 1 and "bar" ID number
+2, which will be swapped if the two sections are reversed:
+
+ listen foo
+ bind :80
+
+ listen bar
+ bind :81
+
+Another important point is that according to rules 2 and 3 above, empty lines,
+spaces, tabs, and comments following and unprotected "#" character are not part
+of the configuration as they are just used as delimiters. This implies that the
+following configurations are strictly equivalent:
+
+ global#this is the global section
+ daemon#daemonize
+ frontend foo
+ mode http # or tcp
+
+and:
+
+ global
+ daemon
+
+ # this is the public web frontend
+ frontend foo
+ mode http
+
+The common practice is to align to the left only the keyword that initiates a
+new section, and indent (i.e. prepend a tab character or a few spaces) all
+other keywords so that it's instantly visible that they belong to the same
+section (as done in the second example above). Placing comments before a new
+section helps the reader decide if it's the desired one. Leaving a blank line
+at the end of a section also visually helps spotting the end when editing it.
+
+Tabs are very convenient for indent but they do not copy-paste well. If spaces
+are used instead, it is recommended to avoid placing too many (2 to 4) so that
+editing in field doesn't become a burden with limited editors that do not
+support automatic indent.
+
+In the early days it used to be common to see arguments split at fixed tab
+positions because most keywords would not take more than two arguments. With
+modern versions featuring complex expressions this practice does not stand
+anymore, and is not recommended.
+
2.2. Quoting and escaping
-------------------------
-HAProxy's configuration introduces a quoting and escaping system similar to
-many programming languages. The configuration file supports 3 types: escaping
-with a backslash, weak quoting with double quotes, and strong quoting with
-single quotes.
+In modern configurations, some arguments require the use of some characters
+that were previously considered as pure delimiters. In order to make this
+possible, HAProxy supports character escaping by prepending a backslash ('\')
+in front of the character to be escaped, weak quoting within double quotes
+('"') and strong quoting within single quotes ("'").
-If spaces have to be entered in strings, then they must be escaped by preceding
-them by a backslash ('\') or by quoting them. Backslashes also have to be
-escaped by doubling or strong quoting them.
+This is pretty similar to what is done in a number of programming languages and
+very close to what is commonly encountered in Bourne shell. The principle is
+the following: while the configuration parser cuts the lines into words, it
+also takes care of quotes and backslashes to decide whether a character is a
+delimiter or is the raw representation of this character within the current
+word. The escape character is then removed, the quotes are removed, and the
+remaining word is used as-is as a keyword or argument for example.
-Escaping is achieved by preceding a special character by a backslash ('\'):
+If a backslash is needed in a word, it must either be escaped using itself
+(i.e. double backslash) or be strongly quoted.
+
+Escaping outside quotes is achieved by preceding a special character by a
+backslash ('\'):
\ to mark a space and differentiate it from a delimiter
\# to mark a hash and differentiate it from a comment
@@ -433,39 +542,161 @@
\' to use a single quote and differentiate it from strong quoting
\" to use a double quote and differentiate it from weak quoting
+In addition, a few non-printable characters may be emitted using their usual
+C-language representation:
+
+ \n to insert a line feed (LF, character \x0a or ASCII 10 decimal)
+ \r to insert a carriage return (CR, character \x0d or ASCII 13 decimal)
+ \t to insert a tab (character \x09 or ASCII 9 decimal)
+ \xNN to insert character having ASCII code hex NN (e.g \x0a for LF).
+
-Weak quoting is achieved by using double quotes (""). Weak quoting prevents
-the interpretation of:
+Weak quoting is achieved by surrounding double quotes ("") around the character
+or sequence of characters to protect. Weak quoting prevents the interpretation
+of:
- space as a parameter separator
+ space or tab as a word separator
' single quote as a strong quoting delimiter
# hash as a comment start
-Weak quoting permits the interpretation of variables, if you want to use a non
--interpreted dollar within a double quoted string, you should escape it with a
-backslash ("\$"), it does not work outside weak quoting.
+Weak quoting permits the interpretation of environment variables (which are not
+evaluated outside of quotes) by preceding them with a dollar sign ('$'). If a
+dollar character is needed inside double quotes, it must be escaped using a
+backslash.
-Interpretation of escaping and special characters are not prevented by weak
-quoting.
+Strong quoting is achieved by surrounding single quotes ('') around the
+character or sequence of characters to protect. Inside single quotes, nothing
+is interpreted, it's the efficient way to quote regular expressions.
-Strong quoting is achieved by using single quotes (''). Inside single quotes,
-nothing is interpreted, it's the efficient way to quote regexes.
+As a result, here is the matrix indicating how special characters can be
+entered in different contexts (unprintable characters are replaced with their
+name within angle brackets). Note that some characters that may only be
+represented escaped have no possible representation inside single quotes,
+hence the '-' there:
-Quoted and escaped strings are replaced in memory by their interpreted
-equivalent, it allows you to perform concatenation.
+ Character | Unquoted | Weakly quoted | Strongly quoted
+ -----------+---------------+-----------------------------+-----------------
+ <TAB> | \<TAB>, \x09 | "<TAB>", "\<TAB>", "\x09" | '<TAB>'
+ <LF> | \n, \x0a | "\n", "\x0a" | -
+ <CR> | \r, \x0d | "\r", "\x0d" | -
+ <SPC> | \<SPC>, \x20 | "<SPC>", "\<SPC>", "\x20" | '<SPC>'
+ " | \", \x22 | "\"", "\x22" | '"'
+ # | \#, \x23 | "#", "\#", "\x23" | '#'
+ $ | $, \$, \x24 | "\$", "\x24" | '$'
+ ' | \', \x27 | "'", "\'", "\x27" | -
+ \ | \\, \x5c | "\\", "\x5c" | '\'
Example:
- # those are equivalents:
+ # those are all strictly equivalent:
log-format %{+Q}o\ %t\ %s\ %{-Q}r
log-format "%{+Q}o %t %s %{-Q}r"
log-format '%{+Q}o %t %s %{-Q}r'
log-format "%{+Q}o %t"' %s %{-Q}r'
log-format "%{+Q}o %t"' %s'\ %{-Q}r
+There is one particular case where a second level of quoting or escaping may be
+necessary. Some keywords take arguments within parenthesis, sometimes delimited
+by commas. These arguments are commonly integers or predefined words, but when
+they are arbitrary strings, it may be required to perform a separate level of
+escaping to disambiguate the characters that belong to the argument from the
+characters that are used to delimit the arguments themselves. A pretty common
+case is the "regsub" converter. It takes a regular expression in argument, and
+if a closing parenthesis is needed inside, this one will require to have its
+own quotes.
+
+The keyword argument parser is exactly the same as the top-level one regarding
+quotes, except that is will not make special cases of backslashes. But what is
+not always obvious is that the delimitors used inside must first be escaped or
+quoted so that they are not resolved at the top level.
+
+Let's take this example making use of the "regsub" converter which takes 3
+arguments, one regular expression, one replacement string and one set of flags:
+
+ # replace all occurrences of "foo" with "blah" in the path:
+ http-request set-path %[path,regsub(foo,blah,g)]
+
+Here no special quoting was necessary. But if now we want to replace either
+"foo" or "bar" with "blah", we'll need the regular expression "(foo|bar)". We
+cannot write:
+
+ http-request set-path %[path,regsub((foo|bar),blah,g)]
+
+because we would like the string to cut like this:
+
+ http-request set-path %[path,regsub((foo|bar),blah,g)]
+ |---------|----|-|
+ arg1 _/ / /
+ arg2 __________/ /
+ arg3 ______________/
+
+but actually what is passed is a string between the opening and closing
+parenthesis then garbage:
+
+ http-request set-path %[path,regsub((foo|bar),blah,g)]
+ |--------|--------|
+ arg1=(foo|bar _/ /
+ trailing garbage _________/
+
+The obvious solution here seems to be that the closing parenthesis needs to be
+quoted, but alone this will not work, because as mentioned above, quotes are
+processed by the top-level parser which will resolve them before processing
+this word:
+
+ http-request set-path %[path,regsub("(foo|bar)",blah,g)]
+ ------------ -------- ----------------------------------
+ word1 word2 word3=%[path,regsub((foo|bar),blah,g)]
+
+So we didn't change anything for the argument parser at the second level which
+still sees a truncated regular expression as the only argument, and garbage at
+the end of the string. By escaping the quotes they will be passed unmodified to
+the second level:
+
+ http-request set-path %[path,regsub(\"(foo|bar)\",blah,g)]
+ ------------ -------- ------------------------------------
+ word1 word2 word3=%[path,regsub("(foo|bar)",blah,g)]
+ |---------||----|-|
+ arg1=(foo|bar) _/ / /
+ arg2=blah ___________/ /
+ arg3=g _______________/
+
+Another approch consists in using single quotes outside the whole string and
+double quotes inside (so that the double quotes are not stripped again):
+
+ http-request set-path '%[path,regsub("(foo|bar)",blah,g)]'
+ ------------ -------- ----------------------------------
+ word1 word2 word3=%[path,regsub("(foo|bar)",blah,g)]
+ |---------||----|-|
+ arg1=(foo|bar) _/ / /
+ arg2 ___________/ /
+ arg3 _______________/
+
+When using regular expressions, it can happen that the dollar ('$') character
+appears in the expression or that a backslash ('\') is used in the replacement
+string. In this case these ones will also be processed inside the double quotes
+thus single quotes are preferred (or double escaping). Example:
+
+ http-request set-path '%[path,regsub("^/(here)(/|$)","my/\1",g)]'
+ ------------ -------- -----------------------------------------
+ word1 word2 word3=%[path,regsub("^/(here)(/|$)","my/\1",g)]
+ |-------------| |-----||-|
+ arg1=(here)(/|$) _/ / /
+ arg2=my/\1 ________________/ /
+ arg3 ______________________/
+
+Remember that backslahes are not escape characters withing single quotes and
+that the whole word3 above is already protected against them using the single
+quotes. Conversely, if double quotes had been used around the whole expression,
+single the dollar character and the backslashes would have been resolved at top
+level, breaking the argument contents at the second level.
+
- # those are equivalents:
- reqrep "^([^\ :]*)\ /static/(.*)" \1\ /\2
- reqrep "^([^ :]*)\ /static/(.*)" '\1 /\2'
- reqrep "^([^ :]*)\ /static/(.*)" "\1 /\2"
- reqrep "^([^ :]*)\ /static/(.*)" "\1\ /\2"
+When in doubt, simply do not use quotes anywhere, and start to place single or
+double quotes around arguments that require a comma or a closing parenthesis,
+and think about escaping these quotes using a backslash of the string contains
+a dollar or a backslash. Again, this is pretty similar to what is used under
+a Bourne shell when double-escaping a command passed to "eval". For API writers
+the best is probably to place escaped quotes around each and every argument,
+regardless of their contents. Users will probably find that using single quotes
+around the whole expression and double quotes around each argument provides
+more readable configurations.
2.3. Environment variables