doc/internals/http-parsing.txt - haproxy - Gitiles

 --- Relevant portions of RFC2616 ---

 OCTET               = <any 8-bit sequence of data>
 CHAR                = <any US-ASCII character (octets 0 - 127)>
 UPALPHA             = <any US-ASCII uppercase letter "A".."Z">
 LOALPHA             = <any US-ASCII lowercase letter "a".."z">
 ALPHA               = UPALPHA | LOALPHA
 DIGIT               = <any US-ASCII digit "0".."9">
 CTL                 = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
 CR                  = <US-ASCII CR, carriage return (13)>
 LF                  = <US-ASCII LF, linefeed (10)>
 SP                  = <US-ASCII SP, space (32)>
 HT                  = <US-ASCII HT, horizontal-tab (9)>
 <">                 = <US-ASCII double-quote mark (34)>
 CRLF                = CR LF
 LWS                 = [CRLF] 1*( SP | HT )
 TEXT                = <any OCTET except CTLs, but including LWS>
 HEX                 = "A" | "B" | "C" | "D" | "E" | "F"
                       | "a" | "b" | "c" | "d" | "e" | "f" | DIGIT
 separators          = "(" | ")" | "<" | ">" | "@"
                     | "," | ";" | ":" | "\" | <">
                     | "/" | "[" | "]" | "?" | "="
                     | "{" | "}" | SP | HT
 token               = 1*<any CHAR except CTLs or separators>

 quoted-pair         = "\" CHAR
 ctext               = <any TEXT excluding "(" and ")">
 qdtext              = <any TEXT except <">>
 quoted-string       = ( <"> *(qdtext | quoted-pair ) <"> )
 comment             = "(" *( ctext | quoted-pair | comment ) ")"


 4 HTTP Message
 4.1 Message Types

 HTTP messages consist of requests from client to server and responses from
 server to client. Request (section 5) and Response (section 6) messages use the
 generic message format of RFC 822 [9] for transferring entities (the payload of
 the message). Both types of message consist of :

   - a start-line
   - zero or more header fields (also known as "headers")
   - an empty line (i.e., a line with nothing preceding the CRLF) indicating the
     end of the header fields
   - and possibly a message-body.


 HTTP-message        = Request | Response

 start-line          = Request-Line | Status-Line
 generic-message     = start-line
                       *(message-header CRLF)
                       CRLF
                       [ message-body ]

 In the interest of robustness, servers SHOULD ignore any empty line(s) received
 where a Request-Line is expected. In other words, if the server is reading the
 protocol stream at the beginning of a message and receives a CRLF first, it
 should ignore the CRLF.


 4.2 Message headers

 - Each header field consists of a name followed by a colon (":") and the field
   value.
 - Field names are case-insensitive.
 - The field value MAY be preceded by any amount of LWS, though a single SP is
   preferred.
 - Header fields can be extended over multiple lines by preceding each extra
   line with at least one SP or HT.


 message-header      = field-name ":" [ field-value ]
 field-name          = token
 field-value         = *( field-content | LWS )
 field-content       = <the OCTETs making up the field-value and consisting of
                        either *TEXT or combinations of token, separators, and
                        quoted-string>


 The field-content does not include any leading or trailing LWS occurring before
 the first non-whitespace character of the field-value or after the last
 non-whitespace character of the field-value. Such leading or trailing LWS MAY
 be removed without changing the semantics of the field value. Any LWS that
 occurs between field-content MAY be replaced with a single SP before
 interpreting the field value or forwarding the message downstream.


 => format des headers = 1*(CHAR & !ctl & !sep) ":" *(OCTET & (!ctl | LWS))
 => les regex de matching de headers s'appliquent sur field-content, et peuvent
    utiliser field-value comme espace de travail (mais de préférence après le
    premier SP).

 (19.3) The line terminator for message-header fields is the sequence CRLF.
 However, we recommend that applications, when parsing such headers, recognize
 a single LF as a line terminator and ignore the leading CR.


 message-body    = entity-body
                 | <entity-body encoded as per Transfer-Encoding>


 5 Request

 Request         = Request-Line
                   *(( general-header
                     | request-header
                     | entity-header ) CRLF)
                   CRLF
                   [ message-body ]


 5.1 Request line

 The elements are separated by SP characters. No CR or LF is allowed except in
 the final CRLF sequence.

 Request-Line = Method SP Request-URI SP HTTP-Version CRLF

 (19.3) Clients SHOULD be tolerant in parsing the Status-Line and servers
 tolerant when parsing the Request-Line. In particular, they SHOULD accept any
 amount of SP or HT characters between fields, even though only a single SP is
 required.

 4.5 General headers
 Apply to MESSAGE.

 general-header  = Cache-Control
                 | Connection
                 | Date
                 | Pragma
                 | Trailer
                 | Transfer-Encoding
                 | Upgrade
                 | Via
                 | Warning

 General-header field names can be extended reliably only in combination with a
 change in the protocol version. However, new or experimental header fields may
 be given the semantics of general header fields if all parties in the
 communication recognize them to be general-header fields. Unrecognized header
 fields are treated as entity-header fields.


 5.3 Request Header Fields

 The request-header fields allow the client to pass additional information about
 the request, and about the client itself, to the server. These fields act as
 request modifiers, with semantics equivalent to the parameters on a programming
 language method invocation.

 request-header  = Accept
                 | Accept-Charset
                 | Accept-Encoding
                 | Accept-Language
                 | Authorization
                 | Expect
                 | From
                 | Host
                 | If-Match
                 | If-Modified-Since
                 | If-None-Match
                 | If-Range
                 | If-Unmodified-Since
                 | Max-Forwards
                 | Proxy-Authorization
                 | Range
                 | Referer
                 | TE
                 | User-Agent

 Request-header field names can be extended reliably only in combination with a
 change in the protocol version. However, new or experimental header fields MAY
 be given the semantics of request-header fields if all parties in the
 communication recognize them to be request-header fields. Unrecognized header
 fields are treated as entity-header fields.


 7.1 Entity header fields

 Entity-header fields define metainformation about the entity-body or, if no
 body is present, about the resource identified by the request. Some of this
 metainformation is OPTIONAL; some might be REQUIRED by portions of this
 specification.

 entity-header   = Allow
                 | Content-Encoding
                 | Content-Language
                 | Content-Length
                 | Content-Location
                 | Content-MD5
                 | Content-Range
                 | Content-Type
                 | Expires
                 | Last-Modified
                 | extension-header
 extension-header = message-header

 The extension-header mechanism allows additional entity-header fields to be
 defined without changing the protocol, but these fields cannot be assumed to be
 recognizable by the recipient. Unrecognized header fields SHOULD be ignored by
 the recipient and MUST be forwarded by transparent proxies.

 ----------------------------------

 The format of Request-URI is defined by RFC3986 :

    URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

    hier-part     = "//" authority path-abempty
                  / path-absolute
                  / path-rootless
                  / path-empty

    URI-reference = URI / relative-ref

    absolute-URI  = scheme ":" hier-part [ "?" query ]

    relative-ref  = relative-part [ "?" query ] [ "#" fragment ]

    relative-part = "//" authority path-abempty
                  / path-absolute
                  / path-noscheme
                  / path-empty

    scheme        = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

    authority     = [ userinfo "@" ] host [ ":" port ]
    userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )
    host          = IP-literal / IPv4address / reg-name
    port          = *DIGIT

    IP-literal    = "[" ( IPv6address / IPvFuture  ) "]"

    IPvFuture     = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

    IPv6address   =                            6( h16 ":" ) ls32
                  /                       "::" 5( h16 ":" ) ls32
                  / [               h16 ] "::" 4( h16 ":" ) ls32
                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
                  / [ *4( h16 ":" ) h16 ] "::"              ls32
                  / [ *5( h16 ":" ) h16 ] "::"              h16
                  / [ *6( h16 ":" ) h16 ] "::"

    h16           = 1*4HEXDIG
    ls32          = ( h16 ":" h16 ) / IPv4address
    IPv4address   = dec-octet "." dec-octet "." dec-octet "." dec-octet
    dec-octet     = DIGIT                 ; 0-9
                  / %x31-39 DIGIT         ; 10-99
                  / "1" 2DIGIT            ; 100-199
                  / "2" %x30-34 DIGIT     ; 200-249
                  / "25" %x30-35          ; 250-255

    reg-name      = *( unreserved / pct-encoded / sub-delims )

    path          = path-abempty    ; begins with "/" or is empty
                  / path-absolute   ; begins with "/" but not "//"
                  / path-noscheme   ; begins with a non-colon segment
                  / path-rootless   ; begins with a segment
                  / path-empty      ; zero characters

    path-abempty  = *( "/" segment )
    path-absolute = "/" [ segment-nz *( "/" segment ) ]
    path-noscheme = segment-nz-nc *( "/" segment )
    path-rootless = segment-nz *( "/" segment )
    path-empty    = 0<pchar>

    segment       = *pchar
    segment-nz    = 1*pchar
    segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                  ; non-zero-length segment without any colon ":"

    pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

    query         = *( pchar / "/" / "?" )

    fragment      = *( pchar / "/" / "?" )

    pct-encoded   = "%" HEXDIG HEXDIG

    unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
    reserved      = gen-delims / sub-delims
    gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

 => so the list of allowed characters in a URI is :

    uri-char      = unreserved / gen-delims / sub-delims / "%"
                  = ALPHA / DIGIT / "-" / "." / "_" / "~"
                  / ":" / "/" / "?" / "#" / "[" / "]" / "@"
                  / "!" / "$" / "&" / "'" / "(" / ")" /
                  / "*" / "+" / "," / ";" / "=" / "%"

 Note that non-ascii characters are forbidden ! Spaces and CTL are forbidden.
 Unfortunately, some products such as Apache allow such characters :-/

 ---- The correct way to do it ----

 - one http_session
   It is basically any transport session on which we talk HTTP. It may be TCP,
   SSL over TCP, etc... It knows a way to talk to the client, either the socket
   file descriptor or a direct access to the client-side buffer. It should hold
   information about the last accessed server so that we can guarantee that the
   same server can be used during a whole session if needed. A first version
   without optimal support for HTTP pipelining will have the client buffers tied
   to the http_session. It may be possible that it is not sufficient for full
   pipelining, but this will need further study. The link from the buffers to
   the backend should be managed by the http transaction (http_txn), provided
   that they are serialized. Each http_session, has 0 to N http_txn. Each
   http_txn belongs to one and only one http_session.

 - each http_txn has 1 request message (http_req), and 0 or 1 response message
   (http_rtr). Each of them has 1 and only one http_txn. An http_txn holds
   information such as the HTTP method, the URI, the HTTP version, the
   transfer-encoding, the HTTP status, the authorization, the req and rtr
   content-length, the timers, logs, etc... The backend and server which process
   the request are also known from the http_txn.

 - both request and response messages hold header and parsing information, such
   as the parsing state, start of headers, start of message, captures, etc...
	--- Relevant portions of RFC2616 ---

	OCTET = <any 8-bit sequence of data>
	CHAR = <any US-ASCII character (octets 0 - 127)>
	UPALPHA = <any US-ASCII uppercase letter "A".."Z">
	LOALPHA = <any US-ASCII lowercase letter "a".."z">
	ALPHA = UPALPHA \| LOALPHA
	DIGIT = <any US-ASCII digit "0".."9">
	CTL = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
	CR = <US-ASCII CR, carriage return (13)>
	LF = <US-ASCII LF, linefeed (10)>
	SP = <US-ASCII SP, space (32)>
	HT = <US-ASCII HT, horizontal-tab (9)>
	<"> = <US-ASCII double-quote mark (34)>
	CRLF = CR LF
	LWS = [CRLF] 1*( SP \| HT )
	TEXT = <any OCTET except CTLs, but including LWS>
	HEX = "A" \| "B" \| "C" \| "D" \| "E" \| "F"
	\| "a" \| "b" \| "c" \| "d" \| "e" \| "f" \| DIGIT
	separators = "(" \| ")" \| "<" \| ">" \| "@"
	\| "," \| ";" \| ":" \| "\" \| <">
	\| "/" \| "[" \| "]" \| "?" \| "="
	\| "{" \| "}" \| SP \| HT
	token = 1*<any CHAR except CTLs or separators>

	quoted-pair = "\" CHAR
	ctext = <any TEXT excluding "(" and ")">
	qdtext = <any TEXT except <">>
	quoted-string = ( <"> *(qdtext \| quoted-pair ) <"> )
	comment = "(" *( ctext \| quoted-pair \| comment ) ")"





	4 HTTP Message
	4.1 Message Types

	HTTP messages consist of requests from client to server and responses from
	server to client. Request (section 5) and Response (section 6) messages use the
	generic message format of RFC 822 [9] for transferring entities (the payload of
	the message). Both types of message consist of :

	- a start-line
	- zero or more header fields (also known as "headers")
	- an empty line (i.e., a line with nothing preceding the CRLF) indicating the
	end of the header fields
	- and possibly a message-body.


	HTTP-message = Request \| Response

	start-line = Request-Line \| Status-Line
	generic-message = start-line
	*(message-header CRLF)
	CRLF
	[ message-body ]

	In the interest of robustness, servers SHOULD ignore any empty line(s) received
	where a Request-Line is expected. In other words, if the server is reading the
	protocol stream at the beginning of a message and receives a CRLF first, it
	should ignore the CRLF.


	4.2 Message headers

	- Each header field consists of a name followed by a colon (":") and the field
	value.
	- Field names are case-insensitive.
	- The field value MAY be preceded by any amount of LWS, though a single SP is
	preferred.
	- Header fields can be extended over multiple lines by preceding each extra
	line with at least one SP or HT.


	message-header = field-name ":" [ field-value ]
	field-name = token
	field-value = *( field-content \| LWS )
	field-content = <the OCTETs making up the field-value and consisting of
	either *TEXT or combinations of token, separators, and
	quoted-string>


	The field-content does not include any leading or trailing LWS occurring before
	the first non-whitespace character of the field-value or after the last
	non-whitespace character of the field-value. Such leading or trailing LWS MAY
	be removed without changing the semantics of the field value. Any LWS that
	occurs between field-content MAY be replaced with a single SP before
	interpreting the field value or forwarding the message downstream.


	=> format des headers = 1(CHAR & !ctl & !sep) ":" (OCTET & (!ctl \| LWS))
	=> les regex de matching de headers s'appliquent sur field-content, et peuvent
	utiliser field-value comme espace de travail (mais de préférence après le
	premier SP).

	(19.3) The line terminator for message-header fields is the sequence CRLF.
	However, we recommend that applications, when parsing such headers, recognize
	a single LF as a line terminator and ignore the leading CR.





	message-body = entity-body
	\| <entity-body encoded as per Transfer-Encoding>



	5 Request

	Request = Request-Line
	*(( general-header
	\| request-header
	\| entity-header ) CRLF)
	CRLF
	[ message-body ]



	5.1 Request line

	The elements are separated by SP characters. No CR or LF is allowed except in
	the final CRLF sequence.

	Request-Line = Method SP Request-URI SP HTTP-Version CRLF

	(19.3) Clients SHOULD be tolerant in parsing the Status-Line and servers
	tolerant when parsing the Request-Line. In particular, they SHOULD accept any
	amount of SP or HT characters between fields, even though only a single SP is
	required.

	4.5 General headers
	Apply to MESSAGE.

	general-header = Cache-Control
	\| Connection
	\| Date
	\| Pragma
	\| Trailer
	\| Transfer-Encoding
	\| Upgrade
	\| Via
	\| Warning

	General-header field names can be extended reliably only in combination with a
	change in the protocol version. However, new or experimental header fields may
	be given the semantics of general header fields if all parties in the
	communication recognize them to be general-header fields. Unrecognized header
	fields are treated as entity-header fields.




	5.3 Request Header Fields

	The request-header fields allow the client to pass additional information about
	the request, and about the client itself, to the server. These fields act as
	request modifiers, with semantics equivalent to the parameters on a programming
	language method invocation.

	request-header = Accept
	\| Accept-Charset
	\| Accept-Encoding
	\| Accept-Language
	\| Authorization
	\| Expect
	\| From
	\| Host
	\| If-Match
	\| If-Modified-Since
	\| If-None-Match
	\| If-Range
	\| If-Unmodified-Since
	\| Max-Forwards
	\| Proxy-Authorization
	\| Range
	\| Referer
	\| TE
	\| User-Agent

	Request-header field names can be extended reliably only in combination with a
	change in the protocol version. However, new or experimental header fields MAY
	be given the semantics of request-header fields if all parties in the
	communication recognize them to be request-header fields. Unrecognized header
	fields are treated as entity-header fields.



	7.1 Entity header fields

	Entity-header fields define metainformation about the entity-body or, if no
	body is present, about the resource identified by the request. Some of this
	metainformation is OPTIONAL; some might be REQUIRED by portions of this
	specification.

	entity-header = Allow
	\| Content-Encoding
	\| Content-Language
	\| Content-Length
	\| Content-Location
	\| Content-MD5
	\| Content-Range
	\| Content-Type
	\| Expires
	\| Last-Modified
	\| extension-header
	extension-header = message-header

	The extension-header mechanism allows additional entity-header fields to be
	defined without changing the protocol, but these fields cannot be assumed to be
	recognizable by the recipient. Unrecognized header fields SHOULD be ignored by
	the recipient and MUST be forwarded by transparent proxies.

	----------------------------------

	The format of Request-URI is defined by RFC3986 :

	URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

	hier-part = "//" authority path-abempty
	/ path-absolute
	/ path-rootless
	/ path-empty

	URI-reference = URI / relative-ref

	absolute-URI = scheme ":" hier-part [ "?" query ]

	relative-ref = relative-part [ "?" query ] [ "#" fragment ]

	relative-part = "//" authority path-abempty
	/ path-absolute
	/ path-noscheme
	/ path-empty

	scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

	authority = [ userinfo "@" ] host [ ":" port ]
	userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
	host = IP-literal / IPv4address / reg-name
	port = *DIGIT

	IP-literal = "[" ( IPv6address / IPvFuture ) "]"

	IPvFuture = "v" 1HEXDIG "." 1( unreserved / sub-delims / ":" )

	IPv6address = 6( h16 ":" ) ls32
	/ "::" 5( h16 ":" ) ls32
	/ [ h16 ] "::" 4( h16 ":" ) ls32
	/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
	/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
	/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
	/ [ *4( h16 ":" ) h16 ] "::" ls32
	/ [ *5( h16 ":" ) h16 ] "::" h16
	/ [ *6( h16 ":" ) h16 ] "::"

	h16 = 1*4HEXDIG
	ls32 = ( h16 ":" h16 ) / IPv4address
	IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
	dec-octet = DIGIT ; 0-9
	/ %x31-39 DIGIT ; 10-99
	/ "1" 2DIGIT ; 100-199
	/ "2" %x30-34 DIGIT ; 200-249
	/ "25" %x30-35 ; 250-255

	reg-name = *( unreserved / pct-encoded / sub-delims )

	path = path-abempty ; begins with "/" or is empty
	/ path-absolute ; begins with "/" but not "//"
	/ path-noscheme ; begins with a non-colon segment
	/ path-rootless ; begins with a segment
	/ path-empty ; zero characters

	path-abempty = *( "/" segment )
	path-absolute = "/" [ segment-nz *( "/" segment ) ]
	path-noscheme = segment-nz-nc *( "/" segment )
	path-rootless = segment-nz *( "/" segment )
	path-empty = 0<pchar>

	segment = *pchar
	segment-nz = 1*pchar
	segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
	; non-zero-length segment without any colon ":"

	pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

	query = *( pchar / "/" / "?" )

	fragment = *( pchar / "/" / "?" )

	pct-encoded = "%" HEXDIG HEXDIG

	unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
	reserved = gen-delims / sub-delims
	gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
	sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
	/ "*" / "+" / "," / ";" / "="

	=> so the list of allowed characters in a URI is :

	uri-char = unreserved / gen-delims / sub-delims / "%"
	= ALPHA / DIGIT / "-" / "." / "_" / "~"
	/ ":" / "/" / "?" / "#" / "[" / "]" / "@"
	/ "!" / "$" / "&" / "'" / "(" / ")" /
	/ "*" / "+" / "," / ";" / "=" / "%"

	Note that non-ascii characters are forbidden ! Spaces and CTL are forbidden.
	Unfortunately, some products such as Apache allow such characters :-/

	---- The correct way to do it ----

	- one http_session
	It is basically any transport session on which we talk HTTP. It may be TCP,
	SSL over TCP, etc... It knows a way to talk to the client, either the socket
	file descriptor or a direct access to the client-side buffer. It should hold
	information about the last accessed server so that we can guarantee that the
	same server can be used during a whole session if needed. A first version
	without optimal support for HTTP pipelining will have the client buffers tied
	to the http_session. It may be possible that it is not sufficient for full
	pipelining, but this will need further study. The link from the buffers to
	the backend should be managed by the http transaction (http_txn), provided
	that they are serialized. Each http_session, has 0 to N http_txn. Each
	http_txn belongs to one and only one http_session.

	- each http_txn has 1 request message (http_req), and 0 or 1 response message
	(http_rtr). Each of them has 1 and only one http_txn. An http_txn holds
	information such as the HTTP method, the URI, the HTTP version, the
	transfer-encoding, the HTTP status, the authorization, the req and rtr
	content-length, the timers, logs, etc... The backend and server which process
	the request are also known from the http_txn.

	- both request and response messages hold header and parsing information, such
	as the parsing state, start of headers, start of message, captures, etc...