MINOR: atomic: implement a more efficient arm64 __ha_cas_dw() using pairs

There finally is a way to support register pairs on aarch64 assembly
under gcc, it's just undocumented, like many of the options there :-(
As indicated below, it's possible to pass "%H" to mention the high
part of a register pair (e.g. "%H0" to go with "%0"):

  https://patchwork.ozlabs.org/project/gcc/patch/59368A74.2060908@foss.arm.com/

By making local variables from pairs of registers via a struct (as
is used in IST for example), we can let gcc choose the correct register
pairs and avoid a few moves in certain situations. The code is now slightly
more efficient than the previous one on AWS' Graviton2 platform, and
noticeably smaller (by 4.5kB approx). A few tests on older releases show
that even Linaro's gcc-4.7 used to support such register pairs and %H,
and by then ATOMICS were not supported so this should not cause build
issues, and as such this patch replaces the earlier implementation.

(cherry picked from commit f8fb4f75f147053fc939ab9bcc456c0e4e5b28a9)
Signed-off-by: Willy Tarreau <w@1wt.eu>
1 file changed