BUG/MINOR: xxhash: make sure armv6 uses memcpy()

There was a special case made to allow ARMv6 to use unaligned accesses
via a cast in xxHash when __ARM_FEATURE_UNALIGNED is defined. But while
ARMv6 (and v7) does support unaligned accesses, it's only for 32-bit
pointers, not 64-bit ones, leading to bus errors when the compiler emits
an ldrd instruction and the input (e.g. a pattern) is not aligned, as in
issue #1035.

Note that v7 was properly using the packed approach here and was safe,
however haproxy versions 2.3 and older use the old r39 xxhash code which
has the same issue for armv7. A slightly different fix is required there,
by using a different definition of packed for 32 and 64 bits.

The problem is really visible when running v7 code on a v8 kernel because
such kernels do not implement alignment trap emulation, and the process
dies when this happens. This is why in the issue above it was only detected
under lxc. The emulation could have been disabled on v7 as well by writing
zero to /proc/cpu/alignment though.

This commit is a backport of xxhash commit a470f2ef ("update default memory
access for armv6").

Thanks to @srkunze for the report and tests, @stgraber for his help on
setting up an easy reproducer outside of lxc, and @Cyan4973 for the
discussion around the best way to fix this. Details and alternate patches
available on https://github.com/Cyan4973/xxHash/issues/490.
diff --git a/include/import/xxhash.h b/include/import/xxhash.h
index 826942f..24326bc 100644
--- a/include/import/xxhash.h
+++ b/include/import/xxhash.h
@@ -791,15 +791,15 @@
  * Method 1:
  *     `__attribute__((packed))` statement. It depends on compiler extensions
  *     and is therefore not portable.
- *     This method is safe if your compiler supports it, and *generally* as
- *     fast or faster than `memcpy`.
+ *     This method is safe _if_ your compiler supports it,
+ *     and *generally* as fast or faster than `memcpy`.
  * Method 2:
  *     Direct access via cast. This method doesn't depend on the compiler but
  *     violates the C standard.
  *     It can generate buggy code on targets which do not support unaligned
  *     memory accesses.
  *     But in some circumstances, it's the only known way to get the most
- *     performance (example: GCC + ARMv6)
+ *     performance.
  * Method 3:
  *     Byteshift. This can generate the best code on old compilers which don't
  *     inline small `memcpy()` calls, and it might also be faster on big-endian
@@ -808,10 +808,10 @@
  * Prefer these methods in priority order (0 > 1 > 2 > 3)
  */
 #ifndef XXH_FORCE_MEMORY_ACCESS   /* can be defined externally, on command line for example */
-#  if !defined(__clang__) && defined(__GNUC__) && defined(__ARM_FEATURE_UNALIGNED) && defined(__ARM_ARCH) && (__ARM_ARCH == 6)
-#    define XXH_FORCE_MEMORY_ACCESS 2
-#  elif !defined(__clang__) && ((defined(__INTEL_COMPILER) && !defined(_WIN32)) || \
-  (defined(__GNUC__) && (defined(__ARM_ARCH) && __ARM_ARCH >= 7)))
+   /* prefer __packed__ structures (method 1) for gcc on armv7 and armv8 */
+#  if !defined(__clang__) && ( \
+    (defined(__INTEL_COMPILER) && !defined(_WIN32)) || \
+    (defined(__GNUC__) && (defined(__ARM_ARCH) && __ARM_ARCH >= 7)) )
 #    define XXH_FORCE_MEMORY_ACCESS 1
 #  endif
 #endif