OPTIM: htx: inline the most common memcpy(8)

On high traffic benchmarks, it's visible the the CPU is dominated by
calls to memcpy(), and many of those come from htx functions. It was
measured that 63% of those coming from htx are made on 8-byte blocks
which really are not worth a call to the function since a single
read-write cycle does it fine.

This commit adds an inline htx_memcpy() function that explicitly
checks for this length and just copies the data without that call.
It's even likely that it could be detected on const sizes, though
that was not done. This is already effective in reducing the number
of calls to memcpy().
1 file changed