OPTIM: task: limit the impact of memory barriers in taks_remove_from_task_list()

In this function we end up with successive locked operations then a
store barrier, and in addition the compiler has to emit less efficient
code due to a longer jump. There's no need for absolutely updating the
tasks_run_queue counter before clearing the task's leaf pointer, so
let's swap the two operations and benefit from a single barrier as much
as possible. This code is on the hot path and shows about half a percent
of improvement with 8 threads.
diff --git a/include/proto/task.h b/include/proto/task.h
index c90a369..f11b445 100644
--- a/include/proto/task.h
+++ b/include/proto/task.h
@@ -273,11 +273,9 @@
 {
 	LIST_DEL_INIT(&((struct tasklet *)t)->list);
 	task_per_thread[tid].task_list_size--;
+	if (!TASK_IS_TASKLET(t))
+		HA_ATOMIC_STORE(&t->rq.node.leaf_p, NULL); // was 0x1
 	HA_ATOMIC_SUB(&tasks_run_queue, 1);
-	if (!TASK_IS_TASKLET(t)) {
-		t->rq.node.leaf_p = NULL; // was 0x1
-		__ha_barrier_store();
-	}
 }
 
 /*