BUG/MINOR: sched: properly account for the CPU time of dying tasks When task profiling is enabled, the scheduler can measure and report the cumulated time spent in each task and their respective latencies. But this was wrong for tasks with few wakeups as well as for self-waking ones, because the call date needed to measure how long it takes to process the task is retrieved in the task itself (->wake_date was turned to the call date), and we could face two conditions: - a new wakeup while the task is executing would reset the ->wake_date field before returning and make abnormally low values being reported; that was likely the case for taskèrun_applet for self-waking applets; - when the task dies, NULL is returned and the call date couldn't be retrieved, so that CPU time was not being accounted for. This was particularly visible with process_stream() which is usually called only twice per request, and whose time was systematically halved. The cleanest solution here is to keep in mind that the scheduler already uses quite a bit of local context in th_ctx, and place the intermediary values there so that they cannot vanish. The wake_date has to be reset immediately once read, and only its copy is used along the function. Note that this must be done both for tasks and tasklet, and that until recently tasklets were also able to report wrong values due to their sole dependency on TH_FL_TASK_PROFILING between tests. One nice benefit for future improvements is that such information will now be available from the task without having to be stored into the task itself anymore. Since the tasklet part was computed on wrapping 32-bit arithmetics and the task one was on 64-bit, the values were now consistently moved to 32-bit as it's already largely sufficient (4s spent in a task is more than twice what the watchdog would tolerate). Some further cleanups might be necessary, but the patch aimed at staying minimal. Task profiling output after 1 million HTTP request previously looked like this: Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg h1_io_cb 2012338 4.850s 2.410us 12.91s 6.417us process_stream 2000136 9.594s 4.796us 34.26s 17.13us sc_conn_io_cb 2000135 1.973s 986.0ns 30.24s 15.12us h1_timeout_task 137 - - 2.649ms 19.34us accept_queue_process 49 152.3us 3.107us 321.7yr 6.564yr main+0x146430 7 5.250us 750.0ns 25.92us 3.702us srv_cleanup_idle_conns 1 559.0ns 559.0ns 918.0ns 918.0ns task_run_applet 1 - - 2.162us 2.162us Now it looks like this: Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg h1_io_cb 2014194 4.794s 2.380us 13.75s 6.826us process_stream 2000151 20.01s 10.00us 36.04s 18.02us sc_conn_io_cb 2000148 2.167s 1.083us 32.27s 16.13us h1_timeout_task 198 54.24us 273.0ns 3.487ms 17.61us accept_queue_process 52 158.3us 3.044us 409.9us 7.882us main+0x1466e0 18 16.77us 931.0ns 63.98us 3.554us srv_cleanup_toremove_conns 8 282.1us 35.26us 546.8us 68.35us srv_cleanup_idle_conns 3 149.2us 49.73us 8.131us 2.710us task_run_applet 3 268.1us 89.38us 11.61us 3.871us Note the two-fold difference on process_stream(). This feature is essentially used for debugging so it has extremely limited impact. However it's used quite a bit more in bug reports and it would be desirable that at least 2.6 gets this fix backported. It depends on at least these two previous patches which will then also have to be backported: MINOR: task: permanently enable latency measurement on tasklets CLEANUP: task: rename ->call_date to ->wake_date

commit: 62b5b96bcc91985cb6bf6a30264ef3c54315c7c7 [log] [tgz]
author: Willy Tarreau <w@1wt.eu> Wed Sep 07 15:11:25 2022 +0200
committer: Willy Tarreau <w@1wt.eu> Thu Sep 08 14:19:15 2022 +0200
tree: 34b76da568b638133febc08d0e366bf11e1226f7
parent: 04e50b3d325fa35ce9557701513773a8a84e9230 [diff] [blame]
diff --git a/src/task.c b/src/task.c
index 44aac72..5cebbbb 100644
--- a/src/task.c
+++ b/src/task.c

@@ -564,19 +564,19 @@
 		_HA_ATOMIC_DEC(&th_ctx->rq_total);
 
 		if (t->state & TASK_F_TASKLET) {
-			uint64_t before = 0;
-
 			LIST_DEL_INIT(&((struct tasklet *)t)->list);
 			__ha_barrier_store();
 
-			if (unlikely(_HA_ATOMIC_LOAD(&th_ctx->flags) & TH_FL_TASK_PROFILING)) {
-				profile_entry = sched_activity_entry(sched_activity, t->process);
-				before = now_mono_time();
+			th_ctx->sched_wake_date = ((struct tasklet *)t)->wake_date;
+			if (th_ctx->sched_wake_date) {
+				uint32_t now_ns = now_mono_time();
+				uint32_t lat = now_ns - th_ctx->sched_wake_date;
 
-				if (((struct tasklet *)t)->wake_date) {
-					HA_ATOMIC_ADD(&profile_entry->lat_time, (uint32_t)(before - ((struct tasklet *)t)->wake_date));
-					((struct tasklet *)t)->wake_date = 0;
-				}
+				((struct tasklet *)t)->wake_date = 0;
+				th_ctx->sched_call_date = now_ns;
+				profile_entry = sched_activity_entry(sched_activity, t->process);
+				HA_ATOMIC_ADD(&profile_entry->lat_time, lat);
+				HA_ATOMIC_INC(&profile_entry->calls);
 			}
 
 			state = _HA_ATOMIC_FETCH_AND(&t->state, TASK_PERSISTENT);
@@ -593,10 +593,8 @@
 				continue;
 			}
 
-			if (unlikely(_HA_ATOMIC_LOAD(&th_ctx->flags) & TH_FL_TASK_PROFILING)) {
-				HA_ATOMIC_INC(&profile_entry->calls);
-				HA_ATOMIC_ADD(&profile_entry->cpu_time, now_mono_time() - before);
-			}
+			if (th_ctx->sched_wake_date)
+				HA_ATOMIC_ADD(&profile_entry->cpu_time, (uint32_t)(now_mono_time() - th_ctx->sched_call_date));
 
 			done++;
 			th_ctx->current = NULL;
@@ -607,6 +605,21 @@
 		LIST_DEL_INIT(&((struct tasklet *)t)->list);
 		__ha_barrier_store();
 
+		th_ctx->sched_wake_date = t->wake_date;
+		if (unlikely(t->wake_date)) {
+			uint32_t now_ns = now_mono_time();
+			uint32_t lat = now_ns - t->wake_date;
+
+			t->lat_time += lat;
+			t->wake_date = 0;
+			th_ctx->sched_call_date = now_ns;
+			profile_entry = sched_activity_entry(sched_activity, t->process);
+			HA_ATOMIC_ADD(&profile_entry->lat_time, lat);
+			HA_ATOMIC_INC(&profile_entry->calls);
+		}
+
+		__ha_barrier_store();
+
 		/* We must be the exclusive owner of the TASK_RUNNING bit, and
 		 * have to be careful that the task is not being manipulated on
 		 * another thread finding it expired in wake_expired_tasks().
@@ -627,18 +640,6 @@
 		/* OK then this is a regular task */
 
 		_HA_ATOMIC_DEC(&ha_thread_ctx[tid].tasks_in_list);
-		if (unlikely(t->wake_date)) {
-			uint64_t now_ns = now_mono_time();
-			uint64_t lat = now_ns - t->wake_date;
-
-			t->lat_time += lat;
-			t->wake_date = now_ns;
-			profile_entry = sched_activity_entry(sched_activity, t->process);
-			HA_ATOMIC_ADD(&profile_entry->lat_time, lat);
-			HA_ATOMIC_INC(&profile_entry->calls);
-		}
-
-		__ha_barrier_store();
 
 		/* Note for below: if TASK_KILLED arrived before we've read the state, we
 		 * directly free the task. Otherwise it will be seen after processing and
@@ -661,18 +662,20 @@
 		}
 		th_ctx->current = NULL;
 		__ha_barrier_store();
+
+		/* stats are only registered for non-zero wake dates */
+		if (unlikely(th_ctx->sched_wake_date)) {
+			uint32_t cpu = (uint32_t)now_mono_time() - th_ctx->sched_call_date;
+
+			if (t)
+				t->cpu_time += cpu;
+			HA_ATOMIC_ADD(&profile_entry->cpu_time, cpu);
+		}
+
 		/* If there is a pending state  we have to wake up the task
 		 * immediately, else we defer it into wait queue
 		 */
 		if (t != NULL) {
-			if (unlikely(t->wake_date)) {
-				uint64_t cpu = now_mono_time() - t->wake_date;
-
-				t->cpu_time += cpu;
-				t->wake_date = 0;
-				HA_ATOMIC_ADD(&profile_entry->cpu_time, cpu);
-			}
-
 			state = _HA_ATOMIC_LOAD(&t->state);
 			if (unlikely(state & TASK_KILLED)) {
 				task_unlink_wq(t);
commit	62b5b96bcc91985cb6bf6a30264ef3c54315c7c7	[log] [tgz]
author	Willy Tarreau <w@1wt.eu>	Wed Sep 07 15:11:25 2022 +0200
committer	Willy Tarreau <w@1wt.eu>	Thu Sep 08 14:19:15 2022 +0200
tree	34b76da568b638133febc08d0e366bf11e1226f7
parent	04e50b3d325fa35ce9557701513773a8a84e9230 [diff] [blame]