schedstat false counting of domain load_balance() tried to move one or more tasks failed

* schedstat false counting of domain load_balance() tried to move one or more tasks failed
@ 2022-07-13  1:52 Steven Rostedt
  2022-07-18  9:51 ` Peter Zijlstra
  0 siblings, 1 reply; 4+ messages in thread
From: Steven Rostedt @ 2022-07-13  1:52 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Linus Torvalds

I've been tasked to analyze the /proc/schedstat file to determine
appropriate metrics to look after in production. So I'm looking at both the
documentation and the code that generates it.

From the documentation at https://docs.kernel.org/scheduler/sched-stats.html

(and Documentation/scheduler/sched-stats.rst for those of you that are
allergic to html)

   Domain statistics
   -----------------
   One of these is produced per domain for each cpu described. (Note that if
   CONFIG_SMP is not defined, *no* domains are utilized and these lines
   will not appear in the output.)

   domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

   The first field is a bit mask indicating what cpus this domain operates over.

   The next 24 are a variety of load_balance() statistics in grouped into types
   of idleness (idle, busy, and newly idle):

       1)  # of times in this domain load_balance() was called when the
           cpu was idle
       2)  # of times in this domain load_balance() checked but found
           the load did not require balancing when the cpu was idle
       3)  # of times in this domain load_balance() tried to move one or
           more tasks and failed, when the cpu was idle

I was looking at this #3 (which is also #11 and #19 for CPU_BUSY and
CPU_NEW_IDLE respectively). It states that it gets incremented when one or
more tasks were tried to be moved but failed. I found this is not always
the case.

We have:

	ld_moved = 0;
	/* Clear this flag as soon as we find a pullable task */
	env.flags |= LBF_ALL_PINNED;
	if (busiest->nr_running > 1) {
[..]
	}

	if (!ld_moved) {
		schedstat_inc(sd->lb_failed[idle]);

Where the lb_failed[] is that counter. I added the following code:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 77b2048a9326..4835ea4d9d01 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9865,6 +9865,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	struct rq *busiest;
 	struct rq_flags rf;
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+	bool redo = false;
 
 	struct lb_env env = {
 		.sd		= sd,
@@ -10012,11 +10013,13 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = sched_nr_migrate_break;
+				redo = true;
 				goto redo;
 			}
 			goto out_all_pinned;
 		}
-	}
+	} else if (!redo)
+		trace_printk("Did not try to move! %d\n", idle);
 
 	if (!ld_moved) {
 		schedstat_inc(sd->lb_failed[idle]);


And sure enough that triggers on CPU_IDLE and CPU_NEW_IDLE calls (I haven't
seen it for CPU_BUSY yet, but I didn't try).

Thus, if we get to that check for (busiest->nr_running > 1) and fail, then
we will increment that counter incorrectly.

Do we care? Should it be fixed? Should it be documented?

Thoughts?

-- Steve


^ permalink raw reply related	[flat|nested] 4+ messages in thread