[RFC] sched/eevdf: avoid task starvation in cgroups

From: Tobias Huschle <huschle@linux.ibm.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	bristot@redhat.com, vschneid@redhat.com,
	sshegde@linux.vnet.ibm.com, srikar@linux.vnet.ibm.com,
	linuxppc-dev@lists.ozlabs.org, ankur.a.arora@oracle.com
Subject: [RFC] sched/eevdf: avoid task starvation in cgroups
Date: Wed, 28 Feb 2024 17:10:23 +0100	[thread overview]
Message-ID: <20240228161023.14310-1-huschle@linux.ibm.com> (raw)

When running update_curr, it is checked whether the current task has
missed its deadline (update_deadline). If the deadline has been crossed,
the task is set to be rescheduled if there are other tasks available on
its cfs_rq.
This can cause task starvation in some cgroup configurations.

Assume the following scenario:

           [ ]  rq of CPU
            |
           [ ]  cfs_rq1
          /   \
cfs_rq2 [ ]   othertask
         |
        curr

In this case, update_curr is called for all levels of the hierarchy,
starting at the leaf.
Assume that curr is a kernel task which does not give up the CPU
voluntarily, i.e. loops indefinitely unless forced to give up the CPU.
Assume further that curr has actually reached its deadline, the expected
behavior would be, that the scheduler determines that curr would now
start to overconsume and therefore should set the need_resched flag to
nudge the current task to allow a reschedule in favor of othertask.

To reach the expected behavior, it is not sufficient to check whether
other tasks are queued in the cfs_rq, which curr is assigned to.

In the configuration shown above, each run queue sees:
rq_cpu:  2 tasks
cfs_rq1: 2 tasks
cfs_rq2: 1 task

This means that cfs_rq2 will never reschedule on its own as it sees
no other tasks that would be worth rescheduling for. Hence, the
call to reschedule relies on the upper levels in the hierarchy.
cfs_rq1 sees 1 additional task and could therefore take the desired
decision. But, cfs_rq1 takes the additional task into account when
computing its own deadline, which means, its deadline lies further in
the future. This causes that its deadline is not being reached.
Therefore cfs_rq1 does not even check for other potential tasks to be
ran.

It could now be assumed that cfs_rq1 will reach its deadline at some
point. But, after curr has consumed its timeslice, all sched_entities
in the cgroup tree get reweighted. This causes the deadlines of all
entities in the hierarchy to be shifted further into the future.
This means especially, that the deadline of cfs_rq1 also gets pushed
into the future, causing it to never reach its deadline.
The decision whether a reweight needs to be done depends on the weight
of the sched_entity and a recalculated value of the shares the
sched_entity shall expect (see calc_group_shares in fair.c). These
values are, in the described scenario, always different, hence, reweight
is done.

The combination of these two circumstances can lead to curr
running indefinitely unless interrupted by an external entity.

Address this issue by rather checking for the nr_running value of the
rq of the CPU itself, i.e. if there is any other task, somewhere on the
CPU runqueue, give it a chance to execute.

The scenario that made me discover this is being discussed here:
https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/

Questions:
1. Is this behavior by design? I couldn't find an explanation why we
   only check for the local cfs_rq. Do we only want to allow take-overs
   from the same cgroup hierarchy level in that case?
2. This problem is referred to as the "hog"-problem by Ankur Arora here:
   https://lore.kernel.org/lkml/20240213055554.1802415-24-ankur.a.arora@oracle.com/
   Might there be a connection, it's the same piece of code in the end. I'd also
   prefer the boolean passing to avoid calling resched_curr too often.
3. Which are the criteria for which I should expect a reweight of the 
   cgroup hierarchy? It seems unintuitive too me, that the shares value
   keeps changing every time.

Feedback and opinions would be highly appreciated.

Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61c4ef20a2f8..e9733ef7964a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1015,7 +1015,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	/*
 	 * The task has consumed its request, reschedule.
 	 */
-	if (cfs_rq->nr_running > 1) {
+	if (rq_of(cfs_rq)->nr_running > 1) {
 		resched_curr(rq_of(cfs_rq));
 		clear_buddies(cfs_rq, se);
 	}
-- 
2.34.1