Re: [PATCH] sched/rt: Fix double enqueue caused by rt_effective_prio

From: Juri Lelli <juri.lelli@redhat.com>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: peterz@infradead.org, mingo@redhat.com,
	linux-kernel@vger.kernel.org, vincent.guittot@linaro.org,
	rostedt@goodmis.org, bristot@redhat.com, bsegall@google.com,
	mgorman@suse.de, Mark Simmons <msimmons@redhat.com>
Subject: Re: [PATCH] sched/rt: Fix double enqueue caused by rt_effective_prio
Date: Wed, 7 Jul 2021 10:47:03 +0200	[thread overview]
Message-ID: <YOVqB1XKdoZYnn4m@localhost.localdomain> (raw)
In-Reply-To: <29c071b5-5dd9-42df-9e00-f3df644eeccc@arm.com>

Hi,

On 06/07/21 16:48, Dietmar Eggemann wrote:
> On 01/07/2021 11:14, Juri Lelli wrote:
> > Double enqueues in rt runqueues (list) have been reported while running
> > a simple test that spawns a number of threads doing a short sleep/run
> > pattern while being concurrently setscheduled between rt and fair class.
> 
> I tried to recreate this in rt-app (with `pi-mutex` resource and
> `pi_enabled=true` but I can't bring the system into hitting this warning.

So, this is a bit hard to reproduce. I'm attaching the reproducer we
have been using to test the fix. Note that we have seen this on RT (thus
why the repro doesn't need to explicitly use mutexes), but I'm not
seeing why this couldn't in principle happen on !RT as well.

> [...]
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0c22cd026440..c84ac1d675f4 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6823,7 +6823,8 @@ static void __setscheduler_params(struct task_struct *p,
> >  
> >  /* Actually do priority change: must hold pi & rq lock. */
> >  static void __setscheduler(struct rq *rq, struct task_struct *p,
> > -			   const struct sched_attr *attr, bool keep_boost)
> > +			   const struct sched_attr *attr, bool keep_boost,
> > +			   int new_effective_prio)
> >  {
> >  	/*
> >  	 * If params can't change scheduling class changes aren't allowed
> > @@ -6840,7 +6841,7 @@ static void __setscheduler(struct rq *rq, struct task_struct *p,
> >  	 */
> >  	p->prio = normal_prio(p);
> >  	if (keep_boost)
> > -		p->prio = rt_effective_prio(p, p->prio);
> > +		p->prio = new_effective_prio;
> 
> So in case __sched_setscheduler() is called for p (SCHED_NORMAL, NICE0)
> you want to avoid that this 2. rt_effective_prio() call returns
> p->prio=120 in case the 1. call (in __sched_setscheduler()) did return 0
> (due to pi_task->prio=0 (FIFO rt_priority=99 task))?

Not sure I completely follow your question. But what I'm seeing is that
the top_task prio/class can change (by a concurrent setscheduler call,
for example) between two consecutive rt_effective_prio() calls and this
eventually causes the double enqueue in the rt list.

Now, what I'm not sure about is if this is fine (as we always eventually
converge to correctness in the PI chain(s)), and thus the proposed fix,
or if we need to fix this differently.

> >  
> >  	if (dl_prio(p->prio))
> >  		p->sched_class = &dl_sched_class;
> > @@ -6873,7 +6874,7 @@ static int __sched_setscheduler(struct task_struct *p,
> >  	int newprio = dl_policy(attr->sched_policy) ? MAX_DL_PRIO - 1 :
> >  		      MAX_RT_PRIO - 1 - attr->sched_priority;
> >  	int retval, oldprio, oldpolicy = -1, queued, running;
> > -	int new_effective_prio, policy = attr->sched_policy;
> > +	int new_effective_prio = newprio, policy = attr->sched_policy;
> >  	const struct sched_class *prev_class;
> >  	struct callback_head *head;
> >  	struct rq_flags rf;
> > @@ -7072,6 +7073,9 @@ static int __sched_setscheduler(struct task_struct *p,
> >  	oldprio = p->prio;
> >  
> >  	if (pi) {
> > +		newprio = fair_policy(attr->sched_policy) ?
> > +			NICE_TO_PRIO(attr->sched_nice) : newprio;
> > +
> 
> Why is this necessary? p (SCHED_NORMAL) would get newprio=99 now and
> with this it gets [100...120...139] which is still greater than any RT
> (0-98)/DL (-1) prio?

It's needed because we might be going to use newprio (returned in
new_effective_prio) with __setscheduler() and that needs to be the
"final" nice scaled value.

Reproducer (on RT) follows.

Best,
Juri

---
# cat load.c
#include <unistd.h>
#include <time.h>

int main(){

        struct timespec t, t2;
        t.tv_sec = 0;
        t.tv_nsec = 100000;
        int i;
        while (1){
                // sleep(1);
                nanosleep(&t, &t2);
                i = 0;
                while(i < 100000){
                        i++;
                }
        }
}

--->8---

# cat setsched.c
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>

int main(int argc, char *argv[]){

        int ret;
        pid_t p;
        p = atoi(argv[1]);
        struct sched_param spr = { .sched_priority = 50};
        struct sched_param spo = { .sched_priority = 0};

        while(1){

                ret = sched_setscheduler(p, SCHED_RR, &spr);
                ret = sched_setscheduler(p, SCHED_OTHER, &spo);
        }
}

--->8---

# cat run.sh
#!/bin/bash

gcc -o load ./load.c
gcc -o setsched ./setsched.c
cp load rt_pid
mkdir TMP

for AUX in $(seq 36); do
    cp load TMP/load__${AUX}
    ./TMP/load__${AUX} &
done

sleep 1
for AUX in $(seq 18); do
    cp rt_pid TMP/rt_pid__${AUX}
    cp setsched TMP/setsched__${AUX}
    ./TMP/rt_pid__${AUX} &
    ./TMP/setsched__${AUX} $!&
done

--->8---

# cat destroy.sh
pkill load
pkill setsched
pkill rt_pid

rm load setsched rt_pid
rm -rf TMP