All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/rt: Don't reschedule a throttled task even if it's higher priority
@ 2021-11-16  1:02 Douglas Anderson
  2021-11-30 16:30 ` Doug Anderson
       [not found] ` <20211201113052.2025-1-hdanton@sina.com>
  0 siblings, 2 replies; 8+ messages in thread
From: Douglas Anderson @ 2021-11-16  1:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt
  Cc: Joel Fernandes, Douglas Anderson, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Mel Gorman,
	linux-kernel

While testing RT_GROUP_SCHED, I found that my system would go bonkers
if my test RT tasks ever got throttled (even if my test RT tasks were
set to only get a tiny slice of CPU time). Specifically I found that
whenever my test RT tasks were throttled that all other RT tasks in
the system were being starved (!!). Several important RT tasks in the
kernel were suddenly getting almost no timeslices and my system became
unusable.

After some experimentation, I determined that this behavior only
happened when I gave my test RT tasks a high priority. If I gave my
test RT tasks a low priority then they were throttled as expected and
nothing was starved.

I managed to come up with a test case that hopefully anyone can run to
demonstrate the problem. The test case uses shell commands and python
but certainly you could reproduce in other ways:

echo "Allow 20 ms more of RT at system and top cgroup"
old_rt=$(cat /proc/sys/kernel/sched_rt_runtime_us)
echo $((old_rt + 20000)) > /proc/sys/kernel/sched_rt_runtime_us
old_rt=$(cat /sys/fs/cgroup/cpu/cpu.rt_runtime_us)
echo $((old_rt + 20000)) > /sys/fs/cgroup/cpu/cpu.rt_runtime_us

echo "Give 10 ms each to spinny and printy groups"
mkdir /sys/fs/cgroup/cpu/spinny
echo 10000 > /sys/fs/cgroup/cpu/spinny/cpu.rt_runtime_us
mkdir /sys/fs/cgroup/cpu/printy
echo 10000 > /sys/fs/cgroup/cpu/printy/cpu.rt_runtime_us

echo "Fork off a printy thing to be a nice RT citizen"
echo "Prints once per second. Priority only 1."
python -c "import time;
last_time = time.time()
while True:
  time.sleep(1)
  now_time = time.time()
  print('Time fies %f' % (now_time - last_time))
  last_time = now_time" &
pid=$!
echo "Give python a few seconds to get started"
sleep 3
echo $pid >> /sys/fs/cgroup/cpu/printy/tasks
chrt -p -f 1 $pid

echo "Sleep to observe that everything is peachy"
sleep 3

echo "Fork off a bunch of evil spinny things"
echo "Chews CPU time. Priority 99."
for i in $(seq 13); do
  python -c "while True: pass"&
  pid=$!
  echo $pid >> /sys/fs/cgroup/cpu/spinny/tasks
  chrt -p -f 99 $pid
done

echo "Huh? Almost no more prints?"

I believe that the problem is an "if" test that's been in
push_rt_task() forever where we will just reschedule the current task
if it's higher priority than the next one. If I just remove that
special case then everything works for me. I tried making it
conditional on just `!rq->rt.rt_throttled` but for whatever reason
that wasn't enough. The `if` test looks like an unlikely special case
optimization and it seems like things ought to be fine without it.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
I know less than zero about the scheduler (so if I told you something,
it's better than 50% chance the the opposite is true!). Here I'm
asserting that we totally don't need this special case and the system
will be fine without it, but I actually don't have any data to back
that up. If nothing else, hopefully my test case in the commit message
would let someone else reproduce and see what I'm talking about and
can come up with a better fix.

 kernel/sched/rt.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index b48baaba2fc2..93ea5de0f917 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2048,16 +2048,6 @@ static int push_rt_task(struct rq *rq, bool pull)
 	if (WARN_ON(next_task == rq->curr))
 		return 0;
 
-	/*
-	 * It's possible that the next_task slipped in of
-	 * higher priority than current. If that's the case
-	 * just reschedule current.
-	 */
-	if (unlikely(next_task->prio < rq->curr->prio)) {
-		resched_curr(rq);
-		return 0;
-	}
-
 	/* We might release rq lock */
 	get_task_struct(next_task);
 
-- 
2.34.0.rc1.387.gb447b232ab-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/rt: Don't reschedule a throttled task even if it's higher priority
  2021-11-16  1:02 [PATCH] sched/rt: Don't reschedule a throttled task even if it's higher priority Douglas Anderson
@ 2021-11-30 16:30 ` Doug Anderson
  2021-11-30 23:36   ` Joel Fernandes
       [not found] ` <20211201113052.2025-1-hdanton@sina.com>
  1 sibling, 1 reply; 8+ messages in thread
From: Doug Anderson @ 2021-11-30 16:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt
  Cc: Joel Fernandes, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Mel Gorman, linux-kernel

Hi,

On Mon, Nov 15, 2021 at 5:03 PM Douglas Anderson <dianders@chromium.org> wrote:
>
> While testing RT_GROUP_SCHED, I found that my system would go bonkers
> if my test RT tasks ever got throttled (even if my test RT tasks were
> set to only get a tiny slice of CPU time). Specifically I found that
> whenever my test RT tasks were throttled that all other RT tasks in
> the system were being starved (!!). Several important RT tasks in the
> kernel were suddenly getting almost no timeslices and my system became
> unusable.
>
> After some experimentation, I determined that this behavior only
> happened when I gave my test RT tasks a high priority. If I gave my
> test RT tasks a low priority then they were throttled as expected and
> nothing was starved.
>
> I managed to come up with a test case that hopefully anyone can run to
> demonstrate the problem. The test case uses shell commands and python
> but certainly you could reproduce in other ways:
>
> echo "Allow 20 ms more of RT at system and top cgroup"
> old_rt=$(cat /proc/sys/kernel/sched_rt_runtime_us)
> echo $((old_rt + 20000)) > /proc/sys/kernel/sched_rt_runtime_us
> old_rt=$(cat /sys/fs/cgroup/cpu/cpu.rt_runtime_us)
> echo $((old_rt + 20000)) > /sys/fs/cgroup/cpu/cpu.rt_runtime_us
>
> echo "Give 10 ms each to spinny and printy groups"
> mkdir /sys/fs/cgroup/cpu/spinny
> echo 10000 > /sys/fs/cgroup/cpu/spinny/cpu.rt_runtime_us
> mkdir /sys/fs/cgroup/cpu/printy
> echo 10000 > /sys/fs/cgroup/cpu/printy/cpu.rt_runtime_us
>
> echo "Fork off a printy thing to be a nice RT citizen"
> echo "Prints once per second. Priority only 1."
> python -c "import time;
> last_time = time.time()
> while True:
>   time.sleep(1)
>   now_time = time.time()
>   print('Time fies %f' % (now_time - last_time))
>   last_time = now_time" &
> pid=$!
> echo "Give python a few seconds to get started"
> sleep 3
> echo $pid >> /sys/fs/cgroup/cpu/printy/tasks
> chrt -p -f 1 $pid
>
> echo "Sleep to observe that everything is peachy"
> sleep 3
>
> echo "Fork off a bunch of evil spinny things"
> echo "Chews CPU time. Priority 99."
> for i in $(seq 13); do
>   python -c "while True: pass"&
>   pid=$!
>   echo $pid >> /sys/fs/cgroup/cpu/spinny/tasks
>   chrt -p -f 99 $pid
> done
>
> echo "Huh? Almost no more prints?"
>
> I believe that the problem is an "if" test that's been in
> push_rt_task() forever where we will just reschedule the current task
> if it's higher priority than the next one. If I just remove that
> special case then everything works for me. I tried making it
> conditional on just `!rq->rt.rt_throttled` but for whatever reason
> that wasn't enough. The `if` test looks like an unlikely special case
> optimization and it seems like things ought to be fine without it.
>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> I know less than zero about the scheduler (so if I told you something,
> it's better than 50% chance the the opposite is true!). Here I'm
> asserting that we totally don't need this special case and the system
> will be fine without it, but I actually don't have any data to back
> that up. If nothing else, hopefully my test case in the commit message
> would let someone else reproduce and see what I'm talking about and
> can come up with a better fix.
>
>  kernel/sched/rt.c | 10 ----------
>  1 file changed, 10 deletions(-)

I know this isn't crazy urgent and I'm happy to sit and twiddle my
thumbs a bit longer if everyone is still sleepy from tryptophan, but
I'm curious if anyone had a chance to look at this. Can anyone confirm
that my script reproduces for them on something other than my system?
Does my patch seem sane?

Thanks!

-Doug

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/rt: Don't reschedule a throttled task even if it's higher priority
  2021-11-30 16:30 ` Doug Anderson
@ 2021-11-30 23:36   ` Joel Fernandes
  2021-12-02 11:11     ` Qais Yousef
  0 siblings, 1 reply; 8+ messages in thread
From: Joel Fernandes @ 2021-11-30 23:36 UTC (permalink / raw)
  To: Doug Anderson, Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Mel Gorman, linux-kernel

On Tue, Nov 30, 2021 at 11:30 AM Doug Anderson <dianders@chromium.org> wrote:
> I know this isn't crazy urgent and I'm happy to sit and twiddle my
> thumbs a bit longer if everyone is still sleepy from tryptophan, but
> I'm curious if anyone had a chance to look at this. Can anyone confirm
> that my script reproduces for them on something other than my system?

Maybe +Qais Yousef can also chime in. Just to add, I think Doug's
issue is that under RT group scheduling, he expects *other* well
behaved RT tasks to not be throttled while the tasks that are
misbehaving to be *gracefully* throttled. Gracefully being- it is
probably WAI that they are exceeding their runtime and so *should
naturally* be bandwidth-limited similar to deadline. I am not sure if
the design of RT throttling considers throttling as an anomaly or
something that's expected to happen. If it is expected to happen, then
I see Doug's point that the mechanism shouldn't affect other RT tasks.

> Does my patch seem sane?

To be honest, I don't understand that bit enough :(

 -Joel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/rt: Don't reschedule a throttled task even if it's higher priority
       [not found] ` <20211201113052.2025-1-hdanton@sina.com>
@ 2021-12-02  0:50   ` Doug Anderson
  0 siblings, 0 replies; 8+ messages in thread
From: Doug Anderson @ 2021-12-02  0:50 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Vincent Guittot, Steven Rostedt, Joel Fernandes,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Mel Gorman, linux-kernel

Hi,

On Wed, Dec 1, 2021 at 3:31 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Mon, 15 Nov 2021 17:02:45 -0800 Douglas Anderson wrote:
> > While testing RT_GROUP_SCHED, I found that my system would go bonkers
> > if my test RT tasks ever got throttled (even if my test RT tasks were
> > set to only get a tiny slice of CPU time). Specifically I found that
> > whenever my test RT tasks were throttled that all other RT tasks in
> > the system were being starved (!!). Several important RT tasks in the
> > kernel were suddenly getting almost no timeslices and my system became
> > unusable.
> >
> > After some experimentation, I determined that this behavior only
> > happened when I gave my test RT tasks a high priority. If I gave my
> > test RT tasks a low priority then they were throttled as expected and
> > nothing was starved.
> >
> > I managed to come up with a test case that hopefully anyone can run to
> > demonstrate the problem. The test case uses shell commands and python
> > but certainly you could reproduce in other ways:
> >
> > echo "Allow 20 ms more of RT at system and top cgroup"
> > old_rt=$(cat /proc/sys/kernel/sched_rt_runtime_us)
> > echo $((old_rt + 20000)) > /proc/sys/kernel/sched_rt_runtime_us
> > old_rt=$(cat /sys/fs/cgroup/cpu/cpu.rt_runtime_us)
> > echo $((old_rt + 20000)) > /sys/fs/cgroup/cpu/cpu.rt_runtime_us
> >
> > echo "Give 10 ms each to spinny and printy groups"
> > mkdir /sys/fs/cgroup/cpu/spinny
> > echo 10000 > /sys/fs/cgroup/cpu/spinny/cpu.rt_runtime_us
> > mkdir /sys/fs/cgroup/cpu/printy
> > echo 10000 > /sys/fs/cgroup/cpu/printy/cpu.rt_runtime_us
> >
> > echo "Fork off a printy thing to be a nice RT citizen"
> > echo "Prints once per second. Priority only 1."
> > python -c "import time;
> > last_time = time.time()
> > while True:
> >   time.sleep(1)
> >   now_time = time.time()
> >   print('Time fies %f' % (now_time - last_time))
> >   last_time = now_time" &
> > pid=$!
> > echo "Give python a few seconds to get started"
> > sleep 3
> > echo $pid >> /sys/fs/cgroup/cpu/printy/tasks
> > chrt -p -f 1 $pid
> >
> > echo "Sleep to observe that everything is peachy"
> > sleep 3
> >
> > echo "Fork off a bunch of evil spinny things"
> > echo "Chews CPU time. Priority 99."
> > for i in $(seq 13); do
> >   python -c "while True: pass"&
> >   pid=$!
> >   echo $pid >> /sys/fs/cgroup/cpu/spinny/tasks
> >   chrt -p -f 99 $pid
> > done
> >
> > echo "Huh? Almost no more prints?"
> >
> > I believe that the problem is an "if" test that's been in
> > push_rt_task() forever where we will just reschedule the current task
> > if it's higher priority than the next one. If I just remove that
> > special case then everything works for me. I tried making it
> > conditional on just `!rq->rt.rt_throttled` but for whatever reason
> > that wasn't enough. The `if` test looks like an unlikely special case
> > optimization and it seems like things ought to be fine without it.
> >
> > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > ---
> > I know less than zero about the scheduler (so if I told you something,
> > it's better than 50% chance the the opposite is true!). Here I'm
> > asserting that we totally don't need this special case and the system
> > will be fine without it, but I actually don't have any data to back
> > that up. If nothing else, hopefully my test case in the commit message
> > would let someone else reproduce and see what I'm talking about and
> > can come up with a better fix.
>
> Can you try to tune the knob down to somewhere like 1ms?
>
> Hillf
>
> /*
>  * period over which we measure -rt task CPU usage in us.
>  * default: 1s
>  */
> unsigned int sysctl_sched_rt_period = 1000000;

I could give it a shot, but that's a pretty big behavior change and
the Documentation (sched-rt-group.rst) warns me away from such a
thing. The default of 1 second seems crazy conservative, but tweaking
it all the way down to 1 ms seems a bit aggressive. It also feels like
this would only be working around the problem, not necessarily solving
it at the core?

-Doug

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/rt: Don't reschedule a throttled task even if it's higher priority
  2021-11-30 23:36   ` Joel Fernandes
@ 2021-12-02 11:11     ` Qais Yousef
  2021-12-02 18:05       ` Doug Anderson
  0 siblings, 1 reply; 8+ messages in thread
From: Qais Yousef @ 2021-12-02 11:11 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Doug Anderson, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Mel Gorman,
	linux-kernel

Hi Joel, Doug

On 11/30/21 18:36, Joel Fernandes wrote:
> On Tue, Nov 30, 2021 at 11:30 AM Doug Anderson <dianders@chromium.org> wrote:
> > I know this isn't crazy urgent and I'm happy to sit and twiddle my
> > thumbs a bit longer if everyone is still sleepy from tryptophan, but
> > I'm curious if anyone had a chance to look at this. Can anyone confirm
> > that my script reproduces for them on something other than my system?
> 
> Maybe +Qais Yousef can also chime in. Just to add, I think Doug's

I could try :-)

/me tries to find the original posting

> issue is that under RT group scheduling, he expects *other* well
> behaved RT tasks to not be throttled while the tasks that are
> misbehaving to be *gracefully* throttled. Gracefully being- it is
> probably WAI that they are exceeding their runtime and so *should
> naturally* be bandwidth-limited similar to deadline. I am not sure if
> the design of RT throttling considers throttling as an anomaly or
> something that's expected to happen. If it is expected to happen, then
> I see Doug's point that the mechanism shouldn't affect other RT tasks.

I don't know much about RT group throttling, but I am under the impression it
is to give other sched classes a breather, not other lower priority RT tasks.
RT still relies on admins knowing what they're doing when creating RT tasks and
manage their priorities.

What I think is happening in your case is that you're spawning 13 busyloops at
priority 99, assuming you're on a chromebook which probably has at best 8 CPUs,
you're basically killing the machine and there's nothing the scheduler can do
about it. You get what you asked for :-)

If you spawn nr_cpus_id - 1 busyloops, the lower priority tasks should find
1 cpu to get their work done, and I don' expect a hang then. And the RT
throttling should allow CFS to make some progress every now and then on the
other CPUs. So you might end up in a slow system, but not hanging one
hopefully.

Note that Peter fixed the kernel so that it produces known RT priorities (as
opposed to developers setting random values in the past):

  * sched_set_fifo_low() ==> not really RT but needs to be above cfs. Runs at
    priority 1.
  * sched_set_fifo() ==> sets priority MAX_RT_PRIO/2 ==> 50

So most system RT tasks fall into the 2nd category, which is reasonably high
priority. And RT scheduling assumes if you set something higher than that,
then it's your responsibility to make sure they don't starve :-)

Yep, it's easy to shoot yourself in the foot with RT, that's why it's
privileged op ;-)

> 
> > Does my patch seem sane?
> 
> To be honest, I don't understand that bit enough :(

Nope. The patch breaks RT priority scheduling. You basically allow a lower
priority task to run before a higher priority one. Another code path will
probably detect that and tries to correct it later, but still this is not
right.

HTH and I didn't gloss over some detail.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/rt: Don't reschedule a throttled task even if it's higher priority
  2021-12-02 11:11     ` Qais Yousef
@ 2021-12-02 18:05       ` Doug Anderson
  2021-12-13 13:08         ` Qais Yousef
  0 siblings, 1 reply; 8+ messages in thread
From: Doug Anderson @ 2021-12-02 18:05 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Joel Fernandes, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Mel Gorman,
	linux-kernel

Hi,

On Thu, Dec 2, 2021 at 3:11 AM Qais Yousef <qais.yousef@arm.com> wrote:
>
> Hi Joel, Doug
>
> On 11/30/21 18:36, Joel Fernandes wrote:
> > On Tue, Nov 30, 2021 at 11:30 AM Doug Anderson <dianders@chromium.org> wrote:
> > > I know this isn't crazy urgent and I'm happy to sit and twiddle my
> > > thumbs a bit longer if everyone is still sleepy from tryptophan, but
> > > I'm curious if anyone had a chance to look at this. Can anyone confirm
> > > that my script reproduces for them on something other than my system?
> >
> > Maybe +Qais Yousef can also chime in. Just to add, I think Doug's
>
> I could try :-)
>
> /me tries to find the original posting

Oops! I thought I had duplicated enough of the content in this patch.
Should have included a reference. It's:

https://lore.kernel.org/r/CAD=FV=UKyJLhDEKxhqrit16kvLfi+g0DyYKL6bLJ35fO7NCTsg@mail.gmail.com/


> > issue is that under RT group scheduling, he expects *other* well
> > behaved RT tasks to not be throttled while the tasks that are
> > misbehaving to be *gracefully* throttled. Gracefully being- it is
> > probably WAI that they are exceeding their runtime and so *should
> > naturally* be bandwidth-limited similar to deadline. I am not sure if
> > the design of RT throttling considers throttling as an anomaly or
> > something that's expected to happen. If it is expected to happen, then
> > I see Doug's point that the mechanism shouldn't affect other RT tasks.
>
> I don't know much about RT group throttling, but I am under the impression it
> is to give other sched classes a breather, not other lower priority RT tasks.
> RT still relies on admins knowing what they're doing when creating RT tasks and
> manage their priorities.

So I guess this is where my fundamental disagreement is. I would say
that one point of the throttling is to prevent an errant RT task (one
that's going crazy) from taking the system down with it. It seems a
bit strange to me that we'd want to protect the normal tasks (which
should, in general, be less urgent than _all_ the RT tasks) but not
lower priority RT tasks. In my case the lower priority RT tasks are
actually important enough that blocking them _does_ take the system
down.


> What I think is happening in your case is that you're spawning 13 busyloops at
> priority 99, assuming you're on a chromebook which probably has at best 8 CPUs,
> you're basically killing the machine and there's nothing the scheduler can do
> about it. You get what you asked for :-)

Well, I asked it to only let those 13 busyloops get 10 ms to run per 1
second at most (and then put them on ice). ...so in my mind I _didn't_
get what I asked for. ;-)


> If you spawn nr_cpus_id - 1 busyloops, the lower priority tasks should find
> 1 cpu to get their work done, and I don' expect a hang then. And the RT
> throttling should allow CFS to make some progress every now and then on the
> other CPUs. So you might end up in a slow system, but not hanging one
> hopefully.
>
> Note that Peter fixed the kernel so that it produces known RT priorities (as
> opposed to developers setting random values in the past):
>
>   * sched_set_fifo_low() ==> not really RT but needs to be above cfs. Runs at
>     priority 1.
>   * sched_set_fifo() ==> sets priority MAX_RT_PRIO/2 ==> 50
>
> So most system RT tasks fall into the 2nd category, which is reasonably high
> priority. And RT scheduling assumes if you set something higher than that,
> then it's your responsibility to make sure they don't starve :-)
>
> Yep, it's easy to shoot yourself in the foot with RT, that's why it's
> privileged op ;-)

For sure getting RT scheduling wrong will shoot you in the foot, but I
don't think what I did was wrong.

From my understanding of the throttling, one big goal (maybe not the
only goal, but a big one) is to make sure that if a process goes
haywire (presumably trying to take 100% of the CPU) that the system
doesn't die and corrective action can be taken on the errant process.
I think I have shown that this goal is really only achieved if the
errant process is lower priority than all the other RT tasks in the
system. That doesn't seem like a reasonable limitation...

I would also say that with Peter's fix above the problem is perhaps
_more_ urgent? You just said that there's a whole bunch of kernel
tasks that are now created with the lowest RT priority. From your
description above this means "not really RT but needs to be above
cfs". If a sysadmin uses RT_GROUP_SCHED and a priority other than the
lowest priority then it's pretty much guaranteed that if the process
goes haywire that it will take down all of these important kernel
tasks. That can't be right, can it?

Maybe the answer is simply that RT_GROUP_SCHED and RT priority are
incompatible? I could submit a patch that forces all RT tasks to have
the same effective priority if RT_GROUP_SCHED is defined. :-/


> > > Does my patch seem sane?
> >
> > To be honest, I don't understand that bit enough :(
>
> Nope. The patch breaks RT priority scheduling. You basically allow a lower
> priority task to run before a higher priority one. Another code path will
> probably detect that and tries to correct it later, but still this is not
> right.

OK, I'll take your word for it. I'm still hoping someone can tell me a
better way to fix this?


-Doug

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/rt: Don't reschedule a throttled task even if it's higher priority
  2021-12-02 18:05       ` Doug Anderson
@ 2021-12-13 13:08         ` Qais Yousef
  2021-12-13 18:32           ` Doug Anderson
  0 siblings, 1 reply; 8+ messages in thread
From: Qais Yousef @ 2021-12-13 13:08 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Joel Fernandes, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Mel Gorman,
	linux-kernel

Hi Doug

Sorry for the delayed response.

On 12/02/21 10:05, Doug Anderson wrote:
> Hi,
> 
> On Thu, Dec 2, 2021 at 3:11 AM Qais Yousef <qais.yousef@arm.com> wrote:
> >
> > Hi Joel, Doug
> >
> > On 11/30/21 18:36, Joel Fernandes wrote:
> > > On Tue, Nov 30, 2021 at 11:30 AM Doug Anderson <dianders@chromium.org> wrote:
> > > > I know this isn't crazy urgent and I'm happy to sit and twiddle my
> > > > thumbs a bit longer if everyone is still sleepy from tryptophan, but
> > > > I'm curious if anyone had a chance to look at this. Can anyone confirm
> > > > that my script reproduces for them on something other than my system?
> > >
> > > Maybe +Qais Yousef can also chime in. Just to add, I think Doug's
> >
> > I could try :-)
> >
> > /me tries to find the original posting
> 
> Oops! I thought I had duplicated enough of the content in this patch.
> Should have included a reference. It's:
> 
> https://lore.kernel.org/r/CAD=FV=UKyJLhDEKxhqrit16kvLfi+g0DyYKL6bLJ35fO7NCTsg@mail.gmail.com/
> 
> 
> > > issue is that under RT group scheduling, he expects *other* well
> > > behaved RT tasks to not be throttled while the tasks that are
> > > misbehaving to be *gracefully* throttled. Gracefully being- it is
> > > probably WAI that they are exceeding their runtime and so *should
> > > naturally* be bandwidth-limited similar to deadline. I am not sure if
> > > the design of RT throttling considers throttling as an anomaly or
> > > something that's expected to happen. If it is expected to happen, then
> > > I see Doug's point that the mechanism shouldn't affect other RT tasks.
> >
> > I don't know much about RT group throttling, but I am under the impression it
> > is to give other sched classes a breather, not other lower priority RT tasks.
> > RT still relies on admins knowing what they're doing when creating RT tasks and
> > manage their priorities.
> 
> So I guess this is where my fundamental disagreement is. I would say
> that one point of the throttling is to prevent an errant RT task (one
> that's going crazy) from taking the system down with it. It seems a
> bit strange to me that we'd want to protect the normal tasks (which
> should, in general, be less urgent than _all_ the RT tasks) but not
> lower priority RT tasks. In my case the lower priority RT tasks are
> actually important enough that blocking them _does_ take the system
> down.

I tried to read Documentation/scheduler/sched-rt-group.rst now and I'm not
sure how to interpret it still. It mentions

"""
Any time not allocated to a realtime group will be used to run normal priority
tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
SCHED_OTHER.
"""

I interpret the example in The Solution as allocating enough time for realtime
to do their work but still give the rest of the system chance to run and keep
the pipeline filled, hence avoid the underruns.

> 
> 
> > What I think is happening in your case is that you're spawning 13 busyloops at
> > priority 99, assuming you're on a chromebook which probably has at best 8 CPUs,
> > you're basically killing the machine and there's nothing the scheduler can do
> > about it. You get what you asked for :-)
> 
> Well, I asked it to only let those 13 busyloops get 10 ms to run per 1
> second at most (and then put them on ice). ...so in my mind I _didn't_
> get what I asked for. ;-)

The document has a warning that short runtime could lead to instability. Not
sure if this can be considered the case here. But worth trying increasing that
to 100ms and see what happens.

I still think your example is killing the system and I wouldn't be surprised if
you end up starving the lower priority tasks.

> 
> 
> > If you spawn nr_cpus_id - 1 busyloops, the lower priority tasks should find
> > 1 cpu to get their work done, and I don' expect a hang then. And the RT
> > throttling should allow CFS to make some progress every now and then on the
> > other CPUs. So you might end up in a slow system, but not hanging one
> > hopefully.
> >
> > Note that Peter fixed the kernel so that it produces known RT priorities (as
> > opposed to developers setting random values in the past):
> >
> >   * sched_set_fifo_low() ==> not really RT but needs to be above cfs. Runs at
> >     priority 1.
> >   * sched_set_fifo() ==> sets priority MAX_RT_PRIO/2 ==> 50
> >
> > So most system RT tasks fall into the 2nd category, which is reasonably high
> > priority. And RT scheduling assumes if you set something higher than that,
> > then it's your responsibility to make sure they don't starve :-)
> >
> > Yep, it's easy to shoot yourself in the foot with RT, that's why it's
> > privileged op ;-)
> 
> For sure getting RT scheduling wrong will shoot you in the foot, but I
> don't think what I did was wrong.
> 
> From my understanding of the throttling, one big goal (maybe not the
> only goal, but a big one) is to make sure that if a process goes
> haywire (presumably trying to take 100% of the CPU) that the system
> doesn't die and corrective action can be taken on the errant process.
> I think I have shown that this goal is really only achieved if the
> errant process is lower priority than all the other RT tasks in the
> system. That doesn't seem like a reasonable limitation...

My reading of Documentation/scheduler/sched-rt-group.rst interprets it as
ensuring the interaction between realtime and non realtime elements don't end
up with scenarios like buffer underruns because the rest of the system is not
having enough time to fill these buffers (SCHED_OTHER).

> 
> I would also say that with Peter's fix above the problem is perhaps
> _more_ urgent? You just said that there's a whole bunch of kernel

I can't see the problem still tbh.

> tasks that are now created with the lowest RT priority. From your
> description above this means "not really RT but needs to be above
> cfs". If a sysadmin uses RT_GROUP_SCHED and a priority other than the
> lowest priority then it's pretty much guaranteed that if the process
> goes haywire that it will take down all of these important kernel
> tasks. That can't be right, can it?

Did you try something simpler like having 1 task in each group instead of
spawning 13 busy loops?

The sched-rt-group.rst has a warning about a potential starvation scenario by
the way in the 'Future plans' section. I don't think you're affected by it, but
I didn't spend much time thinking about it.

My point is that when using RT you can't expect not to end up with unstable
systems sometimes. Your scenario is one that falls into the unstable region
IMHO.

> 
> Maybe the answer is simply that RT_GROUP_SCHED and RT priority are
> incompatible? I could submit a patch that forces all RT tasks to have
> the same effective priority if RT_GROUP_SCHED is defined. :-/

Personally I can't reach this conclusion based on the scenario you presented.
I think you're doing something unreasonable and the results you're getting are
(likely) not that unreasonable too :-)

> 
> 
> > > > Does my patch seem sane?
> > >
> > > To be honest, I don't understand that bit enough :(
> >
> > Nope. The patch breaks RT priority scheduling. You basically allow a lower
> > priority task to run before a higher priority one. Another code path will
> > probably detect that and tries to correct it later, but still this is not
> > right.
> 
> OK, I'll take your word for it. I'm still hoping someone can tell me a
> better way to fix this?

Hopefully someone more knowledgeable from the RT community can offer better
help and explanation than myself. IMHO the question is whether your scenario
and expected outcome are correct. ie: I'm not sure if there's anything to fix
yet.

I could be wrong of course. So don't let my opinion deter you from pursuing
this further if you believe otherwise.

Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/rt: Don't reschedule a throttled task even if it's higher priority
  2021-12-13 13:08         ` Qais Yousef
@ 2021-12-13 18:32           ` Doug Anderson
  0 siblings, 0 replies; 8+ messages in thread
From: Doug Anderson @ 2021-12-13 18:32 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Joel Fernandes, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Mel Gorman,
	linux-kernel

Hi,

On Mon, Dec 13, 2021 at 5:09 AM Qais Yousef <qais.yousef@arm.com> wrote:
>
> > > Note that Peter fixed the kernel so that it produces known RT priorities (as
> > > opposed to developers setting random values in the past):
> > >
> > >   * sched_set_fifo_low() ==> not really RT but needs to be above cfs. Runs at
> > >     priority 1.
> > >   * sched_set_fifo() ==> sets priority MAX_RT_PRIO/2 ==> 50
[ ... cut ... ]
> > I would also say that with Peter's fix above the problem is perhaps
> > _more_ urgent? You just said that there's a whole bunch of kernel
>
> I can't see the problem still tbh.

I think this is the key, so let me explain more here. I don't think
it's worth nitpicking the documentation more to figure out what the
original author might have meant. Even if the current behavior is
expected, it's still a broken behavior that's worth fixing.

Here's my point of view:

1. The fact that the kernel has some of its threads running w/
sched_set_fifo_low() is not ABI to userspace, right? The kernel's
usage of this function on some tasks is an implementation detail and
not something that userspace should need to know about or worry about.

2. Presumably when the kernel is using sched_set_fifo_low() it's doing
so for tasks that are important for the running of the system. I
suppose you could say that _all_ kernel tasks are important to the
running of the system, but presumably these ones are higher priority
and thus more important.

3. If userspace is bothering with all the setup of RT_GROUP_SCHED, it
presumably is expecting it to do something useful. Presumably this
"useful" thing is to keep other parts of the system (those not in the
RT group) working normally. Specifically userspace wouldn't be
expecting the system to crash or a big chunk of kernel functionality
to just stop working if the scheduling allocation is exceeded.

4. Userspace expects priorities other than the "lowest" priority to be
useful for something. If they're not then they should be disallowed.

Maybe from the above points my argument is clear? Said another way:
Userspace is allowed to use a priority other than the lowest one.
Userspace wouldn't be setting up RT_GROUP_SCHED if it didn't think it
would be needed. Userspace doesn't want the kernel to crash / chunks
of functionality to fail when RT_GROUP_SCHED triggers. Userspace can't
know / account for kernel tasks using sched_set_fifo_low()

As an actual example, on my system (which has important kernel tasks
using sched_set_fifo_low()) a big chunk of the kernel is unusable if I
run my testcase. We can't get keyboard input nor do any other
communication to one of the important components in our system.


-Doug

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-12-13 18:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-16  1:02 [PATCH] sched/rt: Don't reschedule a throttled task even if it's higher priority Douglas Anderson
2021-11-30 16:30 ` Doug Anderson
2021-11-30 23:36   ` Joel Fernandes
2021-12-02 11:11     ` Qais Yousef
2021-12-02 18:05       ` Doug Anderson
2021-12-13 13:08         ` Qais Yousef
2021-12-13 18:32           ` Doug Anderson
     [not found] ` <20211201113052.2025-1-hdanton@sina.com>
2021-12-02  0:50   ` Doug Anderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.