linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RT scheduler is suboptimal when an RT thread preempts another RT in terms of choosing a core to migrate
@ 2019-11-15  0:43 Rafikov, Rustem
  2019-11-27 15:27 ` Dietmar Eggemann
  0 siblings, 1 reply; 2+ messages in thread
From: Rafikov, Rustem @ 2019-11-15  0:43 UTC (permalink / raw)
  To: linux-kernel

Hi,

When an RT thread preempts another RT thread it migrates the latter one to a core. 
The way RT scheduler chooses a core is quite suboptimal. Let me give an example from a "production" server with 32 total physical cores.
There are SCHED_NORMAL threads (affined to particular core each) and 2+ groups of RT threads (allowed to run everywhere). 
Scheduler trace showed that most cases RT scheduler preempts a normal prio thread from a core to put evicted RT one on rather than using an idle core the system had a plenty of which according the trace.

I reproduced the behavior on a vanilla 4.18.0 kernel with a micro test where I created 10 SCHED_NORMAL affined to 10 cores,
3 RT/69 with 0xFFFFFFFF affinity and a few RT/79 threads kicking off other RTs from CPUs every 5 msec. 
Other cores were idle but RT/69 never migrated to them.

The problem seems to be in how mapping in cpupri structure is updated:
1) Fair scheduler does not update/read from there. So we don't know if a SCHED_NORMAL left a cpu. Well, that may be OK.
2) RT scheduler uses cpupri to find a core to migrate to, but it updates it incorrectly:
- RT->RT works fine [2]
- But RT->IDLE or RT->SCHED_NORMAL [1] is not right - in both cases it sets RT_MAX(100) which is min NORMAL!
It's totally okay to set it to RT_MAX for all of NORMALs but not for IDLE. BTW - IDLE means swapper which has pri=120 :)

See below traced with kprobes.

[1] IDLE->RT/79->IDLE
#1. <idle>-0     [001] d.h. 14717592.107294: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=14 oldpri=0001
#2. <...>-157332 [001] d... 14717592.107313: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=64 oldpri=0051

Decoding the output at #1 cpu=1 newp=14 oldpri=0001
- cpu = 1 - it happens on core 1
- newp=14 - the priority of a thread being scheduled in is 0x14 which is RT-79 (our test thread)
- oldpri=0001 - a priority of previous thread on that CPU. "1" means NORMAL in 0-101 scale. This is incorrect by itself because the core was IDLE!
Let's try to figure out why it is not '0' (IDLE) by looking at the last line - cpu=1 newp=64 oldpri=0051
- newp=64 says that the priority of a thread being scheduled in is 0x64 which min NORMAL. So, it is not 140 how we could expect when switching to IDLE thread.
- oldpri=0051 this is 81 - priority of our RT-79 thread in 0-101 scale


[2] RT/69->RT/79>RT/69
#1. <...>-158253 [001] d.h. 14723119.396120: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=14 oldpri=0047 #2. <...>-158254 [001] d... 14723119.396122: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=1e oldpri=0051 Line #1 -  "cpu=1 newp=14 oldpri=0047"  switching to 0x14, RT-79 thread
- old pri currently on cpu is 0x47 in 0-101 scale OR RT-69
Line#2 - switching to 0x1e - RT-69. This is correct value of the thread being scheduled in!
- oldppri=0051 - RT-69 in 0-101 scale

Thanks,
Rustem



^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: RT scheduler is suboptimal when an RT thread preempts another RT in terms of choosing a core to migrate
  2019-11-15  0:43 RT scheduler is suboptimal when an RT thread preempts another RT in terms of choosing a core to migrate Rafikov, Rustem
@ 2019-11-27 15:27 ` Dietmar Eggemann
  0 siblings, 0 replies; 2+ messages in thread
From: Dietmar Eggemann @ 2019-11-27 15:27 UTC (permalink / raw)
  To: Rafikov, Rustem, linux-kernel

On 15/11/2019 01:43, Rafikov, Rustem wrote:
> Hi,
> 
> When an RT thread preempts another RT thread it migrates the latter one to a core. 
> The way RT scheduler chooses a core is quite suboptimal. Let me give an example from a "production" server with 32 total physical cores.
> There are SCHED_NORMAL threads (affined to particular core each) and 2+ groups of RT threads (allowed to run everywhere). 
> Scheduler trace showed that most cases RT scheduler preempts a normal prio thread from a core to put evicted RT one on rather than using an idle core the system had a plenty of which according the trace.
> 
> I reproduced the behavior on a vanilla 4.18.0 kernel with a micro test where I created 10 SCHED_NORMAL affined to 10 cores,
> 3 RT/69 with 0xFFFFFFFF affinity and a few RT/79 threads kicking off other RTs from CPUs every 5 msec. 
> Other cores were idle but RT/69 never migrated to them.
> 
> The problem seems to be in how mapping in cpupri structure is updated:
> 1) Fair scheduler does not update/read from there. So we don't know if a SCHED_NORMAL left a cpu. Well, that may be OK.
> 2) RT scheduler uses cpupri to find a core to migrate to, but it updates it incorrectly:
> - RT->RT works fine [2]
> - But RT->IDLE or RT->SCHED_NORMAL [1] is not right - in both cases it sets RT_MAX(100) which is min NORMAL!
> It's totally okay to set it to RT_MAX for all of NORMALs but not for IDLE. BTW - IDLE means swapper which has pri=120 :)
> 
> See below traced with kprobes.
> 
> [1] IDLE->RT/79->IDLE
> #1. <idle>-0     [001] d.h. 14717592.107294: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=14 oldpri=0001
> #2. <...>-157332 [001] d... 14717592.107313: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=64 oldpri=0051
> 
> Decoding the output at #1 cpu=1 newp=14 oldpri=0001
> - cpu = 1 - it happens on core 1
> - newp=14 - the priority of a thread being scheduled in is 0x14 which is RT-79 (our test thread)
> - oldpri=0001 - a priority of previous thread on that CPU. "1" means NORMAL in 0-101 scale. This is incorrect by itself because the core was IDLE!
> Let's try to figure out why it is not '0' (IDLE) by looking at the last line - cpu=1 newp=64 oldpri=0051
> - newp=64 says that the priority of a thread being scheduled in is 0x64 which min NORMAL. So, it is not 140 how we could expect when switching to IDLE thread.
> - oldpri=0051 this is 81 - priority of our RT-79 thread in 0-101 scale
> 
> 
> [2] RT/69->RT/79>RT/69
> #1. <...>-158253 [001] d.h. 14723119.396120: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=14 oldpri=0047 #2. <...>-158254 [001] d... 14723119.396122: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=1e oldpri=0051 Line #1 -  "cpu=1 newp=14 oldpri=0047"  switching to 0x14, RT-79 thread
> - old pri currently on cpu is 0x47 in 0-101 scale OR RT-69
> Line#2 - switching to 0x1e - RT-69. This is correct value of the thread being scheduled in!
> - oldppri=0051 - RT-69 in 0-101 scale

I have seen the same thing. cp->pri_to_cpu[CPUPRI_IDLE] (CPUPRI_IDLE=0)
is never used. So cpupri_find() always skips over it.

There was
https://lore.kernel.org/r/1415260327-30465-2-git-send-email-pang.xunlei@linaro.org
in 2014 but it didn't go mainline.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2019-11-27 15:27 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-15  0:43 RT scheduler is suboptimal when an RT thread preempts another RT in terms of choosing a core to migrate Rafikov, Rustem
2019-11-27 15:27 ` Dietmar Eggemann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).