sched/fair: scheduler not running high priority process on idle cpu

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* sched/fair: scheduler not running high priority process on idle cpu
@ 2020-01-14 16:50 David Laight
  2020-01-14 16:59 ` Steven Rostedt
  0 siblings, 1 reply; 16+ messages in thread
From: David Laight @ 2020-01-14 16:50 UTC (permalink / raw)
  To: 'Vincent Guittot', Peter Zijlstra
  Cc: Viresh Kumar, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel

I've a test that uses four RT priority processes to process audio data every 10ms.
One process wakes up the other three, they all 'beaver away' clearing a queue of
jobs and the last one to finish sleeps until the next tick.
Usually this takes about 0.5ms, but sometimes takes over 3ms.

AFAICT the processes are normally woken on the same cpu they last ran on.
There seems to be a problem when the selected cpu is running a (low priority)
process that is looping in kernel [1].
I'd expect my process to be picked up by one of the idle cpus, but this
doesn't happen.
Instead the process sits in state 'waiting' until the active processes sleeps
(or calls cond_resched()).

Is this really the expected behaviour?????

This is  5.4.0-rc7 kernel.
I could try the current 5.5-rc one if any recent changes might affect things.

Additionally (probably because cv_wait() is implemented with 'ticket locks')
none of the other processes waiting for the cv are woken either.

[1] Xorg seems to periodically request the kernel workqueue run
drm_cflush_sg() to flush the display buffer cache.
For a 2560x1140 display this is 3600 4k pages and the flush loop
takes ~3.3ms.
However there are probably other places where a process can run in
kernel for significant lengths of time.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-14 16:50 sched/fair: scheduler not running high priority process on idle cpu David Laight
@ 2020-01-14 16:59 ` Steven Rostedt
  2020-01-14 17:33   ` David Laight
  0 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2020-01-14 16:59 UTC (permalink / raw)
  To: David Laight
  Cc: 'Vincent Guittot',
	Peter Zijlstra, Viresh Kumar, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel

On Tue, 14 Jan 2020 16:50:43 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> I've a test that uses four RT priority processes to process audio data every 10ms.
> One process wakes up the other three, they all 'beaver away' clearing a queue of
> jobs and the last one to finish sleeps until the next tick.
> Usually this takes about 0.5ms, but sometimes takes over 3ms.
> 
> AFAICT the processes are normally woken on the same cpu they last ran on.
> There seems to be a problem when the selected cpu is running a (low priority)
> process that is looping in kernel [1].
> I'd expect my process to be picked up by one of the idle cpus, but this
> doesn't happen.
> Instead the process sits in state 'waiting' until the active processes sleeps
> (or calls cond_resched()).
> 
> Is this really the expected behaviour?????

It is with CONFIG_PREEMPT_VOLUNTARY. I think you want to recompile your
kernel with CONFIG_PREEMPT. The idea is that the RT task will continue
to run on the CPU it last ran on, and would push off the lower priority
task to the idle CPU. But CONFIG_PREEMPT_VOLUNTARY means that this
will have to wait for the running task to not be in kernel context or
hit a cond_resched() which is the "voluntary" scheduling point.

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-14 16:59 ` Steven Rostedt
@ 2020-01-14 17:33   ` David Laight
  2020-01-14 17:48     ` Steven Rostedt
  0 siblings, 1 reply; 16+ messages in thread
From: David Laight @ 2020-01-14 17:33 UTC (permalink / raw)
  To: 'Steven Rostedt'
  Cc: 'Vincent Guittot',
	Peter Zijlstra, Viresh Kumar, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel

From: Steven Rostedt
> Sent: 14 January 2020 16:59
> 
> On Tue, 14 Jan 2020 16:50:43 +0000
> David Laight <David.Laight@ACULAB.COM> wrote:
> 
> > I've a test that uses four RT priority processes to process audio data every 10ms.
> > One process wakes up the other three, they all 'beaver away' clearing a queue of
> > jobs and the last one to finish sleeps until the next tick.
> > Usually this takes about 0.5ms, but sometimes takes over 3ms.
> >
> > AFAICT the processes are normally woken on the same cpu they last ran on.
> > There seems to be a problem when the selected cpu is running a (low priority)
> > process that is looping in kernel [1].
> > I'd expect my process to be picked up by one of the idle cpus, but this
> > doesn't happen.
> > Instead the process sits in state 'waiting' until the active processes sleeps
> > (or calls cond_resched()).
> >
> > Is this really the expected behaviour?????
> 
> It is with CONFIG_PREEMPT_VOLUNTARY. I think you want to recompile your
> kernel with CONFIG_PREEMPT. The idea is that the RT task will continue
> to run on the CPU it last ran on, and would push off the lower priority
> task to the idle CPU. But CONFIG_PREEMPT_VOLUNTARY means that this
> will have to wait for the running task to not be in kernel context or
> hit a cond_resched() which is the "voluntary" scheduling point.

I have added a cond_resched() to the offending loop, but a close look implies
that code is called with a lock held in another (less common) path so that
can't be directly committed and so CONFIG_PREEMPT won't help.

Indeed requiring CONFIG_PREEMPT doesn't help when customers are running
the application, nor (probably) on AWS since I doubt it is ever the default.

Does the same apply to non-RT tasks?
I can select almost any priority, but RT ones are otherwise a lot better.

I've also seen RT processes delayed by the network stack 'bh' that runs
in a softint from the hardware interrupt.
That can take a while (clearing up tx and refilling rx) and I don't think we
have any control over the cpu it runs on?

The cost of ftrace function call entry/exit (about 200 clocks) makes it
rather unsuitable for any performance measurements unless only
a very few functions are traced - which rather requires you know
what the code is doing :-(

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-14 17:33   ` David Laight
@ 2020-01-14 17:48     ` Steven Rostedt
  2020-01-15 12:44       ` David Laight
  2020-01-15 12:57       ` David Laight
  0 siblings, 2 replies; 16+ messages in thread
From: Steven Rostedt @ 2020-01-14 17:48 UTC (permalink / raw)
  To: David Laight
  Cc: 'Vincent Guittot',
	Peter Zijlstra, Viresh Kumar, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel

On Tue, 14 Jan 2020 17:33:50 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> I have added a cond_resched() to the offending loop, but a close look implies
> that code is called with a lock held in another (less common) path so that
> can't be directly committed and so CONFIG_PREEMPT won't help.
> 
> Indeed requiring CONFIG_PREEMPT doesn't help when customers are running
> the application, nor (probably) on AWS since I doubt it is ever the default.
> 
> Does the same apply to non-RT tasks?
> I can select almost any priority, but RT ones are otherwise a lot better.
> 
> I've also seen RT processes delayed by the network stack 'bh' that runs
> in a softint from the hardware interrupt.
> That can take a while (clearing up tx and refilling rx) and I don't think we
> have any control over the cpu it runs on?

Yes, even with CONFIG_PREEMPT, Linux has no guarantees of latency for
any task regardless of priority. If you have latency requirements, then
you need to apply the PREEMPT_RT patch (which may soon make it to
mainline this year!), which spin locks and bh wont stop a task from
scheduling (unless they need the same lock).

> 
> The cost of ftrace function call entry/exit (about 200 clocks) makes it
> rather unsuitable for any performance measurements unless only
> a very few functions are traced - which rather requires you know
> what the code is doing :-(
> 

Well, when I use function tracing, I start all of them, analyze the
trace, then the functions I don't care about (usually spin locks and
other utils), I add to the set_ftrace_notrace file,  which keeps them
from being part of the trace. I keep doing this until I find a set of
functions that doesn't hurt overhead as much and gives me enough
information to know what is happening. It also helps to enable all or
most events (at least scheduling events).

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-14 17:48     ` Steven Rostedt
@ 2020-01-15 12:44       ` David Laight
  2020-01-15 13:18         ` Steven Rostedt
  2020-01-15 14:56         ` Peter Zijlstra
  2020-01-15 12:57       ` David Laight
  1 sibling, 2 replies; 16+ messages in thread
From: David Laight @ 2020-01-15 12:44 UTC (permalink / raw)
  To: 'Steven Rostedt'
  Cc: 'Vincent Guittot',
	Peter Zijlstra, Viresh Kumar, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel

From: Steven Rostedt
> Sent: 14 January 2020 17:48
> 
> On Tue, 14 Jan 2020 17:33:50 +0000
> David Laight <David.Laight@ACULAB.COM> wrote:
> 
> > I have added a cond_resched() to the offending loop, but a close look implies
> > that code is called with a lock held in another (less common) path so that
> > can't be directly committed and so CONFIG_PREEMPT won't help.
> >
> > Indeed requiring CONFIG_PREEMPT doesn't help when customers are running
> > the application, nor (probably) on AWS since I doubt it is ever the default.
> >
> > Does the same apply to non-RT tasks?
> > I can select almost any priority, but RT ones are otherwise a lot better.
> >
> > I've also seen RT processes delayed by the network stack 'bh' that runs
> > in a softint from the hardware interrupt.
> > That can take a while (clearing up tx and refilling rx) and I don't think we
> > have any control over the cpu it runs on?
> 
> Yes, even with CONFIG_PREEMPT, Linux has no guarantees of latency for
> any task regardless of priority. If you have latency requirements, then
> you need to apply the PREEMPT_RT patch (which may soon make it to
> mainline this year!), which spin locks and bh wont stop a task from
> scheduling (unless they need the same lock)

We're not trying to do anything life-threatening.
So the latency requirements are only moderate - failures mess up telephone
audio quality. There is also allowance for jitter elsewhere.
OTOH not running a high priority process when there are idle cpu seems 'sub-optimal'.

Code that runs with a spin-lock held (or otherwise disables preemption)
for significant periods probably ought to be detected and warned.
I'm not sure of a suitable limit, 100us is probably excessive on x86.

IIUC PREEMPT_RT adds overhead to quite a bit of code and is unlikely
to get enabled in 'distro' kernels.
Especially since they've not enabled CONFIG_PREEMPT which probably
has a lower impact - provided the cv+mutex wakeup has been arranged
to avoid the treble process switch.

Running the driver bh (which is often significant) from a high priority
worker thread instead of a softint (which isn't much different to the
'hardint' it is scheduled from) probably doesn't cost much (in-kernel
process switches shouldn't be much more than a stack switch).
That would benefit RT processes since they could be higher
priority than the bh code.
Although you'd probably want a 'strongly preferred' cpu for them.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-14 17:48     ` Steven Rostedt
  2020-01-15 12:44       ` David Laight
@ 2020-01-15 12:57       ` David Laight
  2020-01-15 14:23         ` Steven Rostedt
  1 sibling, 1 reply; 16+ messages in thread
From: David Laight @ 2020-01-15 12:57 UTC (permalink / raw)
  To: 'Steven Rostedt'
  Cc: 'Vincent Guittot',
	Peter Zijlstra, Viresh Kumar, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel

From: Steven Rostedt
> Sent: 14 January 2020 17:48
...
> > The cost of ftrace function call entry/exit (about 200 clocks) makes it
> > rather unsuitable for any performance measurements unless only
> > a very few functions are traced - which rather requires you know
> > what the code is doing :-(
> >
> 
> Well, when I use function tracing, I start all of them, analyze the
> trace, then the functions I don't care about (usually spin locks and
> other utils), I add to the set_ftrace_notrace file,  which keeps them
> from being part of the trace. I keep doing this until I find a set of
> functions that doesn't hurt overhead as much and gives me enough
> information to know what is happening. It also helps to enable all or
> most events (at least scheduling events).

I've been using schedviz - but have had to 'fixup' wrapped traces so that
all the cpu traces start at the same time to get it to load them.
I managed to find what the worker thread was running - but only
because it ran for the entire time 'echo t >/proc/sysrq-trigger' took
to finish. Then I looked at the sources to find the code...

I'm surprised the 'normal case' for tracing function entry isn't done
in assembler without saving all the registers (etc).
For tsc stamps I think it should be possible saving just 3 registers
in under 32 instructions. Scaling to ns is a bit harder.
It's a shame the ns scaling isn't left to the reading code.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-15 12:44       ` David Laight
@ 2020-01-15 13:18         ` Steven Rostedt
  2020-01-15 14:43           ` David Laight
  2020-01-15 15:11           ` David Laight
  2020-01-15 14:56         ` Peter Zijlstra
  1 sibling, 2 replies; 16+ messages in thread
From: Steven Rostedt @ 2020-01-15 13:18 UTC (permalink / raw)
  To: David Laight
  Cc: 'Vincent Guittot',
	Peter Zijlstra, Viresh Kumar, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel

On Wed, 15 Jan 2020 12:44:19 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> > Yes, even with CONFIG_PREEMPT, Linux has no guarantees of latency for
> > any task regardless of priority. If you have latency requirements, then
> > you need to apply the PREEMPT_RT patch (which may soon make it to
> > mainline this year!), which spin locks and bh wont stop a task from
> > scheduling (unless they need the same lock)  

Every time you add something to allow higher priority processes to run
with less latency you add overhead. By just adding that spinlock check
or to migrate a process to a idle cpu will add a measurable overhead,
and as you state, distros won't like that.

It's a constant game of give and take.

> 
> Running the driver bh (which is often significant) from a high priority
> worker thread instead of a softint (which isn't much different to the
> 'hardint' it is scheduled from) probably doesn't cost much (in-kernel
> process switches shouldn't be much more than a stack switch).
> That would benefit RT processes since they could be higher
> priority than the bh code.
> Although you'd probably want a 'strongly preferred' cpu for them.

BTW, I believe distros compile with "CONFIG_IRQ_FORCED_THREADING" which
means if you add to the kernel command line "threadirqs" the interrupts
will be run as threads. Which allows for even more preemption.

-- Steve


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-15 12:57       ` David Laight
@ 2020-01-15 14:23         ` Steven Rostedt
  0 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2020-01-15 14:23 UTC (permalink / raw)
  To: David Laight
  Cc: 'Vincent Guittot',
	Peter Zijlstra, Viresh Kumar, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel

On Wed, 15 Jan 2020 12:57:10 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> I'm surprised the 'normal case' for tracing function entry isn't done
> in assembler without saving all the registers (etc).

Well, it doesn't save all registers unless you ask it to. It only saves
what the compiler mandates for "fentry" before calling C code.

> For tsc stamps I think it should be possible saving just 3 registers
> in under 32 instructions. Scaling to ns is a bit harder.
> It's a shame the ns scaling isn't left to the reading code.

Well, it could be done, as the ring buffer allows you to post process
timestamps. You could switch to using just tsc:

 echo x86-tsc > /sys/kernel/tracing/trace_clock

One reason that we do not post process the scaling to ns is that the
scaling can change over time depending on the clock source, which means
post processing will give you in accurate results. But the
infrastructure is there to do it.

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-15 13:18         ` Steven Rostedt
@ 2020-01-15 14:43           ` David Laight
  2020-01-15 15:11           ` David Laight
  1 sibling, 0 replies; 16+ messages in thread
From: David Laight @ 2020-01-15 14:43 UTC (permalink / raw)
  To: 'Steven Rostedt'
  Cc: 'Vincent Guittot',
	Peter Zijlstra, Viresh Kumar, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel

From: Steven Rostedt
> Sent: 15 January 2020 13:19
...
> BTW, I believe distros compile with "CONFIG_IRQ_FORCED_THREADING" which
> means if you add to the kernel command line "threadirqs" the interrupts
> will be run as threads. Which allows for even more preemption.

So they do...
I guess that'll stop the bh code running 'on top of' my RT thread.
But won't help getting the RT process running when the 'events_unbound'
kernel worker is running.

They also use grub2 which is so complicated update-grub is always used
which dumbs everything down to a default set that is hard to change.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-15 12:44       ` David Laight
  2020-01-15 13:18         ` Steven Rostedt
@ 2020-01-15 14:56         ` Peter Zijlstra
  2020-01-15 15:09           ` David Laight
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2020-01-15 14:56 UTC (permalink / raw)
  To: David Laight
  Cc: 'Steven Rostedt', 'Vincent Guittot',
	Viresh Kumar, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Ben Segall, Mel Gorman, linux-kernel

On Wed, Jan 15, 2020 at 12:44:19PM +0000, David Laight wrote:

> Code that runs with a spin-lock held (or otherwise disables preemption)
> for significant periods probably ought to be detected and warned.
> I'm not sure of a suitable limit, 100us is probably excessive on x86.

Problem is, without CONFIG_PREEMPT_COUNT (basically only
PREEMPT/PREEMPT_RT) we can't even tell.

And I think we tried adding warnings to things like softirq, but then we
get into arguments with the pure performance people on how allowing it
longer will make their benchmarks go faster.

There really is no silver bullet here :/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-15 14:56         ` Peter Zijlstra
@ 2020-01-15 15:09           ` David Laight
  0 siblings, 0 replies; 16+ messages in thread
From: David Laight @ 2020-01-15 15:09 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: 'Steven Rostedt', 'Vincent Guittot',
	Viresh Kumar, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Ben Segall, Mel Gorman, linux-kernel

From: Peter Zijlstra <peterz@infradead.org>
> Sent: 15 January 2020 14:57
> On Wed, Jan 15, 2020 at 12:44:19PM +0000, David Laight wrote:
> 
> > Code that runs with a spin-lock held (or otherwise disables preemption)
> > for significant periods probably ought to be detected and warned.
> > I'm not sure of a suitable limit, 100us is probably excessive on x86.
> 
> Problem is, without CONFIG_PREEMPT_COUNT (basically only
> PREEMPT/PREEMPT_RT) we can't even tell.
> 
> And I think we tried adding warnings to things like softirq, but then we
> get into arguments with the pure performance people on how allowing it
> longer will make their benchmarks go faster.

The interval would have to be a sysctl - like the one for sleeping uninterruptibly.
(Although that one is a pain for some kernel threads. I'd like to be able to
mark some uninterruptible sleeps as 'long term' and also not affecting the load
average.)

I remember (a long time ago) adding code to an ethernet driver to limit it
to 90% of the bandwidth to allow other systems to transmit (10M HDX).
Someone said ' we can't do that, people expect 100%', a week later he
asked me how to enable it because the AMD Lance could never transmit
if receiving back to back packets  (eg in promiscuous mode).
Benchmarks are a PITA....

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-15 13:18         ` Steven Rostedt
  2020-01-15 14:43           ` David Laight
@ 2020-01-15 15:11           ` David Laight
  2020-01-15 15:30             ` Steven Rostedt
  1 sibling, 1 reply; 16+ messages in thread
From: David Laight @ 2020-01-15 15:11 UTC (permalink / raw)
  To: 'Steven Rostedt'
  Cc: 'Vincent Guittot',
	Peter Zijlstra, Viresh Kumar, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel

From: Steven Rostedt
> Sent: 15 January 2020 13:19
> On Wed, 15 Jan 2020 12:44:19 +0000
> David Laight <David.Laight@ACULAB.COM> wrote:
> 
> > > Yes, even with CONFIG_PREEMPT, Linux has no guarantees of latency for
> > > any task regardless of priority. If you have latency requirements, then
> > > you need to apply the PREEMPT_RT patch (which may soon make it to
> > > mainline this year!), which spin locks and bh wont stop a task from
> > > scheduling (unless they need the same lock)
> 
> Every time you add something to allow higher priority processes to run
> with less latency you add overhead. By just adding that spinlock check
> or to migrate a process to a idle cpu will add a measurable overhead,
> and as you state, distros won't like that.
> 
> It's a constant game of give and take.

I know exactly how much effect innocuous changes can have...

Sorting out process migration on a 1024 cpu NUMA system must be a PITA.

For this case an idle cpu doing a unlocked check for a processes that has
been waiting 'ages' to preempt the running process may not be too
expensive.
I presume the locks are in place for the migrate itself.
The only downside is that the process's data is likely to be in the wrong cache,
but unless the original cpu becomes available just after the migrate it is
probably still a win.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-15 15:11           ` David Laight
@ 2020-01-15 15:30             ` Steven Rostedt
  2020-01-15 17:07               ` David Laight
  0 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2020-01-15 15:30 UTC (permalink / raw)
  To: David Laight
  Cc: 'Vincent Guittot',
	Peter Zijlstra, Viresh Kumar, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel

On Wed, 15 Jan 2020 15:11:32 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Steven Rostedt
> > Sent: 15 January 2020 13:19
> > On Wed, 15 Jan 2020 12:44:19 +0000
> > David Laight <David.Laight@ACULAB.COM> wrote:
> >   
> > > > Yes, even with CONFIG_PREEMPT, Linux has no guarantees of latency for
> > > > any task regardless of priority. If you have latency requirements, then
> > > > you need to apply the PREEMPT_RT patch (which may soon make it to
> > > > mainline this year!), which spin locks and bh wont stop a task from
> > > > scheduling (unless they need the same lock)  
> > 
> > Every time you add something to allow higher priority processes to run
> > with less latency you add overhead. By just adding that spinlock check
> > or to migrate a process to a idle cpu will add a measurable overhead,
> > and as you state, distros won't like that.
> > 
> > It's a constant game of give and take.  
> 
> I know exactly how much effect innocuous changes can have...
> 
> Sorting out process migration on a 1024 cpu NUMA system must be a PITA.
> 
> For this case an idle cpu doing a unlocked check for a processes that has
> been waiting 'ages' to preempt the running process may not be too
> expensive.

How do you measure a process waiting for ages on another CPU? And then
by the time you get the information to pull it, there's always the race
that the process will get the chance to run. And if you think about it,
by looking for a process waiting for a long time, it is likely it will
start to run because "ages" means it's probably close to being released.

> I presume the locks are in place for the migrate itself.

Note, by grabbing locks on another CPU will incur overhead on that
other CPU. I've seen huge latency caused by doing just this.

> The only downside is that the process's data is likely to be in the wrong cache,
> but unless the original cpu becomes available just after the migrate it is
> probably still a win.

If you are doing this with just tasks that are waiting for the CPU to
be preemptable, then it is most likely not a win at all.

Now, the RT tasks do have an aggressive push / pull logic, that keeps
track of which CPUs are running lower priority tasks and will work hard
to keep all RT tasks running (and aggressively migrate them). But this
logic still only takes place at preemption points (cond_resched(), etc).

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-15 15:30             ` Steven Rostedt
@ 2020-01-15 17:07               ` David Laight
  2020-01-20  9:39                 ` Dietmar Eggemann
  0 siblings, 1 reply; 16+ messages in thread
From: David Laight @ 2020-01-15 17:07 UTC (permalink / raw)
  To: 'Steven Rostedt'
  Cc: 'Vincent Guittot',
	Peter Zijlstra, Viresh Kumar, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Mel Gorman, linux-kernel

From Steven Rostedt
> Sent: 15 January 2020 15:31
...
> > For this case an idle cpu doing a unlocked check for a processes that has
> > been waiting 'ages' to preempt the running process may not be too
> > expensive.
> 
> How do you measure a process waiting for ages on another CPU? And then
> by the time you get the information to pull it, there's always the race
> that the process will get the chance to run. And if you think about it,
> by looking for a process waiting for a long time, it is likely it will
> start to run because "ages" means it's probably close to being released.

Without a CBU (Crystal Ball Unit) you can always be unlucky.
But once you get over the 'normal' delays for a system call you probably
get an exponential (or is it logarithmic) distribution and the additional
delay is likely to be at least some fraction of the time it has already waited.

While not entirely the same, but something I still need to look at further.
This is a histogram of time taken (in ns) to send on a raw IPv4 socket.
0k: 1874462617
96k: 260350
160k: 30771
224k: 14812
288k: 770
352k: 593
416k: 489
480k: 368
544k: 185
608k: 63
672k: 27
736k: 6
800k: 1
864k: 2
928k: 3
992k: 4
1056k: 1
1120k: 0
1184k: 1
1248k: 1
1312k: 2
1376k: 3
1440k: 1
1504k: 1
1568k: 1
1632k: 4
1696k: 0 (5 times)
2016k: 1
2080k: 0
2144k: 1
total: 1874771078, average 32k

I've improved it no end by using per-thread sockets and setting
the socket write queue size large.
But there are still some places where it takes > 600us.
The top end is rather more linear than one might expect.

> > I presume the locks are in place for the migrate itself.
> 
> Note, by grabbing locks on another CPU will incur overhead on that
> other CPU. I've seen huge latency caused by doing just this.

I'd have thought this would only be significant if the cache line
ends up being used by both cpus?

> > The only downside is that the process's data is likely to be in the wrong cache,
> > but unless the original cpu becomes available just after the migrate it is
> > probably still a win.
> 
> If you are doing this with just tasks that are waiting for the CPU to
> be preemptable, then it is most likely not a win at all.

You'd need a good guess that the wait would be long.

> Now, the RT tasks do have an aggressive push / pull logic, that keeps
> track of which CPUs are running lower priority tasks and will work hard
> to keep all RT tasks running (and aggressively migrate them). But this
> logic still only takes place at preemption points (cond_resched(), etc).

I guess this only 'gives away' extra RT processes.
Rather than 'stealing' them - which is what I need.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-15 17:07               ` David Laight
@ 2020-01-20  9:39                 ` Dietmar Eggemann
  2020-01-20 10:51                   ` David Laight
  0 siblings, 1 reply; 16+ messages in thread
From: Dietmar Eggemann @ 2020-01-20  9:39 UTC (permalink / raw)
  To: David Laight, 'Steven Rostedt'
  Cc: 'Vincent Guittot',
	Peter Zijlstra, Viresh Kumar, Ingo Molnar, Juri Lelli,
	Ben Segall, Mel Gorman, linux-kernel

On 15/01/2020 18:07, David Laight wrote:
> From Steven Rostedt
>> Sent: 15 January 2020 15:31
> ...
>>> For this case an idle cpu doing a unlocked check for a processes that has
>>> been waiting 'ages' to preempt the running process may not be too
>>> expensive.
>>
>> How do you measure a process waiting for ages on another CPU? And then
>> by the time you get the information to pull it, there's always the race
>> that the process will get the chance to run. And if you think about it,
>> by looking for a process waiting for a long time, it is likely it will
>> start to run because "ages" means it's probably close to being released.
> 
> Without a CBU (Crystal Ball Unit) you can always be unlucky.
> But once you get over the 'normal' delays for a system call you probably
> get an exponential (or is it logarithmic) distribution and the additional
> delay is likely to be at least some fraction of the time it has already waited.
> 
> While not entirely the same, but something I still need to look at further.
> This is a histogram of time taken (in ns) to send on a raw IPv4 socket.
> 0k: 1874462617
> 96k: 260350
> 160k: 30771
> 224k: 14812
> 288k: 770
> 352k: 593
> 416k: 489
> 480k: 368
> 544k: 185
> 608k: 63
> 672k: 27
> 736k: 6
> 800k: 1
> 864k: 2
> 928k: 3
> 992k: 4
> 1056k: 1
> 1120k: 0
> 1184k: 1
> 1248k: 1
> 1312k: 2
> 1376k: 3
> 1440k: 1
> 1504k: 1
> 1568k: 1
> 1632k: 4
> 1696k: 0 (5 times)
> 2016k: 1
> 2080k: 0
> 2144k: 1
> total: 1874771078, average 32k
> 
> I've improved it no end by using per-thread sockets and setting
> the socket write queue size large.
> But there are still some places where it takes > 600us.
> The top end is rather more linear than one might expect.
> 
>>> I presume the locks are in place for the migrate itself.
>>
>> Note, by grabbing locks on another CPU will incur overhead on that
>> other CPU. I've seen huge latency caused by doing just this.
> 
> I'd have thought this would only be significant if the cache line
> ends up being used by both cpus?
> 
>>> The only downside is that the process's data is likely to be in the wrong cache,
>>> but unless the original cpu becomes available just after the migrate it is
>>> probably still a win.
>>
>> If you are doing this with just tasks that are waiting for the CPU to
>> be preemptable, then it is most likely not a win at all.
> 
> You'd need a good guess that the wait would be long.
> 
>> Now, the RT tasks do have an aggressive push / pull logic, that keeps
>> track of which CPUs are running lower priority tasks and will work hard
>> to keep all RT tasks running (and aggressively migrate them). But this
>> logic still only takes place at preemption points (cond_resched(), etc).
> 
> I guess this only 'gives away' extra RT processes.
> Rather than 'stealing' them - which is what I need.

Isn't part of the problem that RT doesn't maintain
cp->pri_to_cpu[CPUPRI_IDLE] (CPUPRI_IDLE = 0).

So push/pull (find_lowest_rq()) never returns a mask of idle CPUs.

There was
https://lore.kernel.org/r/1415260327-30465-2-git-send-email-pang.xunlei@linaro.org
in 2014 but it didn't go mainline.

There was a similar question in Nov last year:

https://lore.kernel.org/r/CH2PR19MB3896AFE1D13AD88A17160860FC700@CH2PR19MB3896.namprd19.prod.outlook.com






^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: sched/fair: scheduler not running high priority process on idle cpu
  2020-01-20  9:39                 ` Dietmar Eggemann
@ 2020-01-20 10:51                   ` David Laight
  0 siblings, 0 replies; 16+ messages in thread
From: David Laight @ 2020-01-20 10:51 UTC (permalink / raw)
  To: 'Dietmar Eggemann', 'Steven Rostedt'
  Cc: 'Vincent Guittot',
	Peter Zijlstra, Viresh Kumar, Ingo Molnar, Juri Lelli,
	Ben Segall, Mel Gorman, linux-kernel

From: Dietmar Eggemann
> Sent: 20 January 2020 09:39
..
> > I guess this only 'gives away' extra RT processes.
> > Rather than 'stealing' them - which is what I need.
> 
> Isn't part of the problem that RT doesn't maintain
> cp->pri_to_cpu[CPUPRI_IDLE] (CPUPRI_IDLE = 0).
> 
> So push/pull (find_lowest_rq()) never returns a mask of idle CPUs.
> 
> There was
> https://lore.kernel.org/r/1415260327-30465-2-git-send-email-pang.xunlei@linaro.org
> in 2014 but it didn't go mainline.
> 
> There was a similar question in Nov last year:
> 
> https://lore.kernel.org/r/CH2PR19MB3896AFE1D13AD88A17160860FC700@CH2PR19MB3896.namprd19.prod.outlook.com

They are probably all related.
My brain doesn't have space to completely 'grok' the scheduler without something
else being pushed out of the main cache.

I bet there are other cases where it decides to run a process on a cpu that
is running a process that is bound to that cpu while there are other idle cpu.

Partially because the problem is 'hard', getting it anywhere near 'right'
for NUMA systems with lots of cpus without using all the processing
power just deciding what to run is probably impossible.
However faster decisions can probably be made with 'slightly stale' data that
get corrected later if they are incorrect.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-01-20 10:51 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-14 16:50 sched/fair: scheduler not running high priority process on idle cpu David Laight
2020-01-14 16:59 ` Steven Rostedt
2020-01-14 17:33   ` David Laight
2020-01-14 17:48     ` Steven Rostedt
2020-01-15 12:44       ` David Laight
2020-01-15 13:18         ` Steven Rostedt
2020-01-15 14:43           ` David Laight
2020-01-15 15:11           ` David Laight
2020-01-15 15:30             ` Steven Rostedt
2020-01-15 17:07               ` David Laight
2020-01-20  9:39                 ` Dietmar Eggemann
2020-01-20 10:51                   ` David Laight
2020-01-15 14:56         ` Peter Zijlstra
2020-01-15 15:09           ` David Laight
2020-01-15 12:57       ` David Laight
2020-01-15 14:23         ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).