All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] sched/rt: Add comments describing the RT IPI pull method
@ 2017-02-28 20:50 Steven Rostedt
  2017-03-01  9:44 ` Peter Zijlstra
  2017-03-16 11:16 ` [tip:sched/core] " tip-bot for Steven Rostedt (VMware)
  0 siblings, 2 replies; 4+ messages in thread
From: Steven Rostedt @ 2017-02-28 20:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Clark Williams, Daniel Bristot de Oliveira


From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

While looking into optimizations for the RT scheduler IPI logic, I realized
that the comments are lacking to describe it efficiently. It deserves a
lengthy description describing its design.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---

Changes since v1.

  Added updates to hopefully handle Peter's comments.


 kernel/sched/rt.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

Index: linux-trace.git/kernel/sched/rt.c
===================================================================
--- linux-trace.git.orig/kernel/sched/rt.c
+++ linux-trace.git/kernel/sched/rt.c
@@ -1928,6 +1928,87 @@ static int find_next_push_cpu(struct rq
 #define RT_PUSH_IPI_EXECUTING		1
 #define RT_PUSH_IPI_RESTART		2
 
+/*
+ * When a high priority task schedules out from a CPU and a lower priority
+ * task is scheduled in, a check is made to see if there's any RT tasks
+ * on other CPUs that are waiting to run because a higher priority RT task
+ * is currently running on its CPU. In this case, the CPU with multiple RT
+ * tasks queued on it (overloaded) needs to be notified that a CPU has opened
+ * up that may be able to run one of its non-running queued RT tasks.
+ *
+ * On large CPU boxes, there's the case that several CPUs could schedule
+ * a lower priority task at the same time, in which case it will look for
+ * any overloaded CPUs that it could pull a task from. To do this, the runqueue
+ * lock must be taken from that overloaded CPU. Having 10s of CPUs all fighting
+ * for a single overloaded CPU's runqueue lock can produce a large latency.
+ * (This has actually been observed on large boxes running cyclictest).
+ * Instead of taking the runqueue lock of the overloaded CPU, each of the
+ * CPUs that scheduled a lower priority task simply sends an IPI to the
+ * overloaded CPU. An IPI is much cheaper than taking an runqueue lock with
+ * lots of contention. The overloaded CPU will look to push its non-running
+ * RT task off, and if it does, it can then ignore the other IPIs coming
+ * in, and just pass those IPIs off to any other overloaded CPU.
+ *
+ * When a CPU schedules a lower priority task, it only sends an IPI to
+ * the "next" CPU that has overloaded RT tasks. This prevents IPI storms,
+ * as having 10 CPUs scheduling lower priority tasks and 10 CPUs with
+ * RT overloaded tasks, would cause 100 IPIs to go out at once.
+ *
+ * The overloaded RT CPU, wher receiving an IPI, will try to push off its
+ * overloaded RT tasks and then send an IPI to the next CPU that has
+ * overloaded RT tasks. This stops when all CPUs with overloaded RT tasks
+ * have completed. Just because a CPU may have pushed off its own overloaded
+ * RT task does not mean it should stop sending the IPI around to other
+ * overloaded CPUs. There may be another RT task waiting to run on one of
+ * those CPUs that are of higher priority than the one that was just
+ * pushed.
+ *
+ * An optimization that could possibly be made is to make a CPU array similar
+ * to the cpupri array mask of all running RT tasks, but for the overloaded
+ * case, then the IPI could be sent to only the CPU with the highest priority
+ * RT task waiting, and that CPU could send off further IPIs to the CPU with
+ * the next highest waiting task. Since the overloaded case is much less likely
+ * to happen, the complexity of this implementation may not be worth it.
+ * Instead, just send an IPI around to all overloaded CPUs.
+ *
+ * The rq->rt.push_flags holds the status of the IPI that is going around.
+ * A run queue can only send out a single IPI at a time. The possible flags
+ * for rq->rt.push_flags are:
+ *
+ *    (None or zero):		No IPI is going around for the current rq
+ *    RT_PUSH_IPI_EXECUTING:	An IPI for the rq is being passed around
+ *    RT_PUSH_IPI_RESTART:	The priority of the running task for the rq
+ *				  has changed, and the IPI should restart
+ *				  circulating the overloaded CPUs again.
+ *
+ * rq->rt.push_cpu contains the CPU that is being sent the IPI. It is updated
+ * before sending to the next CPU.
+ *
+ * Instead of having all CPUs that schedule a lower priority task send
+ * an IPI to the same "first" CPU in the RT overload mask, they send it
+ * to the next overloaded CPU after their own CPU. This helps distribute
+ * the work when there's more than one overloaded CPU and multiple CPUs
+ * scheduling in lower priority tasks.
+ *
+ * When an rq schedules a lower priority task than what was currently
+ * running, the next CPU with overloaded RT tasks is examined first.
+ * That is, if CPU 1 and 5 are overloaded, and CPU 3 schedules a lower
+ * priority task, it will send an IPI first to CPU 5, then CPU 5 will
+ * send to CPU 1 if it is still overloaded. CPU 1 will clear the
+ * rq->rt.push_flags if RT_PUSH_IPI_RESTART is not set.
+ *
+ * The first CPU to notice IPI_RESTART is set, will clear that flag and then
+ * send an IPI to the next overloaded CPU after the rq->cpu and not the next
+ * CPU after push_cpu. That is, if CPU 1, 4 and 5 are overloaded when CPU 3
+ * schedules a lower priority task, and the IPI_RESTART gets set while the
+ * handling is being done on CPU 5, it will clear the flag and send it back to
+ * CPU 4 instead of CPU 1.
+ *
+ * Note, the above logic can be disabled by turning off the sched_feature
+ *  RT_PUSH_IPI. Then the rq lock of the overloaded CPU will simply be
+ *  taken by the CPU requesting a pull and the waiting RT task will be pulled
+ *  by that CPU. This may be fine for machines with few CPUs.
+ */
 static void tell_cpu_to_push(struct rq *rq)
 {
 	int cpu;

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] sched/rt: Add comments describing the RT IPI pull method
  2017-02-28 20:50 [PATCH v2] sched/rt: Add comments describing the RT IPI pull method Steven Rostedt
@ 2017-03-01  9:44 ` Peter Zijlstra
  2017-03-01 13:40   ` Steven Rostedt
  2017-03-16 11:16 ` [tip:sched/core] " tip-bot for Steven Rostedt (VMware)
  1 sibling, 1 reply; 4+ messages in thread
From: Peter Zijlstra @ 2017-03-01  9:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Clark Williams, Daniel Bristot de Oliveira

On Tue, Feb 28, 2017 at 03:50:30PM -0500, Steven Rostedt wrote:
> + * The overloaded RT CPU, wher receiving an IPI, will try to push off its

"wher" isn't in my dictionary, I'm thinking you mean: "when". Fixed that
for you.

> + * overloaded RT tasks and then send an IPI to the next CPU that has
> + * overloaded RT tasks. This stops when all CPUs with overloaded RT tasks
> + * have completed. Just because a CPU may have pushed off its own overloaded
> + * RT task does not mean it should stop sending the IPI around to other
> + * overloaded CPUs. There may be another RT task waiting to run on one of
> + * those CPUs that are of higher priority than the one that was just
> + * pushed.
> + *
> + * An optimization that could possibly be made is to make a CPU array similar
> + * to the cpupri array mask of all running RT tasks, but for the overloaded
> + * case, then the IPI could be sent to only the CPU with the highest priority
> + * RT task waiting, and that CPU could send off further IPIs to the CPU with
> + * the next highest waiting task. Since the overloaded case is much less likely
> + * to happen, the complexity of this implementation may not be worth it.
> + * Instead, just send an IPI around to all overloaded CPUs.

Yeah, not sure, I'll leave it in though.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] sched/rt: Add comments describing the RT IPI pull method
  2017-03-01  9:44 ` Peter Zijlstra
@ 2017-03-01 13:40   ` Steven Rostedt
  0 siblings, 0 replies; 4+ messages in thread
From: Steven Rostedt @ 2017-03-01 13:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Clark Williams, Daniel Bristot de Oliveira

On Wed, 1 Mar 2017 10:44:09 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Feb 28, 2017 at 03:50:30PM -0500, Steven Rostedt wrote:
> > + * The overloaded RT CPU, wher receiving an IPI, will try to push off its  
> 
> "wher" isn't in my dictionary, I'm thinking you mean: "when". Fixed that
> for you.

Heh, thanks. Hmm, I looked at all the highlighted words when I pulled
in the patch into claws-mail. I'm surprised I missed that.

> 
> > + * overloaded RT tasks and then send an IPI to the next CPU that has
> > + * overloaded RT tasks. This stops when all CPUs with overloaded RT tasks
> > + * have completed. Just because a CPU may have pushed off its own overloaded
> > + * RT task does not mean it should stop sending the IPI around to other
> > + * overloaded CPUs. There may be another RT task waiting to run on one of
> > + * those CPUs that are of higher priority than the one that was just
> > + * pushed.
> > + *
> > + * An optimization that could possibly be made is to make a CPU array similar
> > + * to the cpupri array mask of all running RT tasks, but for the overloaded
> > + * case, then the IPI could be sent to only the CPU with the highest priority
> > + * RT task waiting, and that CPU could send off further IPIs to the CPU with
> > + * the next highest waiting task. Since the overloaded case is much less likely
> > + * to happen, the complexity of this implementation may not be worth it.
> > + * Instead, just send an IPI around to all overloaded CPUs.  
> 
> Yeah, not sure, I'll leave it in though.

OK,

Thanks!

-- Steve

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [tip:sched/core] sched/rt: Add comments describing the RT IPI pull method
  2017-02-28 20:50 [PATCH v2] sched/rt: Add comments describing the RT IPI pull method Steven Rostedt
  2017-03-01  9:44 ` Peter Zijlstra
@ 2017-03-16 11:16 ` tip-bot for Steven Rostedt (VMware)
  1 sibling, 0 replies; 4+ messages in thread
From: tip-bot for Steven Rostedt (VMware) @ 2017-03-16 11:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, mingo, bristot, rostedt, linux-kernel, williams, akpm,
	peterz, hpa, efault, torvalds

Commit-ID:  3e777f9909483b603946685d88acfae89f31b07b
Gitweb:     http://git.kernel.org/tip/3e777f9909483b603946685d88acfae89f31b07b
Author:     Steven Rostedt (VMware) <rostedt@goodmis.org>
AuthorDate: Tue, 28 Feb 2017 15:50:30 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 16 Mar 2017 09:41:35 +0100

sched/rt: Add comments describing the RT IPI pull method

While looking into optimizations for the RT scheduler IPI logic, I realized
that the comments are lacking to describe it efficiently. It deserves a
lengthy description describing its design.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170228155030.30c69068@gandalf.local.home
[ Small typographical edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/rt.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9f3e402..979b734 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1927,6 +1927,87 @@ static int find_next_push_cpu(struct rq *rq)
 #define RT_PUSH_IPI_EXECUTING		1
 #define RT_PUSH_IPI_RESTART		2
 
+/*
+ * When a high priority task schedules out from a CPU and a lower priority
+ * task is scheduled in, a check is made to see if there's any RT tasks
+ * on other CPUs that are waiting to run because a higher priority RT task
+ * is currently running on its CPU. In this case, the CPU with multiple RT
+ * tasks queued on it (overloaded) needs to be notified that a CPU has opened
+ * up that may be able to run one of its non-running queued RT tasks.
+ *
+ * On large CPU boxes, there's the case that several CPUs could schedule
+ * a lower priority task at the same time, in which case it will look for
+ * any overloaded CPUs that it could pull a task from. To do this, the runqueue
+ * lock must be taken from that overloaded CPU. Having 10s of CPUs all fighting
+ * for a single overloaded CPU's runqueue lock can produce a large latency.
+ * (This has actually been observed on large boxes running cyclictest).
+ * Instead of taking the runqueue lock of the overloaded CPU, each of the
+ * CPUs that scheduled a lower priority task simply sends an IPI to the
+ * overloaded CPU. An IPI is much cheaper than taking an runqueue lock with
+ * lots of contention. The overloaded CPU will look to push its non-running
+ * RT task off, and if it does, it can then ignore the other IPIs coming
+ * in, and just pass those IPIs off to any other overloaded CPU.
+ *
+ * When a CPU schedules a lower priority task, it only sends an IPI to
+ * the "next" CPU that has overloaded RT tasks. This prevents IPI storms,
+ * as having 10 CPUs scheduling lower priority tasks and 10 CPUs with
+ * RT overloaded tasks, would cause 100 IPIs to go out at once.
+ *
+ * The overloaded RT CPU, when receiving an IPI, will try to push off its
+ * overloaded RT tasks and then send an IPI to the next CPU that has
+ * overloaded RT tasks. This stops when all CPUs with overloaded RT tasks
+ * have completed. Just because a CPU may have pushed off its own overloaded
+ * RT task does not mean it should stop sending the IPI around to other
+ * overloaded CPUs. There may be another RT task waiting to run on one of
+ * those CPUs that are of higher priority than the one that was just
+ * pushed.
+ *
+ * An optimization that could possibly be made is to make a CPU array similar
+ * to the cpupri array mask of all running RT tasks, but for the overloaded
+ * case, then the IPI could be sent to only the CPU with the highest priority
+ * RT task waiting, and that CPU could send off further IPIs to the CPU with
+ * the next highest waiting task. Since the overloaded case is much less likely
+ * to happen, the complexity of this implementation may not be worth it.
+ * Instead, just send an IPI around to all overloaded CPUs.
+ *
+ * The rq->rt.push_flags holds the status of the IPI that is going around.
+ * A run queue can only send out a single IPI at a time. The possible flags
+ * for rq->rt.push_flags are:
+ *
+ *    (None or zero):		No IPI is going around for the current rq
+ *    RT_PUSH_IPI_EXECUTING:	An IPI for the rq is being passed around
+ *    RT_PUSH_IPI_RESTART:	The priority of the running task for the rq
+ *				has changed, and the IPI should restart
+ *				circulating the overloaded CPUs again.
+ *
+ * rq->rt.push_cpu contains the CPU that is being sent the IPI. It is updated
+ * before sending to the next CPU.
+ *
+ * Instead of having all CPUs that schedule a lower priority task send
+ * an IPI to the same "first" CPU in the RT overload mask, they send it
+ * to the next overloaded CPU after their own CPU. This helps distribute
+ * the work when there's more than one overloaded CPU and multiple CPUs
+ * scheduling in lower priority tasks.
+ *
+ * When a rq schedules a lower priority task than what was currently
+ * running, the next CPU with overloaded RT tasks is examined first.
+ * That is, if CPU 1 and 5 are overloaded, and CPU 3 schedules a lower
+ * priority task, it will send an IPI first to CPU 5, then CPU 5 will
+ * send to CPU 1 if it is still overloaded. CPU 1 will clear the
+ * rq->rt.push_flags if RT_PUSH_IPI_RESTART is not set.
+ *
+ * The first CPU to notice IPI_RESTART is set, will clear that flag and then
+ * send an IPI to the next overloaded CPU after the rq->cpu and not the next
+ * CPU after push_cpu. That is, if CPU 1, 4 and 5 are overloaded when CPU 3
+ * schedules a lower priority task, and the IPI_RESTART gets set while the
+ * handling is being done on CPU 5, it will clear the flag and send it back to
+ * CPU 4 instead of CPU 1.
+ *
+ * Note, the above logic can be disabled by turning off the sched_feature
+ * RT_PUSH_IPI. Then the rq lock of the overloaded CPU will simply be
+ * taken by the CPU requesting a pull and the waiting RT task will be pulled
+ * by that CPU. This may be fine for machines with few CPUs.
+ */
 static void tell_cpu_to_push(struct rq *rq)
 {
 	int cpu;

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-03-16 11:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-28 20:50 [PATCH v2] sched/rt: Add comments describing the RT IPI pull method Steven Rostedt
2017-03-01  9:44 ` Peter Zijlstra
2017-03-01 13:40   ` Steven Rostedt
2017-03-16 11:16 ` [tip:sched/core] " tip-bot for Steven Rostedt (VMware)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.