linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH RT 0/4 v2] sched/rt: Lower rq lock contention latencies on many CPU boxes
@ 2012-12-12 19:27 Steven Rostedt
  2012-12-12 19:27 ` [RFC][PATCH RT 1/4 v2] sched/rt: Fix push_rt_task() to have the same checks as the caller did Steven Rostedt
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Steven Rostedt @ 2012-12-12 19:27 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams, Ingo Molnar, Frank Rowand, Mike Galbraith

This version I rearranged the patches a little so that the IPI patch
comes last. As the first 3 patches are less controversal, and should
probably be added now.

The difference in this version was that I added more comments, but
more importantly I also added a sched feature called RT_PUSH_IPI.
When this feature is enabled, it switches the pull logic to send
an IPI to the RT overloaded CPU to do a push instead of doing  the
pull locally.

When the feature is disabled, the push/pull logic stays the same as
it always has (with rq lock contention).

Now I've discussed this with Clark, and we noticed that the contention
didn't show up until we tried it on a 24 CPU machine. On a 16 CPU machine
it ran fine. Thus, by default, machines with 16 or less CPUs will
have the RT_PUSH_IPI feature disabled. Machines with 17 or more
possible CPUs will have the feature enabled at boot up. Note, we haven't
tried it on a machine with 17 to 23 CPUs.

It is safe to enable or disable this feature at run time, although
you may cause latencies in doing so, but there shouldn't be any missed
wakeups or anything else that is serious. The worse that can happen
is that you miss a pull, and an RT task will stay on its CPU when it
could have migrated to another CPU that just lowered its priority.
Well, if you are worried about that, don't change it when you care :-)

The 16 CPUs is just a heuristic, and people may debate it. And perhaps
you may not like how the push/pull default changes between different
machines. I could also add a command line switch to force enable/disable
at boot, and/or I could add a config as well.

Right now, by default <= 16 CPU machines are as it always was, and
17 or more CPU machines have this new logic enabled. Either one can
change it at run time via the debugfs directory (unfortunately it's
not a sysctl).

Comments?

-- Steve


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC][PATCH RT 1/4 v2] sched/rt: Fix push_rt_task() to have the same checks as the caller did
  2012-12-12 19:27 [RFC][PATCH RT 0/4 v2] sched/rt: Lower rq lock contention latencies on many CPU boxes Steven Rostedt
@ 2012-12-12 19:27 ` Steven Rostedt
  2012-12-12 19:27 ` [RFC][PATCH RT 2/4 v2] sched/rt: Try to migrate task if preempting pinned rt task Steven Rostedt
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2012-12-12 19:27 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams, Ingo Molnar, Frank Rowand, Mike Galbraith

[-- Attachment #1: fix-push-rt-task-check.patch --]
[-- Type: text/plain, Size: 1976 bytes --]

Currently, the push_rt_task() only pushes the task if it is lower
priority than the currently running task.

But this is not the only check. If the currently running task is also
pinned, we may want to push as well, and we do this check when we wake
up a task, but then we are guaranteed to fail pushing the task because
the internal checks may fail.

Make the check the same as the wakeup checks. We could remove the
check in the wake up and just let the push_rt_task() do the work,
but this makes the wake up exit this check on the likely case that
"ok_to_push_task()" will fail, and that we don't need to do the
iterative loop of checks on the pushable task list.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-rt.git/kernel/sched/rt.c
===================================================================
--- linux-rt.git.orig/kernel/sched/rt.c
+++ linux-rt.git/kernel/sched/rt.c
@@ -1615,6 +1615,15 @@ static struct task_struct *pick_next_pus
 	return p;
 }
 
+static int ok_to_push_task(struct task_struct *p, struct task_struct *curr)
+{
+	return p->nr_cpus_allowed > 1 &&
+		rt_task(curr) &&
+		(curr->migrate_disable ||
+		 curr->nr_cpus_allowed < 2 ||
+		 curr->prio <= p->prio);
+}
+
 /*
  * If the current CPU has more than one RT task, see if the non
  * running task can migrate over to a CPU that is running a task
@@ -1649,7 +1658,7 @@ retry:
 	 * higher priority than current. If that's the case
 	 * just reschedule current.
 	 */
-	if (unlikely(next_task->prio < rq->curr->prio)) {
+	if (!ok_to_push_task(next_task, rq->curr)) {
 		resched_task(rq->curr);
 		return 0;
 	}
@@ -1814,10 +1823,7 @@ static void task_woken_rt(struct rq *rq,
 	if (!task_running(rq, p) &&
 	    !test_tsk_need_resched(rq->curr) &&
 	    has_pushable_tasks(rq) &&
-	    p->nr_cpus_allowed > 1 &&
-	    rt_task(rq->curr) &&
-	    (rq->curr->nr_cpus_allowed < 2 ||
-	     rq->curr->prio <= p->prio))
+	    ok_to_push_task(p, rq->curr))
 		push_rt_tasks(rq);
 }
 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC][PATCH RT 2/4 v2] sched/rt: Try to migrate task if preempting pinned rt task
  2012-12-12 19:27 [RFC][PATCH RT 0/4 v2] sched/rt: Lower rq lock contention latencies on many CPU boxes Steven Rostedt
  2012-12-12 19:27 ` [RFC][PATCH RT 1/4 v2] sched/rt: Fix push_rt_task() to have the same checks as the caller did Steven Rostedt
@ 2012-12-12 19:27 ` Steven Rostedt
  2012-12-12 19:27 ` [RFC][PATCH RT 3/4 v2] sched/rt: Initiate a pull when the priority of a task is lowered Steven Rostedt
  2012-12-12 19:27 ` [RFC][PATCH RT 4/4 v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
  3 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2012-12-12 19:27 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams, Ingo Molnar, Frank Rowand, Mike Galbraith

[-- Attachment #1: fix-post-sched-push.patch --]
[-- Type: text/plain, Size: 1546 bytes --]

If a higher priority task is about to preempt a task that has been
pinned to a CPU. Try to first see if the higher priority task can
preempt another task instead.

That is, a high priority process wakes up on a CPU while a currently
running task can still migrate, it will miss pushing that high priority
task to another CPU. If by the time the task schedules, the task
that it's about to preempt could have changed its affinity and
is pinned. At this time, it may be better to move the task to another
CPU if one exists that is currently running a lower priority task
than the one about to be preempted.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-rt.git/kernel/sched/rt.c
===================================================================
--- linux-rt.git.orig/kernel/sched/rt.c
+++ linux-rt.git/kernel/sched/rt.c
@@ -1804,8 +1804,22 @@ skip:
 
 static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
 {
+	struct task_struct *p = prev;
+
+	/*
+	 * If we are preempting a pinned task see if we can push
+	 * the higher priority task first.
+	 */
+	if (prev->on_rq && (prev->nr_cpus_allowed <= 1 || prev->migrate_disable) &&
+	    has_pushable_tasks(rq) && rq->rt.highest_prio.next < prev->prio) {
+		p = _pick_next_task_rt(rq);
+
+		if (p != prev && p->nr_cpus_allowed > 1 && push_rt_task(rq))
+			p = _pick_next_task_rt(rq);
+	}
+
 	/* Try to pull RT tasks here if we lower this rq's prio */
-	if (rq->rt.highest_prio.curr > prev->prio)
+	if (rq->rt.highest_prio.curr > p->prio)
 		pull_rt_task(rq);
 }
 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC][PATCH RT 3/4 v2] sched/rt: Initiate a pull when the priority of a task is lowered
  2012-12-12 19:27 [RFC][PATCH RT 0/4 v2] sched/rt: Lower rq lock contention latencies on many CPU boxes Steven Rostedt
  2012-12-12 19:27 ` [RFC][PATCH RT 1/4 v2] sched/rt: Fix push_rt_task() to have the same checks as the caller did Steven Rostedt
  2012-12-12 19:27 ` [RFC][PATCH RT 2/4 v2] sched/rt: Try to migrate task if preempting pinned rt task Steven Rostedt
@ 2012-12-12 19:27 ` Steven Rostedt
  2012-12-12 19:27 ` [RFC][PATCH RT 4/4 v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
  3 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2012-12-12 19:27 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams, Ingo Molnar, Frank Rowand, Mike Galbraith

[-- Attachment #1: rt-migrate-redo.patch --]
[-- Type: text/plain, Size: 898 bytes --]

If a task lowers its priority (say by losing priority inheritance)
if a higher priority task is waiting on another CPU, initiate a pull.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-rt.git/kernel/sched/rt.c
===================================================================
--- linux-rt.git.orig/kernel/sched/rt.c
+++ linux-rt.git/kernel/sched/rt.c
@@ -997,6 +997,8 @@ inc_rt_prio(struct rt_rq *rt_rq, int pri
 	inc_rt_prio_smp(rt_rq, prio, prev_prio);
 }
 
+static int pull_rt_task(struct rq *this_rq);
+
 static void
 dec_rt_prio(struct rt_rq *rt_rq, int prio)
 {
@@ -1021,6 +1023,10 @@ dec_rt_prio(struct rt_rq *rt_rq, int pri
 		rt_rq->highest_prio.curr = MAX_RT_PRIO;
 
 	dec_rt_prio_smp(rt_rq, prio, prev_prio);
+
+	/* Try to pull RT tasks here if we lower this rq's prio */
+	if (prev_prio < rt_rq->highest_prio.curr)
+		pull_rt_task(rq_of_rt_rq(rt_rq));
 }
 
 #else


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC][PATCH RT 4/4 v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2012-12-12 19:27 [RFC][PATCH RT 0/4 v2] sched/rt: Lower rq lock contention latencies on many CPU boxes Steven Rostedt
                   ` (2 preceding siblings ...)
  2012-12-12 19:27 ` [RFC][PATCH RT 3/4 v2] sched/rt: Initiate a pull when the priority of a task is lowered Steven Rostedt
@ 2012-12-12 19:27 ` Steven Rostedt
  2012-12-12 20:44   ` Steven Rostedt
  2012-12-13 19:53   ` Steven Rostedt
  3 siblings, 2 replies; 9+ messages in thread
From: Steven Rostedt @ 2012-12-12 19:27 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams, Ingo Molnar, Frank Rowand, Mike Galbraith

[-- Attachment #1: push-rt-task-ipi-v2.patch --]
[-- Type: text/plain, Size: 7379 bytes --]

When debugging the latencies on a 40 core box, where we hit 300 to
500 microsecond latencies, I found there was a huge contention on the
runqueue locks.

Investigating it further, running ftrace, I found that it was due to
the pulling of RT tasks.

The test that was run was the following:

 cyclictest --numa -p95 -m -d0 -i100

This created a thread on each CPU, that would set its wakeup in interations
of 100 microseconds. The -d0 means that all the threads had the same
interval (100us). Each thread sleeps for 100us and wakes up and measures
its latencies.

What happened was another RT task would be scheduled on one of the CPUs
that was running our test, when the other CPUS test went to sleep and
scheduled idle. This cause the "pull" operation to execute on all
these CPUs. Each one of these saw the RT task that was overloaded on
the CPU of the test that was still running, and each one tried
to grab that task in a thundering herd way.

To grab the task, each thread would do a double rq lock grab, grabbing
its own lock as well as the rq of the overloaded CPU. As the sched
domains on this box was rather flat for its size, I saw up to 12 CPUs
block on this lock at once. This caused a ripple affect with the
rq locks. As these locks were blocked, any wakeups or load balanceing
on these CPUs would also block on these locks, and the wait time escalated.

I've tried various methods to lesson the load, but things like an
atomic counter to only let one CPU grab the task wont work, because
the task may have a limited affinity, and we may pick the wrong
CPU to take that lock and do the pull, to only find out that the
CPU we picked isn't in the task's affinity.

Instead of doing the PULL, I now have the CPUs that want the pull to
send over an IPI to the overloaded CPU, and let that CPU pick what
CPU to push the task to. No more need to grab the rq lock, and the
push/pull algorithm still works fine.

With this patch, the latency dropped to just 150us over a 20 hour run.
Without the patch, the huge latencies would trigger in seconds.

Now, this issue only seems to apply to boxes with greater than 16 CPUs.
We noticed this on a 24 CPU box, and things got much worse on 40 (and
presumably more CPUs would get even worse yet). But running with 16
CPUs and below, the lock contention caused by the pulling of RT tasks
is not noticable.

I've created a new sched feature called RT_PUSH_IPI, which by default
on 16 and less core CPUs is disabled, and on 17 or more CPUs it is
enabled. That seems to be heuristic limit where the pulling logic
causes higher latencies than IPIs. Of course with all heuristics, things
could be different with different architectures.

When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
is enabled, the IPI is sent to the overloaded CPU to do a push.

To enabled or disable this at run time:

 # mount -t debugfs nodev /sys/kernel/debug
 # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
or
 # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-rt.git/kernel/sched/core.c
===================================================================
--- linux-rt.git.orig/kernel/sched/core.c
+++ linux-rt.git/kernel/sched/core.c
@@ -1538,6 +1538,9 @@ static void sched_ttwu_pending(void)
 
 void scheduler_ipi(void)
 {
+	if (sched_feat(RT_PULL_IPI))
+		sched_rt_push_check();
+
 	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
 		return;
 
@@ -7533,6 +7536,19 @@ void __init sched_init_smp(void)
 	free_cpumask_var(non_isolated_cpus);
 
 	init_sched_rt_class();
+
+	/*
+	 * To avoid heavy contention on large CPU boxes,
+	 * when there is an RT overloaded CPU (two or more RT tasks
+	 * queued to run on a CPU and one of the waiting RT tasks
+	 * can migrate) and another CPU lowers its priority, instead
+	 * of grabbing both rq locks of the CPUS (as many CPUs lowering
+	 * their priority at the same time may create large latencies)
+	 * send an IPI to the CPU that is overloaded so that it can
+	 * do an efficent push.
+	 */
+	if (num_possible_cpus() > 16)
+		sched_feat_enable(__SCHED_FEAT_RT_PUSH_IPI);
 }
 #else
 void __init sched_init_smp(void)
Index: linux-rt.git/kernel/sched/rt.c
===================================================================
--- linux-rt.git.orig/kernel/sched/rt.c
+++ linux-rt.git/kernel/sched/rt.c
@@ -1729,6 +1729,31 @@ static void push_rt_tasks(struct rq *rq)
 		;
 }
 
+/**
+ * sched_rt_push_check - check if we can push waiting RT tasks
+ *
+ * Called from sched IPI when sched feature RT_PUSH_IPI is enabled.
+ *
+ * Checks if there is an RT task that can migrate and there exists
+ * a CPU in its affinity that only has tasks lower in priority than
+ * the waiting RT task. If so, then it will push the task off to that
+ * CPU.
+ */
+void sched_rt_push_check(void)
+{
+	struct rq *rq = cpu_rq(smp_processor_id());
+
+	if (WARN_ON_ONCE(!irqs_disabled()))
+		return;
+
+	if (!has_pushable_tasks(rq))
+		return;
+
+	raw_spin_lock(&rq->lock);
+	push_rt_tasks(rq);
+	raw_spin_unlock(&rq->lock);
+}
+
 static int pull_rt_task(struct rq *this_rq)
 {
 	int this_cpu = this_rq->cpu, ret = 0, cpu;
@@ -1756,6 +1781,18 @@ static int pull_rt_task(struct rq *this_
 			continue;
 
 		/*
+		 * When the RT_PUSH_IPI sched feature is enabled, instead
+		 * of trying to grab the rq lock of the RT overloaded CPU
+		 * send an IPI to that CPU instead. This prevents heavy
+		 * contention from several CPUs lowering its priority
+		 * and all trying to grab the rq lock of that overloaded CPU.
+		 */
+		if (sched_feat(RT_PUSH_IPI)) {
+			smp_send_reschedule(cpu);
+			continue;
+		}
+
+		/*
 		 * We can potentially drop this_rq's lock in
 		 * double_lock_balance, and another CPU could
 		 * alter this_rq
Index: linux-rt.git/kernel/sched/sched.h
===================================================================
--- linux-rt.git.orig/kernel/sched/sched.h
+++ linux-rt.git/kernel/sched/sched.h
@@ -1111,6 +1111,8 @@ static inline void double_rq_unlock(stru
 		__release(rq2->lock);
 }
 
+void sched_rt_push_check(void);
+
 #else /* CONFIG_SMP */
 
 /*
@@ -1144,6 +1146,9 @@ static inline void double_rq_unlock(stru
 	__release(rq2->lock);
 }
 
+void sched_rt_push_check(void)
+{
+}
 #endif
 
 extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);
Index: linux-rt.git/kernel/sched/features.h
===================================================================
--- linux-rt.git.orig/kernel/sched/features.h
+++ linux-rt.git/kernel/sched/features.h
@@ -73,6 +73,20 @@ SCHED_FEAT(PREEMPT_LAZY, true)
 # endif
 #endif
 
+/*
+ * In order to avoid a thundering herd attack of CPUS that are
+ * lowering their priorities at the same time, and there being
+ * a single CPU that has an RT task that can migrate and is waiting
+ * to run, where the other CPUs will try to take that CPUs
+ * rq lock and possibly create a large contention, sending an
+ * IPI to that CPU and let that CPU push the RT task to where
+ * it should go may be a better scenario.
+ *
+ * This is default off for machines with <= 16 CPUs, and will
+ * be turned on at boot up for machines with > 16 CPUs.
+ */
+SCHED_FEAT(RT_PUSH_IPI, false)
+
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH RT 4/4 v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2012-12-12 19:27 ` [RFC][PATCH RT 4/4 v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
@ 2012-12-12 20:44   ` Steven Rostedt
  2012-12-13 19:53   ` Steven Rostedt
  1 sibling, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2012-12-12 20:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-rt-users, Thomas Gleixner, Carsten Emde, John Kacur,
	Peter Zijlstra, Clark Williams, Ingo Molnar, Frank Rowand,
	Mike Galbraith

I'm doing these changes in quilt and not git. Thus, I'm hitting some of
the silly old quilt bugs, like forgetting to do a quilt refresh :-p

This one will actually compile and boot.

-- Steve

sched/rt: Use IPI to trigger RT task push migration instead of pulling

When debugging the latencies on a 40 core box, where we hit 300 to
500 microsecond latencies, I found there was a huge contention on the
runqueue locks.

Investigating it further, running ftrace, I found that it was due to
the pulling of RT tasks.

The test that was run was the following:

 cyclictest --numa -p95 -m -d0 -i100

This created a thread on each CPU, that would set its wakeup in interations
of 100 microseconds. The -d0 means that all the threads had the same
interval (100us). Each thread sleeps for 100us and wakes up and measures
its latencies.

What happened was another RT task would be scheduled on one of the CPUs
that was running our test, when the other CPUS test went to sleep and
scheduled idle. This cause the "pull" operation to execute on all
these CPUs. Each one of these saw the RT task that was overloaded on
the CPU of the test that was still running, and each one tried
to grab that task in a thundering herd way.

To grab the task, each thread would do a double rq lock grab, grabbing
its own lock as well as the rq of the overloaded CPU. As the sched
domains on this box was rather flat for its size, I saw up to 12 CPUs
block on this lock at once. This caused a ripple affect with the
rq locks. As these locks were blocked, any wakeups or load balanceing
on these CPUs would also block on these locks, and the wait time escalated.

I've tried various methods to lesson the load, but things like an
atomic counter to only let one CPU grab the task wont work, because
the task may have a limited affinity, and we may pick the wrong
CPU to take that lock and do the pull, to only find out that the
CPU we picked isn't in the task's affinity.

Instead of doing the PULL, I now have the CPUs that want the pull to
send over an IPI to the overloaded CPU, and let that CPU pick what
CPU to push the task to. No more need to grab the rq lock, and the
push/pull algorithm still works fine.

With this patch, the latency dropped to just 150us over a 20 hour run.
Without the patch, the huge latencies would trigger in seconds.

Now, this issue only seems to apply to boxes with greater than 16 CPUs.
We noticed this on a 24 CPU box, and things got much worse on 40 (and
presumably more CPUs would get even worse yet). But running with 16
CPUs and below, the lock contention caused by the pulling of RT tasks
is not noticable.

I've created a new sched feature called RT_PUSH_IPI, which by default
on 16 and less core CPUs is disabled, and on 17 or more CPUs it is
enabled. That seems to be heuristic limit where the pulling logic
causes higher latencies than IPIs. Of course with all heuristics, things
could be different with different architectures.

When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
is enabled, the IPI is sent to the overloaded CPU to do a push.

To enabled or disable this at run time:

 # mount -t debugfs nodev /sys/kernel/debug
 # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
or
 # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-rt.git/kernel/sched/core.c
===================================================================
--- linux-rt.git.orig/kernel/sched/core.c
+++ linux-rt.git/kernel/sched/core.c
@@ -1538,6 +1538,9 @@ static void sched_ttwu_pending(void)
 
 void scheduler_ipi(void)
 {
+	if (sched_feat(RT_PUSH_IPI))
+		sched_rt_push_check();
+
 	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
 		return;
 
@@ -7533,6 +7536,19 @@ void __init sched_init_smp(void)
 	free_cpumask_var(non_isolated_cpus);
 
 	init_sched_rt_class();
+
+	/*
+	 * To avoid heavy contention on large CPU boxes,
+	 * when there is an RT overloaded CPU (two or more RT tasks
+	 * queued to run on a CPU and one of the waiting RT tasks
+	 * can migrate) and another CPU lowers its priority, instead
+	 * of grabbing both rq locks of the CPUS (as many CPUs lowering
+	 * their priority at the same time may create large latencies)
+	 * send an IPI to the CPU that is overloaded so that it can
+	 * do an efficent push.
+	 */
+	if (num_possible_cpus() > 16)
+		sched_feat_enable(__SCHED_FEAT_RT_PUSH_IPI);
 }
 #else
 void __init sched_init_smp(void)
Index: linux-rt.git/kernel/sched/rt.c
===================================================================
--- linux-rt.git.orig/kernel/sched/rt.c
+++ linux-rt.git/kernel/sched/rt.c
@@ -1729,6 +1729,31 @@ static void push_rt_tasks(struct rq *rq)
 		;
 }
 
+/**
+ * sched_rt_push_check - check if we can push waiting RT tasks
+ *
+ * Called from sched IPI when sched feature RT_PUSH_IPI is enabled.
+ *
+ * Checks if there is an RT task that can migrate and there exists
+ * a CPU in its affinity that only has tasks lower in priority than
+ * the waiting RT task. If so, then it will push the task off to that
+ * CPU.
+ */
+void sched_rt_push_check(void)
+{
+	struct rq *rq = cpu_rq(smp_processor_id());
+
+	if (WARN_ON_ONCE(!irqs_disabled()))
+		return;
+
+	if (!has_pushable_tasks(rq))
+		return;
+
+	raw_spin_lock(&rq->lock);
+	push_rt_tasks(rq);
+	raw_spin_unlock(&rq->lock);
+}
+
 static int pull_rt_task(struct rq *this_rq)
 {
 	int this_cpu = this_rq->cpu, ret = 0, cpu;
@@ -1756,6 +1781,18 @@ static int pull_rt_task(struct rq *this_
 			continue;
 
 		/*
+		 * When the RT_PUSH_IPI sched feature is enabled, instead
+		 * of trying to grab the rq lock of the RT overloaded CPU
+		 * send an IPI to that CPU instead. This prevents heavy
+		 * contention from several CPUs lowering its priority
+		 * and all trying to grab the rq lock of that overloaded CPU.
+		 */
+		if (sched_feat(RT_PUSH_IPI)) {
+			smp_send_reschedule(cpu);
+			continue;
+		}
+
+		/*
 		 * We can potentially drop this_rq's lock in
 		 * double_lock_balance, and another CPU could
 		 * alter this_rq
Index: linux-rt.git/kernel/sched/sched.h
===================================================================
--- linux-rt.git.orig/kernel/sched/sched.h
+++ linux-rt.git/kernel/sched/sched.h
@@ -1111,6 +1111,8 @@ static inline void double_rq_unlock(stru
 		__release(rq2->lock);
 }
 
+void sched_rt_push_check(void);
+
 #else /* CONFIG_SMP */
 
 /*
@@ -1144,6 +1146,9 @@ static inline void double_rq_unlock(stru
 	__release(rq2->lock);
 }
 
+void sched_rt_push_check(void)
+{
+}
 #endif
 
 extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);
Index: linux-rt.git/kernel/sched/features.h
===================================================================
--- linux-rt.git.orig/kernel/sched/features.h
+++ linux-rt.git/kernel/sched/features.h
@@ -73,6 +73,20 @@ SCHED_FEAT(PREEMPT_LAZY, true)
 # endif
 #endif
 
+/*
+ * In order to avoid a thundering herd attack of CPUS that are
+ * lowering their priorities at the same time, and there being
+ * a single CPU that has an RT task that can migrate and is waiting
+ * to run, where the other CPUs will try to take that CPUs
+ * rq lock and possibly create a large contention, sending an
+ * IPI to that CPU and let that CPU push the RT task to where
+ * it should go may be a better scenario.
+ *
+ * This is default off for machines with <= 16 CPUs, and will
+ * be turned on at boot up for machines with > 16 CPUs.
+ */
+SCHED_FEAT(RT_PUSH_IPI, false)
+
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH RT 4/4 v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2012-12-12 19:27 ` [RFC][PATCH RT 4/4 v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
  2012-12-12 20:44   ` Steven Rostedt
@ 2012-12-13 19:53   ` Steven Rostedt
  2012-12-21 15:42     ` Mike Galbraith
  2013-02-13 16:49     ` John Kacur
  1 sibling, 2 replies; 9+ messages in thread
From: Steven Rostedt @ 2012-12-13 19:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-rt-users, Thomas Gleixner, Carsten Emde, John Kacur,
	Peter Zijlstra, Clark Williams, Ingo Molnar, Frank Rowand,
	Mike Galbraith

I didn't get a chance to test the latest IPI patch series on the 40 core
box, and only had my 4 way box to test on. But I was able to test it
last night and found some issues.

The RT_PUSH_IPI doesn't get automatically set because just doing the
sched_feat_enable() wasn't enough. Below is the corrected patch.

Also, for some reason patch 3 caused the box to hang. Perhaps it
required the RT_PUSH_IPI set, because it worked with the original patch
series. But that series only did the push ipi. I removed it on the 40
core before noticing that the RT_PUSH_IPI wasn't being automatically
enabled.

Here's an update of patch 4:

sched/rt: Use IPI to trigger RT task push migration instead of pulling

When debugging the latencies on a 40 core box, where we hit 300 to
500 microsecond latencies, I found there was a huge contention on the
runqueue locks.

Investigating it further, running ftrace, I found that it was due to
the pulling of RT tasks.

The test that was run was the following:

 cyclictest --numa -p95 -m -d0 -i100

This created a thread on each CPU, that would set its wakeup in interations
of 100 microseconds. The -d0 means that all the threads had the same
interval (100us). Each thread sleeps for 100us and wakes up and measures
its latencies.

What happened was another RT task would be scheduled on one of the CPUs
that was running our test, when the other CPUS test went to sleep and
scheduled idle. This cause the "pull" operation to execute on all
these CPUs. Each one of these saw the RT task that was overloaded on
the CPU of the test that was still running, and each one tried
to grab that task in a thundering herd way.

To grab the task, each thread would do a double rq lock grab, grabbing
its own lock as well as the rq of the overloaded CPU. As the sched
domains on this box was rather flat for its size, I saw up to 12 CPUs
block on this lock at once. This caused a ripple affect with the
rq locks. As these locks were blocked, any wakeups or load balanceing
on these CPUs would also block on these locks, and the wait time escalated.

I've tried various methods to lesson the load, but things like an
atomic counter to only let one CPU grab the task wont work, because
the task may have a limited affinity, and we may pick the wrong
CPU to take that lock and do the pull, to only find out that the
CPU we picked isn't in the task's affinity.

Instead of doing the PULL, I now have the CPUs that want the pull to
send over an IPI to the overloaded CPU, and let that CPU pick what
CPU to push the task to. No more need to grab the rq lock, and the
push/pull algorithm still works fine.

With this patch, the latency dropped to just 150us over a 20 hour run.
Without the patch, the huge latencies would trigger in seconds.

Now, this issue only seems to apply to boxes with greater than 16 CPUs.
We noticed this on a 24 CPU box, and things got much worse on 40 (and
presumably more CPUs would get even worse yet). But running with 16
CPUs and below, the lock contention caused by the pulling of RT tasks
is not noticable.

I've created a new sched feature called RT_PUSH_IPI, which by default
on 16 and less core CPUs is disabled, and on 17 or more CPUs it is
enabled. That seems to be heuristic limit where the pulling logic
causes higher latencies than IPIs. Of course with all heuristics, things
could be different with different architectures.

When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
is enabled, the IPI is sent to the overloaded CPU to do a push.

To enabled or disable this at run time:

 # mount -t debugfs nodev /sys/kernel/debug
 # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
or
 # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: rt-linux.git/kernel/sched/core.c
===================================================================
--- rt-linux.git.orig/kernel/sched/core.c
+++ rt-linux.git/kernel/sched/core.c
@@ -1538,6 +1538,9 @@ static void sched_ttwu_pending(void)
 
 void scheduler_ipi(void)
 {
+	if (sched_feat(RT_PUSH_IPI))
+		sched_rt_push_check();
+
 	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
 		return;
 
@@ -7541,6 +7544,21 @@ void __init sched_init_smp(void)
 	free_cpumask_var(non_isolated_cpus);
 
 	init_sched_rt_class();
+
+	/*
+	 * To avoid heavy contention on large CPU boxes,
+	 * when there is an RT overloaded CPU (two or more RT tasks
+	 * queued to run on a CPU and one of the waiting RT tasks
+	 * can migrate) and another CPU lowers its priority, instead
+	 * of grabbing both rq locks of the CPUS (as many CPUs lowering
+	 * their priority at the same time may create large latencies)
+	 * send an IPI to the CPU that is overloaded so that it can
+	 * do an efficent push.
+	 */
+	if (num_possible_cpus() > 16) {
+		sched_feat_enable(__SCHED_FEAT_RT_PUSH_IPI);
+		sysctl_sched_features |= (1UL << __SCHED_FEAT_RT_PUSH_IPI);
+	}
 }
 #else
 void __init sched_init_smp(void)
Index: rt-linux.git/kernel/sched/rt.c
===================================================================
--- rt-linux.git.orig/kernel/sched/rt.c
+++ rt-linux.git/kernel/sched/rt.c
@@ -1723,6 +1723,31 @@ static void push_rt_tasks(struct rq *rq)
 		;
 }
 
+/**
+ * sched_rt_push_check - check if we can push waiting RT tasks
+ *
+ * Called from sched IPI when sched feature RT_PUSH_IPI is enabled.
+ *
+ * Checks if there is an RT task that can migrate and there exists
+ * a CPU in its affinity that only has tasks lower in priority than
+ * the waiting RT task. If so, then it will push the task off to that
+ * CPU.
+ */
+void sched_rt_push_check(void)
+{
+	struct rq *rq = cpu_rq(smp_processor_id());
+
+	if (WARN_ON_ONCE(!irqs_disabled()))
+		return;
+
+	if (!has_pushable_tasks(rq))
+		return;
+
+	raw_spin_lock(&rq->lock);
+	push_rt_tasks(rq);
+	raw_spin_unlock(&rq->lock);
+}
+
 static int pull_rt_task(struct rq *this_rq)
 {
 	int this_cpu = this_rq->cpu, ret = 0, cpu;
@@ -1750,6 +1775,18 @@ static int pull_rt_task(struct rq *this_
 			continue;
 
 		/*
+		 * When the RT_PUSH_IPI sched feature is enabled, instead
+		 * of trying to grab the rq lock of the RT overloaded CPU
+		 * send an IPI to that CPU instead. This prevents heavy
+		 * contention from several CPUs lowering its priority
+		 * and all trying to grab the rq lock of that overloaded CPU.
+		 */
+		if (sched_feat(RT_PUSH_IPI)) {
+			smp_send_reschedule(cpu);
+			continue;
+		}
+
+		/*
 		 * We can potentially drop this_rq's lock in
 		 * double_lock_balance, and another CPU could
 		 * alter this_rq
Index: rt-linux.git/kernel/sched/sched.h
===================================================================
--- rt-linux.git.orig/kernel/sched/sched.h
+++ rt-linux.git/kernel/sched/sched.h
@@ -1111,6 +1111,8 @@ static inline void double_rq_unlock(stru
 		__release(rq2->lock);
 }
 
+void sched_rt_push_check(void);
+
 #else /* CONFIG_SMP */
 
 /*
@@ -1144,6 +1146,9 @@ static inline void double_rq_unlock(stru
 	__release(rq2->lock);
 }
 
+void sched_rt_push_check(void)
+{
+}
 #endif
 
 extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);
Index: rt-linux.git/kernel/sched/features.h
===================================================================
--- rt-linux.git.orig/kernel/sched/features.h
+++ rt-linux.git/kernel/sched/features.h
@@ -73,6 +73,20 @@ SCHED_FEAT(PREEMPT_LAZY, true)
 # endif
 #endif
 
+/*
+ * In order to avoid a thundering herd attack of CPUS that are
+ * lowering their priorities at the same time, and there being
+ * a single CPU that has an RT task that can migrate and is waiting
+ * to run, where the other CPUs will try to take that CPUs
+ * rq lock and possibly create a large contention, sending an
+ * IPI to that CPU and let that CPU push the RT task to where
+ * it should go may be a better scenario.
+ *
+ * This is default off for machines with <= 16 CPUs, and will
+ * be turned on at boot up for machines with > 16 CPUs.
+ */
+SCHED_FEAT(RT_PUSH_IPI, false)
+
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH RT 4/4 v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2012-12-13 19:53   ` Steven Rostedt
@ 2012-12-21 15:42     ` Mike Galbraith
  2013-02-13 16:49     ` John Kacur
  1 sibling, 0 replies; 9+ messages in thread
From: Mike Galbraith @ 2012-12-21 15:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-rt-users, Thomas Gleixner, Carsten Emde,
	John Kacur, Peter Zijlstra, Clark Williams, Ingo Molnar,
	Frank Rowand

On Thu, 2012-12-13 at 14:53 -0500, Steven Rostedt wrote: 
> I didn't get a chance to test the latest IPI patch series on the 40 core
> box, and only had my 4 way box to test on. But I was able to test it
> last night and found some issues.
> 
> The RT_PUSH_IPI doesn't get automatically set because just doing the
> sched_feat_enable() wasn't enough. Below is the corrected patch.
> 
> Also, for some reason patch 3 caused the box to hang. Perhaps it

Yeah, I got to experience that, grabbed the wrong patch4, so didn't see
this warning, and got to fix the feature too :)

> required the RT_PUSH_IPI set, because it worked with the original patch
> series. But that series only did the push ipi. I removed it on the 40
> core before noticing that the RT_PUSH_IPI wasn't being automatically
> enabled.
> 
> Here's an update of patch 4:
> 
> sched/rt: Use IPI to trigger RT task push migration instead of pulling
> 
> When debugging the latencies on a 40 core box, where we hit 300 to
> 500 microsecond latencies, I found there was a huge contention on the
> runqueue locks.

Makes a WEE bit of difference on my dainbramaged 64 core DL980, using a
SCHED_RR massive_intr 3 for 5 minutes as the bouncable load.  This box
has only sched domains MC and CPU.. thinks it's a rather large laptop.

NO_RT_PUSH_IPI

T: 0 (14147) P:99 I:100 C:3059998 Min:      0 Act:    4 Avg:    3 Max:     368
T: 1 (14148) P:99 I:100 C:3059922 Min:      0 Act:   37 Avg:    3 Max:     372
T: 2 (14149) P:99 I:100 C:3059847 Min:      0 Act:    7 Avg:    3 Max:     361
T: 3 (14150) P:99 I:100 C:3059772 Min:      0 Act:   40 Avg:    2 Max:     346
T: 4 (14151) P:99 I:100 C:3059698 Min:      0 Act:    5 Avg:    2 Max:     307
T: 5 (14152) P:99 I:100 C:3059622 Min:      0 Act:    5 Avg:    2 Max:     379
T: 6 (14153) P:99 I:100 C:3059548 Min:      0 Act:    4 Avg:    3 Max:     341
T: 7 (14154) P:99 I:100 C:3059473 Min:      0 Act:    5 Avg:    4 Max:     379
T: 8 (14155) P:99 I:100 C:3059398 Min:      0 Act:   30 Avg:    5 Max:     552
T: 9 (14156) P:99 I:100 C:3059323 Min:      0 Act:    4 Avg:    3 Max:     760
T:10 (14157) P:99 I:100 C:3059248 Min:      0 Act:    3 Avg:    2 Max:     360
T:11 (14158) P:99 I:100 C:3059172 Min:      0 Act:   21 Avg:    4 Max:     583
T:12 (14159) P:99 I:100 C:3059097 Min:      0 Act:    4 Avg:    2 Max:     467
T:13 (14160) P:99 I:100 C:3059021 Min:      0 Act:    3 Avg:    2 Max:     350
T:14 (14161) P:99 I:100 C:3058946 Min:      0 Act:    5 Avg:    4 Max:     353
T:15 (14162) P:99 I:100 C:3058871 Min:      0 Act:    6 Avg:    2 Max:     481
T:16 (14163) P:99 I:100 C:3058794 Min:      0 Act:    3 Avg:    4 Max:     749
T:17 (14164) P:99 I:100 C:3058716 Min:      0 Act:    3 Avg:    5 Max:     632
T:18 (14165) P:99 I:100 C:3058642 Min:      0 Act:    4 Avg:    3 Max:     492
T:19 (14166) P:99 I:100 C:3058564 Min:      0 Act:    3 Avg:    3 Max:     500
T:20 (14167) P:99 I:100 C:3058486 Min:      0 Act:    4 Avg:    6 Max:     567
T:21 (14168) P:99 I:100 C:3058408 Min:      0 Act:    5 Avg:    4 Max:     444
T:22 (14169) P:99 I:100 C:3058330 Min:      0 Act:    4 Avg:    3 Max:     417
T:23 (14170) P:99 I:100 C:3058253 Min:      0 Act:    3 Avg:    5 Max:     591
T:24 (14171) P:99 I:100 C:3058175 Min:      0 Act:    4 Avg:    4 Max:     737
T:25 (14172) P:99 I:100 C:3058098 Min:      0 Act:    4 Avg:    4 Max:     628
T:26 (14173) P:99 I:100 C:3058019 Min:      0 Act:    4 Avg:    5 Max:     599
T:27 (14174) P:99 I:100 C:3057939 Min:      0 Act:    3 Avg:    4 Max:     370
T:28 (14175) P:99 I:100 C:3057858 Min:      0 Act:    3 Avg:    4 Max:     384
T:29 (14176) P:99 I:100 C:3057777 Min:      0 Act:    3 Avg:    4 Max:     440
T:30 (14177) P:99 I:100 C:3057696 Min:      0 Act:    4 Avg:    3 Max:     492
T:31 (14178) P:99 I:100 C:3057616 Min:      0 Act:    2 Avg:    3 Max:     383
T:32 (14179) P:99 I:100 C:3057534 Min:      0 Act:    4 Avg:    5 Max:     484
T:33 (14180) P:99 I:100 C:3057454 Min:      0 Act:    4 Avg:    5 Max:     622
T:34 (14181) P:99 I:100 C:3057373 Min:      0 Act:    3 Avg:    3 Max:     388
T:35 (14182) P:99 I:100 C:3057291 Min:      0 Act:    4 Avg:    3 Max:     447
T:36 (14183) P:99 I:100 C:3057209 Min:      0 Act:    4 Avg:    4 Max:     519
T:37 (14184) P:99 I:100 C:3057126 Min:      0 Act:    3 Avg:    2 Max:     484
T:38 (14185) P:99 I:100 C:3057043 Min:      0 Act:    4 Avg:    5 Max:     408
T:39 (14186) P:99 I:100 C:3056960 Min:      0 Act:    3 Avg:    4 Max:     405
T:40 (14187) P:99 I:100 C:3056876 Min:      0 Act:    3 Avg:    6 Max:     681
T:41 (14188) P:99 I:100 C:3056793 Min:      0 Act:    3 Avg:    3 Max:    1082
T:42 (14189) P:99 I:100 C:3056709 Min:      0 Act:    4 Avg:    4 Max:     445
T:43 (14190) P:99 I:100 C:3056625 Min:      0 Act:    4 Avg:    6 Max:     427
T:44 (14191) P:99 I:100 C:3056541 Min:      0 Act:    4 Avg:    4 Max:     501
T:45 (14192) P:99 I:100 C:3056457 Min:      0 Act:    4 Avg:    4 Max:     412
T:46 (14193) P:99 I:100 C:3056373 Min:      0 Act:    4 Avg:    5 Max:     438
T:47 (14194) P:99 I:100 C:3056289 Min:      0 Act:    4 Avg:    4 Max:     437
T:48 (14195) P:99 I:100 C:3056204 Min:      0 Act:    5 Avg:    8 Max:     626
T:49 (14196) P:99 I:100 C:3056120 Min:      0 Act:    2 Avg:    2 Max:     643
T:50 (14197) P:99 I:100 C:3056034 Min:      0 Act:    5 Avg:    4 Max:     502
T:51 (14198) P:99 I:100 C:3055949 Min:      0 Act:    4 Avg:    3 Max:     427
T:52 (14199) P:99 I:100 C:3055863 Min:      0 Act:    3 Avg:    3 Max:     515
T:53 (14200) P:99 I:100 C:3055778 Min:      0 Act:    4 Avg:    4 Max:     397
T:54 (14201) P:99 I:100 C:3055693 Min:      0 Act:    3 Avg:    5 Max:     866
T:55 (14202) P:99 I:100 C:3055607 Min:      0 Act:    4 Avg:    4 Max:     536
T:56 (14203) P:99 I:100 C:3055521 Min:      0 Act:    3 Avg:    6 Max:     611
T:57 (14204) P:99 I:100 C:3055435 Min:      0 Act:    4 Avg:    4 Max:     487
T:58 (14205) P:99 I:100 C:3055348 Min:      0 Act:    2 Avg:    4 Max:     647
T:59 (14206) P:99 I:100 C:3055261 Min:      0 Act:    3 Avg:    3 Max:     520
T:60 (14207) P:99 I:100 C:3055175 Min:      0 Act:    4 Avg:    4 Max:     686
T:61 (14208) P:99 I:100 C:3055088 Min:      0 Act:    4 Avg:    5 Max:     531
T:62 (14209) P:99 I:100 C:3055001 Min:      0 Act:    4 Avg:    5 Max:     435
T:63 (14210) P:99 I:100 C:3054914 Min:      0 Act:    4 Avg:    4 Max:     525

RT_PUSH_IPI

T: 0 (14065) P:99 I:100 C:3089627 Min:      1 Act:    3 Avg:    2 Max:      10
T: 1 (14066) P:99 I:100 C:3089574 Min:      2 Act:    4 Avg:    2 Max:      10
T: 2 (14067) P:99 I:100 C:3089521 Min:      1 Act:    3 Avg:    3 Max:      10
T: 3 (14068) P:99 I:100 C:3089468 Min:      1 Act:    4 Avg:    3 Max:       8
T: 4 (14069) P:99 I:100 C:3089415 Min:      1 Act:    2 Avg:    2 Max:      12
T: 5 (14070) P:99 I:100 C:3089361 Min:      1 Act:    2 Avg:    2 Max:       7
T: 6 (14071) P:99 I:100 C:3089308 Min:      1 Act:    3 Avg:    2 Max:      12
T: 7 (14072) P:99 I:100 C:3089255 Min:      2 Act:    3 Avg:    3 Max:      29
T: 8 (14073) P:99 I:100 C:3089201 Min:      2 Act:    4 Avg:    3 Max:      11
T: 9 (14074) P:99 I:100 C:3089140 Min:      1 Act:    3 Avg:    4 Max:      43
T:10 (14075) P:99 I:100 C:3089093 Min:      2 Act:    3 Avg:    3 Max:      14
T:11 (14076) P:99 I:100 C:3089038 Min:      2 Act:    4 Avg:    3 Max:      11
T:12 (14077) P:99 I:100 C:3088982 Min:      2 Act:    3 Avg:    3 Max:       8
T:13 (14078) P:99 I:100 C:3088927 Min:      2 Act:    4 Avg:    3 Max:      13
T:14 (14079) P:99 I:100 C:3088871 Min:      1 Act:    3 Avg:    4 Max:      13
T:15 (14080) P:99 I:100 C:3088817 Min:      1 Act:    3 Avg:    4 Max:      13
T:16 (14081) P:99 I:100 C:3088762 Min:      1 Act:    3 Avg:    2 Max:       9
T:17 (14082) P:99 I:100 C:3088707 Min:      1 Act:    4 Avg:    2 Max:       8
T:18 (14083) P:99 I:100 C:3088652 Min:      2 Act:    3 Avg:    2 Max:      11
T:19 (14084) P:99 I:100 C:3088597 Min:      2 Act:    4 Avg:    3 Max:      14
T:20 (14085) P:99 I:100 C:3088542 Min:      1 Act:    4 Avg:    3 Max:       8
T:21 (14086) P:99 I:100 C:3088487 Min:      2 Act:    3 Avg:    3 Max:      19
T:22 (14087) P:99 I:100 C:3088432 Min:      2 Act:    3 Avg:    3 Max:      18
T:23 (14088) P:99 I:100 C:3088377 Min:      2 Act:    4 Avg:    3 Max:      12
T:24 (14089) P:99 I:100 C:3088321 Min:      2 Act:    3 Avg:    3 Max:      14
T:25 (14090) P:99 I:100 C:3088265 Min:      2 Act:    4 Avg:    4 Max:      14
T:26 (14091) P:99 I:100 C:3088208 Min:      2 Act:    4 Avg:    3 Max:      14
T:27 (14092) P:99 I:100 C:3088151 Min:      2 Act:    3 Avg:    3 Max:       9
T:28 (14093) P:99 I:100 C:3088094 Min:      2 Act:    3 Avg:    3 Max:      23
T:29 (14094) P:99 I:100 C:3088038 Min:      2 Act:    4 Avg:    3 Max:      10
T:30 (14095) P:99 I:100 C:3087980 Min:      2 Act:    3 Avg:    3 Max:      19
T:31 (14096) P:99 I:100 C:3087924 Min:      1 Act:    4 Avg:    3 Max:      10
T:32 (14097) P:99 I:100 C:3087866 Min:      1 Act:    3 Avg:    3 Max:      11
T:33 (14098) P:99 I:100 C:3087807 Min:      1 Act:    3 Avg:    3 Max:      14
T:34 (14099) P:99 I:100 C:3087749 Min:      1 Act:    2 Avg:    2 Max:      13
T:35 (14100) P:99 I:100 C:3087690 Min:      2 Act:    3 Avg:    3 Max:      12
T:36 (14101) P:99 I:100 C:3087631 Min:      2 Act:    3 Avg:    3 Max:      13
T:37 (14102) P:99 I:100 C:3087572 Min:      1 Act:    5 Avg:    4 Max:      22
T:38 (14103) P:99 I:100 C:3087512 Min:      2 Act:    3 Avg:    3 Max:      12
T:39 (14104) P:99 I:100 C:3087453 Min:      2 Act:    3 Avg:    3 Max:      11
T:40 (14105) P:99 I:100 C:3087392 Min:      2 Act:    4 Avg:   13 Max:      50
T:41 (14106) P:99 I:100 C:3087333 Min:      2 Act:    4 Avg:   10 Max:      42
T:42 (14107) P:99 I:100 C:3087272 Min:      1 Act:    2 Avg:    4 Max:      20
T:43 (14108) P:99 I:100 C:3087211 Min:      1 Act:    2 Avg:    5 Max:      23
T:44 (14109) P:99 I:100 C:3087149 Min:      1 Act:    3 Avg:    6 Max:      38
T:45 (14110) P:99 I:100 C:3087088 Min:      1 Act:    6 Avg:    4 Max:      37
T:46 (14111) P:99 I:100 C:3087027 Min:      2 Act:    4 Avg:    4 Max:      44
T:47 (14112) P:99 I:100 C:3086965 Min:      1 Act:    2 Avg:    2 Max:      12
T:48 (14113) P:99 I:100 C:3086903 Min:      2 Act:    3 Avg:    3 Max:      12
T:49 (14114) P:99 I:100 C:3086841 Min:      1 Act:    3 Avg:    4 Max:      14
T:50 (14115) P:99 I:100 C:3086778 Min:      1 Act:    3 Avg:    3 Max:      11
T:51 (14116) P:99 I:100 C:3086715 Min:      2 Act:    3 Avg:    3 Max:      12
T:52 (14117) P:99 I:100 C:3086652 Min:      2 Act:    3 Avg:    3 Max:      17
T:53 (14118) P:99 I:100 C:3086589 Min:      1 Act:    2 Avg:    3 Max:      14
T:54 (14119) P:99 I:100 C:3086525 Min:      2 Act:    3 Avg:    3 Max:      10
T:55 (14120) P:99 I:100 C:3086462 Min:      1 Act:    3 Avg:    3 Max:       9
T:56 (14121) P:99 I:100 C:3086398 Min:      2 Act:    4 Avg:    3 Max:      14
T:57 (14122) P:99 I:100 C:3086335 Min:      2 Act:    4 Avg:    3 Max:      25
T:58 (14123) P:99 I:100 C:3086270 Min:      2 Act:    4 Avg:    3 Max:      12
T:59 (14124) P:99 I:100 C:3086207 Min:      2 Act:    3 Avg:    3 Max:      12
T:60 (14125) P:99 I:100 C:3086143 Min:      2 Act:    4 Avg:    3 Max:      13
T:61 (14126) P:99 I:100 C:3086079 Min:      2 Act:    3 Avg:    3 Max:      11
T:62 (14127) P:99 I:100 C:3086014 Min:      2 Act:    4 Avg:    3 Max:      12
T:63 (14128) P:99 I:100 C:3085949 Min:      2 Act:    3 Avg:    3 Max:      11



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH RT 4/4 v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2012-12-13 19:53   ` Steven Rostedt
  2012-12-21 15:42     ` Mike Galbraith
@ 2013-02-13 16:49     ` John Kacur
  1 sibling, 0 replies; 9+ messages in thread
From: John Kacur @ 2013-02-13 16:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-rt-users, Thomas Gleixner, Carsten Emde,
	John Kacur, Peter Zijlstra, Clark Williams, Ingo Molnar,
	Frank Rowand, Mike Galbraith



On Thu, 13 Dec 2012, Steven Rostedt wrote:

> I didn't get a chance to test the latest IPI patch series on the 40 core
> box, and only had my 4 way box to test on. But I was able to test it
> last night and found some issues.
> 
> The RT_PUSH_IPI doesn't get automatically set because just doing the
> sched_feat_enable() wasn't enough. Below is the corrected patch.
> 
> Also, for some reason patch 3 caused the box to hang. Perhaps it
> required the RT_PUSH_IPI set, because it worked with the original patch
> series. But that series only did the push ipi. I removed it on the 40
> core before noticing that the RT_PUSH_IPI wasn't being automatically
> enabled.
> 
> Here's an update of patch 4:
> 
> sched/rt: Use IPI to trigger RT task push migration instead of pulling
> 
> When debugging the latencies on a 40 core box, where we hit 300 to
> 500 microsecond latencies, I found there was a huge contention on the
> runqueue locks.
> 
> Investigating it further, running ftrace, I found that it was due to
> the pulling of RT tasks.
> 
> The test that was run was the following:
> 
>  cyclictest --numa -p95 -m -d0 -i100
> 
> This created a thread on each CPU, that would set its wakeup in interations
> of 100 microseconds. The -d0 means that all the threads had the same
> interval (100us). Each thread sleeps for 100us and wakes up and measures
> its latencies.
> 
> What happened was another RT task would be scheduled on one of the CPUs
> that was running our test, when the other CPUS test went to sleep and
> scheduled idle. This cause the "pull" operation to execute on all
> these CPUs. Each one of these saw the RT task that was overloaded on
> the CPU of the test that was still running, and each one tried
> to grab that task in a thundering herd way.
> 
> To grab the task, each thread would do a double rq lock grab, grabbing
> its own lock as well as the rq of the overloaded CPU. As the sched
> domains on this box was rather flat for its size, I saw up to 12 CPUs
> block on this lock at once. This caused a ripple affect with the
> rq locks. As these locks were blocked, any wakeups or load balanceing
> on these CPUs would also block on these locks, and the wait time escalated.
> 
> I've tried various methods to lesson the load, but things like an
> atomic counter to only let one CPU grab the task wont work, because
> the task may have a limited affinity, and we may pick the wrong
> CPU to take that lock and do the pull, to only find out that the
> CPU we picked isn't in the task's affinity.
> 
> Instead of doing the PULL, I now have the CPUs that want the pull to
> send over an IPI to the overloaded CPU, and let that CPU pick what
> CPU to push the task to. No more need to grab the rq lock, and the
> push/pull algorithm still works fine.
> 
> With this patch, the latency dropped to just 150us over a 20 hour run.
> Without the patch, the huge latencies would trigger in seconds.
> 
> Now, this issue only seems to apply to boxes with greater than 16 CPUs.
> We noticed this on a 24 CPU box, and things got much worse on 40 (and
> presumably more CPUs would get even worse yet). But running with 16
> CPUs and below, the lock contention caused by the pulling of RT tasks
> is not noticable.
> 
> I've created a new sched feature called RT_PUSH_IPI, which by default
> on 16 and less core CPUs is disabled, and on 17 or more CPUs it is
> enabled. That seems to be heuristic limit where the pulling logic
> causes higher latencies than IPIs. Of course with all heuristics, things
> could be different with different architectures.
> 
> When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
> and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
> is enabled, the IPI is sent to the overloaded CPU to do a push.
> 
> To enabled or disable this at run time:
> 
>  # mount -t debugfs nodev /sys/kernel/debug
>  # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
> or
>  # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> 
> Index: rt-linux.git/kernel/sched/core.c
> ===================================================================
> --- rt-linux.git.orig/kernel/sched/core.c
> +++ rt-linux.git/kernel/sched/core.c
> @@ -1538,6 +1538,9 @@ static void sched_ttwu_pending(void)
>  
>  void scheduler_ipi(void)
>  {
> +	if (sched_feat(RT_PUSH_IPI))
> +		sched_rt_push_check();
> +
>  	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
>  		return;
>  
> @@ -7541,6 +7544,21 @@ void __init sched_init_smp(void)
>  	free_cpumask_var(non_isolated_cpus);
>  
>  	init_sched_rt_class();
> +
> +	/*
> +	 * To avoid heavy contention on large CPU boxes,
> +	 * when there is an RT overloaded CPU (two or more RT tasks
> +	 * queued to run on a CPU and one of the waiting RT tasks
> +	 * can migrate) and another CPU lowers its priority, instead
> +	 * of grabbing both rq locks of the CPUS (as many CPUs lowering
> +	 * their priority at the same time may create large latencies)
> +	 * send an IPI to the CPU that is overloaded so that it can
> +	 * do an efficent push.
> +	 */
> +	if (num_possible_cpus() > 16) {
> +		sched_feat_enable(__SCHED_FEAT_RT_PUSH_IPI);
> +		sysctl_sched_features |= (1UL << __SCHED_FEAT_RT_PUSH_IPI);
> +	}
>  }
>  #else
>  void __init sched_init_smp(void)
> Index: rt-linux.git/kernel/sched/rt.c
> ===================================================================
> --- rt-linux.git.orig/kernel/sched/rt.c
> +++ rt-linux.git/kernel/sched/rt.c
> @@ -1723,6 +1723,31 @@ static void push_rt_tasks(struct rq *rq)
>  		;
>  }
>  
> +/**
> + * sched_rt_push_check - check if we can push waiting RT tasks
> + *
> + * Called from sched IPI when sched feature RT_PUSH_IPI is enabled.
> + *
> + * Checks if there is an RT task that can migrate and there exists
> + * a CPU in its affinity that only has tasks lower in priority than
> + * the waiting RT task. If so, then it will push the task off to that
> + * CPU.
> + */
> +void sched_rt_push_check(void)
> +{
> +	struct rq *rq = cpu_rq(smp_processor_id());
> +
> +	if (WARN_ON_ONCE(!irqs_disabled()))
> +		return;
> +
> +	if (!has_pushable_tasks(rq))
> +		return;
> +
> +	raw_spin_lock(&rq->lock);
> +	push_rt_tasks(rq);
> +	raw_spin_unlock(&rq->lock);
> +}
> +
>  static int pull_rt_task(struct rq *this_rq)
>  {
>  	int this_cpu = this_rq->cpu, ret = 0, cpu;
> @@ -1750,6 +1775,18 @@ static int pull_rt_task(struct rq *this_
>  			continue;
>  
>  		/*
> +		 * When the RT_PUSH_IPI sched feature is enabled, instead
> +		 * of trying to grab the rq lock of the RT overloaded CPU
> +		 * send an IPI to that CPU instead. This prevents heavy
> +		 * contention from several CPUs lowering its priority
> +		 * and all trying to grab the rq lock of that overloaded CPU.
> +		 */
> +		if (sched_feat(RT_PUSH_IPI)) {
> +			smp_send_reschedule(cpu);
> +			continue;
> +		}
> +
> +		/*
>  		 * We can potentially drop this_rq's lock in
>  		 * double_lock_balance, and another CPU could
>  		 * alter this_rq
> Index: rt-linux.git/kernel/sched/sched.h
> ===================================================================
> --- rt-linux.git.orig/kernel/sched/sched.h
> +++ rt-linux.git/kernel/sched/sched.h
> @@ -1111,6 +1111,8 @@ static inline void double_rq_unlock(stru
>  		__release(rq2->lock);
>  }
>  
> +void sched_rt_push_check(void);
> +
>  #else /* CONFIG_SMP */
>  
>  /*
> @@ -1144,6 +1146,9 @@ static inline void double_rq_unlock(stru
>  	__release(rq2->lock);
>  }
>  
> +void sched_rt_push_check(void)
> +{
> +}
>  #endif
>  
>  extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);
> Index: rt-linux.git/kernel/sched/features.h
> ===================================================================
> --- rt-linux.git.orig/kernel/sched/features.h
> +++ rt-linux.git/kernel/sched/features.h
> @@ -73,6 +73,20 @@ SCHED_FEAT(PREEMPT_LAZY, true)
>  # endif
>  #endif
>  
> +/*
> + * In order to avoid a thundering herd attack of CPUS that are
> + * lowering their priorities at the same time, and there being
> + * a single CPU that has an RT task that can migrate and is waiting
> + * to run, where the other CPUs will try to take that CPUs
> + * rq lock and possibly create a large contention, sending an
> + * IPI to that CPU and let that CPU push the RT task to where
> + * it should go may be a better scenario.
> + *
> + * This is default off for machines with <= 16 CPUs, and will
> + * be turned on at boot up for machines with > 16 CPUs.
> + */
> +SCHED_FEAT(RT_PUSH_IPI, false)
> +
>  SCHED_FEAT(FORCE_SD_OVERLAP, false)
>  SCHED_FEAT(RT_RUNTIME_SHARE, true)
>  SCHED_FEAT(LB_MIN, false)
> 

FWIW: Applying this to our latest test queue.

Thanks

John

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-02-13 16:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-12 19:27 [RFC][PATCH RT 0/4 v2] sched/rt: Lower rq lock contention latencies on many CPU boxes Steven Rostedt
2012-12-12 19:27 ` [RFC][PATCH RT 1/4 v2] sched/rt: Fix push_rt_task() to have the same checks as the caller did Steven Rostedt
2012-12-12 19:27 ` [RFC][PATCH RT 2/4 v2] sched/rt: Try to migrate task if preempting pinned rt task Steven Rostedt
2012-12-12 19:27 ` [RFC][PATCH RT 3/4 v2] sched/rt: Initiate a pull when the priority of a task is lowered Steven Rostedt
2012-12-12 19:27 ` [RFC][PATCH RT 4/4 v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
2012-12-12 20:44   ` Steven Rostedt
2012-12-13 19:53   ` Steven Rostedt
2012-12-21 15:42     ` Mike Galbraith
2013-02-13 16:49     ` John Kacur

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).