linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic
@ 2017-04-24 15:47 Steven Rostedt
  2017-05-04 14:41 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Steven Rostedt @ 2017-04-24 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Clark Williams,
	Daniel Bristot de Oliveira, John Kacur, Scott Wood


When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.

When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit b6366f048e0c ("sched/rt: Use IPI to trigger RT task
push migration instead of pulling") changed the way to request a pull.
Instead of grabbing the lock of the overloaded CPU's runqueue, it simply
sent an IPI to that CPU to do the work.

Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.

Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.

This greatly simplifies the code!

The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:

  rto_push_work	 - the irq work to process each CPU set in rto_mask
  rto_lock	 - the lock to protect some of the other rto fields
  rto_loop_start - an atomic that keeps contention down on rto_lock
		    the first CPU scheduling in a lower priority task
		    is the one to kick off the process.
  rto_loop_next	 - an atomic that gets incremented for each CPU that
		    schedules in a lower priority task.
  rto_loop	 - a variable protected by rto_lock that is used to
		    compare against rto_loop_next
  rto_cpu	 - The cpu to send the next IPI to, also protected by
		    the rto_lock.

When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.

If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.

When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.

Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.

If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.

This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?

Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---

Changes from v1:

  o Addition of rto_start_trylock/unlock() helper functions
   (Peter Zijlstra)

  o Movement of rto variables next to the rto_lock that protects them
   (Peter Zijlstra)

>From 4c1dea040b5d03a7e9cd57a3a40fa8c1821cafbd Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
Date: Fri, 24 Jun 2016 10:39:20 -0400
Subject: [PATCH] sched/rt: Simplify the IPI rt balancing logic

When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.

When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit b6366f048e0c ("sched/rt: Use IPI to trigger RT task
push migration instead of pulling") changed the way to request a pull.
Instead of grabbing the lock of the overloaded CPU's runqueue, it simply
sent an IPI to that CPU to do the work.

Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.

Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.

This greatly simplifies the code!

The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:

  rto_push_work	 - the irq work to process each CPU set in rto_mask
  rto_lock	 - the lock to protect some of the other rto fields
  rto_loop_start - an atomic that keeps contention down on rto_lock
		    the first CPU scheduling in a lower priority task
		    is the one to kick off the process.
  rto_loop_next	 - an atomic that gets incremented for each CPU that
		    schedules in a lower priority task.
  rto_loop	 - a variable protected by rto_lock that is used to
		    compare against rto_loop_next
  rto_cpu	 - The cpu to send the next IPI to, also protected by
		    the rto_lock.

When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.

If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.

When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.

Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.

If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.

This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?

Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/sched/rt.c       | 304 +++++++++++++++++-------------------------------
 kernel/sched/sched.h    |  24 ++--
 kernel/sched/topology.c |   6 +
 3 files changed, 128 insertions(+), 206 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 979b734..ba18593 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -73,10 +73,6 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
 
-#if defined(CONFIG_SMP) && defined(HAVE_RT_PUSH_IPI)
-static void push_irq_work_func(struct irq_work *work);
-#endif
-
 void init_rt_rq(struct rt_rq *rt_rq)
 {
 	struct rt_prio_array *array;
@@ -96,13 +92,6 @@ void init_rt_rq(struct rt_rq *rt_rq)
 	rt_rq->rt_nr_migratory = 0;
 	rt_rq->overloaded = 0;
 	plist_head_init(&rt_rq->pushable_tasks);
-
-#ifdef HAVE_RT_PUSH_IPI
-	rt_rq->push_flags = 0;
-	rt_rq->push_cpu = nr_cpu_ids;
-	raw_spin_lock_init(&rt_rq->push_lock);
-	init_irq_work(&rt_rq->push_work, push_irq_work_func);
-#endif
 #endif /* CONFIG_SMP */
 	/* We start is dequeued state, because no RT tasks are queued */
 	rt_rq->rt_queued = 0;
@@ -1864,241 +1853,158 @@ static void push_rt_tasks(struct rq *rq)
 }
 
 #ifdef HAVE_RT_PUSH_IPI
+
 /*
- * The search for the next cpu always starts at rq->cpu and ends
- * when we reach rq->cpu again. It will never return rq->cpu.
- * This returns the next cpu to check, or nr_cpu_ids if the loop
- * is complete.
+ * When a high priority task schedules out from a CPU and a lower priority
+ * task is scheduled in, a check is made to see if there's any RT tasks
+ * on other CPUs that are waiting to run because a higher priority RT task
+ * is currently running on its CPU. In this case, the CPU with multiple RT
+ * tasks queued on it (overloaded) needs to be notified that a CPU has opened
+ * up that may be able to run one of its non-running queued RT tasks.
+ *
+ * All CPUs with overloaded RT tasks need to be notified as there is currently
+ * no way to know which of these CPUs have the highest priority task waiting
+ * to run. Instead of trying to take a spinlock on each of these CPUs,
+ * which has shown to cause large latency when done on machines with many
+ * CPUs, sending an IPI to the CPUs to have them push off the overloaded
+ * RT tasks waiting to run.
+ *
+ * Just sending an IPI to each of the CPUs is also an issue, as on large
+ * count CPU machines, this can cause an IPI storm on a CPU, especially
+ * if its the only CPU with multiple RT tasks queued, and a large number
+ * of CPUs scheduling a lower priority task at the same time.
+ *
+ * Each root domain has its own irq work function that can iterate over
+ * all CPUs with RT overloaded tasks. Since all CPUs with overloaded RT
+ * tassk must be checked if there's one or many CPUs that are lowering
+ * their priority, there's a single irq work iterator that will try to
+ * push off RT tasks that are waiting to run.
+ *
+ * When a CPU schedules a lower priority task, it will kick off the
+ * irq work iterator that will jump to each CPU with overloaded RT tasks.
+ * As it only takes the first CPU that schedules a lower priority task
+ * to start the process, the rto_start variable is incremented and if
+ * the atomic result is one, then that CPU will try to take the rto_lock.
+ * This prevents high contention on the lock as the process handles all
+ * CPUs scheduling lower priority tasks.
+ *
+ * All CPUs that are scheduling a lower priority task will increment the
+ * rt_loop_next variable. This will make sure that the irq work iterator
+ * checks all RT overloaded CPUs whenever a CPU schedules a new lower
+ * priority task, even if the iterator is in the middle of a scan. Incrementing
+ * the rt_loop_next will cause the iterator to perform another scan.
  *
- * rq->rt.push_cpu holds the last cpu returned by this function,
- * or if this is the first instance, it must hold rq->cpu.
  */
 static int rto_next_cpu(struct rq *rq)
 {
-	int prev_cpu = rq->rt.push_cpu;
 	int cpu;
 
-	cpu = cpumask_next(prev_cpu, rq->rd->rto_mask);
-
 	/*
-	 * If the previous cpu is less than the rq's CPU, then it already
-	 * passed the end of the mask, and has started from the beginning.
-	 * We end if the next CPU is greater or equal to rq's CPU.
+	 * When starting the IPI RT pushing, the rto_cpu is set to nr_cpu_ids
+	 * or greater. rt_next_cpu() will simply return the first CPU found in
+	 * the rto_mask.
+	 *
+	 * If rto_next_cpu() is called with rto_cpu less than nr_cpu_ids, it
+	 * will return the next CPU found in the rto_mask.
+	 *
+	 * If there are no more CPUs left in the rto_mask, then a check is made
+	 * against rto_loop and rto_loop_next. rto_loop is only updated with
+	 * the rto_lock held, but any CPU may increment the rto_loop_next
+	 * without any locking.
 	 */
-	if (prev_cpu < rq->cpu) {
-		if (cpu >= rq->cpu)
-			return nr_cpu_ids;
-
-	} else if (cpu >= nr_cpu_ids) {
-		/*
-		 * We passed the end of the mask, start at the beginning.
-		 * If the result is greater or equal to the rq's CPU, then
-		 * the loop is finished.
-		 */
+again:
+	if (rq->rd->rto_cpu >= nr_cpu_ids) {
 		cpu = cpumask_first(rq->rd->rto_mask);
-		if (cpu >= rq->cpu)
-			return nr_cpu_ids;
+		rq->rd->rto_cpu = cpu;
+		/* If cpu is nr_cpu_ids, then there is no overloaded rqs */
+		return cpu;
 	}
-	rq->rt.push_cpu = cpu;
 
-	/* Return cpu to let the caller know if the loop is finished or not */
-	return cpu;
-}
+	cpu = cpumask_next(rq->rd->rto_cpu, rq->rd->rto_mask);
+	rq->rd->rto_cpu = cpu;
 
-static int find_next_push_cpu(struct rq *rq)
-{
-	struct rq *next_rq;
-	int cpu;
+	if (cpu < nr_cpu_ids)
+		return cpu;
 
-	while (1) {
-		cpu = rto_next_cpu(rq);
-		if (cpu >= nr_cpu_ids)
-			break;
-		next_rq = cpu_rq(cpu);
+	if (rq->rd->rto_loop == atomic_read(&rq->rd->rto_loop_next))
+		return cpu;
 
-		/* Make sure the next rq can push to this rq */
-		if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
-			break;
-	}
+	rq->rd->rto_loop = atomic_read(&rq->rd->rto_loop_next);
+	goto again;
+}
 
-	return cpu;
+static inline bool rto_start_trylock(atomic_t *v)
+{
+	return !atomic_cmpxchg(v, 0, 1);
 }
 
-#define RT_PUSH_IPI_EXECUTING		1
-#define RT_PUSH_IPI_RESTART		2
+static inline void rto_start_unlock(atomic_t *v)
+{
+	atomic_set_release(v, 0);
+}
 
-/*
- * When a high priority task schedules out from a CPU and a lower priority
- * task is scheduled in, a check is made to see if there's any RT tasks
- * on other CPUs that are waiting to run because a higher priority RT task
- * is currently running on its CPU. In this case, the CPU with multiple RT
- * tasks queued on it (overloaded) needs to be notified that a CPU has opened
- * up that may be able to run one of its non-running queued RT tasks.
- *
- * On large CPU boxes, there's the case that several CPUs could schedule
- * a lower priority task at the same time, in which case it will look for
- * any overloaded CPUs that it could pull a task from. To do this, the runqueue
- * lock must be taken from that overloaded CPU. Having 10s of CPUs all fighting
- * for a single overloaded CPU's runqueue lock can produce a large latency.
- * (This has actually been observed on large boxes running cyclictest).
- * Instead of taking the runqueue lock of the overloaded CPU, each of the
- * CPUs that scheduled a lower priority task simply sends an IPI to the
- * overloaded CPU. An IPI is much cheaper than taking an runqueue lock with
- * lots of contention. The overloaded CPU will look to push its non-running
- * RT task off, and if it does, it can then ignore the other IPIs coming
- * in, and just pass those IPIs off to any other overloaded CPU.
- *
- * When a CPU schedules a lower priority task, it only sends an IPI to
- * the "next" CPU that has overloaded RT tasks. This prevents IPI storms,
- * as having 10 CPUs scheduling lower priority tasks and 10 CPUs with
- * RT overloaded tasks, would cause 100 IPIs to go out at once.
- *
- * The overloaded RT CPU, when receiving an IPI, will try to push off its
- * overloaded RT tasks and then send an IPI to the next CPU that has
- * overloaded RT tasks. This stops when all CPUs with overloaded RT tasks
- * have completed. Just because a CPU may have pushed off its own overloaded
- * RT task does not mean it should stop sending the IPI around to other
- * overloaded CPUs. There may be another RT task waiting to run on one of
- * those CPUs that are of higher priority than the one that was just
- * pushed.
- *
- * An optimization that could possibly be made is to make a CPU array similar
- * to the cpupri array mask of all running RT tasks, but for the overloaded
- * case, then the IPI could be sent to only the CPU with the highest priority
- * RT task waiting, and that CPU could send off further IPIs to the CPU with
- * the next highest waiting task. Since the overloaded case is much less likely
- * to happen, the complexity of this implementation may not be worth it.
- * Instead, just send an IPI around to all overloaded CPUs.
- *
- * The rq->rt.push_flags holds the status of the IPI that is going around.
- * A run queue can only send out a single IPI at a time. The possible flags
- * for rq->rt.push_flags are:
- *
- *    (None or zero):		No IPI is going around for the current rq
- *    RT_PUSH_IPI_EXECUTING:	An IPI for the rq is being passed around
- *    RT_PUSH_IPI_RESTART:	The priority of the running task for the rq
- *				has changed, and the IPI should restart
- *				circulating the overloaded CPUs again.
- *
- * rq->rt.push_cpu contains the CPU that is being sent the IPI. It is updated
- * before sending to the next CPU.
- *
- * Instead of having all CPUs that schedule a lower priority task send
- * an IPI to the same "first" CPU in the RT overload mask, they send it
- * to the next overloaded CPU after their own CPU. This helps distribute
- * the work when there's more than one overloaded CPU and multiple CPUs
- * scheduling in lower priority tasks.
- *
- * When a rq schedules a lower priority task than what was currently
- * running, the next CPU with overloaded RT tasks is examined first.
- * That is, if CPU 1 and 5 are overloaded, and CPU 3 schedules a lower
- * priority task, it will send an IPI first to CPU 5, then CPU 5 will
- * send to CPU 1 if it is still overloaded. CPU 1 will clear the
- * rq->rt.push_flags if RT_PUSH_IPI_RESTART is not set.
- *
- * The first CPU to notice IPI_RESTART is set, will clear that flag and then
- * send an IPI to the next overloaded CPU after the rq->cpu and not the next
- * CPU after push_cpu. That is, if CPU 1, 4 and 5 are overloaded when CPU 3
- * schedules a lower priority task, and the IPI_RESTART gets set while the
- * handling is being done on CPU 5, it will clear the flag and send it back to
- * CPU 4 instead of CPU 1.
- *
- * Note, the above logic can be disabled by turning off the sched_feature
- * RT_PUSH_IPI. Then the rq lock of the overloaded CPU will simply be
- * taken by the CPU requesting a pull and the waiting RT task will be pulled
- * by that CPU. This may be fine for machines with few CPUs.
- */
 static void tell_cpu_to_push(struct rq *rq)
 {
-	int cpu;
-
-	if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
-		raw_spin_lock(&rq->rt.push_lock);
-		/* Make sure it's still executing */
-		if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
-			/*
-			 * Tell the IPI to restart the loop as things have
-			 * changed since it started.
-			 */
-			rq->rt.push_flags |= RT_PUSH_IPI_RESTART;
-			raw_spin_unlock(&rq->rt.push_lock);
-			return;
-		}
-		raw_spin_unlock(&rq->rt.push_lock);
-	}
+	int cpu = nr_cpu_ids;
 
-	/* When here, there's no IPI going around */
+	/* Keep the loop going if the IPI is currently active */
+	atomic_inc_return(&rq->rd->rto_loop_next);
 
-	rq->rt.push_cpu = rq->cpu;
-	cpu = find_next_push_cpu(rq);
-	if (cpu >= nr_cpu_ids)
+	/* Only one CPU can initiate a loop at a time */
+	if (!rto_start_trylock(&rq->rd->rto_loop_start))
 		return;
 
-	rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
+	raw_spin_lock(&rq->rd->rto_lock);
+
+	/*
+	 * The rto_cpu is updated under the lock, if it has a valid cpu
+	 * then the IPI is still running and will continue due to the
+	 * update to loop_next, and nothing needs to be done here.
+	 * Otherwise it is finishing up and an ipi needs to be sent.
+	 */
+	if (rq->rd->rto_cpu >= nr_cpu_ids)
+		cpu = rto_next_cpu(rq);
 
-	irq_work_queue_on(&rq->rt.push_work, cpu);
+	raw_spin_unlock(&rq->rd->rto_lock);
+
+	rto_start_unlock(&rq->rd->rto_loop_start);
+
+	if (cpu < nr_cpu_ids)
+		irq_work_queue_on(&rq->rd->rto_push_work, cpu);
 }
 
 /* Called from hardirq context */
-static void try_to_push_tasks(void *arg)
+void rto_push_irq_work_func(struct irq_work *work)
 {
-	struct rt_rq *rt_rq = arg;
-	struct rq *rq, *src_rq;
+	struct rq *rq;
 	int this_cpu;
 	int cpu;
 
-	this_cpu = rt_rq->push_cpu;
-
-	/* Paranoid check */
-	BUG_ON(this_cpu != smp_processor_id());
-
+	this_cpu = smp_processor_id();
 	rq = cpu_rq(this_cpu);
-	src_rq = rq_of_rt_rq(rt_rq);
 
-again:
+	/*
+	 * We do not need to grab the lock to check for has_pushable_tasks.
+	 * When it gets updated, a check is made if a push is possible.
+	 */
 	if (has_pushable_tasks(rq)) {
 		raw_spin_lock(&rq->lock);
-		push_rt_task(rq);
+		push_rt_tasks(rq);
 		raw_spin_unlock(&rq->lock);
 	}
 
-	/* Pass the IPI to the next rt overloaded queue */
-	raw_spin_lock(&rt_rq->push_lock);
-	/*
-	 * If the source queue changed since the IPI went out,
-	 * we need to restart the search from that CPU again.
-	 */
-	if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
-		rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
-		rt_rq->push_cpu = src_rq->cpu;
-	}
+	raw_spin_lock(&rq->rd->rto_lock);
 
-	cpu = find_next_push_cpu(src_rq);
+	/* Pass the IPI to the next rt overloaded queue */
+	cpu = rto_next_cpu(rq);
 
-	if (cpu >= nr_cpu_ids)
-		rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
-	raw_spin_unlock(&rt_rq->push_lock);
+	raw_spin_unlock(&rq->rd->rto_lock);
 
 	if (cpu >= nr_cpu_ids)
 		return;
 
-	/*
-	 * It is possible that a restart caused this CPU to be
-	 * chosen again. Don't bother with an IPI, just see if we
-	 * have more to push.
-	 */
-	if (unlikely(cpu == rq->cpu))
-		goto again;
-
 	/* Try the next RT overloaded CPU */
-	irq_work_queue_on(&rt_rq->push_work, cpu);
-}
-
-static void push_irq_work_func(struct irq_work *work)
-{
-	struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_work);
-
-	try_to_push_tasks(rt_rq);
+	irq_work_queue_on(&rq->rd->rto_push_work, cpu);
 }
 #endif /* HAVE_RT_PUSH_IPI */
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index de4b934..17b81d6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -480,7 +480,7 @@ static inline int rt_bandwidth_enabled(void)
 }
 
 /* RT IPI pull logic requires IRQ_WORK */
-#ifdef CONFIG_IRQ_WORK
+#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP)
 # define HAVE_RT_PUSH_IPI
 #endif
 
@@ -502,12 +502,6 @@ struct rt_rq {
 	unsigned long rt_nr_total;
 	int overloaded;
 	struct plist_head pushable_tasks;
-#ifdef HAVE_RT_PUSH_IPI
-	int push_flags;
-	int push_cpu;
-	struct irq_work push_work;
-	raw_spinlock_t push_lock;
-#endif
 #endif /* CONFIG_SMP */
 	int rt_queued;
 
@@ -594,6 +588,19 @@ struct root_domain {
 	struct dl_bw dl_bw;
 	struct cpudl cpudl;
 
+#ifdef HAVE_RT_PUSH_IPI
+	/*
+	 * For IPI pull requests, loop across the rto_mask.
+	 */
+	struct irq_work rto_push_work;
+	raw_spinlock_t rto_lock;
+	/* These are only updated and read within rto_lock */
+	int rto_loop;
+	int rto_cpu;
+	/* These atomics are updated outside of a lock */
+	atomic_t rto_loop_next;
+	atomic_t rto_loop_start;
+#endif
 	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
 	 * one runnable RT task.
@@ -613,6 +620,9 @@ extern void init_defrootdomain(void);
 extern int init_sched_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
 
+#ifdef HAVE_RT_PUSH_IPI
+extern void rto_push_irq_work_func(struct irq_work *work);
+#endif
 #endif /* CONFIG_SMP */
 
 /*
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1b0b4fb..c0d1472 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -253,6 +253,12 @@ static int init_rootdomain(struct root_domain *rd)
 	if (!zalloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
 		goto free_dlo_mask;
 
+#ifdef HAVE_RT_PUSH_IPI
+	rd->rto_cpu = nr_cpu_ids;
+	raw_spin_lock_init(&rd->rto_lock);
+	init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
+#endif
+
 	init_dl_bw(&rd->dl_bw);
 	if (cpudl_init(&rd->cpudl) != 0)
 		goto free_rto_mask;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic
  2017-04-24 15:47 [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic Steven Rostedt
@ 2017-05-04 14:41 ` Peter Zijlstra
  2017-05-04 15:01   ` Steven Rostedt
  2017-05-04 15:32 ` Peter Zijlstra
  2017-10-10 11:02 ` [tip:sched/core] sched/rt: Simplify the IPI based RT " tip-bot for Steven Rostedt (Red Hat)
  2 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-05-04 14:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Clark Williams,
	Daniel Bristot de Oliveira, John Kacur, Scott Wood

On Mon, Apr 24, 2017 at 11:47:32AM -0400, Steven Rostedt wrote:
> +	/* Keep the loop going if the IPI is currently active */
> +	atomic_inc_return(&rq->rd->rto_loop_next);

What's the return for?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic
  2017-05-04 14:41 ` Peter Zijlstra
@ 2017-05-04 15:01   ` Steven Rostedt
  0 siblings, 0 replies; 30+ messages in thread
From: Steven Rostedt @ 2017-05-04 15:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Clark Williams,
	Daniel Bristot de Oliveira, John Kacur, Scott Wood

On Thu, 4 May 2017 16:41:28 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Apr 24, 2017 at 11:47:32AM -0400, Steven Rostedt wrote:
> > +	/* Keep the loop going if the IPI is currently active */
> > +	atomic_inc_return(&rq->rd->rto_loop_next);  
> 
> What's the return for?

-ENOCONTEXT (too much cut from email)

/me goes and finds his email that he sent.

	/* Keep the loop going if the IPI is currently active */
	atomic_inc_return(&rq->rd->rto_loop_next);

	/* Only one CPU can initiate a loop at a time */
	if (!rto_start_trylock(&rq->rd->rto_loop_start))
		return;

Ah, it's not needed. I think I had that to supply a full memory
barrier for a previous version, but now that rto_start_trylock() is:

static inline bool rto_start_trylock(atomic_t *v)
{
	return !atomic_cmpxchg(v, 0, 1);
}

Which supplies its own memory barrier, it can now be a simple

   atomic_inc().

Thanks, will update.

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic
  2017-04-24 15:47 [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic Steven Rostedt
  2017-05-04 14:41 ` Peter Zijlstra
@ 2017-05-04 15:32 ` Peter Zijlstra
  2017-05-04 17:25   ` Steven Rostedt
  2017-10-10 11:02 ` [tip:sched/core] sched/rt: Simplify the IPI based RT " tip-bot for Steven Rostedt (Red Hat)
  2 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-05-04 15:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Clark Williams,
	Daniel Bristot de Oliveira, John Kacur, Scott Wood

On Mon, Apr 24, 2017 at 11:47:32AM -0400, Steven Rostedt wrote:
>  static int rto_next_cpu(struct rq *rq)
>  {
>  	int cpu;
>  
>  	/*
> +	 * When starting the IPI RT pushing, the rto_cpu is set to nr_cpu_ids
> +	 * or greater. rt_next_cpu() will simply return the first CPU found in
> +	 * the rto_mask.
> +	 *
> +	 * If rto_next_cpu() is called with rto_cpu less than nr_cpu_ids, it
> +	 * will return the next CPU found in the rto_mask.
> +	 *
> +	 * If there are no more CPUs left in the rto_mask, then a check is made
> +	 * against rto_loop and rto_loop_next. rto_loop is only updated with
> +	 * the rto_lock held, but any CPU may increment the rto_loop_next
> +	 * without any locking.
>  	 */
> +again:
> +	if (rq->rd->rto_cpu >= nr_cpu_ids) {
>  		cpu = cpumask_first(rq->rd->rto_mask);
> +		rq->rd->rto_cpu = cpu;
> +		/* If cpu is nr_cpu_ids, then there is no overloaded rqs */
> +		return cpu;
>  	}
>  
> +	cpu = cpumask_next(rq->rd->rto_cpu, rq->rd->rto_mask);
> +	rq->rd->rto_cpu = cpu;
>  
> +	if (cpu < nr_cpu_ids)
> +		return cpu;
>  
> +	if (rq->rd->rto_loop == atomic_read(&rq->rd->rto_loop_next))
> +		return cpu;
>  
> +	rq->rd->rto_loop = atomic_read(&rq->rd->rto_loop_next);
> +	goto again;
> +}

I think you want to write that as:

	struct root_domain *rd = rq->rd;
	int cpu, next;

	/* comment */
	for (;;) { 
		if (rd->rto_cpu >= nr_cpu_ids) {
			cpu = cpumask_first(rd->rto_mask);
			rd->rto_cpu = cpu;
			return cpu;
		}

		cpu = cpumask_next(rd->rto_mask);
		rd->rto_cpu = cpu;

		if (cpu < nr_cpu_ids)
			break;

//		rd->rto_cpu = -1;

		/*
		 * ACQUIRE ensures we see the @rto_mask changes
		 * made prior to the @next value observed.
		 * 
		 * Matches WMB in rt_set_overload().
		 */
		next = atomic_read_acquire(&rd->rto_loop_next);

		if (rd->rto_loop == next)
			break;

		rd->rto_loop = next;
	}

	return cpu;

And I don't fully understand the whole rto_cpu >= nr_cpus_ids thing,
can't you simply reset the thing to -1 and always use cpumask_next()?
As per the // comment above?

> +static inline bool rto_start_trylock(atomic_t *v)
> +{
> +	return !atomic_cmpxchg(v, 0, 1);

Arguably this could be: !atomic_cmpxchg_acquire(v, 0, 1);

>  }
>  
> +static inline void rto_start_unlock(atomic_t *v)
> +{
> +	atomic_set_release(v, 0);
> +}
>  

>  static void tell_cpu_to_push(struct rq *rq)
>  {
> +	int cpu = nr_cpu_ids;
>  
> +	/* Keep the loop going if the IPI is currently active */
> +	atomic_inc_return(&rq->rd->rto_loop_next);

Since rt_set_overload() already provides a WMB, we don't need an
ordered primitive here and atomic_inc() is fine.

>  
> +	/* Only one CPU can initiate a loop at a time */
> +	if (!rto_start_trylock(&rq->rd->rto_loop_start))
>  		return;
>  
> +	raw_spin_lock(&rq->rd->rto_lock);
> +
> +	/*
> +	 * The rto_cpu is updated under the lock, if it has a valid cpu
> +	 * then the IPI is still running and will continue due to the
> +	 * update to loop_next, and nothing needs to be done here.
> +	 * Otherwise it is finishing up and an ipi needs to be sent.
> +	 */
> +	if (rq->rd->rto_cpu >= nr_cpu_ids)
//	if (rq->rd->rto_cpu < 0)

> +		cpu = rto_next_cpu(rq);
>  
> +	raw_spin_unlock(&rq->rd->rto_lock);
> +
> +	rto_start_unlock(&rq->rd->rto_loop_start);
> +
> +	if (cpu < nr_cpu_ids)
> +		irq_work_queue_on(&rq->rd->rto_push_work, cpu);
>  }
>  
>  /* Called from hardirq context */
> +void rto_push_irq_work_func(struct irq_work *work)
>  {
> +	struct rq *rq;
>  	int this_cpu;
>  	int cpu;
>  
> +	this_cpu = smp_processor_id();
>  	rq = cpu_rq(this_cpu);

	rq = this_rq();

>  
> +	/*
> +	 * We do not need to grab the lock to check for has_pushable_tasks.
> +	 * When it gets updated, a check is made if a push is possible.
> +	 */
>  	if (has_pushable_tasks(rq)) {
>  		raw_spin_lock(&rq->lock);
> +		push_rt_tasks(rq);
>  		raw_spin_unlock(&rq->lock);
>  	}
>  
> +	raw_spin_lock(&rq->rd->rto_lock);
>  
> +	/* Pass the IPI to the next rt overloaded queue */
> +	cpu = rto_next_cpu(rq);
>  
> +	raw_spin_unlock(&rq->rd->rto_lock);
>  
>  	if (cpu >= nr_cpu_ids)
>  		return;
>  
>  	/* Try the next RT overloaded CPU */
> +	irq_work_queue_on(&rq->rd->rto_push_work, cpu);
>  }

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic
  2017-05-04 15:32 ` Peter Zijlstra
@ 2017-05-04 17:25   ` Steven Rostedt
  2017-05-04 18:42     ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2017-05-04 17:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Clark Williams,
	Daniel Bristot de Oliveira, John Kacur, Scott Wood

On Thu, 4 May 2017 17:32:56 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Apr 24, 2017 at 11:47:32AM -0400, Steven Rostedt wrote:
> >  static int rto_next_cpu(struct rq *rq)
> >  {
> >  	int cpu;
> >  
> >  	/*
> > +	 * When starting the IPI RT pushing, the rto_cpu is set to nr_cpu_ids
> > +	 * or greater. rt_next_cpu() will simply return the first CPU found in
> > +	 * the rto_mask.
> > +	 *
> > +	 * If rto_next_cpu() is called with rto_cpu less than nr_cpu_ids, it
> > +	 * will return the next CPU found in the rto_mask.
> > +	 *
> > +	 * If there are no more CPUs left in the rto_mask, then a check is made
> > +	 * against rto_loop and rto_loop_next. rto_loop is only updated with
> > +	 * the rto_lock held, but any CPU may increment the rto_loop_next
> > +	 * without any locking.
> >  	 */
> > +again:
> > +	if (rq->rd->rto_cpu >= nr_cpu_ids) {
> >  		cpu = cpumask_first(rq->rd->rto_mask);
> > +		rq->rd->rto_cpu = cpu;
> > +		/* If cpu is nr_cpu_ids, then there is no overloaded rqs */
> > +		return cpu;
> >  	}
> >  
> > +	cpu = cpumask_next(rq->rd->rto_cpu, rq->rd->rto_mask);
> > +	rq->rd->rto_cpu = cpu;
> >  
> > +	if (cpu < nr_cpu_ids)
> > +		return cpu;
> >  
> > +	if (rq->rd->rto_loop == atomic_read(&rq->rd->rto_loop_next))
> > +		return cpu;
> >  
> > +	rq->rd->rto_loop = atomic_read(&rq->rd->rto_loop_next);
> > +	goto again;
> > +}  
> 
> I think you want to write that as:
> 
> 	struct root_domain *rd = rq->rd;
> 	int cpu, next;
> 
> 	/* comment */
> 	for (;;) { 
> 		if (rd->rto_cpu >= nr_cpu_ids) {

If we go with your change, then this needs to be:

		if (rd->rto_cpu < 0) {

> 			cpu = cpumask_first(rd->rto_mask);
> 			rd->rto_cpu = cpu;
> 			return cpu;
> 		}
> 
> 		cpu = cpumask_next(rd->rto_mask);

cpumask_next() requires two parameters.

> 		rd->rto_cpu = cpu;
> 
> 		if (cpu < nr_cpu_ids)
> 			break;
> 
> //		rd->rto_cpu = -1;
> 
> 		/*
> 		 * ACQUIRE ensures we see the @rto_mask changes
> 		 * made prior to the @next value observed.
> 		 * 
> 		 * Matches WMB in rt_set_overload().
> 		 */
> 		next = atomic_read_acquire(&rd->rto_loop_next);
> 
> 		if (rd->rto_loop == next)
> 			break;
> 
> 		rd->rto_loop = next;
> 	}
> 
> 	return cpu;
> 
> And I don't fully understand the whole rto_cpu >= nr_cpus_ids thing,
> can't you simply reset the thing to -1 and always use cpumask_next()?
> As per the // comment above?
> 
> > +static inline bool rto_start_trylock(atomic_t *v)
> > +{
> > +	return !atomic_cmpxchg(v, 0, 1);  
> 
> Arguably this could be: !atomic_cmpxchg_acquire(v, 0, 1);

Yes agreed. But if you remember, at the time I was basing this off of
tip/sched/core, which didn't have atomic_cmpxchg_acquire() available.

> 
> >  }
> >  
> > +static inline void rto_start_unlock(atomic_t *v)
> > +{
> > +	atomic_set_release(v, 0);
> > +}
> >    
> 
> >  static void tell_cpu_to_push(struct rq *rq)
> >  {
> > +	int cpu = nr_cpu_ids;
> >  
> > +	/* Keep the loop going if the IPI is currently active */
> > +	atomic_inc_return(&rq->rd->rto_loop_next);  
> 
> Since rt_set_overload() already provides a WMB, we don't need an
> ordered primitive here and atomic_inc() is fine.

Agree, I mentioned this in my previous reply. It was leftover from
previous versions of the patch. I believe I also needed a memory
barrier with this and the check for rto_loop_start. Can't remember if
that was the case, but it doesn't matter now as loop_start is now
updated with a cmpxchg.

> 
> >  
> > +	/* Only one CPU can initiate a loop at a time */
> > +	if (!rto_start_trylock(&rq->rd->rto_loop_start))
> >  		return;
> >  
> > +	raw_spin_lock(&rq->rd->rto_lock);
> > +
> > +	/*
> > +	 * The rto_cpu is updated under the lock, if it has a valid cpu
> > +	 * then the IPI is still running and will continue due to the
> > +	 * update to loop_next, and nothing needs to be done here.
> > +	 * Otherwise it is finishing up and an ipi needs to be sent.
> > +	 */
> > +	if (rq->rd->rto_cpu >= nr_cpu_ids)  
> //	if (rq->rd->rto_cpu < 0)

This can be done, I was just being a bit more conservative and having
rto_cpu have less states (valid CPU or nr_cpu_ids). With a -1, we have
to manually set it to that. But I'm fine with doing it that way too.

This went through several iterations. There were times where using a -1
wasn't so simple.

> 
> > +		cpu = rto_next_cpu(rq);
> >  
> > +	raw_spin_unlock(&rq->rd->rto_lock);
> > +
> > +	rto_start_unlock(&rq->rd->rto_loop_start);
> > +
> > +	if (cpu < nr_cpu_ids)
> > +		irq_work_queue_on(&rq->rd->rto_push_work, cpu);
> >  }
> >  
> >  /* Called from hardirq context */
> > +void rto_push_irq_work_func(struct irq_work *work)
> >  {
> > +	struct rq *rq;
> >  	int this_cpu;
> >  	int cpu;
> >  
> > +	this_cpu = smp_processor_id();
> >  	rq = cpu_rq(this_cpu);  
> 
> 	rq = this_rq();

Heh, sure. I guess I was just keeping it with the previous logic.

Thanks for the review. I'll spin up a new patch. Unfortunately, I no
longer have access to the behemoth machine. I'll only be testing this
on 4 cores now, or 8 with HT.

-- Steve


> 
> >  
> > +	/*
> > +	 * We do not need to grab the lock to check for has_pushable_tasks.
> > +	 * When it gets updated, a check is made if a push is possible.
> > +	 */
> >  	if (has_pushable_tasks(rq)) {
> >  		raw_spin_lock(&rq->lock);
> > +		push_rt_tasks(rq);
> >  		raw_spin_unlock(&rq->lock);
> >  	}
> >  
> > +	raw_spin_lock(&rq->rd->rto_lock);
> >  
> > +	/* Pass the IPI to the next rt overloaded queue */
> > +	cpu = rto_next_cpu(rq);
> >  
> > +	raw_spin_unlock(&rq->rd->rto_lock);
> >  
> >  	if (cpu >= nr_cpu_ids)
> >  		return;
> >  
> >  	/* Try the next RT overloaded CPU */
> > +	irq_work_queue_on(&rq->rd->rto_push_work, cpu);
> >  }  

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic
  2017-05-04 17:25   ` Steven Rostedt
@ 2017-05-04 18:42     ` Peter Zijlstra
  2017-05-04 19:03       ` Steven Rostedt
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-05-04 18:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Clark Williams,
	Daniel Bristot de Oliveira, John Kacur, Scott Wood

On Thu, May 04, 2017 at 01:25:38PM -0400, Steven Rostedt wrote:
> > I think you want to write that as:
> > 
> > 	struct root_domain *rd = rq->rd;
> > 	int cpu, next;
> > 
> > 	/* comment */
> > 	for (;;) { 
> > 		if (rd->rto_cpu >= nr_cpu_ids) {
> 
> If we go with your change, then this needs to be:
> 
> 		if (rd->rto_cpu < 0) {
> 
> > 			cpu = cpumask_first(rd->rto_mask);
> > 			rd->rto_cpu = cpu;
> > 			return cpu;
> > 		}

No you can leave it out entirely.

> > 
> > 		cpu = cpumask_next(rd->rto_mask);
> 
> cpumask_next() requires two parameters.

Indeed it does:

		cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);

will be cpumask_first() when rto_cpu == -1, see for example
for_each_cpu().

> > > +static inline bool rto_start_trylock(atomic_t *v)
> > > +{
> > > +	return !atomic_cmpxchg(v, 0, 1);  
> > 
> > Arguably this could be: !atomic_cmpxchg_acquire(v, 0, 1);
> 
> Yes agreed. But if you remember, at the time I was basing this off of
> tip/sched/core, which didn't have atomic_cmpxchg_acquire() available.

No that's the try_cmpxchg stuff, the _acquire stuff is long in.

> Thanks for the review. I'll spin up a new patch. Unfortunately, I no
> longer have access to the behemoth machine. I'll only be testing this
> on 4 cores now, or 8 with HT.

I have something with 144 CPUs in or thereabout, if you have the
testcase handy I can give it a spin.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic
  2017-05-04 18:42     ` Peter Zijlstra
@ 2017-05-04 19:03       ` Steven Rostedt
  2017-05-05  4:26         ` Mike Galbraith
  2017-05-05 11:05         ` Peter Zijlstra
  0 siblings, 2 replies; 30+ messages in thread
From: Steven Rostedt @ 2017-05-04 19:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Clark Williams,
	Daniel Bristot de Oliveira, John Kacur, Scott Wood

On Thu, 4 May 2017 20:42:00 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, May 04, 2017 at 01:25:38PM -0400, Steven Rostedt wrote:
> > > I think you want to write that as:
> > > 
> > > 	struct root_domain *rd = rq->rd;
> > > 	int cpu, next;
> > > 
> > > 	/* comment */
> > > 	for (;;) { 
> > > 		if (rd->rto_cpu >= nr_cpu_ids) {  
> > 
> > If we go with your change, then this needs to be:
> > 
> > 		if (rd->rto_cpu < 0) {
> >   
> > > 			cpu = cpumask_first(rd->rto_mask);
> > > 			rd->rto_cpu = cpu;
> > > 			return cpu;
> > > 		}  
> 
> No you can leave it out entirely.
> 
> > > 
> > > 		cpu = cpumask_next(rd->rto_mask);  
> > 
> > cpumask_next() requires two parameters.  
> 
> Indeed it does:
> 
> 		cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
> 
> will be cpumask_first() when rto_cpu == -1, see for example
> for_each_cpu().

OK, I'll have to take a look.

> 
> > > > +static inline bool rto_start_trylock(atomic_t *v)
> > > > +{
> > > > +	return !atomic_cmpxchg(v, 0, 1);    
> > > 
> > > Arguably this could be: !atomic_cmpxchg_acquire(v, 0, 1);  
> > 
> > Yes agreed. But if you remember, at the time I was basing this off of
> > tip/sched/core, which didn't have atomic_cmpxchg_acquire() available.  
> 
> No that's the try_cmpxchg stuff, the _acquire stuff is long in.

OK, I was confused with the try stuff, as I remember that was what you
suggested before.

> 
> > Thanks for the review. I'll spin up a new patch. Unfortunately, I no
> > longer have access to the behemoth machine. I'll only be testing this
> > on 4 cores now, or 8 with HT.  
> 
> I have something with 144 CPUs in or thereabout, if you have the
> testcase handy I can give it a spin.

My test case is two fold. It basically just involves running rteval.

One is to run it on latest mainline to make sure it doesn't crash. The
other is to backport it to the latest -rt patch, and see how well it
helps with latency.

To get rteval:

 $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rteval.git
 $ cd rteval
 $ git checkout origin/v2/master

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic
  2017-05-04 19:03       ` Steven Rostedt
@ 2017-05-05  4:26         ` Mike Galbraith
  2017-05-05  5:16           ` Mike Galbraith
  2017-05-05 11:05         ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Mike Galbraith @ 2017-05-05  4:26 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Clark Williams,
	Daniel Bristot de Oliveira, John Kacur, Scott Wood


> To get rteval:
> 
>  $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rteval.git
>  $ cd rteval
>  $ git checkout origin/v2/master

ImportError: No module named ethtool

Where does one find the ethtool bits it seems to depend upon?

	-Mike

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic
  2017-05-05  4:26         ` Mike Galbraith
@ 2017-05-05  5:16           ` Mike Galbraith
  0 siblings, 0 replies; 30+ messages in thread
From: Mike Galbraith @ 2017-05-05  5:16 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Clark Williams,
	Daniel Bristot de Oliveira, John Kacur, Scott Wood

On Fri, 2017-05-05 at 06:26 +0200, Mike Galbraith wrote:
> > To get rteval:
> > 
> >  $ git clone
> > git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rteval.git
> >  $ cd rteval
> >  $ git checkout origin/v2/master
> 
> ImportError: No module named ethtool
> 
> Where does one find the ethtool bits it seems to depend upon?

Nevermind, google found a tarball.

	-Mike

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic
  2017-05-04 19:03       ` Steven Rostedt
  2017-05-05  4:26         ` Mike Galbraith
@ 2017-05-05 11:05         ` Peter Zijlstra
  2017-05-05 12:02           ` Steven Rostedt
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-05-05 11:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Clark Williams,
	Daniel Bristot de Oliveira, John Kacur, Scott Wood

On Thu, May 04, 2017 at 03:03:55PM -0400, Steven Rostedt wrote:
> My test case is two fold. It basically just involves running rteval.
> 
> One is to run it on latest mainline to make sure it doesn't crash. The
> other is to backport it to the latest -rt patch, and see how well it
> helps with latency.
> 
> To get rteval:
> 
>  $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rteval.git
>  $ cd rteval
>  $ git checkout origin/v2/master

Blergh, that thing wants a gazillion things installed.

Can't I run something simple like rt-migrate-test with some arguments to
stress the box out?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic
  2017-05-05 11:05         ` Peter Zijlstra
@ 2017-05-05 12:02           ` Steven Rostedt
  2017-05-05 17:39             ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2017-05-05 12:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Clark Williams,
	Daniel Bristot de Oliveira, John Kacur, Scott Wood

On Fri, 5 May 2017 13:05:29 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, May 04, 2017 at 03:03:55PM -0400, Steven Rostedt wrote:
> > My test case is two fold. It basically just involves running rteval.
> > 
> > One is to run it on latest mainline to make sure it doesn't crash. The
> > other is to backport it to the latest -rt patch, and see how well it
> > helps with latency.
> > 
> > To get rteval:
> > 
> >  $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rteval.git
> >  $ cd rteval
> >  $ git checkout origin/v2/master  
> 
> Blergh, that thing wants a gazillion things installed.

Blame Clark and friends ;-)

> 
> Can't I run something simple like rt-migrate-test with some arguments to
> stress the box out?

Actually what rteval does is basically 3 things. It runs cyclictest,
hackbench in a loop and a kernel build in a loop.

Note, rteval binds the the hackbench and kernel builds to nodes. That
is, if you have 4 nodes, it will run four instances of loops of both
hackbench and kernel builds in each of the nodes. This is because
hackbench and access to the filesystem across nodes with a stress test
can cause exorbitant latency due to cross node memory access on spin
locks.

I usually run cyclictest with:

 cyclictest -p80 -i250 -n -a -t -q -d 0

Although I think rteval does it slightly different. Like adding --numa
to it.

Clark, want to explain more?

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic
  2017-05-05 12:02           ` Steven Rostedt
@ 2017-05-05 17:39             ` Peter Zijlstra
  2017-05-05 18:59               ` Steven Rostedt
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-05-05 17:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Clark Williams,
	Daniel Bristot de Oliveira, John Kacur, Scott Wood

On Fri, May 05, 2017 at 08:02:38AM -0400, Steven Rostedt wrote:
> Actually what rteval does is basically 3 things. It runs cyclictest,
> hackbench in a loop and a kernel build in a loop.

Of those, only cyclictest uses RT tasks and would end up poking at the
bits you just changed.

So just running cyclictest should lock up a ~120 CPU machine?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic
  2017-05-05 17:39             ` Peter Zijlstra
@ 2017-05-05 18:59               ` Steven Rostedt
  2017-05-06  7:52                 ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2017-05-05 18:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Clark Williams,
	Daniel Bristot de Oliveira, John Kacur, Scott Wood

On Fri, 5 May 2017 19:39:49 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, May 05, 2017 at 08:02:38AM -0400, Steven Rostedt wrote:
> > Actually what rteval does is basically 3 things. It runs cyclictest,
> > hackbench in a loop and a kernel build in a loop.  
> 
> Of those, only cyclictest uses RT tasks and would end up poking at the
> bits you just changed.
> 
> So just running cyclictest should lock up a ~120 CPU machine?

The other tools tend to trigger RT kernel threads as well, which causes
migration. cyclictest tasks don't migrate, but they do cause other
tasks to want to move around.

You can try rt-migrate-test too, because I used that to trigger some
bugs in previous versions.

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic
  2017-05-05 18:59               ` Steven Rostedt
@ 2017-05-06  7:52                 ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2017-05-06  7:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Clark Williams,
	Daniel Bristot de Oliveira, John Kacur, Scott Wood

On Fri, May 05, 2017 at 02:59:16PM -0400, Steven Rostedt wrote:
> On Fri, 5 May 2017 19:39:49 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, May 05, 2017 at 08:02:38AM -0400, Steven Rostedt wrote:
> > > Actually what rteval does is basically 3 things. It runs cyclictest,
> > > hackbench in a loop and a kernel build in a loop.  
> > 
> > Of those, only cyclictest uses RT tasks and would end up poking at the
> > bits you just changed.
> > 
> > So just running cyclictest should lock up a ~120 CPU machine?
> 
> The other tools tend to trigger RT kernel threads as well, which causes
> migration. cyclictest tasks don't migrate, but they do cause other
> tasks to want to move around.

There aren't that many RT threads on a !RT kernel. All my laptop has for
example are:

$ ps -faxo pid,class,comm | grep -v TS
  PID CLS COMMAND
    9 FF   \_ migration/0
   10 FF   \_ watchdog/0
   11 FF   \_ watchdog/1
   12 FF   \_ migration/1
   16 FF   \_ watchdog/2
   17 FF   \_ migration/2
   21 FF   \_ watchdog/3
   22 FF   \_ migration/3
  714 FF   \_ irq/48-iwlwifi
24254 FF   \_ irq/47-mei_me
 2444 B            \_ baloo_file

> You can try rt-migrate-test too, because I used that to trigger some
> bugs in previous versions.

Just running rt-migrate-test 'works' on the 144 CPU box. But I ran it
while it was otherwise idle. That is, it ran in reasonable time and no
lockup messages were produced.

I can try and run it together with cyclictest, but over all this sounds
like we're missing a usable test-case. If I have a spare moment (lol) I
might poke at rt-migrate-test to see if I can make it more aggressive or
something.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
  2017-04-24 15:47 [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic Steven Rostedt
  2017-05-04 14:41 ` Peter Zijlstra
  2017-05-04 15:32 ` Peter Zijlstra
@ 2017-10-10 11:02 ` tip-bot for Steven Rostedt (Red Hat)
  2018-01-19  9:23   ` Pavan Kondeti
  2 siblings, 1 reply; 30+ messages in thread
From: tip-bot for Steven Rostedt (Red Hat) @ 2017-10-10 11:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: rostedt, williams, mingo, tglx, peterz, linux-kernel, jkacur,
	bristot, torvalds, efault, hpa, swood

Commit-ID:  4bdced5c9a2922521e325896a7bbbf0132c94e56
Gitweb:     https://git.kernel.org/tip/4bdced5c9a2922521e325896a7bbbf0132c94e56
Author:     Steven Rostedt (Red Hat) <rostedt@goodmis.org>
AuthorDate: Fri, 6 Oct 2017 14:05:04 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 10 Oct 2017 11:45:40 +0200

sched/rt: Simplify the IPI based RT balancing logic

When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.

When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:

  b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")

changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.

Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.

This greatly simplifies the code!

The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:

  rto_push_work	 - the irq work to process each CPU set in rto_mask
  rto_lock	 - the lock to protect some of the other rto fields
  rto_loop_start - an atomic that keeps contention down on rto_lock
		    the first CPU scheduling in a lower priority task
		    is the one to kick off the process.
  rto_loop_next	 - an atomic that gets incremented for each CPU that
		    schedules in a lower priority task.
  rto_loop	 - a variable protected by rto_lock that is used to
		    compare against rto_loop_next
  rto_cpu	 - The cpu to send the next IPI to, also protected by
		    the rto_lock.

When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.

If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.

When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.

Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.

If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.

This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?

Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/rt.c       | 316 ++++++++++++++++++------------------------------
 kernel/sched/sched.h    |  24 ++--
 kernel/sched/topology.c |   6 +
 3 files changed, 138 insertions(+), 208 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0af5ca9..fda2799 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -73,10 +73,6 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
 
-#if defined(CONFIG_SMP) && defined(HAVE_RT_PUSH_IPI)
-static void push_irq_work_func(struct irq_work *work);
-#endif
-
 void init_rt_rq(struct rt_rq *rt_rq)
 {
 	struct rt_prio_array *array;
@@ -96,13 +92,6 @@ void init_rt_rq(struct rt_rq *rt_rq)
 	rt_rq->rt_nr_migratory = 0;
 	rt_rq->overloaded = 0;
 	plist_head_init(&rt_rq->pushable_tasks);
-
-#ifdef HAVE_RT_PUSH_IPI
-	rt_rq->push_flags = 0;
-	rt_rq->push_cpu = nr_cpu_ids;
-	raw_spin_lock_init(&rt_rq->push_lock);
-	init_irq_work(&rt_rq->push_work, push_irq_work_func);
-#endif
 #endif /* CONFIG_SMP */
 	/* We start is dequeued state, because no RT tasks are queued */
 	rt_rq->rt_queued = 0;
@@ -1875,241 +1864,166 @@ static void push_rt_tasks(struct rq *rq)
 }
 
 #ifdef HAVE_RT_PUSH_IPI
+
 /*
- * The search for the next cpu always starts at rq->cpu and ends
- * when we reach rq->cpu again. It will never return rq->cpu.
- * This returns the next cpu to check, or nr_cpu_ids if the loop
- * is complete.
+ * When a high priority task schedules out from a CPU and a lower priority
+ * task is scheduled in, a check is made to see if there's any RT tasks
+ * on other CPUs that are waiting to run because a higher priority RT task
+ * is currently running on its CPU. In this case, the CPU with multiple RT
+ * tasks queued on it (overloaded) needs to be notified that a CPU has opened
+ * up that may be able to run one of its non-running queued RT tasks.
+ *
+ * All CPUs with overloaded RT tasks need to be notified as there is currently
+ * no way to know which of these CPUs have the highest priority task waiting
+ * to run. Instead of trying to take a spinlock on each of these CPUs,
+ * which has shown to cause large latency when done on machines with many
+ * CPUs, sending an IPI to the CPUs to have them push off the overloaded
+ * RT tasks waiting to run.
+ *
+ * Just sending an IPI to each of the CPUs is also an issue, as on large
+ * count CPU machines, this can cause an IPI storm on a CPU, especially
+ * if its the only CPU with multiple RT tasks queued, and a large number
+ * of CPUs scheduling a lower priority task at the same time.
+ *
+ * Each root domain has its own irq work function that can iterate over
+ * all CPUs with RT overloaded tasks. Since all CPUs with overloaded RT
+ * tassk must be checked if there's one or many CPUs that are lowering
+ * their priority, there's a single irq work iterator that will try to
+ * push off RT tasks that are waiting to run.
+ *
+ * When a CPU schedules a lower priority task, it will kick off the
+ * irq work iterator that will jump to each CPU with overloaded RT tasks.
+ * As it only takes the first CPU that schedules a lower priority task
+ * to start the process, the rto_start variable is incremented and if
+ * the atomic result is one, then that CPU will try to take the rto_lock.
+ * This prevents high contention on the lock as the process handles all
+ * CPUs scheduling lower priority tasks.
+ *
+ * All CPUs that are scheduling a lower priority task will increment the
+ * rt_loop_next variable. This will make sure that the irq work iterator
+ * checks all RT overloaded CPUs whenever a CPU schedules a new lower
+ * priority task, even if the iterator is in the middle of a scan. Incrementing
+ * the rt_loop_next will cause the iterator to perform another scan.
  *
- * rq->rt.push_cpu holds the last cpu returned by this function,
- * or if this is the first instance, it must hold rq->cpu.
  */
 static int rto_next_cpu(struct rq *rq)
 {
-	int prev_cpu = rq->rt.push_cpu;
+	struct root_domain *rd = rq->rd;
+	int next;
 	int cpu;
 
-	cpu = cpumask_next(prev_cpu, rq->rd->rto_mask);
-
 	/*
-	 * If the previous cpu is less than the rq's CPU, then it already
-	 * passed the end of the mask, and has started from the beginning.
-	 * We end if the next CPU is greater or equal to rq's CPU.
+	 * When starting the IPI RT pushing, the rto_cpu is set to -1,
+	 * rt_next_cpu() will simply return the first CPU found in
+	 * the rto_mask.
+	 *
+	 * If rto_next_cpu() is called with rto_cpu is a valid cpu, it
+	 * will return the next CPU found in the rto_mask.
+	 *
+	 * If there are no more CPUs left in the rto_mask, then a check is made
+	 * against rto_loop and rto_loop_next. rto_loop is only updated with
+	 * the rto_lock held, but any CPU may increment the rto_loop_next
+	 * without any locking.
 	 */
-	if (prev_cpu < rq->cpu) {
-		if (cpu >= rq->cpu)
-			return nr_cpu_ids;
+	for (;;) {
 
-	} else if (cpu >= nr_cpu_ids) {
-		/*
-		 * We passed the end of the mask, start at the beginning.
-		 * If the result is greater or equal to the rq's CPU, then
-		 * the loop is finished.
-		 */
-		cpu = cpumask_first(rq->rd->rto_mask);
-		if (cpu >= rq->cpu)
-			return nr_cpu_ids;
-	}
-	rq->rt.push_cpu = cpu;
+		/* When rto_cpu is -1 this acts like cpumask_first() */
+		cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
 
-	/* Return cpu to let the caller know if the loop is finished or not */
-	return cpu;
-}
+		rd->rto_cpu = cpu;
 
-static int find_next_push_cpu(struct rq *rq)
-{
-	struct rq *next_rq;
-	int cpu;
+		if (cpu < nr_cpu_ids)
+			return cpu;
 
-	while (1) {
-		cpu = rto_next_cpu(rq);
-		if (cpu >= nr_cpu_ids)
-			break;
-		next_rq = cpu_rq(cpu);
+		rd->rto_cpu = -1;
+
+		/*
+		 * ACQUIRE ensures we see the @rto_mask changes
+		 * made prior to the @next value observed.
+		 *
+		 * Matches WMB in rt_set_overload().
+		 */
+		next = atomic_read_acquire(&rd->rto_loop_next);
 
-		/* Make sure the next rq can push to this rq */
-		if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
+		if (rd->rto_loop == next)
 			break;
+
+		rd->rto_loop = next;
 	}
 
-	return cpu;
+	return -1;
 }
 
-#define RT_PUSH_IPI_EXECUTING		1
-#define RT_PUSH_IPI_RESTART		2
+static inline bool rto_start_trylock(atomic_t *v)
+{
+	return !atomic_cmpxchg_acquire(v, 0, 1);
+}
 
-/*
- * When a high priority task schedules out from a CPU and a lower priority
- * task is scheduled in, a check is made to see if there's any RT tasks
- * on other CPUs that are waiting to run because a higher priority RT task
- * is currently running on its CPU. In this case, the CPU with multiple RT
- * tasks queued on it (overloaded) needs to be notified that a CPU has opened
- * up that may be able to run one of its non-running queued RT tasks.
- *
- * On large CPU boxes, there's the case that several CPUs could schedule
- * a lower priority task at the same time, in which case it will look for
- * any overloaded CPUs that it could pull a task from. To do this, the runqueue
- * lock must be taken from that overloaded CPU. Having 10s of CPUs all fighting
- * for a single overloaded CPU's runqueue lock can produce a large latency.
- * (This has actually been observed on large boxes running cyclictest).
- * Instead of taking the runqueue lock of the overloaded CPU, each of the
- * CPUs that scheduled a lower priority task simply sends an IPI to the
- * overloaded CPU. An IPI is much cheaper than taking an runqueue lock with
- * lots of contention. The overloaded CPU will look to push its non-running
- * RT task off, and if it does, it can then ignore the other IPIs coming
- * in, and just pass those IPIs off to any other overloaded CPU.
- *
- * When a CPU schedules a lower priority task, it only sends an IPI to
- * the "next" CPU that has overloaded RT tasks. This prevents IPI storms,
- * as having 10 CPUs scheduling lower priority tasks and 10 CPUs with
- * RT overloaded tasks, would cause 100 IPIs to go out at once.
- *
- * The overloaded RT CPU, when receiving an IPI, will try to push off its
- * overloaded RT tasks and then send an IPI to the next CPU that has
- * overloaded RT tasks. This stops when all CPUs with overloaded RT tasks
- * have completed. Just because a CPU may have pushed off its own overloaded
- * RT task does not mean it should stop sending the IPI around to other
- * overloaded CPUs. There may be another RT task waiting to run on one of
- * those CPUs that are of higher priority than the one that was just
- * pushed.
- *
- * An optimization that could possibly be made is to make a CPU array similar
- * to the cpupri array mask of all running RT tasks, but for the overloaded
- * case, then the IPI could be sent to only the CPU with the highest priority
- * RT task waiting, and that CPU could send off further IPIs to the CPU with
- * the next highest waiting task. Since the overloaded case is much less likely
- * to happen, the complexity of this implementation may not be worth it.
- * Instead, just send an IPI around to all overloaded CPUs.
- *
- * The rq->rt.push_flags holds the status of the IPI that is going around.
- * A run queue can only send out a single IPI at a time. The possible flags
- * for rq->rt.push_flags are:
- *
- *    (None or zero):		No IPI is going around for the current rq
- *    RT_PUSH_IPI_EXECUTING:	An IPI for the rq is being passed around
- *    RT_PUSH_IPI_RESTART:	The priority of the running task for the rq
- *				has changed, and the IPI should restart
- *				circulating the overloaded CPUs again.
- *
- * rq->rt.push_cpu contains the CPU that is being sent the IPI. It is updated
- * before sending to the next CPU.
- *
- * Instead of having all CPUs that schedule a lower priority task send
- * an IPI to the same "first" CPU in the RT overload mask, they send it
- * to the next overloaded CPU after their own CPU. This helps distribute
- * the work when there's more than one overloaded CPU and multiple CPUs
- * scheduling in lower priority tasks.
- *
- * When a rq schedules a lower priority task than what was currently
- * running, the next CPU with overloaded RT tasks is examined first.
- * That is, if CPU 1 and 5 are overloaded, and CPU 3 schedules a lower
- * priority task, it will send an IPI first to CPU 5, then CPU 5 will
- * send to CPU 1 if it is still overloaded. CPU 1 will clear the
- * rq->rt.push_flags if RT_PUSH_IPI_RESTART is not set.
- *
- * The first CPU to notice IPI_RESTART is set, will clear that flag and then
- * send an IPI to the next overloaded CPU after the rq->cpu and not the next
- * CPU after push_cpu. That is, if CPU 1, 4 and 5 are overloaded when CPU 3
- * schedules a lower priority task, and the IPI_RESTART gets set while the
- * handling is being done on CPU 5, it will clear the flag and send it back to
- * CPU 4 instead of CPU 1.
- *
- * Note, the above logic can be disabled by turning off the sched_feature
- * RT_PUSH_IPI. Then the rq lock of the overloaded CPU will simply be
- * taken by the CPU requesting a pull and the waiting RT task will be pulled
- * by that CPU. This may be fine for machines with few CPUs.
- */
-static void tell_cpu_to_push(struct rq *rq)
+static inline void rto_start_unlock(atomic_t *v)
 {
-	int cpu;
+	atomic_set_release(v, 0);
+}
 
-	if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
-		raw_spin_lock(&rq->rt.push_lock);
-		/* Make sure it's still executing */
-		if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
-			/*
-			 * Tell the IPI to restart the loop as things have
-			 * changed since it started.
-			 */
-			rq->rt.push_flags |= RT_PUSH_IPI_RESTART;
-			raw_spin_unlock(&rq->rt.push_lock);
-			return;
-		}
-		raw_spin_unlock(&rq->rt.push_lock);
-	}
+static void tell_cpu_to_push(struct rq *rq)
+{
+	int cpu = -1;
 
-	/* When here, there's no IPI going around */
+	/* Keep the loop going if the IPI is currently active */
+	atomic_inc(&rq->rd->rto_loop_next);
 
-	rq->rt.push_cpu = rq->cpu;
-	cpu = find_next_push_cpu(rq);
-	if (cpu >= nr_cpu_ids)
+	/* Only one CPU can initiate a loop at a time */
+	if (!rto_start_trylock(&rq->rd->rto_loop_start))
 		return;
 
-	rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
+	raw_spin_lock(&rq->rd->rto_lock);
+
+	/*
+	 * The rto_cpu is updated under the lock, if it has a valid cpu
+	 * then the IPI is still running and will continue due to the
+	 * update to loop_next, and nothing needs to be done here.
+	 * Otherwise it is finishing up and an ipi needs to be sent.
+	 */
+	if (rq->rd->rto_cpu < 0)
+		cpu = rto_next_cpu(rq);
 
-	irq_work_queue_on(&rq->rt.push_work, cpu);
+	raw_spin_unlock(&rq->rd->rto_lock);
+
+	rto_start_unlock(&rq->rd->rto_loop_start);
+
+	if (cpu >= 0)
+		irq_work_queue_on(&rq->rd->rto_push_work, cpu);
 }
 
 /* Called from hardirq context */
-static void try_to_push_tasks(void *arg)
+void rto_push_irq_work_func(struct irq_work *work)
 {
-	struct rt_rq *rt_rq = arg;
-	struct rq *rq, *src_rq;
-	int this_cpu;
+	struct rq *rq;
 	int cpu;
 
-	this_cpu = rt_rq->push_cpu;
+	rq = this_rq();
 
-	/* Paranoid check */
-	BUG_ON(this_cpu != smp_processor_id());
-
-	rq = cpu_rq(this_cpu);
-	src_rq = rq_of_rt_rq(rt_rq);
-
-again:
+	/*
+	 * We do not need to grab the lock to check for has_pushable_tasks.
+	 * When it gets updated, a check is made if a push is possible.
+	 */
 	if (has_pushable_tasks(rq)) {
 		raw_spin_lock(&rq->lock);
-		push_rt_task(rq);
+		push_rt_tasks(rq);
 		raw_spin_unlock(&rq->lock);
 	}
 
-	/* Pass the IPI to the next rt overloaded queue */
-	raw_spin_lock(&rt_rq->push_lock);
-	/*
-	 * If the source queue changed since the IPI went out,
-	 * we need to restart the search from that CPU again.
-	 */
-	if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
-		rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
-		rt_rq->push_cpu = src_rq->cpu;
-	}
+	raw_spin_lock(&rq->rd->rto_lock);
 
-	cpu = find_next_push_cpu(src_rq);
+	/* Pass the IPI to the next rt overloaded queue */
+	cpu = rto_next_cpu(rq);
 
-	if (cpu >= nr_cpu_ids)
-		rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
-	raw_spin_unlock(&rt_rq->push_lock);
+	raw_spin_unlock(&rq->rd->rto_lock);
 
-	if (cpu >= nr_cpu_ids)
+	if (cpu < 0)
 		return;
 
-	/*
-	 * It is possible that a restart caused this CPU to be
-	 * chosen again. Don't bother with an IPI, just see if we
-	 * have more to push.
-	 */
-	if (unlikely(cpu == rq->cpu))
-		goto again;
-
 	/* Try the next RT overloaded CPU */
-	irq_work_queue_on(&rt_rq->push_work, cpu);
-}
-
-static void push_irq_work_func(struct irq_work *work)
-{
-	struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_work);
-
-	try_to_push_tasks(rt_rq);
+	irq_work_queue_on(&rq->rd->rto_push_work, cpu);
 }
 #endif /* HAVE_RT_PUSH_IPI */
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a81c978..8aa24b4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -505,7 +505,7 @@ static inline int rt_bandwidth_enabled(void)
 }
 
 /* RT IPI pull logic requires IRQ_WORK */
-#ifdef CONFIG_IRQ_WORK
+#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP)
 # define HAVE_RT_PUSH_IPI
 #endif
 
@@ -527,12 +527,6 @@ struct rt_rq {
 	unsigned long rt_nr_total;
 	int overloaded;
 	struct plist_head pushable_tasks;
-#ifdef HAVE_RT_PUSH_IPI
-	int push_flags;
-	int push_cpu;
-	struct irq_work push_work;
-	raw_spinlock_t push_lock;
-#endif
 #endif /* CONFIG_SMP */
 	int rt_queued;
 
@@ -641,6 +635,19 @@ struct root_domain {
 	struct dl_bw dl_bw;
 	struct cpudl cpudl;
 
+#ifdef HAVE_RT_PUSH_IPI
+	/*
+	 * For IPI pull requests, loop across the rto_mask.
+	 */
+	struct irq_work rto_push_work;
+	raw_spinlock_t rto_lock;
+	/* These are only updated and read within rto_lock */
+	int rto_loop;
+	int rto_cpu;
+	/* These atomics are updated outside of a lock */
+	atomic_t rto_loop_next;
+	atomic_t rto_loop_start;
+#endif
 	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
 	 * one runnable RT task.
@@ -658,6 +665,9 @@ extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
 
+#ifdef HAVE_RT_PUSH_IPI
+extern void rto_push_irq_work_func(struct irq_work *work);
+#endif
 #endif /* CONFIG_SMP */
 
 /*
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f51d123..e50450c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -268,6 +268,12 @@ static int init_rootdomain(struct root_domain *rd)
 	if (!zalloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
 		goto free_dlo_mask;
 
+#ifdef HAVE_RT_PUSH_IPI
+	rd->rto_cpu = -1;
+	raw_spin_lock_init(&rd->rto_lock);
+	init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
+#endif
+
 	init_dl_bw(&rd->dl_bw);
 	if (cpudl_init(&rd->cpudl) != 0)
 		goto free_rto_mask;

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
  2017-10-10 11:02 ` [tip:sched/core] sched/rt: Simplify the IPI based RT " tip-bot for Steven Rostedt (Red Hat)
@ 2018-01-19  9:23   ` Pavan Kondeti
  2018-01-19 15:03     ` Steven Rostedt
                       ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Pavan Kondeti @ 2018-01-19  9:23 UTC (permalink / raw)
  To: williams, Steven Rostedt, Ingo Molnar, LKML, Peter Zijlstra,
	Thomas Gleixner, bristot, jkacur, efault, hpa, torvalds, swood
  Cc: linux-tip-commits, Pavan Kondeti

Hi Steven,

>  /* Called from hardirq context */
> -static void try_to_push_tasks(void *arg)
> +void rto_push_irq_work_func(struct irq_work *work)
>  {
> -       struct rt_rq *rt_rq = arg;
> -       struct rq *rq, *src_rq;
> -       int this_cpu;
> +       struct rq *rq;
>         int cpu;
>
> -       this_cpu = rt_rq->push_cpu;
> +       rq = this_rq();
>
> -       /* Paranoid check */
> -       BUG_ON(this_cpu != smp_processor_id());
> -
> -       rq = cpu_rq(this_cpu);
> -       src_rq = rq_of_rt_rq(rt_rq);
> -
> -again:
> +       /*
> +        * We do not need to grab the lock to check for has_pushable_tasks.
> +        * When it gets updated, a check is made if a push is possible.
> +        */
>         if (has_pushable_tasks(rq)) {
>                 raw_spin_lock(&rq->lock);
> -               push_rt_task(rq);
> +               push_rt_tasks(rq);
>                 raw_spin_unlock(&rq->lock);
>         }
>
> -       /* Pass the IPI to the next rt overloaded queue */
> -       raw_spin_lock(&rt_rq->push_lock);
> -       /*
> -        * If the source queue changed since the IPI went out,
> -        * we need to restart the search from that CPU again.
> -        */
> -       if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
> -               rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
> -               rt_rq->push_cpu = src_rq->cpu;
> -       }
> +       raw_spin_lock(&rq->rd->rto_lock);
>
> -       cpu = find_next_push_cpu(src_rq);
> +       /* Pass the IPI to the next rt overloaded queue */
> +       cpu = rto_next_cpu(rq);
>
> -       if (cpu >= nr_cpu_ids)
> -               rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
> -       raw_spin_unlock(&rt_rq->push_lock);
> +       raw_spin_unlock(&rq->rd->rto_lock);
>
> -       if (cpu >= nr_cpu_ids)
> +       if (cpu < 0)
>                 return;

I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
stable kernel based system. This issue is observed only after
inclusion of this patch. It appears to me that rq->rd can change
between spinlock is acquired and released in rto_push_irq_work_func()
IRQ work if hotplug is in progress. It was only reported couple of
times during long stress testing. The issue can be easily reproduced
if an artificial delay is introduced between lock and unlock of
rto_lock. The rq->rd is changed under rq->lock, so we can protect this
race with rq->lock. The below patch solved the problem. we are taking
rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
here. Please let me know your thoughts on this.

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d863d39..478192b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
                raw_spin_unlock(&rq->lock);
        }

+       raw_spin_lock(&rq->lock);
        raw_spin_lock(&rq->rd->rto_lock);

        /* Pass the IPI to the next rt overloaded queue */
@@ -2291,11 +2292,10 @@ void rto_push_irq_work_func(struct irq_work *work)

        raw_spin_unlock(&rq->rd->rto_lock);

-       if (cpu < 0)
-               return;
-
        /* Try the next RT overloaded CPU */
-       irq_work_queue_on(&rq->rd->rto_push_work, cpu);
+       if (cpu >= 0)
+               irq_work_queue_on(&rq->rd->rto_push_work, cpu);
+       raw_spin_unlock(&rq->lock);
 }
 #endif /* HAVE_RT_PUSH_IPI */


Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
  2018-01-19  9:23   ` Pavan Kondeti
@ 2018-01-19 15:03     ` Steven Rostedt
  2018-01-19 15:44       ` Pavan Kondeti
  2018-01-19 17:46       ` Pavan Kondeti
  2018-02-06 11:54     ` [tip:sched/urgent] sched/rt: Use container_of() to get root domain in rto_push_irq_work_func() tip-bot for Steven Rostedt (VMware)
  2018-02-06 11:54     ` [tip:sched/urgent] sched/rt: Up the root domain ref count when passing it around via IPIs tip-bot for Steven Rostedt (VMware)
  2 siblings, 2 replies; 30+ messages in thread
From: Steven Rostedt @ 2018-01-19 15:03 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: williams, Ingo Molnar, LKML, Peter Zijlstra, Thomas Gleixner,
	bristot, jkacur, efault, hpa, torvalds, swood, linux-tip-commits

On Fri, 19 Jan 2018 14:53:05 +0530
Pavan Kondeti <pkondeti@codeaurora.org> wrote:

> I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> stable kernel based system. This issue is observed only after
> inclusion of this patch. It appears to me that rq->rd can change
> between spinlock is acquired and released in rto_push_irq_work_func()
> IRQ work if hotplug is in progress. It was only reported couple of
> times during long stress testing. The issue can be easily reproduced
> if an artificial delay is introduced between lock and unlock of
> rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> race with rq->lock. The below patch solved the problem. we are taking
> rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> here. Please let me know your thoughts on this.

As so rq->rd can change. Interesting.

> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index d863d39..478192b 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
>                 raw_spin_unlock(&rq->lock);
>         }
> 
> +       raw_spin_lock(&rq->lock);


What about just saving the rd then?

	struct root_domain *rd;

	rd = READ_ONCE(rq->rd);

then use that. Then we don't need to worry about it changing.

-- Steve


>         raw_spin_lock(&rq->rd->rto_lock);
> 
>         /* Pass the IPI to the next rt overloaded queue */
> @@ -2291,11 +2292,10 @@ void rto_push_irq_work_func(struct irq_work *work)
> 
>         raw_spin_unlock(&rq->rd->rto_lock);
> 
> -       if (cpu < 0)
> -               return;
> -
>         /* Try the next RT overloaded CPU */
> -       irq_work_queue_on(&rq->rd->rto_push_work, cpu);
> +       if (cpu >= 0)
> +               irq_work_queue_on(&rq->rd->rto_push_work, cpu);
> +       raw_spin_unlock(&rq->lock);
>  }
>  #endif /* HAVE_RT_PUSH_IPI */
> 
> 
> Thanks,
> Pavan
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
  2018-01-19 15:03     ` Steven Rostedt
@ 2018-01-19 15:44       ` Pavan Kondeti
  2018-01-19 15:53         ` Steven Rostedt
  2018-01-19 17:46       ` Pavan Kondeti
  1 sibling, 1 reply; 30+ messages in thread
From: Pavan Kondeti @ 2018-01-19 15:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: williams, Ingo Molnar, LKML, Peter Zijlstra, Thomas Gleixner,
	bristot, jkacur, efault, hpa, torvalds, swood, linux-tip-commits

Hi Steven,

On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 14:53:05 +0530
> Pavan Kondeti <pkondeti@codeaurora.org> wrote:
> 
> > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> > stable kernel based system. This issue is observed only after
> > inclusion of this patch. It appears to me that rq->rd can change
> > between spinlock is acquired and released in rto_push_irq_work_func()
> > IRQ work if hotplug is in progress. It was only reported couple of
> > times during long stress testing. The issue can be easily reproduced
> > if an artificial delay is introduced between lock and unlock of
> > rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> > race with rq->lock. The below patch solved the problem. we are taking
> > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> > here. Please let me know your thoughts on this.
> 
> As so rq->rd can change. Interesting.
> 
> > 
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index d863d39..478192b 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> >                 raw_spin_unlock(&rq->lock);
> >         }
> > 
> > +       raw_spin_lock(&rq->lock);
> 
> 
> What about just saving the rd then?
> 
> 	struct root_domain *rd;
> 
> 	rd = READ_ONCE(rq->rd);
> 
> then use that. Then we don't need to worry about it changing.
> 

Yeah, it should work. I will give it a try and send the patch
for review.

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
  2018-01-19 15:44       ` Pavan Kondeti
@ 2018-01-19 15:53         ` Steven Rostedt
  0 siblings, 0 replies; 30+ messages in thread
From: Steven Rostedt @ 2018-01-19 15:53 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: williams, Ingo Molnar, LKML, Peter Zijlstra, Thomas Gleixner,
	bristot, jkacur, efault, hpa, torvalds, swood, linux-tip-commits

On Fri, 19 Jan 2018 21:14:30 +0530
Pavan Kondeti <pkondeti@codeaurora.org> wrote:


> Yeah, it should work. I will give it a try and send the patch
> for review.

Add a comment above the assignment saying something to the effect:


	/* Due to CPU hotplug, rq->rd could possibly change */

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
  2018-01-19 15:03     ` Steven Rostedt
  2018-01-19 15:44       ` Pavan Kondeti
@ 2018-01-19 17:46       ` Pavan Kondeti
  2018-01-19 18:11         ` Steven Rostedt
  1 sibling, 1 reply; 30+ messages in thread
From: Pavan Kondeti @ 2018-01-19 17:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: williams, Ingo Molnar, LKML, Peter Zijlstra, Thomas Gleixner,
	bristot, jkacur, efault, hpa, torvalds, swood, linux-tip-commits

On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 14:53:05 +0530
> Pavan Kondeti <pkondeti@codeaurora.org> wrote:
> 
> > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> > stable kernel based system. This issue is observed only after
> > inclusion of this patch. It appears to me that rq->rd can change
> > between spinlock is acquired and released in rto_push_irq_work_func()
> > IRQ work if hotplug is in progress. It was only reported couple of
> > times during long stress testing. The issue can be easily reproduced
> > if an artificial delay is introduced between lock and unlock of
> > rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> > race with rq->lock. The below patch solved the problem. we are taking
> > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> > here. Please let me know your thoughts on this.
> 
> As so rq->rd can change. Interesting.
> 
> > 
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index d863d39..478192b 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> >                 raw_spin_unlock(&rq->lock);
> >         }
> > 
> > +       raw_spin_lock(&rq->lock);
> 
> 
> What about just saving the rd then?
> 
> 	struct root_domain *rd;
> 
> 	rd = READ_ONCE(rq->rd);
> 
> then use that. Then we don't need to worry about it changing.
> 

I am thinking of another problem because of the race between
rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.

Lets say, we cache the rq->rd here and queued the IRQ work on a remote
CPU. In the mean time, the rq_attach_root() might drop all the references
to this cached (old) rd and wants to free it. The rq->rd is freed in
RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
can get freed before the IRQ work is executed. This results in the corruption
of the remote  CPU's IRQ work list. Right?

Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably
we have to wait for the IRQ work to finish before freeing the older root domain
in RCU-sched callback.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
  2018-01-19 17:46       ` Pavan Kondeti
@ 2018-01-19 18:11         ` Steven Rostedt
  2018-01-19 18:12           ` Steven Rostedt
  2018-01-19 18:54           ` Pavan Kondeti
  0 siblings, 2 replies; 30+ messages in thread
From: Steven Rostedt @ 2018-01-19 18:11 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: williams, Ingo Molnar, LKML, Peter Zijlstra, Thomas Gleixner,
	bristot, jkacur, efault, hpa, torvalds, swood, linux-tip-commits

On Fri, 19 Jan 2018 23:16:17 +0530
Pavan Kondeti <pkondeti@codeaurora.org> wrote:

> I am thinking of another problem because of the race between
> rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.
> 
> Lets say, we cache the rq->rd here and queued the IRQ work on a remote
> CPU. In the mean time, the rq_attach_root() might drop all the references
> to this cached (old) rd and wants to free it. The rq->rd is freed in
> RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
> can get freed before the IRQ work is executed. This results in the corruption
> of the remote  CPU's IRQ work list. Right?
> 
> Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably
> we have to wait for the IRQ work to finish before freeing the older root domain
> in RCU-sched callback.

I was wondering about this too. Yeah, it would require an RCU like
update. Once the rd was unreferenced, it would need to wait for the
irq works to to finish before freeing it.

The easy way to do this is to simply up the refcount when sending the
domain. Something like this:

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 862a513adca3..89a086ed2b16 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq)
  * the rt_loop_next will cause the iterator to perform another scan.
  *
  */
-static int rto_next_cpu(struct rq *rq)
+static int rto_next_cpu(struct root_domain *rd)
 {
-	struct root_domain *rd = rq->rd;
 	int next;
 	int cpu;
 
@@ -1985,19 +1984,24 @@ static void tell_cpu_to_push(struct rq *rq)
 	 * Otherwise it is finishing up and an ipi needs to be sent.
 	 */
 	if (rq->rd->rto_cpu < 0)
-		cpu = rto_next_cpu(rq);
+		cpu = rto_next_cpu(rq->rd);
 
 	raw_spin_unlock(&rq->rd->rto_lock);
 
 	rto_start_unlock(&rq->rd->rto_loop_start);
 
-	if (cpu >= 0)
+	if (cpu >= 0) {
+		/* Make sure the rd does not get freed while pushing */
+		sched_get_rd(rq->rd);
 		irq_work_queue_on(&rq->rd->rto_push_work, cpu);
+	}
 }
 
 /* Called from hardirq context */
 void rto_push_irq_work_func(struct irq_work *work)
 {
+	struct root_domain *rd =
+		container_of(work, struct root_domain, rto_push_work);
 	struct rq *rq;
 	int cpu;
 
@@ -2013,18 +2017,20 @@ void rto_push_irq_work_func(struct irq_work *work)
 		raw_spin_unlock(&rq->lock);
 	}
 
-	raw_spin_lock(&rq->rd->rto_lock);
+	raw_spin_lock(&rd->rto_lock);
 
 	/* Pass the IPI to the next rt overloaded queue */
-	cpu = rto_next_cpu(rq);
+	cpu = rto_next_cpu(rd);
 
-	raw_spin_unlock(&rq->rd->rto_lock);
+	raw_spin_unlock(&rd->rto_lock);
 
-	if (cpu < 0)
+	if (cpu < 0) {
+		sched_put_rd(rd);
 		return;
+	}
 
 	/* Try the next RT overloaded CPU */
-	irq_work_queue_on(&rq->rd->rto_push_work, cpu);
+	irq_work_queue_on(&rd->rto_push_work, cpu);
 }
 #endif /* HAVE_RT_PUSH_IPI */
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e95505e23c6..fb5fc458547f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -691,6 +691,8 @@ extern struct mutex sched_domains_mutex;
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
+extern void sched_get_rd(struct root_domain *rd);
+extern void sched_put_rd(struct root_domain *rd);
 
 #ifdef HAVE_RT_PUSH_IPI
 extern void rto_push_irq_work_func(struct irq_work *work);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 034cbed7f88b..519b024f4e94 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -259,6 +259,19 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 		call_rcu_sched(&old_rd->rcu, free_rootdomain);
 }
 
+void sched_get_rd(struct root_domain *rd)
+{
+	atomic_inc(&rd->refcount);
+}
+
+void sched_put_rd(struct root_domain *rd)
+{
+	if (!atomic_dec_and_test(&rd->refcount))
+		return;
+
+	call_rcu_sched(&rd->rcu, free_rootdomain);
+}
+
 static int init_rootdomain(struct root_domain *rd)
 {
 	if (!zalloc_cpumask_var(&rd->span, GFP_KERNEL))

-- Steve

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
  2018-01-19 18:11         ` Steven Rostedt
@ 2018-01-19 18:12           ` Steven Rostedt
  2018-01-19 18:57             ` Pavan Kondeti
  2018-01-19 18:54           ` Pavan Kondeti
  1 sibling, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2018-01-19 18:12 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: williams, Ingo Molnar, LKML, Peter Zijlstra, Thomas Gleixner,
	bristot, jkacur, efault, hpa, torvalds, swood, linux-tip-commits

On Fri, 19 Jan 2018 13:11:21 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

>  void rto_push_irq_work_func(struct irq_work *work)
>  {
> +	struct root_domain *rd =
> +		container_of(work, struct root_domain, rto_push_work);
>  	struct rq *rq;

Notice that I also remove the dependency on rq from getting the rd.

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
  2018-01-19 18:11         ` Steven Rostedt
  2018-01-19 18:12           ` Steven Rostedt
@ 2018-01-19 18:54           ` Pavan Kondeti
  1 sibling, 0 replies; 30+ messages in thread
From: Pavan Kondeti @ 2018-01-19 18:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: williams, Ingo Molnar, LKML, Peter Zijlstra, Thomas Gleixner,
	bristot, jkacur, efault, hpa, torvalds, swood, linux-tip-commits

On Fri, Jan 19, 2018 at 01:11:21PM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 23:16:17 +0530
> Pavan Kondeti <pkondeti@codeaurora.org> wrote:
> 
> > I am thinking of another problem because of the race between
> > rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.
> > 
> > Lets say, we cache the rq->rd here and queued the IRQ work on a remote
> > CPU. In the mean time, the rq_attach_root() might drop all the references
> > to this cached (old) rd and wants to free it. The rq->rd is freed in
> > RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
> > can get freed before the IRQ work is executed. This results in the corruption
> > of the remote  CPU's IRQ work list. Right?
> > 
> > Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably
> > we have to wait for the IRQ work to finish before freeing the older root domain
> > in RCU-sched callback.
> 
> I was wondering about this too. Yeah, it would require an RCU like
> update. Once the rd was unreferenced, it would need to wait for the
> irq works to to finish before freeing it.
> 
> The easy way to do this is to simply up the refcount when sending the
> domain. Something like this:
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 862a513adca3..89a086ed2b16 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq)
>   * the rt_loop_next will cause the iterator to perform another scan.
>   *
>   */
> -static int rto_next_cpu(struct rq *rq)
> +static int rto_next_cpu(struct root_domain *rd)
>  {
> -	struct root_domain *rd = rq->rd;
>  	int next;
>  	int cpu;
>  
> @@ -1985,19 +1984,24 @@ static void tell_cpu_to_push(struct rq *rq)
>  	 * Otherwise it is finishing up and an ipi needs to be sent.
>  	 */
>  	if (rq->rd->rto_cpu < 0)
> -		cpu = rto_next_cpu(rq);
> +		cpu = rto_next_cpu(rq->rd);
>  
>  	raw_spin_unlock(&rq->rd->rto_lock);
>  
>  	rto_start_unlock(&rq->rd->rto_loop_start);
>  
> -	if (cpu >= 0)
> +	if (cpu >= 0) {
> +		/* Make sure the rd does not get freed while pushing */
> +		sched_get_rd(rq->rd);
>  		irq_work_queue_on(&rq->rd->rto_push_work, cpu);
> +	}
>  }

Since this is covered by rq->lock, it is guaranteed that we increment the
refcount on the older rd before RCU-sched callback is queued in
rq_attach_root(). Either we keep older rd alive or use the updated rd.

We are good here, I think.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
  2018-01-19 18:12           ` Steven Rostedt
@ 2018-01-19 18:57             ` Pavan Kondeti
  2018-01-19 19:51               ` Steven Rostedt
  0 siblings, 1 reply; 30+ messages in thread
From: Pavan Kondeti @ 2018-01-19 18:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: williams, Ingo Molnar, LKML, Peter Zijlstra, Thomas Gleixner,
	bristot, jkacur, efault, hpa, torvalds, swood, linux-tip-commits

Hi Steve,

Thanks for the patch.

On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 13:11:21 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> >  void rto_push_irq_work_func(struct irq_work *work)
> >  {
> > +	struct root_domain *rd =
> > +		container_of(work, struct root_domain, rto_push_work);
> >  	struct rq *rq;
> 
> Notice that I also remove the dependency on rq from getting the rd.
> 

Nice. This snippet it self solves the original problem, I reported.
I will test your patch and let you know the results.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
  2018-01-19 18:57             ` Pavan Kondeti
@ 2018-01-19 19:51               ` Steven Rostedt
  2018-01-20  4:56                 ` Pavan Kondeti
  0 siblings, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2018-01-19 19:51 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: williams, Ingo Molnar, LKML, Peter Zijlstra, Thomas Gleixner,
	bristot, jkacur, efault, hpa, torvalds, swood, linux-tip-commits

On Sat, 20 Jan 2018 00:27:56 +0530
Pavan Kondeti <pkondeti@codeaurora.org> wrote:

> Hi Steve,
> 
> Thanks for the patch.
> 
> On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> > On Fri, 19 Jan 2018 13:11:21 -0500
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >   
> > >  void rto_push_irq_work_func(struct irq_work *work)
> > >  {
> > > +	struct root_domain *rd =
> > > +		container_of(work, struct root_domain, rto_push_work);
> > >  	struct rq *rq;  
> > 
> > Notice that I also remove the dependency on rq from getting the rd.
> >   
> 
> Nice. This snippet it self solves the original problem, I reported.
> I will test your patch and let you know the results.
> 
>

I'll break the patch up into two then. One with this snippet, and the
other with the rd ref counting.

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
  2018-01-19 19:51               ` Steven Rostedt
@ 2018-01-20  4:56                 ` Pavan Kondeti
  0 siblings, 0 replies; 30+ messages in thread
From: Pavan Kondeti @ 2018-01-20  4:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: williams, Ingo Molnar, LKML, Peter Zijlstra, Thomas Gleixner,
	bristot, jkacur, efault, hpa, torvalds, swood, linux-tip-commits

Hi Steve,

On Fri, Jan 19, 2018 at 02:51:15PM -0500, Steven Rostedt wrote:
> On Sat, 20 Jan 2018 00:27:56 +0530
> Pavan Kondeti <pkondeti@codeaurora.org> wrote:
> 
> > Hi Steve,
> > 
> > Thanks for the patch.
> > 
> > On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> > > On Fri, 19 Jan 2018 13:11:21 -0500
> > > Steven Rostedt <rostedt@goodmis.org> wrote:
> > >   
> > > >  void rto_push_irq_work_func(struct irq_work *work)
> > > >  {
> > > > +	struct root_domain *rd =
> > > > +		container_of(work, struct root_domain, rto_push_work);
> > > >  	struct rq *rq;  
> > > 
> > > Notice that I also remove the dependency on rq from getting the rd.
> > >   
> > 
> > Nice. This snippet it self solves the original problem, I reported.
> > I will test your patch and let you know the results.
> > 
> >
> 
> I'll break the patch up into two then. One with this snippet, and the
> other with the rd ref counting.
> 

Yeah, this snippet fixed the original problem.

I have not seen "use after free" of rd in my testing. But I can see
we are operating on a rd for which refcount is 0. After applying your
refcount patch, it never happened. I also verified that we are freeing
the rd via IRQ work by dropping the last reference.

Thanks for your help with the patches. Please copy linux-stable for
the 1st patch.

Feel free to use
Tested-by: Pavankumar Kondeti <pkondeti@codeaurora.org>

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [tip:sched/urgent] sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()
  2018-01-19  9:23   ` Pavan Kondeti
  2018-01-19 15:03     ` Steven Rostedt
@ 2018-02-06 11:54     ` tip-bot for Steven Rostedt (VMware)
  2018-02-07  4:14       ` Steven Rostedt
  2018-02-06 11:54     ` [tip:sched/urgent] sched/rt: Up the root domain ref count when passing it around via IPIs tip-bot for Steven Rostedt (VMware)
  2 siblings, 1 reply; 30+ messages in thread
From: tip-bot for Steven Rostedt (VMware) @ 2018-02-06 11:54 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: pkondeti, efault, rostedt, linux-kernel, hpa, akpm, peterz, tglx,
	mingo, torvalds

Commit-ID:  ad0f1d9d65938aec72a698116cd73a980916895e
Gitweb:     https://git.kernel.org/tip/ad0f1d9d65938aec72a698116cd73a980916895e
Author:     Steven Rostedt (VMware) <rostedt@goodmis.org>
AuthorDate: Tue, 23 Jan 2018 20:45:37 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 6 Feb 2018 10:20:33 +0100

sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()

When the rto_push_irq_work_func() is called, it looks at the RT overloaded
bitmask in the root domain via the runqueue (rq->rd). The problem is that
during CPU up and down, nothing here stops rq->rd from changing between
taking the rq->rd->rto_lock and releasing it. That means the lock that is
released is not the same lock that was taken.

Instead of using this_rq()->rd to get the root domain, as the irq work is
part of the root domain, we can simply get the root domain from the irq work
that is passed to the routine:

 container_of(work, struct root_domain, rto_push_work)

This keeps the root domain consistent.

Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/rt.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 862a513..2fb627d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq)
  * the rt_loop_next will cause the iterator to perform another scan.
  *
  */
-static int rto_next_cpu(struct rq *rq)
+static int rto_next_cpu(struct root_domain *rd)
 {
-	struct root_domain *rd = rq->rd;
 	int next;
 	int cpu;
 
@@ -1985,7 +1984,7 @@ static void tell_cpu_to_push(struct rq *rq)
 	 * Otherwise it is finishing up and an ipi needs to be sent.
 	 */
 	if (rq->rd->rto_cpu < 0)
-		cpu = rto_next_cpu(rq);
+		cpu = rto_next_cpu(rq->rd);
 
 	raw_spin_unlock(&rq->rd->rto_lock);
 
@@ -1998,6 +1997,8 @@ static void tell_cpu_to_push(struct rq *rq)
 /* Called from hardirq context */
 void rto_push_irq_work_func(struct irq_work *work)
 {
+	struct root_domain *rd =
+		container_of(work, struct root_domain, rto_push_work);
 	struct rq *rq;
 	int cpu;
 
@@ -2013,18 +2014,18 @@ void rto_push_irq_work_func(struct irq_work *work)
 		raw_spin_unlock(&rq->lock);
 	}
 
-	raw_spin_lock(&rq->rd->rto_lock);
+	raw_spin_lock(&rd->rto_lock);
 
 	/* Pass the IPI to the next rt overloaded queue */
-	cpu = rto_next_cpu(rq);
+	cpu = rto_next_cpu(rd);
 
-	raw_spin_unlock(&rq->rd->rto_lock);
+	raw_spin_unlock(&rd->rto_lock);
 
 	if (cpu < 0)
 		return;
 
 	/* Try the next RT overloaded CPU */
-	irq_work_queue_on(&rq->rd->rto_push_work, cpu);
+	irq_work_queue_on(&rd->rto_push_work, cpu);
 }
 #endif /* HAVE_RT_PUSH_IPI */
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [tip:sched/urgent] sched/rt: Up the root domain ref count when passing it around via IPIs
  2018-01-19  9:23   ` Pavan Kondeti
  2018-01-19 15:03     ` Steven Rostedt
  2018-02-06 11:54     ` [tip:sched/urgent] sched/rt: Use container_of() to get root domain in rto_push_irq_work_func() tip-bot for Steven Rostedt (VMware)
@ 2018-02-06 11:54     ` tip-bot for Steven Rostedt (VMware)
  2018-02-07  4:15       ` Steven Rostedt
  2 siblings, 1 reply; 30+ messages in thread
From: tip-bot for Steven Rostedt (VMware) @ 2018-02-06 11:54 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: pkondeti, efault, tglx, mingo, linux-kernel, peterz, rostedt,
	hpa, akpm, torvalds

Commit-ID:  364f56653708ba8bcdefd4f0da2a42904baa8eeb
Gitweb:     https://git.kernel.org/tip/364f56653708ba8bcdefd4f0da2a42904baa8eeb
Author:     Steven Rostedt (VMware) <rostedt@goodmis.org>
AuthorDate: Tue, 23 Jan 2018 20:45:38 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 6 Feb 2018 10:20:33 +0100

sched/rt: Up the root domain ref count when passing it around via IPIs

When issuing an IPI RT push, where an IPI is sent to each CPU that has more
than one RT task scheduled on it, it references the root domain's rto_mask,
that contains all the CPUs within the root domain that has more than one RT
task in the runable state. The problem is, after the IPIs are initiated, the
rq->lock is released. This means that the root domain that is associated to
the run queue could be freed while the IPIs are going around.

Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
the root domain's ref count respectively. This way when initiating the IPIs,
the scheduler will up the root domain's ref count before releasing the
rq->lock, ensuring that the root domain does not go away until the IPI round
is complete.

Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/rt.c       |  9 +++++++--
 kernel/sched/sched.h    |  2 ++
 kernel/sched/topology.c | 13 +++++++++++++
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2fb627d..89a086e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1990,8 +1990,11 @@ static void tell_cpu_to_push(struct rq *rq)
 
 	rto_start_unlock(&rq->rd->rto_loop_start);
 
-	if (cpu >= 0)
+	if (cpu >= 0) {
+		/* Make sure the rd does not get freed while pushing */
+		sched_get_rd(rq->rd);
 		irq_work_queue_on(&rq->rd->rto_push_work, cpu);
+	}
 }
 
 /* Called from hardirq context */
@@ -2021,8 +2024,10 @@ void rto_push_irq_work_func(struct irq_work *work)
 
 	raw_spin_unlock(&rd->rto_lock);
 
-	if (cpu < 0)
+	if (cpu < 0) {
+		sched_put_rd(rd);
 		return;
+	}
 
 	/* Try the next RT overloaded CPU */
 	irq_work_queue_on(&rd->rto_push_work, cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e95505..fb5fc45 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -691,6 +691,8 @@ extern struct mutex sched_domains_mutex;
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
+extern void sched_get_rd(struct root_domain *rd);
+extern void sched_put_rd(struct root_domain *rd);
 
 #ifdef HAVE_RT_PUSH_IPI
 extern void rto_push_irq_work_func(struct irq_work *work);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 034cbed..519b024 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -259,6 +259,19 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 		call_rcu_sched(&old_rd->rcu, free_rootdomain);
 }
 
+void sched_get_rd(struct root_domain *rd)
+{
+	atomic_inc(&rd->refcount);
+}
+
+void sched_put_rd(struct root_domain *rd)
+{
+	if (!atomic_dec_and_test(&rd->refcount))
+		return;
+
+	call_rcu_sched(&rd->rcu, free_rootdomain);
+}
+
 static int init_rootdomain(struct root_domain *rd)
 {
 	if (!zalloc_cpumask_var(&rd->span, GFP_KERNEL))

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [tip:sched/urgent] sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()
  2018-02-06 11:54     ` [tip:sched/urgent] sched/rt: Use container_of() to get root domain in rto_push_irq_work_func() tip-bot for Steven Rostedt (VMware)
@ 2018-02-07  4:14       ` Steven Rostedt
  0 siblings, 0 replies; 30+ messages in thread
From: Steven Rostedt @ 2018-02-07  4:14 UTC (permalink / raw)
  To: tip-bot for Steven Rostedt (VMware)
  Cc: pkondeti, hpa, linux-kernel, rostedt, efault, peterz, akpm,
	torvalds, mingo, tglx, linux-tip-commits, stable


I see this was just applied to Linus's tree. It probably should be
tagged for stable as well.

-- Steve


On Tue, 6 Feb 2018 03:54:16 -0800
"tip-bot for Steven Rostedt (VMware)" <tipbot@zytor.com> wrote:

> Commit-ID:  ad0f1d9d65938aec72a698116cd73a980916895e
> Gitweb:     https://git.kernel.org/tip/ad0f1d9d65938aec72a698116cd73a980916895e
> Author:     Steven Rostedt (VMware) <rostedt@goodmis.org>
> AuthorDate: Tue, 23 Jan 2018 20:45:37 -0500
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Tue, 6 Feb 2018 10:20:33 +0100
> 
> sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()
> 
> When the rto_push_irq_work_func() is called, it looks at the RT overloaded
> bitmask in the root domain via the runqueue (rq->rd). The problem is that
> during CPU up and down, nothing here stops rq->rd from changing between
> taking the rq->rd->rto_lock and releasing it. That means the lock that is
> released is not the same lock that was taken.
> 
> Instead of using this_rq()->rd to get the root domain, as the irq work is
> part of the root domain, we can simply get the root domain from the irq work
> that is passed to the routine:
> 
>  container_of(work, struct root_domain, rto_push_work)
> 
> This keeps the root domain consistent.
> 
> Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
> Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/sched/rt.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 862a513..2fb627d 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq)
>   * the rt_loop_next will cause the iterator to perform another scan.
>   *
>   */
> -static int rto_next_cpu(struct rq *rq)
> +static int rto_next_cpu(struct root_domain *rd)
>  {
> -	struct root_domain *rd = rq->rd;
>  	int next;
>  	int cpu;
>  
> @@ -1985,7 +1984,7 @@ static void tell_cpu_to_push(struct rq *rq)
>  	 * Otherwise it is finishing up and an ipi needs to be sent.
>  	 */
>  	if (rq->rd->rto_cpu < 0)
> -		cpu = rto_next_cpu(rq);
> +		cpu = rto_next_cpu(rq->rd);
>  
>  	raw_spin_unlock(&rq->rd->rto_lock);
>  
> @@ -1998,6 +1997,8 @@ static void tell_cpu_to_push(struct rq *rq)
>  /* Called from hardirq context */
>  void rto_push_irq_work_func(struct irq_work *work)
>  {
> +	struct root_domain *rd =
> +		container_of(work, struct root_domain, rto_push_work);
>  	struct rq *rq;
>  	int cpu;
>  
> @@ -2013,18 +2014,18 @@ void rto_push_irq_work_func(struct irq_work *work)
>  		raw_spin_unlock(&rq->lock);
>  	}
>  
> -	raw_spin_lock(&rq->rd->rto_lock);
> +	raw_spin_lock(&rd->rto_lock);
>  
>  	/* Pass the IPI to the next rt overloaded queue */
> -	cpu = rto_next_cpu(rq);
> +	cpu = rto_next_cpu(rd);
>  
> -	raw_spin_unlock(&rq->rd->rto_lock);
> +	raw_spin_unlock(&rd->rto_lock);
>  
>  	if (cpu < 0)
>  		return;
>  
>  	/* Try the next RT overloaded CPU */
> -	irq_work_queue_on(&rq->rd->rto_push_work, cpu);
> +	irq_work_queue_on(&rd->rto_push_work, cpu);
>  }
>  #endif /* HAVE_RT_PUSH_IPI */
>  

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [tip:sched/urgent] sched/rt: Up the root domain ref count when passing it around via IPIs
  2018-02-06 11:54     ` [tip:sched/urgent] sched/rt: Up the root domain ref count when passing it around via IPIs tip-bot for Steven Rostedt (VMware)
@ 2018-02-07  4:15       ` Steven Rostedt
  0 siblings, 0 replies; 30+ messages in thread
From: Steven Rostedt @ 2018-02-07  4:15 UTC (permalink / raw)
  To: tip-bot for Steven Rostedt (VMware)
  Cc: torvalds, akpm, hpa, rostedt, peterz, mingo, linux-kernel, tglx,
	efault, pkondeti, linux-tip-commits, stable


I see this was just applied to Linus's tree. This too probably should be
tagged for stable as well.

-- Steve


On Tue, 6 Feb 2018 03:54:42 -0800
"tip-bot for Steven Rostedt (VMware)" <tipbot@zytor.com> wrote:

> Commit-ID:  364f56653708ba8bcdefd4f0da2a42904baa8eeb
> Gitweb:     https://git.kernel.org/tip/364f56653708ba8bcdefd4f0da2a42904baa8eeb
> Author:     Steven Rostedt (VMware) <rostedt@goodmis.org>
> AuthorDate: Tue, 23 Jan 2018 20:45:38 -0500
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Tue, 6 Feb 2018 10:20:33 +0100
> 
> sched/rt: Up the root domain ref count when passing it around via IPIs
> 
> When issuing an IPI RT push, where an IPI is sent to each CPU that has more
> than one RT task scheduled on it, it references the root domain's rto_mask,
> that contains all the CPUs within the root domain that has more than one RT
> task in the runable state. The problem is, after the IPIs are initiated, the
> rq->lock is released. This means that the root domain that is associated to
> the run queue could be freed while the IPIs are going around.
> 
> Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
> the root domain's ref count respectively. This way when initiating the IPIs,
> the scheduler will up the root domain's ref count before releasing the
> rq->lock, ensuring that the root domain does not go away until the IPI round
> is complete.
> 
> Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
> Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/sched/rt.c       |  9 +++++++--
>  kernel/sched/sched.h    |  2 ++
>  kernel/sched/topology.c | 13 +++++++++++++
>  3 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 2fb627d..89a086e 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1990,8 +1990,11 @@ static void tell_cpu_to_push(struct rq *rq)
>  
>  	rto_start_unlock(&rq->rd->rto_loop_start);
>  
> -	if (cpu >= 0)
> +	if (cpu >= 0) {
> +		/* Make sure the rd does not get freed while pushing */
> +		sched_get_rd(rq->rd);
>  		irq_work_queue_on(&rq->rd->rto_push_work, cpu);
> +	}
>  }
>  
>  /* Called from hardirq context */
> @@ -2021,8 +2024,10 @@ void rto_push_irq_work_func(struct irq_work *work)
>  
>  	raw_spin_unlock(&rd->rto_lock);
>  
> -	if (cpu < 0)
> +	if (cpu < 0) {
> +		sched_put_rd(rd);
>  		return;
> +	}
>  
>  	/* Try the next RT overloaded CPU */
>  	irq_work_queue_on(&rd->rto_push_work, cpu);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2e95505..fb5fc45 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -691,6 +691,8 @@ extern struct mutex sched_domains_mutex;
>  extern void init_defrootdomain(void);
>  extern int sched_init_domains(const struct cpumask *cpu_map);
>  extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
> +extern void sched_get_rd(struct root_domain *rd);
> +extern void sched_put_rd(struct root_domain *rd);
>  
>  #ifdef HAVE_RT_PUSH_IPI
>  extern void rto_push_irq_work_func(struct irq_work *work);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 034cbed..519b024 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -259,6 +259,19 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
>  		call_rcu_sched(&old_rd->rcu, free_rootdomain);
>  }
>  
> +void sched_get_rd(struct root_domain *rd)
> +{
> +	atomic_inc(&rd->refcount);
> +}
> +
> +void sched_put_rd(struct root_domain *rd)
> +{
> +	if (!atomic_dec_and_test(&rd->refcount))
> +		return;
> +
> +	call_rcu_sched(&rd->rcu, free_rootdomain);
> +}
> +
>  static int init_rootdomain(struct root_domain *rd)
>  {
>  	if (!zalloc_cpumask_var(&rd->span, GFP_KERNEL))

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2018-02-07  4:15 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-24 15:47 [PATCH tip/sched/core v2] sched/rt: Simplify the IPI rt balancing logic Steven Rostedt
2017-05-04 14:41 ` Peter Zijlstra
2017-05-04 15:01   ` Steven Rostedt
2017-05-04 15:32 ` Peter Zijlstra
2017-05-04 17:25   ` Steven Rostedt
2017-05-04 18:42     ` Peter Zijlstra
2017-05-04 19:03       ` Steven Rostedt
2017-05-05  4:26         ` Mike Galbraith
2017-05-05  5:16           ` Mike Galbraith
2017-05-05 11:05         ` Peter Zijlstra
2017-05-05 12:02           ` Steven Rostedt
2017-05-05 17:39             ` Peter Zijlstra
2017-05-05 18:59               ` Steven Rostedt
2017-05-06  7:52                 ` Peter Zijlstra
2017-10-10 11:02 ` [tip:sched/core] sched/rt: Simplify the IPI based RT " tip-bot for Steven Rostedt (Red Hat)
2018-01-19  9:23   ` Pavan Kondeti
2018-01-19 15:03     ` Steven Rostedt
2018-01-19 15:44       ` Pavan Kondeti
2018-01-19 15:53         ` Steven Rostedt
2018-01-19 17:46       ` Pavan Kondeti
2018-01-19 18:11         ` Steven Rostedt
2018-01-19 18:12           ` Steven Rostedt
2018-01-19 18:57             ` Pavan Kondeti
2018-01-19 19:51               ` Steven Rostedt
2018-01-20  4:56                 ` Pavan Kondeti
2018-01-19 18:54           ` Pavan Kondeti
2018-02-06 11:54     ` [tip:sched/urgent] sched/rt: Use container_of() to get root domain in rto_push_irq_work_func() tip-bot for Steven Rostedt (VMware)
2018-02-07  4:14       ` Steven Rostedt
2018-02-06 11:54     ` [tip:sched/urgent] sched/rt: Up the root domain ref count when passing it around via IPIs tip-bot for Steven Rostedt (VMware)
2018-02-07  4:15       ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).