linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] RT: balance rt tasks enhancements v6
@ 2007-10-25 14:46 Gregory Haskins
  2007-10-25 14:46 ` [PATCH 1/3] RT: cleanup some push-rt logic Gregory Haskins
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Gregory Haskins @ 2007-10-25 14:46 UTC (permalink / raw)
  To: linux-rt-users
  Cc: linux-kernel, Gregory Haskins, Steven Rostedt, Dmitry Adamushko,
	Peter Zijlstra, Ingo Molnar, Darren Hart

This is a mini-release of my series, rebased on -rt2.  I have more changes
downstream which are not quite ready for primetime, but I need to work on some
other unrelated issues right now and I wanted to get what works out there. 

Changes since v5

*) Rebased to rt2 - Many of the functions of the original series are now
 included in base -rt so they are dropped out


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] RT: cleanup some push-rt logic
  2007-10-25 14:46 [PATCH 0/3] RT: balance rt tasks enhancements v6 Gregory Haskins
@ 2007-10-25 14:46 ` Gregory Haskins
  2007-10-25 14:46 ` [PATCH 2/3] RT: Cache cpus_allowed weight for optimizing migration Gregory Haskins
  2007-10-25 14:46 ` [PATCH 3/3] RT: CPU priority management Gregory Haskins
  2 siblings, 0 replies; 14+ messages in thread
From: Gregory Haskins @ 2007-10-25 14:46 UTC (permalink / raw)
  To: linux-rt-users
  Cc: linux-kernel, Gregory Haskins, Steven Rostedt, Dmitry Adamushko,
	Peter Zijlstra, Ingo Molnar, Darren Hart

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   10 ++--------
 1 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 55da7d0..b59dc20 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -292,7 +292,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
 {
 	struct rq *lowest_rq = NULL;
 	cpumask_t cpu_mask;
-	int dst_cpu = -1;
 	int cpu;
 	int tries;
 
@@ -311,14 +310,12 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
 			/* We look for lowest RT prio or non-rt CPU */
 			if (rq->rt.highest_prio >= MAX_RT_PRIO) {
 				lowest_rq = rq;
-				dst_cpu = cpu;
 				break;
 			}
 
 			/* no locking for now */
 			if (rq->rt.highest_prio > task->prio &&
 			    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
-				dst_cpu = cpu;
 				lowest_rq = rq;
 			}
 		}
@@ -335,7 +332,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
 			 * Also make sure that it wasn't scheduled on its rq.
 			 */
 			if (unlikely(task_rq(task) != this_rq ||
-				     !cpu_isset(dst_cpu, task->cpus_allowed) ||
+				     !cpu_isset(lowest_rq->cpu, task->cpus_allowed) ||
 				     task_running(this_rq, task) ||
 				     !task->se.on_rq)) {
 				spin_unlock(&lowest_rq->lock);
@@ -365,7 +362,6 @@ static int push_rt_task(struct rq *this_rq)
 {
 	struct task_struct *next_task;
 	struct rq *lowest_rq;
-	int dst_cpu;
 	int ret = 0;
 	int paranoid = RT_MAX_TRIES;
 
@@ -412,12 +408,10 @@ static int push_rt_task(struct rq *this_rq)
 		goto out;
 	}
 
-	dst_cpu = lowest_rq->cpu;
-
 	assert_spin_locked(&lowest_rq->lock);
 
 	deactivate_task(this_rq, next_task, 0);
-	set_task_cpu(next_task, dst_cpu);
+	set_task_cpu(next_task, lowest_rq->cpu);
 	activate_task(lowest_rq, next_task, 0);
 
 	resched_task(lowest_rq->curr);


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] RT: Cache cpus_allowed weight for optimizing migration
  2007-10-25 14:46 [PATCH 0/3] RT: balance rt tasks enhancements v6 Gregory Haskins
  2007-10-25 14:46 ` [PATCH 1/3] RT: cleanup some push-rt logic Gregory Haskins
@ 2007-10-25 14:46 ` Gregory Haskins
  2007-10-25 15:48   ` Steven Rostedt
  2007-10-25 19:58   ` Steven Rostedt
  2007-10-25 14:46 ` [PATCH 3/3] RT: CPU priority management Gregory Haskins
  2 siblings, 2 replies; 14+ messages in thread
From: Gregory Haskins @ 2007-10-25 14:46 UTC (permalink / raw)
  To: linux-rt-users
  Cc: linux-kernel, Gregory Haskins, Steven Rostedt, Dmitry Adamushko,
	Peter Zijlstra, Ingo Molnar, Darren Hart

Some RT tasks (particularly kthreads) are bound to one specific CPU.
It is fairly common for one or more bound tasks to get queued up at the
same time.  Consider, for instance, softirq_timer and softirq_sched.  A
timer goes off in an ISR which schedules softirq_thread to run at RT50.
Then during the handling of the timer, the system determines that it's
time to smp-rebalance the system so it schedules softirq_sched to run
from within the softirq_timer kthread context. So we are in a situation
where we have two RT50 tasks queued, and the system will go into
rt-overload condition to request other CPUs for help.

The problem is that these tasks cannot ever be pulled away since they
are already running on their one and only valid RQ.  However, the other
CPUs cannot determine that the tasks are unpullable without going
through expensive checks/locking.  Therefore the helping CPUS
experience unecessary overhead/latencies regardless as they
ineffectively try to process the overload condition.

This patch tries to optimize the situation by utilizing the hamming
weight of the task->cpus_allowed mask.  A weight of 1 indicates that
the task cannot be migrated, which may be utilized by the overload
handling code to eliminate uncessary rebalance attempts.  We also
introduce a per-rq variable to count the number of migratable tasks
that are currently running.  We only go into overload if we have more
than one rt task, AND at least one of them is migratable.

Calculating the weight is probably relatively expensive, so it is only
done when the cpus_allowed mask is updated (which should be relatively
infrequent, especially compared to scheduling frequency) and cached in
the task_struct.


Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/sched.h |    2 +
 kernel/fork.c         |    1 
 kernel/sched.c        |    9 +++-
 kernel/sched_rt.c     |  116 +++++++++++++++++++++++++++++++++++++++++--------
 4 files changed, 107 insertions(+), 21 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7a3829f..829de6f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1048,6 +1048,7 @@ struct sched_class {
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
 	void (*task_new) (struct rq *rq, struct task_struct *p);
+	void (*set_cpus_allowed)(struct task_struct *p, cpumask_t newmask);
 };
 
 struct load_weight {
@@ -1144,6 +1145,7 @@ struct task_struct {
 
 	unsigned int policy;
 	cpumask_t cpus_allowed;
+	int nr_cpus_allowed;
 	unsigned int time_slice;
 
 #ifdef CONFIG_PREEMPT_RCU
diff --git a/kernel/fork.c b/kernel/fork.c
index 5f11f23..f808e18 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1257,6 +1257,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 */
 	preempt_disable();
 	p->cpus_allowed = current->cpus_allowed;
+	p->nr_cpus_allowed = current->nr_cpus_allowed;
 	if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
 			!cpu_online(task_cpu(p))))
 		set_task_cpu(p, smp_processor_id());
diff --git a/kernel/sched.c b/kernel/sched.c
index 30fa531..6c90093 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -262,6 +262,7 @@ struct rt_rq {
 	int rt_load_balance_idx;
 	struct list_head *rt_load_balance_head, *rt_load_balance_curr;
 	unsigned long rt_nr_running;
+	unsigned long rt_nr_migratory;
 	unsigned long rt_nr_uninterruptible;
 	/* highest queued rt task prio */
 	int highest_prio;
@@ -5371,7 +5372,13 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
 		goto out;
 	}
 
-	p->cpus_allowed = new_mask;
+	if (p->sched_class->set_cpus_allowed)
+		p->sched_class->set_cpus_allowed(p, new_mask);
+	else {
+		p->cpus_allowed    = new_mask;
+		p->nr_cpus_allowed = cpus_weight(new_mask);
+	}
+
 	/* Can the task run on the task's current CPU? If so, we're done */
 	if (cpu_isset(task_cpu(p), new_mask))
 		goto out;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index b59dc20..ad35c89 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -50,6 +50,24 @@ static inline void update_curr_rt(struct rq *rq)
 	curr->se.sum_exec_runtime += delta_exec;
 	curr->se.exec_start = rq->clock;
 }
+#ifdef CONFIG_SMP
+static void inc_rt_migration(struct task_struct *p, struct rq *rq)
+{
+	rq->rt.rt_nr_migratory++;
+
+	if (rq->rt.rt_nr_running > 1)
+		rt_set_overload(p, rq->cpu);
+}
+
+static void dec_rt_migration(struct task_struct *p, struct rq *rq)
+{
+	WARN_ON(!rq->rt.rt_nr_migratory);
+	rq->rt.rt_nr_migratory--;
+
+	if ((rq->rt.rt_nr_running < 2) || !rq->rt.rt_nr_migratory)
+		rt_clear_overload(p, rq->cpu);
+}
+#endif
 
 static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 {
@@ -58,8 +76,8 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 #ifdef CONFIG_SMP
 	if (p->prio < rq->rt.highest_prio)
 		rq->rt.highest_prio = p->prio;
-	if (rq->rt.rt_nr_running > 1)
-		rt_set_overload(p, rq->cpu);
+	if (p->nr_cpus_allowed > 1)
+		inc_rt_migration(p, rq);
 #endif /* CONFIG_SMP */
 }
 
@@ -81,8 +99,8 @@ static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
 		} /* otherwise leave rq->highest prio alone */
 	} else
 		rq->rt.highest_prio = MAX_RT_PRIO;
-	if (rq->rt.rt_nr_running < 2)
-		rt_clear_overload(p, rq->cpu);
+	if (p->nr_cpus_allowed > 1)
+		dec_rt_migration(p, rq);
 #endif /* CONFIG_SMP */
 }
 
@@ -286,6 +304,27 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
 	return next;
 }
 
+static int lock_migration_target(struct task_struct *p, struct rq *target_rq)
+{
+	struct rq *this_rq = task_rq(p);
+
+	if (double_lock_balance(this_rq, target_rq)) {
+		/*
+		 * We had to unlock the run queue. In
+		 * the mean time, task could have
+		 * migrated already or had its affinity changed.
+		 */
+		if (unlikely(task_running(this_rq, p) ||
+			     task_rq(p) != this_rq ||
+			     !cpu_isset(target_rq->cpu, p->cpus_allowed))) {
+			spin_unlock(&target_rq->lock);
+			return 0;
+		}
+	}	
+	
+	return 1;
+}
+
 /* Will lock the rq it finds */
 static struct rq *find_lock_lowest_rq(struct task_struct *task,
 				      struct rq *this_rq)
@@ -295,6 +334,28 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
 	int cpu;
 	int tries;
 
+ 	/*
+ 	 * We can optimize if the hamming weight of the cpus_allowed mask
+ 	 * is 1 because the task has nowhere to go but one CPU.  So don't
+ 	 * waste the time trying to find the lowest RQ in this case.
+ 	 */
+ 	if (task->nr_cpus_allowed == 1) {
+ 		/* If the task is already on the RQ, we are done */
+ 		if (cpu_isset(this_rq->cpu, task->cpus_allowed))
+ 			return NULL;
+ 
+ 		cpu = first_cpu(task->cpus_allowed);
+ 
+ 		lowest_rq = cpu_rq(cpu);
+ 		BUG_ON(this_rq == lowest_rq);
+ 
+ 		/* Otherwise, we can simply grab the new RQ */
+ 		if (lock_migration_target(task, lowest_rq))
+ 			return lowest_rq;
+ 		else
+ 			return NULL;
+ 	}
+
 	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
 
 	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
@@ -324,22 +385,8 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
 			break;
 
 		/* if the prio of this runqueue changed, try again */
-		if (double_lock_balance(this_rq, lowest_rq)) {
-			/*
-			 * We had to unlock the run queue. In
-			 * the mean time, task could have
-			 * migrated already or had its affinity changed.
-			 * Also make sure that it wasn't scheduled on its rq.
-			 */
-			if (unlikely(task_rq(task) != this_rq ||
-				     !cpu_isset(lowest_rq->cpu, task->cpus_allowed) ||
-				     task_running(this_rq, task) ||
-				     !task->se.on_rq)) {
-				spin_unlock(&lowest_rq->lock);
-				lowest_rq = NULL;
-				break;
-			}
-		}
+ 		if (!lock_migration_target(task, lowest_rq))
+ 			return NULL;
 
 		/* If this rq is still suitable use it. */
 		if (lowest_rq->rt.highest_prio > task->prio)
@@ -658,6 +705,31 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p)
 	}
 }
 
+#ifdef CONFIG_SMP
+static void set_cpus_allowed_rt(struct task_struct *p, cpumask_t new_mask)
+{
+	int weight = cpus_weight(new_mask);
+
+	BUG_ON(!rt_task(p));
+
+	/*
+	 * Update the migration status of the RQ if we have an RT task
+	 * which is running AND changing its weight value.
+	 */
+	if (p->se.on_rq && (weight != p->nr_cpus_allowed)) {
+		struct rq *rq = task_rq(p);
+
+		if ((p->nr_cpus_allowed <= 1) && (weight > 1))
+			inc_rt_migration(p, rq);
+		else if((p->nr_cpus_allowed > 1) && (weight <= 1))
+			dec_rt_migration(p, rq);
+	}
+
+	p->cpus_allowed    = new_mask;
+	p->nr_cpus_allowed = weight;
+}
+#endif
+
 static struct sched_class rt_sched_class __read_mostly = {
 	.enqueue_task		= enqueue_task_rt,
 	.dequeue_task		= dequeue_task_rt,
@@ -671,4 +743,8 @@ static struct sched_class rt_sched_class __read_mostly = {
 	.load_balance		= load_balance_rt,
 
 	.task_tick		= task_tick_rt,
+
+#ifdef CONFIG_SMP
+	.set_cpus_allowed       = set_cpus_allowed_rt,
+#endif
 };


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] RT: CPU priority management
  2007-10-25 14:46 [PATCH 0/3] RT: balance rt tasks enhancements v6 Gregory Haskins
  2007-10-25 14:46 ` [PATCH 1/3] RT: cleanup some push-rt logic Gregory Haskins
  2007-10-25 14:46 ` [PATCH 2/3] RT: Cache cpus_allowed weight for optimizing migration Gregory Haskins
@ 2007-10-25 14:46 ` Gregory Haskins
  2007-10-25 15:27   ` Steven Rostedt
  2 siblings, 1 reply; 14+ messages in thread
From: Gregory Haskins @ 2007-10-25 14:46 UTC (permalink / raw)
  To: linux-rt-users
  Cc: linux-kernel, Gregory Haskins, Steven Rostedt, Dmitry Adamushko,
	Peter Zijlstra, Ingo Molnar, Darren Hart

This code tracks the priority of each CPU so that global migration
  decisions are easy to calculate.  Each CPU can be in a state as follows:

                 (INVALID), IDLE, NORMAL, RT1, ... RT99

  going from the lowest priority to the highest.  CPUs in the INVALID state
  are not eligible for routing.  The system maintains this state with
  a 2 dimensional bitmap (the first for priority class, the second for cpus
  in that class).  Therefore a typical application without affinity
  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
  searches).  For tasks with affinity restrictions, the algorithm has a
  worst case complexity of O(min(102, NR_CPUS)), though the scenario that
  yields the worst case search is fairly contrived.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/Makefile       |    2 
 kernel/sched.c        |    4 +
 kernel/sched_cpupri.c |  201 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_cpupri.h |   10 ++
 kernel/sched_rt.c     |   34 ++------
 5 files changed, 224 insertions(+), 27 deletions(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index e4e2acf..d9d1351 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
 	    rcupdate.o extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o \
 	    hrtimer.o rwsem.o latency.o nsproxy.o srcu.o die_notifier.o \
-	    utsname.o
+	    utsname.o sched_cpupri.o
 
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
 obj-y += time/
diff --git a/kernel/sched.c b/kernel/sched.c
index 6c90093..acfc75d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -68,6 +68,8 @@
 
 #include <asm/tlb.h>
 
+#include "sched_cpupri.h"
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  * This is default implementation.
@@ -6955,6 +6957,8 @@ void __init sched_init(void)
 	fair_sched_class.next = &idle_sched_class;
 	idle_sched_class.next = NULL;
 
+	cpupri_init();
+
 	for_each_possible_cpu(i) {
 		struct rt_prio_array *array;
 		struct rq *rq;
diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
new file mode 100644
index 0000000..7a226cb
--- /dev/null
+++ b/kernel/sched_cpupri.c
@@ -0,0 +1,201 @@
+/*
+ *  kernel/sched_cpupri.c
+ *
+ *  CPU priority management
+ *
+ *  Copyright (C) 2007 Novell
+ *
+ *  Author: Gregory Haskins <ghaskins@novell.com>
+ *
+ *  This code tracks the priority of each CPU so that global migration
+ *  decisions are easy to calculate.  Each CPU can be in a state as follows:
+ *
+ *                 (INVALID), IDLE, NORMAL, RT1, ... RT99
+ *
+ *  going from the lowest priority to the highest.  CPUs in the INVALID state
+ *  are not eligible for routing.  The system maintains this state with
+ *  a 2 dimensional bitmap (the first for priority class, the second for cpus
+ *  in that class).  Therefore a typical application without affinity
+ *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
+ *  searches).  For tasks with affinity restrictions, the algorithm has a
+ *  worst case complexity of O(min(102, NR_CPUS)), though the scenario that
+ *  yields the worst case search is fairly contrived.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include <asm/idle.h>
+
+#include "sched_cpupri.h"
+
+#define CPUPRI_NR_PRIORITIES 2+MAX_RT_PRIO
+
+#define CPUPRI_INVALID -2
+#define CPUPRI_IDLE    -1
+#define CPUPRI_NORMAL   0
+/* values 1-99 are RT priorities */
+
+struct cpu_priority {
+	raw_spinlock_t lock;
+	cpumask_t      pri_to_cpu[CPUPRI_NR_PRIORITIES];
+	long           pri_active[CPUPRI_NR_PRIORITIES/BITS_PER_LONG];
+	int            cpu_to_pri[NR_CPUS];
+};
+
+static __cacheline_aligned_in_smp struct cpu_priority cpu_priority;
+
+/* Convert between a 140 based task->prio, and our 102 based cpupri */
+static int convert_prio(int prio)
+{
+	int cpupri;
+	
+	if (prio == MAX_PRIO)
+		cpupri = CPUPRI_IDLE;
+	else if (prio >= MAX_RT_PRIO)
+		cpupri = CPUPRI_NORMAL;
+	else
+		cpupri = MAX_RT_PRIO - prio;
+    
+	return cpupri;
+}
+
+#define for_each_cpupri_active(array, idx)                   \
+  for( idx = find_first_bit(array, CPUPRI_NR_PRIORITIES);    \
+       idx < CPUPRI_NR_PRIORITIES;                           \
+       idx = find_next_bit(array, CPUPRI_NR_PRIORITIES, idx+1))
+
+/**
+ * cpupri_find - find the best (lowest-pri) CPU in the system
+ * @cpu: The recommended/default CPU
+ * @task_pri: The priority of the task being scheduled (IDLE-RT99)
+ * @p: The task being scheduled
+ *
+ * Note: This function returns the recommended CPU as calculated during the
+ * current invokation.  By the time the call returns, the CPUs may have in
+ * fact changed priorities any number of times.  While not ideal, it is not
+ * an issue of correctness since the normal rebalancer logic will correct
+ * any discrepancies created by racing against the uncertainty of the current
+ * priority configuration.
+ *
+ * Returns: (int)cpu - The recommended cpu to accept the task
+ */
+int cpupri_find(int def_cpu, struct task_struct *p)
+{
+	int                  idx      = 0;
+	struct cpu_priority *cp       = &cpu_priority;
+	int                  this_cpu = smp_processor_id();
+	int                  cpu      = def_cpu;
+	int                  task_pri = convert_prio(p->prio);
+
+	for_each_cpupri_active(cp->pri_active, idx) {
+		cpumask_t mask;
+		int       lowest_pri = idx-1;
+		
+		if (lowest_pri >= task_pri)
+			break;
+		
+		cpus_and(mask, p->cpus_allowed, cp->pri_to_cpu[idx]);
+		
+		if (!cpus_empty(mask)) {
+			/*
+			 * We select a CPU in the following priority:
+			 *
+			 *       def_cpu, this_cpu, first_cpu
+			 *
+			 * for efficiency.  def_cpu preserves cache
+			 * affinity, and this_cpu is cheaper to preempt
+			 * (note that sometimes they are the same).
+			 * Finally, we will take whatever is available
+			 * if the first two don't pan out.
+			 */
+			if (cpu_isset(def_cpu, mask))
+				break;
+			
+			if (cpu_isset(this_cpu, mask)) {
+				cpu = this_cpu;
+				break;
+			}
+			
+			cpu = first_cpu(mask);
+			break;
+		}
+	}
+
+	return cpu;
+}
+
+/**
+ * cpupri_set - update the cpu priority setting
+ * @cpu: The target cpu
+ * @pri: The priority (INVALID-RT99) to assign to this CPU
+ *
+ * Note: Assumes cpu_rq(cpu)->lock is locked
+ *
+ * Returns: (void)
+ */
+void cpupri_set(int cpu, int newpri)
+{
+	struct cpu_priority *cp      = &cpu_priority;
+	int                 *currpri = &cp->cpu_to_pri[cpu];
+	int                  oldpri  = *currpri;
+	unsigned long        flags;
+
+	newpri = convert_prio(newpri);
+
+	if (newpri == oldpri)
+		return;
+
+	spin_lock_irqsave(&cp->lock, flags);
+	
+	/*
+	 * If the cpu was currently mapped to a different value, we
+	 * first need to unmap the old value
+	 */
+	if (likely(oldpri != CPUPRI_INVALID)) {
+		int        idx  = oldpri+1;
+		cpumask_t *mask = &cp->pri_to_cpu[idx];
+		
+		cpu_clear(cpu, *mask);
+		if (cpus_empty(*mask))
+			__clear_bit(idx, cp->pri_active);
+	}
+	
+	if (likely(newpri != CPUPRI_INVALID)) {
+		int        idx  = newpri+1;
+		cpumask_t *mask = &cp->pri_to_cpu[idx];
+		
+		cpu_set(cpu, *mask);
+		__set_bit(idx, cp->pri_active);
+	}
+
+	spin_unlock_irqrestore(&cp->lock, flags);
+	
+	*currpri = newpri;
+}
+
+/**
+ * cpupri_init - initialize the cpupri subsystem
+ *
+ * This must be called during the scheduler initialization before the 
+ * other methods may be used.
+ *
+ * Returns: (void)
+ */
+void cpupri_init(void)
+{
+	struct cpu_priority *cp = &cpu_priority;
+	int i;
+
+	memset(cp, 0, sizeof(*cp));
+
+	spin_lock_init(&cp->lock);
+
+	for_each_possible_cpu(i) {
+		cp->cpu_to_pri[i] = CPUPRI_INVALID;
+	}
+}
+
+
diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
new file mode 100644
index 0000000..a58a4e8
--- /dev/null
+++ b/kernel/sched_cpupri.h
@@ -0,0 +1,10 @@
+#ifndef _LINUX_CPUPRI_H
+#define _LINUX_CPUPRI_H
+
+#include <linux/sched.h>
+
+int  cpupri_find(int cpu, struct task_struct *p);
+void cpupri_set(int cpu, int pri);
+void cpupri_init(void);
+
+#endif /* _LINUX_CPUPRI_H */
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index ad35c89..94ba496 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -74,8 +74,10 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 	WARN_ON(!rt_task(p));
 	rq->rt.rt_nr_running++;
 #ifdef CONFIG_SMP
-	if (p->prio < rq->rt.highest_prio)
+	if (p->prio < rq->rt.highest_prio) {
 		rq->rt.highest_prio = p->prio;
+		cpupri_set(rq->cpu, p->prio);
+	}
 	if (p->nr_cpus_allowed > 1)
 		inc_rt_migration(p, rq);
 #endif /* CONFIG_SMP */
@@ -96,6 +98,7 @@ static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
 			array = &rq->rt.active;
 			rq->rt.highest_prio =
 				sched_find_first_bit(array->bitmap);
+			cpupri_set(rq->cpu, rq->rt.highest_prio);
 		} /* otherwise leave rq->highest prio alone */
 	} else
 		rq->rt.highest_prio = MAX_RT_PRIO;
@@ -330,7 +333,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
 				      struct rq *this_rq)
 {
 	struct rq *lowest_rq = NULL;
-	cpumask_t cpu_mask;
 	int cpu;
 	int tries;
 
@@ -356,34 +358,14 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
  			return NULL;
  	}
 
-	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
-
 	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
-		/*
-		 * Scan each rq for the lowest prio.
-		 */
-		for_each_cpu_mask(cpu, cpu_mask) {
-			struct rq *rq = &per_cpu(runqueues, cpu);
-
-			if (cpu == this_rq->cpu)
-				continue;
+		cpu = cpupri_find(this_rq->cpu, task);
 
-			/* We look for lowest RT prio or non-rt CPU */
-			if (rq->rt.highest_prio >= MAX_RT_PRIO) {
-				lowest_rq = rq;
-				break;
-			}
-
-			/* no locking for now */
-			if (rq->rt.highest_prio > task->prio &&
-			    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
-				lowest_rq = rq;
-			}
-		}
-
-		if (!lowest_rq)
+		if (cpu == this_rq->cpu)
 			break;
 
+		lowest_rq = cpu_rq(cpu);
+
 		/* if the prio of this runqueue changed, try again */
  		if (!lock_migration_target(task, lowest_rq))
  			return NULL;


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] RT: CPU priority management
  2007-10-25 14:46 ` [PATCH 3/3] RT: CPU priority management Gregory Haskins
@ 2007-10-25 15:27   ` Steven Rostedt
  2007-10-25 17:26     ` Gregory Haskins
  0 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2007-10-25 15:27 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-rt-users, linux-kernel, Dmitry Adamushko, Peter Zijlstra,
	Ingo Molnar, Darren Hart


--

> +
> +struct cpu_priority {
> +	raw_spinlock_t lock;
> +	cpumask_t      pri_to_cpu[CPUPRI_NR_PRIORITIES];
> +	long           pri_active[CPUPRI_NR_PRIORITIES/BITS_PER_LONG];
> +	int            cpu_to_pri[NR_CPUS];
> +};
> +
> +static __cacheline_aligned_in_smp struct cpu_priority cpu_priority;
> +

[...]

> +int cpupri_find(int def_cpu, struct task_struct *p)
> +{
> +	int                  idx      = 0;
> +	struct cpu_priority *cp       = &cpu_priority;
> +	int                  this_cpu = smp_processor_id();
> +	int                  cpu      = def_cpu;
> +	int                  task_pri = convert_prio(p->prio);
> +
> +	for_each_cpupri_active(cp->pri_active, idx) {

[...]

> +void cpupri_set(int cpu, int newpri)
> +{
> +	struct cpu_priority *cp      = &cpu_priority;
> +	int                 *currpri = &cp->cpu_to_pri[cpu];
> +	int                  oldpri  = *currpri;
> +	unsigned long        flags;
> +
> +	newpri = convert_prio(newpri);
> +
> +	if (newpri == oldpri)
> +		return;
> +
> +	spin_lock_irqsave(&cp->lock, flags);

The cpu_priority and the cp->lock will be aboslutely horrible for
cacheline bouncing.  Ironically, this will kill performance for the very
machines this code is to help with.  The larger the number of CPUs you
have the more cacheline bouncing this code will create.

I still don't see the benefit from the cpupri code.

-- Steve


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] RT: Cache cpus_allowed weight for optimizing migration
  2007-10-25 14:46 ` [PATCH 2/3] RT: Cache cpus_allowed weight for optimizing migration Gregory Haskins
@ 2007-10-25 15:48   ` Steven Rostedt
  2007-10-25 18:51     ` Gregory Haskins
  2007-10-25 19:58   ` Steven Rostedt
  1 sibling, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2007-10-25 15:48 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-rt-users, linux-kernel, Dmitry Adamushko, Peter Zijlstra,
	Ingo Molnar, Darren Hart


--
On Thu, 25 Oct 2007, Gregory Haskins wrote:

> Some RT tasks (particularly kthreads) are bound to one specific CPU.
> It is fairly common for one or more bound tasks to get queued up at the
> same time.  Consider, for instance, softirq_timer and softirq_sched.  A
> timer goes off in an ISR which schedules softirq_thread to run at RT50.
> Then during the handling of the timer, the system determines that it's
> time to smp-rebalance the system so it schedules softirq_sched to run
> from within the softirq_timer kthread context. So we are in a situation
> where we have two RT50 tasks queued, and the system will go into
> rt-overload condition to request other CPUs for help.
>
> The problem is that these tasks cannot ever be pulled away since they
> are already running on their one and only valid RQ.  However, the other
> CPUs cannot determine that the tasks are unpullable without going
> through expensive checks/locking.  Therefore the helping CPUS
> experience unecessary overhead/latencies regardless as they
> ineffectively try to process the overload condition.
>
> This patch tries to optimize the situation by utilizing the hamming
> weight of the task->cpus_allowed mask.  A weight of 1 indicates that
> the task cannot be migrated, which may be utilized by the overload
> handling code to eliminate uncessary rebalance attempts.  We also
> introduce a per-rq variable to count the number of migratable tasks
> that are currently running.  We only go into overload if we have more
> than one rt task, AND at least one of them is migratable.
>
> Calculating the weight is probably relatively expensive, so it is only
> done when the cpus_allowed mask is updated (which should be relatively
> infrequent, especially compared to scheduling frequency) and cached in
> the task_struct.
>
>
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
>
>  include/linux/sched.h |    2 +
>  kernel/fork.c         |    1
>  kernel/sched.c        |    9 +++-
>  kernel/sched_rt.c     |  116 +++++++++++++++++++++++++++++++++++++++++--------
>  4 files changed, 107 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 7a3829f..829de6f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1048,6 +1048,7 @@ struct sched_class {
>  	void (*set_curr_task) (struct rq *rq);
>  	void (*task_tick) (struct rq *rq, struct task_struct *p);
>  	void (*task_new) (struct rq *rq, struct task_struct *p);
> +	void (*set_cpus_allowed)(struct task_struct *p, cpumask_t newmask);
>  };
>
>  struct load_weight {
> @@ -1144,6 +1145,7 @@ struct task_struct {
>
>  	unsigned int policy;
>  	cpumask_t cpus_allowed;
> +	int nr_cpus_allowed;
>  	unsigned int time_slice;
>
>  #ifdef CONFIG_PREEMPT_RCU
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5f11f23..f808e18 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1257,6 +1257,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	 */
>  	preempt_disable();
>  	p->cpus_allowed = current->cpus_allowed;
> +	p->nr_cpus_allowed = current->nr_cpus_allowed;
>  	if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
>  			!cpu_online(task_cpu(p))))
>  		set_task_cpu(p, smp_processor_id());
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 30fa531..6c90093 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -262,6 +262,7 @@ struct rt_rq {
>  	int rt_load_balance_idx;
>  	struct list_head *rt_load_balance_head, *rt_load_balance_curr;
>  	unsigned long rt_nr_running;
> +	unsigned long rt_nr_migratory;
>  	unsigned long rt_nr_uninterruptible;
>  	/* highest queued rt task prio */
>  	int highest_prio;
> @@ -5371,7 +5372,13 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
>  		goto out;
>  	}
>
> -	p->cpus_allowed = new_mask;
> +	if (p->sched_class->set_cpus_allowed)

Not sure we need this optimization (not doing the nr_cpus_allowed
calculation). Since, due to priority boosting, we will need to calculate
then. Calculating it here is better.

> +		p->sched_class->set_cpus_allowed(p,
new_mask); > +	else {
> +		p->cpus_allowed    = new_mask;
> +		p->nr_cpus_allowed = cpus_weight(new_mask);
> +	}
> +
>  	/* Can the task run on the task's current CPU? If so, we're done */
>  	if (cpu_isset(task_cpu(p), new_mask))
>  		goto out;
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index b59dc20..ad35c89 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -50,6 +50,24 @@ static inline void update_curr_rt(struct rq *rq)
>  	curr->se.sum_exec_runtime += delta_exec;
>  	curr->se.exec_start = rq->clock;
>  }
> +#ifdef CONFIG_SMP
> +static void inc_rt_migration(struct task_struct *p, struct rq *rq)
> +{
> +	rq->rt.rt_nr_migratory++;
> +
> +	if (rq->rt.rt_nr_running > 1)
> +		rt_set_overload(p, rq->cpu);
> +}
> +
> +static void dec_rt_migration(struct task_struct *p, struct rq *rq)
> +{
> +	WARN_ON(!rq->rt.rt_nr_migratory);
> +	rq->rt.rt_nr_migratory--;
> +
> +	if ((rq->rt.rt_nr_running < 2) || !rq->rt.rt_nr_migratory)
> +		rt_clear_overload(p, rq->cpu);
> +}
> +#endif
>
>  static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
>  {
> @@ -58,8 +76,8 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
>  #ifdef CONFIG_SMP
>  	if (p->prio < rq->rt.highest_prio)
>  		rq->rt.highest_prio = p->prio;
> -	if (rq->rt.rt_nr_running > 1)
> -		rt_set_overload(p, rq->cpu);
> +	if (p->nr_cpus_allowed > 1)
> +		inc_rt_migration(p, rq);
>  #endif /* CONFIG_SMP */
>  }
>
> @@ -81,8 +99,8 @@ static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
>  		} /* otherwise leave rq->highest prio alone */
>  	} else
>  		rq->rt.highest_prio = MAX_RT_PRIO;
> -	if (rq->rt.rt_nr_running < 2)
> -		rt_clear_overload(p, rq->cpu);
> +	if (p->nr_cpus_allowed > 1)
> +		dec_rt_migration(p, rq);
>  #endif /* CONFIG_SMP */
>  }
>
> @@ -286,6 +304,27 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
>  	return next;
>  }
>
> +static int lock_migration_target(struct task_struct *p, struct rq *target_rq)
> +{
> +	struct rq *this_rq = task_rq(p);
> +
> +	if (double_lock_balance(this_rq, target_rq)) {
> +		/*
> +		 * We had to unlock the run queue. In
> +		 * the mean time, task could have
> +		 * migrated already or had its affinity changed.
> +		 */
> +		if (unlikely(task_running(this_rq, p) ||
> +			     task_rq(p) != this_rq ||
> +			     !cpu_isset(target_rq->cpu, p->cpus_allowed))) {
> +			spin_unlock(&target_rq->lock);
> +			return 0;
> +		}
> +	}
> +
> +	return 1;
> +}
> +
>  /* Will lock the rq it finds */
>  static struct rq *find_lock_lowest_rq(struct task_struct *task,
>  				      struct rq *this_rq)
> @@ -295,6 +334,28 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
>  	int cpu;
>  	int tries;
>
> + 	/*
> + 	 * We can optimize if the hamming weight of the cpus_allowed mask
> + 	 * is 1 because the task has nowhere to go but one CPU.  So don't
> + 	 * waste the time trying to find the lowest RQ in this case.
> + 	 */

This code should be in the pick_next_highest_rt and not here.

> + 	if (task->nr_cpus_allowed == 1) {
> + 		/* If the task is already on the RQ, we are done */
> + 		if (cpu_isset(this_rq->cpu, task->cpus_allowed))
> + 			return NULL;
> +
> + 		cpu = first_cpu(task->cpus_allowed);
> +
> + 		lowest_rq = cpu_rq(cpu);
> + 		BUG_ON(this_rq == lowest_rq);
> +
> + 		/* Otherwise, we can simply grab the new RQ */
> + 		if (lock_migration_target(task, lowest_rq))
> + 			return lowest_rq;
> + 		else
> + 			return NULL;
> + 	}
> +
>  	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
>
>  	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
> @@ -324,22 +385,8 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
>  			break;
>
>  		/* if the prio of this runqueue changed, try again */
> -		if (double_lock_balance(this_rq, lowest_rq)) {
> -			/*
> -			 * We had to unlock the run queue. In
> -			 * the mean time, task could have
> -			 * migrated already or had its affinity changed.
> -			 * Also make sure that it wasn't scheduled on its rq.
> -			 */
> -			if (unlikely(task_rq(task) != this_rq ||
> -				     !cpu_isset(lowest_rq->cpu, task->cpus_allowed) ||
> -				     task_running(this_rq, task) ||
> -				     !task->se.on_rq)) {
> -				spin_unlock(&lowest_rq->lock);
> -				lowest_rq = NULL;
> -				break;
> -			}
> -		}
> + 		if (!lock_migration_target(task, lowest_rq))
> + 			return NULL;

I don't like this encapsulating of the doubl_lock_balance. There's a
reason I kept it out like this. Mainly because this is the source of most
(if not all) of the race condititions I fought. So this is a very
sensitive area to touch. In fact, I see a few races already introduced by
this patch. One is that you took out the "!task->se.or_rq" test. Which
means that a task could have ran and deactivated itself
(TASK_UNINTERRUPTIBLE) and you just added it to a run queue. (I hit that
race ;-)

So keep that code out in the open where it belongs. Its very sensitive.


>
>  		/* If this rq is still suitable use it. */
>  		if (lowest_rq->rt.highest_prio > task->prio)
> @@ -658,6 +705,31 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p)
>  	}
>  }
>
> +#ifdef CONFIG_SMP
> +static void set_cpus_allowed_rt(struct task_struct *p, cpumask_t new_mask)
> +{
> +	int weight = cpus_weight(new_mask);
> +
> +	BUG_ON(!rt_task(p));
> +
> +	/*
> +	 * Update the migration status of the RQ if we have an RT task
> +	 * which is running AND changing its weight value.
> +	 */
> +	if (p->se.on_rq && (weight != p->nr_cpus_allowed)) {
> +		struct rq *rq = task_rq(p);
> +
> +		if ((p->nr_cpus_allowed <= 1) && (weight > 1))
> +			inc_rt_migration(p, rq);
> +		else if((p->nr_cpus_allowed > 1) && (weight <= 1))
> +			dec_rt_migration(p, rq);
> +	}
> +
> +	p->cpus_allowed    = new_mask;
> +	p->nr_cpus_allowed = weight;
> +}
> +#endif
> +
>  static struct sched_class rt_sched_class __read_mostly = {
>  	.enqueue_task		= enqueue_task_rt,
>  	.dequeue_task		= dequeue_task_rt,
> @@ -671,4 +743,8 @@ static struct sched_class rt_sched_class __read_mostly = {
>  	.load_balance		= load_balance_rt,
>
>  	.task_tick		= task_tick_rt,
> +
> +#ifdef CONFIG_SMP
> +	.set_cpus_allowed       = set_cpus_allowed_rt,
> +#endif

Hmm, I must have missed where you update the mask at time of boosting.
Anyway, we shouldn't. The set_cpus_allowed should be done for all tasks.
Now you can have a handler to call for when a task changes class and
changes weight.

-- Steve


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] RT: CPU priority management
  2007-10-25 15:27   ` Steven Rostedt
@ 2007-10-25 17:26     ` Gregory Haskins
  2007-10-25 17:36       ` Steven Rostedt
  0 siblings, 1 reply; 14+ messages in thread
From: Gregory Haskins @ 2007-10-25 17:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-rt-users, linux-kernel, Dmitry Adamushko, Peter Zijlstra,
	Ingo Molnar, Darren Hart

[-- Attachment #1: Type: text/plain, Size: 1762 bytes --]

Oh crap.  I just realized this is an older version of the patch..mustv'e
forgot to refresh...grr.  Ill send out the refreshed one.

But anyway, I digress.  

On Thu, 2007-10-25 at 11:27 -0400, Steven Rostedt wrote:

> 
> The cpu_priority and the cp->lock will be aboslutely horrible for
> cacheline bouncing.

In the form presented here in this email, perhaps.  I think you will see
some significant improvements in the refreshed version.  The big change
is that the global lock is gone.

>   Ironically, this will kill performance for the very
> machines this code is to help with.  The larger the number of CPUs you
> have the more cacheline bouncing this code will create.

Don't forget:  The same is precisely true for the current -rt2
algorithm.  For instance, the -rt2 algorithm aside from being linear in
general, scales cacheline bouncing linearly as well.  Each cpu is going
to trash rq->rt.highest_prio and then we will walk them for each scan.

The fact is, you can't maintain a global dynamic policy without bouncing
cachelines, period.  But hopefully we can minimize it, and I just want
to see the fastest code here.

> 
> I still don't see the benefit from the cpupri code.

I still owe you timing data, but at this juncture I think I can beat
linear (especially as we start throwing in big-iron) ;)  I originally
got involved in this scheduler rework from observations of poor scaling
on our 8/16-ways, so I want it to scale as much as you ;)  If this alg
doesn't pan out, that's cool.  But I think it will in the end.  Linear
algs in the fast path just make my skin crawl.  Perhaps it will still be
the best design in the end, but I am not giving up that easy until I
prove it to myself.

Regards,
-Greg

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] RT: CPU priority management
  2007-10-25 17:26     ` Gregory Haskins
@ 2007-10-25 17:36       ` Steven Rostedt
  2007-10-25 17:55         ` Gregory Haskins
  0 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2007-10-25 17:36 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-rt-users, linux-kernel, Dmitry Adamushko, Peter Zijlstra,
	Ingo Molnar, Darren Hart


--
On Thu, 25 Oct 2007, Gregory Haskins wrote:

> Oh crap.  I just realized this is an older version of the patch..mustv'e
> forgot to refresh...grr.  Ill send out the refreshed one.

Been there, done that ;-)

>
> But anyway, I digress.
>
> On Thu, 2007-10-25 at 11:27 -0400, Steven Rostedt wrote:
>
> >
> > The cpu_priority and the cp->lock will be aboslutely horrible for
> > cacheline bouncing.
>
> In the form presented here in this email, perhaps.  I think you will see
> some significant improvements in the refreshed version.  The big change
> is that the global lock is gone.
>
> >   Ironically, this will kill performance for the very
> > machines this code is to help with.  The larger the number of CPUs you
> > have the more cacheline bouncing this code will create.
>
> Don't forget:  The same is precisely true for the current -rt2
> algorithm.  For instance, the -rt2 algorithm aside from being linear in
> general, scales cacheline bouncing linearly as well.  Each cpu is going
> to trash rq->rt.highest_prio and then we will walk them for each scan.

Yep, -rt2 (and -rt3) are both horrible too. That's why I'm working on a
sched-domain version now to handle that.

>
> The fact is, you can't maintain a global dynamic policy without bouncing
> cachelines, period.  But hopefully we can minimize it, and I just want
> to see the fastest code here.

Well, we need a balance between cacheline hits, and where to pull from.
As performance goes, the two are inverse perportional. The secret is to
find a nice middle ground.

>
> >
> > I still don't see the benefit from the cpupri code.
>
> I still owe you timing data, but at this juncture I think I can beat
> linear (especially as we start throwing in big-iron) ;)  I originally
> got involved in this scheduler rework from observations of poor scaling
> on our 8/16-ways, so I want it to scale as much as you ;)  If this alg
> doesn't pan out, that's cool.  But I think it will in the end.  Linear
> algs in the fast path just make my skin crawl.  Perhaps it will still be
> the best design in the end, but I am not giving up that easy until I
> prove it to myself.

It's only linear in respect to the size of the domain or cpus allowed. Any
8/16 way boxes worried about cacheline bouncing need to reign in on the
cpus allowed mask. Otherwise, we would simply have bouncing anyway with
the tasks themselves.

-- Steve


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] RT: CPU priority management
  2007-10-25 17:36       ` Steven Rostedt
@ 2007-10-25 17:55         ` Gregory Haskins
  0 siblings, 0 replies; 14+ messages in thread
From: Gregory Haskins @ 2007-10-25 17:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-rt-users, linux-kernel, Dmitry Adamushko, Peter Zijlstra,
	Ingo Molnar, Darren Hart

[-- Attachment #1: Type: text/plain, Size: 2308 bytes --]

On Thu, 2007-10-25 at 13:36 -0400, Steven Rostedt wrote:


> 
> Yep, -rt2 (and -rt3) are both horrible too. That's why I'm working on a
> sched-domain version now to handle that.

Excellent.  I'm not 100% sure I've got the "mingo lingo" ;) down enough
to know if sched_domains are the best fit, but I think in concept that
makes a lot of sense (as we have discussed on IRC).   I am thinking that
we want to treat the rt.highest_prio/cpupri stuff to the "cpuset"
mentality, just like Ingo suggested for the rto masks.  If sched_domains
are those beasts, then great.

> 
> >
> > The fact is, you can't maintain a global dynamic policy without bouncing
> > cachelines, period.  But hopefully we can minimize it, and I just want
> > to see the fastest code here.
> 
> Well, we need a balance between cacheline hits, and where to pull from.
> As performance goes, the two are inverse perportional. The secret is to
> find a nice middle ground.

Fully agree.

> 
> >
> > >
> > > I still don't see the benefit from the cpupri code.
> >
> > I still owe you timing data, but at this juncture I think I can beat
> > linear (especially as we start throwing in big-iron) ;)  I originally
> > got involved in this scheduler rework from observations of poor scaling
> > on our 8/16-ways, so I want it to scale as much as you ;)  If this alg
> > doesn't pan out, that's cool.  But I think it will in the end.  Linear
> > algs in the fast path just make my skin crawl.  Perhaps it will still be
> > the best design in the end, but I am not giving up that easy until I
> > prove it to myself.
> 
> It's only linear in respect to the size of the domain or cpus allowed.

True, but as of now its a singular domain ;)  And even so, I still think
a cpupri like algorithm can be potentially useful even if we utilize
finer grained domains.  That is, unless the membership becomes
sufficiently small (maybe < 4 CPUs?) where the lower-overhead linear
search will just always trounce the larger overhead of cpupri_set().

>  Any
> 8/16 way boxes worried about cacheline bouncing need to reign in on the
> cpus allowed mask. Otherwise, we would simply have bouncing anyway with
> the tasks themselves.

Agreed.   Looking forward to reviewing your work here. :)

Regards,
-Greg

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] RT: Cache cpus_allowed weight for optimizing migration
  2007-10-25 15:48   ` Steven Rostedt
@ 2007-10-25 18:51     ` Gregory Haskins
  2007-10-25 19:52       ` Steven Rostedt
  0 siblings, 1 reply; 14+ messages in thread
From: Gregory Haskins @ 2007-10-25 18:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-rt-users, linux-kernel, Dmitry Adamushko, Peter Zijlstra,
	Ingo Molnar, Darren Hart

[-- Attachment #1: Type: text/plain, Size: 13930 bytes --]

On Thu, 2007-10-25 at 11:48 -0400, Steven Rostedt wrote:
> --
> On Thu, 25 Oct 2007, Gregory Haskins wrote:
> 
> > Some RT tasks (particularly kthreads) are bound to one specific CPU.
> > It is fairly common for one or more bound tasks to get queued up at the
> > same time.  Consider, for instance, softirq_timer and softirq_sched.  A
> > timer goes off in an ISR which schedules softirq_thread to run at RT50.
> > Then during the handling of the timer, the system determines that it's
> > time to smp-rebalance the system so it schedules softirq_sched to run
> > from within the softirq_timer kthread context. So we are in a situation
> > where we have two RT50 tasks queued, and the system will go into
> > rt-overload condition to request other CPUs for help.
> >
> > The problem is that these tasks cannot ever be pulled away since they
> > are already running on their one and only valid RQ.  However, the other
> > CPUs cannot determine that the tasks are unpullable without going
> > through expensive checks/locking.  Therefore the helping CPUS
> > experience unecessary overhead/latencies regardless as they
> > ineffectively try to process the overload condition.
> >
> > This patch tries to optimize the situation by utilizing the hamming
> > weight of the task->cpus_allowed mask.  A weight of 1 indicates that
> > the task cannot be migrated, which may be utilized by the overload
> > handling code to eliminate uncessary rebalance attempts.  We also
> > introduce a per-rq variable to count the number of migratable tasks
> > that are currently running.  We only go into overload if we have more
> > than one rt task, AND at least one of them is migratable.
> >
> > Calculating the weight is probably relatively expensive, so it is only
> > done when the cpus_allowed mask is updated (which should be relatively
> > infrequent, especially compared to scheduling frequency) and cached in
> > the task_struct.
> >
> >
> > Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> > ---
> >
> >  include/linux/sched.h |    2 +
> >  kernel/fork.c         |    1
> >  kernel/sched.c        |    9 +++-
> >  kernel/sched_rt.c     |  116 +++++++++++++++++++++++++++++++++++++++++--------
> >  4 files changed, 107 insertions(+), 21 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 7a3829f..829de6f 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1048,6 +1048,7 @@ struct sched_class {
> >  	void (*set_curr_task) (struct rq *rq);
> >  	void (*task_tick) (struct rq *rq, struct task_struct *p);
> >  	void (*task_new) (struct rq *rq, struct task_struct *p);
> > +	void (*set_cpus_allowed)(struct task_struct *p, cpumask_t newmask);
> >  };
> >
> >  struct load_weight {
> > @@ -1144,6 +1145,7 @@ struct task_struct {
> >
> >  	unsigned int policy;
> >  	cpumask_t cpus_allowed;
> > +	int nr_cpus_allowed;
> >  	unsigned int time_slice;
> >
> >  #ifdef CONFIG_PREEMPT_RCU
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 5f11f23..f808e18 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1257,6 +1257,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> >  	 */
> >  	preempt_disable();
> >  	p->cpus_allowed = current->cpus_allowed;
> > +	p->nr_cpus_allowed = current->nr_cpus_allowed;
> >  	if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
> >  			!cpu_online(task_cpu(p))))
> >  		set_task_cpu(p, smp_processor_id());
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 30fa531..6c90093 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -262,6 +262,7 @@ struct rt_rq {
> >  	int rt_load_balance_idx;
> >  	struct list_head *rt_load_balance_head, *rt_load_balance_curr;
> >  	unsigned long rt_nr_running;
> > +	unsigned long rt_nr_migratory;
> >  	unsigned long rt_nr_uninterruptible;
> >  	/* highest queued rt task prio */
> >  	int highest_prio;
> > @@ -5371,7 +5372,13 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
> >  		goto out;
> >  	}
> >
> > -	p->cpus_allowed = new_mask;
> > +	if (p->sched_class->set_cpus_allowed)
> 
> Not sure we need this optimization (not doing the nr_cpus_allowed
> calculation). Since, due to priority boosting, we will need to calculate
> then. Calculating it here is better.

I think you are misunderstanding the code here.  The only optimization
is that I didn't want to force every sched_class to define a default
set_cpus_allowed member-fn.  So instead, it first checks if its defined
and invokes it if true.  Else, the default behavior is to assign the
mask and calculate the weight.

If you look at the one and only implementation of this function
(sched_rt.c:set_cpus_allowed_rt()), it also performs the assignment and
calcs the mask.

Or did I misunderstand your objection?

> 
> > +		p->sched_class->set_cpus_allowed(p,
> new_mask); > +	else {
> > +		p->cpus_allowed    = new_mask;
> > +		p->nr_cpus_allowed = cpus_weight(new_mask);
> > +	}
> > +
> >  	/* Can the task run on the task's current CPU? If so, we're done */
> >  	if (cpu_isset(task_cpu(p), new_mask))
> >  		goto out;
> > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> > index b59dc20..ad35c89 100644
> > --- a/kernel/sched_rt.c
> > +++ b/kernel/sched_rt.c
> > @@ -50,6 +50,24 @@ static inline void update_curr_rt(struct rq *rq)
> >  	curr->se.sum_exec_runtime += delta_exec;
> >  	curr->se.exec_start = rq->clock;
> >  }
> > +#ifdef CONFIG_SMP
> > +static void inc_rt_migration(struct task_struct *p, struct rq *rq)
> > +{
> > +	rq->rt.rt_nr_migratory++;
> > +
> > +	if (rq->rt.rt_nr_running > 1)
> > +		rt_set_overload(p, rq->cpu);
> > +}
> > +
> > +static void dec_rt_migration(struct task_struct *p, struct rq *rq)
> > +{
> > +	WARN_ON(!rq->rt.rt_nr_migratory);
> > +	rq->rt.rt_nr_migratory--;
> > +
> > +	if ((rq->rt.rt_nr_running < 2) || !rq->rt.rt_nr_migratory)
> > +		rt_clear_overload(p, rq->cpu);
> > +}
> > +#endif
> >
> >  static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
> >  {
> > @@ -58,8 +76,8 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
> >  #ifdef CONFIG_SMP
> >  	if (p->prio < rq->rt.highest_prio)
> >  		rq->rt.highest_prio = p->prio;
> > -	if (rq->rt.rt_nr_running > 1)
> > -		rt_set_overload(p, rq->cpu);
> > +	if (p->nr_cpus_allowed > 1)
> > +		inc_rt_migration(p, rq);
> >  #endif /* CONFIG_SMP */
> >  }
> >
> > @@ -81,8 +99,8 @@ static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
> >  		} /* otherwise leave rq->highest prio alone */
> >  	} else
> >  		rq->rt.highest_prio = MAX_RT_PRIO;
> > -	if (rq->rt.rt_nr_running < 2)
> > -		rt_clear_overload(p, rq->cpu);
> > +	if (p->nr_cpus_allowed > 1)
> > +		dec_rt_migration(p, rq);
> >  #endif /* CONFIG_SMP */
> >  }
> >
> > @@ -286,6 +304,27 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
> >  	return next;
> >  }
> >
> > +static int lock_migration_target(struct task_struct *p, struct rq *target_rq)
> > +{
> > +	struct rq *this_rq = task_rq(p);
> > +
> > +	if (double_lock_balance(this_rq, target_rq)) {
> > +		/*
> > +		 * We had to unlock the run queue. In
> > +		 * the mean time, task could have
> > +		 * migrated already or had its affinity changed.
> > +		 */
> > +		if (unlikely(task_running(this_rq, p) ||
> > +			     task_rq(p) != this_rq ||
> > +			     !cpu_isset(target_rq->cpu, p->cpus_allowed))) {
> > +			spin_unlock(&target_rq->lock);
> > +			return 0;
> > +		}
> > +	}
> > +
> > +	return 1;
> > +}
> > +
> >  /* Will lock the rq it finds */
> >  static struct rq *find_lock_lowest_rq(struct task_struct *task,
> >  				      struct rq *this_rq)
> > @@ -295,6 +334,28 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
> >  	int cpu;
> >  	int tries;
> >
> > + 	/*
> > + 	 * We can optimize if the hamming weight of the cpus_allowed mask
> > + 	 * is 1 because the task has nowhere to go but one CPU.  So don't
> > + 	 * waste the time trying to find the lowest RQ in this case.
> > + 	 */
> 
> This code should be in the pick_next_highest_rt and not here.

I disagree, but I admit it is not apparent at this level of the series
why.  The reason has to do with optimizing the wakeup path.  Unless I am
missing something, I think this placement is optimal, and that will
hopefully become apparent when you see the rest of my series.


> 
> > + 	if (task->nr_cpus_allowed == 1) {
> > + 		/* If the task is already on the RQ, we are done */
> > + 		if (cpu_isset(this_rq->cpu, task->cpus_allowed))
> > + 			return NULL;
> > +
> > + 		cpu = first_cpu(task->cpus_allowed);
> > +
> > + 		lowest_rq = cpu_rq(cpu);
> > + 		BUG_ON(this_rq == lowest_rq);
> > +
> > + 		/* Otherwise, we can simply grab the new RQ */
> > + 		if (lock_migration_target(task, lowest_rq))
> > + 			return lowest_rq;
> > + 		else
> > + 			return NULL;
> > + 	}
> > +
> >  	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
> >
> >  	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
> > @@ -324,22 +385,8 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
> >  			break;
> >
> >  		/* if the prio of this runqueue changed, try again */
> > -		if (double_lock_balance(this_rq, lowest_rq)) {
> > -			/*
> > -			 * We had to unlock the run queue. In
> > -			 * the mean time, task could have
> > -			 * migrated already or had its affinity changed.
> > -			 * Also make sure that it wasn't scheduled on its rq.
> > -			 */
> > -			if (unlikely(task_rq(task) != this_rq ||
> > -				     !cpu_isset(lowest_rq->cpu, task->cpus_allowed) ||
> > -				     task_running(this_rq, task) ||
> > -				     !task->se.on_rq)) {
> > -				spin_unlock(&lowest_rq->lock);
> > -				lowest_rq = NULL;
> > -				break;
> > -			}
> > -		}
> > + 		if (!lock_migration_target(task, lowest_rq))
> > + 			return NULL;
> 
> I don't like this encapsulating of the doubl_lock_balance. There's a
> reason I kept it out like this. Mainly because this is the source of most
> (if not all) of the race condititions I fought. So this is a very
> sensitive area to touch.

Reducing code duplication is an effort to mitigate error propagation,
not increase it ;)

> In fact, I see a few races already introduced by
> this patch. One is that you took out the "!task->se.or_rq" test.

This was intentional, but perhaps I should have been more explicit on
the reason why.  This is again related to future optimizations that I
have yet to reveal.  Namely, this code path can be invoked before the
task has been woken up, and it is therefore normal for it to not be on a
RQ.

>  Which
> means that a task could have ran and deactivated itself
> (TASK_UNINTERRUPTIBLE) and you just added it to a run queue. (I hit that
> race ;-)

Hmm.. This is a good point.  Perhaps we should check to see that the
task doesnt change state instead of if its on a RQ.

> 
> So keep that code out in the open where it belongs.

You aren't the first person to say something like that, and I always
scratch my head at that one.  It *is* open..its right there -40 lines
from where it used to be.  Its not like I gave you a obfuscated .o
blackbox ;)

>  Its very sensitive.

Agreed.  You weren't aware of my issue, as I wasn't aware of yours.  I
think that means we both failed to comment tricky code properly ;)

> 
> 
> >
> >  		/* If this rq is still suitable use it. */
> >  		if (lowest_rq->rt.highest_prio > task->prio)
> > @@ -658,6 +705,31 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p)
> >  	}
> >  }
> >
> > +#ifdef CONFIG_SMP
> > +static void set_cpus_allowed_rt(struct task_struct *p, cpumask_t new_mask)
> > +{
> > +	int weight = cpus_weight(new_mask);
> > +
> > +	BUG_ON(!rt_task(p));
> > +
> > +	/*
> > +	 * Update the migration status of the RQ if we have an RT task
> > +	 * which is running AND changing its weight value.
> > +	 */
> > +	if (p->se.on_rq && (weight != p->nr_cpus_allowed)) {
> > +		struct rq *rq = task_rq(p);
> > +
> > +		if ((p->nr_cpus_allowed <= 1) && (weight > 1))
> > +			inc_rt_migration(p, rq);
> > +		else if((p->nr_cpus_allowed > 1) && (weight <= 1))
> > +			dec_rt_migration(p, rq);
> > +	}
> > +
> > +	p->cpus_allowed    = new_mask;
> > +	p->nr_cpus_allowed = weight;
> > +}
> > +#endif
> > +
> >  static struct sched_class rt_sched_class __read_mostly = {
> >  	.enqueue_task		= enqueue_task_rt,
> >  	.dequeue_task		= dequeue_task_rt,
> > @@ -671,4 +743,8 @@ static struct sched_class rt_sched_class __read_mostly = {
> >  	.load_balance		= load_balance_rt,
> >
> >  	.task_tick		= task_tick_rt,
> > +
> > +#ifdef CONFIG_SMP
> > +	.set_cpus_allowed       = set_cpus_allowed_rt,
> > +#endif
> 
> Hmm, I must have missed where you update the mask at time of boosting.

?  I don't understand what you mean.  Are you referring to PI boosting?
What does that have to do with the mask?

Are you referring to when we change the mask?  We already invoke this
handler from the sched.c:set_cpus_allowed().  Can you clarify this
point?

> Anyway, we shouldn't. The set_cpus_allowed should be done for all tasks.

It is.  Im just lazy and had the default case handled inside the
sched_c:set_cpus_allowed() call.  Perhaps I should just let all
sched_classes define a handler instead of such trickery.

> Now you can have a handler to call for when a task changes class

Well, at least for what I am designing here, we are already covered.  If
the task changes class, it would be dequeued in one class, and enqueued
in the new one.  This would properly update the migration state.
Likewise, if it had its mask changed the set_cpus_allowed_rt() would
update the migration state.

Regards,
-Greg

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] RT: Cache cpus_allowed weight for optimizing migration
  2007-10-25 18:51     ` Gregory Haskins
@ 2007-10-25 19:52       ` Steven Rostedt
  2007-10-25 21:12         ` Gregory Haskins
  0 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2007-10-25 19:52 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-rt-users, linux-kernel, Dmitry Adamushko, Peter Zijlstra,
	Ingo Molnar, Darren Hart


--
On Thu, 25 Oct 2007, Gregory Haskins wrote:

> > >
> > > -	p->cpus_allowed = new_mask;
> > > +	if (p->sched_class->set_cpus_allowed)
> >
> > Not sure we need this optimization (not doing the nr_cpus_allowed
> > calculation). Since, due to priority boosting, we will need to calculate
> > then. Calculating it here is better.
>
> I think you are misunderstanding the code here.  The only optimization
> is that I didn't want to force every sched_class to define a default
> set_cpus_allowed member-fn.  So instead, it first checks if its defined
> and invokes it if true.  Else, the default behavior is to assign the
> mask and calculate the weight.
>
> If you look at the one and only implementation of this function
> (sched_rt.c:set_cpus_allowed_rt()), it also performs the assignment and
> calcs the mask.
>
> Or did I misunderstand your objection?

Actually, I say we do the calculation for all tasks regardless of class.
But you can have a function for each class (or NULL) that will do
something with the old and new values.

Reason why is that we don't want the calculation to happen during the
boosting of a tasks (when it goes from one class to another).

>
> >
> > > +		p->sched_class->set_cpus_allowed(p,
> > new_mask); > +	else {
> > > +		p->cpus_allowed    = new_mask;
> > > +		p->nr_cpus_allowed = cpus_weight(new_mask);
> > > +	}
> > > +
> > >  	/* Can the task run on the task's current CPU? If so, we're done */
> > >  	if (cpu_isset(task_cpu(p), new_mask))
> > >  		goto out;


> > >  /* Will lock the rq it finds */
> > >  static struct rq *find_lock_lowest_rq(struct task_struct *task,
> > >  				      struct rq *this_rq)
> > > @@ -295,6 +334,28 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
> > >  	int cpu;
> > >  	int tries;
> > >
> > > + 	/*
> > > + 	 * We can optimize if the hamming weight of the cpus_allowed mask
> > > + 	 * is 1 because the task has nowhere to go but one CPU.  So don't
> > > + 	 * waste the time trying to find the lowest RQ in this case.
> > > + 	 */
> >
> > This code should be in the pick_next_highest_rt and not here.
>
> I disagree, but I admit it is not apparent at this level of the series
> why.  The reason has to do with optimizing the wakeup path.  Unless I am
> missing something, I think this placement is optimal, and that will
> hopefully become apparent when you see the rest of my series.

Then move the code there then. One can't be analyzing patches when code
isn't apparent that determines changes.  Those changes need to be
together.  Especially since they may not be applied.

I would be happy to apply most of this patch but because of changes that
are here to help out to-be-announced patches, will keep this patch from
going into the tree.

IOW, break the patch up to the fixes that this patch is to do on its own.
It has plenty. Leave out the stuff that will help out later patches.
Add the stuff that is being recommended, and you can make a separate patch
to move things around in the patch series that does all the TBA stuff.

>
> >
> > > + 	if (task->nr_cpus_allowed == 1) {
> > > + 		/* If the task is already on the RQ, we are done */
> > > + 		if (cpu_isset(this_rq->cpu, task->cpus_allowed))
> > > + 			return NULL;
> > > +
> > > + 		cpu = first_cpu(task->cpus_allowed);
> > > +
> > > + 		lowest_rq = cpu_rq(cpu);
> > > + 		BUG_ON(this_rq == lowest_rq);
> > > +
> > > + 		/* Otherwise, we can simply grab the new RQ */
> > > + 		if (lock_migration_target(task, lowest_rq))
> > > + 			return lowest_rq;
> > > + 		else
> > > + 			return NULL;
> > > + 	}
> > > +
> > >  	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
> > >
> > >  	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
> > > @@ -324,22 +385,8 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
> > >  			break;
> > >
> > >  		/* if the prio of this runqueue changed, try again */
> > > -		if (double_lock_balance(this_rq, lowest_rq)) {
> > > -			/*
> > > -			 * We had to unlock the run queue. In
> > > -			 * the mean time, task could have
> > > -			 * migrated already or had its affinity changed.
> > > -			 * Also make sure that it wasn't scheduled on its rq.
> > > -			 */
> > > -			if (unlikely(task_rq(task) != this_rq ||
> > > -				     !cpu_isset(lowest_rq->cpu, task->cpus_allowed) ||
> > > -				     task_running(this_rq, task) ||
> > > -				     !task->se.on_rq)) {
> > > -				spin_unlock(&lowest_rq->lock);
> > > -				lowest_rq = NULL;
> > > -				break;
> > > -			}
> > > -		}
> > > + 		if (!lock_migration_target(task, lowest_rq))
> > > + 			return NULL;
> >
> > I don't like this encapsulating of the doubl_lock_balance. There's a
> > reason I kept it out like this. Mainly because this is the source of most
> > (if not all) of the race condititions I fought. So this is a very
> > sensitive area to touch.
>
> Reducing code duplication is an effort to mitigate error propagation,
> not increase it ;)

Actually that's not the point. The side effects of the double_lock_balance
are unique to each location. So it doesn't really make sense to modulize
this code. That was why I didn't do it.

>
> > In fact, I see a few races already introduced by
> > this patch. One is that you took out the "!task->se.or_rq" test.
>
> This was intentional, but perhaps I should have been more explicit on
> the reason why.  This is again related to future optimizations that I
> have yet to reveal.  Namely, this code path can be invoked before the
> task has been woken up, and it is therefore normal for it to not be on a
> RQ.
>
> >  Which
> > means that a task could have ran and deactivated itself
> > (TASK_UNINTERRUPTIBLE) and you just added it to a run queue. (I hit that
> > race ;-)
>
> Hmm.. This is a good point.  Perhaps we should check to see that the
> task doesnt change state instead of if its on a RQ.

Hence why I said the balance before activate task is very racy.  Looking
at the state is not enough. A RT task may be in the TASK_UNINTERRUPTIBLE
state and about to check a condition that will have it wake itself up. So
just looking at the state is not enough to determine if we should skip it
or not.

>
> >
> > So keep that code out in the open where it belongs.
>
> You aren't the first person to say something like that, and I always
> scratch my head at that one.  It *is* open..its right there -40 lines
> from where it used to be.  Its not like I gave you a obfuscated .o
> blackbox ;)

My point is that the side effect that the double_lock_balance causes is
not apparent in the encapsulated function.

>
> >  Its very sensitive.
>
> Agreed.  You weren't aware of my issue, as I wasn't aware of yours.  I
> think that means we both failed to comment tricky code properly ;)

But your issues are on patches not yet sent ;-)

> >
> >
> > >
> > >  		/* If this rq is still suitable use it. */
> > >  		if (lowest_rq->rt.highest_prio > task->prio)
> > > @@ -658,6 +705,31 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p)
> > >  	}
> > >  }
> > >
> > > +#ifdef CONFIG_SMP
> > > +static void set_cpus_allowed_rt(struct task_struct *p, cpumask_t new_mask)
> > > +{
> > > +	int weight = cpus_weight(new_mask);
> > > +
> > > +	BUG_ON(!rt_task(p));
> > > +
> > > +	/*
> > > +	 * Update the migration status of the RQ if we have an RT task
> > > +	 * which is running AND changing its weight value.
> > > +	 */
> > > +	if (p->se.on_rq && (weight != p->nr_cpus_allowed)) {
> > > +		struct rq *rq = task_rq(p);
> > > +
> > > +		if ((p->nr_cpus_allowed <= 1) && (weight > 1))
> > > +			inc_rt_migration(p, rq);
> > > +		else if((p->nr_cpus_allowed > 1) && (weight <= 1))
> > > +			dec_rt_migration(p, rq);
> > > +	}
> > > +
> > > +	p->cpus_allowed    = new_mask;
> > > +	p->nr_cpus_allowed = weight;
> > > +}
> > > +#endif
> > > +
> > >  static struct sched_class rt_sched_class __read_mostly = {
> > >  	.enqueue_task		= enqueue_task_rt,
> > >  	.dequeue_task		= dequeue_task_rt,
> > > @@ -671,4 +743,8 @@ static struct sched_class rt_sched_class __read_mostly = {
> > >  	.load_balance		= load_balance_rt,
> > >
> > >  	.task_tick		= task_tick_rt,
> > > +
> > > +#ifdef CONFIG_SMP
> > > +	.set_cpus_allowed       = set_cpus_allowed_rt,
> > > +#endif
> >
> > Hmm, I must have missed where you update the mask at time of boosting.
>
> ?  I don't understand what you mean.  Are you referring to PI boosting?
> What does that have to do with the mask?

No but a boosting can change classes. And going from fair to rt we don't
have a nr_cpus_allowed set up yet.

>
> Are you referring to when we change the mask?  We already invoke this
> handler from the sched.c:set_cpus_allowed().  Can you clarify this
> point?
>
> > Anyway, we shouldn't. The set_cpus_allowed should be done for all tasks.
>
> It is.  Im just lazy and had the default case handled inside the
> sched_c:set_cpus_allowed() call.  Perhaps I should just let all
> sched_classes define a handler instead of such trickery.
>
> > Now you can have a handler to call for when a task changes class
>
> Well, at least for what I am designing here, we are already covered.  If
> the task changes class, it would be dequeued in one class, and enqueued
> in the new one.  This would properly update the migration state.
> Likewise, if it had its mask changed the set_cpus_allowed_rt() would
> update the migration state.

That is what I think I'm missing. I have to go back and look at the patch
(I admit I may have missed it). Switching the migration state will update
the processes nr_cpus_allowed? Of course I think that's not a good idea.

/me looks.

-- Steve


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] RT: Cache cpus_allowed weight for optimizing migration
  2007-10-25 14:46 ` [PATCH 2/3] RT: Cache cpus_allowed weight for optimizing migration Gregory Haskins
  2007-10-25 15:48   ` Steven Rostedt
@ 2007-10-25 19:58   ` Steven Rostedt
  1 sibling, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2007-10-25 19:58 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-rt-users, linux-kernel, Dmitry Adamushko, Peter Zijlstra,
	Ingo Molnar, Darren Hart


--
> @@ -5371,7 +5372,13 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
>  		goto out;
>  	}
>
> -	p->cpus_allowed = new_mask;
> +	if (p->sched_class->set_cpus_allowed)
> +		p->sched_class->set_cpus_allowed(p, new_mask);
> +	else {
> +		p->cpus_allowed    = new_mask;
> +		p->nr_cpus_allowed = cpus_weight(new_mask);
> +	}
> +

OK, forget everything I said on this topic. I must have replied while low
on caffeine.

I see here that you do the set_cpus_allowed code if defined _ELSE_ you
just update the weight.

So you do do what I asked.

/me slaps himself to wakeup.

-- Steve


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] RT: Cache cpus_allowed weight for optimizing migration
  2007-10-25 19:52       ` Steven Rostedt
@ 2007-10-25 21:12         ` Gregory Haskins
  2007-10-26  0:03           ` Steven Rostedt
  0 siblings, 1 reply; 14+ messages in thread
From: Gregory Haskins @ 2007-10-25 21:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-rt-users, linux-kernel, Dmitry Adamushko, Peter Zijlstra,
	Ingo Molnar, Darren Hart

[-- Attachment #1: Type: text/plain, Size: 11447 bytes --]

On Thu, 2007-10-25 at 15:52 -0400, Steven Rostedt wrote:

> >
> > >
> > > > +		p->sched_class->set_cpus_allowed(p,
> > > new_mask); > +	else {
> > > > +		p->cpus_allowed    = new_mask;
> > > > +		p->nr_cpus_allowed = cpus_weight(new_mask);
> > > > +	}
> > > > +
> > > >  	/* Can the task run on the task's current CPU? If so, we're done */
> > > >  	if (cpu_isset(task_cpu(p), new_mask))
> > > >  		goto out;
> 
> 
> > > >  /* Will lock the rq it finds */
> > > >  static struct rq *find_lock_lowest_rq(struct task_struct *task,
> > > >  				      struct rq *this_rq)
> > > > @@ -295,6 +334,28 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
> > > >  	int cpu;
> > > >  	int tries;
> > > >
> > > > + 	/*
> > > > + 	 * We can optimize if the hamming weight of the cpus_allowed mask
> > > > + 	 * is 1 because the task has nowhere to go but one CPU.  So don't
> > > > + 	 * waste the time trying to find the lowest RQ in this case.
> > > > + 	 */
> > >
> > > This code should be in the pick_next_highest_rt and not here.
> >
> > I disagree, but I admit it is not apparent at this level of the series
> > why.  The reason has to do with optimizing the wakeup path.  Unless I am
> > missing something, I think this placement is optimal, and that will
> > hopefully become apparent when you see the rest of my series.
> 
> Then move the code there then. One can't be analyzing patches when code
> isn't apparent that determines changes.  Those changes need to be
> together.  Especially since they may not be applied.

The correctness of this particular change (IMO) is not predicated on the
later patches, otherwise I would have done just that.  It accomplishes
what I intended as is here in this patch.

Why do you think moving the logic to pick_next_highest is a better
design?  To be honest, I haven't really studied your new logic in
push_rt_tasks to understand why you might feel this way.  If you can
make the case that it is better in the other location then I agree with
you that we should move it there in this patch, and potentially adjust
it later.  Until then, I see no problem with it being here.

> 
> I would be happy to apply most of this patch but because of changes that
> are here to help out to-be-announced patches, will keep this patch from
> going into the tree.
> 
> IOW, break the patch up to the fixes that this patch is to do on its own.
> It has plenty. Leave out the stuff that will help out later patches.
> Add the stuff that is being recommended, and you can make a separate patch
> to move things around in the patch series that does all the TBA stuff.

I think the only one real solid case of that is the removal of on_rq
from the double_lock conditions.  I agree with your point that it should
come later.  I'll put it back in and adjust it later.  I'll wait to hear
your argument on the pick_highest_task before moving the other one.

> 
> >
> > >
> > > > + 	if (task->nr_cpus_allowed == 1) {
> > > > + 		/* If the task is already on the RQ, we are done */
> > > > + 		if (cpu_isset(this_rq->cpu, task->cpus_allowed))
> > > > + 			return NULL;
> > > > +
> > > > + 		cpu = first_cpu(task->cpus_allowed);
> > > > +
> > > > + 		lowest_rq = cpu_rq(cpu);
> > > > + 		BUG_ON(this_rq == lowest_rq);
> > > > +
> > > > + 		/* Otherwise, we can simply grab the new RQ */
> > > > + 		if (lock_migration_target(task, lowest_rq))
> > > > + 			return lowest_rq;
> > > > + 		else
> > > > + 			return NULL;
> > > > + 	}
> > > > +
> > > >  	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
> > > >
> > > >  	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
> > > > @@ -324,22 +385,8 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
> > > >  			break;
> > > >
> > > >  		/* if the prio of this runqueue changed, try again */
> > > > -		if (double_lock_balance(this_rq, lowest_rq)) {
> > > > -			/*
> > > > -			 * We had to unlock the run queue. In
> > > > -			 * the mean time, task could have
> > > > -			 * migrated already or had its affinity changed.
> > > > -			 * Also make sure that it wasn't scheduled on its rq.
> > > > -			 */
> > > > -			if (unlikely(task_rq(task) != this_rq ||
> > > > -				     !cpu_isset(lowest_rq->cpu, task->cpus_allowed) ||
> > > > -				     task_running(this_rq, task) ||
> > > > -				     !task->se.on_rq)) {
> > > > -				spin_unlock(&lowest_rq->lock);
> > > > -				lowest_rq = NULL;
> > > > -				break;
> > > > -			}
> > > > -		}
> > > > + 		if (!lock_migration_target(task, lowest_rq))
> > > > + 			return NULL;
> > >
> > > I don't like this encapsulating of the doubl_lock_balance. There's a
> > > reason I kept it out like this. Mainly because this is the source of most
> > > (if not all) of the race condititions I fought. So this is a very
> > > sensitive area to touch.
> >
> > Reducing code duplication is an effort to mitigate error propagation,
> > not increase it ;)
> 
> Actually that's not the point. The side effects of the double_lock_balance
> are unique to each location. So it doesn't really make sense to modulize
> this code. That was why I didn't do it.

?

I generalized the core functionality so it worked as I believed it
should in the two places that I used it.  I don't follow your objection
here.  Did I break something (besides the on_rq thing?).  My feeling is
that both need the same locking semantics, but we don't need the check
for the priority shift in the optimized case as we do in the loop.  So
that is where I delineated the common code from the specialized.

> 
> >
> > > In fact, I see a few races already introduced by
> > > this patch. One is that you took out the "!task->se.or_rq" test.
> >
> > This was intentional, but perhaps I should have been more explicit on
> > the reason why.  This is again related to future optimizations that I
> > have yet to reveal.  Namely, this code path can be invoked before the
> > task has been woken up, and it is therefore normal for it to not be on a
> > RQ.
> >
> > >  Which
> > > means that a task could have ran and deactivated itself
> > > (TASK_UNINTERRUPTIBLE) and you just added it to a run queue. (I hit that
> > > race ;-)
> >
> > Hmm.. This is a good point.  Perhaps we should check to see that the
> > task doesnt change state instead of if its on a RQ.
> 
> Hence why I said the balance before activate task is very racy.

If that's the case then you are already broken if we hit the wake_idle()
case, no?  We can drop the lock there too.

But I am not convinced it cannot be done in a race free way.  I can be
convinced otherwise. 

>   Looking
> at the state is not enough. A RT task may be in the TASK_UNINTERRUPTIBLE
> state and about to check a condition that will have it wake itself up. So
> just looking at the state is not enough to determine if we should skip it
> or not.

Hmm...I'm not sure I follow.  Can you elaborate more?

> 
> >
> > >
> > > So keep that code out in the open where it belongs.
> >
> > You aren't the first person to say something like that, and I always
> > scratch my head at that one.  It *is* open..its right there -40 lines
> > from where it used to be.  Its not like I gave you a obfuscated .o
> > blackbox ;)
> 
> My point is that the side effect that the double_lock_balance causes is
> not apparent in the encapsulated function.
> 
> >
> > >  Its very sensitive.
> >
> > Agreed.  You weren't aware of my issue, as I wasn't aware of yours.  I
> > think that means we both failed to comment tricky code properly ;)
> 
> But your issues are on patches not yet sent ;-)

Fair enough...I'll fix it.  But my point still stands that a decent
comment from you would have stopped my "cleverness" ;)

> 
> > >
> > >
> > > >
> > > >  		/* If this rq is still suitable use it. */
> > > >  		if (lowest_rq->rt.highest_prio > task->prio)
> > > > @@ -658,6 +705,31 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p)
> > > >  	}
> > > >  }
> > > >
> > > > +#ifdef CONFIG_SMP
> > > > +static void set_cpus_allowed_rt(struct task_struct *p, cpumask_t new_mask)
> > > > +{
> > > > +	int weight = cpus_weight(new_mask);
> > > > +
> > > > +	BUG_ON(!rt_task(p));
> > > > +
> > > > +	/*
> > > > +	 * Update the migration status of the RQ if we have an RT task
> > > > +	 * which is running AND changing its weight value.
> > > > +	 */
> > > > +	if (p->se.on_rq && (weight != p->nr_cpus_allowed)) {
> > > > +		struct rq *rq = task_rq(p);
> > > > +
> > > > +		if ((p->nr_cpus_allowed <= 1) && (weight > 1))
> > > > +			inc_rt_migration(p, rq);
> > > > +		else if((p->nr_cpus_allowed > 1) && (weight <= 1))
> > > > +			dec_rt_migration(p, rq);
> > > > +	}
> > > > +
> > > > +	p->cpus_allowed    = new_mask;
> > > > +	p->nr_cpus_allowed = weight;
> > > > +}
> > > > +#endif
> > > > +
> > > >  static struct sched_class rt_sched_class __read_mostly = {
> > > >  	.enqueue_task		= enqueue_task_rt,
> > > >  	.dequeue_task		= dequeue_task_rt,
> > > > @@ -671,4 +743,8 @@ static struct sched_class rt_sched_class __read_mostly = {
> > > >  	.load_balance		= load_balance_rt,
> > > >
> > > >  	.task_tick		= task_tick_rt,
> > > > +
> > > > +#ifdef CONFIG_SMP
> > > > +	.set_cpus_allowed       = set_cpus_allowed_rt,
> > > > +#endif
> > >
> > > Hmm, I must have missed where you update the mask at time of boosting.
> >
> > ?  I don't understand what you mean.  Are you referring to PI boosting?
> > What does that have to do with the mask?
> 
> No but a boosting can change classes. And going from fair to rt we don't
> have a nr_cpus_allowed set up yet.

Sure we do! (unless I introduced a bug)

task->cpus_allowed and task->nr_cpus_allowed is updated for *all*
classes.  It is only acted upon to change the migration state if the
task is of an RT class.

> 
> >
> > Are you referring to when we change the mask?  We already invoke this
> > handler from the sched.c:set_cpus_allowed().  Can you clarify this
> > point?
> >
> > > Anyway, we shouldn't. The set_cpus_allowed should be done for all tasks.
> >
> > It is.  Im just lazy and had the default case handled inside the
> > sched_c:set_cpus_allowed() call.  Perhaps I should just let all
> > sched_classes define a handler instead of such trickery.
> >
> > > Now you can have a handler to call for when a task changes class
> >
> > Well, at least for what I am designing here, we are already covered.  If
> > the task changes class, it would be dequeued in one class, and enqueued
> > in the new one.  This would properly update the migration state.
> > Likewise, if it had its mask changed the set_cpus_allowed_rt() would
> > update the migration state.
> 
> That is what I think I'm missing. I have to go back and look at the patch
> (I admit I may have missed it). Switching the migration state will update
> the processes nr_cpus_allowed? Of course I think that's not a good idea.

No, other way around.  (de)queuing an RT task, or changing an RT tasks
mask will change the migration state.  So if you have a normal task, its
tracking its hamming-weight...and then you convert it to RT, it would
update migration state on the enqueue.

Likewise, if you alter the mask of an RT task, it would update the
migration state right away.  It never would flow the other way.

Regards,
-Greg



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] RT: Cache cpus_allowed weight for optimizing migration
  2007-10-25 21:12         ` Gregory Haskins
@ 2007-10-26  0:03           ` Steven Rostedt
  0 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2007-10-26  0:03 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-rt-users, linux-kernel, Dmitry Adamushko, Peter Zijlstra,
	Ingo Molnar, Darren Hart


--
On Thu, 25 Oct 2007, Gregory Haskins wrote:
> > > > >  /* Will lock the rq it finds */
> > > > >  static struct rq *find_lock_lowest_rq(struct task_struct *task,
> > > > >  				      struct rq *this_rq)
> > > > > @@ -295,6 +334,28 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
> > > > >  	int cpu;
> > > > >  	int tries;
> > > > >
> > > > > + 	/*
> > > > > + 	 * We can optimize if the hamming weight of the cpus_allowed mask
> > > > > + 	 * is 1 because the task has nowhere to go but one CPU.  So don't
> > > > > + 	 * waste the time trying to find the lowest RQ in this case.
> > > > > + 	 */
> > > >
> > > > This code should be in the pick_next_highest_rt and not here.
> > >
> > > I disagree, but I admit it is not apparent at this level of the series
> > > why.  The reason has to do with optimizing the wakeup path.  Unless I am
> > > missing something, I think this placement is optimal, and that will
> > > hopefully become apparent when you see the rest of my series.
> >
> > Then move the code there then. One can't be analyzing patches when code
> > isn't apparent that determines changes.  Those changes need to be
> > together.  Especially since they may not be applied.
>
> The correctness of this particular change (IMO) is not predicated on the
> later patches, otherwise I would have done just that.  It accomplishes
> what I intended as is here in this patch.
>
> Why do you think moving the logic to pick_next_highest is a better
> design?  To be honest, I haven't really studied your new logic in
> push_rt_tasks to understand why you might feel this way.  If you can
> make the case that it is better in the other location then I agree with
> you that we should move it there in this patch, and potentially adjust
> it later.  Until then, I see no problem with it being here.

Ah, after reading the comment in your code, I might know where our
miscommunication is from.  When you hit a task that can't migrate, you
simply stop and don't bother looking for a lowest rq to place it on.

I'm saying to do one better. Put the code in the pick_next_highest_task_rt
and _skip_ rt tasks with nr_cpus_allowed == 1. So we can then look to
migrate another RT task that is lower in priority than a bounded RT task.

Does this clear up what I'm trying to say?

BTW, stop looking for a lowest_rq isn't really an optimization here. Since
we only look at cpus that are in the tasks cpu affinity mask and we skip
the cpu that it currently is on. So we don't even take a lock in that
case.

-- Steve


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-10-26  0:05 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-10-25 14:46 [PATCH 0/3] RT: balance rt tasks enhancements v6 Gregory Haskins
2007-10-25 14:46 ` [PATCH 1/3] RT: cleanup some push-rt logic Gregory Haskins
2007-10-25 14:46 ` [PATCH 2/3] RT: Cache cpus_allowed weight for optimizing migration Gregory Haskins
2007-10-25 15:48   ` Steven Rostedt
2007-10-25 18:51     ` Gregory Haskins
2007-10-25 19:52       ` Steven Rostedt
2007-10-25 21:12         ` Gregory Haskins
2007-10-26  0:03           ` Steven Rostedt
2007-10-25 19:58   ` Steven Rostedt
2007-10-25 14:46 ` [PATCH 3/3] RT: CPU priority management Gregory Haskins
2007-10-25 15:27   ` Steven Rostedt
2007-10-25 17:26     ` Gregory Haskins
2007-10-25 17:36       ` Steven Rostedt
2007-10-25 17:55         ` Gregory Haskins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).