linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock
@ 2011-08-02 20:36 Steven Rostedt
  2011-08-03 14:18 ` Hillf Danton
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Steven Rostedt @ 2011-08-02 20:36 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Mike Galbraith,
	Luis Claudio R. Gonçalves, Matthew Hank Sabins,
	Gregory Haskins, Andrew Morton


Ingo,

I've been passing this patch (one with benchmark code enabled) around,
and this has shown nice improvement in the scheduling of RT tasks. The
original code in cpupri, uses a vec->lock where there's one vec per RT
priority, and these locks are global. With the RCU threads and IRQ
threads being RT tasks of the same priority, this lock has been showing
up at the top of the contention list. It has been causing some severe
performance hits in some cases on large boxes.

I passed this patch around to various users that have access to
different boxes ranging from 4 to 64 CPUs. In every case, this patch
showed a significant improvement, as it replaces the global vec->lock
with memory barriers. The added measurement tests is not the only
improvements, but so are jitter tests and cyclictest have improved when
this patch (without the benchmark recording) has been applied.

This patch is in my RT git repo based off of v3.0.


Please pull the latest tip/sched/cpupri tree, which can be found at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-rt.git
tip/sched/cpupri

Head SHA1: 84d342f89dd0bf7d9a01c9802021f01a5ff1c453


Steven Rostedt (1):
      sched/cpupri: Remove the vec->lock

----
 kernel/sched_cpupri.c |   62 ++++++++++++++++++++++++++++++------------------
 kernel/sched_cpupri.h |    5 +--
 2 files changed, 41 insertions(+), 26 deletions(-)
---------------------------
commit 84d342f89dd0bf7d9a01c9802021f01a5ff1c453
Author: Steven Rostedt <srostedt@redhat.com>
Date:   Mon Aug 1 13:20:05 2011 -0400

    sched/cpupri: Remove the vec->lock
    
    The cpupri vec->lock has been showing up as a top contention
    lately. This is because of the RT push/pull logic takes an
    agressive approach for migrating RT tasks. The cpupri logic is
    in place to improve the performance of the push/pull when dealing
    with large number CPU machines.
    
    The problem though is a vec->lock is required, where a vec is a
    global per RT priority structure. That is, if there are lots of
    RT tasks at the same priority, every time they are added or removed
    from the RT queue, this global vec->lock is taken. Now that more
    kernel threads are becoming RT (RCU boost and threaded interrupts)
    this is becoming much more of an issue.
    
    There are two variables that are being synced by the vec->lock.
    The cpupri bitmask, and the vec->counter. The cpupri bitmask
    is one bit per priority. If a RT priority vec has a process queued,
    then the vec->count is > 0 and the cpupri bitmask is set for that
    RT priority.
    
    If the cpupri bitmask gets out of sync with the vec->counter, we could
    end up pushing a low proirity RT task to a high priority queue.
    That RT task that could have run immediately could be queued on a
    run queue with a higher priority task indefinitely.
    
    The solution is not to use the cpupri bitmask and just look at the
    vec->count directly when doing a pull. The cpupri bitmask is just
    a fast way to scan the RT priorities when a pull is made. Instead
    of using the bitmask, and just examine all RT priorities, and
    look at the vec->counts, we could eliminate the vec->lock. The
    scan of RT tasks is to find a run queue that we can push an RT task
    to, and we do not push to a high priority queue, thus the scan only
    needs to go from 1 to RT task->prio, and not all 100 RT priorities.
    
    The push algorithm, which does the scan of RT priorities (and
    scan of the bitmask) only happens when we have an overloaded RT run
    queue (more than one RT task queued). The grabbing of the vec->lock
    happens every time any RT task is queued or dequeued on the run
    queue for that priority. The slowing down of the scan by not using
    a bitmask is negligible by the speed up of removing the vec->lock
    contention, and replacing it with an atomic counter and memory barrier.
    
    To prove this, I wrote a patch that times both the loop and the code
    that grabs the vec->locks. I passed the patches to various people
    (and companies) to test and show the results. I let everyone choose
    their own load to test, giving different loads on the system,
    for various different setups.
    
    Here's some of the results: (snipping to a few CPUs to not make
    this change log huge, but the results were consistent across
    the entire system).
    
    System 1 (24 CPUs)
    
    Before patch:
    CPU:    Name    Count   Max     Min     Average Total
    ----    ----    -----   ---     ---     ------- -----
    [...]
    cpu 20: loop    3057    1.766   0.061   0.642   1963.170
            vec     6782949 90.469  0.089   0.414   2811760.503
    cpu 21: loop    2617    1.723   0.062   0.641   1679.074
            vec     6782810 90.499  0.089   0.291   1978499.900
    cpu 22: loop    2212    1.863   0.063   0.699   1547.160
            vec     6767244 85.685  0.089   0.435   2949676.898
    cpu 23: loop    2320    2.013   0.062   0.594   1380.265
            vec     6781694 87.923  0.088   0.431   2928538.224
    
    After patch:
    cpu 20: loop    2078    1.579   0.061   0.533   1108.006
            vec     6164555 5.704   0.060   0.143   885185.809
    cpu 21: loop    2268    1.712   0.065   0.575   1305.248
            vec     6153376 5.558   0.060   0.187   1154960.469
    cpu 22: loop    1542    1.639   0.095   0.533   823.249
            vec     6156510 5.720   0.060   0.190   1172727.232
    cpu 23: loop    1650    1.733   0.068   0.545   900.781
            vec     6170784 5.533   0.060   0.167   1034287.953
    
    All times are in microseconds. The 'loop' is the amount of time spent
    doing the loop across the priorities (before patch uses bitmask).
    the 'vec' is the amount of time in the code that requires grabbing
    the vec->lock. The second patch just does not have the vec lock, but
    encompasses the same code.
    
    Amazingly the loop code even went down on average. The vec code went
    from .5 down to .18, that's more than half the time spent!
    
    Note, more than one test was run, but they all had the same results.
    
    System 2 (64 CPUs)
    
    Before patch:
    CPU:    Name    Count   Max     Min     Average Total
    ----    ----    -----   ---     ---     ------- -----
    cpu 60: loop    0       0       0       0       0
            vec     5410840 277.954 0.084   0.782   4232895.727
    cpu 61: loop    0       0       0       0       0
            vec     4915648 188.399 0.084   0.570   2803220.301
    cpu 62: loop    0       0       0       0       0
            vec     5356076 276.417 0.085   0.786   4214544.548
    cpu 63: loop    0       0       0       0       0
            vec     4891837 170.531 0.085   0.799   3910948.833
    
    After patch:
    cpu 60: loop    0       0       0       0       0
            vec     5365118 5.080   0.021   0.063   340490.267
    cpu 61: loop    0       0       0       0       0
            vec     4898590 1.757   0.019   0.071   347903.615
    cpu 62: loop    0       0       0       0       0
            vec     5737130 3.067   0.021   0.119   687108.734
    cpu 63: loop    0       0       0       0       0
            vec     4903228 1.822   0.021   0.071   348506.477
    
    The test run during the measurement did not have any (very few,
    from other CPUs) RT tasks pushing. But this shows that it helped
    out tremendously with the contention, as the contention happens
    because the vec->lock is taken only on queuing at an RT priority,
    and different CPUs that queue tasks at the same priority will
    have contention.
    
    I tested on my own 4 CPU machine with the following results:
    
    Before patch:
    CPU:    Name    Count   Max     Min     Average Total
    ----    ----    -----   ---     ---     ------- -----
    cpu 0:  loop    2377    1.489   0.158   0.588   1398.395
            vec     4484    770.146 2.301   4.396   19711.755
    cpu 1:  loop    2169    1.962   0.160   0.576   1250.110
            vec     4425    152.769 2.297   4.030   17834.228
    cpu 2:  loop    2324    1.749   0.155   0.559   1299.799
            vec     4368    779.632 2.325   4.665   20379.268
    cpu 3:  loop    2325    1.629   0.157   0.561   1306.113
            vec     4650    408.782 2.394   4.348   20222.577
    
    After patch:
    CPU:    Name    Count   Max     Min     Average Total
    ----    ----    -----   ---     ---     ------- -----
    cpu 0:  loop    2121    1.616   0.113   0.636   1349.189
            vec     4303    1.151   0.225   0.421   1811.966
    cpu 1:  loop    2130    1.638   0.178   0.644   1372.927
            vec     4627    1.379   0.235   0.428   1983.648
    cpu 2:  loop    2056    1.464   0.165   0.637   1310.141
            vec     4471    1.311   0.217   0.433   1937.927
    cpu 3:  loop    2154    1.481   0.162   0.601   1295.083
            vec     4236    1.253   0.230   0.425   1803.008
    
    This was running my migrate.c code that can be found at:
    http://lwn.net/Articles/425763/
    
    The migrate code does stress the RT tasks a bit. This shows that
    the loop did increase a little after the patch, but not by much.
    The vec code dropped dramatically. From 4.3us down to .42us.
    That's a 10x improvement!
    
    Tested-by: Mike Galbraith <mgalbraith@suse.de>
    Tested-by: Luis Claudio R. Gonçalves <lgoncalv@redhat.com>
    Tested-by: Matthew Hank Sabins<msabins@linux.vnet.ibm.com>
    Reviewed-by: Gregory Haskins <gregory.haskins@gmail.com>
    Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
index 2722dc1..7761a26 100644
--- a/kernel/sched_cpupri.c
+++ b/kernel/sched_cpupri.c
@@ -47,9 +47,6 @@ static int convert_prio(int prio)
 	return cpupri;
 }
 
-#define for_each_cpupri_active(array, idx)                    \
-	for_each_set_bit(idx, array, CPUPRI_NR_PRIORITIES)
-
 /**
  * cpupri_find - find the best (lowest-pri) CPU in the system
  * @cp: The cpupri context
@@ -71,11 +68,33 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
 	int                  idx      = 0;
 	int                  task_pri = convert_prio(p->prio);
 
-	for_each_cpupri_active(cp->pri_active, idx) {
+	if (task_pri >= MAX_RT_PRIO)
+		return 0;
+
+	for (idx = 0; idx < task_pri; idx++) {
 		struct cpupri_vec *vec  = &cp->pri_to_cpu[idx];
 
-		if (idx >= task_pri)
-			break;
+		if (!atomic_read(&(vec)->count))
+			continue;
+		/*
+		 * When looking at the vector, we need to read the counter,
+		 * do a memory barrier, then read the mask.
+		 *
+		 * Note: This is still all racey, but we can deal with it.
+		 *  Ideally, we only want to look at masks that are set.
+		 *
+		 *  If a mask is not set, then the only thing wrong is that we
+		 *  did a little more work than necessary.
+		 *
+		 *  If we read a zero count but the mask is set, because of the
+		 *  memory barriers, that can only happen when the highest prio
+		 *  task for a run queue has left the run queue, in which case,
+		 *  it will be followed by a pull. If the task we are processing
+		 *  fails to find a proper place to go, that pull request will
+		 *  pull this task if the run queue is running at a lower
+		 *  priority.
+		 */
+		smp_rmb();
 
 		if (cpumask_any_and(&p->cpus_allowed, vec->mask) >= nr_cpu_ids)
 			continue;
@@ -115,7 +134,6 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 {
 	int                 *currpri = &cp->cpu_to_pri[cpu];
 	int                  oldpri  = *currpri;
-	unsigned long        flags;
 
 	newpri = convert_prio(newpri);
 
@@ -134,26 +152,25 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 	if (likely(newpri != CPUPRI_INVALID)) {
 		struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];
 
-		raw_spin_lock_irqsave(&vec->lock, flags);
-
 		cpumask_set_cpu(cpu, vec->mask);
-		vec->count++;
-		if (vec->count == 1)
-			set_bit(newpri, cp->pri_active);
-
-		raw_spin_unlock_irqrestore(&vec->lock, flags);
+		/*
+		 * When adding a new vector, we update the mask first,
+		 * do a write memory barrier, and then update the count, to
+		 * make sure the vector is visible when count is set.
+		 */
+		smp_wmb();
+		atomic_inc(&(vec)->count);
 	}
 	if (likely(oldpri != CPUPRI_INVALID)) {
 		struct cpupri_vec *vec  = &cp->pri_to_cpu[oldpri];
 
-		raw_spin_lock_irqsave(&vec->lock, flags);
-
-		vec->count--;
-		if (!vec->count)
-			clear_bit(oldpri, cp->pri_active);
+		/*
+		 * When removing from the vector, we decrement the counter first
+		 * do a memory barrier and then clear the mask.
+		 */
+		atomic_dec(&(vec)->count);
+		smp_wmb();
 		cpumask_clear_cpu(cpu, vec->mask);
-
-		raw_spin_unlock_irqrestore(&vec->lock, flags);
 	}
 
 	*currpri = newpri;
@@ -175,8 +192,7 @@ int cpupri_init(struct cpupri *cp)
 	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
 		struct cpupri_vec *vec = &cp->pri_to_cpu[i];
 
-		raw_spin_lock_init(&vec->lock);
-		vec->count = 0;
+		atomic_set(&vec->count, 0);
 		if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL))
 			goto cleanup;
 	}
diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
index 9fc7d38..6b4cd17 100644
--- a/kernel/sched_cpupri.h
+++ b/kernel/sched_cpupri.h
@@ -12,9 +12,8 @@
 /* values 2-101 are RT priorities 0-99 */
 
 struct cpupri_vec {
-	raw_spinlock_t lock;
-	int        count;
-	cpumask_var_t mask;
+	atomic_t	count;
+	cpumask_var_t	mask;
 };
 
 struct cpupri {



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock
  2011-08-02 20:36 [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock Steven Rostedt
@ 2011-08-03 14:18 ` Hillf Danton
  2011-08-03 14:49   ` Steven Rostedt
  2011-08-03 14:29 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Hillf Danton @ 2011-08-03 14:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Luis Claudio R.

On Wed, Aug 3, 2011 at 4:36 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>    The migrate code does stress the RT tasks a bit. This shows that
>    the loop did increase a little after the patch, but not by much.
>    The vec code dropped dramatically. From 4.3us down to .42us.
>    That's a 10x improvement!
>
>    Tested-by: Mike Galbraith <mgalbraith@suse.de>
>    Tested-by: Luis Claudio R. Gonçalves <lgoncalv@redhat.com>
>    Tested-by: Matthew Hank Sabins<msabins@linux.vnet.ibm.com>
>    Reviewed-by: Gregory Haskins <gregory.haskins@gmail.com>
>    Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
>
Acked-by: Hillf Danton <dhillf@gmail.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock
  2011-08-02 20:36 [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock Steven Rostedt
  2011-08-03 14:18 ` Hillf Danton
@ 2011-08-03 14:29 ` Peter Zijlstra
  2011-08-04 20:32 ` [PATCH] cpupri: Fix memory barriers for vec updates to always be in order Steven Rostedt
  2011-08-05  8:20 ` [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock Yong Zhang
  3 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2011-08-03 14:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Mike Galbraith,
	Luis Claudio R. Gonçalves, Matthew Hank Sabins,
	Gregory Haskins, Andrew Morton

On Tue, 2011-08-02 at 16:36 -0400, Steven Rostedt wrote:
> commit 84d342f89dd0bf7d9a01c9802021f01a5ff1c453
> Author: Steven Rostedt <srostedt@redhat.com>
> Date:   Mon Aug 1 13:20:05 2011 -0400
> 
>     sched/cpupri: Remove the vec->lock 

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock
  2011-08-03 14:18 ` Hillf Danton
@ 2011-08-03 14:49   ` Steven Rostedt
  2011-08-05 13:16     ` Hillf Danton
  0 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2011-08-03 14:49 UTC (permalink / raw)
  To: Hillf Danton
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Luis Claudio R.

On Wed, 2011-08-03 at 22:18 +0800, Hillf Danton wrote:
> On Wed, Aug 3, 2011 at 4:36 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> >    The migrate code does stress the RT tasks a bit. This shows that
> >    the loop did increase a little after the patch, but not by much.
> >    The vec code dropped dramatically. From 4.3us down to .42us.
> >    That's a 10x improvement!
> >
> >    Tested-by: Mike Galbraith <mgalbraith@suse.de>
> >    Tested-by: Luis Claudio R. Gonçalves <lgoncalv@redhat.com>
> >    Tested-by: Matthew Hank Sabins<msabins@linux.vnet.ibm.com>
> >    Reviewed-by: Gregory Haskins <gregory.haskins@gmail.com>
> >    Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> >
> Acked-by: Hillf Danton <dhillf@gmail.com>

Hi Hillf,

Thanks for the ack. But I want to point out this change as something I
want you to see. Remember when I replied to you with your patches asking
about benchmarks and timings and other tests? This patch is a good
example of what I meant.

I made a change that looked obvious. But obvious is not good enough when
you are dealing with the Linux scheduler. Before posting it, I created a
timing patch to record the timings of the affected area for any work
load. I then passed this patch with the timing changes to various people
that reported issues with this part of the code. I also ran on my own
boxes.

The result was outstanding. That is, everyone that reported back to me
found improvements and no regressions. The improvements were not just in
the timing measurements that I included, but also with their own tests.

Now I'm comfortable with this change.

You sent several patches to me that modified the scheduler in non
trivial ways, with no benchmarks or tests attached. Before making any
changes to the scheduler, you need to have something that shows that
those changes improve things and do not cause regressions.

I sent these patches out over a month ago to get these results. I'm
putting this change in for v3.2, that way it can get even more testing
in linux-next to make sure we didn't miss anything.

This is what I want you to understand. That the scheduler is a core
aspect of Linux, and if we mess it up, it will affect everyone. We can't
take that lightly.

Thanks!

-- Steve




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] cpupri: Fix memory barriers for vec updates to always be in order
  2011-08-02 20:36 [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock Steven Rostedt
  2011-08-03 14:18 ` Hillf Danton
  2011-08-03 14:29 ` Peter Zijlstra
@ 2011-08-04 20:32 ` Steven Rostedt
  2011-08-05 12:27   ` [PATCH v2] " Steven Rostedt
  2011-08-05  8:20 ` [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock Yong Zhang
  3 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2011-08-04 20:32 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Mike Galbraith,
	Luis Claudio R. Gonçalves, Matthew Hank Sabins,
	Gregory Haskins, Andrew Morton

Re-examining the cpupri patch, I see there's a possible race because the
update of the two priorities vec->counts are not protected by a memory
barrier.

When a RT runqueue is overloaded and wants to push an RT task to another
runqueue, it scans the RT priority vectors in a loop from lowest
priority to highest.

When we queue or dequeue an RT task that changes a runqueue's highest
priority task, we update the vectors to show that a runqueue is rated at
a different priority. To do this, we first set the new priority mask,
and increment the vec->count, and then set the old priority mask by
decrementing the vec->count.

If we are lowering the runqueue's RT priority rating, it will trigger a
RT pull, and we do not care if we miss pushing to this runqueue or not. 

But if we raise the priority, but the priority is still lower than an RT
task that is looking to be pushed, we must make sure that this runqueue
is still seen by the push algorithm (the loop).

Because the loop reads from lowest to highest, and the new priority is
set before the old one is cleared, we will either see the new or old
priority set and the vector will be checked.

But! Since there's no memory barrier between the updates of the two, the
old count may be decremented first before the new count is incremented.
This means the loop may see the old count of zero and skip it, and also
the new count of zero before it was updated. A possible runqueue that
the RT task could move to could be missed.

A conditional memory barrier is placed between the vec->count updates
and is only called when both updates are done.

The smp_wmb() has also been changed to smp_mb__before_atomic_inc/dec(),
as they are not needed by archs that already synchronize
atomic_inc/dec().

The smp_rmb() has been moved to be called at every iteration of the loop
so that the race between seeing the two updates is visible by each
iteration of the loop, as an arch is free to optimize the reading of
memory of the counters in the loop.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
index 7761a26..7113243 100644
--- a/kernel/sched_cpupri.c
+++ b/kernel/sched_cpupri.c
@@ -73,9 +73,10 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
 
 	for (idx = 0; idx < task_pri; idx++) {
 		struct cpupri_vec *vec  = &cp->pri_to_cpu[idx];
+		int skip = 0;
 
 		if (!atomic_read(&(vec)->count))
-			continue;
+			skip = 1;
 		/*
 		 * When looking at the vector, we need to read the counter,
 		 * do a memory barrier, then read the mask.
@@ -96,6 +97,10 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
 		 */
 		smp_rmb();
 
+		/* Need to do the rmb for every iteration */
+		if (skip)
+			continue;
+
 		if (cpumask_any_and(&p->cpus_allowed, vec->mask) >= nr_cpu_ids)
 			continue;
 
@@ -134,6 +139,7 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 {
 	int                 *currpri = &cp->cpu_to_pri[cpu];
 	int                  oldpri  = *currpri;
+	int                  do_mb = 0;
 
 	newpri = convert_prio(newpri);
 
@@ -158,18 +164,34 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 		 * do a write memory barrier, and then update the count, to
 		 * make sure the vector is visible when count is set.
 		 */
-		smp_wmb();
+		smp_mb__before_atomic_inc();
 		atomic_inc(&(vec)->count);
+		do_mb = 1
 	}
 	if (likely(oldpri != CPUPRI_INVALID)) {
 		struct cpupri_vec *vec  = &cp->pri_to_cpu[oldpri];
 
 		/*
+		 * Because the order of modification of the vec->count
+		 * is important, we must make sure that the update
+		 * of the new prio is seen before we decrement the
+		 * old prio. This makes sure that the loop sees
+		 * one or the other when we raise the priority of
+		 * the run queue. We don't care about when we lower the
+		 * priority, as that will trigger an rt pull anyway.
+		 *
+		 * We only need to do a memory barrier if we updated
+		 * the new priority vec.
+		 */
+		if (do_mb)
+			smp_mb__after_atomic_inc();
+
+		/*
 		 * When removing from the vector, we decrement the counter first
 		 * do a memory barrier and then clear the mask.
 		 */
 		atomic_dec(&(vec)->count);
-		smp_wmb();
+		smp_mb__after_atomic_inc();
 		cpumask_clear_cpu(cpu, vec->mask);
 	}
 



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock
  2011-08-02 20:36 [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock Steven Rostedt
                   ` (2 preceding siblings ...)
  2011-08-04 20:32 ` [PATCH] cpupri: Fix memory barriers for vec updates to always be in order Steven Rostedt
@ 2011-08-05  8:20 ` Yong Zhang
  2011-08-05 12:30   ` Steven Rostedt
  3 siblings, 1 reply; 14+ messages in thread
From: Yong Zhang @ 2011-08-05  8:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Luis Claudio R. Gon�alves,
	Matthew Hank Sabins, Gregory Haskins, Andrew Morton

On Tue, Aug 02, 2011 at 04:36:12PM -0400, Steven Rostedt wrote:
> diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
> index 9fc7d38..6b4cd17 100644
> --- a/kernel/sched_cpupri.h
> +++ b/kernel/sched_cpupri.h
> @@ -12,9 +12,8 @@
>  /* values 2-101 are RT priorities 0-99 */
>  
>  struct cpupri_vec {
> -	raw_spinlock_t lock;
> -	int        count;
> -	cpumask_var_t mask;
> +	atomic_t	count;
> +	cpumask_var_t	mask;
>  };
>  
>  struct cpupri {

So cpupri->pri_active and macro CPUPRI_NR_PRI_WORDS is not needed any more?

Thanks,
Yong

-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] cpupri: Fix memory barriers for vec updates to always be in order
  2011-08-04 20:32 ` [PATCH] cpupri: Fix memory barriers for vec updates to always be in order Steven Rostedt
@ 2011-08-05 12:27   ` Steven Rostedt
  2011-08-14 16:12     ` [tip:sched/core] sched/cpupri: " tip-bot for Steven Rostedt
  0 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2011-08-05 12:27 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Mike Galbraith,
	Luis Claudio R. Gonçalves, Matthew Hank Sabins,
	Gregory Haskins, Andrew Morton

[ This patch actually compiles. Thanks to Mike Galbraith for pointing
that out. I compiled and booted this patch with no issues. ]

Re-examining the cpupri patch, I see there's a possible race because the
update of the two priorities vec->counts are not protected by a memory
barrier.

When a RT runqueue is overloaded and wants to push an RT task to another
runqueue, it scans the RT priority vectors in a loop from lowest
priority to highest.

When we queue or dequeue an RT task that changes a runqueue's highest
priority task, we update the vectors to show that a runqueue is rated at
a different priority. To do this, we first set the new priority mask,
and increment the vec->count, and then set the old priority mask by
decrementing the vec->count.

If we are lowering the runqueue's RT priority rating, it will trigger a
RT pull, and we do not care if we miss pushing to this runqueue or not. 

But if we raise the priority, but the priority is still lower than an RT
task that is looking to be pushed, we must make sure that this runqueue
is still seen by the push algorithm (the loop).

Because the loop reads from lowest to highest, and the new priority is
set before the old one is cleared, we will either see the new or old
priority set and the vector will be checked.

But! Since there's no memory barrier between the updates of the two, the
old count may be decremented first before the new count is incremented.
This means the loop may see the old count of zero and skip it, and also
the new count of zero before it was updated. A possible runqueue that
the RT task could move to could be missed.

A conditional memory barrier is placed between the vec->count updates
and is only called when both updates are done.

The smp_wmb() has also been changed to smp_mb__before_atomic_inc/dec(),
as they are not needed by archs that already synchronize
atomic_inc/dec().

The smp_rmb() has been moved to be called at every iteration of the loop
so that the race between seeing the two updates is visible by each
iteration of the loop, as an arch is free to optimize the reading of
memory of the counters in the loop.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-rt.git/kernel/sched_cpupri.c
===================================================================
--- linux-rt.git.orig/kernel/sched_cpupri.c
+++ linux-rt.git/kernel/sched_cpupri.c
@@ -73,9 +73,10 @@ int cpupri_find(struct cpupri *cp, struc
 
 	for (idx = 0; idx < task_pri; idx++) {
 		struct cpupri_vec *vec  = &cp->pri_to_cpu[idx];
+		int skip = 0;
 
 		if (!atomic_read(&(vec)->count))
-			continue;
+			skip = 1;
 		/*
 		 * When looking at the vector, we need to read the counter,
 		 * do a memory barrier, then read the mask.
@@ -96,6 +97,10 @@ int cpupri_find(struct cpupri *cp, struc
 		 */
 		smp_rmb();
 
+		/* Need to do the rmb for every iteration */
+		if (skip)
+			continue;
+
 		if (cpumask_any_and(&p->cpus_allowed, vec->mask) >= nr_cpu_ids)
 			continue;
 
@@ -134,6 +139,7 @@ void cpupri_set(struct cpupri *cp, int c
 {
 	int                 *currpri = &cp->cpu_to_pri[cpu];
 	int                  oldpri  = *currpri;
+	int                  do_mb = 0;
 
 	newpri = convert_prio(newpri);
 
@@ -158,18 +164,34 @@ void cpupri_set(struct cpupri *cp, int c
 		 * do a write memory barrier, and then update the count, to
 		 * make sure the vector is visible when count is set.
 		 */
-		smp_wmb();
+		smp_mb__before_atomic_inc();
 		atomic_inc(&(vec)->count);
+		do_mb = 1;
 	}
 	if (likely(oldpri != CPUPRI_INVALID)) {
 		struct cpupri_vec *vec  = &cp->pri_to_cpu[oldpri];
 
 		/*
+		 * Because the order of modification of the vec->count
+		 * is important, we must make sure that the update
+		 * of the new prio is seen before we decrement the
+		 * old prio. This makes sure that the loop sees
+		 * one or the other when we raise the priority of
+		 * the run queue. We don't care about when we lower the
+		 * priority, as that will trigger an rt pull anyway.
+		 *
+		 * We only need to do a memory barrier if we updated
+		 * the new priority vec.
+		 */
+		if (do_mb)
+			smp_mb__after_atomic_inc();
+
+		/*
 		 * When removing from the vector, we decrement the counter first
 		 * do a memory barrier and then clear the mask.
 		 */
 		atomic_dec(&(vec)->count);
-		smp_wmb();
+		smp_mb__after_atomic_inc();
 		cpumask_clear_cpu(cpu, vec->mask);
 	}
 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock
  2011-08-05  8:20 ` [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock Yong Zhang
@ 2011-08-05 12:30   ` Steven Rostedt
  2011-08-05 14:38     ` [PATCH] sched/cpupri: Remove cpupri->pri_active Yong Zhang
  0 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2011-08-05 12:30 UTC (permalink / raw)
  To: Yong Zhang
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Luis Claudio R. Gon�alves,
	Matthew Hank Sabins, Gregory Haskins, Andrew Morton

On Fri, 2011-08-05 at 16:20 +0800, Yong Zhang wrote:
> On Tue, Aug 02, 2011 at 04:36:12PM -0400, Steven Rostedt wrote:
> > diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
> > index 9fc7d38..6b4cd17 100644
> > --- a/kernel/sched_cpupri.h
> > +++ b/kernel/sched_cpupri.h
> > @@ -12,9 +12,8 @@
> >  /* values 2-101 are RT priorities 0-99 */
> >  
> >  struct cpupri_vec {
> > -	raw_spinlock_t lock;
> > -	int        count;
> > -	cpumask_var_t mask;
> > +	atomic_t	count;
> > +	cpumask_var_t	mask;
> >  };
> >  
> >  struct cpupri {
> 
> So cpupri->pri_active and macro CPUPRI_NR_PRI_WORDS is not needed any more?

It doesn't appear so. Looks like they can be removed too.

Want to send a patch?

Thanks!

-- Steve



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock
  2011-08-03 14:49   ` Steven Rostedt
@ 2011-08-05 13:16     ` Hillf Danton
  0 siblings, 0 replies; 14+ messages in thread
From: Hillf Danton @ 2011-08-05 13:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Luis Claudio R.

On Wed, Aug 3, 2011 at 10:49 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> Hi Hillf,
>
> Thanks for the ack. But I want to point out this change as something I
> want you to see. Remember when I replied to you with your patches asking
> about benchmarks and timings and other tests? This patch is a good
> example of what I meant.
>
Got and thanks.

Hillf

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] sched/cpupri: Remove cpupri->pri_active
  2011-08-05 12:30   ` Steven Rostedt
@ 2011-08-05 14:38     ` Yong Zhang
  2011-08-05 15:26       ` Steven Rostedt
  0 siblings, 1 reply; 14+ messages in thread
From: Yong Zhang @ 2011-08-05 14:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Luis Claudio R. Gon�alves,
	Matthew Hank Sabins, Gregory Haskins, Andrew Morton

On Fri, Aug 05, 2011 at 08:30:47AM -0400, Steven Rostedt wrote:
> On Fri, 2011-08-05 at 16:20 +0800, Yong Zhang wrote:
> > 
> > So cpupri->pri_active and macro CPUPRI_NR_PRI_WORDS is not needed any more?
> 
> It doesn't appear so. Looks like they can be removed too.
> 
> Want to send a patch?

No problem :)

---
From: Yong Zhang <yong.zhang0@gmail.com>
Subject: [PATCH] sched/cpupri: Remove cpupri->pri_active

Since [sched/cpupri: Remove the vec->lock], member pri_active
of struct cpupri is not needed any more, just remove it. Also
clean stuff related to it.

Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
---
 kernel/sched_cpupri.c |    3 ---
 kernel/sched_cpupri.h |    2 --
 2 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
index 90faffd..5839559 100644
--- a/kernel/sched_cpupri.c
+++ b/kernel/sched_cpupri.c
@@ -151,9 +151,6 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 	/*
 	 * If the cpu was currently mapped to a different value, we
 	 * need to map it to the new value then remove the old value.
-	 * Note, we must add the new value first, otherwise we risk the
-	 * cpu being cleared from pri_active, and this cpu could be
-	 * missed for a push or pull.
 	 */
 	if (likely(newpri != CPUPRI_INVALID)) {
 		struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];
diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
index 6b4cd17..f6d7561 100644
--- a/kernel/sched_cpupri.h
+++ b/kernel/sched_cpupri.h
@@ -4,7 +4,6 @@
 #include <linux/sched.h>
 
 #define CPUPRI_NR_PRIORITIES	(MAX_RT_PRIO + 2)
-#define CPUPRI_NR_PRI_WORDS	BITS_TO_LONGS(CPUPRI_NR_PRIORITIES)
 
 #define CPUPRI_INVALID -1
 #define CPUPRI_IDLE     0
@@ -18,7 +17,6 @@ struct cpupri_vec {
 
 struct cpupri {
 	struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES];
-	long              pri_active[CPUPRI_NR_PRI_WORDS];
 	int               cpu_to_pri[NR_CPUS];
 };
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/cpupri: Remove cpupri->pri_active
  2011-08-05 14:38     ` [PATCH] sched/cpupri: Remove cpupri->pri_active Yong Zhang
@ 2011-08-05 15:26       ` Steven Rostedt
  2011-08-06  0:10         ` [PATCH V2] " Yong Zhang
  0 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2011-08-05 15:26 UTC (permalink / raw)
  To: Yong Zhang
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Luis Claudio R. Gon�alves,
	Matthew Hank Sabins, Gregory Haskins, Andrew Morton

On Fri, 2011-08-05 at 22:38 +0800, Yong Zhang wrote:

> Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
> ---
>  kernel/sched_cpupri.c |    3 ---
>  kernel/sched_cpupri.h |    2 --
>  2 files changed, 0 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
> index 90faffd..5839559 100644
> --- a/kernel/sched_cpupri.c
> +++ b/kernel/sched_cpupri.c
> @@ -151,9 +151,6 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
>  	/*
>  	 * If the cpu was currently mapped to a different value, we
>  	 * need to map it to the new value then remove the old value.
> -	 * Note, we must add the new value first, otherwise we risk the
> -	 * cpu being cleared from pri_active, and this cpu could be
> -	 * missed for a push or pull.

Actually, the above still holds true, just not for pri_active. Probably
should be changed to:

	* Note, we must add the new value first, otherwise we risk the
	* cpu being missed by the priority loop in cpupri_find.

or something as such.

>  	 */
>  	if (likely(newpri != CPUPRI_INVALID)) {
>  		struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];
> diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
> index 6b4cd17..f6d7561 100644
> --- a/kernel/sched_cpupri.h
> +++ b/kernel/sched_cpupri.h
> @@ -4,7 +4,6 @@
>  #include <linux/sched.h>
>  
>  #define CPUPRI_NR_PRIORITIES	(MAX_RT_PRIO + 2)
> -#define CPUPRI_NR_PRI_WORDS	BITS_TO_LONGS(CPUPRI_NR_PRIORITIES)
>  
>  #define CPUPRI_INVALID -1
>  #define CPUPRI_IDLE     0
> @@ -18,7 +17,6 @@ struct cpupri_vec {
>  
>  struct cpupri {
>  	struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES];
> -	long              pri_active[CPUPRI_NR_PRI_WORDS];
>  	int               cpu_to_pri[NR_CPUS];
>  };
>  


Otherwise looks good.

-- Steve



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH V2] sched/cpupri: Remove cpupri->pri_active
  2011-08-05 15:26       ` Steven Rostedt
@ 2011-08-06  0:10         ` Yong Zhang
  2011-08-14 16:13           ` [tip:sched/core] " tip-bot for Yong Zhang
  0 siblings, 1 reply; 14+ messages in thread
From: Yong Zhang @ 2011-08-06  0:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Luis Claudio R. Gon�alves,
	Matthew Hank Sabins, Gregory Haskins, Andrew Morton

On Fri, Aug 05, 2011 at 11:26:40AM -0400, Steven Rostedt wrote:
> On Fri, 2011-08-05 at 22:38 +0800, Yong Zhang wrote:
> 
> > Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
> > ---
> >  kernel/sched_cpupri.c |    3 ---
> >  kernel/sched_cpupri.h |    2 --
> >  2 files changed, 0 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
> > index 90faffd..5839559 100644
> > --- a/kernel/sched_cpupri.c
> > +++ b/kernel/sched_cpupri.c
> > @@ -151,9 +151,6 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
> >  	/*
> >  	 * If the cpu was currently mapped to a different value, we
> >  	 * need to map it to the new value then remove the old value.
> > -	 * Note, we must add the new value first, otherwise we risk the
> > -	 * cpu being cleared from pri_active, and this cpu could be
> > -	 * missed for a push or pull.
> 
> Actually, the above still holds true, just not for pri_active. Probably
> should be changed to:
> 
> 	* Note, we must add the new value first, otherwise we risk the
> 	* cpu being missed by the priority loop in cpupri_find.
> 
> or something as such.

Updated.

Thanks,
Yong

---
From: Yong Zhang <yong.zhang0@gmail.com>
Subject: [PATCH V2] sched/cpupri: Remove cpupri->pri_active

Since [sched/cpupri: Remove the vec->lock], member pri_active
of struct cpupri is not needed any more, just remove it. Also
clean stuff related to it.

Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
---
 kernel/sched_cpupri.c |    3 +--
 kernel/sched_cpupri.h |    2 --
 2 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
index 90faffd..a86cf9d 100644
--- a/kernel/sched_cpupri.c
+++ b/kernel/sched_cpupri.c
@@ -152,8 +152,7 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 	 * If the cpu was currently mapped to a different value, we
 	 * need to map it to the new value then remove the old value.
 	 * Note, we must add the new value first, otherwise we risk the
-	 * cpu being cleared from pri_active, and this cpu could be
-	 * missed for a push or pull.
+	 * cpu being missed by the priority loop in cpupri_find.
 	 */
 	if (likely(newpri != CPUPRI_INVALID)) {
 		struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];
diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
index 6b4cd17..f6d7561 100644
--- a/kernel/sched_cpupri.h
+++ b/kernel/sched_cpupri.h
@@ -4,7 +4,6 @@
 #include <linux/sched.h>
 
 #define CPUPRI_NR_PRIORITIES	(MAX_RT_PRIO + 2)
-#define CPUPRI_NR_PRI_WORDS	BITS_TO_LONGS(CPUPRI_NR_PRIORITIES)
 
 #define CPUPRI_INVALID -1
 #define CPUPRI_IDLE     0
@@ -18,7 +17,6 @@ struct cpupri_vec {
 
 struct cpupri {
 	struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES];
-	long              pri_active[CPUPRI_NR_PRI_WORDS];
 	int               cpu_to_pri[NR_CPUS];
 };
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched/cpupri: Fix memory barriers for vec updates to always be in order
  2011-08-05 12:27   ` [PATCH v2] " Steven Rostedt
@ 2011-08-14 16:12     ` tip-bot for Steven Rostedt
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Steven Rostedt @ 2011-08-14 16:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, a.p.zijlstra, npiggin,
	rostedt, tglx, mingo

Commit-ID:  d473750b4073f16f23f46f30dc1bd3de45c35754
Gitweb:     http://git.kernel.org/tip/d473750b4073f16f23f46f30dc1bd3de45c35754
Author:     Steven Rostedt <rostedt@goodmis.org>
AuthorDate: Fri, 5 Aug 2011 08:27:49 -0400
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:01:07 +0200

sched/cpupri: Fix memory barriers for vec updates to always be in order

[ This patch actually compiles. Thanks to Mike Galbraith for pointing
that out. I compiled and booted this patch with no issues. ]

Re-examining the cpupri patch, I see there's a possible race because the
update of the two priorities vec->counts are not protected by a memory
barrier.

When a RT runqueue is overloaded and wants to push an RT task to another
runqueue, it scans the RT priority vectors in a loop from lowest
priority to highest.

When we queue or dequeue an RT task that changes a runqueue's highest
priority task, we update the vectors to show that a runqueue is rated at
a different priority. To do this, we first set the new priority mask,
and increment the vec->count, and then set the old priority mask by
decrementing the vec->count.

If we are lowering the runqueue's RT priority rating, it will trigger a
RT pull, and we do not care if we miss pushing to this runqueue or not.

But if we raise the priority, but the priority is still lower than an RT
task that is looking to be pushed, we must make sure that this runqueue
is still seen by the push algorithm (the loop).

Because the loop reads from lowest to highest, and the new priority is
set before the old one is cleared, we will either see the new or old
priority set and the vector will be checked.

But! Since there's no memory barrier between the updates of the two, the
old count may be decremented first before the new count is incremented.
This means the loop may see the old count of zero and skip it, and also
the new count of zero before it was updated. A possible runqueue that
the RT task could move to could be missed.

A conditional memory barrier is placed between the vec->count updates
and is only called when both updates are done.

The smp_wmb() has also been changed to smp_mb__before_atomic_inc/dec(),
as they are not needed by archs that already synchronize
atomic_inc/dec().

The smp_rmb() has been moved to be called at every iteration of the loop
so that the race between seeing the two updates is visible by each
iteration of the loop, as an arch is free to optimize the reading of
memory of the counters in the loop.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1312547269.18583.194.camel@gandalf.stny.rr.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_cpupri.c |   28 +++++++++++++++++++++++++---
 1 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
index 7761a26..90faffd 100644
--- a/kernel/sched_cpupri.c
+++ b/kernel/sched_cpupri.c
@@ -73,9 +73,10 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
 
 	for (idx = 0; idx < task_pri; idx++) {
 		struct cpupri_vec *vec  = &cp->pri_to_cpu[idx];
+		int skip = 0;
 
 		if (!atomic_read(&(vec)->count))
-			continue;
+			skip = 1;
 		/*
 		 * When looking at the vector, we need to read the counter,
 		 * do a memory barrier, then read the mask.
@@ -96,6 +97,10 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
 		 */
 		smp_rmb();
 
+		/* Need to do the rmb for every iteration */
+		if (skip)
+			continue;
+
 		if (cpumask_any_and(&p->cpus_allowed, vec->mask) >= nr_cpu_ids)
 			continue;
 
@@ -134,6 +139,7 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 {
 	int                 *currpri = &cp->cpu_to_pri[cpu];
 	int                  oldpri  = *currpri;
+	int                  do_mb = 0;
 
 	newpri = convert_prio(newpri);
 
@@ -158,18 +164,34 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 		 * do a write memory barrier, and then update the count, to
 		 * make sure the vector is visible when count is set.
 		 */
-		smp_wmb();
+		smp_mb__before_atomic_inc();
 		atomic_inc(&(vec)->count);
+		do_mb = 1;
 	}
 	if (likely(oldpri != CPUPRI_INVALID)) {
 		struct cpupri_vec *vec  = &cp->pri_to_cpu[oldpri];
 
 		/*
+		 * Because the order of modification of the vec->count
+		 * is important, we must make sure that the update
+		 * of the new prio is seen before we decrement the
+		 * old prio. This makes sure that the loop sees
+		 * one or the other when we raise the priority of
+		 * the run queue. We don't care about when we lower the
+		 * priority, as that will trigger an rt pull anyway.
+		 *
+		 * We only need to do a memory barrier if we updated
+		 * the new priority vec.
+		 */
+		if (do_mb)
+			smp_mb__after_atomic_inc();
+
+		/*
 		 * When removing from the vector, we decrement the counter first
 		 * do a memory barrier and then clear the mask.
 		 */
 		atomic_dec(&(vec)->count);
-		smp_wmb();
+		smp_mb__after_atomic_inc();
 		cpumask_clear_cpu(cpu, vec->mask);
 	}
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched/cpupri: Remove cpupri->pri_active
  2011-08-06  0:10         ` [PATCH V2] " Yong Zhang
@ 2011-08-14 16:13           ` tip-bot for Yong Zhang
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Yong Zhang @ 2011-08-14 16:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, yong.zhang0, tglx, mingo

Commit-ID:  5710f15b52664ae0bfa60a66d75464769d297b2b
Gitweb:     http://git.kernel.org/tip/5710f15b52664ae0bfa60a66d75464769d297b2b
Author:     Yong Zhang <yong.zhang0@gmail.com>
AuthorDate: Sat, 6 Aug 2011 08:10:04 +0800
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 14 Aug 2011 12:01:11 +0200

sched/cpupri: Remove cpupri->pri_active

Since [sched/cpupri: Remove the vec->lock], member pri_active
of struct cpupri is not needed any more, just remove it. Also
clean stuff related to it.

Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110806001004.GA2207@zhy
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_cpupri.c |    3 +--
 kernel/sched_cpupri.h |    2 --
 2 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
index 90faffd..a86cf9d 100644
--- a/kernel/sched_cpupri.c
+++ b/kernel/sched_cpupri.c
@@ -152,8 +152,7 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 	 * If the cpu was currently mapped to a different value, we
 	 * need to map it to the new value then remove the old value.
 	 * Note, we must add the new value first, otherwise we risk the
-	 * cpu being cleared from pri_active, and this cpu could be
-	 * missed for a push or pull.
+	 * cpu being missed by the priority loop in cpupri_find.
 	 */
 	if (likely(newpri != CPUPRI_INVALID)) {
 		struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];
diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
index 6b4cd17..f6d7561 100644
--- a/kernel/sched_cpupri.h
+++ b/kernel/sched_cpupri.h
@@ -4,7 +4,6 @@
 #include <linux/sched.h>
 
 #define CPUPRI_NR_PRIORITIES	(MAX_RT_PRIO + 2)
-#define CPUPRI_NR_PRI_WORDS	BITS_TO_LONGS(CPUPRI_NR_PRIORITIES)
 
 #define CPUPRI_INVALID -1
 #define CPUPRI_IDLE     0
@@ -18,7 +17,6 @@ struct cpupri_vec {
 
 struct cpupri {
 	struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES];
-	long              pri_active[CPUPRI_NR_PRI_WORDS];
 	int               cpu_to_pri[NR_CPUS];
 };
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-08-14 16:14 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-02 20:36 [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock Steven Rostedt
2011-08-03 14:18 ` Hillf Danton
2011-08-03 14:49   ` Steven Rostedt
2011-08-05 13:16     ` Hillf Danton
2011-08-03 14:29 ` Peter Zijlstra
2011-08-04 20:32 ` [PATCH] cpupri: Fix memory barriers for vec updates to always be in order Steven Rostedt
2011-08-05 12:27   ` [PATCH v2] " Steven Rostedt
2011-08-14 16:12     ` [tip:sched/core] sched/cpupri: " tip-bot for Steven Rostedt
2011-08-05  8:20 ` [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock Yong Zhang
2011-08-05 12:30   ` Steven Rostedt
2011-08-05 14:38     ` [PATCH] sched/cpupri: Remove cpupri->pri_active Yong Zhang
2011-08-05 15:26       ` Steven Rostedt
2011-08-06  0:10         ` [PATCH V2] " Yong Zhang
2011-08-14 16:13           ` [tip:sched/core] " tip-bot for Yong Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).