From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1755289Ab1HBUg2 (ORCPT <rfc822;w@1wt.eu>);
	Tue, 2 Aug 2011 16:36:28 -0400
Received: from hrndva-omtalb.mail.rr.com ([71.74.56.123]:49410 "EHLO
	hrndva-omtalb.mail.rr.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754944Ab1HBUgW (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 2 Aug 2011 16:36:22 -0400
X-Authority-Analysis: v=1.1 cv=Pm0sEXe2MdIPK/rOEC7hwDW84D/yDsPO3JtCzsVYOFU= c=1 sm=0 a=1LrE3yD2bboA:10 a=5SG0PmZfjMsA:10 a=Q9fys5e9bTEA:10 a=OPBmh+XkhLl+Enan7BmTLg==:17 a=VwQbUJbxAAAA:8 a=20KFwNOVAAAA:8 a=07d9gI8wAAAA:8 a=VnNF1IyMAAAA:8 a=pGLkceISAAAA:8 a=meVymXHHAAAA:8 a=1Q382_MsjFcs22-h_NUA:9 a=4y7_-GOAzCXnqBPqwlkA:7 a=PUjeQqilurYA:10 a=jEp0ucaQiEUA:10 a=0kPLrQdw3YYA:10 a=MSl-tDqOz04A:10 a=jeBq3FmKZ4MA:10 a=OPBmh+XkhLl+Enan7BmTLg==:117
X-Cloudmark-Score: 0
X-Originating-IP: 67.242.120.143
Subject: [PATCH][GIT PULL] sched/cpupri: Remove the vec->lock
From: Steven Rostedt <rostedt@goodmis.org>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>,
        Mike Galbraith <mgalbraith@suse.de>,
        "Luis Claudio R." =?ISO-8859-1?Q?Gon=E7alves?= 
	<lgoncalv@redhat.com>,
        Matthew Hank Sabins <msabins@linux.vnet.ibm.com>,
        Gregory Haskins <gregory.haskins@gmail.com>,
        Andrew Morton <akpm@linux-foundation.org>
Content-Type: text/plain; charset="ISO-8859-15"
Date: Tue, 02 Aug 2011 16:36:12 -0400
Message-ID: <1312317372.18583.101.camel@gandalf.stny.rr.com>
Mime-Version: 1.0
X-Mailer: Evolution 2.32.3 
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


Ingo,

I've been passing this patch (one with benchmark code enabled) around,
and this has shown nice improvement in the scheduling of RT tasks. The
original code in cpupri, uses a vec->lock where there's one vec per RT
priority, and these locks are global. With the RCU threads and IRQ
threads being RT tasks of the same priority, this lock has been showing
up at the top of the contention list. It has been causing some severe
performance hits in some cases on large boxes.

I passed this patch around to various users that have access to
different boxes ranging from 4 to 64 CPUs. In every case, this patch
showed a significant improvement, as it replaces the global vec->lock
with memory barriers. The added measurement tests is not the only
improvements, but so are jitter tests and cyclictest have improved when
this patch (without the benchmark recording) has been applied.

This patch is in my RT git repo based off of v3.0.


Please pull the latest tip/sched/cpupri tree, which can be found at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-rt.git
tip/sched/cpupri

Head SHA1: 84d342f89dd0bf7d9a01c9802021f01a5ff1c453


Steven Rostedt (1):
      sched/cpupri: Remove the vec->lock

----
 kernel/sched_cpupri.c |   62 ++++++++++++++++++++++++++++++------------------
 kernel/sched_cpupri.h |    5 +--
 2 files changed, 41 insertions(+), 26 deletions(-)
---------------------------
commit 84d342f89dd0bf7d9a01c9802021f01a5ff1c453
Author: Steven Rostedt <srostedt@redhat.com>
Date:   Mon Aug 1 13:20:05 2011 -0400

    sched/cpupri: Remove the vec->lock
    
    The cpupri vec->lock has been showing up as a top contention
    lately. This is because of the RT push/pull logic takes an
    agressive approach for migrating RT tasks. The cpupri logic is
    in place to improve the performance of the push/pull when dealing
    with large number CPU machines.
    
    The problem though is a vec->lock is required, where a vec is a
    global per RT priority structure. That is, if there are lots of
    RT tasks at the same priority, every time they are added or removed
    from the RT queue, this global vec->lock is taken. Now that more
    kernel threads are becoming RT (RCU boost and threaded interrupts)
    this is becoming much more of an issue.
    
    There are two variables that are being synced by the vec->lock.
    The cpupri bitmask, and the vec->counter. The cpupri bitmask
    is one bit per priority. If a RT priority vec has a process queued,
    then the vec->count is > 0 and the cpupri bitmask is set for that
    RT priority.
    
    If the cpupri bitmask gets out of sync with the vec->counter, we could
    end up pushing a low proirity RT task to a high priority queue.
    That RT task that could have run immediately could be queued on a
    run queue with a higher priority task indefinitely.
    
    The solution is not to use the cpupri bitmask and just look at the
    vec->count directly when doing a pull. The cpupri bitmask is just
    a fast way to scan the RT priorities when a pull is made. Instead
    of using the bitmask, and just examine all RT priorities, and
    look at the vec->counts, we could eliminate the vec->lock. The
    scan of RT tasks is to find a run queue that we can push an RT task
    to, and we do not push to a high priority queue, thus the scan only
    needs to go from 1 to RT task->prio, and not all 100 RT priorities.
    
    The push algorithm, which does the scan of RT priorities (and
    scan of the bitmask) only happens when we have an overloaded RT run
    queue (more than one RT task queued). The grabbing of the vec->lock
    happens every time any RT task is queued or dequeued on the run
    queue for that priority. The slowing down of the scan by not using
    a bitmask is negligible by the speed up of removing the vec->lock
    contention, and replacing it with an atomic counter and memory barrier.
    
    To prove this, I wrote a patch that times both the loop and the code
    that grabs the vec->locks. I passed the patches to various people
    (and companies) to test and show the results. I let everyone choose
    their own load to test, giving different loads on the system,
    for various different setups.
    
    Here's some of the results: (snipping to a few CPUs to not make
    this change log huge, but the results were consistent across
    the entire system).
    
    System 1 (24 CPUs)
    
    Before patch:
    CPU:    Name    Count   Max     Min     Average Total
    ----    ----    -----   ---     ---     ------- -----
    [...]
    cpu 20: loop    3057    1.766   0.061   0.642   1963.170
            vec     6782949 90.469  0.089   0.414   2811760.503
    cpu 21: loop    2617    1.723   0.062   0.641   1679.074
            vec     6782810 90.499  0.089   0.291   1978499.900
    cpu 22: loop    2212    1.863   0.063   0.699   1547.160
            vec     6767244 85.685  0.089   0.435   2949676.898
    cpu 23: loop    2320    2.013   0.062   0.594   1380.265
            vec     6781694 87.923  0.088   0.431   2928538.224
    
    After patch:
    cpu 20: loop    2078    1.579   0.061   0.533   1108.006
            vec     6164555 5.704   0.060   0.143   885185.809
    cpu 21: loop    2268    1.712   0.065   0.575   1305.248
            vec     6153376 5.558   0.060   0.187   1154960.469
    cpu 22: loop    1542    1.639   0.095   0.533   823.249
            vec     6156510 5.720   0.060   0.190   1172727.232
    cpu 23: loop    1650    1.733   0.068   0.545   900.781
            vec     6170784 5.533   0.060   0.167   1034287.953
    
    All times are in microseconds. The 'loop' is the amount of time spent
    doing the loop across the priorities (before patch uses bitmask).
    the 'vec' is the amount of time in the code that requires grabbing
    the vec->lock. The second patch just does not have the vec lock, but
    encompasses the same code.
    
    Amazingly the loop code even went down on average. The vec code went
    from .5 down to .18, that's more than half the time spent!
    
    Note, more than one test was run, but they all had the same results.
    
    System 2 (64 CPUs)
    
    Before patch:
    CPU:    Name    Count   Max     Min     Average Total
    ----    ----    -----   ---     ---     ------- -----
    cpu 60: loop    0       0       0       0       0
            vec     5410840 277.954 0.084   0.782   4232895.727
    cpu 61: loop    0       0       0       0       0
            vec     4915648 188.399 0.084   0.570   2803220.301
    cpu 62: loop    0       0       0       0       0
            vec     5356076 276.417 0.085   0.786   4214544.548
    cpu 63: loop    0       0       0       0       0
            vec     4891837 170.531 0.085   0.799   3910948.833
    
    After patch:
    cpu 60: loop    0       0       0       0       0
            vec     5365118 5.080   0.021   0.063   340490.267
    cpu 61: loop    0       0       0       0       0
            vec     4898590 1.757   0.019   0.071   347903.615
    cpu 62: loop    0       0       0       0       0
            vec     5737130 3.067   0.021   0.119   687108.734
    cpu 63: loop    0       0       0       0       0
            vec     4903228 1.822   0.021   0.071   348506.477
    
    The test run during the measurement did not have any (very few,
    from other CPUs) RT tasks pushing. But this shows that it helped
    out tremendously with the contention, as the contention happens
    because the vec->lock is taken only on queuing at an RT priority,
    and different CPUs that queue tasks at the same priority will
    have contention.
    
    I tested on my own 4 CPU machine with the following results:
    
    Before patch:
    CPU:    Name    Count   Max     Min     Average Total
    ----    ----    -----   ---     ---     ------- -----
    cpu 0:  loop    2377    1.489   0.158   0.588   1398.395
            vec     4484    770.146 2.301   4.396   19711.755
    cpu 1:  loop    2169    1.962   0.160   0.576   1250.110
            vec     4425    152.769 2.297   4.030   17834.228
    cpu 2:  loop    2324    1.749   0.155   0.559   1299.799
            vec     4368    779.632 2.325   4.665   20379.268
    cpu 3:  loop    2325    1.629   0.157   0.561   1306.113
            vec     4650    408.782 2.394   4.348   20222.577
    
    After patch:
    CPU:    Name    Count   Max     Min     Average Total
    ----    ----    -----   ---     ---     ------- -----
    cpu 0:  loop    2121    1.616   0.113   0.636   1349.189
            vec     4303    1.151   0.225   0.421   1811.966
    cpu 1:  loop    2130    1.638   0.178   0.644   1372.927
            vec     4627    1.379   0.235   0.428   1983.648
    cpu 2:  loop    2056    1.464   0.165   0.637   1310.141
            vec     4471    1.311   0.217   0.433   1937.927
    cpu 3:  loop    2154    1.481   0.162   0.601   1295.083
            vec     4236    1.253   0.230   0.425   1803.008
    
    This was running my migrate.c code that can be found at:
    http://lwn.net/Articles/425763/
    
    The migrate code does stress the RT tasks a bit. This shows that
    the loop did increase a little after the patch, but not by much.
    The vec code dropped dramatically. From 4.3us down to .42us.
    That's a 10x improvement!
    
    Tested-by: Mike Galbraith <mgalbraith@suse.de>
    Tested-by: Luis Claudio R. Gonçalves <lgoncalv@redhat.com>
    Tested-by: Matthew Hank Sabins<msabins@linux.vnet.ibm.com>
    Reviewed-by: Gregory Haskins <gregory.haskins@gmail.com>
    Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
index 2722dc1..7761a26 100644
--- a/kernel/sched_cpupri.c
+++ b/kernel/sched_cpupri.c
@@ -47,9 +47,6 @@ static int convert_prio(int prio)
 	return cpupri;
 }
 
-#define for_each_cpupri_active(array, idx)                    \
-	for_each_set_bit(idx, array, CPUPRI_NR_PRIORITIES)
-
 /**
  * cpupri_find - find the best (lowest-pri) CPU in the system
  * @cp: The cpupri context
@@ -71,11 +68,33 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
 	int                  idx      = 0;
 	int                  task_pri = convert_prio(p->prio);
 
-	for_each_cpupri_active(cp->pri_active, idx) {
+	if (task_pri >= MAX_RT_PRIO)
+		return 0;
+
+	for (idx = 0; idx < task_pri; idx++) {
 		struct cpupri_vec *vec  = &cp->pri_to_cpu[idx];
 
-		if (idx >= task_pri)
-			break;
+		if (!atomic_read(&(vec)->count))
+			continue;
+		/*
+		 * When looking at the vector, we need to read the counter,
+		 * do a memory barrier, then read the mask.
+		 *
+		 * Note: This is still all racey, but we can deal with it.
+		 *  Ideally, we only want to look at masks that are set.
+		 *
+		 *  If a mask is not set, then the only thing wrong is that we
+		 *  did a little more work than necessary.
+		 *
+		 *  If we read a zero count but the mask is set, because of the
+		 *  memory barriers, that can only happen when the highest prio
+		 *  task for a run queue has left the run queue, in which case,
+		 *  it will be followed by a pull. If the task we are processing
+		 *  fails to find a proper place to go, that pull request will
+		 *  pull this task if the run queue is running at a lower
+		 *  priority.
+		 */
+		smp_rmb();
 
 		if (cpumask_any_and(&p->cpus_allowed, vec->mask) >= nr_cpu_ids)
 			continue;
@@ -115,7 +134,6 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 {
 	int                 *currpri = &cp->cpu_to_pri[cpu];
 	int                  oldpri  = *currpri;
-	unsigned long        flags;
 
 	newpri = convert_prio(newpri);
 
@@ -134,26 +152,25 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 	if (likely(newpri != CPUPRI_INVALID)) {
 		struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];
 
-		raw_spin_lock_irqsave(&vec->lock, flags);
-
 		cpumask_set_cpu(cpu, vec->mask);
-		vec->count++;
-		if (vec->count == 1)
-			set_bit(newpri, cp->pri_active);
-
-		raw_spin_unlock_irqrestore(&vec->lock, flags);
+		/*
+		 * When adding a new vector, we update the mask first,
+		 * do a write memory barrier, and then update the count, to
+		 * make sure the vector is visible when count is set.
+		 */
+		smp_wmb();
+		atomic_inc(&(vec)->count);
 	}
 	if (likely(oldpri != CPUPRI_INVALID)) {
 		struct cpupri_vec *vec  = &cp->pri_to_cpu[oldpri];
 
-		raw_spin_lock_irqsave(&vec->lock, flags);
-
-		vec->count--;
-		if (!vec->count)
-			clear_bit(oldpri, cp->pri_active);
+		/*
+		 * When removing from the vector, we decrement the counter first
+		 * do a memory barrier and then clear the mask.
+		 */
+		atomic_dec(&(vec)->count);
+		smp_wmb();
 		cpumask_clear_cpu(cpu, vec->mask);
-
-		raw_spin_unlock_irqrestore(&vec->lock, flags);
 	}
 
 	*currpri = newpri;
@@ -175,8 +192,7 @@ int cpupri_init(struct cpupri *cp)
 	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
 		struct cpupri_vec *vec = &cp->pri_to_cpu[i];
 
-		raw_spin_lock_init(&vec->lock);
-		vec->count = 0;
+		atomic_set(&vec->count, 0);
 		if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL))
 			goto cleanup;
 	}
diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
index 9fc7d38..6b4cd17 100644
--- a/kernel/sched_cpupri.h
+++ b/kernel/sched_cpupri.h
@@ -12,9 +12,8 @@
 /* values 2-101 are RT priorities 0-99 */
 
 struct cpupri_vec {
-	raw_spinlock_t lock;
-	int        count;
-	cpumask_var_t mask;
+	atomic_t	count;
+	cpumask_var_t	mask;
 };
 
 struct cpupri {