RFC [patch] sched: strengthen LAST_BUDDY and minimize buddy induced latencies V3
diff mbox series

Message ID 1255775063.32691.5.camel@marge.simson.net
State New, archived
Headers show
Series
  • RFC [patch] sched: strengthen LAST_BUDDY and minimize buddy induced latencies V3
Related show

Commit Message

Mike Galbraith Oct. 17, 2009, 10:24 a.m. UTC
sched: strengthen LAST_BUDDY and minimize buddy induced latencies.

This patch restores the effectiveness of LAST_BUDDY in preventing pgsql+oltp
from collapsing due to wakeup preemption.  It also minimizes buddy induced
latencies.  x264 testcase spawns new worker threads at a high rate, and was
being affected badly by NEXT_BUDDY.  It turned out that CACHE_HOT_BUDDY was
thwarting idle balancing.  This patch ensures that the load can disperse,
and that buddies can't make any task excessively late.

Some numbers for v2.6.32-rc4-1600-g0786aa4:

vmark
tip           108841 messages per second 
tip+          116617 messages per second

tbench 8
tip           938.421 MB/sec 8 procs
tip+          948.408 MB/sec 8 procs

mysql+oltp
clients             1          2          4          8         16         32         64        128        256
tip           9999.36   18493.54   34652.91   34253.13   32057.64   30297.43   28300.96   25450.14   20675.99
tip+         10054.16   18275.67   34799.62   33561.74   32633.54   31584.56   29861.57   26929.84   22450.29

pgsql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          13577.63   26510.67   51871.05   51374.62   50190.69   45494.64   37173.83   27767.09   22795.23
tip+         13671.69   26586.23   51766.85   51464.36   50459.22   49637.46   48678.73   47127.42   44994.69

x264.sh 8
tip          366.80 fps  +NEXT_BUDDY: 274.15  -NEXT_BUDDY -START_DEBIT: 396.77  +NEXT_BUDDY -START_DEBIT: 263.45
tip+         373.23 fps  +NEXT_BUDDY: 369.73  -NEXT_BUDDY -START_DEBIT: 404.57  +NEXT_BUDDY -START_DEBIT: 401.57


Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>

---
 kernel/sched.c      |    4 ++++
 kernel/sched_fair.c |   35 +++++++++++++++++++++--------------
 2 files changed, 25 insertions(+), 14 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Peter Zijlstra Oct. 20, 2009, 4:24 a.m. UTC | #1
On Sat, 2009-10-17 at 12:24 +0200, Mike Galbraith wrote:
> sched: strengthen LAST_BUDDY and minimize buddy induced latencies.
> 
> This patch restores the effectiveness of LAST_BUDDY in preventing pgsql+oltp
> from collapsing due to wakeup preemption.  It also minimizes buddy induced
> latencies.  x264 testcase spawns new worker threads at a high rate, and was
> being affected badly by NEXT_BUDDY.  It turned out that CACHE_HOT_BUDDY was
> thwarting idle balancing.  This patch ensures that the load can disperse,
> and that buddies can't make any task excessively late.

> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2007,8 +2007,12 @@ task_hot(struct task_struct *p, u64 now,
>  
>  	/*
>  	 * Buddy candidates are cache hot:
> +	 *
> +	 * Do not honor buddies if there may be nothing else to
> +	 * prevent us from becoming idle.
>  	 */
>  	if (sched_feat(CACHE_HOT_BUDDY) &&
> +			task_rq(p)->nr_running >= sched_nr_latency &&
>  			(&p->se == cfs_rq_of(&p->se)->next ||
>  			 &p->se == cfs_rq_of(&p->se)->last))
>  		return 1;

I'm not sure about this. The sched_nr_latency seems arbitrary, 1 seems
like a more natural boundary.

Also, one thing that arjan found was that we don't need to consider
buddies cache hot if we're migrating them within a cache domain. So we
need to add a SD_flag and sched_domain to properly represent the cache
hierarchy.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mike Galbraith Oct. 20, 2009, 5:03 a.m. UTC | #2
On Tue, 2009-10-20 at 06:24 +0200, Peter Zijlstra wrote:
> On Sat, 2009-10-17 at 12:24 +0200, Mike Galbraith wrote:
> > sched: strengthen LAST_BUDDY and minimize buddy induced latencies.
> > 
> > This patch restores the effectiveness of LAST_BUDDY in preventing pgsql+oltp
> > from collapsing due to wakeup preemption.  It also minimizes buddy induced
> > latencies.  x264 testcase spawns new worker threads at a high rate, and was
> > being affected badly by NEXT_BUDDY.  It turned out that CACHE_HOT_BUDDY was
> > thwarting idle balancing.  This patch ensures that the load can disperse,
> > and that buddies can't make any task excessively late.
> 
> > Index: linux-2.6/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched.c
> > +++ linux-2.6/kernel/sched.c
> > @@ -2007,8 +2007,12 @@ task_hot(struct task_struct *p, u64 now,
> >  
> >  	/*
> >  	 * Buddy candidates are cache hot:
> > +	 *
> > +	 * Do not honor buddies if there may be nothing else to
> > +	 * prevent us from becoming idle.
> >  	 */
> >  	if (sched_feat(CACHE_HOT_BUDDY) &&
> > +			task_rq(p)->nr_running >= sched_nr_latency &&
> >  			(&p->se == cfs_rq_of(&p->se)->next ||
> >  			 &p->se == cfs_rq_of(&p->se)->last))
> >  		return 1;
> 
> I'm not sure about this. The sched_nr_latency seems arbitrary, 1 seems
> like a more natural boundary.

That's what I did first, which of course worked fine.

What I'm thinking of doing instead though is to specifically target the
only time I see the problem, ie fork/exec load wanting to disperse.  I
don't really want to see buddies being ripped away from their cache.
But as you note below, that can be a good thing iff it lands on a shared
cache. In my case, there's a 1 in 3 chance of safe landing.

> Also, one thing that arjan found was that we don't need to consider
> buddies cache hot if we're migrating them within a cache domain. So we
> need to add a SD_flag and sched_domain to properly represent the cache
> hierarchy.

Yeah, I thought about this too.  If there's any overlap time, waking CPU
affine is a loser if there's an idle shared cache next door.

	-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mike Galbraith Oct. 20, 2009, 2:28 p.m. UTC | #3
On Tue, 2009-10-20 at 06:24 +0200, Peter Zijlstra wrote:
> On Sat, 2009-10-17 at 12:24 +0200, Mike Galbraith wrote:
> > sched: strengthen LAST_BUDDY and minimize buddy induced latencies.
> > 
> > This patch restores the effectiveness of LAST_BUDDY in preventing pgsql+oltp
> > from collapsing due to wakeup preemption.  It also minimizes buddy induced
> > latencies.  x264 testcase spawns new worker threads at a high rate, and was
> > being affected badly by NEXT_BUDDY.  It turned out that CACHE_HOT_BUDDY was
> > thwarting idle balancing.  This patch ensures that the load can disperse,
> > and that buddies can't make any task excessively late.
> 
> > Index: linux-2.6/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched.c
> > +++ linux-2.6/kernel/sched.c
> > @@ -2007,8 +2007,12 @@ task_hot(struct task_struct *p, u64 now,
> >  
> >  	/*
> >  	 * Buddy candidates are cache hot:
> > +	 *
> > +	 * Do not honor buddies if there may be nothing else to
> > +	 * prevent us from becoming idle.
> >  	 */
> >  	if (sched_feat(CACHE_HOT_BUDDY) &&
> > +			task_rq(p)->nr_running >= sched_nr_latency &&
> >  			(&p->se == cfs_rq_of(&p->se)->next ||
> >  			 &p->se == cfs_rq_of(&p->se)->last))
> >  		return 1;
> 
> I'm not sure about this. The sched_nr_latency seems arbitrary, 1 seems
> like a more natural boundary.

How about the below?  I started thinking about a vmark et al, and
figured I'd try taking LAST_BUDDY a bit further, ie try even harder to
give the CPU back to a preempted task so it can go on it's merry way
rightward.  Vmark likes the idea, as does mysql+oltp and of course pgsql
+oltp is happier (preempt userland spinlock holder -> welcome to pain)

That weird little dip right after mysql+oltp peak is still present, and
I don't understand why.  I've squabbled with that bugger before.

Full retest (pulled tip v2.6.32-rc5-1497-ga525b32)

vmark
tip           108466 messages per second 
tip++         121151 messages per second
               1.116
              
mysql+oltp
clients             1          2          4          8         16         32         64        128        256
tip           9821.62   18573.65   34757.38   34313.31   32144.12   30654.29   28310.89   25027.35   19558.34
              9862.92   18561.28   34822.03   34576.43   32971.17   30845.74   28290.78   25051.09   19473.82
             10165.14   18935.68   34824.31   34490.38   32933.35   30797.89   28314.15   25100.49   19612.10
tip avg       9949.89   18690.20   34801.24   34460.04   32682.88   30765.97   28305.27   25059.64   19548.08
             
tip+         10206.95   18661.99   34808.03   33735.84   32939.46   31613.18   29994.18   27293.44   22846.26
              9884.26   18652.53   35136.57   34090.69   32953.83   31699.69   30073.19   27242.16   22772.26
              9885.20   18774.23   35166.59   34034.52   33015.85   31726.04   30144.69   27239.97   22750.68
tip+ avg      9992.13   18696.25   35037.06   33953.68   32969.71   31679.63   30070.68   27258.52   22789.73
                1.004      1.000      1.006       .985      1.008      1.029      1.062      1.087      1.165

pgsql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          13686.37   26609.25   51934.28   51347.81   49479.51   45312.65   36691.91   26851.57   24145.35
tip++        13675.11   26591.73   51882.93   51618.99   50681.77   49592.17   48893.15   47374.94   45417.42
                 .999       .999       .999      1.005      1.024      1.094      1.332      1.764      1.881

sched: strengthen LAST_BUDDY and minimize buddy induced latencies.

This patch restores the effectiveness of LAST_BUDDY in preventing pgsql+oltp
from collapsing due to wakeup preemption.  It also switches LAST_BUDDY to
do what it does best, namely mitigate the effects of aggressive preemption,
which improves vmark throughput markedly.

Last hunk is to prevent buddies from stymieing BALANCE_NEWIDLE.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>

---
 kernel/sched.c      |    2 +-
 kernel/sched_fair.c |   49 ++++++++++++++++++++++++-------------------------
 2 files changed, 25 insertions(+), 26 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -861,21 +861,17 @@ wakeup_preempt_entity(struct sched_entit
 static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se = __pick_next_entity(cfs_rq);
-	struct sched_entity *buddy;
 
-	if (cfs_rq->next) {
-		buddy = cfs_rq->next;
-		cfs_rq->next = NULL;
-		if (wakeup_preempt_entity(buddy, se) < 1)
-			return buddy;
-	}
+	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1)
+		se = cfs_rq->next;
 
-	if (cfs_rq->last) {
-		buddy = cfs_rq->last;
-		cfs_rq->last = NULL;
-		if (wakeup_preempt_entity(buddy, se) < 1)
-			return buddy;
-	}
+	/*
+	 * Prefer last buddy, try to return the CPU to a preempted task.
+	 */
+	if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1)
+		se = cfs_rq->last;
+
+	clear_buddies(cfs_rq, se);
 
 	return se;
 }
@@ -1591,17 +1587,6 @@ static void check_preempt_wakeup(struct
 	if (unlikely(se == pse))
 		return;
 
-	/*
-	 * Only set the backward buddy when the current task is still on the
-	 * rq. This can happen when a wakeup gets interleaved with schedule on
-	 * the ->pre_schedule() or idle_balance() point, either of which can
-	 * drop the rq lock.
-	 *
-	 * Also, during early boot the idle thread is in the fair class, for
-	 * obvious reasons its a bad idea to schedule back to the idle thread.
-	 */
-	if (sched_feat(LAST_BUDDY) && likely(se->on_rq && curr != rq->idle))
-		set_last_buddy(se);
 	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK))
 		set_next_buddy(pse);
 
@@ -1648,8 +1633,22 @@ static void check_preempt_wakeup(struct
 
 	BUG_ON(!pse);
 
-	if (wakeup_preempt_entity(se, pse) == 1)
+	if (wakeup_preempt_entity(se, pse) == 1) {
 		resched_task(curr);
+		/*
+		 * Only set the backward buddy when the current task is still
+		 * on the rq. This can happen when a wakeup gets interleaved
+		 * with schedule on the ->pre_schedule() or idle_balance()
+		 * point, either of which can * drop the rq lock.
+		 *
+		 * Also, during early boot the idle thread is in the fair class,
+		 * for obvious reasons its a bad idea to schedule back to it.
+		 */
+		if (unlikely(!se->on_rq || curr == rq->idle))
+			return;
+		if (sched_feat(LAST_BUDDY))
+			set_last_buddy(se);
+	}
 }
 
 static struct task_struct *pick_next_task_fair(struct rq *rq)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2008,7 +2008,7 @@ task_hot(struct task_struct *p, u64 now,
 	/*
 	 * Buddy candidates are cache hot:
 	 */
-	if (sched_feat(CACHE_HOT_BUDDY) &&
+	if (sched_feat(CACHE_HOT_BUDDY) && this_rq()->nr_running &&
 			(&p->se == cfs_rq_of(&p->se)->next ||
 			 &p->se == cfs_rq_of(&p->se)->last))
 		return 1;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mike Galbraith Oct. 20, 2009, 7 p.m. UTC | #4
On Tue, 2009-10-20 at 16:28 +0200, Mike Galbraith wrote: 
> On Tue, 2009-10-20 at 06:24 +0200, Peter Zijlstra wrote:
> > On Sat, 2009-10-17 at 12:24 +0200, Mike Galbraith wrote:
> > > sched: strengthen LAST_BUDDY and minimize buddy induced latencies.
> > > 
> > > This patch restores the effectiveness of LAST_BUDDY in preventing pgsql+oltp
> > > from collapsing due to wakeup preemption.  It also minimizes buddy induced
> > > latencies.  x264 testcase spawns new worker threads at a high rate, and was
> > > being affected badly by NEXT_BUDDY.  It turned out that CACHE_HOT_BUDDY was
> > > thwarting idle balancing.  This patch ensures that the load can disperse,
> > > and that buddies can't make any task excessively late.
> > 
> > > Index: linux-2.6/kernel/sched.c
> > > ===================================================================
> > > --- linux-2.6.orig/kernel/sched.c
> > > +++ linux-2.6/kernel/sched.c
> > > @@ -2007,8 +2007,12 @@ task_hot(struct task_struct *p, u64 now,
> > >  
> > >  	/*
> > >  	 * Buddy candidates are cache hot:
> > > +	 *
> > > +	 * Do not honor buddies if there may be nothing else to
> > > +	 * prevent us from becoming idle.
> > >  	 */
> > >  	if (sched_feat(CACHE_HOT_BUDDY) &&
> > > +			task_rq(p)->nr_running >= sched_nr_latency &&
> > >  			(&p->se == cfs_rq_of(&p->se)->next ||
> > >  			 &p->se == cfs_rq_of(&p->se)->last))
> > >  		return 1;
> > 
> > I'm not sure about this. The sched_nr_latency seems arbitrary, 1 seems
> > like a more natural boundary.
> 
> How about the below?  I started thinking about a vmark et al, and
> figured I'd try taking LAST_BUDDY a bit further, ie try even harder to
> give the CPU back to a preempted task so it can go on it's merry way
> rightward.  Vmark likes the idea, as does mysql+oltp and of course pgsql
> +oltp is happier (preempt userland spinlock holder -> welcome to pain)
> 
> That weird little dip right after mysql+oltp peak is still present, and
> I don't understand why.  I've squabbled with that bugger before.
> 
> Full retest (pulled tip v2.6.32-rc5-1497-ga525b32)
> 
> vmark
> tip           108466 messages per second 
> tip++         121151 messages per second

This patchlet, unlike the one I showed you and Ingo offline, also passed
interactivity testing.

But...

It also displays this interesting (to me) property, as did the other,
why I now go try the same with virgin source.

When running vmark with amarok playing (light perturbation), this is the
throughput.

142077 messages per second
140138 messages per second
140264 messages per second

That's three three run averages.  Now, virgin tip.

112511 messages per second
112048 messages per second
115717 messages per second

((112511+112048+115717)/3)/108466 = 1.045
((142077+140138+140264)/3)/121151 = 1.162

Both kernels achieve better throughput with perturbation.

The unperturbed numbers are stable enough to pique my curiosity spot.
(theory baking, not gonna air it;)

	-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch
diff mbox series

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2007,8 +2007,12 @@  task_hot(struct task_struct *p, u64 now,
 
 	/*
 	 * Buddy candidates are cache hot:
+	 *
+	 * Do not honor buddies if there may be nothing else to
+	 * prevent us from becoming idle.
 	 */
 	if (sched_feat(CACHE_HOT_BUDDY) &&
+			task_rq(p)->nr_running >= sched_nr_latency &&
 			(&p->se == cfs_rq_of(&p->se)->next ||
 			 &p->se == cfs_rq_of(&p->se)->last))
 		return 1;
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -861,21 +861,28 @@  wakeup_preempt_entity(struct sched_entit
 static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se = __pick_next_entity(cfs_rq);
-	struct sched_entity *buddy;
+	u64 vruntime = se->vruntime + (2*sysctl_sched_min_granularity);
 
-	if (cfs_rq->next) {
-		buddy = cfs_rq->next;
-		cfs_rq->next = NULL;
-		if (wakeup_preempt_entity(buddy, se) < 1)
-			return buddy;
+	/*
+	 * Maybe it's a buddy, maybe not.  Who cares, it's late.. go now!
+	 */
+	if (unlikely(min_vruntime(vruntime, cfs_rq->min_vruntime) == vruntime)) {
+		clear_buddies(cfs_rq, se);
+		return se;
 	}
 
-	if (cfs_rq->last) {
-		buddy = cfs_rq->last;
-		cfs_rq->last = NULL;
-		if (wakeup_preempt_entity(buddy, se) < 1)
-			return buddy;
-	}
+	if (cfs_rq->next && se != cfs_rq->next && sched_feat(NEXT_BUDDY) &&
+			wakeup_preempt_entity(cfs_rq->next, se) < 1)
+		se = cfs_rq->next;
+
+	/*
+	 * Prefer last buddy, try to return the CPU to a preempted task.
+	 */
+	if (cfs_rq->last && se != cfs_rq->last && sched_feat(LAST_BUDDY) &&
+			wakeup_preempt_entity(cfs_rq->last, se) < 1)
+		se = cfs_rq->last;
+
+	clear_buddies(cfs_rq, se);
 
 	return se;
 }
@@ -1600,9 +1607,9 @@  static void check_preempt_wakeup(struct
 	 * Also, during early boot the idle thread is in the fair class, for
 	 * obvious reasons its a bad idea to schedule back to the idle thread.
 	 */
-	if (sched_feat(LAST_BUDDY) && likely(se->on_rq && curr != rq->idle))
+	if (!(wake_flags & WF_FORK) && likely(se->on_rq && curr != rq->idle))
 		set_last_buddy(se);
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK))
+	if (!(wake_flags & WF_FORK))
 		set_next_buddy(pse);
 
 	/*