* specjbb2005 and aim7 regression with 2.6.32-rc kernels @ 2009-11-06 7:38 Zhang, Yanmin 2009-11-06 8:02 ` Mike Galbraith 2009-11-06 8:04 ` Ingo Molnar 0 siblings, 2 replies; 9+ messages in thread From: Zhang, Yanmin @ 2009-11-06 7:38 UTC (permalink / raw) To: LKML, Ingo Molnar, Peter Zijlstra; +Cc: Mike Galbraith Comparing with 2.6.31, specjbb2005 and aim7 have some regressions with 2.6.32-rc kernels on core2 machines. 1) On 4*4 core tigerton: specjbb2005 has about 5% regression. 2) On 2*4 stoakley: aim7 has about 5% regression. On Nehalem, specjbb2005 has about 2%~8% improvement instead of regression. aim7 has much dependency on schedule patameters, such like sched_latency_ns, sched_min_granularity_ns, and sched_wakeup_granularity_ns. 2.6.32-rc kernel decreases these parameter values. I restore them and retest aim7 on stoakley. aim7 regression becomes about 2% and specjbb2005 regression also becomes 2%. But on Nehalem, the improvement shrinks. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: specjbb2005 and aim7 regression with 2.6.32-rc kernels 2009-11-06 7:38 specjbb2005 and aim7 regression with 2.6.32-rc kernels Zhang, Yanmin @ 2009-11-06 8:02 ` Mike Galbraith 2009-11-06 8:04 ` Ingo Molnar 1 sibling, 0 replies; 9+ messages in thread From: Mike Galbraith @ 2009-11-06 8:02 UTC (permalink / raw) To: Zhang, Yanmin; +Cc: LKML, Ingo Molnar, Peter Zijlstra On Fri, 2009-11-06 at 15:38 +0800, Zhang, Yanmin wrote: > Comparing with 2.6.31, specjbb2005 and aim7 have some regressions with > 2.6.32-rc kernels on core2 machines. > > 1) On 4*4 core tigerton: specjbb2005 has about 5% regression. > 2) On 2*4 stoakley: aim7 has about 5% regression. > > On Nehalem, specjbb2005 has about 2%~8% improvement instead of regression. > > aim7 has much dependency on schedule patameters, such like sched_latency_ns, > sched_min_granularity_ns, and sched_wakeup_granularity_ns. 2.6.32-rc kernel > decreases these parameter values. I restore them and retest aim7 on stoakley. > aim7 regression becomes about 2% and specjbb2005 regression also becomes > 2%. But on Nehalem, the improvement shrinks. Yeah, the price of lower latency. We may want to tweak big machine setup a little. Be advised that there's a locking problem which appears to be falsifying benchmark results somewhat. I've got a tentative fix, but I don't think it's quite enough. (I haven't looked yet at what protects cpus_allowed, so aren't sure yet.) Just wanted to let you know lest your testing time investment may be producing somewhat skewed results, so you may want to hold off a little bit. (your testing time is much appreciated, don't want to waste a single drop;) -Mike ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: specjbb2005 and aim7 regression with 2.6.32-rc kernels 2009-11-06 7:38 specjbb2005 and aim7 regression with 2.6.32-rc kernels Zhang, Yanmin 2009-11-06 8:02 ` Mike Galbraith @ 2009-11-06 8:04 ` Ingo Molnar 2009-11-09 6:19 ` Zhang, Yanmin 1 sibling, 1 reply; 9+ messages in thread From: Ingo Molnar @ 2009-11-06 8:04 UTC (permalink / raw) To: Zhang, Yanmin; +Cc: LKML, Peter Zijlstra, Mike Galbraith * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > Comparing with 2.6.31, specjbb2005 and aim7 have some regressions with > 2.6.32-rc kernels on core2 machines. > > 1) On 4*4 core tigerton: specjbb2005 has about 5% regression. > 2) On 2*4 stoakley: aim7 has about 5% regression. > > On Nehalem, specjbb2005 has about 2%~8% improvement instead of > regression. > > aim7 has much dependency on schedule patameters, such like > sched_latency_ns, sched_min_granularity_ns, and > sched_wakeup_granularity_ns. 2.6.32-rc kernel decreases these > parameter values. I restore them and retest aim7 on stoakley. aim7 > regression becomes about 2% and specjbb2005 regression also becomes > 2%. But on Nehalem, the improvement shrinks. Which precise 2.6.32-rc commit have you tested? Since v2.6.32-rc6 Linus's tree has this one too: f685cea: sched: Strengthen buddies and mitigate buddy induced latencies Which should improve things a bit. For 2.6.33 we have queued up these two in -tip: a1f84a3: sched: Check for an idle shared cache in select_task_rq_fair() 1b9508f: sched: Rate-limit newidle If any of them fixes a performance regression we could still merge them into 2.6.32 as well. Ingo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: specjbb2005 and aim7 regression with 2.6.32-rc kernels 2009-11-06 8:04 ` Ingo Molnar @ 2009-11-09 6:19 ` Zhang, Yanmin 2009-11-09 7:09 ` Mike Galbraith 0 siblings, 1 reply; 9+ messages in thread From: Zhang, Yanmin @ 2009-11-09 6:19 UTC (permalink / raw) To: Ingo Molnar; +Cc: LKML, Peter Zijlstra, Mike Galbraith On Fri, 2009-11-06 at 09:04 +0100, Ingo Molnar wrote: > * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > > > Comparing with 2.6.31, specjbb2005 and aim7 have some regressions with > > 2.6.32-rc kernels on core2 machines. > > > > 1) On 4*4 core tigerton: specjbb2005 has about 5% regression. > > 2) On 2*4 stoakley: aim7 has about 5% regression. > > > > On Nehalem, specjbb2005 has about 2%~8% improvement instead of > > regression. > > > > aim7 has much dependency on schedule patameters, such like > > sched_latency_ns, sched_min_granularity_ns, and > > sched_wakeup_granularity_ns. 2.6.32-rc kernel decreases these > > parameter values. I restore them and retest aim7 on stoakley. aim7 > > regression becomes about 2% and specjbb2005 regression also becomes > > 2%. But on Nehalem, the improvement shrinks. > > Which precise 2.6.32-rc commit have you tested? > > Since v2.6.32-rc6 Linus's tree has this one too: > > f685cea: sched: Strengthen buddies and mitigate buddy induced latencies > > Which should improve things a bit. For 2.6.33 we have queued up these > two in -tip: > > a1f84a3: sched: Check for an idle shared cache in select_task_rq_fair() > 1b9508f: sched: Rate-limit newidle > > If any of them fixes a performance regression we could still merge them > into 2.6.32 as well. 1b9508f definitely fixes netperf UDP loopback regression. Yanmin ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: specjbb2005 and aim7 regression with 2.6.32-rc kernels 2009-11-09 6:19 ` Zhang, Yanmin @ 2009-11-09 7:09 ` Mike Galbraith 2009-11-09 9:15 ` Peter Zijlstra 0 siblings, 1 reply; 9+ messages in thread From: Mike Galbraith @ 2009-11-09 7:09 UTC (permalink / raw) To: Zhang, Yanmin; +Cc: Ingo Molnar, LKML, Peter Zijlstra On Mon, 2009-11-09 at 14:19 +0800, Zhang, Yanmin wrote: > On Fri, 2009-11-06 at 09:04 +0100, Ingo Molnar wrote: > > * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > > > > > Comparing with 2.6.31, specjbb2005 and aim7 have some regressions with > > > 2.6.32-rc kernels on core2 machines. > > > > > > 1) On 4*4 core tigerton: specjbb2005 has about 5% regression. > > > 2) On 2*4 stoakley: aim7 has about 5% regression. > > > > > > On Nehalem, specjbb2005 has about 2%~8% improvement instead of > > > regression. > > > > > > aim7 has much dependency on schedule patameters, such like > > > sched_latency_ns, sched_min_granularity_ns, and > > > sched_wakeup_granularity_ns. 2.6.32-rc kernel decreases these > > > parameter values. I restore them and retest aim7 on stoakley. aim7 > > > regression becomes about 2% and specjbb2005 regression also becomes > > > 2%. But on Nehalem, the improvement shrinks. > > > > Which precise 2.6.32-rc commit have you tested? > > > > Since v2.6.32-rc6 Linus's tree has this one too: > > > > f685cea: sched: Strengthen buddies and mitigate buddy induced latencies > > > > Which should improve things a bit. For 2.6.33 we have queued up these > > two in -tip: > > > > a1f84a3: sched: Check for an idle shared cache in select_task_rq_fair() > > 1b9508f: sched: Rate-limit newidle > > > > If any of them fixes a performance regression we could still merge them > > into 2.6.32 as well. > 1b9508f definitely fixes netperf UDP loopback regression. Excellent, I was pretty darn sure it would. When I (well, more likely Peter;) get all the necessary barriers in place, all should be well for 32 release scheduler wise now. With the minimal test diff against virgin mainline below, my enumeration woes are gone, and I'm within 2% of 31 on extreme context switch pinned tasks, and within .7%, comparing locked vs unlocked runqueue selection. (now to wander over to cpumask.[ch] before Peter wakes up and see if I can close the holes without killing performance;) diff --git a/kernel/sched.c b/kernel/sched.c index 28dd4f4..837ab30 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -590,6 +590,8 @@ struct rq { u64 rt_avg; u64 age_stamp; + u64 idle_stamp; + u64 avg_idle; #endif /* calc_load related fields */ @@ -1009,6 +1011,24 @@ static struct rq *this_rq_lock(void) return rq; } +/* + * cpu_rq_lock - lock the runqueue a given cpu and disable interrupts. + */ +static struct rq *cpu_rq_lock(int cpu, unsigned long *flags) + __acquires(rq->lock) +{ + struct rq *rq = cpu_rq(cpu); + + spin_lock_irqsave(&rq->lock, *flags); + return rq; +} + +static inline void cpu_rq_unlock(struct rq *rq, unsigned long *flags) + __releases(rq->lock) +{ + spin_unlock_irqrestore(&rq->lock, *flags); +} + #ifdef CONFIG_SCHED_HRTICK /* * Use HR-timers to deliver accurate preemption points. @@ -2372,16 +2392,19 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state, if (task_contributes_to_load(p)) rq->nr_uninterruptible--; p->state = TASK_WAKING; + preempt_disable(); task_rq_unlock(rq, &flags); cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags); - if (cpu != orig_cpu) - set_task_cpu(p, cpu); - rq = task_rq_lock(p, &flags); - if (rq != orig_rq) + if (cpu != orig_cpu) { + task_rq_unlock(rq, &flags); + rq = cpu_rq_lock(cpu, &flags); update_rq_clock(rq); + set_task_cpu(p, cpu); + } + preempt_enable_no_resched(); WARN_ON(p->state != TASK_WAKING); cpu = task_cpu(p); @@ -2439,6 +2462,17 @@ out_running: #ifdef CONFIG_SMP if (p->sched_class->task_wake_up) p->sched_class->task_wake_up(rq, p); + + if (unlikely(rq->idle_stamp)) { + u64 delta = rq->clock - rq->idle_stamp; + u64 max = 2*sysctl_sched_migration_cost; + + if (delta > max) + rq->avg_idle = max; + else + update_avg(&rq->avg_idle, delta); + rq->idle_stamp = 0; + } #endif out: task_rq_unlock(rq, &flags); @@ -4125,7 +4159,9 @@ static int load_balance(int this_cpu, struct rq *this_rq, unsigned long flags; struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask); + smp_read_barrier_depends(); cpumask_setall(cpus); + cpumask_and(cpus, cpus, cpu_online_mask); /* * When power savings policy is enabled for the parent domain, idle @@ -4288,7 +4324,9 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd) int all_pinned = 0; struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask); + smp_read_barrier_depends(); cpumask_setall(cpus); + cpumask_and(cpus, cpus, cpu_online_mask); /* * When power savings policy is enabled for the parent domain, idle @@ -4428,6 +4466,11 @@ static void idle_balance(int this_cpu, struct rq *this_rq) int pulled_task = 0; unsigned long next_balance = jiffies + HZ; + this_rq->idle_stamp = this_rq->clock; + + if (this_rq->avg_idle < sysctl_sched_migration_cost) + return; + for_each_domain(this_cpu, sd) { unsigned long interval; @@ -4442,8 +4485,10 @@ static void idle_balance(int this_cpu, struct rq *this_rq) interval = msecs_to_jiffies(sd->balance_interval); if (time_after(next_balance, sd->last_balance + interval)) next_balance = sd->last_balance + interval; - if (pulled_task) + if (pulled_task) { + this_rq->idle_stamp = 0; break; + } } if (pulled_task || time_after(jiffies, this_rq->next_balance)) { /* @@ -7054,6 +7099,7 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask) int ret = 0; rq = task_rq_lock(p, &flags); + smp_read_barrier_depends(); if (!cpumask_intersects(new_mask, cpu_online_mask)) { ret = -EINVAL; goto out; @@ -7065,11 +7111,13 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask) goto out; } - if (p->sched_class->set_cpus_allowed) + if (p->sched_class->set_cpus_allowed) { p->sched_class->set_cpus_allowed(p, new_mask); - else { + smp_wmb(); + } else { cpumask_copy(&p->cpus_allowed, new_mask); p->rt.nr_cpus_allowed = cpumask_weight(new_mask); + smp_wmb(); } /* Can the task run on the task's current CPU? If so, we're done */ @@ -7166,6 +7214,7 @@ static int migration_thread(void *data) struct list_head *head; spin_lock_irq(&rq->lock); + smp_read_barrier_depends(); if (cpu_is_offline(cpu)) { spin_unlock_irq(&rq->lock); @@ -7571,6 +7620,7 @@ static void set_rq_online(struct rq *rq) const struct sched_class *class; cpumask_set_cpu(rq->cpu, rq->rd->online); + smp_wmb(); rq->online = 1; for_each_class(class) { @@ -7591,6 +7641,7 @@ static void set_rq_offline(struct rq *rq) } cpumask_clear_cpu(rq->cpu, rq->rd->online); + smp_wmb(); rq->online = 0; } } @@ -9521,6 +9572,8 @@ void __init sched_init(void) rq->cpu = i; rq->online = 0; rq->migration_thread = NULL; + rq->idle_stamp = 0; + rq->avg_idle = 2*sysctl_sched_migration_cost; INIT_LIST_HEAD(&rq->migration_queue); rq_attach_root(rq, &def_root_domain); #endif diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c index efb8440..dd304a9 100644 --- a/kernel/sched_debug.c +++ b/kernel/sched_debug.c @@ -285,12 +285,14 @@ static void print_cpu(struct seq_file *m, int cpu) #ifdef CONFIG_SCHEDSTATS #define P(n) SEQ_printf(m, " .%-30s: %d\n", #n, rq->n); +#define P64(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, rq->n); P(yld_count); P(sched_switch); P(sched_count); P(sched_goidle); + P64(avg_idle); P(ttwu_count); P(ttwu_local); diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 37087a7..f36d5d0 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1360,13 +1360,14 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); - int new_cpu = cpu; + int task_pinned = cpumask_weight(&p->cpus_allowed) == 1; + int new_cpu = task_pinned ? prev_cpu : cpu; int want_affine = 0; - int want_sd = 1; + int want_sd = !task_pinned; int sync = wake_flags & WF_SYNC; if (sd_flag & SD_BALANCE_WAKE) { - if (sched_feat(AFFINE_WAKEUPS) && + if (sched_feat(AFFINE_WAKEUPS) && !task_pinned && cpumask_test_cpu(cpu, &p->cpus_allowed)) want_affine = 1; new_cpu = prev_cpu; @@ -1374,11 +1375,12 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag rcu_read_lock(); for_each_domain(cpu, tmp) { + smp_read_barrier_depends(); /* * If power savings logic is enabled for a domain, see if we * are not overloaded, if so, don't balance wider. */ - if (tmp->flags & (SD_POWERSAVINGS_BALANCE|SD_PREFER_LOCAL)) { + if (want_sd && tmp->flags & (SD_POWERSAVINGS_BALANCE|SD_PREFER_LOCAL)) { unsigned long power = 0; unsigned long nr_running = 0; unsigned long capacity; @@ -1429,6 +1431,12 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag update_shares(tmp); } + /* + * Balance shares maybe, but don't waste time balancing. + */ + if (task_pinned) + goto out; + if (affine_sd && wake_affine(affine_sd, p, sync)) { new_cpu = cpu; goto out; @@ -1447,6 +1455,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag if (sd_flag & SD_BALANCE_WAKE) load_idx = sd->wake_idx; + smp_read_barrier_depends(); group = find_idlest_group(sd, p, cpu, load_idx); if (!group) { sd = sd->child; ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: specjbb2005 and aim7 regression with 2.6.32-rc kernels 2009-11-09 7:09 ` Mike Galbraith @ 2009-11-09 9:15 ` Peter Zijlstra 2009-11-09 9:44 ` Gautham R Shenoy 2009-11-09 9:55 ` Mike Galbraith 0 siblings, 2 replies; 9+ messages in thread From: Peter Zijlstra @ 2009-11-09 9:15 UTC (permalink / raw) To: Mike Galbraith; +Cc: Zhang, Yanmin, Ingo Molnar, LKML, Gautham R Shenoy On Mon, 2009-11-09 at 08:09 +0100, Mike Galbraith wrote: > + smp_read_barrier_depends(); > cpumask_setall(cpus); > + cpumask_and(cpus, cpus, cpu_online_mask); how about: cpumask_copy(cpus, cpu_online_mask); ? Also, iirc cpu_online_mask is guaranteed stable when preemption is disabled, otherwise you need to use get/put_online_cpus(), an rmb_depends() won't do. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: specjbb2005 and aim7 regression with 2.6.32-rc kernels 2009-11-09 9:15 ` Peter Zijlstra @ 2009-11-09 9:44 ` Gautham R Shenoy 2009-11-09 9:57 ` Mike Galbraith 2009-11-09 9:55 ` Mike Galbraith 1 sibling, 1 reply; 9+ messages in thread From: Gautham R Shenoy @ 2009-11-09 9:44 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Mike Galbraith, Zhang, Yanmin, Ingo Molnar, LKML On Mon, Nov 09, 2009 at 10:15:04AM +0100, Peter Zijlstra wrote: > On Mon, 2009-11-09 at 08:09 +0100, Mike Galbraith wrote: > > + smp_read_barrier_depends(); > > cpumask_setall(cpus); > > + cpumask_and(cpus, cpus, cpu_online_mask); > > > how about: cpumask_copy(cpus, cpu_online_mask); ? > > Also, iirc cpu_online_mask is guaranteed stable when preemption is > disabled, otherwise you need to use get/put_online_cpus(), an > rmb_depends() won't do. preempt_disable() guarantees that any cpus won't go offline, since we use stop_machine() to take CPUs offline. I don't think it provides cover against new cpus coming online. -- Thanks and Regards gautham ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: specjbb2005 and aim7 regression with 2.6.32-rc kernels 2009-11-09 9:44 ` Gautham R Shenoy @ 2009-11-09 9:57 ` Mike Galbraith 0 siblings, 0 replies; 9+ messages in thread From: Mike Galbraith @ 2009-11-09 9:57 UTC (permalink / raw) To: ego; +Cc: Peter Zijlstra, Zhang, Yanmin, Ingo Molnar, LKML On Mon, 2009-11-09 at 15:14 +0530, Gautham R Shenoy wrote: > On Mon, Nov 09, 2009 at 10:15:04AM +0100, Peter Zijlstra wrote: > > On Mon, 2009-11-09 at 08:09 +0100, Mike Galbraith wrote: > > > + smp_read_barrier_depends(); > > > cpumask_setall(cpus); > > > + cpumask_and(cpus, cpus, cpu_online_mask); > > > > > > how about: cpumask_copy(cpus, cpu_online_mask); ? > > > > Also, iirc cpu_online_mask is guaranteed stable when preemption is > > disabled, otherwise you need to use get/put_online_cpus(), an > > rmb_depends() won't do. > > preempt_disable() guarantees that any cpus won't go offline, since we > use stop_machine() to take CPUs offline. I don't think it provides cover > against new cpus coming online. That's exactly the problem I'm having with newidle. Without that barrier, even with the cpumask_and(), it still balances offline cpus. -Mike ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: specjbb2005 and aim7 regression with 2.6.32-rc kernels 2009-11-09 9:15 ` Peter Zijlstra 2009-11-09 9:44 ` Gautham R Shenoy @ 2009-11-09 9:55 ` Mike Galbraith 1 sibling, 0 replies; 9+ messages in thread From: Mike Galbraith @ 2009-11-09 9:55 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Zhang, Yanmin, Ingo Molnar, LKML, Gautham R Shenoy On Mon, 2009-11-09 at 10:15 +0100, Peter Zijlstra wrote: > On Mon, 2009-11-09 at 08:09 +0100, Mike Galbraith wrote: > > + smp_read_barrier_depends(); > > cpumask_setall(cpus); > > + cpumask_and(cpus, cpus, cpu_online_mask); > > > how about: cpumask_copy(cpus, cpu_online_mask); ? Yeah, better. > Also, iirc cpu_online_mask is guaranteed stable when preemption is > disabled, otherwise you need to use get/put_online_cpus(), an > rmb_depends() won't do. Ok.. I do need a barrier though. I don't see how it can be stable when three other CPUs diddle it. It looks to me like it's stable only when all diddlers serialize on the runqueue lock. (which iff correct means 31 has bugs too, so I'm very likely dead wrong) /me has very little experience with smp memory woes. Tripping over one is one thing, fixing the bugger is an entirely different matter. (what I'm about to compile would probably get me spanked on lkml;) -Mike ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-11-09 9:57 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-11-06 7:38 specjbb2005 and aim7 regression with 2.6.32-rc kernels Zhang, Yanmin 2009-11-06 8:02 ` Mike Galbraith 2009-11-06 8:04 ` Ingo Molnar 2009-11-09 6:19 ` Zhang, Yanmin 2009-11-09 7:09 ` Mike Galbraith 2009-11-09 9:15 ` Peter Zijlstra 2009-11-09 9:44 ` Gautham R Shenoy 2009-11-09 9:57 ` Mike Galbraith 2009-11-09 9:55 ` Mike Galbraith
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.