* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible [not found] <1329764866.2293.376.camhel@twins> @ 2012-03-05 15:24 ` Srivatsa Vaddagiri 2012-03-06 9:14 ` Ingo Molnar 0 siblings, 1 reply; 49+ messages in thread From: Srivatsa Vaddagiri @ 2012-03-05 15:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Mike Galbraith, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner * Peter Zijlstra <peterz@infradead.org> [2012-02-20 20:07:46]: > On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote: > > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider. > > Maybe that has changed, but I doubt it. > > Right, I through I remembered somet such, you could see it on wakeup > heavy things like pipe-bench and that java msg passing thing, right? I did some experiments with volanomark and it does turn out to be sensitive to SD_BALANCE_WAKE, while the other wake-heavy benchmark that I am dealing with (Trade) benefits from it. Normalized results for both benchmarks provided below. Machine : 2 Quad-core Intel X5570 CPU (H/T enabled) Kernel : tip (HEAD at b86148a) Before patch After patch Trade thr'put 1 2.17 (~200% improvement) volanomark 1 0.8 (20% degradation) Quick description of benchmarks =============================== Trade was run inside a 8-vcpu VM (cgroup). 4 other 4-vcpu VMs running cpu hogs were also present, leading to this cgroup setup: /cgroup/sys (1024 shares - hosts all system tasks) /cgroup/libvirt (20000 shares) /cgroup/libvirt/qemu/VM1 (8192 cpu shares) /cgroup/libvirt/qemu/VM2-5 (1024 shares) Volanomark server/client programs were run in root cgroup. The patch essentially does balance on wake to look for any idle cpu in same cache domain as its prev_cpu (or cur_cpu if wake_affine obliges), failing to find looks for least loaded cpu. This helps minimize latencies for trade workload (and thus boost its score). For volanomark, it seems to hurt because of waking on a colder L2 cache. The tradeoff seems to be between latency and cache-misses. Short of adding another tunable, are there better suggestions on how we can address this sort of tradeoff? Not-yet-Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> --- include/linux/topology.h | 4 ++-- kernel/sched/fair.c | 26 +++++++++++++++++++++----- 2 files changed, 23 insertions(+), 7 deletions(-) Index: current/include/linux/topology.h =================================================================== --- current.orig/include/linux/topology.h +++ current/include/linux/topology.h @@ -96,7 +96,7 @@ int arch_update_cpu_topology(void); | 1*SD_BALANCE_NEWIDLE \ | 1*SD_BALANCE_EXEC \ | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ + | 1*SD_BALANCE_WAKE \ | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ | 0*SD_POWERSAVINGS_BALANCE \ @@ -129,7 +129,7 @@ int arch_update_cpu_topology(void); | 1*SD_BALANCE_NEWIDLE \ | 1*SD_BALANCE_EXEC \ | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ + | 1*SD_BALANCE_WAKE \ | 1*SD_WAKE_AFFINE \ | 0*SD_PREFER_LOCAL \ | 0*SD_SHARE_CPUPOWER \ Index: current/kernel/sched/fair.c =================================================================== --- current.orig/kernel/sched/fair.c +++ current/kernel/sched/fair.c @@ -2638,7 +2638,7 @@ static int select_idle_sibling(struct ta int prev_cpu = task_cpu(p); struct sched_domain *sd; struct sched_group *sg; - int i; + int i, some_idle_cpu = -1; /* * If the task is going to be woken-up on this cpu and if it is @@ -2661,15 +2661,25 @@ static int select_idle_sibling(struct ta for_each_lower_domain(sd) { sg = sd->groups; do { + int skip = 0; + if (!cpumask_intersects(sched_group_cpus(sg), tsk_cpus_allowed(p))) goto next; - for_each_cpu(i, sched_group_cpus(sg)) { - if (!idle_cpu(i)) - goto next; + for_each_cpu_and(i, sched_group_cpus(sg), + tsk_cpus_allowed(p)) { + if (!idle_cpu(i)) { + if (some_idle_cpu >= 0) + goto next; + skip = 1; + } else + some_idle_cpu = i; } + if (skip) + goto next; + target = cpumask_first_and(sched_group_cpus(sg), tsk_cpus_allowed(p)); goto done; @@ -2677,6 +2687,9 @@ next: sg = sg->next; } while (sg != sd->groups); } + + if (some_idle_cpu >= 0) + target = some_idle_cpu; done: return target; } @@ -2766,7 +2779,10 @@ select_task_rq_fair(struct task_struct * prev_cpu = cpu; new_cpu = select_idle_sibling(p, prev_cpu); - goto unlock; + if (idle_cpu(new_cpu)) + goto unlock; + sd = rcu_dereference(per_cpu(sd_llc, prev_cpu)); + cpu = prev_cpu; } while (sd) { ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-03-05 15:24 ` sched: Avoid SMT siblings in select_idle_sibling() if possible Srivatsa Vaddagiri @ 2012-03-06 9:14 ` Ingo Molnar 2012-03-06 10:03 ` Srivatsa Vaddagiri 2012-03-22 15:32 ` Srivatsa Vaddagiri 0 siblings, 2 replies; 49+ messages in thread From: Ingo Molnar @ 2012-03-06 9:14 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Peter Zijlstra, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner * Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote: > * Peter Zijlstra <peterz@infradead.org> [2012-02-20 20:07:46]: > > > On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote: > > > Enabling SD_BALANCE_WAKE used to be decidedly too > > > expensive to consider. Maybe that has changed, but I doubt > > > it. > > > > Right, I through I remembered somet such, you could see it > > on wakeup heavy things like pipe-bench and that java msg > > passing thing, right? > > I did some experiments with volanomark and it does turn out to > be sensitive to SD_BALANCE_WAKE, while the other wake-heavy > benchmark that I am dealing with (Trade) benefits from it. Does volanomark still do yield(), thereby invoking a random shuffle of thread scheduling and pretty much voluntarily ejecting itself from most scheduler performance considerations? If it uses a real locking primitive such as futexes then its performance matters more. Thanks, Ingo ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-03-06 9:14 ` Ingo Molnar @ 2012-03-06 10:03 ` Srivatsa Vaddagiri 2012-03-22 15:32 ` Srivatsa Vaddagiri 1 sibling, 0 replies; 49+ messages in thread From: Srivatsa Vaddagiri @ 2012-03-06 10:03 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner * Ingo Molnar <mingo@elte.hu> [2012-03-06 10:14:11]: > Does volanomark still do yield(), I could count 3.5M yield() calls across 16 cpus over a period of 5.4sec, so yes it does seem to make very good use of yeild()! > thereby invoking a random > shuffle of thread scheduling and pretty much voluntarily > ejecting itself from most scheduler performance considerations? > > If it uses a real locking primitive such as futexes then its > performance matters more. Good point ..I will gather more data for additional benchmarks (tbench, sysbench) and post shortly. - vatsa ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-03-06 9:14 ` Ingo Molnar 2012-03-06 10:03 ` Srivatsa Vaddagiri @ 2012-03-22 15:32 ` Srivatsa Vaddagiri 2012-03-23 6:38 ` Mike Galbraith ` (2 more replies) 1 sibling, 3 replies; 49+ messages in thread From: Srivatsa Vaddagiri @ 2012-03-22 15:32 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner * Ingo Molnar <mingo@elte.hu> [2012-03-06 10:14:11]: > > I did some experiments with volanomark and it does turn out to > > be sensitive to SD_BALANCE_WAKE, while the other wake-heavy > > benchmark that I am dealing with (Trade) benefits from it. > > Does volanomark still do yield(), thereby invoking a random > shuffle of thread scheduling and pretty much voluntarily > ejecting itself from most scheduler performance considerations? > > If it uses a real locking primitive such as futexes then its > performance matters more. Some more interesting results on more recent tip kernel. Machine : 2 Quad-core Intel X5570 CPU w/ H/T enabled (16 cpus) Kernel : tip (HEAD at ee415e2) guest VM : 2.6.18 linux kernel based enterprise guest Benchmarks are run in two scenarios: 1. BM -> Bare Metal. Benchmark is run on bare metal in root cgroup 2. VM -> Benchmark is run inside a guest VM. Several cpu hogs (in various cgroups) are run on host. Cgroup setup is as below: /sys (cpu.shares = 1024, hosts all system tasks) /libvirt (cpu.shares = 20000) /libvirt/qemu/VM (cpu.shares = 8192. guest VM w/ 8 vcpus) /libvirt/qemu/hoga (cpu.shares = 1024. hosts 4 cpu hogs) /libvirt/qemu/hogb (cpu.shares = 1024. hosts 4 cpu hogs) /libvirt/qemu/hogc (cpu.shares = 1024. hosts 4 cpu hogs) /libvirt/qemu/hogd (cpu.shares = 1024. hosts 4 cpu hogs) First BM (bare metal) scenario: tip tip + patch volano 1 0.955 (4.5% degradation) sysbench [n1] 1 0.9984 (0.16% degradation) tbench 1 [n2] 1 0.9096 (9% degradation) Now the more interesting VM scenario: tip tip + patch volano 1 1.29 (29% improvement) sysbench [n3] 1 2 (100% improvement) tbench 1 [n4] 1 1.07 (7% improvement) tbench 8 [n5] 1 1.26 (26% improvement) httperf [n6] 1 1.05 (5% improvement) Trade 1 1.31 (31% improvement) Notes: n1. sysbench was run with 16 threads. n2. tbench was run on localhost with 1 client n3. sysbench was run with 8 threads n4. tbench was run on localhost with 1 client n5. tbench was run over network with 8 clients n6. httperf was run as with burst-length of 100 and wsess of 100,500,0 So the patch seems to be a wholesome win when VCPU threads are waking up (in a highly contended environment). One reason could be that any assumption of better cache hits by running (vcpu) threads on its prev_cpu may not be fully correct as vcpu threads could represent many different threads internally? Anyway, there are degradations as well, considering which I see several possibilities: 1. Do balance-on-wake for vcpu threads only. 2. Document tuning possibility to improve performance in virtualized environment: - Either via sched_domain flags (disable SD_WAKE_AFFINE at all levels and enable SD_BALANCE_WAKE at SMT/MC levels) - Or via a new sched_feat(BALANCE_WAKE) tunable Any other thoughts or suggestions for more experiments? -- Balance threads on wakeup to let it run on least-loaded CPU in the same cache domain as its prev_cpu (or cur_cpu if wake_affine() test obliges). Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> --- include/linux/topology.h | 4 ++-- kernel/sched/fair.c | 5 ++++- 2 files changed, 6 insertions(+), 3 deletions(-) Index: current/include/linux/topology.h =================================================================== --- current.orig/include/linux/topology.h +++ current/include/linux/topology.h @@ -96,7 +96,7 @@ int arch_update_cpu_topology(void); | 1*SD_BALANCE_NEWIDLE \ | 1*SD_BALANCE_EXEC \ | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ + | 1*SD_BALANCE_WAKE \ | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ | 0*SD_POWERSAVINGS_BALANCE \ @@ -129,7 +129,7 @@ int arch_update_cpu_topology(void); | 1*SD_BALANCE_NEWIDLE \ | 1*SD_BALANCE_EXEC \ | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ + | 1*SD_BALANCE_WAKE \ | 1*SD_WAKE_AFFINE \ | 0*SD_PREFER_LOCAL \ | 0*SD_SHARE_CPUPOWER \ Index: current/kernel/sched/fair.c =================================================================== --- current.orig/kernel/sched/fair.c +++ current/kernel/sched/fair.c @@ -2766,7 +2766,10 @@ select_task_rq_fair(struct task_struct * prev_cpu = cpu; new_cpu = select_idle_sibling(p, prev_cpu); - goto unlock; + if (idle_cpu(new_cpu)) + goto unlock; + sd = rcu_dereference(per_cpu(sd_llc, prev_cpu)); + cpu = prev_cpu; } while (sd) { ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-03-22 15:32 ` Srivatsa Vaddagiri @ 2012-03-23 6:38 ` Mike Galbraith 2012-03-26 8:29 ` Peter Zijlstra 2012-03-26 8:36 ` Peter Zijlstra 2 siblings, 0 replies; 49+ messages in thread From: Mike Galbraith @ 2012-03-23 6:38 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Ingo Molnar, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner On Thu, 2012-03-22 at 21:02 +0530, Srivatsa Vaddagiri wrote: > * Ingo Molnar <mingo@elte.hu> [2012-03-06 10:14:11]: > > > > I did some experiments with volanomark and it does turn out to > > > be sensitive to SD_BALANCE_WAKE, while the other wake-heavy > > > benchmark that I am dealing with (Trade) benefits from it. > > > > Does volanomark still do yield(), thereby invoking a random > > shuffle of thread scheduling and pretty much voluntarily > > ejecting itself from most scheduler performance considerations? > > > > If it uses a real locking primitive such as futexes then its > > performance matters more. > > Some more interesting results on more recent tip kernel. Yeah, interesting. I find myself ever returning to this message, as gears grind away trying to imagine what's going on in those vcpus. > Machine : 2 Quad-core Intel X5570 CPU w/ H/T enabled (16 cpus) > Kernel : tip (HEAD at ee415e2) > guest VM : 2.6.18 linux kernel based enterprise guest > > Benchmarks are run in two scenarios: > > 1. BM -> Bare Metal. Benchmark is run on bare metal in root cgroup > 2. VM -> Benchmark is run inside a guest VM. Several cpu hogs (in > various cgroups) are run on host. Cgroup setup is as below: > > /sys (cpu.shares = 1024, hosts all system tasks) > /libvirt (cpu.shares = 20000) > /libvirt/qemu/VM (cpu.shares = 8192. guest VM w/ 8 vcpus) > /libvirt/qemu/hoga (cpu.shares = 1024. hosts 4 cpu hogs) > /libvirt/qemu/hogb (cpu.shares = 1024. hosts 4 cpu hogs) > /libvirt/qemu/hogc (cpu.shares = 1024. hosts 4 cpu hogs) > /libvirt/qemu/hogd (cpu.shares = 1024. hosts 4 cpu hogs) > > First BM (bare metal) scenario: > > tip tip + patch > > volano 1 0.955 (4.5% degradation) > sysbench [n1] 1 0.9984 (0.16% degradation) > tbench 1 [n2] 1 0.9096 (9% degradation) Those make sense, fast path cycles added. > Now the more interesting VM scenario: > > tip tip + patch > > volano 1 1.29 (29% improvement) > sysbench [n3] 1 2 (100% improvement) > tbench 1 [n4] 1 1.07 (7% improvement) > tbench 8 [n5] 1 1.26 (26% improvement) > httperf [n6] 1 1.05 (5% improvement) > Trade 1 1.31 (31% improvement) > > Notes: > > n1. sysbench was run with 16 threads. > n2. tbench was run on localhost with 1 client > n3. sysbench was run with 8 threads > n4. tbench was run on localhost with 1 client > n5. tbench was run over network with 8 clients > n6. httperf was run as with burst-length of 100 and wsess of 100,500,0 > > So the patch seems to be a wholesome win when VCPU threads are waking > up (in a highly contended environment). One reason could be that any assumption > of better cache hits by running (vcpu) threads on its prev_cpu may not > be fully correct as vcpu threads could represent many different threads > internally? > > Anyway, there are degradations as well, considering which I see several > possibilities: > > 1. Do balance-on-wake for vcpu threads only. That's what your numbers say to me with this patch. I'm not getting the why, but your patch appears to reduce vcpu internal latencies hugely. > 2. Document tuning possibility to improve performance in virtualized > environment: > - Either via sched_domain flags (disable SD_WAKE_AFFINE > at all levels and enable SD_BALANCE_WAKE at SMT/MC levels) > - Or via a new sched_feat(BALANCE_WAKE) tunable > > Any other thoughts or suggestions for more experiments? Other than nuke select_idle_sibling() entirely instead, none here. -Mike ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-03-22 15:32 ` Srivatsa Vaddagiri 2012-03-23 6:38 ` Mike Galbraith @ 2012-03-26 8:29 ` Peter Zijlstra 2012-03-26 8:36 ` Peter Zijlstra 2 siblings, 0 replies; 49+ messages in thread From: Peter Zijlstra @ 2012-03-26 8:29 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner On Thu, 2012-03-22 at 21:02 +0530, Srivatsa Vaddagiri wrote: > Anyway, there are degradations as well, considering which I see > several > possibilities: > > 1. Do balance-on-wake for vcpu threads only. Hell no ;-) we're not going to special case some threads over others. > 2. Document tuning possibility to improve performance in virtualized > environment: > - Either via sched_domain flags (disable SD_WAKE_AFFINE > at all levels and enable SD_BALANCE_WAKE at SMT/MC levels) But domain flags are not exported -- except under SCHED_DEBUG and that sysctl mess.. also SD_flags are not stable. > - Or via a new sched_feat(BALANCE_WAKE) tunable sched_feat() is not a stable ABI and shouldn't ever be used for anything but debugging (hence it lives in debugfs and goes away if you disable SCHED_DEBUG). I would very much like more information on why things are a loss. Is it really related to what cpu you pick, or is it the cost of doing the balance thing? ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-03-22 15:32 ` Srivatsa Vaddagiri 2012-03-23 6:38 ` Mike Galbraith 2012-03-26 8:29 ` Peter Zijlstra @ 2012-03-26 8:36 ` Peter Zijlstra 2012-03-26 17:35 ` Srivatsa Vaddagiri 2 siblings, 1 reply; 49+ messages in thread From: Peter Zijlstra @ 2012-03-26 8:36 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner On Thu, 2012-03-22 at 21:02 +0530, Srivatsa Vaddagiri wrote: > > Now the more interesting VM scenario: > More interesting for whoem? I still think virt is an utter waste of time ;-) > tip tip + patch > > volano 1 1.29 (29% improvement) > sysbench [n3] 1 2 (100% improvement) > tbench 1 [n4] 1 1.07 (7% improvement) > tbench 8 [n5] 1 1.26 (26% improvement) > httperf [n6] 1 1.05 (5% improvement) > Trade 1 1.31 (31% improvement) That smells like there's more to the story, a 100% improvement is too much.. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-03-26 8:36 ` Peter Zijlstra @ 2012-03-26 17:35 ` Srivatsa Vaddagiri 2012-03-26 18:06 ` Peter Zijlstra 0 siblings, 1 reply; 49+ messages in thread From: Srivatsa Vaddagiri @ 2012-03-26 17:35 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner * Peter Zijlstra <peterz@infradead.org> [2012-03-26 10:36:00]: > > tip tip + patch > > > > volano 1 1.29 (29% improvement) > > sysbench [n3] 1 2 (100% improvement) > > tbench 1 [n4] 1 1.07 (7% improvement) > > tbench 8 [n5] 1 1.26 (26% improvement) > > httperf [n6] 1 1.05 (5% improvement) > > Trade 1 1.31 (31% improvement) > > That smells like there's more to the story, a 100% improvement is too > much.. Yeah I have rubbed my eyes several times to make sure its true and ran the same benchmark (sysbench) again now! I can recreate that ~100% improvement with the patch even now. To quickly re-cap my environment, I have a 16-cpu machine w/ 5 cgroups. 1 cgroup (8192 shares) hosts sysbench inside 8-vcpu VM while remaining 4 cgroups (1024 shares each) hosts 4 cpu hogs running on bare metal. Given this overcommittment, select_idle_sibling() should mostly be a no-op (i.e it won't find any idle cores and thus defaults to prev_cpu). Also the only tasks that will (sleep and) wakeup are the VM tasks. Since the patch potentially affects (improves) scheduling latencies, I measured Sum(se.statistics.wait_sum) for the VM tasks over the benchmark run (5 iterations of sysbench). tip : 987240 ms tip + patch : 280275 ms I will get more comprehensive perf data shortly and post. >From what I can tell, the huge improvement in benchmark score is coming from reduced latencies for its VM tasks. The hard part to figure out (when we are inside select_task_rq_fair()) is whether any potential improvement in latencies (because of waking up on a less loaded cpu) will offshoot the cost of potentially more L2-cache misses, for which IMHO we don't have enough data to make a good decision. - vatsa ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-03-26 17:35 ` Srivatsa Vaddagiri @ 2012-03-26 18:06 ` Peter Zijlstra 2012-03-27 13:56 ` Mike Galbraith 0 siblings, 1 reply; 49+ messages in thread From: Peter Zijlstra @ 2012-03-26 18:06 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner On Mon, 2012-03-26 at 23:05 +0530, Srivatsa Vaddagiri wrote: > > From what I can tell, the huge improvement in benchmark score is coming from > reduced latencies for its VM tasks. But if the machine is pegged latency should not impact throughput (since there's always work to do), so are you creating extra idle time some place? Are you running against lock-inversion in the vcpus? Or that tlb shootdown issue we had in the gang scheduling thread? Both are typically busy-wait time, which is of course harder to spot that actual idle time :/ Then again, reducing latency is good, so I don't object to that per-se, but that flips the question, why does it regress those other loads? The biggest regression came from tbench, wasn't that mostly a random number generator anyway? How stable are those results, do you have a variance measure on the results? ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-03-26 18:06 ` Peter Zijlstra @ 2012-03-27 13:56 ` Mike Galbraith 0 siblings, 0 replies; 49+ messages in thread From: Mike Galbraith @ 2012-03-27 13:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Srivatsa Vaddagiri, Ingo Molnar, Suresh Siddha, linux-kernel, Paul Turner On Mon, 2012-03-26 at 20:06 +0200, Peter Zijlstra wrote: > The biggest regression came from tbench, wasn't that mostly a random > number generator anyway? That's dbench ;-) Tbench does wiggle around more than you'd like though. I've never tracked, but in lots of running the thing, 2-3% you can forget chasing, and I once bisected a 5% regression to an unrelated change, reverted and re-applied the suspect a few times in disbelief, and it tracked. You can't trust it _too_ much. I consider it a reliable indicator that I'd better do more testing. If others agree, it's a real regression, if not, it's those damn Elves 'n Gremlins ;-) -Mike ^ permalink raw reply [flat|nested] 49+ messages in thread
* sched: Avoid SMT siblings in select_idle_sibling() if possible @ 2011-11-15 9:46 Peter Zijlstra 2011-11-16 1:14 ` Suresh Siddha 0 siblings, 1 reply; 49+ messages in thread From: Peter Zijlstra @ 2011-11-15 9:46 UTC (permalink / raw) To: linux-kernel, Ingo Molnar; +Cc: Paul Turner, Suresh Siddha, Mike Galbraith Subject: sched: Avoid SMT siblings in select_idle_sibling() if possible From: Peter Zijlstra <peterz@infradead.org> Date: Thu, 10 Nov 2011 13:01:10 +0100 Avoid select_idle_sibling() from picking a sibling thread if there's an idle core that shares cache. Cc: Suresh Siddha <suresh.b.siddha@intel.com> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-smlz1ah4jwood10f7eml9gzk@git.kernel.org --- kernel/sched_fair.c | 42 ++++++++++++++++++++++++++++-------------- 1 file changed, 28 insertions(+), 14 deletions(-) Index: linux-2.6/kernel/sched_fair.c =================================================================== --- linux-2.6.orig/kernel/sched_fair.c +++ linux-2.6/kernel/sched_fair.c @@ -2326,7 +2326,8 @@ static int select_idle_sibling(struct ta int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); struct sched_domain *sd; - int i; + struct sched_group *sg; + int i, smt = 0; /* * If the task is going to be woken-up on this cpu and if it is @@ -2346,25 +2347,38 @@ static int select_idle_sibling(struct ta * Otherwise, iterate the domains and find an elegible idle cpu. */ rcu_read_lock(); +again: for_each_domain(target, sd) { - if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) - break; + if (!smt && (sd->flags & SD_SHARE_CPUPOWER)) + continue; - for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) { - if (idle_cpu(i)) { - target = i; - break; + if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) { + if (!smt) { + smt = 1; + goto again; } + break; } - /* - * Lets stop looking for an idle sibling when we reached - * the domain that spans the current cpu and prev_cpu. - */ - if (cpumask_test_cpu(cpu, sched_domain_span(sd)) && - cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) - break; + sg = sd->groups; + do { + if (!cpumask_intersects(sched_group_cpus(sg), + tsk_cpus_allowed(p))) + goto next; + + for_each_cpu(i, sched_group_cpus(sg)) { + if (!idle_cpu(i)) + goto next; + } + + target = cpumask_first_and(sched_group_cpus(sg), + tsk_cpus_allowed(p)); + goto done; +next: + sg = sg->next; + } while (sg != sd->groups); } +done: rcu_read_unlock(); return target; ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2011-11-15 9:46 Peter Zijlstra @ 2011-11-16 1:14 ` Suresh Siddha 2011-11-16 9:24 ` Mike Galbraith 0 siblings, 1 reply; 49+ messages in thread From: Suresh Siddha @ 2011-11-16 1:14 UTC (permalink / raw) To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Paul Turner, Mike Galbraith On Tue, 2011-11-15 at 01:46 -0800, Peter Zijlstra wrote: > @@ -2346,25 +2347,38 @@ static int select_idle_sibling(struct ta > * Otherwise, iterate the domains and find an elegible idle cpu. > */ > rcu_read_lock(); > +again: > for_each_domain(target, sd) { > - if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) > - break; > + if (!smt && (sd->flags & SD_SHARE_CPUPOWER)) > + continue; > > - for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) { > - if (idle_cpu(i)) { > - target = i; > - break; > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) { > + if (!smt) { > + smt = 1; > + goto again; > } > + break; > } It looks like you will be checking the core domain twice (with smt == 0 and smt == 1) if there are no idle siblings. How about this patch which is more self explanatory? --- Avoid select_idle_sibling() from picking a sibling thread if there's an idle core that shares cache. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> --- kernel/sched.c | 2 + kernel/sched_fair.c | 54 +++++++++++++++++++++++++++++++++++--------------- 2 files changed, 40 insertions(+), 16 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 0e9344a..4b0bc6a 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -734,6 +734,8 @@ static inline int cpu_of(struct rq *rq) #define for_each_domain(cpu, __sd) \ for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent) +#define for_each_lower_domain(sd) for (; sd; sd = sd->child) + #define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) #define this_rq() (&__get_cpu_var(runqueues)) #define task_rq(p) cpu_rq(task_cpu(p)) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5c9e679..cb7a5ef 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2241,6 +2241,25 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) return idlest; } +/** + * highest_flag_domain - Return highest sched_domain containing flag. + * @cpu: The cpu whose highest level of sched domain is to + * be returned. + * @flag: The flag to check for the highest sched_domain + * for the given cpu. + * + * Returns the highest sched_domain of a cpu which contains the given flag. + */ +static inline struct sched_domain *highest_flag_domain(int cpu, int flag) +{ + struct sched_domain *sd; + + for_each_domain(cpu, sd) + if (!(sd->flags & flag)) + return sd->child; + return NULL; +} + /* * Try and locate an idle CPU in the sched_domain. */ @@ -2249,6 +2268,7 @@ static int select_idle_sibling(struct task_struct *p, int target) int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); struct sched_domain *sd; + struct sched_group *sg; int i; /* @@ -2269,25 +2289,27 @@ static int select_idle_sibling(struct task_struct *p, int target) * Otherwise, iterate the domains and find an elegible idle cpu. */ rcu_read_lock(); - for_each_domain(target, sd) { - if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) - break; + sd = highest_flag_domain(target, SD_SHARE_PKG_RESOURCES); + for_each_lower_domain(sd) { + sg = sd->groups; + do { + if (!cpumask_intersects(sched_group_cpus(sg), + tsk_cpus_allowed(p))) + goto next; - for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) { - if (idle_cpu(i)) { - target = i; - break; + for_each_cpu(i, sched_group_cpus(sg)) { + if (!idle_cpu(i)) + goto next; } - } - /* - * Lets stop looking for an idle sibling when we reached - * the domain that spans the current cpu and prev_cpu. - */ - if (cpumask_test_cpu(cpu, sched_domain_span(sd)) && - cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) - break; + target = cpumask_first_and(sched_group_cpus(sg), + tsk_cpus_allowed(p)); + goto done; +next: + sg = sg->next; + } while (sg != sd->groups); } +done: rcu_read_unlock(); return target; ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2011-11-16 1:14 ` Suresh Siddha @ 2011-11-16 9:24 ` Mike Galbraith 2011-11-16 18:37 ` Suresh Siddha 0 siblings, 1 reply; 49+ messages in thread From: Mike Galbraith @ 2011-11-16 9:24 UTC (permalink / raw) To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner On Tue, 2011-11-15 at 17:14 -0800, Suresh Siddha wrote: > How about this patch which is more self explanatory? Actually, after further testing/reading, it looks to me like both of these patches have a problem. They'll never select SMT siblings (so a skip SIBLING should accomplish the same). This patch didn't select an idle core either though, where Peter's did. Tested by pinning a hog to cores 1-3, then starting up an unpinned tbench pair. Peter's patch didn't do the BadThing (as in bad for TCP_RR/tbench) in that case, but should have. > + sg = sd->groups; > + do { > + if (!cpumask_intersects(sched_group_cpus(sg), > + tsk_cpus_allowed(p))) > + goto next; > > + for_each_cpu(i, sched_group_cpus(sg)) { > + if (!idle_cpu(i)) > + goto next; Say target is CPU0. Groups are 0,4 1,5 2,6 3,7. 0-3 are first CPUs encountered in MC groups, all were busy. At SIBLING level, the only group is 0,4. First encountered CPU of sole group is busy target, so we're done.. so we return busy target. > + target = cpumask_first_and(sched_group_cpus(sg), > + tsk_cpus_allowed(p)); At SIBLING, group = 0,4 = 0x5, 0x5 & 0xff = 1 = target. -Mike ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2011-11-16 9:24 ` Mike Galbraith @ 2011-11-16 18:37 ` Suresh Siddha 2011-11-17 1:59 ` Mike Galbraith 0 siblings, 1 reply; 49+ messages in thread From: Suresh Siddha @ 2011-11-16 18:37 UTC (permalink / raw) To: Mike Galbraith; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner On Wed, 2011-11-16 at 01:24 -0800, Mike Galbraith wrote: > On Tue, 2011-11-15 at 17:14 -0800, Suresh Siddha wrote: > > > How about this patch which is more self explanatory? > > Actually, after further testing/reading, it looks to me like both of > these patches have a problem. They'll never select SMT siblings (so a > skip SIBLING should accomplish the same). > > This patch didn't select an idle core either though, where Peter's did. > > Tested by pinning a hog to cores 1-3, then starting up an unpinned > tbench pair. Peter's patch didn't do the BadThing (as in bad for > TCP_RR/tbench) in that case, but should have. > > > + sg = sd->groups; > > + do { > > + if (!cpumask_intersects(sched_group_cpus(sg), > > + tsk_cpus_allowed(p))) > > + goto next; > > > > + for_each_cpu(i, sched_group_cpus(sg)) { > > + if (!idle_cpu(i)) > > + goto next; > > Say target is CPU0. Groups are 0,4 1,5 2,6 3,7. 0-3 are first CPUs > encountered in MC groups, all were busy. At SIBLING level, the only > group is 0,4. First encountered CPU of sole group is busy target, so > we're done.. so we return busy target. > > > + target = cpumask_first_and(sched_group_cpus(sg), > > + tsk_cpus_allowed(p)); > > At SIBLING, group = 0,4 = 0x5, 0x5 & 0xff = 1 = target. Mike, At the sibling level, domain span will be 0,4 which is 0x5. But there are two individual groups. First group just contains cpu0 and the second group contains cpu4. So if cpu0 is busy, we will check the next group to see if it is idle (which is cpu4 in your example). So we will return cpu-4. It should be ok. Isn't it? thanks, suresh ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2011-11-16 18:37 ` Suresh Siddha @ 2011-11-17 1:59 ` Mike Galbraith 2011-11-17 15:38 ` Mike Galbraith 0 siblings, 1 reply; 49+ messages in thread From: Mike Galbraith @ 2011-11-17 1:59 UTC (permalink / raw) To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner On Wed, 2011-11-16 at 10:37 -0800, Suresh Siddha wrote: > On Wed, 2011-11-16 at 01:24 -0800, Mike Galbraith wrote: > > At SIBLING, group = 0,4 = 0x5, 0x5 & 0xff = 1 = target. > > Mike, At the sibling level, domain span will be 0,4 which is 0x5. But > there are two individual groups. First group just contains cpu0 and the > second group contains cpu4. > > So if cpu0 is busy, we will check the next group to see if it is idle > (which is cpu4 in your example). So we will return cpu-4. Oh duh, right. > It should be ok. Isn't it? Nope, wasn't ok. I'll double check today though, and bend, spindle and/or mutilate as required. -Mike ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2011-11-17 1:59 ` Mike Galbraith @ 2011-11-17 15:38 ` Mike Galbraith 2011-11-17 15:56 ` Peter Zijlstra 0 siblings, 1 reply; 49+ messages in thread From: Mike Galbraith @ 2011-11-17 15:38 UTC (permalink / raw) To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner On Thu, 2011-11-17 at 02:59 +0100, Mike Galbraith wrote: > On Wed, 2011-11-16 at 10:37 -0800, Suresh Siddha wrote: > > It should be ok. Isn't it? > > Nope, wasn't ok. Because you can't get to again: with a single E5620 package, it having only SIBLING and MC domains. again: for_each_domain(target, sd) { if (!smt && (sd->flags & SD_SHARE_CPUPOWER)) continue; if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) { if (!smt) { smt = 1; goto again; ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2011-11-17 15:38 ` Mike Galbraith @ 2011-11-17 15:56 ` Peter Zijlstra 2011-11-17 16:38 ` Mike Galbraith 2011-11-17 19:08 ` Suresh Siddha 0 siblings, 2 replies; 49+ messages in thread From: Peter Zijlstra @ 2011-11-17 15:56 UTC (permalink / raw) To: Mike Galbraith; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner On Thu, 2011-11-17 at 16:38 +0100, Mike Galbraith wrote: > On Thu, 2011-11-17 at 02:59 +0100, Mike Galbraith wrote: > > On Wed, 2011-11-16 at 10:37 -0800, Suresh Siddha wrote: > > > > It should be ok. Isn't it? > > > > Nope, wasn't ok. > > Because you can't get to again: with a single E5620 package, it having > only SIBLING and MC domains. > > again: > for_each_domain(target, sd) { > if (!smt && (sd->flags & SD_SHARE_CPUPOWER)) > continue; > > if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) { > if (!smt) { > smt = 1; > goto again; > D'0h, indeed.. Something like the below maybe, although I'm certain it all can be written much nicer indeed. --- kernel/sched/fair.c | 14 ++++++++------ 1 files changed, 8 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cd3b642..340e62e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2676,13 +2676,11 @@ static int select_idle_sibling(struct task_struct *p, int target) if (!smt && (sd->flags & SD_SHARE_CPUPOWER)) continue; - if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) { - if (!smt) { - smt = 1; - goto again; - } + if (smt && !(sd->flags & SD_SHARE_CPUPOWER)) + break; + + if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) break; - } sg = sd->groups; do { @@ -2702,6 +2700,10 @@ static int select_idle_sibling(struct task_struct *p, int target) sg = sg->next; } while (sg != sd->groups); } + if (!smt) { + smt = 1; + goto again; + } done: rcu_read_unlock(); ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2011-11-17 15:56 ` Peter Zijlstra @ 2011-11-17 16:38 ` Mike Galbraith 2011-11-17 17:36 ` Suresh Siddha 2011-11-17 19:08 ` Suresh Siddha 1 sibling, 1 reply; 49+ messages in thread From: Mike Galbraith @ 2011-11-17 16:38 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner On Thu, 2011-11-17 at 16:56 +0100, Peter Zijlstra wrote: > On Thu, 2011-11-17 at 16:38 +0100, Mike Galbraith wrote: > > On Thu, 2011-11-17 at 02:59 +0100, Mike Galbraith wrote: > > > On Wed, 2011-11-16 at 10:37 -0800, Suresh Siddha wrote: > > > > > > It should be ok. Isn't it? > > > > > > Nope, wasn't ok. > > > > Because you can't get to again: with a single E5620 package, it having > > only SIBLING and MC domains. > > > > again: > > for_each_domain(target, sd) { > > if (!smt && (sd->flags & SD_SHARE_CPUPOWER)) > > continue; > > > > if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) { > > if (!smt) { > > smt = 1; > > goto again; > > > > D'0h, indeed.. > > Something like the below maybe, although I'm certain it all can be > written much nicer indeed. I'll give it a go. Squabbling with bouncing buddies in an isolated and otherwise idle cpuset ate my day. > --- > kernel/sched/fair.c | 14 ++++++++------ > 1 files changed, 8 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index cd3b642..340e62e 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2676,13 +2676,11 @@ static int select_idle_sibling(struct task_struct *p, int target) > if (!smt && (sd->flags & SD_SHARE_CPUPOWER)) > continue; > > - if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) { > - if (!smt) { > - smt = 1; > - goto again; > - } > + if (smt && !(sd->flags & SD_SHARE_CPUPOWER)) > + break; > + > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) > break; > - } > > sg = sd->groups; > do { > @@ -2702,6 +2700,10 @@ static int select_idle_sibling(struct task_struct *p, int target) > sg = sg->next; > } while (sg != sd->groups); > } > + if (!smt) { > + smt = 1; > + goto again; > + } > done: > rcu_read_unlock(); > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2011-11-17 16:38 ` Mike Galbraith @ 2011-11-17 17:36 ` Suresh Siddha 2011-11-18 15:14 ` Mike Galbraith 0 siblings, 1 reply; 49+ messages in thread From: Suresh Siddha @ 2011-11-17 17:36 UTC (permalink / raw) To: Mike Galbraith; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner On Thu, 2011-11-17 at 08:38 -0800, Mike Galbraith wrote: > On Thu, 2011-11-17 at 16:56 +0100, Peter Zijlstra wrote: > > Something like the below maybe, although I'm certain it all can be > > written much nicer indeed. > > I'll give it a go. > > Squabbling with bouncing buddies in an isolated and otherwise idle > cpuset ate my day. > Well looks like I managed to have the similar issue in my patch too. Anyways here is the updated cleaned up version of the patch ;) --- Avoid select_idle_sibling() from picking a sibling thread if there's an idle core that shares cache. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> --- kernel/sched.c | 2 + kernel/sched_fair.c | 57 ++++++++++++++++++++++++++++++++++++-------------- 2 files changed, 43 insertions(+), 16 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 0e9344a..4b0bc6a 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -734,6 +734,8 @@ static inline int cpu_of(struct rq *rq) #define for_each_domain(cpu, __sd) \ for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent) +#define for_each_lower_domain(sd) for (; sd; sd = sd->child) + #define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) #define this_rq() (&__get_cpu_var(runqueues)) #define task_rq(p) cpu_rq(task_cpu(p)) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5c9e679..020483a 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2241,6 +2241,28 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) return idlest; } +/** + * highest_flag_domain - Return highest sched_domain containing flag. + * @cpu: The cpu whose highest level of sched domain is to + * be returned. + * @flag: The flag to check for the highest sched_domain + * for the given cpu. + * + * Returns the highest sched_domain of a cpu which contains the given flag. + */ +static inline struct sched_domain *highest_flag_domain(int cpu, int flag) +{ + struct sched_domain *sd, *hsd = NULL; + + for_each_domain(cpu, sd) { + if (!(sd->flags & flag)) + break; + hsd = sd; + } + + return hsd; +} + /* * Try and locate an idle CPU in the sched_domain. */ @@ -2249,6 +2271,7 @@ static int select_idle_sibling(struct task_struct *p, int target) int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); struct sched_domain *sd; + struct sched_group *sg; int i; /* @@ -2269,25 +2292,27 @@ static int select_idle_sibling(struct task_struct *p, int target) * Otherwise, iterate the domains and find an elegible idle cpu. */ rcu_read_lock(); - for_each_domain(target, sd) { - if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) - break; + sd = highest_flag_domain(target, SD_SHARE_PKG_RESOURCES); + for_each_lower_domain(sd) { + sg = sd->groups; + do { + if (!cpumask_intersects(sched_group_cpus(sg), + tsk_cpus_allowed(p))) + goto next; - for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) { - if (idle_cpu(i)) { - target = i; - break; + for_each_cpu(i, sched_group_cpus(sg)) { + if (!idle_cpu(i)) + goto next; } - } - /* - * Lets stop looking for an idle sibling when we reached - * the domain that spans the current cpu and prev_cpu. - */ - if (cpumask_test_cpu(cpu, sched_domain_span(sd)) && - cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) - break; + target = cpumask_first_and(sched_group_cpus(sg), + tsk_cpus_allowed(p)); + goto done; +next: + sg = sg->next; + } while (sg != sd->groups); } +done: rcu_read_unlock(); return target; ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2011-11-17 17:36 ` Suresh Siddha @ 2011-11-18 15:14 ` Mike Galbraith 2012-02-20 14:41 ` Peter Zijlstra 0 siblings, 1 reply; 49+ messages in thread From: Mike Galbraith @ 2011-11-18 15:14 UTC (permalink / raw) To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner On Thu, 2011-11-17 at 09:36 -0800, Suresh Siddha wrote: > On Thu, 2011-11-17 at 08:38 -0800, Mike Galbraith wrote: > > On Thu, 2011-11-17 at 16:56 +0100, Peter Zijlstra wrote: > > > Something like the below maybe, although I'm certain it all can be > > > written much nicer indeed. > > > > I'll give it a go. > > > > Squabbling with bouncing buddies in an isolated and otherwise idle > > cpuset ate my day. > > > > Well looks like I managed to have the similar issue in my patch too. > Anyways here is the updated cleaned up version of the patch ;) Works fine. However, unpinned buddies bounce more than with virgin mainline. I tried doing it differently (mikie in numbers below), and it worked for a single unbound pair, but raped multiple unbound pairs. --- kernel/sched_fair.c | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) Index: linux-3.0-tip/kernel/sched_fair.c =================================================================== --- linux-3.0-tip.orig/kernel/sched_fair.c +++ linux-3.0-tip/kernel/sched_fair.c @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) { if (idle_cpu(i)) { target = i; + if (sd->flags & SD_SHARE_CPUPOWER) + continue; break; } } - - /* - * Lets stop looking for an idle sibling when we reached - * the domain that spans the current cpu and prev_cpu. - */ - if (cpumask_test_cpu(cpu, sched_domain_span(sd)) && - cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) - break; } rcu_read_unlock(); mikie2 is your patch + twiddles I'll post as a reply to this post. kernel v3.2-rc1-306-g7f80850 TTWU_QUEUE off (skews results), test in cpuset 1-3,5-7 Test1: one unbound TCP_RR pair, three runs virgin 66611.73 71376.00 61297.09 avg 66428.27 1.000 suresh 68488.88 68412.48 68149.73 (bounce) 68350.36 1.028 mikie 75925.91 75851.63 74617.29 (bounce--) 75464.94 1.136 mikie2 71403.39 71396.73 72258.91 NO_SIBLING_LIMIT_SYNC 71686.34 1.079 mikie2 139210.06 140485.95 140189.95 SIBLING_LIMIT_SYNC 139961.98 2.106 Test2: one unbound TCP_RR pair plus 2 unbound hogs, three runs virgin 87108.59 88737.30 87383.98 avg 87743.29 1.000 suresh 84281.24 84725.07 84823.57 84931.93 .967 mikie 87850.37 86081.73 85789.49 86573.86 .986 mikie2 92613.79 92022.95 92014.26 NO_SIBLING_LIMIT_SYNC 92217.00 1.050 mikie2 134682.16 133497.30 133584.48 SIBLING_LIMIT_SYNC Test3: three unbound TCP_RR pairs, single run virgin 55246.99 55138.67 55248.95 avg 55211.53 1.000 suresh 53141.24 53165.45 53224.71 53177.13 .963 mikie 47627.14 47361.68 47389.41 47459.41 .859 mikie2 57969.49 57704.79 58218.14 NO_SIBLING_LIMIT_SYNC 57964.14 1.049 mikie2 132205.11 133726.94 133706.09 SIBLING_LIMIT_SYNC 133212.71 2.412 Test4: three bound TCP_RR pairs, single run virgin 130073.67 130202.02 131666.48 avg 130647.39 1.000 suresh 129805.98 128058.25 128709.77 128858.00 .986 mikie 125597.11 127260.39 127208.73 126688.74 .969 mikie2 135441.58 134961.89 137162.00 135855.15 1.039 Test5: drop shield, tbench 8 virgin 2118.26 MB/sec 1.000 suresh 2036.32 MB/sec .961 mikie 2051.18 MB/sec .968 mikie2 2125.21 MB/sec 1.003 (hohum, all within tbench jitter) Problem reference: select_idle_sibling() = painful L2 misses with westmere. Identical configs, nohz=off NO_TTWU_QUEUE, processor.max_cstate=0 intel_idle.max_cstate=0 turbo-boost off (so both are now plain 2.4GHz boxen) single bound TCP_RR pair E5620 Q6600 bound 90196.84 42517.96 3->0 92654.92 43946.50 3->1 91735.26 95274.10 3->2 129394.55 95266.83 3->3 89127.98 3->4 91303.15 3->5 91345.85 3->6 74141.88 3->7 huh?.. load is synchronous! ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2011-11-18 15:14 ` Mike Galbraith @ 2012-02-20 14:41 ` Peter Zijlstra 2012-02-20 15:03 ` Srivatsa Vaddagiri 2012-02-20 18:14 ` Mike Galbraith 0 siblings, 2 replies; 49+ messages in thread From: Peter Zijlstra @ 2012-02-20 14:41 UTC (permalink / raw) To: Mike Galbraith Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner, Srivatsa Vaddagiri On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote: > --- > kernel/sched_fair.c | 10 ++-------- > 1 file changed, 2 insertions(+), 8 deletions(-) > > Index: linux-3.0-tip/kernel/sched_fair.c > =================================================================== > --- linux-3.0-tip.orig/kernel/sched_fair.c > +++ linux-3.0-tip/kernel/sched_fair.c > @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta > for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) { > if (idle_cpu(i)) { > target = i; > + if (sd->flags & SD_SHARE_CPUPOWER) > + continue; > break; > } > } > - > - /* > - * Lets stop looking for an idle sibling when we reached > - * the domain that spans the current cpu and prev_cpu. > - */ > - if (cpumask_test_cpu(cpu, sched_domain_span(sd)) && > - cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) > - break; > } > rcu_read_unlock(); Mike, Suresh, did we ever get this sorted? I was looking at select_idle_sibling() and it looks like we dropped this. Also, did anybody ever get an answer from a HW guy on why sharing stuff over SMT threads is so much worse than sharing it over proper cores? Its not like this workload actually does anything concurrently. I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-20 14:41 ` Peter Zijlstra @ 2012-02-20 15:03 ` Srivatsa Vaddagiri 2012-02-20 18:25 ` Mike Galbraith 2012-02-20 18:14 ` Mike Galbraith 1 sibling, 1 reply; 49+ messages in thread From: Srivatsa Vaddagiri @ 2012-02-20 15:03 UTC (permalink / raw) To: Peter Zijlstra Cc: Mike Galbraith, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner * Peter Zijlstra <peterz@infradead.org> [2012-02-20 15:41:01]: > On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote: > > > --- > > kernel/sched_fair.c | 10 ++-------- > > 1 file changed, 2 insertions(+), 8 deletions(-) > > > > Index: linux-3.0-tip/kernel/sched_fair.c > > =================================================================== > > --- linux-3.0-tip.orig/kernel/sched_fair.c > > +++ linux-3.0-tip/kernel/sched_fair.c > > @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta > > for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) { > > if (idle_cpu(i)) { > > target = i; > > + if (sd->flags & SD_SHARE_CPUPOWER) > > + continue; > > break; > > } > > } > > - > > - /* > > - * Lets stop looking for an idle sibling when we reached > > - * the domain that spans the current cpu and prev_cpu. > > - */ > > - if (cpumask_test_cpu(cpu, sched_domain_span(sd)) && > > - cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) > > - break; > > } > > rcu_read_unlock(); > > Mike, Suresh, did we ever get this sorted? I was looking at > select_idle_sibling() and it looks like we dropped this. > > Also, did anybody ever get an answer from a HW guy on why sharing stuff > over SMT threads is so much worse than sharing it over proper cores? Its > not like this workload actually does anything concurrently. > > I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE. >From a quick scan of that code, it seems to prefer selecting an idle cpu in the same cache domain (vs selecting prev_cpu in absence of a core that is fully idle). I can give that a try for my benchmark and see how much it helps. My suspicion is it will not fully solve the problem I have on hand. - vatsa ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-20 15:03 ` Srivatsa Vaddagiri @ 2012-02-20 18:25 ` Mike Galbraith 2012-02-21 0:06 ` Srivatsa Vaddagiri 0 siblings, 1 reply; 49+ messages in thread From: Mike Galbraith @ 2012-02-20 18:25 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner On Mon, 2012-02-20 at 20:33 +0530, Srivatsa Vaddagiri wrote: > * Peter Zijlstra <peterz@infradead.org> [2012-02-20 15:41:01]: > > > On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote: > > > > > --- > > > kernel/sched_fair.c | 10 ++-------- > > > 1 file changed, 2 insertions(+), 8 deletions(-) > > > > > > Index: linux-3.0-tip/kernel/sched_fair.c > > > =================================================================== > > > --- linux-3.0-tip.orig/kernel/sched_fair.c > > > +++ linux-3.0-tip/kernel/sched_fair.c > > > @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta > > > for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) { > > > if (idle_cpu(i)) { > > > target = i; > > > + if (sd->flags & SD_SHARE_CPUPOWER) > > > + continue; > > > break; > > > } > > > } > > > - > > > - /* > > > - * Lets stop looking for an idle sibling when we reached > > > - * the domain that spans the current cpu and prev_cpu. > > > - */ > > > - if (cpumask_test_cpu(cpu, sched_domain_span(sd)) && > > > - cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) > > > - break; > > > } > > > rcu_read_unlock(); > > > > Mike, Suresh, did we ever get this sorted? I was looking at > > select_idle_sibling() and it looks like we dropped this. > > > > Also, did anybody ever get an answer from a HW guy on why sharing stuff > > over SMT threads is so much worse than sharing it over proper cores? Its > > not like this workload actually does anything concurrently. > > > > I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE. > > From a quick scan of that code, it seems to prefer selecting an idle cpu > in the same cache domain (vs selecting prev_cpu in absence of a core > that is fully idle). Yes, that was the sole purpose of select_idle_sibling() from square one. If you can mobilize a CPU without eating cache penalty, this is most excellent for load ramp-up. The gain is huge over affine wakeup if there is any overlap to regain, ie it's not a 100% synchronous load. > I can give that a try for my benchmark and see how much it helps. My > suspicion is it will not fully solve the problem I have on hand. I doubt it will either. Your problem is when it doesn't succeed, but you have an idle core available in another domain. That's a whole different ball game. Yeah, you can reap benefit by doing wakeup balancing, but you'd better look very closely at the cost. I haven't been able to do that lately, so dunno what cost is in the here and now, but it used to be _way_ too expensive to consider, just as unrestricted idle balancing is, or high frequency load balancing in general is. -Mike ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-20 18:25 ` Mike Galbraith @ 2012-02-21 0:06 ` Srivatsa Vaddagiri 2012-02-21 6:37 ` Mike Galbraith 0 siblings, 1 reply; 49+ messages in thread From: Srivatsa Vaddagiri @ 2012-02-21 0:06 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner * Mike Galbraith <efault@gmx.de> [2012-02-20 19:25:43]: > > I can give that a try for my benchmark and see how much it helps. My > > suspicion is it will not fully solve the problem I have on hand. > > I doubt it will either. Your problem is when it doesn't succeed, but > you have an idle core available in another domain. fwiw the patch I had sent does a wakeup balance within prev_cpu's cache_domain (and not outside). It handles the case where we don't have any idle cpu/core within prev_cpu's cache domain, in which case we look for next best thing (least loaded cpu). I did see good numbers with that (for both my benchmark and sysbench). More on this later in the day .. - vatsa ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-21 0:06 ` Srivatsa Vaddagiri @ 2012-02-21 6:37 ` Mike Galbraith 2012-02-21 8:09 ` Srivatsa Vaddagiri 0 siblings, 1 reply; 49+ messages in thread From: Mike Galbraith @ 2012-02-21 6:37 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner On Tue, 2012-02-21 at 05:36 +0530, Srivatsa Vaddagiri wrote: > fwiw the patch I had sent does a wakeup balance within prev_cpu's > cache_domain (and not outside). Yeah, the testing I did was turn on the flag, and measure. With single domain scans, maybe select_idle_sibling() could just go away. It ain't free either. -Mike ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-21 6:37 ` Mike Galbraith @ 2012-02-21 8:09 ` Srivatsa Vaddagiri 0 siblings, 0 replies; 49+ messages in thread From: Srivatsa Vaddagiri @ 2012-02-21 8:09 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner * Mike Galbraith <efault@gmx.de> [2012-02-21 07:37:14]: > On Tue, 2012-02-21 at 05:36 +0530, Srivatsa Vaddagiri wrote: > > > fwiw the patch I had sent does a wakeup balance within prev_cpu's > > cache_domain (and not outside). > > Yeah, the testing I did was turn on the flag, and measure. With single > domain scans, maybe select_idle_sibling() could just go away. Yeah that's what I was thinking. Essentially find_idlest_group/cpu should accomplish that job pretty much for us ... > It ain't free either. Will find out how much difference it makes .. - vatsa ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-20 14:41 ` Peter Zijlstra 2012-02-20 15:03 ` Srivatsa Vaddagiri @ 2012-02-20 18:14 ` Mike Galbraith 2012-02-20 18:15 ` Peter Zijlstra ` (2 more replies) 1 sibling, 3 replies; 49+ messages in thread From: Mike Galbraith @ 2012-02-20 18:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner, Srivatsa Vaddagiri On Mon, 2012-02-20 at 15:41 +0100, Peter Zijlstra wrote: > On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote: > > > --- > > kernel/sched_fair.c | 10 ++-------- > > 1 file changed, 2 insertions(+), 8 deletions(-) > > > > Index: linux-3.0-tip/kernel/sched_fair.c > > =================================================================== > > --- linux-3.0-tip.orig/kernel/sched_fair.c > > +++ linux-3.0-tip/kernel/sched_fair.c > > @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta > > for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) { > > if (idle_cpu(i)) { > > target = i; > > + if (sd->flags & SD_SHARE_CPUPOWER) > > + continue; > > break; > > } > > } > > - > > - /* > > - * Lets stop looking for an idle sibling when we reached > > - * the domain that spans the current cpu and prev_cpu. > > - */ > > - if (cpumask_test_cpu(cpu, sched_domain_span(sd)) && > > - cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) > > - break; > > } > > rcu_read_unlock(); > > Mike, Suresh, did we ever get this sorted? I was looking at > select_idle_sibling() and it looks like we dropped this. I thought this was pretty much sorted. We want to prefer core over sibling, because on at laest some modern CPUs with L3, siblings suck rocks. > Also, did anybody ever get an answer from a HW guy on why sharing stuff > over SMT threads is so much worse than sharing it over proper cores? No. My numbers on westmere indicated to me that siblings do not share L2, making them fairly worthless. Hard facts we never got. > Its > not like this workload actually does anything concurrently. > > I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE. I really really need to find time to do systematic mainline testing. Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider. Maybe that has changed, but I doubt it. (general aside: testing with a bloated distro config is a big mistake) -Mike ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-20 18:14 ` Mike Galbraith @ 2012-02-20 18:15 ` Peter Zijlstra 2012-02-20 19:07 ` Peter Zijlstra 2012-02-23 10:49 ` Srivatsa Vaddagiri 2 siblings, 0 replies; 49+ messages in thread From: Peter Zijlstra @ 2012-02-20 18:15 UTC (permalink / raw) To: Mike Galbraith Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner, Srivatsa Vaddagiri On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote: > I thought this was pretty much sorted. We want to prefer core over > sibling, because on at laest some modern CPUs with L3, siblings suck > rocks. Yeah, I since figured out how its supposed (and supposedly) does work. Suresh was a bit too clever and forgot to put a comment in since clearly it was obvious at the time ;-) ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-20 18:14 ` Mike Galbraith 2012-02-20 18:15 ` Peter Zijlstra @ 2012-02-20 19:07 ` Peter Zijlstra 2012-02-21 5:43 ` Mike Galbraith 2012-02-23 10:49 ` Srivatsa Vaddagiri 2 siblings, 1 reply; 49+ messages in thread From: Peter Zijlstra @ 2012-02-20 19:07 UTC (permalink / raw) To: Mike Galbraith Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner, Srivatsa Vaddagiri On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote: > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider. > Maybe that has changed, but I doubt it. Right, I through I remembered some such, you could see it on wakeup heavy things like pipe-bench and that java msg passing thing, right? ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-20 19:07 ` Peter Zijlstra @ 2012-02-21 5:43 ` Mike Galbraith 2012-02-21 8:32 ` Srivatsa Vaddagiri 0 siblings, 1 reply; 49+ messages in thread From: Mike Galbraith @ 2012-02-21 5:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner, Srivatsa Vaddagiri On Mon, 2012-02-20 at 20:07 +0100, Peter Zijlstra wrote: > On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote: > > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider. > > Maybe that has changed, but I doubt it. > > Right, I through I remembered some such, you could see it on wakeup > heavy things like pipe-bench and that java msg passing thing, right? Yeah, it beat up switch heavy buddies pretty bad. -Mike ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-21 5:43 ` Mike Galbraith @ 2012-02-21 8:32 ` Srivatsa Vaddagiri 2012-02-21 9:21 ` Mike Galbraith 0 siblings, 1 reply; 49+ messages in thread From: Srivatsa Vaddagiri @ 2012-02-21 8:32 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner * Mike Galbraith <efault@gmx.de> [2012-02-21 06:43:18]: > On Mon, 2012-02-20 at 20:07 +0100, Peter Zijlstra wrote: > > On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote: > > > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider. > > > Maybe that has changed, but I doubt it. > > > > Right, I through I remembered some such, you could see it on wakeup > > heavy things like pipe-bench and that java msg passing thing, right? > > Yeah, it beat up switch heavy buddies pretty bad. Do you have pointer to the java benchmark? Also is pipe-bench the same as the one described below? http://freecode.com/projects/pipebench - vatsa ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-21 8:32 ` Srivatsa Vaddagiri @ 2012-02-21 9:21 ` Mike Galbraith 2012-02-21 10:37 ` Peter Zijlstra 0 siblings, 1 reply; 49+ messages in thread From: Mike Galbraith @ 2012-02-21 9:21 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner On Tue, 2012-02-21 at 14:02 +0530, Srivatsa Vaddagiri wrote: > * Mike Galbraith <efault@gmx.de> [2012-02-21 06:43:18]: > > > On Mon, 2012-02-20 at 20:07 +0100, Peter Zijlstra wrote: > > > On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote: > > > > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider. > > > > Maybe that has changed, but I doubt it. > > > > > > Right, I through I remembered some such, you could see it on wakeup > > > heavy things like pipe-bench and that java msg passing thing, right? > > > > Yeah, it beat up switch heavy buddies pretty bad. > > Do you have pointer to the java benchmark? Also is pipe-bench the same > as the one described below? I use vmark, find it _somewhat_ useful. Not a lovely benchmark, but it is highly affinity sensitive, and switches heftily. I don't put much value on it though, too extreme for me, but it is a ~decent indicator. There are no doubt _lots_ better than vmark for java stuff. I toss a variety pack at the box in obsessive-compulsive man mode when testing. Which benchmarks doesn't matter much, just need to be wide spectrum and consistent. > http://freecode.com/projects/pipebench No, I use Ingo's pipe-test, but that to measure fastpath overhead. -Mike ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-21 9:21 ` Mike Galbraith @ 2012-02-21 10:37 ` Peter Zijlstra 2012-02-21 14:58 ` Srivatsa Vaddagiri 0 siblings, 1 reply; 49+ messages in thread From: Peter Zijlstra @ 2012-02-21 10:37 UTC (permalink / raw) To: Mike Galbraith Cc: Srivatsa Vaddagiri, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner On Tue, 2012-02-21 at 10:21 +0100, Mike Galbraith wrote: > I use vmark, find it _somewhat_ useful. Not a lovely benchmark, but it > is highly affinity sensitive, and switches heftily. I don't put much > value on it though, too extreme for me, but it is a ~decent indicator. > > There are no doubt _lots_ better than vmark for java stuff. > > I toss a variety pack at the box in obsessive-compulsive man mode when > testing. Which benchmarks doesn't matter much, just need to be wide > spectrum and consistent. http://www.volano.com/benchmarks.html > No, I use Ingo's pipe-test, but that to measure fastpath overhead. http://people.redhat.com/mingo/cfs-scheduler/tools/pipe-test.c Also I think it lives as: perf bench sched pipe ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-21 10:37 ` Peter Zijlstra @ 2012-02-21 14:58 ` Srivatsa Vaddagiri 0 siblings, 0 replies; 49+ messages in thread From: Srivatsa Vaddagiri @ 2012-02-21 14:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Mike Galbraith, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner * Peter Zijlstra <peterz@infradead.org> [2012-02-21 11:37:23]: > http://www.volano.com/benchmarks.html For some reason, volanomark stops working (client hangs waiting for data to arrive on socket) in 3rd iteration when run on latest tip. I don't see that with 2.6.32 based kernel though. Checking which commit may have caused this .. - vatsa ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-20 18:14 ` Mike Galbraith 2012-02-20 18:15 ` Peter Zijlstra 2012-02-20 19:07 ` Peter Zijlstra @ 2012-02-23 10:49 ` Srivatsa Vaddagiri 2012-02-23 11:19 ` Ingo Molnar ` (2 more replies) 2 siblings, 3 replies; 49+ messages in thread From: Srivatsa Vaddagiri @ 2012-02-23 10:49 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner * Mike Galbraith <efault@gmx.de> [2012-02-20 19:14:21]: > > I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE. > > I really really need to find time to do systematic mainline testing. > > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider. > Maybe that has changed, but I doubt it. (general aside: testing with a > bloated distro config is a big mistake) I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE at SMT/MC domains. Machine : 2 Quad-core Intel X5570 CPU (H/T enabled) Kernel : tip (HEAD at 6241cc8) Java : OpenJDK 1.6.0_20 Volano : 2_9_0 Volano benchmark was run 4 times with and w/o SD_BALANCE_WAKE enabled in SMT/MC domains. Average score std. dev SD_BALANCE_WAKE disabled 369459.750 4825.046 SD_BALANCE_WAKE enabled 379070.500 379070.5 I am going to try pipe benchmark next. Do you have suggestions for any other benchmarks I should try to see the effect of SD_BALANCE_WAKE turned on in SMT/MC domains? - vatsa ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-23 10:49 ` Srivatsa Vaddagiri @ 2012-02-23 11:19 ` Ingo Molnar 2012-02-23 12:18 ` Srivatsa Vaddagiri 2012-02-23 11:20 ` Srivatsa Vaddagiri 2012-02-23 11:21 ` Mike Galbraith 2 siblings, 1 reply; 49+ messages in thread From: Ingo Molnar @ 2012-02-23 11:19 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner * Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote: > * Mike Galbraith <efault@gmx.de> [2012-02-20 19:14:21]: > > > > I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE. > > > > I really really need to find time to do systematic mainline testing. > > > > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider. > > Maybe that has changed, but I doubt it. (general aside: testing with a > > bloated distro config is a big mistake) > > I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE > at SMT/MC domains. > > Machine : 2 Quad-core Intel X5570 CPU (H/T enabled) > Kernel : tip (HEAD at 6241cc8) > Java : OpenJDK 1.6.0_20 > Volano : 2_9_0 > > Volano benchmark was run 4 times with and w/o SD_BALANCE_WAKE enabled in > SMT/MC domains. > > Average score std. dev > > SD_BALANCE_WAKE disabled 369459.750 4825.046 > SD_BALANCE_WAKE enabled 379070.500 379070.5 > > I am going to try pipe benchmark next. Do you have suggestions > for any other benchmarks I should try to see the effect of > SD_BALANCE_WAKE turned on in SMT/MC domains? sysbench is one of the best ones punishing bad scheduler balancing mistakes. Thanks, Ingo ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-23 11:19 ` Ingo Molnar @ 2012-02-23 12:18 ` Srivatsa Vaddagiri 0 siblings, 0 replies; 49+ messages in thread From: Srivatsa Vaddagiri @ 2012-02-23 12:18 UTC (permalink / raw) To: Ingo Molnar Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner * Ingo Molnar <mingo@elte.hu> [2012-02-23 12:19:29]: > sysbench is one of the best ones punishing bad scheduler > balancing mistakes. Here are the sysbench oltp score on same machine i.e: Machine : 2 Quad-core Intel X5570 CPU (H/T enabled) Kernel : tip (HEAD at 6241cc8) sysbench : 0.4.12 sysbench was run with 16 threads as: ./sysbench --num-threads=16 --max-requests=100000 --test=oltp --oltp-table-size=500000 --mysql-socket=/var/lib/mysql/mysql.sock --oltp-read-only --mysql-user=root --mysql-password=blah run sysbench was run 5 times with fs-cache being purged before each run (echo 3 > /proc/sys/vm/drop_caches). Average of 5 runs alongwith % std. dev. is noted for various OLTP stats SD_BALANCE_WAKE SD_BALANCE_WAKE disabled enabled transactions (per sec) 4833.826 (+- 0.75%) 4837.354 (+- 1%) read/write requests (per sec) 67673.580 (+- 0.75%) 67722.960 (+- 1%) other ops (per sec) 9667.654 (+- 0.75%) 9674.710 (+- 1%) There is minor improvement seen when SD_BALANCE_WAKE is enabled at SMT/MC domains, but no degradation observed with it enabled. - vatsa ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-23 10:49 ` Srivatsa Vaddagiri 2012-02-23 11:19 ` Ingo Molnar @ 2012-02-23 11:20 ` Srivatsa Vaddagiri 2012-02-23 11:26 ` Ingo Molnar 2012-02-23 11:21 ` Mike Galbraith 2 siblings, 1 reply; 49+ messages in thread From: Srivatsa Vaddagiri @ 2012-02-23 11:20 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner * Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2012-02-23 16:19:59]: > > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider. > > Maybe that has changed, but I doubt it. (general aside: testing with a > > bloated distro config is a big mistake) > > I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE > at SMT/MC domains. Ditto with pipe benchmark. > Machine : 2 Quad-core Intel X5570 CPU (H/T enabled) > Kernel : tip (HEAD at 6241cc8) "perf bench sched pipe" was run 10 times and average ops/sec score alongwith std. dev is noted as below. SD_BALANCE_WAKE SD_BALANCE_WAKE disabled enabled Avg. score 108984.900 111844.300 (+2.6%) std dev 20383.457 21668.604 Note: SD_BALANCE_WAKE for SMT/MC domain was enabled by writing into appropriate flags file present in /proc: # cd /proc/sys/kernel/sched_domain/ # for i in `find . -name flags | grep domain1`; do echo 4671 > $i; done # for i in `find . -name flags | grep domain0`; do echo 703 > $i; done - vatsa ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-23 11:20 ` Srivatsa Vaddagiri @ 2012-02-23 11:26 ` Ingo Molnar 2012-02-23 11:32 ` Srivatsa Vaddagiri 0 siblings, 1 reply; 49+ messages in thread From: Ingo Molnar @ 2012-02-23 11:26 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner * Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote: > "perf bench sched pipe" was run 10 times and average ops/sec > score alongwith std. dev is noted as below. > > > SD_BALANCE_WAKE SD_BALANCE_WAKE > disabled enabled > > Avg. score 108984.900 111844.300 (+2.6%) > std dev 20383.457 21668.604 pro perf tip of the day: did you know that it's possible to run: perf stat --repeat 10 --null perf bench sched pipe and get meaningful, normalized stddev calculations for free: 5.486491487 seconds time elapsed ( +- 5.50% ) the percentage at the end shows the level of standard deviation. You can add "--sync" as well, which will cause perf stat to call sync() - this gives extra stability of individual iterations and makes sure all IO cost is accounted to the right run. Thanks, Ingo ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-23 11:26 ` Ingo Molnar @ 2012-02-23 11:32 ` Srivatsa Vaddagiri 2012-02-23 16:17 ` Ingo Molnar 0 siblings, 1 reply; 49+ messages in thread From: Srivatsa Vaddagiri @ 2012-02-23 11:32 UTC (permalink / raw) To: Ingo Molnar Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner * Ingo Molnar <mingo@elte.hu> [2012-02-23 12:26:50]: > pro perf tip of the day: did you know that it's possible to run: > > perf stat --repeat 10 --null perf bench sched pipe Ah yes ..had forgotten that! What is the recommended # of iterations to run? > and get meaningful, normalized stddev calculations for free: > > 5.486491487 seconds time elapsed ( +- 5.50% ) > > the percentage at the end shows the level of standard deviation. > > You can add "--sync" as well, which will cause perf stat to call > sync() - this gives extra stability of individual iterations and > makes sure all IO cost is accounted to the right run. Ok ..thanks for the 'pro' tip!! Will use it in my subsequent runs! - vatsa ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-23 11:32 ` Srivatsa Vaddagiri @ 2012-02-23 16:17 ` Ingo Molnar 0 siblings, 0 replies; 49+ messages in thread From: Ingo Molnar @ 2012-02-23 16:17 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner * Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote: > * Ingo Molnar <mingo@elte.hu> [2012-02-23 12:26:50]: > > > pro perf tip of the day: did you know that it's possible to run: > > > > perf stat --repeat 10 --null perf bench sched pipe > > Ah yes ..had forgotten that! What is the recommended # of > iterations to run? Well, I start at 3 or 5 and sometimes increase it to get a meaningful stddev - i.e. one that is smaller than half of the measured effect. (for cases where I expect an improvement and want to measure it.) Thanks, Ingo ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-23 10:49 ` Srivatsa Vaddagiri 2012-02-23 11:19 ` Ingo Molnar 2012-02-23 11:20 ` Srivatsa Vaddagiri @ 2012-02-23 11:21 ` Mike Galbraith 2012-02-25 6:54 ` Srivatsa Vaddagiri 2 siblings, 1 reply; 49+ messages in thread From: Mike Galbraith @ 2012-02-23 11:21 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner On Thu, 2012-02-23 at 16:19 +0530, Srivatsa Vaddagiri wrote: > I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE > at SMT/MC domains. That's an unexpected (but welcome) result. > Machine : 2 Quad-core Intel X5570 CPU (H/T enabled) > Kernel : tip (HEAD at 6241cc8) > Java : OpenJDK 1.6.0_20 > Volano : 2_9_0 > > Volano benchmark was run 4 times with and w/o SD_BALANCE_WAKE enabled in > SMT/MC domains. > > Average score std. dev > > SD_BALANCE_WAKE disabled 369459.750 4825.046 > SD_BALANCE_WAKE enabled 379070.500 379070.5 > > I am going to try pipe benchmark next. Do you have suggestions for any other > benchmarks I should try to see the effect of SD_BALANCE_WAKE turned on in > SMT/MC domains? Unpinned netperf TCP_RR and/or tbench pairs? Anything that's wakeup heavy should tell the tail. -Mike ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-23 11:21 ` Mike Galbraith @ 2012-02-25 6:54 ` Srivatsa Vaddagiri 2012-02-25 8:30 ` Mike Galbraith 0 siblings, 1 reply; 49+ messages in thread From: Srivatsa Vaddagiri @ 2012-02-25 6:54 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner * Mike Galbraith <efault@gmx.de> [2012-02-23 12:21:04]: > Unpinned netperf TCP_RR and/or tbench pairs? Anything that's wakeup > heavy should tell the tail. Here are some tbench numbers: Machine : 2 Intel Xeon X5650 (Westmere) CPUs (6 core/package) Kernel : tip (HEAD at ebe97fa) dbench : v4.0 One tbench server/client pair was run on same host 5 times (with fs-cache being purged each time) and avg of 5 run for various cases noted below: Case A : HT enabled (24 logical CPUs) Thr'put : 168.166 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE) Thr'put : 169.564 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc/smt) Thr'put : 173.151 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE) Case B : HT disabled (12 logical CPUs) Thr'put : 167.977 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE) Thr'put : 167.891 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc) Thr'put : 173.801 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE) Observations: a. ~3% improvement seen with SD_SHARE_PKG_RESOURCES disabled, which I guess reflects the cost of waking to a cold L2 cache. b. No degradation seen with SD_BALANCE_WAKE enabled at mc/smt domains IMO we need to detect tbench type paired wakeups as synchronous case, in which case blindly wakeup the task to cur_cpu (as cost of L2 cache miss could outweight the cost of any reduced scheduling latencies). IOW select_task_rq_fair() needs to be given better hint as to whether L2 cache has been made warm by someone (interrupt handler or a producer task), in which case (consumer) task needs to be woken in the same L2 cache domain (i.e on cur_cpu itself)? - vatsa ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-25 6:54 ` Srivatsa Vaddagiri @ 2012-02-25 8:30 ` Mike Galbraith 2012-02-27 22:11 ` Suresh Siddha 0 siblings, 1 reply; 49+ messages in thread From: Mike Galbraith @ 2012-02-25 8:30 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner On Sat, 2012-02-25 at 12:24 +0530, Srivatsa Vaddagiri wrote: > * Mike Galbraith <efault@gmx.de> [2012-02-23 12:21:04]: > > > Unpinned netperf TCP_RR and/or tbench pairs? Anything that's wakeup > > heavy should tell the tail. > > Here are some tbench numbers: > > Machine : 2 Intel Xeon X5650 (Westmere) CPUs (6 core/package) > Kernel : tip (HEAD at ebe97fa) > dbench : v4.0 > > One tbench server/client pair was run on same host 5 times (with > fs-cache being purged each time) and avg of 5 run for various cases > noted below: > > Case A : HT enabled (24 logical CPUs) > > Thr'put : 168.166 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE) > Thr'put : 169.564 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc/smt) > Thr'put : 173.151 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE) > > Case B : HT disabled (12 logical CPUs) > > Thr'put : 167.977 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE) > Thr'put : 167.891 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc) > Thr'put : 173.801 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE) > > Observations: > > a. ~3% improvement seen with SD_SHARE_PKG_RESOURCES disabled, which I guess > reflects the cost of waking to a cold L2 cache. > > b. No degradation seen with SD_BALANCE_WAKE enabled at mc/smt domains I haven't done a lot of testing, but yeah, the little I have doesn't show SD_BALANCE_WAKE making much difference on single socket boxen. > IMO we need to detect tbench type paired wakeups as synchronous case, in > which case blindly wakeup the task to cur_cpu (as cost of L2 cache miss > could outweight the cost of any reduced scheduling latencies). > > IOW select_task_rq_fair() needs to be given better hint as to whether L2 > cache has been made warm by someone (interrupt handler or a producer > task), in which case (consumer) task needs to be woken in the same L2 cache > domain (i.e on cur_cpu itself)? My less rotund config shows the L2 penalty decidedly more prominently. We used to have avg_overlap as a synchronous wakeup hint, but it was broken by preemption and whatnot, got the axe to recover some cycles. A reliable and dirt cheap replacement would be a good thing to have. TCP_RR and tbench are far way away from the overlap breakeven point on E5620, whereas with Q6600s shared L2, you can start converting overlap into throughput almost immediately. 2.4 GHz E5620 Throughput 248.994 MB/sec 1 procs SD_SHARE_PKG_RESOURCES Throughput 379.488 MB/sec 1 procs !SD_SHARE_PKG_RESOURCES 2.4 GHz Q6600 Throughput 299.049 MB/sec 1 procs SD_SHARE_PKG_RESOURCES Throughput 300.018 MB/sec 1 procs !SD_SHARE_PKG_RESOURCES -Mike ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-25 8:30 ` Mike Galbraith @ 2012-02-27 22:11 ` Suresh Siddha 2012-02-28 5:05 ` Mike Galbraith 0 siblings, 1 reply; 49+ messages in thread From: Suresh Siddha @ 2012-02-27 22:11 UTC (permalink / raw) To: Mike Galbraith Cc: Srivatsa Vaddagiri, Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner On Sat, 2012-02-25 at 09:30 +0100, Mike Galbraith wrote: > My less rotund config shows the L2 penalty decidedly more prominently. > We used to have avg_overlap as a synchronous wakeup hint, but it was > broken by preemption and whatnot, got the axe to recover some cycles. A > reliable and dirt cheap replacement would be a good thing to have. > > TCP_RR and tbench are far way away from the overlap breakeven point on > E5620, whereas with Q6600s shared L2, you can start converting overlap > into throughput almost immediately. > > 2.4 GHz E5620 > Throughput 248.994 MB/sec 1 procs SD_SHARE_PKG_RESOURCES > Throughput 379.488 MB/sec 1 procs !SD_SHARE_PKG_RESOURCES > > 2.4 GHz Q6600 > Throughput 299.049 MB/sec 1 procs SD_SHARE_PKG_RESOURCES > Throughput 300.018 MB/sec 1 procs !SD_SHARE_PKG_RESOURCES > Also it is not always about just the L2 cache being shared/not or warm/cold etc. It also depends on the core c-states/p-states etc. It will cost waking up an idle core and the cost will depend on the what core-c state it is in. And also if we ping-pong between cores often, cpufreq governor will come and request for a lower core p-state even though the load was keeping one core or the other in the socket always busy at any given point of time. thanks, suresh ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2012-02-27 22:11 ` Suresh Siddha @ 2012-02-28 5:05 ` Mike Galbraith 0 siblings, 0 replies; 49+ messages in thread From: Mike Galbraith @ 2012-02-28 5:05 UTC (permalink / raw) To: Suresh Siddha Cc: Srivatsa Vaddagiri, Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner On Mon, 2012-02-27 at 14:11 -0800, Suresh Siddha wrote: > On Sat, 2012-02-25 at 09:30 +0100, Mike Galbraith wrote: > > My less rotund config shows the L2 penalty decidedly more prominently. > > We used to have avg_overlap as a synchronous wakeup hint, but it was > > broken by preemption and whatnot, got the axe to recover some cycles. A > > reliable and dirt cheap replacement would be a good thing to have. > > > > TCP_RR and tbench are far way away from the overlap breakeven point on > > E5620, whereas with Q6600s shared L2, you can start converting overlap > > into throughput almost immediately. > > > > 2.4 GHz E5620 > > Throughput 248.994 MB/sec 1 procs SD_SHARE_PKG_RESOURCES > > Throughput 379.488 MB/sec 1 procs !SD_SHARE_PKG_RESOURCES > > > > 2.4 GHz Q6600 > > Throughput 299.049 MB/sec 1 procs SD_SHARE_PKG_RESOURCES > > Throughput 300.018 MB/sec 1 procs !SD_SHARE_PKG_RESOURCES > > > > Also it is not always about just the L2 cache being shared/not or > warm/cold etc. It also depends on the core c-states/p-states etc. It > will cost waking up an idle core and the cost will depend on the what > core-c state it is in. And also if we ping-pong between cores often, > cpufreq governor will come and request for a lower core p-state even > though the load was keeping one core or the other in the socket always > busy at any given point of time. Yeah, pinning yields a couple percent on Q6600 box, more on E5620 despite its spiffier gearbox.. likely turbo-boost doing it's thing. -Mike ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2011-11-17 15:56 ` Peter Zijlstra 2011-11-17 16:38 ` Mike Galbraith @ 2011-11-17 19:08 ` Suresh Siddha 2011-11-18 15:12 ` Peter Zijlstra 1 sibling, 1 reply; 49+ messages in thread From: Suresh Siddha @ 2011-11-17 19:08 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel, Ingo Molnar, Paul Turner On Thu, 2011-11-17 at 07:56 -0800, Peter Zijlstra wrote: > D'0h, indeed.. > > Something like the below maybe, although I'm certain it all can be > written much nicer indeed. > Peter, I just noticed that the -tip tree has the original proposed patch and the new sched/ directory. So updated my cleanup patch accordingly. Thanks. --- From: Suresh Siddha <suresh.b.siddha@intel.com> Subject: sched: cleanup domain traversal in select_idle_sibling Instead of going through the scheduler domain hierarchy multiple times (for giving priority to an idle core over an idle SMT sibling in a busy core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES flag and traverse the domain hierarchy down till we find an idle group. This cleanup also addresses an issue reported by Mike where the recent changes returned the busy thread even in the presence of an idle SMT sibling in single socket platforms. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Mike Galbraith <efault@gmx.de> --- kernel/sched/fair.c | 38 +++++++++++++++++++++++++------------- kernel/sched/sched.h | 2 ++ 2 files changed, 27 insertions(+), 13 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cd3b642..537e16a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2642,6 +2642,28 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) return idlest; } +/** + * highest_flag_domain - Return highest sched_domain containing flag. + * @cpu: The cpu whose highest level of sched domain is to + * be returned. + * @flag: The flag to check for the highest sched_domain + * for the given cpu. + * + * Returns the highest sched_domain of a cpu which contains the given flag. + */ +static inline struct sched_domain *highest_flag_domain(int cpu, int flag) +{ + struct sched_domain *sd, *hsd = NULL; + + for_each_domain(cpu, sd) { + if (!(sd->flags & flag)) + break; + hsd = sd; + } + + return hsd; +} + /* * Try and locate an idle CPU in the sched_domain. */ @@ -2651,7 +2673,7 @@ static int select_idle_sibling(struct task_struct *p, int target) int prev_cpu = task_cpu(p); struct sched_domain *sd; struct sched_group *sg; - int i, smt = 0; + int i; /* * If the task is going to be woken-up on this cpu and if it is @@ -2671,19 +2693,9 @@ static int select_idle_sibling(struct task_struct *p, int target) * Otherwise, iterate the domains and find an elegible idle cpu. */ rcu_read_lock(); -again: - for_each_domain(target, sd) { - if (!smt && (sd->flags & SD_SHARE_CPUPOWER)) - continue; - - if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) { - if (!smt) { - smt = 1; - goto again; - } - break; - } + sd = highest_flag_domain(target, SD_SHARE_PKG_RESOURCES); + for_each_lower_domain(sd) { sg = sd->groups; do { if (!cpumask_intersects(sched_group_cpus(sg), diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c2e7802..8715055 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -501,6 +501,8 @@ DECLARE_PER_CPU(struct rq, runqueues); #define for_each_domain(cpu, __sd) \ for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent) +#define for_each_lower_domain(sd) for (; sd; sd = sd->child) + #define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) #define this_rq() (&__get_cpu_var(runqueues)) #define task_rq(p) cpu_rq(task_cpu(p)) ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2011-11-17 19:08 ` Suresh Siddha @ 2011-11-18 15:12 ` Peter Zijlstra 2011-11-18 15:26 ` Mike Galbraith 0 siblings, 1 reply; 49+ messages in thread From: Peter Zijlstra @ 2011-11-18 15:12 UTC (permalink / raw) To: Suresh Siddha; +Cc: Mike Galbraith, linux-kernel, Ingo Molnar, Paul Turner On Thu, 2011-11-17 at 11:08 -0800, Suresh Siddha wrote: > > From: Suresh Siddha <suresh.b.siddha@intel.com> > Subject: sched: cleanup domain traversal in select_idle_sibling > > Instead of going through the scheduler domain hierarchy multiple times > (for giving priority to an idle core over an idle SMT sibling in a busy > core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES > flag and traverse the domain hierarchy down till we find an idle group. > > This cleanup also addresses an issue reported by Mike where the recent > changes returned the busy thread even in the presence of an idle SMT > sibling in single socket platforms. > > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> > Cc: Mike Galbraith <efault@gmx.de> Thanks Suresh! ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible 2011-11-18 15:12 ` Peter Zijlstra @ 2011-11-18 15:26 ` Mike Galbraith 0 siblings, 0 replies; 49+ messages in thread From: Mike Galbraith @ 2011-11-18 15:26 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner On Fri, 2011-11-18 at 16:12 +0100, Peter Zijlstra wrote: > On Thu, 2011-11-17 at 11:08 -0800, Suresh Siddha wrote: > > > > From: Suresh Siddha <suresh.b.siddha@intel.com> > > Subject: sched: cleanup domain traversal in select_idle_sibling > > > > Instead of going through the scheduler domain hierarchy multiple times > > (for giving priority to an idle core over an idle SMT sibling in a busy > > core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES > > flag and traverse the domain hierarchy down till we find an idle group. > > > > This cleanup also addresses an issue reported by Mike where the recent > > changes returned the busy thread even in the presence of an idle SMT > > sibling in single socket platforms. > > > > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> > > Cc: Mike Galbraith <efault@gmx.de> > > Thanks Suresh! Tested-by: Mike Galbraith <efault@gmx.de> ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2012-03-27 13:56 UTC | newest] Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <1329764866.2293.376.camhel@twins> 2012-03-05 15:24 ` sched: Avoid SMT siblings in select_idle_sibling() if possible Srivatsa Vaddagiri 2012-03-06 9:14 ` Ingo Molnar 2012-03-06 10:03 ` Srivatsa Vaddagiri 2012-03-22 15:32 ` Srivatsa Vaddagiri 2012-03-23 6:38 ` Mike Galbraith 2012-03-26 8:29 ` Peter Zijlstra 2012-03-26 8:36 ` Peter Zijlstra 2012-03-26 17:35 ` Srivatsa Vaddagiri 2012-03-26 18:06 ` Peter Zijlstra 2012-03-27 13:56 ` Mike Galbraith 2011-11-15 9:46 Peter Zijlstra 2011-11-16 1:14 ` Suresh Siddha 2011-11-16 9:24 ` Mike Galbraith 2011-11-16 18:37 ` Suresh Siddha 2011-11-17 1:59 ` Mike Galbraith 2011-11-17 15:38 ` Mike Galbraith 2011-11-17 15:56 ` Peter Zijlstra 2011-11-17 16:38 ` Mike Galbraith 2011-11-17 17:36 ` Suresh Siddha 2011-11-18 15:14 ` Mike Galbraith 2012-02-20 14:41 ` Peter Zijlstra 2012-02-20 15:03 ` Srivatsa Vaddagiri 2012-02-20 18:25 ` Mike Galbraith 2012-02-21 0:06 ` Srivatsa Vaddagiri 2012-02-21 6:37 ` Mike Galbraith 2012-02-21 8:09 ` Srivatsa Vaddagiri 2012-02-20 18:14 ` Mike Galbraith 2012-02-20 18:15 ` Peter Zijlstra 2012-02-20 19:07 ` Peter Zijlstra 2012-02-21 5:43 ` Mike Galbraith 2012-02-21 8:32 ` Srivatsa Vaddagiri 2012-02-21 9:21 ` Mike Galbraith 2012-02-21 10:37 ` Peter Zijlstra 2012-02-21 14:58 ` Srivatsa Vaddagiri 2012-02-23 10:49 ` Srivatsa Vaddagiri 2012-02-23 11:19 ` Ingo Molnar 2012-02-23 12:18 ` Srivatsa Vaddagiri 2012-02-23 11:20 ` Srivatsa Vaddagiri 2012-02-23 11:26 ` Ingo Molnar 2012-02-23 11:32 ` Srivatsa Vaddagiri 2012-02-23 16:17 ` Ingo Molnar 2012-02-23 11:21 ` Mike Galbraith 2012-02-25 6:54 ` Srivatsa Vaddagiri 2012-02-25 8:30 ` Mike Galbraith 2012-02-27 22:11 ` Suresh Siddha 2012-02-28 5:05 ` Mike Galbraith 2011-11-17 19:08 ` Suresh Siddha 2011-11-18 15:12 ` Peter Zijlstra 2011-11-18 15:26 ` Mike Galbraith
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).