From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752882Ab2CWGiK (ORCPT ); Fri, 23 Mar 2012 02:38:10 -0400 Received: from mailout-de.gmx.net ([213.165.64.22]:42959 "HELO mailout-de.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1750895Ab2CWGiG (ORCPT ); Fri, 23 Mar 2012 02:38:06 -0400 X-Authenticated: #14349625 X-Provags-ID: V01U2FsdGVkX1/jX1cEMRbjByqrF23MM66xzIfeFt7pRbCYBqZkD0 B2N/l2XpWFP+Dz Message-ID: <1332484680.5721.42.camel@marge.simpson.net> Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible From: Mike Galbraith To: Srivatsa Vaddagiri Cc: Ingo Molnar , Peter Zijlstra , Suresh Siddha , linux-kernel , Paul Turner Date: Fri, 23 Mar 2012 07:38:00 +0100 In-Reply-To: <20120322153205.GA28570@linux.vnet.ibm.com> References: <1329764866.2293.376.camhel@twins> <20120305152443.GE26559@linux.vnet.ibm.com> <20120306091410.GD27238@elte.hu> <20120322153205.GA28570@linux.vnet.ibm.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.2.1 Content-Transfer-Encoding: 7bit Mime-Version: 1.0 X-Y-GMX-Trusted: 0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2012-03-22 at 21:02 +0530, Srivatsa Vaddagiri wrote: > * Ingo Molnar [2012-03-06 10:14:11]: > > > > I did some experiments with volanomark and it does turn out to > > > be sensitive to SD_BALANCE_WAKE, while the other wake-heavy > > > benchmark that I am dealing with (Trade) benefits from it. > > > > Does volanomark still do yield(), thereby invoking a random > > shuffle of thread scheduling and pretty much voluntarily > > ejecting itself from most scheduler performance considerations? > > > > If it uses a real locking primitive such as futexes then its > > performance matters more. > > Some more interesting results on more recent tip kernel. Yeah, interesting. I find myself ever returning to this message, as gears grind away trying to imagine what's going on in those vcpus. > Machine : 2 Quad-core Intel X5570 CPU w/ H/T enabled (16 cpus) > Kernel : tip (HEAD at ee415e2) > guest VM : 2.6.18 linux kernel based enterprise guest > > Benchmarks are run in two scenarios: > > 1. BM -> Bare Metal. Benchmark is run on bare metal in root cgroup > 2. VM -> Benchmark is run inside a guest VM. Several cpu hogs (in > various cgroups) are run on host. Cgroup setup is as below: > > /sys (cpu.shares = 1024, hosts all system tasks) > /libvirt (cpu.shares = 20000) > /libvirt/qemu/VM (cpu.shares = 8192. guest VM w/ 8 vcpus) > /libvirt/qemu/hoga (cpu.shares = 1024. hosts 4 cpu hogs) > /libvirt/qemu/hogb (cpu.shares = 1024. hosts 4 cpu hogs) > /libvirt/qemu/hogc (cpu.shares = 1024. hosts 4 cpu hogs) > /libvirt/qemu/hogd (cpu.shares = 1024. hosts 4 cpu hogs) > > First BM (bare metal) scenario: > > tip tip + patch > > volano 1 0.955 (4.5% degradation) > sysbench [n1] 1 0.9984 (0.16% degradation) > tbench 1 [n2] 1 0.9096 (9% degradation) Those make sense, fast path cycles added. > Now the more interesting VM scenario: > > tip tip + patch > > volano 1 1.29 (29% improvement) > sysbench [n3] 1 2 (100% improvement) > tbench 1 [n4] 1 1.07 (7% improvement) > tbench 8 [n5] 1 1.26 (26% improvement) > httperf [n6] 1 1.05 (5% improvement) > Trade 1 1.31 (31% improvement) > > Notes: > > n1. sysbench was run with 16 threads. > n2. tbench was run on localhost with 1 client > n3. sysbench was run with 8 threads > n4. tbench was run on localhost with 1 client > n5. tbench was run over network with 8 clients > n6. httperf was run as with burst-length of 100 and wsess of 100,500,0 > > So the patch seems to be a wholesome win when VCPU threads are waking > up (in a highly contended environment). One reason could be that any assumption > of better cache hits by running (vcpu) threads on its prev_cpu may not > be fully correct as vcpu threads could represent many different threads > internally? > > Anyway, there are degradations as well, considering which I see several > possibilities: > > 1. Do balance-on-wake for vcpu threads only. That's what your numbers say to me with this patch. I'm not getting the why, but your patch appears to reduce vcpu internal latencies hugely. > 2. Document tuning possibility to improve performance in virtualized > environment: > - Either via sched_domain flags (disable SD_WAKE_AFFINE > at all levels and enable SD_BALANCE_WAKE at SMT/MC levels) > - Or via a new sched_feat(BALANCE_WAKE) tunable > > Any other thoughts or suggestions for more experiments? Other than nuke select_idle_sibling() entirely instead, none here. -Mike