From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752525AbdGDONp (ORCPT ); Tue, 4 Jul 2017 10:13:45 -0400 Received: from mx1.redhat.com ([209.132.183.28]:50234 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751536AbdGDONm (ORCPT ); Tue, 4 Jul 2017 10:13:42 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 4522C7C83B Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=rkrcmar@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 4522C7C83B Date: Tue, 4 Jul 2017 16:13:23 +0200 From: Radim =?utf-8?B?S3LEjW3DocWZ?= To: Yang Zhang Cc: Paolo Bonzini , Wanpeng Li , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , the arch/x86 maintainers , Jonathan Corbet , tony.luck@intel.com, Borislav Petkov , Peter Zijlstra , mchehab@kernel.org, Andrew Morton , krzk@kernel.org, jpoimboe@redhat.com, Andy Lutomirski , Christian Borntraeger , Thomas Garnier , Robert Gerst , Mathias Krause , douly.fnst@cn.fujitsu.com, Nicolai Stange , Frederic Weisbecker , dvlasenk@redhat.com, Daniel Bristot de Oliveira , yamada.masahiro@socionext.com, mika.westerberg@linux.intel.com, Chen Yu , aaron.lu@intel.com, Steven Rostedt , Kyle Huey , Len Brown , Prarit Bhargava , hidehiro.kawai.ez@hitachi.com, fengtiantian@huawei.com, pmladek@suse.com, jeyu@redhat.com, Larry.Finger@lwfinger.net, zijun_hu@htc.com, luisbg@osg.samsung.com, johannes.berg@intel.com, niklas.soderlund+renesas@ragnatech.se, zlpnobody@gmail.com, Alexey Dobriyan , fgao@48lvckh6395k16k5.yundunddos.com, ebiederm@xmission.com, Subash Abhinov Kasiviswanathan , Arnd Bergmann , Matt Fleming , Mel Gorman , "linux-kernel@vger.kernel.org" , linux-doc@vger.kernel.org, linux-edac@vger.kernel.org, kvm Subject: Re: [PATCH 2/2] x86/idle: use dynamic halt poll Message-ID: <20170704141322.GC30880@potion> References: <4444ffc8-9e7b-5bd2-20da-af422fe834cc@redhat.com> <2245bef7-b668-9265-f3f8-3b63d71b1033@gmail.com> <7d085956-2573-212f-44f4-86104beba9bb@gmail.com> <05ec7efc-fb9c-ae24-5770-66fc472545a4@redhat.com> <20170627134043.GA1487@potion> <2771f905-d1b0-b118-9ae9-db5fb87f877c@redhat.com> <20170627142251.GB1487@potion> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Tue, 04 Jul 2017 14:13:41 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 2017-07-03 17:28+0800, Yang Zhang: > The background is that we(Alibaba Cloud) do get more and more complaints > from our customers in both KVM and Xen compare to bare-mental.After > investigations, the root cause is known to us: big cost in message passing > workload(David show it in KVM forum 2015) > > A typical message workload like below: > vcpu 0 vcpu 1 > 1. send ipi 2. doing hlt > 3. go into idle 4. receive ipi and wake up from hlt > 5. write APIC time twice 6. write APIC time twice to > to stop sched timer reprogram sched timer One write is enough to disable/re-enable the APIC timer -- why does Linux use two? > 7. doing hlt 8. handle task and send ipi to > vcpu 0 > 9. same to 4. 10. same to 3 > > One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The > cost of such vmexits will degrades performance severely. Yeah, sounds like too much ... I understood that there are IPI from 1 to 2 4 * APIC timer IPI from 2 to 1 which adds to 6 MSR writes -- what are the other 4? > Linux kernel > already provide idle=poll to mitigate the trend. But it only eliminates the > IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A > compromise would be to turn off NOHZ kernel, but it is not the default > config for new distributions. Same for halt-poll in KVM, it only solve the > cost from schedule in/out in host and can not help such workload much. > > The purpose of this patch we want to improve current idle=poll mechanism to Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow down the sibling hyperthread. MWAIT solves the IPI problem, but doesn't get rid of the timer one. > use dynamic polling and do poll before touch sched timer. It should not be a > virtualization specific feature but seems bare mental have low cost to > access the MSR. So i want to only enable it in VM. Though the idea below the > patch may not so perfect to fit all conditions, it looks no worse than now. It adds code to hot-paths (interrupt handlers) while trying to optimize an idle-path, which is suspicious. > How about we keep current implementation and i integrate the patch to > para-virtualize part as Paolo suggested? We can continue discuss it and i > will continue to refine it if anyone has better suggestions? I think there is a nicer solution to avoid the expensive timer rewrite: Linux uses one-shot APIC timers and getting the timer interrupt is about as expensive as programming the timer, so the guest can keep the timer armed, but not re-arm it after the expiration if the CPU is idle. This should also mitigate the problem with short idle periods, but the optimized window is anywhere between 0 to 1ms. Do you see disadvantages of this combined with MWAIT? Thanks. From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Subject: [2/2] x86/idle: use dynamic halt poll From: =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= Message-Id: <20170704141322.GC30880@potion> Date: Tue, 4 Jul 2017 16:13:23 +0200 To: Yang Zhang Cc: Paolo Bonzini , Wanpeng Li , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , the arch/x86 maintainers , Jonathan Corbet , tony.luck@intel.com, Borislav Petkov , Peter Zijlstra , mchehab@kernel.org, Andrew Morton , krzk@kernel.org, jpoimboe@redhat.com, Andy Lutomirski , Christian Borntraeger , Thomas Garnier , Robert Gerst , Mathias Krause , douly.fnst@cn.fujitsu.com, Nicolai Stange , Frederic Weisbecker , dvlasenk@redhat.com, Daniel Bristot de Oliveira , yamada.masahiro@socionext.com, mika.westerberg@linux.intel.com, Chen Yu , aaron.lu@intel.com, Steven Rostedt , Kyle Huey , Len Brown , Prarit Bhargava , hidehiro.kawai.ez@hitachi.com, fengtiantian@huawei.com, pmladek@suse.com, jeyu@redhat.com, Larry.Finger@lwfinger.net, zijun_hu@htc.com, luisbg@osg.samsung.com, johannes.berg@intel.com, niklas.soderlund+renesas@ragnatech.se, zlpnobody@gmail.com, Alexey Dobriyan , fgao@ikuai8.com, ebiederm@xmission.com, Subash Abhinov Kasiviswanathan , Arnd Bergmann , Matt Fleming , Mel Gorman , "linux-kernel@vger.kernel.org" , linux-doc@vger.kernel.org, linux-edac@vger.kernel.org, kvm List-ID: MjAxNy0wNy0wMyAxNzoyOCswODAwLCBZYW5nIFpoYW5nOgo+IFRoZSBiYWNrZ3JvdW5kIGlzIHRo YXQgd2UoQWxpYmFiYSBDbG91ZCkgZG8gZ2V0IG1vcmUgYW5kIG1vcmUgY29tcGxhaW50cwo+IGZy b20gb3VyIGN1c3RvbWVycyBpbiBib3RoIEtWTSBhbmQgWGVuIGNvbXBhcmUgdG8gYmFyZS1tZW50 YWwuQWZ0ZXIKPiBpbnZlc3RpZ2F0aW9ucywgdGhlIHJvb3QgY2F1c2UgaXMga25vd24gdG8gdXM6 IGJpZyBjb3N0IGluIG1lc3NhZ2UgcGFzc2luZwo+IHdvcmtsb2FkKERhdmlkIHNob3cgaXQgaW4g S1ZNIGZvcnVtIDIwMTUpCj4gCj4gQSB0eXBpY2FsIG1lc3NhZ2Ugd29ya2xvYWQgbGlrZSBiZWxv dzoKPiB2Y3B1IDAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHZjcHUgMQo+IDEuIHNlbmQg aXBpICAgICAgICAgICAgICAgICAgICAgMi4gIGRvaW5nIGhsdAo+IDMuIGdvIGludG8gaWRsZSAg ICAgICAgICAgICAgICAgNC4gIHJlY2VpdmUgaXBpIGFuZCB3YWtlIHVwIGZyb20gaGx0Cj4gNS4g d3JpdGUgQVBJQyB0aW1lIHR3aWNlICAgICAgICA2LiAgd3JpdGUgQVBJQyB0aW1lIHR3aWNlIHRv Cj4gICAgdG8gc3RvcCBzY2hlZCB0aW1lciAgICAgICAgICAgICAgcmVwcm9ncmFtIHNjaGVkIHRp bWVyCgpPbmUgd3JpdGUgaXMgZW5vdWdoIHRvIGRpc2FibGUvcmUtZW5hYmxlIHRoZSBBUElDIHRp bWVyIC0tIHdoeSBkb2VzCkxpbnV4IHVzZSB0d28/Cgo+IDcuIGRvaW5nIGhsdCAgICAgICAgICAg ICAgICAgICAgOC4gIGhhbmRsZSB0YXNrIGFuZCBzZW5kIGlwaSB0bwo+ICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgIHZjcHUgMAo+IDkuIHNhbWUgdG8gNC4gICAgICAgICAgICAg ICAgICAgMTAuIHNhbWUgdG8gMwo+IAo+IE9uZSB0cmFuc2FjdGlvbiB3aWxsIGludHJvZHVjZSBh Ym91dCAxMiB2bWV4aXRzKDIgaGx0IGFuZCAxMCBtc3Igd3JpdGUpLiBUaGUKPiBjb3N0IG9mIHN1 Y2ggdm1leGl0cyB3aWxsIGRlZ3JhZGVzIHBlcmZvcm1hbmNlIHNldmVyZWx5LgoKWWVhaCwgc291 bmRzIGxpa2UgdG9vIG11Y2ggLi4uIEkgdW5kZXJzdG9vZCB0aGF0IHRoZXJlIGFyZQoKICBJUEkg ZnJvbSAxIHRvIDIKICA0ICogQVBJQyB0aW1lcgogIElQSSBmcm9tIDIgdG8gMQoKd2hpY2ggYWRk cyB0byA2IE1TUiB3cml0ZXMgLS0gd2hhdCBhcmUgdGhlIG90aGVyIDQ/Cgo+ICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIExpbnV4IGtlcm5l bAo+IGFscmVhZHkgcHJvdmlkZSBpZGxlPXBvbGwgdG8gbWl0aWdhdGUgdGhlIHRyZW5kLiBCdXQg aXQgb25seSBlbGltaW5hdGVzIHRoZQo+IElQSSBhbmQgaGx0IHZtZXhpdC4gSXQgaGFzIG5vdGhp bmcgdG8gZG8gd2l0aCBzdGFydC9zdG9wIHNjaGVkIHRpbWVyLiBBCj4gY29tcHJvbWlzZSB3b3Vs ZCBiZSB0byB0dXJuIG9mZiBOT0haIGtlcm5lbCwgYnV0IGl0IGlzIG5vdCB0aGUgZGVmYXVsdAo+ IGNvbmZpZyBmb3IgbmV3IGRpc3RyaWJ1dGlvbnMuIFNhbWUgZm9yIGhhbHQtcG9sbCBpbiBLVk0s IGl0IG9ubHkgc29sdmUgdGhlCj4gY29zdCBmcm9tIHNjaGVkdWxlIGluL291dCBpbiBob3N0IGFu ZCBjYW4gbm90IGhlbHAgc3VjaCB3b3JrbG9hZCBtdWNoLgo+IAo+IFRoZSBwdXJwb3NlIG9mIHRo aXMgcGF0Y2ggd2Ugd2FudCB0byBpbXByb3ZlIGN1cnJlbnQgaWRsZT1wb2xsIG1lY2hhbmlzbSB0 bwoKUGxlYXNlIGFpbSB0byBhbGxvdyBNV0FJVCBpbnN0ZWFkIG9mIGlkbGU9cG9sbCAtLSBNV0FJ VCBkb2Vzbid0IHNsb3cKZG93biB0aGUgc2libGluZyBoeXBlcnRocmVhZC4gIE1XQUlUIHNvbHZl cyB0aGUgSVBJIHByb2JsZW0sIGJ1dCBkb2Vzbid0CmdldCByaWQgb2YgdGhlIHRpbWVyIG9uZS4K Cj4gdXNlIGR5bmFtaWMgcG9sbGluZyBhbmQgZG8gcG9sbCBiZWZvcmUgdG91Y2ggc2NoZWQgdGlt ZXIuIEl0IHNob3VsZCBub3QgYmUgYQo+IHZpcnR1YWxpemF0aW9uIHNwZWNpZmljIGZlYXR1cmUg YnV0IHNlZW1zIGJhcmUgbWVudGFsIGhhdmUgbG93IGNvc3QgdG8KPiBhY2Nlc3MgdGhlIE1TUi4g U28gaSB3YW50IHRvIG9ubHkgZW5hYmxlIGl0IGluIFZNLiBUaG91Z2ggdGhlIGlkZWEgYmVsb3cg dGhlCj4gcGF0Y2ggbWF5IG5vdCBzbyBwZXJmZWN0IHRvIGZpdCBhbGwgY29uZGl0aW9ucywgaXQg bG9va3Mgbm8gd29yc2UgdGhhbiBub3cuCgpJdCBhZGRzIGNvZGUgdG8gaG90LXBhdGhzIChpbnRl cnJ1cHQgaGFuZGxlcnMpIHdoaWxlIHRyeWluZyB0byBvcHRpbWl6ZQphbiBpZGxlLXBhdGgsIHdo aWNoIGlzIHN1c3BpY2lvdXMuCgo+IEhvdyBhYm91dCB3ZSBrZWVwIGN1cnJlbnQgaW1wbGVtZW50 YXRpb24gYW5kIGkgaW50ZWdyYXRlIHRoZSBwYXRjaCB0bwo+IHBhcmEtdmlydHVhbGl6ZSBwYXJ0 IGFzIFBhb2xvIHN1Z2dlc3RlZD8gV2UgY2FuIGNvbnRpbnVlIGRpc2N1c3MgaXQgYW5kIGkKPiB3 aWxsIGNvbnRpbnVlIHRvIHJlZmluZSBpdCBpZiBhbnlvbmUgaGFzIGJldHRlciBzdWdnZXN0aW9u cz8KCkkgdGhpbmsgdGhlcmUgaXMgYSBuaWNlciBzb2x1dGlvbiB0byBhdm9pZCB0aGUgZXhwZW5z aXZlIHRpbWVyIHJld3JpdGU6CkxpbnV4IHVzZXMgb25lLXNob3QgQVBJQyB0aW1lcnMgYW5kIGdl dHRpbmcgdGhlIHRpbWVyIGludGVycnVwdCBpcyBhYm91dAphcyBleHBlbnNpdmUgYXMgcHJvZ3Jh bW1pbmcgdGhlIHRpbWVyLCBzbyB0aGUgZ3Vlc3QgY2FuIGtlZXAgdGhlIHRpbWVyCmFybWVkLCBi dXQgbm90IHJlLWFybSBpdCBhZnRlciB0aGUgZXhwaXJhdGlvbiBpZiB0aGUgQ1BVIGlzIGlkbGUu CgpUaGlzIHNob3VsZCBhbHNvIG1pdGlnYXRlIHRoZSBwcm9ibGVtIHdpdGggc2hvcnQgaWRsZSBw ZXJpb2RzLCBidXQgdGhlCm9wdGltaXplZCB3aW5kb3cgaXMgYW55d2hlcmUgYmV0d2VlbiAwIHRv IDFtcy4KCkRvIHlvdSBzZWUgZGlzYWR2YW50YWdlcyBvZiB0aGlzIGNvbWJpbmVkIHdpdGggTVdB SVQ/CgpUaGFua3MuCi0tLQpUbyB1bnN1YnNjcmliZSBmcm9tIHRoaXMgbGlzdDogc2VuZCB0aGUg bGluZSAidW5zdWJzY3JpYmUgbGludXgtZWRhYyIgaW4KdGhlIGJvZHkgb2YgYSBtZXNzYWdlIHRv IG1ham9yZG9tb0B2Z2VyLmtlcm5lbC5vcmcKTW9yZSBtYWpvcmRvbW8gaW5mbyBhdCAgaHR0cDov L3ZnZXIua2VybmVsLm9yZy9tYWpvcmRvbW8taW5mby5odG1sCg== From mboxrd@z Thu Jan 1 00:00:00 1970 From: Radim =?utf-8?B?S3LEjW3DocWZ?= Subject: Re: [PATCH 2/2] x86/idle: use dynamic halt poll Date: Tue, 4 Jul 2017 16:13:23 +0200 Message-ID: <20170704141322.GC30880@potion> References: <4444ffc8-9e7b-5bd2-20da-af422fe834cc@redhat.com> <2245bef7-b668-9265-f3f8-3b63d71b1033@gmail.com> <7d085956-2573-212f-44f4-86104beba9bb@gmail.com> <05ec7efc-fb9c-ae24-5770-66fc472545a4@redhat.com> <20170627134043.GA1487@potion> <2771f905-d1b0-b118-9ae9-db5fb87f877c@redhat.com> <20170627142251.GB1487@potion> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Paolo Bonzini , Wanpeng Li , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , the arch/x86 maintainers , Jonathan Corbet , tony.luck@intel.com, Borislav Petkov , Peter Zijlstra , mchehab@kernel.org, Andrew Morton , krzk@kernel.org, jpoimboe@redhat.com, Andy Lutomirski , Christian Borntraeger , Thomas Garnier , Robert Gerst , Mathias Krause , douly.fnst@cn.fujitsu.com, Nicolai Stange , Frederic Weisbecker , dvlasenk@redhat.com, To: Yang Zhang Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-doc-owner@vger.kernel.org List-Id: kvm.vger.kernel.org 2017-07-03 17:28+0800, Yang Zhang: > The background is that we(Alibaba Cloud) do get more and more complaints > from our customers in both KVM and Xen compare to bare-mental.After > investigations, the root cause is known to us: big cost in message passing > workload(David show it in KVM forum 2015) > > A typical message workload like below: > vcpu 0 vcpu 1 > 1. send ipi 2. doing hlt > 3. go into idle 4. receive ipi and wake up from hlt > 5. write APIC time twice 6. write APIC time twice to > to stop sched timer reprogram sched timer One write is enough to disable/re-enable the APIC timer -- why does Linux use two? > 7. doing hlt 8. handle task and send ipi to > vcpu 0 > 9. same to 4. 10. same to 3 > > One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The > cost of such vmexits will degrades performance severely. Yeah, sounds like too much ... I understood that there are IPI from 1 to 2 4 * APIC timer IPI from 2 to 1 which adds to 6 MSR writes -- what are the other 4? > Linux kernel > already provide idle=poll to mitigate the trend. But it only eliminates the > IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A > compromise would be to turn off NOHZ kernel, but it is not the default > config for new distributions. Same for halt-poll in KVM, it only solve the > cost from schedule in/out in host and can not help such workload much. > > The purpose of this patch we want to improve current idle=poll mechanism to Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow down the sibling hyperthread. MWAIT solves the IPI problem, but doesn't get rid of the timer one. > use dynamic polling and do poll before touch sched timer. It should not be a > virtualization specific feature but seems bare mental have low cost to > access the MSR. So i want to only enable it in VM. Though the idea below the > patch may not so perfect to fit all conditions, it looks no worse than now. It adds code to hot-paths (interrupt handlers) while trying to optimize an idle-path, which is suspicious. > How about we keep current implementation and i integrate the patch to > para-virtualize part as Paolo suggested? We can continue discuss it and i > will continue to refine it if anyone has better suggestions? I think there is a nicer solution to avoid the expensive timer rewrite: Linux uses one-shot APIC timers and getting the timer interrupt is about as expensive as programming the timer, so the guest can keep the timer armed, but not re-arm it after the expiration if the CPU is idle. This should also mitigate the problem with short idle periods, but the optimized window is anywhere between 0 to 1ms. Do you see disadvantages of this combined with MWAIT? Thanks.