From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752687AbdGDW2x (ORCPT ); Tue, 4 Jul 2017 18:28:53 -0400 Received: from mail-oi0-f65.google.com ([209.85.218.65]:34999 "EHLO mail-oi0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752559AbdGDW2u (ORCPT ); Tue, 4 Jul 2017 18:28:50 -0400 MIME-Version: 1.0 In-Reply-To: References: <1498130534-26568-1-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> <1498130534-26568-3-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> <4444ffc8-9e7b-5bd2-20da-af422fe834cc@redhat.com> <2245bef7-b668-9265-f3f8-3b63d71b1033@gmail.com> <7d085956-2573-212f-44f4-86104beba9bb@gmail.com> <05ec7efc-fb9c-ae24-5770-66fc472545a4@redhat.com> <20170627134043.GA1487@potion> <2771f905-d1b0-b118-9ae9-db5fb87f877c@redhat.com> <20170627142251.GB1487@potion> From: Wanpeng Li Date: Wed, 5 Jul 2017 06:28:48 +0800 Message-ID: Subject: Re: [PATCH 2/2] x86/idle: use dynamic halt poll To: Yang Zhang Cc: =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Paolo Bonzini , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , "the arch/x86 maintainers" , Jonathan Corbet , tony.luck@intel.com, Borislav Petkov , Peter Zijlstra , mchehab@kernel.org, Andrew Morton , krzk@kernel.org, jpoimboe@redhat.com, Andy Lutomirski , Christian Borntraeger , Thomas Garnier , Robert Gerst , Mathias Krause , douly.fnst@cn.fujitsu.com, Nicolai Stange , Frederic Weisbecker , dvlasenk@redhat.com, Daniel Bristot de Oliveira , yamada.masahiro@socionext.com, mika.westerberg@linux.intel.com, Chen Yu , aaron.lu@intel.com, Steven Rostedt , Kyle Huey , Len Brown , Prarit Bhargava , hidehiro.kawai.ez@hitachi.com, fengtiantian@huawei.com, pmladek@suse.com, jeyu@redhat.com, Larry.Finger@lwfinger.net, zijun_hu@htc.com, luisbg@osg.samsung.com, johannes.berg@intel.com, niklas.soderlund+renesas@ragnatech.se, zlpnobody@gmail.com, Alexey Dobriyan , fgao@48lvckh6395k16k5.yundunddos.com, ebiederm@xmission.com, Subash Abhinov Kasiviswanathan , Arnd Bergmann , Matt Fleming , Mel Gorman , "linux-kernel@vger.kernel.org" , linux-doc@vger.kernel.org, linux-edac@vger.kernel.org, kvm Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mail.home.local id v64MUXJE024741 2017-07-03 17:28 GMT+08:00 Yang Zhang : > On 2017/6/27 22:22, Radim Krčmář wrote: >> >> 2017-06-27 15:56+0200, Paolo Bonzini: >>> >>> On 27/06/2017 15:40, Radim Krčmář wrote: >>>>> >>>>> ... which is not necessarily _wrong_. It's just a different heuristic. >>>> >>>> Right, it's just harder to use than host's single_task_running() -- the >>>> VCPU calling vcpu_is_preempted() is never preempted, so we have to look >>>> at other VCPUs that are not halted, but still preempted. >>>> >>>> If we see some ratio of preempted VCPUs (> 0?), then we stop polling and >>>> yield to the host. Working under the assumption that there is work for >>>> this PCPU if other VCPUs have stuff to do. The downside is that it >>>> misses information about host's topology, so it would be hard to make it >>>> work well. >>> >>> >>> I would just use vcpu_is_preempted on the current CPU. From guest POV >>> this option is really a "f*** everyone else" setting just like >>> idle=poll, only a little more polite. >> >> >> vcpu_is_preempted() on current cpu cannot return true, AFAIK. >> >>> If we've been preempted and we were polling, there are two cases. If an >>> interrupt was queued while the guest was preempted, the poll will be >>> treated as successful anyway. >> >> >> I think the poll should be treated as invalid if the window has expired >> while the VCPU was preempted -- the guest can't tell whether the >> interrupt arrived still within the poll window (unless we added paravirt >> for that), so it shouldn't be wasting time waiting for it. >> >>> If it hasn't, let others run---but really >>> that's not because the guest wants to be polite, it's to avoid that the >>> scheduler penalizes it excessively. >> >> >> This sounds like a VM entry just to do an immediate VM exit, so paravirt >> seems better here as well ... (the guest telling the host about its >> window -- which could also be used to rule it out as a target in the >> pause loop random kick.) >> >>> So until it's preempted, I think it's okay if the guest doesn't care >>> about others. You wouldn't use this option anyway in overcommitted >>> situations. >>> >>> (I'm still not very convinced about the idea). >> >> >> Me neither. (The same mechanism is applicable to bare-metal, but was >> never used there, so I would rather bring the guest behavior closer to >> bare-metal.) >> > > The background is that we(Alibaba Cloud) do get more and more complaints > from our customers in both KVM and Xen compare to bare-mental.After > investigations, the root cause is known to us: big cost in message passing > workload(David show it in KVM forum 2015) > > A typical message workload like below: > vcpu 0 vcpu 1 > 1. send ipi 2. doing hlt > 3. go into idle 4. receive ipi and wake up from hlt > 5. write APIC time twice 6. write APIC time twice to > to stop sched timer reprogram sched timer I didn't find these two scenarios will program APIC timer twice separately instead of once separately, could you point out the codes? Regards, Wanpeng Li > 7. doing hlt 8. handle task and send ipi to > vcpu 0 > 9. same to 4. 10. same to 3 > > One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The > cost of such vmexits will degrades performance severely. Linux kernel > already provide idle=poll to mitigate the trend. But it only eliminates the > IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A > compromise would be to turn off NOHZ kernel, but it is not the default > config for new distributions. Same for halt-poll in KVM, it only solve the > cost from schedule in/out in host and can not help such workload much. > > The purpose of this patch we want to improve current idle=poll mechanism to > use dynamic polling and do poll before touch sched timer. It should not be a > virtualization specific feature but seems bare mental have low cost to > access the MSR. So i want to only enable it in VM. Though the idea below the > patch may not so perfect to fit all conditions, it looks no worse than now. > How about we keep current implementation and i integrate the patch to > para-virtualize part as Paolo suggested? We can continue discuss it and i > will continue to refine it if anyone has better suggestions? > > > > -- > Yang > Alibaba Cloud Computing From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Subject: [2/2] x86/idle: use dynamic halt poll From: Wanpeng Li Message-Id: Date: Wed, 5 Jul 2017 06:28:48 +0800 To: Yang Zhang Cc: =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Paolo Bonzini , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , the arch/x86 maintainers , Jonathan Corbet , tony.luck@intel.com, Borislav Petkov , Peter Zijlstra , mchehab@kernel.org, Andrew Morton , krzk@kernel.org, jpoimboe@redhat.com, Andy Lutomirski , Christian Borntraeger , Thomas Garnier , Robert Gerst , Mathias Krause , douly.fnst@cn.fujitsu.com, Nicolai Stange , Frederic Weisbecker , dvlasenk@redhat.com, Daniel Bristot de Oliveira , yamada.masahiro@socionext.com, mika.westerberg@linux.intel.com, Chen Yu , aaron.lu@intel.com, Steven Rostedt , Kyle Huey , Len Brown , Prarit Bhargava , hidehiro.kawai.ez@hitachi.com, fengtiantian@huawei.com, pmladek@suse.com, jeyu@redhat.com, Larry.Finger@lwfinger.net, zijun_hu@htc.com, luisbg@osg.samsung.com, johannes.berg@intel.com, niklas.soderlund+renesas@ragnatech.se, zlpnobody@gmail.com, Alexey Dobriyan , fgao@ikuai8.com, ebiederm@xmission.com, Subash Abhinov Kasiviswanathan , Arnd Bergmann , Matt Fleming , Mel Gorman , "linux-kernel@vger.kernel.org" , linux-doc@vger.kernel.org, linux-edac@vger.kernel.org, kvm List-ID: MjAxNy0wNy0wMyAxNzoyOCBHTVQrMDg6MDAgWWFuZyBaaGFuZyA8eWFuZy56aGFuZy53ekBnbWFp bC5jb20+Ogo+IE9uIDIwMTcvNi8yNyAyMjoyMiwgUmFkaW0gS3LEjW3DocWZIHdyb3RlOgo+Pgo+ PiAyMDE3LTA2LTI3IDE1OjU2KzAyMDAsIFBhb2xvIEJvbnppbmk6Cj4+Pgo+Pj4gT24gMjcvMDYv MjAxNyAxNTo0MCwgUmFkaW0gS3LEjW3DocWZIHdyb3RlOgo+Pj4+Pgo+Pj4+PiAuLi4gd2hpY2gg aXMgbm90IG5lY2Vzc2FyaWx5IF93cm9uZ18uICBJdCdzIGp1c3QgYSBkaWZmZXJlbnQgaGV1cmlz dGljLgo+Pj4+Cj4+Pj4gUmlnaHQsIGl0J3MganVzdCBoYXJkZXIgdG8gdXNlIHRoYW4gaG9zdCdz IHNpbmdsZV90YXNrX3J1bm5pbmcoKSAtLSB0aGUKPj4+PiBWQ1BVIGNhbGxpbmcgdmNwdV9pc19w cmVlbXB0ZWQoKSBpcyBuZXZlciBwcmVlbXB0ZWQsIHNvIHdlIGhhdmUgdG8gbG9vawo+Pj4+IGF0 IG90aGVyIFZDUFVzIHRoYXQgYXJlIG5vdCBoYWx0ZWQsIGJ1dCBzdGlsbCBwcmVlbXB0ZWQuCj4+ Pj4KPj4+PiBJZiB3ZSBzZWUgc29tZSByYXRpbyBvZiBwcmVlbXB0ZWQgVkNQVXMgKD4gMD8pLCB0 aGVuIHdlIHN0b3AgcG9sbGluZyBhbmQKPj4+PiB5aWVsZCB0byB0aGUgaG9zdC4gIFdvcmtpbmcg dW5kZXIgdGhlIGFzc3VtcHRpb24gdGhhdCB0aGVyZSBpcyB3b3JrIGZvcgo+Pj4+IHRoaXMgUENQ VSBpZiBvdGhlciBWQ1BVcyBoYXZlIHN0dWZmIHRvIGRvLiAgVGhlIGRvd25zaWRlIGlzIHRoYXQg aXQKPj4+PiBtaXNzZXMgaW5mb3JtYXRpb24gYWJvdXQgaG9zdCdzIHRvcG9sb2d5LCBzbyBpdCB3 b3VsZCBiZSBoYXJkIHRvIG1ha2UgaXQKPj4+PiB3b3JrIHdlbGwuCj4+Pgo+Pj4KPj4+IEkgd291 bGQganVzdCB1c2UgdmNwdV9pc19wcmVlbXB0ZWQgb24gdGhlIGN1cnJlbnQgQ1BVLiAgRnJvbSBn dWVzdCBQT1YKPj4+IHRoaXMgb3B0aW9uIGlzIHJlYWxseSBhICJmKioqIGV2ZXJ5b25lIGVsc2Ui IHNldHRpbmcganVzdCBsaWtlCj4+PiBpZGxlPXBvbGwsIG9ubHkgYSBsaXR0bGUgbW9yZSBwb2xp dGUuCj4+Cj4+Cj4+IHZjcHVfaXNfcHJlZW1wdGVkKCkgb24gY3VycmVudCBjcHUgY2Fubm90IHJl dHVybiB0cnVlLCBBRkFJSy4KPj4KPj4+IElmIHdlJ3ZlIGJlZW4gcHJlZW1wdGVkIGFuZCB3ZSB3 ZXJlIHBvbGxpbmcsIHRoZXJlIGFyZSB0d28gY2FzZXMuICBJZiBhbgo+Pj4gaW50ZXJydXB0IHdh cyBxdWV1ZWQgd2hpbGUgdGhlIGd1ZXN0IHdhcyBwcmVlbXB0ZWQsIHRoZSBwb2xsIHdpbGwgYmUK Pj4+IHRyZWF0ZWQgYXMgc3VjY2Vzc2Z1bCBhbnl3YXkuCj4+Cj4+Cj4+IEkgdGhpbmsgdGhlIHBv bGwgc2hvdWxkIGJlIHRyZWF0ZWQgYXMgaW52YWxpZCBpZiB0aGUgd2luZG93IGhhcyBleHBpcmVk Cj4+IHdoaWxlIHRoZSBWQ1BVIHdhcyBwcmVlbXB0ZWQgLS0gdGhlIGd1ZXN0IGNhbid0IHRlbGwg d2hldGhlciB0aGUKPj4gaW50ZXJydXB0IGFycml2ZWQgc3RpbGwgd2l0aGluIHRoZSBwb2xsIHdp bmRvdyAodW5sZXNzIHdlIGFkZGVkIHBhcmF2aXJ0Cj4+IGZvciB0aGF0KSwgc28gaXQgc2hvdWxk bid0IGJlIHdhc3RpbmcgdGltZSB3YWl0aW5nIGZvciBpdC4KPj4KPj4+ICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgICBJZiBpdCBoYXNuJ3QsIGxldCBvdGhlcnMgcnVuLS0tYnV0IHJlYWxs eQo+Pj4gdGhhdCdzIG5vdCBiZWNhdXNlIHRoZSBndWVzdCB3YW50cyB0byBiZSBwb2xpdGUsIGl0 J3MgdG8gYXZvaWQgdGhhdCB0aGUKPj4+IHNjaGVkdWxlciBwZW5hbGl6ZXMgaXQgZXhjZXNzaXZl bHkuCj4+Cj4+Cj4+IFRoaXMgc291bmRzIGxpa2UgYSBWTSBlbnRyeSBqdXN0IHRvIGRvIGFuIGlt bWVkaWF0ZSBWTSBleGl0LCBzbyBwYXJhdmlydAo+PiBzZWVtcyBiZXR0ZXIgaGVyZSBhcyB3ZWxs IC4uLiAodGhlIGd1ZXN0IHRlbGxpbmcgdGhlIGhvc3QgYWJvdXQgaXRzCj4+IHdpbmRvdyAtLSB3 aGljaCBjb3VsZCBhbHNvIGJlIHVzZWQgdG8gcnVsZSBpdCBvdXQgYXMgYSB0YXJnZXQgaW4gdGhl Cj4+IHBhdXNlIGxvb3AgcmFuZG9tIGtpY2suKQo+Pgo+Pj4gU28gdW50aWwgaXQncyBwcmVlbXB0 ZWQsIEkgdGhpbmsgaXQncyBva2F5IGlmIHRoZSBndWVzdCBkb2Vzbid0IGNhcmUKPj4+IGFib3V0 IG90aGVycy4gIFlvdSB3b3VsZG4ndCB1c2UgdGhpcyBvcHRpb24gYW55d2F5IGluIG92ZXJjb21t aXR0ZWQKPj4+IHNpdHVhdGlvbnMuCj4+Pgo+Pj4gKEknbSBzdGlsbCBub3QgdmVyeSBjb252aW5j ZWQgYWJvdXQgdGhlIGlkZWEpLgo+Pgo+Pgo+PiBNZSBuZWl0aGVyLiAgKFRoZSBzYW1lIG1lY2hh bmlzbSBpcyBhcHBsaWNhYmxlIHRvIGJhcmUtbWV0YWwsIGJ1dCB3YXMKPj4gbmV2ZXIgdXNlZCB0 aGVyZSwgc28gSSB3b3VsZCByYXRoZXIgYnJpbmcgdGhlIGd1ZXN0IGJlaGF2aW9yIGNsb3NlciB0 bwo+PiBiYXJlLW1ldGFsLikKPj4KPgo+IFRoZSBiYWNrZ3JvdW5kIGlzIHRoYXQgd2UoQWxpYmFi YSBDbG91ZCkgZG8gZ2V0IG1vcmUgYW5kIG1vcmUgY29tcGxhaW50cwo+IGZyb20gb3VyIGN1c3Rv bWVycyBpbiBib3RoIEtWTSBhbmQgWGVuIGNvbXBhcmUgdG8gYmFyZS1tZW50YWwuQWZ0ZXIKPiBp bnZlc3RpZ2F0aW9ucywgdGhlIHJvb3QgY2F1c2UgaXMga25vd24gdG8gdXM6IGJpZyBjb3N0IGlu IG1lc3NhZ2UgcGFzc2luZwo+IHdvcmtsb2FkKERhdmlkIHNob3cgaXQgaW4gS1ZNIGZvcnVtIDIw MTUpCj4KPiBBIHR5cGljYWwgbWVzc2FnZSB3b3JrbG9hZCBsaWtlIGJlbG93Ogo+IHZjcHUgMCAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgdmNwdSAxCj4gMS4gc2VuZCBpcGkgICAgICAgICAg ICAgICAgICAgICAyLiAgZG9pbmcgaGx0Cj4gMy4gZ28gaW50byBpZGxlICAgICAgICAgICAgICAg ICA0LiAgcmVjZWl2ZSBpcGkgYW5kIHdha2UgdXAgZnJvbSBobHQKPiA1LiB3cml0ZSBBUElDIHRp bWUgdHdpY2UgICAgICAgIDYuICB3cml0ZSBBUElDIHRpbWUgdHdpY2UgdG8KPiAgICB0byBzdG9w IHNjaGVkIHRpbWVyICAgICAgICAgICAgICByZXByb2dyYW0gc2NoZWQgdGltZXIKCkkgZGlkbid0 IGZpbmQgdGhlc2UgdHdvIHNjZW5hcmlvcyB3aWxsIHByb2dyYW0gQVBJQyB0aW1lciB0d2ljZQpz ZXBhcmF0ZWx5IGluc3RlYWQgb2Ygb25jZSBzZXBhcmF0ZWx5LCBjb3VsZCB5b3UgcG9pbnQgb3V0 IHRoZSBjb2Rlcz8KClJlZ2FyZHMsCldhbnBlbmcgTGkKCj4gNy4gZG9pbmcgaGx0ICAgICAgICAg ICAgICAgICAgICA4LiAgaGFuZGxlIHRhc2sgYW5kIHNlbmQgaXBpIHRvCj4gICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgdmNwdSAwCj4gOS4gc2FtZSB0byA0LiAgICAgICAgICAg ICAgICAgICAxMC4gc2FtZSB0byAzCj4KPiBPbmUgdHJhbnNhY3Rpb24gd2lsbCBpbnRyb2R1Y2Ug YWJvdXQgMTIgdm1leGl0cygyIGhsdCBhbmQgMTAgbXNyIHdyaXRlKS4gVGhlCj4gY29zdCBvZiBz dWNoIHZtZXhpdHMgd2lsbCBkZWdyYWRlcyBwZXJmb3JtYW5jZSBzZXZlcmVseS4gTGludXgga2Vy bmVsCj4gYWxyZWFkeSBwcm92aWRlIGlkbGU9cG9sbCB0byBtaXRpZ2F0ZSB0aGUgdHJlbmQuIEJ1 dCBpdCBvbmx5IGVsaW1pbmF0ZXMgdGhlCj4gSVBJIGFuZCBobHQgdm1leGl0LiBJdCBoYXMgbm90 aGluZyB0byBkbyB3aXRoIHN0YXJ0L3N0b3Agc2NoZWQgdGltZXIuIEEKPiBjb21wcm9taXNlIHdv dWxkIGJlIHRvIHR1cm4gb2ZmIE5PSFoga2VybmVsLCBidXQgaXQgaXMgbm90IHRoZSBkZWZhdWx0 Cj4gY29uZmlnIGZvciBuZXcgZGlzdHJpYnV0aW9ucy4gU2FtZSBmb3IgaGFsdC1wb2xsIGluIEtW TSwgaXQgb25seSBzb2x2ZSB0aGUKPiBjb3N0IGZyb20gc2NoZWR1bGUgaW4vb3V0IGluIGhvc3Qg YW5kIGNhbiBub3QgaGVscCBzdWNoIHdvcmtsb2FkIG11Y2guCj4KPiBUaGUgcHVycG9zZSBvZiB0 aGlzIHBhdGNoIHdlIHdhbnQgdG8gaW1wcm92ZSBjdXJyZW50IGlkbGU9cG9sbCBtZWNoYW5pc20g dG8KPiB1c2UgZHluYW1pYyBwb2xsaW5nIGFuZCBkbyBwb2xsIGJlZm9yZSB0b3VjaCBzY2hlZCB0 aW1lci4gSXQgc2hvdWxkIG5vdCBiZSBhCj4gdmlydHVhbGl6YXRpb24gc3BlY2lmaWMgZmVhdHVy ZSBidXQgc2VlbXMgYmFyZSBtZW50YWwgaGF2ZSBsb3cgY29zdCB0bwo+IGFjY2VzcyB0aGUgTVNS LiBTbyBpIHdhbnQgdG8gb25seSBlbmFibGUgaXQgaW4gVk0uIFRob3VnaCB0aGUgaWRlYSBiZWxv dyB0aGUKPiBwYXRjaCBtYXkgbm90IHNvIHBlcmZlY3QgdG8gZml0IGFsbCBjb25kaXRpb25zLCBp dCBsb29rcyBubyB3b3JzZSB0aGFuIG5vdy4KPiBIb3cgYWJvdXQgd2Uga2VlcCBjdXJyZW50IGlt cGxlbWVudGF0aW9uIGFuZCBpIGludGVncmF0ZSB0aGUgcGF0Y2ggdG8KPiBwYXJhLXZpcnR1YWxp emUgcGFydCBhcyBQYW9sbyBzdWdnZXN0ZWQ/IFdlIGNhbiBjb250aW51ZSBkaXNjdXNzIGl0IGFu ZCBpCj4gd2lsbCBjb250aW51ZSB0byByZWZpbmUgaXQgaWYgYW55b25lIGhhcyBiZXR0ZXIgc3Vn Z2VzdGlvbnM/Cj4KPgo+Cj4gLS0KPiBZYW5nCj4gQWxpYmFiYSBDbG91ZCBDb21wdXRpbmcKLS0t ClRvIHVuc3Vic2NyaWJlIGZyb20gdGhpcyBsaXN0OiBzZW5kIHRoZSBsaW5lICJ1bnN1YnNjcmli ZSBsaW51eC1lZGFjIiBpbgp0aGUgYm9keSBvZiBhIG1lc3NhZ2UgdG8gbWFqb3Jkb21vQHZnZXIu a2VybmVsLm9yZwpNb3JlIG1ham9yZG9tbyBpbmZvIGF0ICBodHRwOi8vdmdlci5rZXJuZWwub3Jn L21ham9yZG9tby1pbmZvLmh0bWwK From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wanpeng Li Subject: Re: [PATCH 2/2] x86/idle: use dynamic halt poll Date: Wed, 5 Jul 2017 06:28:48 +0800 Message-ID: References: <1498130534-26568-1-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> <1498130534-26568-3-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> <4444ffc8-9e7b-5bd2-20da-af422fe834cc@redhat.com> <2245bef7-b668-9265-f3f8-3b63d71b1033@gmail.com> <7d085956-2573-212f-44f4-86104beba9bb@gmail.com> <05ec7efc-fb9c-ae24-5770-66fc472545a4@redhat.com> <20170627134043.GA1487@potion> <2771f905-d1b0-b118-9ae9-db5fb87f877c@redhat.com> <20170627142251.GB1487@potion> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Cc: =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Paolo Bonzini , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , "the arch/x86 maintainers" , Jonathan Corbet , tony.luck@intel.com, Borislav Petkov , Peter Zijlstra , mchehab@kernel.org, Andrew Morton , krzk@kernel.org, jpoimboe@redhat.com, Andy Lutomirski , Christian Borntraeger , Thomas Garnier , Robert Gerst , Mathias Krause , douly.fnst@cn.fujitsu.com, Nicolai Stange , Frederic Weisbecker To: Yang Zhang Return-path: In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org 2017-07-03 17:28 GMT+08:00 Yang Zhang : > On 2017/6/27 22:22, Radim Kr=C4=8Dm=C3=A1=C5=99 wrote: >> >> 2017-06-27 15:56+0200, Paolo Bonzini: >>> >>> On 27/06/2017 15:40, Radim Kr=C4=8Dm=C3=A1=C5=99 wrote: >>>>> >>>>> ... which is not necessarily _wrong_. It's just a different heuristi= c. >>>> >>>> Right, it's just harder to use than host's single_task_running() -- th= e >>>> VCPU calling vcpu_is_preempted() is never preempted, so we have to loo= k >>>> at other VCPUs that are not halted, but still preempted. >>>> >>>> If we see some ratio of preempted VCPUs (> 0?), then we stop polling a= nd >>>> yield to the host. Working under the assumption that there is work fo= r >>>> this PCPU if other VCPUs have stuff to do. The downside is that it >>>> misses information about host's topology, so it would be hard to make = it >>>> work well. >>> >>> >>> I would just use vcpu_is_preempted on the current CPU. From guest POV >>> this option is really a "f*** everyone else" setting just like >>> idle=3Dpoll, only a little more polite. >> >> >> vcpu_is_preempted() on current cpu cannot return true, AFAIK. >> >>> If we've been preempted and we were polling, there are two cases. If a= n >>> interrupt was queued while the guest was preempted, the poll will be >>> treated as successful anyway. >> >> >> I think the poll should be treated as invalid if the window has expired >> while the VCPU was preempted -- the guest can't tell whether the >> interrupt arrived still within the poll window (unless we added paravirt >> for that), so it shouldn't be wasting time waiting for it. >> >>> If it hasn't, let others run---but reall= y >>> that's not because the guest wants to be polite, it's to avoid that the >>> scheduler penalizes it excessively. >> >> >> This sounds like a VM entry just to do an immediate VM exit, so paravirt >> seems better here as well ... (the guest telling the host about its >> window -- which could also be used to rule it out as a target in the >> pause loop random kick.) >> >>> So until it's preempted, I think it's okay if the guest doesn't care >>> about others. You wouldn't use this option anyway in overcommitted >>> situations. >>> >>> (I'm still not very convinced about the idea). >> >> >> Me neither. (The same mechanism is applicable to bare-metal, but was >> never used there, so I would rather bring the guest behavior closer to >> bare-metal.) >> > > The background is that we(Alibaba Cloud) do get more and more complaints > from our customers in both KVM and Xen compare to bare-mental.After > investigations, the root cause is known to us: big cost in message passin= g > workload(David show it in KVM forum 2015) > > A typical message workload like below: > vcpu 0 vcpu 1 > 1. send ipi 2. doing hlt > 3. go into idle 4. receive ipi and wake up from hlt > 5. write APIC time twice 6. write APIC time twice to > to stop sched timer reprogram sched timer I didn't find these two scenarios will program APIC timer twice separately instead of once separately, could you point out the codes? Regards, Wanpeng Li > 7. doing hlt 8. handle task and send ipi to > vcpu 0 > 9. same to 4. 10. same to 3 > > One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). = The > cost of such vmexits will degrades performance severely. Linux kernel > already provide idle=3Dpoll to mitigate the trend. But it only eliminates= the > IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A > compromise would be to turn off NOHZ kernel, but it is not the default > config for new distributions. Same for halt-poll in KVM, it only solve th= e > cost from schedule in/out in host and can not help such workload much. > > The purpose of this patch we want to improve current idle=3Dpoll mechanis= m to > use dynamic polling and do poll before touch sched timer. It should not b= e a > virtualization specific feature but seems bare mental have low cost to > access the MSR. So i want to only enable it in VM. Though the idea below = the > patch may not so perfect to fit all conditions, it looks no worse than no= w. > How about we keep current implementation and i integrate the patch to > para-virtualize part as Paolo suggested? We can continue discuss it and i > will continue to refine it if anyone has better suggestions? > > > > -- > Yang > Alibaba Cloud Computing