linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yang Zhang <yang.zhang.wz@gmail.com>
To: "Radim Krčmář" <rkrcmar@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	the arch/x86 maintainers <x86@kernel.org>,
	Jonathan Corbet <corbet@lwn.net>,
	tony.luck@intel.com, Borislav Petkov <bp@alien8.de>,
	Peter Zijlstra <peterz@infradead.org>,
	mchehab@kernel.org, Andrew Morton <akpm@linux-foundation.org>,
	krzk@kernel.org, jpoimboe@redhat.com,
	Andy Lutomirski <luto@kernel.org>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Thomas Garnier <thgarnie@google.com>,
	Robert Gerst <rgerst@gmail.com>,
	Mathias Krause <minipli@googlemail.com>,
	douly.fnst@cn.fujitsu.com, Nicolai Stange <nicstange@gmail.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	dvlasenk@redhat.com,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	yamada.masahiro@socionext.com, mika.westerberg@linux.intel.com,
	Chen Yu <yu.c.chen@intel.com>,
	aaron.lu@intel.com, Steven Rostedt <rostedt@goodmis.org>,
	Kyle Huey <me@kylehuey.com>, Len Brown <len.brown@intel.com>,
	Prarit Bhargava <prarit@redhat.com>,
	hidehiro.kawai.ez@hitachi.com, fengtiantian@huawei.com,
	pmladek@suse.com, jeyu@redhat.com, Larry.Finger@lwfinger.net,
	zijun_hu@htc.com, luisbg@osg.samsung.com,
	johannes.berg@intel.com, niklas.soderlund+renesas@ragnatech.se,
	zlpnobody@gmail.com, Alexey Dobriyan <adobriyan@gmail.com>,
	fgao@48lvckh6395k16k5.yundunddos.com, ebiederm@xmission.com,
	Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>,
	Arnd Bergmann <arnd@arndb.de>,
	Matt Fleming <matt@codeblueprint.co.uk>,
	Mel Gorman <mgorman@techsingularity.net>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	linux-doc@vger.kernel.org, linux-edac@vger.kernel.org,
	kvm <kvm@vger.kernel.org>
Subject: Re: [PATCH 2/2] x86/idle: use dynamic halt poll
Date: Mon, 3 Jul 2017 17:28:42 +0800	[thread overview]
Message-ID: <be2a5434-b990-75a9-9136-1a4519a6ca4d@gmail.com> (raw)
In-Reply-To: <20170627142251.GB1487@potion>

On 2017/6/27 22:22, Radim Krčmář wrote:
> 2017-06-27 15:56+0200, Paolo Bonzini:
>> On 27/06/2017 15:40, Radim Krčmář wrote:
>>>> ... which is not necessarily _wrong_.  It's just a different heuristic.
>>> Right, it's just harder to use than host's single_task_running() -- the
>>> VCPU calling vcpu_is_preempted() is never preempted, so we have to look
>>> at other VCPUs that are not halted, but still preempted.
>>>
>>> If we see some ratio of preempted VCPUs (> 0?), then we stop polling and
>>> yield to the host.  Working under the assumption that there is work for
>>> this PCPU if other VCPUs have stuff to do.  The downside is that it
>>> misses information about host's topology, so it would be hard to make it
>>> work well.
>>
>> I would just use vcpu_is_preempted on the current CPU.  From guest POV
>> this option is really a "f*** everyone else" setting just like
>> idle=poll, only a little more polite.
>
> vcpu_is_preempted() on current cpu cannot return true, AFAIK.
>
>> If we've been preempted and we were polling, there are two cases.  If an
>> interrupt was queued while the guest was preempted, the poll will be
>> treated as successful anyway.
>
> I think the poll should be treated as invalid if the window has expired
> while the VCPU was preempted -- the guest can't tell whether the
> interrupt arrived still within the poll window (unless we added paravirt
> for that), so it shouldn't be wasting time waiting for it.
>
>>                                If it hasn't, let others run---but really
>> that's not because the guest wants to be polite, it's to avoid that the
>> scheduler penalizes it excessively.
>
> This sounds like a VM entry just to do an immediate VM exit, so paravirt
> seems better here as well ... (the guest telling the host about its
> window -- which could also be used to rule it out as a target in the
> pause loop random kick.)
>
>> So until it's preempted, I think it's okay if the guest doesn't care
>> about others.  You wouldn't use this option anyway in overcommitted
>> situations.
>>
>> (I'm still not very convinced about the idea).
>
> Me neither.  (The same mechanism is applicable to bare-metal, but was
> never used there, so I would rather bring the guest behavior closer to
> bare-metal.)
>

The background is that we(Alibaba Cloud) do get more and more complaints 
from our customers in both KVM and Xen compare to bare-mental.After 
investigations, the root cause is known to us: big cost in message 
passing workload(David show it in KVM forum 2015)

A typical message workload like below:
vcpu 0                             vcpu 1
1. send ipi                     2.  doing hlt
3. go into idle                 4.  receive ipi and wake up from hlt
5. write APIC time twice        6.  write APIC time twice to
    to stop sched timer              reprogram sched timer
7. doing hlt                    8.  handle task and send ipi to
                                     vcpu 0
9. same to 4.                   10. same to 3

One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). 
The cost of such vmexits will degrades performance severely. Linux 
kernel already provide idle=poll to mitigate the trend. But it only 
eliminates the IPI and hlt vmexit. It has nothing to do with start/stop 
sched timer. A compromise would be to turn off NOHZ kernel, but it is 
not the default config for new distributions. Same for halt-poll in KVM, 
it only solve the cost from schedule in/out in host and can not help 
such workload much.

The purpose of this patch we want to improve current idle=poll mechanism 
to use dynamic polling and do poll before touch sched timer. It should 
not be a virtualization specific feature but seems bare mental have low 
cost to access the MSR. So i want to only enable it in VM. Though the 
idea below the patch may not so perfect to fit all conditions, it looks 
no worse than now.
How about we keep current implementation and i integrate the patch to 
para-virtualize part as Paolo suggested? We can continue discuss it and 
i will continue to refine it if anyone has better suggestions?


-- 
Yang
Alibaba Cloud Computing

  parent reply	other threads:[~2017-07-03  9:29 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-22 11:22 [PATCH 0/2] x86/idle: add halt poll support root
2017-06-22 11:22 ` [PATCH 1/2] x86/idle: add halt poll for halt idle root
2017-06-22 14:23   ` Thomas Gleixner
2017-06-23  4:05     ` Yang Zhang
2017-08-16  4:04   ` Michael S. Tsirkin
2017-08-17  7:29     ` Yang Zhang
2017-06-22 11:22 ` [PATCH 2/2] x86/idle: use dynamic halt poll root
2017-06-22 11:51   ` Paolo Bonzini
2017-06-23  3:58     ` Yang Zhang
2017-06-27 11:22       ` Yang Zhang
2017-06-27 12:07         ` Paolo Bonzini
2017-06-27 12:23           ` Wanpeng Li
2017-06-27 12:28             ` Paolo Bonzini
2017-06-27 13:40               ` Radim Krčmář
2017-06-27 13:56                 ` Paolo Bonzini
2017-06-27 14:22                   ` Radim Krčmář
2017-06-27 14:26                     ` Paolo Bonzini
2017-07-03  9:28                     ` Yang Zhang [this message]
2017-07-03 10:06                       ` Thomas Gleixner
2017-07-04  2:19                         ` Yang Zhang
2017-07-04 14:13                       ` Radim Krčmář
2017-07-04 14:50                         ` Thomas Gleixner
2017-07-13 11:49                         ` Yang Zhang
2017-07-14  9:37                           ` Alexander Graf
2017-07-17  9:26                             ` Yang Zhang
2017-07-17  9:54                               ` Alexander Graf
2017-07-17 12:50                                 ` Yang Zhang
2017-07-04 22:28                       ` Wanpeng Li
2017-06-22 14:32   ` Thomas Gleixner
2017-06-23  4:04     ` Yang Zhang
2017-06-22 22:46   ` kbuild test robot
2017-06-22 11:32 ` [PATCH 0/2] x86/idle: add halt poll support Yang Zhang
2017-06-22 11:50 ` Wanpeng Li
2017-06-23  4:08   ` Yang Zhang
2017-06-23  4:35     ` Wanpeng Li
2017-06-23  6:49       ` Yang Zhang
2017-06-27 14:00         ` Radim Krčmář

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=be2a5434-b990-75a9-9136-1a4519a6ca4d@gmail.com \
    --to=yang.zhang.wz@gmail.com \
    --cc=Larry.Finger@lwfinger.net \
    --cc=aaron.lu@intel.com \
    --cc=adobriyan@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=borntraeger@de.ibm.com \
    --cc=bp@alien8.de \
    --cc=bristot@redhat.com \
    --cc=corbet@lwn.net \
    --cc=douly.fnst@cn.fujitsu.com \
    --cc=dvlasenk@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=fengtiantian@huawei.com \
    --cc=fgao@48lvckh6395k16k5.yundunddos.com \
    --cc=fweisbec@gmail.com \
    --cc=hidehiro.kawai.ez@hitachi.com \
    --cc=hpa@zytor.com \
    --cc=jeyu@redhat.com \
    --cc=johannes.berg@intel.com \
    --cc=jpoimboe@redhat.com \
    --cc=kernellwp@gmail.com \
    --cc=krzk@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=len.brown@intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luisbg@osg.samsung.com \
    --cc=luto@kernel.org \
    --cc=matt@codeblueprint.co.uk \
    --cc=mchehab@kernel.org \
    --cc=me@kylehuey.com \
    --cc=mgorman@techsingularity.net \
    --cc=mika.westerberg@linux.intel.com \
    --cc=mingo@redhat.com \
    --cc=minipli@googlemail.com \
    --cc=nicstange@gmail.com \
    --cc=niklas.soderlund+renesas@ragnatech.se \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pmladek@suse.com \
    --cc=prarit@redhat.com \
    --cc=rgerst@gmail.com \
    --cc=rkrcmar@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=subashab@codeaurora.org \
    --cc=tglx@linutronix.de \
    --cc=thgarnie@google.com \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    --cc=yamada.masahiro@socionext.com \
    --cc=yu.c.chen@intel.com \
    --cc=zijun_hu@htc.com \
    --cc=zlpnobody@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).