From: Sean Christopherson <seanjc@google.com>
To: Igor Mammedov <imammedo@redhat.com>
Cc: "Srivatsa S. Bhat" <srivatsa@csail.mit.edu>,
Thomas Gleixner <tglx@linutronix.de>,
linux-kernel@vger.kernel.org, amakhalov@vmware.com,
ganb@vmware.com, ankitja@vmware.com, bordoloih@vmware.com,
keerthanak@vmware.com, blamoreaux@vmware.com, namit@vmware.com,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
"H. Peter Anvin" <hpa@zytor.com>,
"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
"Paul E. McKenney" <paulmck@kernel.org>,
Wyes Karny <wyes.karny@amd.com>,
Lewis Caroll <lewis.carroll@amd.com>,
Tom Lendacky <thomas.lendacky@amd.com>,
Juergen Gross <jgross@suse.com>,
x86@kernel.org,
VMware PV-Drivers Reviewers <pv-drivers@vmware.com>,
virtualization@lists.linux-foundation.org, kvm@vger.kernel.org,
xen-devel@lists.xenproject.org
Subject: Re: [PATCH v2] x86/hotplug: Do not put offline vCPUs in mwait idle state
Date: Fri, 20 Jan 2023 18:35:48 +0000 [thread overview]
Message-ID: <Y8rfBBBicRMk+Hut@google.com> (raw)
In-Reply-To: <20230120163734.63e62444@imammedo.users.ipa.redhat.com>
On Fri, Jan 20, 2023, Igor Mammedov wrote:
> On Fri, 20 Jan 2023 05:55:11 -0800
> "Srivatsa S. Bhat" <srivatsa@csail.mit.edu> wrote:
>
> > Hi Igor and Thomas,
> >
> > Thank you for your review!
> >
> > On 1/19/23 1:12 PM, Thomas Gleixner wrote:
> > > On Mon, Jan 16 2023 at 15:55, Igor Mammedov wrote:
> > >> "Srivatsa S. Bhat" <srivatsa@csail.mit.edu> wrote:
> > >>> Fix this by preventing the use of mwait idle state in the vCPU offline
> > >>> play_dead() path for any hypervisor, even if mwait support is
> > >>> available.
> > >>
> > >> if mwait is enabled, it's very likely guest to have cpuidle
> > >> enabled and using the same mwait as well. So exiting early from
> > >> mwait_play_dead(), might just punt workflow down:
> > >> native_play_dead()
> > >> ...
> > >> mwait_play_dead();
> > >> if (cpuidle_play_dead()) <- possible mwait here
> > >> hlt_play_dead();
> > >>
> > >> and it will end up in mwait again and only if that fails
> > >> it will go HLT route and maybe transition to VMM.
> > >
> > > Good point.
> > >
> > >> Instead of workaround on guest side,
> > >> shouldn't hypervisor force VMEXIT on being uplugged vCPU when it's
> > >> actually hot-unplugging vCPU? (ex: QEMU kicks vCPU out from guest
> > >> context when it is removing vCPU, among other things)
> > >
> > > For a pure guest side CPU unplug operation:
> > >
> > > guest$ echo 0 >/sys/devices/system/cpu/cpu$N/online
> > >
> > > the hypervisor is not involved at all. The vCPU is not removed in that
> > > case.
> > >
> >
> > Agreed, and this is indeed the scenario I was targeting with this patch,
> > as opposed to vCPU removal from the host side. I'll add this clarification
> > to the commit message.
Forcing HLT doesn't solve anything, it's perfectly legal to passthrough HLT. I
guarantee there are use cases that passthrough HLT but _not_ MONITOR/MWAIT, and
that passthrough all of them.
> commit message explicitly said:
> "which prevents the hypervisor from running other vCPUs or workloads on the
> corresponding pCPU."
>
> and that implies unplug on hypervisor side as well.
> Why? That's because when hypervisor exposes mwait to guest, it has to reserve/pin
> a pCPU for each of present vCPUs. And you can safely run other VMs/workloads
> on that pCPU only after it's not possible for it to be reused by VM where
> it was used originally.
Pinning isn't strictly required from a safety perspective. The latency of context
switching may suffer due to wake times, but preempting a vCPU that it's C1 (or
deeper) won't cause functional problems. Passing through an entire socket
(or whatever scope triggers extra fun) might be a different story, but pinning
isn't strictly required.
That said, I 100% agree that this is expected behavior and not a bug. Letting the
guest execute MWAIT or HLT means the host won't have perfect visibility into guest
activity state.
Oversubscribing a pCPU and exposing MWAIT and/or HLT to vCPUs is generally not done
precisely because the guest will always appear busy without extra effort on the
host. E.g. KVM requires an explicit opt-in from userspace to expose MWAIT and/or
HLT.
If someone really wants to effeciently oversubscribe pCPUs and passthrough MWAIT,
then their best option is probably to have a paravirt interface so that the guest
can tell the host its offlining a vCPU. Barring that the host could inspect the
guest when preempting a vCPU to try and guesstimate how much work the vCPU is
actually doing in order to make better scheduling decisions.
> Now consider following worst (and most likely) case without unplug
> on hypervisor side:
>
> 1. vm1mwait: pin pCPU2 to vCPU2
> 2. vm1mwait: guest$ echo 0 >/sys/devices/system/cpu/cpu2/online
> -> HLT -> VMEXIT
> --
> 3. vm2mwait: pin pCPU2 to vCPUx and start VM
> 4. vm2mwait: guest OS onlines Vcpu and starts using it incl.
> going into idle=>mwait state
> --
> 5. vm1mwait: it still thinks that vCPU is present it can rightfully do:
> guest$ echo 1 >/sys/devices/system/cpu/cpu2/online
> --
> 6.1 best case vm1mwait online fails after timeout
> 6.2 worse case: vm2mwait does VMEXIT on vCPUx around time-frame when
> vm1mwait onlines vCPU2, the online may succeed and then vm2mwait's
> vCPUx will be stuck (possibly indefinitely) until for some reason
> VMEXIT happens on vm1mwait's vCPU2 _and_ host decides to schedule
> vCPUx on pCPU2 which would make vm1mwait stuck on vCPU2.
> So either way it's expected behavior.
>
> And if there is no intention to unplug vCPU on hypervisor side,
> then VMEXIT on play_dead is not really necessary (mwait is better
> then HLT), since hypervisor can't safely reuse pCPU elsewhere and
> VCPU goes into deep sleep within guest context.
>
> PS:
> The only case where making HLT/VMEXIT on play_dead might work out,
> would be if new workload weren't pinned to the same pCPU nor
> used mwait (i.e. host can migrate it elsewhere and schedule
> vCPU2 back on pCPU2).
next prev parent reply other threads:[~2023-01-20 18:35 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-16 6:01 [PATCH v2] x86/hotplug: Do not put offline vCPUs in mwait idle state Srivatsa S. Bhat
2023-01-16 8:38 ` Juergen Gross
2023-01-16 14:55 ` Igor Mammedov
2023-01-19 21:12 ` Thomas Gleixner
2023-01-20 13:55 ` Srivatsa S. Bhat
2023-01-20 15:37 ` Igor Mammedov
2023-01-20 18:35 ` Sean Christopherson [this message]
2023-01-26 2:14 ` Srivatsa S. Bhat
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y8rfBBBicRMk+Hut@google.com \
--to=seanjc@google.com \
--cc=amakhalov@vmware.com \
--cc=ankitja@vmware.com \
--cc=blamoreaux@vmware.com \
--cc=bordoloih@vmware.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=ganb@vmware.com \
--cc=hpa@zytor.com \
--cc=imammedo@redhat.com \
--cc=jgross@suse.com \
--cc=keerthanak@vmware.com \
--cc=kvm@vger.kernel.org \
--cc=lewis.carroll@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=namit@vmware.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=pv-drivers@vmware.com \
--cc=rafael.j.wysocki@intel.com \
--cc=srivatsa@csail.mit.edu \
--cc=tglx@linutronix.de \
--cc=thomas.lendacky@amd.com \
--cc=virtualization@lists.linux-foundation.org \
--cc=wyes.karny@amd.com \
--cc=x86@kernel.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).