From: Vitaly Kuznetsov <vkuznets@redhat.com>
To: "Michael Kelley (LINUX)" <mikelley@microsoft.com>,
"linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
Cc: KY Srinivasan <kys@microsoft.com>,
Haiyang Zhang <haiyangz@microsoft.com>,
Wei Liu <wei.liu@kernel.org>, Dexuan Cui <decui@microsoft.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"x86@kernel.org" <x86@kernel.org>
Subject: RE: [PATCH] x86/hyperv: Restore VP assist page after cpu offlining/onlining
Date: Fri, 04 Nov 2022 17:46:25 +0100 [thread overview]
Message-ID: <878rkqr7ku.fsf@ovpn-192-136.brq.redhat.com> (raw)
In-Reply-To: <BYAPR21MB1688482206965765C716DC9ED73B9@BYAPR21MB1688.namprd21.prod.outlook.com>
[-- Attachment #1: Type: text/plain, Size: 5913 bytes --]
"Michael Kelley (LINUX)" <mikelley@microsoft.com> writes:
> From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Thursday, November 3, 2022 12:06 PM
>>
>> Commit e5d9b714fe40 ("x86/hyperv: fix root partition faults when writing
>> to VP assist page MSR") moved 'wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE)' under
>> 'if (*hvp)' condition. This works for root partition as hv_cpu_die()
>> does memunmap() and sets 'hv_vp_assist_page[cpu]' to NULL but breaks
>> non-root partitions as hv_cpu_die() doesn't free 'hv_vp_assist_page[cpu]'
>> for them.
>
> Did you consider having hv_cpu_die() free the VP assist page and
> set hv_vp_assist_page[cpu] to NULL in the non-root case?
Oh yes, I did, and I even wrote a patch (attached) but it failed in
testing :-( My testing was to run CPU onlining/offlining loop and and
try using KVM on the same CPU at the same time. I was able to see issues
when KVM was not able to reach to Enlightened VMCS because VP assist
page was NULL.
> That would
> make the root and non-root cases more consistent, and it would make
> hv_cpu_init() and hv_cpu_die() more symmetrical. The hv_cpu_die()
> path frees up pretty much all the other per-CPU resources. I don't
> know why it keeps the VP assist page for re-use if the CPU comes back
> online later.
>
> You added the original code for allocating the vp_assist_page in
> commit a46d15cc1a, so maybe you remember if there are any
> gotchas. :-)
The root cause of the problem I observed seems to be the order of CPU
hotplug. Hyper-V uses the last CPUHP_AP_ONLINE_DYN stage while KVM has
his own dedicated CPUHP_AP_KVM_STARTING one so we end up freeing VP
assist page before turning KVM off on the CPU. It may be sufficient to
introduce a new stage for Hyper-V and put it before KVM's. It's in my
backlog to explore this path but it may take me some time to get back to
it :-(
>
> Michael
>
>> This causes VP assist page to remain unset after CPU
>> offline/online cycle:
>>
>> $ rdmsr -p 24 0x40000073
>> 10212f001
>> $ echo 0 > /sys/devices/system/cpu/cpu24/online
>> $ echo 1 > /sys/devices/system/cpu/cpu24/online
>> $ rdmsr -p 24 0x40000073
>> 0
>>
>> Fix the issue by always writing to HV_X64_MSR_VP_ASSIST_PAGE in
>> hv_cpu_init(). Note, checking 'if (!*hvp)', for root partition is
>> pointless as hv_cpu_die() always sets 'hv_vp_assist_page[cpu]' to
>> NULL (and it's also NULL initially).
>>
>> Note: the fact that 'hv_vp_assist_page[cpu]' is reset to NULL may
>> present a (potential) issue for KVM. While Hyper-V uses
>> CPUHP_AP_ONLINE_DYN stage in CPU hotplug, KVM uses
>> CPUHP_AP_KVM_STARTING
>> which comes earlier in CPU teardown sequence. It is theoretically
>> possible that Enlightened VMCS is still in use. It is unclear if the
>> issue is real and if using KVM with Hyper-V root partition is even
>> possible.
>>
>> While on it, drop the unneeded smp_processor_id() call from hv_cpu_init().
>>
>> Fixes: e5d9b714fe40 ("x86/hyperv: fix root partition faults when writing to VP assist
>> page MSR")
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>> arch/x86/hyperv/hv_init.c | 54 +++++++++++++++++++--------------------
>> 1 file changed, 26 insertions(+), 28 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
>> index f49bc3ec76e6..a269049a43ce 100644
>> --- a/arch/x86/hyperv/hv_init.c
>> +++ b/arch/x86/hyperv/hv_init.c
>> @@ -77,7 +77,7 @@ static int hyperv_init_ghcb(void)
>> static int hv_cpu_init(unsigned int cpu)
>> {
>> union hv_vp_assist_msr_contents msr = { 0 };
>> - struct hv_vp_assist_page **hvp = &hv_vp_assist_page[smp_processor_id()];
>> + struct hv_vp_assist_page **hvp = &hv_vp_assist_page[cpu];
>> int ret;
>>
>> ret = hv_common_cpu_init(cpu);
>> @@ -87,34 +87,32 @@ static int hv_cpu_init(unsigned int cpu)
>> if (!hv_vp_assist_page)
>> return 0;
>>
>> - if (!*hvp) {
>> - if (hv_root_partition) {
>> - /*
>> - * For root partition we get the hypervisor provided VP assist
>> - * page, instead of allocating a new page.
>> - */
>> - rdmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> - *hvp = memremap(msr.pfn <<
>> -
>> HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT,
>> - PAGE_SIZE, MEMREMAP_WB);
>> - } else {
>> - /*
>> - * The VP assist page is an "overlay" page (see Hyper-V TLFS's
>> - * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
>> - * out to make sure we always write the EOI MSR in
>> - * hv_apic_eoi_write() *after* the EOI optimization is disabled
>> - * in hv_cpu_die(), otherwise a CPU may not be stopped in the
>> - * case of CPU offlining and the VM will hang.
>> - */
>> + if (hv_root_partition) {
>> + /*
>> + * For root partition we get the hypervisor provided VP assist
>> + * page, instead of allocating a new page.
>> + */
>> + rdmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> + *hvp = memremap(msr.pfn <<
>> HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT,
>> + PAGE_SIZE, MEMREMAP_WB);
>> + } else {
>> + /*
>> + * The VP assist page is an "overlay" page (see Hyper-V TLFS's
>> + * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
>> + * out to make sure we always write the EOI MSR in
>> + * hv_apic_eoi_write() *after* the EOI optimization is disabled
>> + * in hv_cpu_die(), otherwise a CPU may not be stopped in the
>> + * case of CPU offlining and the VM will hang.
>> + */
>> + if (!*hvp)
>> *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
>> - if (*hvp)
>> - msr.pfn = vmalloc_to_pfn(*hvp);
>> - }
>> - WARN_ON(!(*hvp));
>> - if (*hvp) {
>> - msr.enable = 1;
>> - wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> - }
>> + if (*hvp)
>> + msr.pfn = vmalloc_to_pfn(*hvp);
>> +
>> + }
>> + if (!WARN_ON(!(*hvp))) {
>> + msr.enable = 1;
>> + wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> }
>>
>> return hyperv_init_ghcb();
>> --
>> 2.38.1
>
--
Vitaly
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-x86-hyperv-Free-VP-assist-page-from-hv_cpu_die.patch --]
[-- Type: text/x-patch, Size: 3075 bytes --]
From 9b9e30e3fc182f6a43efca0f93b5851392074a3a Mon Sep 17 00:00:00 2001
From: Vitaly Kuznetsov <vkuznets@redhat.com>
Date: Thu, 3 Nov 2022 16:27:41 +0100
Subject: [PATCH] x86/hyperv: Free VP assist page from hv_cpu_die()
Content-Type: text/plain
Normally, 'hv_vp_assist_page[cpu]' points to CPU's VP assist page mapping.
In case of Hyper-V root partition, this is 'memremap()' of the PFN given
by the hypervisor. In case of a non-root partition, it's vmalloc(). When
the CPU goes offline, hv_cpu_die() disables VP assist page by writing
HV_X64_MSR_VP_ASSIST_PAGE and in case of root partition, does memunmap().
For non-root partitions, the vmalloc()ed page remains allocated and
thus hv_cpu_init() has to check whether a new allocation is
needed. This is unnecessary complicated. Let's always free the page
from hv_cpu_die() and allocate it back from hv_cpu_init(). All VP
assist page users have to be prepared to 'hv_vp_assist_page[cpu]'
becoming NULL anyway as that's what happes already for the root
partition.
VP assist page has two users: KVM and APIC PV EOI. When a CPU goes
offline, there cannot be a running guest and thus KVM's use case
should be safe. As correctly noted in commit e320ab3cec7dd ("x86/hyper-v:
Zero out the VP ASSIST PAGE on allocation"), it is possible to see
interrupts after hv_cpu_die() and before the CPU is fully
dead. hv_apic_eoi_write() is, however, also prepared to see NULL in
'hv_vp_assist_page[smp_processor_id()]'. Moreover, checking the
page which is already unmapped from the hypervisor is incorrect in the
first place.
While on it, adjust VP assist page disabling a bit: always write to
HV_X64_MSR_VP_ASSIST_PAGE first and unmap/free the corresponding page
after, this is to make sure the hypervisor doesn't write to the
already freed memory in the interim.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
arch/x86/hyperv/hv_init.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index a0165df3c4d8..74be6f145fc4 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -104,8 +104,7 @@ static int hv_cpu_init(unsigned int cpu)
* in hv_cpu_die(), otherwise a CPU may not be stopped in the
* case of CPU offlining and the VM will hang.
*/
- if (!*hvp)
- *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
+ *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
if (*hvp)
msr.pfn = vmalloc_to_pfn(*hvp);
@@ -233,12 +232,17 @@ static int hv_cpu_die(unsigned int cpu)
* page here and nullify it, so that in future we have
* correct page address mapped in hv_cpu_init.
*/
- memunmap(hv_vp_assist_page[cpu]);
- hv_vp_assist_page[cpu] = NULL;
rdmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
msr.enable = 0;
}
wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
+
+ if (hv_root_partition)
+ memunmap(hv_vp_assist_page[cpu]);
+ else
+ vfree(hv_vp_assist_page[cpu]);
+
+ hv_vp_assist_page[cpu] = NULL;
}
if (hv_reenlightenment_cb == NULL)
--
2.38.1
next prev parent reply other threads:[~2022-11-04 16:47 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-03 19:06 [PATCH] x86/hyperv: Restore VP assist page after cpu offlining/onlining Vitaly Kuznetsov
2022-11-04 16:23 ` Michael Kelley (LINUX)
2022-11-04 16:46 ` Vitaly Kuznetsov [this message]
2022-11-04 20:57 ` Michael Kelley (LINUX)
2022-11-11 23:34 ` Wei Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=878rkqr7ku.fsf@ovpn-192-136.brq.redhat.com \
--to=vkuznets@redhat.com \
--cc=decui@microsoft.com \
--cc=haiyangz@microsoft.com \
--cc=kys@microsoft.com \
--cc=linux-hyperv@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mikelley@microsoft.com \
--cc=wei.liu@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).