RE: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining

From: Dexuan Cui <decui@microsoft.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Haiyang Zhang <haiyangz@microsoft.com>,
	KY Srinivasan <kys@microsoft.com>,
	Stephen Hemminger <sthemmin@microsoft.com>,
	Sasha Levin <Alexander.Levin@microsoft.com>,
	"linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>,
	Michael Kelley <mikelley@microsoft.com>,
	Long Li <longli@microsoft.com>, vkuznets <vkuznets@redhat.com>,
	Ingo Molnar <mingo@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>,
	Borislav Petkov <bp@alien8.de>, "x86@kernel.org" <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>,
	"driverdev-devel@linuxdriverproject.org" 
	<driverdev-devel@linuxdriverproject.org>,
	"olaf@aepfle.de" <olaf@aepfle.de>,
	"apw@canonical.com" <apw@canonical.com>,
	"jasowang@redhat.com" <jasowang@redhat.com>
Subject: RE: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining
Date: Thu, 18 Jul 2019 01:22:16 +0000	[thread overview]
Message-ID: <PU1P153MB01693AB444C4A432FBA2507BBFC80@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM> (raw)
In-Reply-To: <alpine.DEB.2.21.1907180058210.1778@nanos.tec.linutronix.de>

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Wednesday, July 17, 2019 4:04 PM
> To: Dexuan Cui <decui@microsoft.com>
> ...
> On Thu, 4 Jul 2019, Dexuan Cui wrote:
> > When a CPU is being offlined, the CPU usually still receives a few
> > interrupts (e.g. reschedule IPIs), after hv_cpu_die() disables the
> > HV_X64_MSR_VP_ASSIST_PAGE, so hv_apic_eoi_write() may not write 
> > the EOI MSR, if the apic_assist field's bit0 happens to be 1; as a result,
> > Hyper-V may not be able to deliver all the interrupts to the CPU, and the
> > CPU may not be stopped, and the kernel will hang soon.
> >
> > The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's Section
> > 5.2.1 "GPA Overlay Pages"), so with this fix we're sure the apic_assist
> > field is still zero, after the VP ASSIST PAGE is disabled.
> >
> > Fixes: ba696429d290 ("x86/hyper-v: Implement EOI assist")
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > 
> > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> > index 0e033ef11a9f..db51a301f759 100644
> > --- a/arch/x86/hyperv/hv_init.c
> > +++ b/arch/x86/hyperv/hv_init.c
> > @@ -60,8 +60,14 @@ static int hv_cpu_init(unsigned int cpu)
> >  	if (!hv_vp_assist_page)
> >  		return 0;
> >
> > +	/*
> > +	 * The ZERO flag is necessary, because in the case of CPU offlining
> > +	 * the page can still be used by hv_apic_eoi_write() for a while,
> > +	 * after the VP ASSIST PAGE is disabled in hv_cpu_die().
> > +	 */
> >  	if (!*hvp)
> > -		*hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL);
> > +		*hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO,
> > +				 PAGE_KERNEL);
> 
> This is the allocation when the CPU is brought online for the first
> time. So what effect has zeroing at allocation time vs. offlining and
> potentially receiving IPIs? That allocation is never freed.
> 
> Neither the comment nor the changelog make any sense to me.
> 	tglx

That allocation was introduced by the commit
a46d15cc1ae5 ("x86/hyper-v: allocate and use Virtual Processor Assist Pages").

I think it's ok to not free the page when a CPU is offlined: every
CPU uses only 1 page and CPU offlining is not really a very usual
operation except for the scenario of hibernation (and suspend-to-memory), 
where the CPUs are quickly onlined again, when we resume from hibernation.
IMO Vitaly intentionally decided to not free the page for simplicity of the
code.

When a CPU (e.g. CPU1) is being onlined, in hv_cpu_init(), we allocate the
VP_ASSIST_PAGE page and enable the PV EOI optimization for this CPU by
writing the MSR HV_X64_MSR_VP_ASSIST_PAGE. From now on, this CPU
*always* uses hvp->apic_assist (which is updated by the hypervisor) to
decide if it needs to write the EOI MSR:

static void hv_apic_eoi_write(u32 reg, u32 val)
{
        struct hv_vp_assist_page *hvp = hv_vp_assist_page[smp_processor_id()];

        if (hvp && (xchg(&hvp->apic_assist, 0) & 0x1))
                return;

        wrmsr(HV_X64_MSR_EOI, val, 0);
}

When a CPU (e.g. CPU1) is being offlined, on this CPU, we do:
1. in hv_cpu_die(), we disable the PV EOI optimizaton for this CPU;
2. we finish the remaining work of stopping this CPU;
3. this CPU is completed stopped.

Between 1 and 3, this CPU can still receive interrupts (e.g. IPIs from CPU0,
and Local APIC timer interrupts), and this CPU *must* write the EOI MSR for
every interrupt received, otherwise the hypervisor may not deliver further
interrupts, which may be needed to stop this CPU completely.

So we need to make sure hvp->apic_assist.bit0 is zero, after we run the line
"wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);" in hv_cpu_die(). The easiest
way is what I do in this patch. Alternatively, we can use the below patch:

@@ -188,8 +188,12 @@ static int hv_cpu_die(unsigned int cpu)
        local_irq_restore(flags);
        free_page((unsigned long)input_pg);

-       if (hv_vp_assist_page && hv_vp_assist_page[cpu])
+       if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
+               local_irq_save(flags);
                wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);
+               hvp->apic_assist &= ~1;
+               local_irq_restore(flags);
+       }

        if (hv_reenlightenment_cb == NULL)
                return 0;

This second version needs 3+ lines, so I prefer the one-line version. :-)

Thanks,
-- Dexuan