[MODERATED] Re: [PATCH v2 5/8] MDSv2 7

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: speck@linutronix.de
Subject: [MODERATED] Re: [PATCH v2 5/8] MDSv2 7
Date: Wed, 12 Dec 2018 10:05:32 -0800	[thread overview]
Message-ID: <a4c71ed3-3216-8684-4ed4-3b622b287ab8@citrix.com> (raw)
In-Reply-To: <fef2282c-dd0f-5e30-d11e-fb0acb8938a7@citrix.com>

[-- Attachment #1: Type: text/plain, Size: 3316 bytes --]

On 10/12/2018 16:33, speck for Andrew Cooper wrote:
> On 10/12/2018 17:53, speck for Andi Kleen wrote:
>> xdiff --git a/arch/x86/lib/clear_cpu.S b/arch/x86/lib/clear_cpu.S
>> new file mode 100644
>> index 000000000000..5af33baf5427
>> --- /dev/null
>> +++ b/arch/x86/lib/clear_cpu.S
>> @@ -0,0 +1,107 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#include <linux/linkage.h>
>> +#include <asm/alternative-asm.h>
>> +#include <asm/cpufeatures.h>
>> +
>> +/*
>> + * Clear internal CPU buffers on kernel boundaries.
>> + *
>> + * These sequences are somewhat fragile, please don't add
>> + * or change instructions in the middle of the areas marked with
>> + * start/end.
>> + *
>> + * Interrupts and NMIs we deal with by reclearing. We clear parts
>> + * of the kernel stack, which has other advantages too.
>> + *
>> + * Save all registers to make it easier to use for callers.
>> + *
>> + * This sequence is for Nehalem-IvyBridge. For Haswell we jump
>> + * to hsw_clear_buf.
>> + *
>> + * These functions need to be called on a full stack, as they may
>> + * use upto 1.5k of stack. They should be also called with
>> + * interrupts disabled. NMIs etc. are handled by letting every
>> + * NMI do its own clear sequence.
>> + */
>> +ENTRY(ivb_clear_cpu)
>> +GLOBAL(do_clear_cpu)
>> +	/*
>> +	 * obj[tf]ool complains about unreachable code here,
>> +	 * which appears to be spurious?!?
>> +	 */
>> +	ALTERNATIVE "", "jmp hsw_clear_cpu", X86_BUG_MDS_CLEAR_CPU_HSWange
>> +	push %__ASM_REG(si)
>> +	push %__ASM_REG(di)
>> +	push %__ASM_REG(cx)
>> +	mov %_ASM_SP, %__ASM_REG(si)
>> +	sub  $2*16, %_ASM_SP
>> +	and  $-16,%_ASM_SP
>> +	movdqa %xmm0, (%_ASM_SP)
>> +	movdqa %xmm1, 1*16(%_ASM_SP)
> You don't need to preserve %xmm1 here.  It is unmodified by the
> sequence, because the orpd pulls zero out of its memory operand. 
> Similarly...
>
>> +	sub  $672, %_ASM_SP
>> +	xorpd %xmm0,%xmm0
>> +	movdqa %xmm0, (%_ASM_SP)
>> +	movdqa %xmm0, 16(%_ASM_SP)
> ... this store doesn't appear to do anything useful, as that stack slot
> isn't read again, and...
>
>> +	mov %_ASM_SP, %__ASM_REG(di)
>> +	/* Clear sequence start */
>> +	movdqu %xmm0,(%__ASM_REG(di))
>> +	lfence
>> +	orpd (%__ASM_REG(di)), %xmm0
>> +	orpd (%__ASM_REG(di)), %xmm1
>> +	mfence
>> +	movl $40, %ecx
>> +	add  $32, %__ASM_REG(di)
> ... I know this was in the recommended sequence, but bytes 16-31 aren't
> used at all, and it feels fishy.
>
> Either this wants to be $16, or the second orpd wants a displacement of
> 16 (and we do need to retain the second zeroing write) so all 32 bytes
> are used.

Based on what Ronak has said in person, two back-to-back loads are
guaranteed to be scheduled on alternate load ports.

Therefore, this can be a 16 byte change rather than 32.

However, there is a separate problem with synchronising the other
threads, because a pause waitloop will be racing with this sequence for
allocation of load ports.  There doesn't appear to be a viable option to
guarantee that these two orpd hit both load ports in the core.

Given the GPR restoration later in the return to guest path which is
15ish loads in a line, one option being discussed is to do away with the
first half of software sequence entirely.

~Andrew