Re: [PATCH v5 00/16] x86: make PAT and MTRR independent from each other

From: Juergen Gross <jgross@suse.com>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-kernel@vger.kernel.org, x86@kernel.org,
	linux-pm@vger.kernel.org, Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Pavel Machek <pavel@ucw.cz>, Andy Lutomirski <luto@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v5 00/16] x86: make PAT and MTRR independent from each other
Date: Tue, 8 Nov 2022 08:30:19 +0100	[thread overview]
Message-ID: <03e5624d-9e45-c51b-9ace-616a223ff092@suse.com> (raw)
In-Reply-To: <Y2lbmYRWQvXn0nPe@zn.tnic>

[-- Attachment #1.1.1: Type: text/plain, Size: 9430 bytes --]

On 07.11.22 20:25, Borislav Petkov wrote:
> On Thu, Nov 03, 2022 at 05:15:52PM +0100, Borislav Petkov wrote:
>> Lemme try to find a smaller box which shows that too - that one is a
>> pain to bisect on.
> 
> Ok, couldn't find a smaller one (or maybe it had to be a big one to
> tickle this out).
> 
> So I think it is the parallel setup thing:
> 
> x86/mtrr: Do MTRR/PAT setup on all secondary CPUs in parallel
> 
> Note that before it, it would do the configuration sequentially on each
> CPU:
> 
> [    0.759239] MTRR: prepare_set: CPU83, MSR_MTRRdefType: 0x0, read: (0xc00:0)
> [    0.759239] MTRR: set_mtrr_state: CPU83, mtrr_deftype_lo: 0xc00, mtrr_state.def_type: 0, mtrr_state.enabled: 3
> [    0.760794] MTRR: post_set: CPU83, MSR_MTRRdefType will write: (0xc00:0)
> [    0.761151] MTRR: prepare_set: CPU70, MSR_MTRRdefType: 0x0, read: (0xc00:0)
> [    0.761151] MTRR: set_mtrr_state: CPU70, mtrr_deftype_lo: 0xc00, mtrr_state.def_type: 0, mtrr_state.enabled: 3
> [    0.761151] MTRR: post_set: CPU70, MSR_MTRRdefType will write: (0xc00:0)
> ...
> 
> and so on.
> 
> Now, it would do it all in parallel:
> 
> [    0.762006] MTRR: mtrr_disable: CPU70, MSR_MTRRdefType: 0x0, read: (0xc00:0)
> [    0.761916] MTRR: mtrr_disable: CPU18, MSR_MTRRdefType: 0x0, read: (0xc00:0)
> [    0.761808] MTRR: mtrr_disable: CPU82, MSR_MTRRdefType: 0x0, read: (0xc00:0)
> [    0.762593] MTRR: mtrr_disable: CPU6, MSR_MTRRdefType: 0x0, read: (0x0:0)
> 								      ^^^^^^
> 
> Note that last thing. That comes from (with debug output added):
> 
> void mtrr_disable(struct cache_state *state)
> {
>          unsigned int cpu = smp_processor_id();
>          u64 msrval;
> 
>          /* Save MTRR state */
>          rdmsr(MSR_MTRRdefType, state->mtrr_deftype_lo, state->mtrr_deftype_hi);
> 
>          /* Disable MTRRs, and set the default type to uncached */
>          mtrr_wrmsr(MSR_MTRRdefType, state->mtrr_deftype_lo & ~0xcff,
>                     state->mtrr_deftype_hi);
> 
>          rdmsrl(MSR_MTRRdefType, msrval);
> 
>          pr_info("%s: CPU%d, MSR_MTRRdefType: 0x%llx, read: (0x%x:%x)\n",
>                  __func__, cpu, msrval, state->mtrr_deftype_lo, state->mtrr_deftype_hi);
> }
> 
> The "read: (0x0:0)" basically says that
> 
> 	state->mtrr_deftype_lo, state->mtrr_deftype_hi
> 
> are both 0 already. BUT(!), they should NOT be. The low piece is 0xc00 on most
> cores except a handful and it means that MTRRs and Fixed Range are
> enabled. In total, they're these cores here:
> 
> [    0.762593] MTRR: mtrr_disable: CPU6, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762247] MTRR: mtrr_disable: CPU26, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762685] MTRR: mtrr_disable: CPU68, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762725] MTRR: mtrr_disable: CPU17, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762685] MTRR: mtrr_disable: CPU69, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762800] MTRR: mtrr_disable: CPU1, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762734] MTRR: mtrr_disable: CPU13, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762720] MTRR: mtrr_disable: CPU24, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762696] MTRR: mtrr_disable: CPU66, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762716] MTRR: mtrr_disable: CPU48, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762693] MTRR: mtrr_disable: CPU57, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762519] MTRR: mtrr_disable: CPU87, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762532] MTRR: mtrr_disable: CPU58, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762755] MTRR: mtrr_disable: CPU32, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762693] MTRR: mtrr_disable: CPU52, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762861] MTRR: mtrr_disable: CPU0, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762724] MTRR: mtrr_disable: CPU21, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762640] MTRR: mtrr_disable: CPU15, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762615] MTRR: mtrr_disable: CPU50, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762741] MTRR: mtrr_disable: CPU40, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762738] MTRR: mtrr_disable: CPU37, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762716] MTRR: mtrr_disable: CPU25, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762512] MTRR: mtrr_disable: CPU59, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762721] MTRR: mtrr_disable: CPU45, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762682] MTRR: mtrr_disable: CPU56, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762583] MTRR: mtrr_disable: CPU124, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762751] MTRR: mtrr_disable: CPU12, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762741] MTRR: mtrr_disable: CPU9, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762575] MTRR: mtrr_disable: CPU51, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762632] MTRR: mtrr_disable: CPU100, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762688] MTRR: mtrr_disable: CPU61, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762610] MTRR: mtrr_disable: CPU105, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762721] MTRR: mtrr_disable: CPU20, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.762583] MTRR: mtrr_disable: CPU47, MSR_MTRRdefType: 0x0, read: (0x0:0)
> 
> Now, if I add MFENCEs around those RDMSRs:
> 
> void mtrr_disable(struct cache_state *state)
> {
>          unsigned int cpu = smp_processor_id();
>          u64 msrval;
> 
>          /* Save MTRR state */
>          rdmsr(MSR_MTRRdefType, state->mtrr_deftype_lo, state->mtrr_deftype_hi);
> 
>          __mb();
> 
>          /* Disable MTRRs, and set the default type to uncached */
>          mtrr_wrmsr(MSR_MTRRdefType, state->mtrr_deftype_lo & ~0xcff,
>                     state->mtrr_deftype_hi);
> 
>          __mb();
> 
>          rdmsrl(MSR_MTRRdefType, msrval);
> 
>          pr_info("%s: CPU%d, MSR_MTRRdefType: 0x%llx, read: (0x%x:%x)\n",
>                  __func__, cpu, msrval, state->mtrr_deftype_lo, state->mtrr_deftype_hi);
> 
>          __mb();
> }
> 
> the amount of cores becomes less:

Probably not because of the fencing, but because of the different timing.

> 
> [    0.765260] MTRR: mtrr_disable: CPU6, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.765462] MTRR: mtrr_disable: CPU5, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.765242] MTRR: mtrr_disable: CPU22, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.765522] MTRR: mtrr_disable: CPU0, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.765474] MTRR: mtrr_disable: CPU1, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.765207] MTRR: mtrr_disable: CPU54, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.765225] MTRR: mtrr_disable: CPU8, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.765282] MTRR: mtrr_disable: CPU88, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.765150] MTRR: mtrr_disable: CPU119, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.765370] MTRR: mtrr_disable: CPU49, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.765395] MTRR: mtrr_disable: CPU16, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.765348] MTRR: mtrr_disable: CPU52, MSR_MTRRdefType: 0x0, read: (0x0:0)
> [    0.765270] MTRR: mtrr_disable: CPU58, MSR_MTRRdefType: 0x0, read: (0x0:0)
> 
> which basically hints at some speculative fun where we end up reading
> the MSR *after* the write to it has already happened. After this thing:
> 
>          /* Disable MTRRs, and set the default type to uncached */
>          mtrr_wrmsr(MSR_MTRRdefType, state->mtrr_deftype_lo & ~0xcff,
>                     state->mtrr_deftype_hi);
> 
> and thus when we read it, we already read the disabled state. But this
> is only a conjecture because I still have no clear idea how TF would
> that even happen?!?

Yeah, and why doesn't it happen when we handle only one cpu at a time?

There might be some interaction between the cpus ...

> 
> Needless to say, this fixes it, ofc:
> 
> diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
> index 3805a6d32d37..4a685898caf3 100644
> --- a/arch/x86/kernel/cpu/cacheinfo.c
> +++ b/arch/x86/kernel/cpu/cacheinfo.c
> @@ -1116,12 +1116,14 @@ void cache_enable(struct cache_state *state)
>                  __write_cr4(state->cr4);
>   }
>   
> +static DEFINE_RAW_SPINLOCK(set_atomicity_lock);
> +
>   static void cache_cpu_init(void)
>   {
>          unsigned long flags;
>          struct cache_state state = { };
>   
> -       local_irq_save(flags);
> +       raw_spin_lock_irqsave(&set_atomicity_lock, flags);
>          cache_disable(&state);
>   
>          if (memory_caching_control & CACHE_MTRR)
> @@ -1131,7 +1133,7 @@ static void cache_cpu_init(void)
>                  pat_cpu_init();
>   
>          cache_enable(&state);
> -       local_irq_restore(flags);
> +       raw_spin_unlock_irqrestore(&set_atomicity_lock, flags);
>   }
>   
>   static bool cache_aps_delayed_init = true;
> 
> ---
> 
> and frankly, considering how we have bigger fish to fry, I'd say we do
> it the old way and leave that can'o'worms half-opened.

I agree to keep this patch out of the series for now.

> 
> Unless you wanna continue poking at it. I can give you access to that
> box at work...

Yes, please. I suspect there are some additional requirements for updating
MTRR in parallel, or this is "just" a cpu bug.

Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]