Re: [RFC PATCH] x86/mm: Flush more aggressively in lazy TLB mode

From: Andy Lutomirski <luto@kernel.org>
To: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>, X86 ML <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Markus Trippelsdorf <markus@trippelsdorf.de>,
	Adam Borowski <kilobyte@angband.pl>,
	Brian Gerst <brgerst@gmail.com>,
	Johannes Hirte <johannes.hirte@datenkhaos.de>
Subject: Re: [RFC PATCH] x86/mm: Flush more aggressively in lazy TLB mode
Date: Mon, 9 Oct 2017 10:50:34 -0700	[thread overview]
Message-ID: <CALCETrXY+fKsqE9ZKSJaFuw0EUh9zmt342VdG6fxDjx96LswWw@mail.gmail.com> (raw)
In-Reply-To: <20171009173651.3gsh3k252jnase7o@pd.tnic>

On Mon, Oct 9, 2017 at 10:36 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, Oct 09, 2017 at 07:02:31PM +0200, Borislav Petkov wrote:
>> From 29a6426c20f25b8df767356a04727cb113e8e784 Mon Sep 17 00:00:00 2001
>> From: Andy Lutomirski <luto@kernel.org>
>> Date: Mon, 9 Oct 2017 09:50:49 -0700
>> Subject: [PATCH] x86/mm: Flush more aggressively in lazy TLB mode
>>
>> Since commit 94b1b03b519b, x86's lazy TLB mode has been all the way
>
> Write it out:
>
> Since commit
>
>   94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")

Will do.

>
> x86's lazy...
>
>> lazy: when running a kernel thread (including the idle thread), the
>> kernel keeps using the last user mm's page tables without attempting
>> to maintain user TLB coherence at all.  From a pure semantic
>> perspective, this is fine -- kernel threads won't attempt to access
>> user pages, so having stale TLB entries doesn't matter.
>>
>> Unfortunately, I forgot about a subtlety.  By skipping TLB flushes,
>> we also allow any paging-structure caches that may exist on the CPU
>> to become incoherent.  This means that we can have a
>> paging-structure cache entry that references a freed page table, and
>> the CPU is within its rights to do a speculative page walk starting
>> at the freed page table.
>>
>> I can imagine this causing two different problems:
>>
>>  - A speculative page walk starting from a bogus page table could read
>>    IO addresses.  I haven't seen any reports of this causing problems.
>>
>>  - A speculative page walk that involves a bogus page table can install
>>    garbage in the TLB.  Such garbage would always be at a user VA, but
>>    some AMD CPUs have logic that triggers a machine check when it notices
>>    these bogus entries.  I've seen a couple reports of this.
>
> It is actually more of an optimization which assumes that paging-structure
> entries are in WB DRAM:
>
> "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
> performance optimization that assumes PML4, PDP, PDE, and PTE entries
> are in cacheable WB-DRAM; memory type checks may be bypassed, and
> addresses outside of WB-DRAM may result in undefined behavior or NB
> protocol errors. 1=Disables performance optimization and allows PML4,
> PDP, PDE and PTE entries to be in any memory type. Operating systems
> that maintain page tables in memory types other than WB- DRAM must set
> TlbCacheDis to insure proper operation."
>
> The MCE generated is an NB protocol error to signal that
>
> "Link: A specific coherent-only packet from a CPU was issued to an
> IO link. This may be caused by software which addresses page table
> structures in a memory type other than cacheable WB-DRAM without
> properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
> example, when page table structure addresses are above top of memory. In
> such cases, the NB will generate an MCE if it sees a mismatch between
> the memory operation generated by the core and the link type."
>
> I'm assuming coherent-only packets don't go out on IO links, thus the
> error.

Makes sense.  It's even possible that the address is entirely bogus
and doesn't refer to anything at all.

>
>> Reinstate TLB coherence in lazy mode.  With this patch applied, we
>> do it in one of two ways.  If we have PCID, we simply switch back to
>> init_mm's page tables when we enter a kernel thread -- this seems to
>> be quite cheap except for the cost of serializing the CPU.  If we
>> don't have PCID, then we set a flag and switch to init_mm the first
>> time we would otherwise need to flush the TLB.
>>
>> /sys/kernel/debug/x86/tlb_use_lazy_mode can be changed to override
>> the default mode for benchmarking.
>
> So this should be tlb_full_lazy_mode, no?
>
> Because we are lazy before but not "all the way" as you say above...

The choices are somewhat lazy and not lazy at all.

>
> And frankly, I'm not really crazy about this benchmarking aspect. Why
> would you have this in every kernel? I mean, it could be a patch ontop
> for people to measure on boxes but it is not really worth it. I mean,
> the PCID fun only showed some small improvement in some microbenchmark
> and nothing earth-shattering so having it everywhere to get some meh
> numbers is not really worth the trouble.
>
> Besides, this TLB code is already complex as hell.

The degree of simplification I would get by removing it is basically
nil.  The debugfs code itself goes away, and a
static_branch_unlikely() turns into a static_cpu_has(), and that's it.

The real reason I added it is because Chris Mason volunteered to
benchmark it, and I'll send it to him once it survives a bit of
review.

>
> IOW, you could simplify this by removing the static key and adding it
> with a patch ontop for people who wanna play with this. For the majority
> of the systems and use cases it doesn't really matter, IMO.
>
>> In theory, we could optimize this better by only flushing the TLB in
>> lazy CPUs when a page table is freed.  Doing that would require
>> auditing the mm code to make sure that all page table freeing goes
>> through tlb_remove_page() as well as reworking some data structures
>> to implement the improved flush logic.
>>
>> Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
>> Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
>> Reported-by: Adam Borowski <kilobyte@angband.pl>
>> Cc: Brian Gerst <brgerst@gmail.com>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> Cc: Adam Borowski <kilobyte@angband.pl>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Borislav Petkov <bp@suse.de>
>> Cc: Brian Gerst <brgerst@gmail.com>
>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>> Cc: Eric Biggers <ebiggers@google.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
>> Cc: Nadav Amit <nadav.amit@gmail.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Roman Kagan <rkagan@virtuozzo.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: lkml <linux-kernel@vger.kernel.org>
>> Cc: x86-ml <x86@kernel.org>
>> Link: http://lkml.kernel.org/r/8eccc9240041ea7cb94624cab8d07e2a6e911ba7.1507567665.git.luto@kernel.org
>> Signed-off-by: Borislav Petkov <bp@suse.de>
>> ---
>>  arch/x86/include/asm/mmu_context.h |   8 +-
>>  arch/x86/include/asm/tlbflush.h    |  24 ++++++
>>  arch/x86/mm/tlb.c                  | 153 +++++++++++++++++++++++++++----------
>>  3 files changed, 136 insertions(+), 49 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
>> index c120b5db178a..3c856a15b98e 100644
>> --- a/arch/x86/include/asm/mmu_context.h
>> +++ b/arch/x86/include/asm/mmu_context.h
>> @@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
>>       DEBUG_LOCKS_WARN_ON(preemptible());
>>  }
>>
>> -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
>> -{
>> -     int cpu = smp_processor_id();
>> -
>> -     if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
>> -             cpumask_clear_cpu(cpu, mm_cpumask(mm));
>> -}
>> +void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
>>
>>  static inline int init_new_context(struct task_struct *tsk,
>>                                  struct mm_struct *mm)
>> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
>> index 4893abf7f74f..d362161d3291 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -83,6 +83,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
>>  #endif
>>
>>  /*
>> + * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point
>> + * to init_mm when we switch to a kernel thread (e.g. the idle thread).  If
>> + * it's false, then we immediately switch CR3 when entering a kernel thread.
>> + */
>> +DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
>> +
>> +/*
>>   * 6 because 6 should be plenty and struct tlb_state will fit in
>>   * two cache lines.
>>   */
>> @@ -105,6 +112,23 @@ struct tlb_state {
>>       u16 next_asid;
>>
>>       /*
>> +      * We can be in one of several states:
>> +      *
>> +      *  - Actively using an mm.  Our CPU's bit will be set in
>> +      *    mm_cpumask(loaded_mm) and is_lazy == false;
>> +      *
>> +      *  - Not using a real mm.  loaded_mm == &init_mm.  Our CPU's bit
>> +      *    will not be set in mm_cpumask(&init_mm) and is_lazy == false.
>
> And this is the old lazy mode, right?
>
> I think we should state what lazy we mean ...

This is non-lazy.  It's roughtly what our state was in old kernels
when we went lazy and then called leave_mm().

>
>> +      *  - Lazily using a real mm.  loaded_mm != &init_mm, our bit
>> +      *    is set in mm_cpumask(loaded_mm), but is_lazy == true.
>> +      *    We're heuristically guessing that the CR3 load we
>> +      *    skipped more than makes up for the overhead added by
>> +      *    lazy mode.
>> +      */
>> +     bool is_lazy;
>
>
>> +
>> +     /*
>>        * Access to this CR4 shadow and to H/W CR4 is protected by
>>        * disabling interrupts when modifying either one.
>>        */
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index 49d9778376d7..658bf0090565 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -30,6 +30,8 @@
>>
>>  atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
>>
>> +DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
>> +
>>  static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
>>                           u16 *new_asid, bool *need_flush)
>>  {
>> @@ -80,7 +82,7 @@ void leave_mm(int cpu)
>>               return;
>>
>>       /* Warn if we're not lazy. */
>> -     WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
>> +     WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
>>
>>       switch_mm(NULL, &init_mm, NULL);
>>  }
>> @@ -142,45 +144,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>>               __flush_tlb_all();
>>       }
>>  #endif
>> +     this_cpu_write(cpu_tlbstate.is_lazy, false);
>>
>>       if (real_prev == next) {
>>               VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
>>                         next->context.ctx_id);
>>
>> -             if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
>> -                     /*
>> -                      * There's nothing to do: we weren't lazy, and we
>> -                      * aren't changing our mm.  We don't need to flush
>> -                      * anything, nor do we need to update CR3, CR4, or
>> -                      * LDTR.
>> -                      */
>> -                     return;
>> -             }
>> -
>> -             /* Resume remote flushes and then read tlb_gen. */
>> -             cpumask_set_cpu(cpu, mm_cpumask(next));
>> -             next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>> -
>> -             if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
>> -                 next_tlb_gen) {
>> -                     /*
>> -                      * Ideally, we'd have a flush_tlb() variant that
>> -                      * takes the known CR3 value as input.  This would
>> -                      * be faster on Xen PV and on hypothetical CPUs
>> -                      * on which INVPCID is fast.
>> -                      */
>> -                     this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
>> -                                    next_tlb_gen);
>> -                     write_cr3(build_cr3(next, prev_asid));
>> -                     trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
>> -                                     TLB_FLUSH_ALL);
>> -             }
>> -
>>               /*
>> -              * We just exited lazy mode, which means that CR4 and/or LDTR
>> -              * may be stale.  (Changes to the required CR4 and LDTR states
>> -              * are not reflected in tlb_gen.)
>> +              * We don't currently support having a real mm loaded without
>> +              * our cpu set in mm_cpumask().  We have all the bookkeeping
>
> s/cpu/CPU/
>
>> +              * in place to figure out whether we would need to flush
>> +              * if our cpu were cleared in mm_cpumask(), but we don't
>
> ditto.
>
>> +              * currently use it.
>>                */
>> +             if (WARN_ON_ONCE(real_prev != &init_mm &&
>> +                              !cpumask_test_cpu(cpu, mm_cpumask(next))))
>> +                     cpumask_set_cpu(cpu, mm_cpumask(next));
>> +
>> +             return;
>>       } else {
>>               u16 new_asid;
>>               bool need_flush;
> --
> Regards/Gruss,
>     Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.