Re: [PATCH 3/6] x86,tlb: make lazy TLB mode lazier

From: Andy Lutomirski <luto@kernel.org>
To: Rik van Riel <riel@surriel.com>
Cc: LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>,
	Andrew Lutomirski <luto@kernel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Ingo Molnar <mingo@kernel.org>, kernel-team <kernel-team@fb.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Mike Galbraith <efault@gmx.de>,
	songliubraving@fb.com
Subject: Re: [PATCH 3/6] x86,tlb: make lazy TLB mode lazier
Date: Wed, 27 Jun 2018 11:10:44 -0700	[thread overview]
Message-ID: <CALCETrVgDXe6NFCmxhyiwMEQmQqyidAh+wQBWeaCKg+z6jQOPw@mail.gmail.com> (raw)
In-Reply-To: <20180626173126.12296-4-riel@surriel.com>

On Tue, Jun 26, 2018 at 10:31 AM Rik van Riel <riel@surriel.com> wrote:
>
> Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
> when all it really needs to do is reload %CR3 at the next context switch,
> assuming no page table pages got freed.
>
> Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
> which checks whether .tlb_gen changed, and the TLB invalidation code, which
> increments .tlb_gen whenever page table entries get invalidated.
>
> The atomic increment in inc_mm_tlb_gen is its own barrier; the context
> switch code adds an explicit barrier between reading tlbstate.is_lazy and
> next->context.tlb_gen.
>
> Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set
> are not removed from the mm_cpumask(mm), since that would prevent the TLB
> flush IPIs at page table free time from being sent to all the CPUs
> that need them.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Song Liu <songliubraving@fb.com>
> ---
>  arch/x86/include/asm/uv/uv.h  |   6 +-
>  arch/x86/mm/tlb.c             | 131 ++++++++++++++++++++++++++++--------------
>  arch/x86/platform/uv/tlb_uv.c |   2 +-
>  3 files changed, 92 insertions(+), 47 deletions(-)
>
> diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h
> index a80c0673798f..d801afb5fe90 100644
> --- a/arch/x86/include/asm/uv/uv.h
> +++ b/arch/x86/include/asm/uv/uv.h
> @@ -17,7 +17,7 @@ extern int is_uv_hubless(void);
>  extern void uv_cpu_init(void);
>  extern void uv_nmi_init(void);
>  extern void uv_system_init(void);
> -extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
> +extern struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
>                                                  const struct flush_tlb_info *info);
>
>  #else  /* X86_UV */
> @@ -27,8 +27,8 @@ static inline int is_uv_system(void)  { return 0; }
>  static inline int is_uv_hubless(void)  { return 0; }
>  static inline void uv_cpu_init(void)   { }
>  static inline void uv_system_init(void)        { }
> -static inline const struct cpumask *
> -uv_flush_tlb_others(const struct cpumask *cpumask,
> +static inline struct cpumask *
> +uv_flush_tlb_others(struct cpumask *cpumask,
>                     const struct flush_tlb_info *info)
>  { return cpumask; }
>
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 9a893673c56b..137a2c62c75b 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -7,6 +7,7 @@
>  #include <linux/export.h>
>  #include <linux/cpu.h>
>  #include <linux/debugfs.h>
> +#include <linux/gfp.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/mmu_context.h>
> @@ -185,8 +186,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  {
>         struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
>         u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> +       bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
>         unsigned cpu = smp_processor_id();
>         u64 next_tlb_gen;
> +       bool need_flush;
> +       u16 new_asid;
>
>         /*
>          * NB: The scheduler will call us with prev == next when switching
> @@ -250,10 +254,20 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>                                  !cpumask_test_cpu(cpu, mm_cpumask(next))))
>                         cpumask_set_cpu(cpu, mm_cpumask(next));
>
> -               return;

You left this comment:

                /*
                 * We don't currently support having a real mm loaded without
                 * our cpu set in mm_cpumask().  We have all the bookkeeping
                 * in place to figure out whether we would need to flush
                 * if our cpu were cleared in mm_cpumask(), but we don't
                 * currently use it.
                 */

Presumably you should either clear the cpu from mm_cpumask when lazy
or you shoudl update the comment.

> +               /*
> +                * Switching straight from one thread in a process to another
> +                * thread in the same process requires no TLB flush at all.
> +                */
> +               if (!was_lazy)
> +                       return;

Comment doesn't match code.  Maybe add "... if we weren't lazy"?

> +
> +               /*
> +                * The code below checks whether there was a TLB flush while
> +                * this CPU was in lazy TLB mode. The barrier ensures ordering
> +                * with the TLB invalidation code advancing .tlb_gen.
> +                */
> +               smp_rmb();

I think it may need to be smp_mb().  You're trying to order
this_cpu_write() against subsequent reads.

In general, the changes to this function are very hard to review
because you're mixing semantic changes and restructuring the function.
Is there any way you could avoid that?  Or maybe just open-code a
tlb_gen check in the unlazying path?

> +       /*
> +        * Instead of sending IPIs to CPUs in lazy TLB mode, move that
> +        * CPU's TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
> +        * at the next context switch, or at page table free time.
> +        */

Stale comment?