Re: [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time

From: Andy Lutomirski <luto@kernel.org>
To: Rik van Riel <riel@surriel.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	86@vger.kernel.org, Andrew Lutomirski <luto@kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Mike Galbraith <efault@gmx.de>,
	songliubraving@fb.com, kernel-team <kernel-team@fb.com>
Subject: Re: [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time
Date: Fri, 22 Jun 2018 07:58:43 -0700	[thread overview]
Message-ID: <CALCETrX+EmeV5PxfwDwO=W4Deu9T_nPj5WbQX0mgxMV08vN=tg@mail.gmail.com> (raw)
In-Reply-To: <20180620195652.27251-3-riel@surriel.com>

On Wed, Jun 20, 2018 at 12:57 PM Rik van Riel <riel@surriel.com> wrote:
>
> Andy discovered that speculative memory accesses while in lazy
> TLB mode can crash a system, when a CPU tries to dereference a
> speculative access using memory contents that used to be valid
> page table memory, but have since been reused for something else
> and point into la-la land.
>
> The latter problem can be prevented in two ways. The first is to
> always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
> the second one is to only send the TLB shootdown at page table
> freeing time.
>
> The second should result in fewer IPIs, since operationgs like
> mprotect and madvise are very common with some workloads, but
> do not involve page table freeing. Also, on munmap, batching
> of page table freeing covers much larger ranges of virtual
> memory than the batching of unmapped user pages.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Song Liu <songliubraving@fb.com>
> ---
>  arch/x86/include/asm/tlbflush.h |  5 +++++
>  arch/x86/mm/tlb.c               | 24 ++++++++++++++++++++++++
>  include/asm-generic/tlb.h       | 10 ++++++++++
>  mm/memory.c                     | 22 ++++++++++++++--------
>  4 files changed, 53 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 6690cd3fc8b1..3aa3204b5dc0 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -554,4 +554,9 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
>         native_flush_tlb_others(mask, info)
>  #endif
>
> +extern void tlb_flush_remove_tables(struct mm_struct *mm);
> +extern void tlb_flush_remove_tables_local(void *arg);
> +
> +#define HAVE_TLB_FLUSH_REMOVE_TABLES
> +
>  #endif /* _ASM_X86_TLBFLUSH_H */
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index e055d1a06699..61773b07ed54 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -646,6 +646,30 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>         put_cpu();
>  }
>
> +void tlb_flush_remove_tables_local(void *arg)
> +{
> +       struct mm_struct *mm = arg;
> +
> +       if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm &&
> +                       this_cpu_read(cpu_tlbstate.is_lazy))
> +               /*
> +                * We're in lazy mode.  We need to at least flush our
> +                * paging-structure cache to avoid speculatively reading
> +                * garbage into our TLB.  Since switching to init_mm is barely
> +                * slower than a minimal flush, just switch to init_mm.
> +                */
> +               switch_mm_irqs_off(NULL, &init_mm, NULL);

Can you add braces?

> +}
> +
> +void tlb_flush_remove_tables(struct mm_struct *mm)
> +{
> +       int cpu = get_cpu();
> +       /*
> +        * XXX: this really only needs to be called for CPUs in lazy TLB mode.
> +        */
> +       if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> +               smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1);

I suspect that most if the gain will come from fixing this limitation :)