Re: [tip:x86/urgent] x86/mm: Add barriers and document switch_mm() -vs-flush synchronization

From: Andy Lutomirski <luto@amacapital.net>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Rik van Riel <riel@redhat.com>, Brian Gerst <brgerst@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Ingo Molnar <mingo@kernel.org>,
	Denys Vlasenko <dvlasenk@redhat.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Borislav Petkov <bp@alien8.de>,
	Andrew Lutomirski <luto@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	"linux-tip-commits@vger.kernel.org" 
	<linux-tip-commits@vger.kernel.org>
Subject: Re: [tip:x86/urgent] x86/mm: Add barriers and document switch_mm() -vs-flush synchronization
Date: Mon, 11 Jan 2016 13:50:24 -0800	[thread overview]
Message-ID: <CALCETrXLUewtYNQOdiGpKLvp=RL0eMLSj+v7_J0G1a_do+6G8Q@mail.gmail.com> (raw)
In-Reply-To: <20160111182548.GF6344@twins.programming.kicks-ass.net>

On Mon, Jan 11, 2016 at 10:25 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jan 11, 2016 at 03:42:40AM -0800, tip-bot for Andy Lutomirski wrote:
>> --- a/arch/x86/include/asm/mmu_context.h
>> +++ b/arch/x86/include/asm/mmu_context.h
>> @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>>  #endif
>>               cpumask_set_cpu(cpu, mm_cpumask(next));
>>
>> -             /* Re-load page tables */
>> +             /*
>> +              * Re-load page tables.
>> +              *
>> +              * This logic has an ordering constraint:
>> +              *
>> +              *  CPU 0: Write to a PTE for 'next'
>> +              *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
>> +              *  CPU 1: set bit 1 in next's mm_cpumask
>> +              *  CPU 1: load from the PTE that CPU 0 writes (implicit)
>> +              *
>> +              * We need to prevent an outcome in which CPU 1 observes
>> +              * the new PTE value and CPU 0 observes bit 1 clear in
>> +              * mm_cpumask.  (If that occurs, then the IPI will never
>> +              * be sent, and CPU 0's TLB will contain a stale entry.)
>> +              *
>> +              * The bad outcome can occur if either CPU's load is
>> +              * reordered before that CPU's store, so both CPUs much
>
> s/much/must/ ?

Indeed.  Is this worth a follow-up patch?

>
>> +              * execute full barriers to prevent this from happening.
>> +              *
>> +              * Thus, switch_mm needs a full barrier between the
>> +              * store to mm_cpumask and any operation that could load
>> +              * from next->pgd.  This barrier synchronizes with
>> +              * remote TLB flushers.  Fortunately, load_cr3 is
>> +              * serializing and thus acts as a full barrier.
>> +              *
>> +              */
>>               load_cr3(next->pgd);
>> +
>>               trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
>>
>>               /* Stop flush ipis for the previous mm */
>> @@ -156,10 +182,15 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>>                        * schedule, protecting us from simultaneous changes.
>>                        */
>>                       cpumask_set_cpu(cpu, mm_cpumask(next));
>> +
>>                       /*
>>                        * We were in lazy tlb mode and leave_mm disabled
>>                        * tlb flush IPI delivery. We must reload CR3
>>                        * to make sure to use no freed page tables.
>> +                      *
>> +                      * As above, this is a barrier that forces
>> +                      * TLB repopulation to be ordered after the
>> +                      * store to mm_cpumask.
>
> somewhat confused by this comment, cpumask_set_cpu() is a LOCK BTS, that
> is already fully ordered.

There are more than enough barriers here.  v1 had cpumask_set_cpu;
smp_mb__after_atomic, which is more portable and generates identical
code.  I don't have a real preference for which barrier we should
consider to the important one.

>
>>                        */
>>                       load_cr3(next->pgd);
>>                       trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index 8ddb5d0..8f4cc3d 100644
>
>
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>
>> @@ -188,17 +191,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>
>>       if (!current->mm) {
>>               leave_mm(smp_processor_id());
>> +
>> +             /* Synchronize with switch_mm. */
>> +             smp_mb();
>> +
>>               goto out;
>>       }
>
>> +             } else {
>>                       leave_mm(smp_processor_id());
>> +
>> +                     /* Synchronize with switch_mm. */
>> +                     smp_mb();
>> +             }
>>       }
>
> The alternative is making leave_mm() unconditionally imply a full
> barrier. I've not looked at other sites using it though.

For a quick fix, I preferred the more self-contained change.

--Andy