From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761286AbcAKS0D (ORCPT ); Mon, 11 Jan 2016 13:26:03 -0500 Received: from bombadil.infradead.org ([198.137.202.9]:54481 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759321AbcAKS0A (ORCPT ); Mon, 11 Jan 2016 13:26:00 -0500 Date: Mon, 11 Jan 2016 19:25:48 +0100 From: Peter Zijlstra To: linux-kernel@vger.kernel.org, dave.hansen@linux.intel.com, riel@redhat.com, brgerst@gmail.com, akpm@linux-foundation.org, luto@amacapital.net, mingo@kernel.org, dvlasenk@redhat.com, hpa@zytor.com, tglx@linutronix.de, bp@alien8.de, luto@kernel.org, torvalds@linux-foundation.org Cc: linux-tip-commits@vger.kernel.org Subject: Re: [tip:x86/urgent] x86/mm: Add barriers and document switch_mm() -vs-flush synchronization Message-ID: <20160111182548.GF6344@twins.programming.kicks-ass.net> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 11, 2016 at 03:42:40AM -0800, tip-bot for Andy Lutomirski wrote: > --- a/arch/x86/include/asm/mmu_context.h > +++ b/arch/x86/include/asm/mmu_context.h > @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, > #endif > cpumask_set_cpu(cpu, mm_cpumask(next)); > > - /* Re-load page tables */ > + /* > + * Re-load page tables. > + * > + * This logic has an ordering constraint: > + * > + * CPU 0: Write to a PTE for 'next' > + * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. > + * CPU 1: set bit 1 in next's mm_cpumask > + * CPU 1: load from the PTE that CPU 0 writes (implicit) > + * > + * We need to prevent an outcome in which CPU 1 observes > + * the new PTE value and CPU 0 observes bit 1 clear in > + * mm_cpumask. (If that occurs, then the IPI will never > + * be sent, and CPU 0's TLB will contain a stale entry.) > + * > + * The bad outcome can occur if either CPU's load is > + * reordered before that CPU's store, so both CPUs much s/much/must/ ? > + * execute full barriers to prevent this from happening. > + * > + * Thus, switch_mm needs a full barrier between the > + * store to mm_cpumask and any operation that could load > + * from next->pgd. This barrier synchronizes with > + * remote TLB flushers. Fortunately, load_cr3 is > + * serializing and thus acts as a full barrier. > + * > + */ > load_cr3(next->pgd); > + > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > > /* Stop flush ipis for the previous mm */ > @@ -156,10 +182,15 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, > * schedule, protecting us from simultaneous changes. > */ > cpumask_set_cpu(cpu, mm_cpumask(next)); > + > /* > * We were in lazy tlb mode and leave_mm disabled > * tlb flush IPI delivery. We must reload CR3 > * to make sure to use no freed page tables. > + * > + * As above, this is a barrier that forces > + * TLB repopulation to be ordered after the > + * store to mm_cpumask. somewhat confused by this comment, cpumask_set_cpu() is a LOCK BTS, that is already fully ordered. > */ > load_cr3(next->pgd); > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 8ddb5d0..8f4cc3d 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -188,17 +191,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, > if (!current->mm) { > leave_mm(smp_processor_id()); > + > + /* Synchronize with switch_mm. */ > + smp_mb(); > + > goto out; > } > + } else { > leave_mm(smp_processor_id()); > + > + /* Synchronize with switch_mm. */ > + smp_mb(); > + } > } The alternative is making leave_mm() unconditionally imply a full barrier. I've not looked at other sites using it though.