From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1761286AbcAKS0D (ORCPT <rfc822;w@1wt.eu>);
	Mon, 11 Jan 2016 13:26:03 -0500
Received: from bombadil.infradead.org ([198.137.202.9]:54481 "EHLO
	bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1759321AbcAKS0A (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 11 Jan 2016 13:26:00 -0500
Date: Mon, 11 Jan 2016 19:25:48 +0100
From: Peter Zijlstra <peterz@infradead.org>
To: linux-kernel@vger.kernel.org, dave.hansen@linux.intel.com, riel@redhat.com,
        brgerst@gmail.com, akpm@linux-foundation.org, luto@amacapital.net,
        mingo@kernel.org, dvlasenk@redhat.com, hpa@zytor.com,
        tglx@linutronix.de, bp@alien8.de, luto@kernel.org,
        torvalds@linux-foundation.org
Cc: linux-tip-commits@vger.kernel.org
Subject: Re: [tip:x86/urgent] x86/mm: Add barriers and document switch_mm()
 -vs-flush synchronization
Message-ID: <20160111182548.GF6344@twins.programming.kicks-ass.net>
References: <tip-71b3c126e61177eb693423f2e18a1914205b165e@git.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <tip-71b3c126e61177eb693423f2e18a1914205b165e@git.kernel.org>
User-Agent: Mutt/1.5.21 (2012-12-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Jan 11, 2016 at 03:42:40AM -0800, tip-bot for Andy Lutomirski wrote:
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  #endif
>  		cpumask_set_cpu(cpu, mm_cpumask(next));
>  
> -		/* Re-load page tables */
> +		/*
> +		 * Re-load page tables.
> +		 *
> +		 * This logic has an ordering constraint:
> +		 *
> +		 *  CPU 0: Write to a PTE for 'next'
> +		 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
> +		 *  CPU 1: set bit 1 in next's mm_cpumask
> +		 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
> +		 *
> +		 * We need to prevent an outcome in which CPU 1 observes
> +		 * the new PTE value and CPU 0 observes bit 1 clear in
> +		 * mm_cpumask.  (If that occurs, then the IPI will never
> +		 * be sent, and CPU 0's TLB will contain a stale entry.)
> +		 *
> +		 * The bad outcome can occur if either CPU's load is
> +		 * reordered before that CPU's store, so both CPUs much

s/much/must/ ?

> +		 * execute full barriers to prevent this from happening.
> +		 *
> +		 * Thus, switch_mm needs a full barrier between the
> +		 * store to mm_cpumask and any operation that could load
> +		 * from next->pgd.  This barrier synchronizes with
> +		 * remote TLB flushers.  Fortunately, load_cr3 is
> +		 * serializing and thus acts as a full barrier.
> +		 *
> +		 */
>  		load_cr3(next->pgd);
> +
>  		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
>  
>  		/* Stop flush ipis for the previous mm */
> @@ -156,10 +182,15 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  			 * schedule, protecting us from simultaneous changes.
>  			 */
>  			cpumask_set_cpu(cpu, mm_cpumask(next));
> +
>  			/*
>  			 * We were in lazy tlb mode and leave_mm disabled
>  			 * tlb flush IPI delivery. We must reload CR3
>  			 * to make sure to use no freed page tables.
> +			 *
> +			 * As above, this is a barrier that forces
> +			 * TLB repopulation to be ordered after the
> +			 * store to mm_cpumask.

somewhat confused by this comment, cpumask_set_cpu() is a LOCK BTS, that
is already fully ordered.

>  			 */
>  			load_cr3(next->pgd);
>  			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 8ddb5d0..8f4cc3d 100644


> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c

> @@ -188,17 +191,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,

>  	if (!current->mm) {
>  		leave_mm(smp_processor_id());
> +
> +		/* Synchronize with switch_mm. */
> +		smp_mb();
> +
>  		goto out;
>  	}

> +		} else {
>  			leave_mm(smp_processor_id());
> +
> +			/* Synchronize with switch_mm. */
> +			smp_mb();
> +		}
>  	}

The alternative is making leave_mm() unconditionally imply a full
barrier. I've not looked at other sites using it though.