Linux-Next Archive on lore.kernel.org
 help / color / Atom feed
From: Nadav Amit <namit@vmware.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>, Ingo Molnar <mingo@kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	Andrew Morton <akpm@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Linux-Next Mailing List <linux-next@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linus <torvalds@linux-foundation.org>
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree
Date: Tue, 15 Aug 2017 07:51:57 +0000
Message-ID: <1138ED5D-AA95-48D0-86D4-75F20DFE0E0B@vmware.com> (raw)
In-Reply-To: <20170814193828.GN6524@worktop.programming.kicks-ass.net>

Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Aug 14, 2017 at 05:07:19AM +0000, Nadav Amit wrote:
>>>> So I'm not entirely clear about this yet.
>>>> 
>>>> How about:
>>>> 
>>>> 
>>>> 	CPU0				CPU1
>>>> 
>>>> 					tlb_gather_mmu()
>>>> 
>>>> 					lock PTLn
>>>> 					no mod
>>>> 					unlock PTLn
>>>> 
>>>> 	tlb_gather_mmu()
>>>> 
>>>> 					lock PTLm
>>>> 					mod
>>>> 					include in tlb range
>>>> 					unlock PTLm
>>>> 
>>>> 	lock PTLn
>>>> 	mod
>>>> 	unlock PTLn
>>>> 
>>>> 					tlb_finish_mmu()
>>>> 					  force = mm_tlb_flush_nested(tlb->mm);
>>>> 					  arch_tlb_finish_mmu(force);
>>>> 
>>>> 
>>>> 	... more ...
>>>> 
>>>> 	tlb_finish_mmu()
>>>> 
>>>> 
>>>> 
>>>> In this case you also want CPU1's mm_tlb_flush_nested() call to return
>>>> true, right?
>>> 
>>> No, because CPU 1 mofified pte and added it into tlb range
>>> so regardless of nested, it will flush TLB so there is no stale
>>> TLB problem.
> 
>> To clarify: the main problem that these patches address is when the first
>> CPU updates the PTE, and second CPU sees the updated value and thinks: “the
>> PTE is already what I wanted - no flush is needed”.
> 
> OK, that simplifies things.
> 
>> For some reason (I would assume intentional), all the examples here first
>> “do not modify” the PTE, and then modify it - which is not an “interesting”
>> case.
> 
> Depends on what you call 'interesting' :-) They are 'interesting' to
> make work from a memory ordering POV. And since I didn't get they were
> excluded from the set, I worried.
> 
> In fact, if they were to be included, I couldn't make it work at all. So
> I'm really glad to hear we can disregard them.
> 
>> However, based on what I understand on the memory barriers, I think
>> there is indeed a missing barrier before reading it in
>> mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case,
>> before reading, would solve the problem with least impact on systems with
>> strong memory ordering.
> 
> No, all is well. If, as you say, we're naturally constrained to the case
> where we only care about prior modification we can rely on the RCpc PTL
> locks.
> 
> Consider:
> 
> 
> 	CPU0				CPU1
> 
> 					tlb_gather_mmu()
> 
> 	tlb_gather_mmu()
> 	  inc	--------.
> 			| (inc is constrained by RELEASE)
> 	lock PTLn	|
> 	mod		^
> 	unlock PTLn ----------------->	lock PTLn
> 				v	no mod
> 				|	unlock PTLn
> 				|
> 				|	lock PTLm
> 				|	mod
> 				|	include in tlb range
> 				|	unlock PTLm
> 				|
> 	(read is constrained	|
> 	          by ACQUIRE)	|
> 				|	tlb_finish_mmu()
> 				`----	  force = mm_tlb_flush_nested(tlb->mm);
> 					  arch_tlb_finish_mmu(force);
> 
> 
> 	... more ...
> 
> 	tlb_finish_mmu()
> 
> 
> Then CPU1's acquire of PTLn orders against CPU0's release of that same
> PTLn which guarantees we observe both its (prior) modified PTE and the
> mm->tlb_flush_pending increment from tlb_gather_mmu().
> 
> So all we need for mm_tlb_flush_nested() to work is having acquired the
> right PTL at least once before calling it.
> 
> At the same time, the decrements need to be after the TLB invalidate is
> complete, this ensures that _IF_ we observe the decrement, we must've
> also observed the corresponding invalidate.
> 
> Something like the below is then sufficient.
> 
> ---
> Subject: mm: Clarify tlb_flush_pending barriers
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri, 11 Aug 2017 16:04:50 +0200
> 
> Better document the ordering around tlb_flush_pending.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> include/linux/mm_types.h |   78 +++++++++++++++++++++++++++--------------------
> 1 file changed, 45 insertions(+), 33 deletions(-)
> 
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -526,30 +526,6 @@ extern void tlb_gather_mmu(struct mmu_ga
> extern void tlb_finish_mmu(struct mmu_gather *tlb,
> 				unsigned long start, unsigned long end);
> 
> -/*
> - * Memory barriers to keep this state in sync are graciously provided by
> - * the page table locks, outside of which no page table modifications happen.
> - * The barriers are used to ensure the order between tlb_flush_pending updates,
> - * which happen while the lock is not taken, and the PTE updates, which happen
> - * while the lock is taken, are serialized.
> - */
> -static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> -{
> -	/*
> -	 * Must be called with PTL held; such that our PTL acquire will have
> -	 * observed the store from set_tlb_flush_pending().
> -	 */
> -	return atomic_read(&mm->tlb_flush_pending) > 0;
> -}
> -
> -/*
> - * Returns true if there are two above TLB batching threads in parallel.
> - */
> -static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
> -{
> -	return atomic_read(&mm->tlb_flush_pending) > 1;
> -}
> -
> static inline void init_tlb_flush_pending(struct mm_struct *mm)
> {
> 	atomic_set(&mm->tlb_flush_pending, 0);
> @@ -558,7 +534,6 @@ static inline void init_tlb_flush_pendin
> static inline void inc_tlb_flush_pending(struct mm_struct *mm)
> {
> 	atomic_inc(&mm->tlb_flush_pending);
> -
> 	/*
> 	 * The only time this value is relevant is when there are indeed pages
> 	 * to flush. And we'll only flush pages after changing them, which
> @@ -580,24 +555,61 @@ static inline void inc_tlb_flush_pending
> 	 *	flush_tlb_range();
> 	 *	atomic_dec(&mm->tlb_flush_pending);
> 	 *
> -	 * So the =true store is constrained by the PTL unlock, and the =false
> -	 * store is constrained by the TLB invalidate.
> +	 * Where the increment if constrained by the PTL unlock, it thus
> +	 * ensures that the increment is visible if the PTE modification is
> +	 * visible. After all, if there is no PTE modification, nobody cares
> +	 * about TLB flushes either.
> +	 *
> +	 * This very much relies on users (mm_tlb_flush_pending() and
> +	 * mm_tlb_flush_nested()) only caring about _specific_ PTEs (and
> +	 * therefore specific PTLs), because with SPLIT_PTE_PTLOCKS and RCpc
> +	 * locks (PPC) the unlock of one doesn't order against the lock of
> +	 * another PTL.
> +	 *
> +	 * The decrement is ordered by the flush_tlb_range(), such that
> +	 * mm_tlb_flush_pending() will not return false unless all flushes have
> +	 * completed.
> 	 */
> }
> 
> -/* Clearing is done after a TLB flush, which also provides a barrier. */
> static inline void dec_tlb_flush_pending(struct mm_struct *mm)
> {
> 	/*
> -	 * Guarantee that the tlb_flush_pending does not not leak into the
> -	 * critical section, since we must order the PTE change and changes to
> -	 * the pending TLB flush indication. We could have relied on TLB flush
> -	 * as a memory barrier, but this behavior is not clearly documented.
> +	 * See inc_tlb_flush_pending().
> +	 *
> +	 * This cannot be smp_mb__before_atomic() because smp_mb() simply does
> +	 * not order against TLB invalidate completion, which is what we need.
> +	 *
> +	 * Therefore we must rely on tlb_flush_*() to guarantee order.
> 	 */
> -	smp_mb__before_atomic();
> 	atomic_dec(&mm->tlb_flush_pending);
> }
> 
> +static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> +{
> +	/*
> +	 * Must be called after having acquired the PTL; orders against that
> +	 * PTLs release and therefore ensures that if we observe the modified
> +	 * PTE we must also observe the increment from inc_tlb_flush_pending().
> +	 *
> +	 * That is, it only guarantees to return true if there is a flush
> +	 * pending for _this_ PTL.
> +	 */
> +	return atomic_read(&mm->tlb_flush_pending);
> +}
> +
> +static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
> +{
> +	/*
> +	 * Similar to mm_tlb_flush_pending(), we must have acquired the PTL
> +	 * for which there is a TLB flush pending in order to guarantee
> +	 * we've seen both that PTE modification and the increment.
> +	 *
> +	 * (no requirement on actually still holding the PTL, that is irrelevant)
> +	 */
> +	return atomic_read(&mm->tlb_flush_pending) > 1;
> +}
> +
> struct vm_fault;
> 
> struct vm_special_mapping {

Thanks for the detailed explanation. I will pay more attention next time.


  reply index

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-11  7:53 Stephen Rothwell
2017-08-11  9:34 ` Peter Zijlstra
2017-08-11 10:48   ` Peter Zijlstra
2017-08-11 11:45   ` Stephen Rothwell
2017-08-11 11:56     ` Ingo Molnar
2017-08-11 12:17       ` Peter Zijlstra
2017-08-11 12:44         ` Ingo Molnar
2017-08-11 13:49           ` Stephen Rothwell
2017-08-11 14:04       ` Peter Zijlstra
2017-08-13  6:06         ` Nadav Amit
2017-08-13 12:50           ` Peter Zijlstra
2017-08-14  3:16             ` Minchan Kim
2017-08-14  5:07               ` Nadav Amit
2017-08-14  5:23                 ` Minchan Kim
2017-08-14  8:38                 ` Minchan Kim
2017-08-14 19:57                   ` Peter Zijlstra
2017-08-16  4:14                     ` Minchan Kim
2017-08-14 19:38                 ` Peter Zijlstra
2017-08-15  7:51                   ` Nadav Amit [this message]
2017-08-14  3:09         ` Minchan Kim
2017-08-14 18:54           ` Peter Zijlstra
  -- strict thread matches above, loose matches on Subject: below --
2019-10-31  5:43 Stephen Rothwell
2019-06-24 10:24 Stephen Rothwell
2019-05-01 11:10 Stephen Rothwell
2019-01-31  4:31 Stephen Rothwell
2018-08-20  4:32 Stephen Rothwell
2018-08-20 19:52 ` Andrew Morton
2018-03-23  5:59 Stephen Rothwell
2017-12-18  5:04 Stephen Rothwell
2017-11-10  4:33 Stephen Rothwell
2017-11-02  7:19 Stephen Rothwell
2017-08-22  6:57 Stephen Rothwell
2017-08-23  6:39 ` Vlastimil Babka
2017-04-12  6:46 Stephen Rothwell
2017-04-12 20:53 ` Vlastimil Babka
2017-04-20  2:17   ` NeilBrown
2017-03-24  5:25 Stephen Rothwell
2017-02-17  4:40 Stephen Rothwell
2016-11-14  6:08 Stephen Rothwell
2016-07-29  4:14 Stephen Rothwell
2016-06-15  5:23 Stephen Rothwell
2016-06-18 19:39 ` Manfred Spraul
2016-04-29  6:12 Stephen Rothwell
2016-04-29  6:26 ` Ingo Molnar
2016-03-02  5:40 Stephen Rothwell
2016-02-26  5:07 Stephen Rothwell
2016-02-26 21:35 ` Andrew Morton
2016-02-19  4:09 Stephen Rothwell
2016-02-19 15:26 ` Ard Biesheuvel
2015-12-07  8:06 Stephen Rothwell
2015-10-02  4:21 Stephen Rothwell
2015-07-28  6:00 Stephen Rothwell
2015-07-29 17:12 ` Andrea Arcangeli
2015-07-29 17:47   ` Andy Lutomirski
2015-07-29 18:46     ` Thomas Gleixner
2015-07-30 15:38       ` Andrea Arcangeli
2015-07-29 23:06   ` Stephen Rothwell
2015-07-29 23:07     ` Thomas Gleixner
2015-09-07 23:35   ` Stephen Rothwell
2015-09-08 18:11     ` Linus Torvalds
2015-09-08 22:56       ` Stephen Rothwell
2015-09-08 23:03         ` Linus Torvalds
2015-09-08 23:21           ` Andrew Morton
2015-09-16  6:58             ` Geert Uytterhoeven
2015-06-04 12:07 Stephen Rothwell
2015-04-08  8:28 Stephen Rothwell
2015-04-08  8:25 Stephen Rothwell
2014-03-17  9:31 Stephen Rothwell
2014-03-17  9:36 ` Peter Zijlstra
2014-03-19 23:27   ` Andrew Morton
2014-01-14  4:53 Stephen Rothwell
2014-01-14  5:04 ` Davidlohr Bueso
2014-01-14 12:51 ` Peter Zijlstra
2014-01-14 13:17   ` Geert Uytterhoeven
2014-01-14 13:33     ` Peter Zijlstra
2014-01-14 16:19     ` H. Peter Anvin
2014-01-14 15:15   ` H. Peter Anvin
2014-01-14 15:20     ` Geert Uytterhoeven
2014-01-14 15:41       ` Peter Zijlstra
2014-01-14 15:48         ` H. Peter Anvin
2014-01-07  6:00 Stephen Rothwell
2014-01-07  6:34 ` Tang Chen
2013-11-08  7:48 Stephen Rothwell
2013-11-08 18:58 ` Josh Triplett
2013-11-08 23:20   ` Stephen Rothwell
2013-11-09  0:19     ` Josh Triplett
2013-10-30  6:40 Stephen Rothwell

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1138ED5D-AA95-48D0-86D4-75F20DFE0E0B@vmware.com \
    --to=namit@vmware.com \
    --cc=akpm@linux-foundation.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-next@vger.kernel.org \
    --cc=minchan@kernel.org \
    --cc=mingo@elte.hu \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=sfr@canb.auug.org.au \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Next Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-next/0 linux-next/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-next linux-next/ https://lore.kernel.org/linux-next \
		linux-next@vger.kernel.org
	public-inbox-index linux-next

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-next


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git