Linux-Next Archive on lore.kernel.org
 help / color / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Nadav Amit <namit@vmware.com>
Cc: Minchan Kim <minchan@kernel.org>, Ingo Molnar <mingo@kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	Andrew Morton <akpm@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Linux-Next Mailing List <linux-next@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linus <torvalds@linux-foundation.org>
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree
Date: Mon, 14 Aug 2017 21:38:28 +0200
Message-ID: <20170814193828.GN6524@worktop.programming.kicks-ass.net> (raw)
In-Reply-To: <0F858068-D41D-46E3-B4A8-8A95B4EDB94F@vmware.com>

On Mon, Aug 14, 2017 at 05:07:19AM +0000, Nadav Amit wrote:
> >> So I'm not entirely clear about this yet.
> >> 
> >> How about:
> >> 
> >> 
> >> 	CPU0				CPU1
> >> 
> >> 					tlb_gather_mmu()
> >> 
> >> 					lock PTLn
> >> 					no mod
> >> 					unlock PTLn
> >> 
> >> 	tlb_gather_mmu()
> >> 
> >> 					lock PTLm
> >> 					mod
> >> 					include in tlb range
> >> 					unlock PTLm
> >> 
> >> 	lock PTLn
> >> 	mod
> >> 	unlock PTLn
> >> 
> >> 					tlb_finish_mmu()
> >> 					  force = mm_tlb_flush_nested(tlb->mm);
> >> 					  arch_tlb_finish_mmu(force);
> >> 
> >> 
> >> 	... more ...
> >> 
> >> 	tlb_finish_mmu()
> >> 
> >> 
> >> 
> >> In this case you also want CPU1's mm_tlb_flush_nested() call to return
> >> true, right?
> > 
> > No, because CPU 1 mofified pte and added it into tlb range
> > so regardless of nested, it will flush TLB so there is no stale
> > TLB problem.

> To clarify: the main problem that these patches address is when the first
> CPU updates the PTE, and second CPU sees the updated value and thinks: “the
> PTE is already what I wanted - no flush is needed”.

OK, that simplifies things.

> For some reason (I would assume intentional), all the examples here first
> “do not modify” the PTE, and then modify it - which is not an “interesting”
> case.

Depends on what you call 'interesting' :-) They are 'interesting' to
make work from a memory ordering POV. And since I didn't get they were
excluded from the set, I worried.

In fact, if they were to be included, I couldn't make it work at all. So
I'm really glad to hear we can disregard them.

> However, based on what I understand on the memory barriers, I think
> there is indeed a missing barrier before reading it in
> mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case,
> before reading, would solve the problem with least impact on systems with
> strong memory ordering.

No, all is well. If, as you say, we're naturally constrained to the case
where we only care about prior modification we can rely on the RCpc PTL
locks.

Consider:


	CPU0				CPU1

					tlb_gather_mmu()

	tlb_gather_mmu()
	  inc	--------.
			| (inc is constrained by RELEASE)
	lock PTLn	|
	mod		^
	unlock PTLn ----------------->	lock PTLn
				v	no mod
				|	unlock PTLn
				|
				|	lock PTLm
				|	mod
				|	include in tlb range
				|	unlock PTLm
				|
	(read is constrained	|
	          by ACQUIRE)	|
				|	tlb_finish_mmu()
				`----	  force = mm_tlb_flush_nested(tlb->mm);
					  arch_tlb_finish_mmu(force);


	... more ...

	tlb_finish_mmu()


Then CPU1's acquire of PTLn orders against CPU0's release of that same
PTLn which guarantees we observe both its (prior) modified PTE and the
mm->tlb_flush_pending increment from tlb_gather_mmu().

So all we need for mm_tlb_flush_nested() to work is having acquired the
right PTL at least once before calling it.

At the same time, the decrements need to be after the TLB invalidate is
complete, this ensures that _IF_ we observe the decrement, we must've
also observed the corresponding invalidate.

Something like the below is then sufficient.

---
Subject: mm: Clarify tlb_flush_pending barriers
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 11 Aug 2017 16:04:50 +0200

Better document the ordering around tlb_flush_pending.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/mm_types.h |   78 +++++++++++++++++++++++++++--------------------
 1 file changed, 45 insertions(+), 33 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -526,30 +526,6 @@ extern void tlb_gather_mmu(struct mmu_ga
 extern void tlb_finish_mmu(struct mmu_gather *tlb,
 				unsigned long start, unsigned long end);
 
-/*
- * Memory barriers to keep this state in sync are graciously provided by
- * the page table locks, outside of which no page table modifications happen.
- * The barriers are used to ensure the order between tlb_flush_pending updates,
- * which happen while the lock is not taken, and the PTE updates, which happen
- * while the lock is taken, are serialized.
- */
-static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
-{
-	/*
-	 * Must be called with PTL held; such that our PTL acquire will have
-	 * observed the store from set_tlb_flush_pending().
-	 */
-	return atomic_read(&mm->tlb_flush_pending) > 0;
-}
-
-/*
- * Returns true if there are two above TLB batching threads in parallel.
- */
-static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
-{
-	return atomic_read(&mm->tlb_flush_pending) > 1;
-}
-
 static inline void init_tlb_flush_pending(struct mm_struct *mm)
 {
 	atomic_set(&mm->tlb_flush_pending, 0);
@@ -558,7 +534,6 @@ static inline void init_tlb_flush_pendin
 static inline void inc_tlb_flush_pending(struct mm_struct *mm)
 {
 	atomic_inc(&mm->tlb_flush_pending);
-
 	/*
 	 * The only time this value is relevant is when there are indeed pages
 	 * to flush. And we'll only flush pages after changing them, which
@@ -580,24 +555,61 @@ static inline void inc_tlb_flush_pending
 	 *	flush_tlb_range();
 	 *	atomic_dec(&mm->tlb_flush_pending);
 	 *
-	 * So the =true store is constrained by the PTL unlock, and the =false
-	 * store is constrained by the TLB invalidate.
+	 * Where the increment if constrained by the PTL unlock, it thus
+	 * ensures that the increment is visible if the PTE modification is
+	 * visible. After all, if there is no PTE modification, nobody cares
+	 * about TLB flushes either.
+	 *
+	 * This very much relies on users (mm_tlb_flush_pending() and
+	 * mm_tlb_flush_nested()) only caring about _specific_ PTEs (and
+	 * therefore specific PTLs), because with SPLIT_PTE_PTLOCKS and RCpc
+	 * locks (PPC) the unlock of one doesn't order against the lock of
+	 * another PTL.
+	 *
+	 * The decrement is ordered by the flush_tlb_range(), such that
+	 * mm_tlb_flush_pending() will not return false unless all flushes have
+	 * completed.
 	 */
 }
 
-/* Clearing is done after a TLB flush, which also provides a barrier. */
 static inline void dec_tlb_flush_pending(struct mm_struct *mm)
 {
 	/*
-	 * Guarantee that the tlb_flush_pending does not not leak into the
-	 * critical section, since we must order the PTE change and changes to
-	 * the pending TLB flush indication. We could have relied on TLB flush
-	 * as a memory barrier, but this behavior is not clearly documented.
+	 * See inc_tlb_flush_pending().
+	 *
+	 * This cannot be smp_mb__before_atomic() because smp_mb() simply does
+	 * not order against TLB invalidate completion, which is what we need.
+	 *
+	 * Therefore we must rely on tlb_flush_*() to guarantee order.
 	 */
-	smp_mb__before_atomic();
 	atomic_dec(&mm->tlb_flush_pending);
 }
 
+static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
+{
+	/*
+	 * Must be called after having acquired the PTL; orders against that
+	 * PTLs release and therefore ensures that if we observe the modified
+	 * PTE we must also observe the increment from inc_tlb_flush_pending().
+	 *
+	 * That is, it only guarantees to return true if there is a flush
+	 * pending for _this_ PTL.
+	 */
+	return atomic_read(&mm->tlb_flush_pending);
+}
+
+static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
+{
+	/*
+	 * Similar to mm_tlb_flush_pending(), we must have acquired the PTL
+	 * for which there is a TLB flush pending in order to guarantee
+	 * we've seen both that PTE modification and the increment.
+	 *
+	 * (no requirement on actually still holding the PTL, that is irrelevant)
+	 */
+	return atomic_read(&mm->tlb_flush_pending) > 1;
+}
+
 struct vm_fault;
 
 struct vm_special_mapping {

  parent reply index

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-11  7:53 Stephen Rothwell
2017-08-11  9:34 ` Peter Zijlstra
2017-08-11 10:48   ` Peter Zijlstra
2017-08-11 11:45   ` Stephen Rothwell
2017-08-11 11:56     ` Ingo Molnar
2017-08-11 12:17       ` Peter Zijlstra
2017-08-11 12:44         ` Ingo Molnar
2017-08-11 13:49           ` Stephen Rothwell
2017-08-11 14:04       ` Peter Zijlstra
2017-08-13  6:06         ` Nadav Amit
2017-08-13 12:50           ` Peter Zijlstra
2017-08-14  3:16             ` Minchan Kim
2017-08-14  5:07               ` Nadav Amit
2017-08-14  5:23                 ` Minchan Kim
2017-08-14  8:38                 ` Minchan Kim
2017-08-14 19:57                   ` Peter Zijlstra
2017-08-16  4:14                     ` Minchan Kim
2017-08-14 19:38                 ` Peter Zijlstra [this message]
2017-08-15  7:51                   ` Nadav Amit
2017-08-14  3:09         ` Minchan Kim
2017-08-14 18:54           ` Peter Zijlstra
  -- strict thread matches above, loose matches on Subject: below --
2019-10-31  5:43 Stephen Rothwell
2019-06-24 10:24 Stephen Rothwell
2019-05-01 11:10 Stephen Rothwell
2019-01-31  4:31 Stephen Rothwell
2018-08-20  4:32 Stephen Rothwell
2018-08-20 19:52 ` Andrew Morton
2018-03-23  5:59 Stephen Rothwell
2017-12-18  5:04 Stephen Rothwell
2017-11-10  4:33 Stephen Rothwell
2017-11-02  7:19 Stephen Rothwell
2017-08-22  6:57 Stephen Rothwell
2017-08-23  6:39 ` Vlastimil Babka
2017-04-12  6:46 Stephen Rothwell
2017-04-12 20:53 ` Vlastimil Babka
2017-04-20  2:17   ` NeilBrown
2017-03-24  5:25 Stephen Rothwell
2017-02-17  4:40 Stephen Rothwell
2016-11-14  6:08 Stephen Rothwell
2016-07-29  4:14 Stephen Rothwell
2016-06-15  5:23 Stephen Rothwell
2016-06-18 19:39 ` Manfred Spraul
2016-04-29  6:12 Stephen Rothwell
2016-04-29  6:26 ` Ingo Molnar
2016-03-02  5:40 Stephen Rothwell
2016-02-26  5:07 Stephen Rothwell
2016-02-26 21:35 ` Andrew Morton
2016-02-19  4:09 Stephen Rothwell
2016-02-19 15:26 ` Ard Biesheuvel
2015-12-07  8:06 Stephen Rothwell
2015-10-02  4:21 Stephen Rothwell
2015-07-28  6:00 Stephen Rothwell
2015-07-29 17:12 ` Andrea Arcangeli
2015-07-29 17:47   ` Andy Lutomirski
2015-07-29 18:46     ` Thomas Gleixner
2015-07-30 15:38       ` Andrea Arcangeli
2015-07-29 23:06   ` Stephen Rothwell
2015-07-29 23:07     ` Thomas Gleixner
2015-09-07 23:35   ` Stephen Rothwell
2015-09-08 18:11     ` Linus Torvalds
2015-09-08 22:56       ` Stephen Rothwell
2015-09-08 23:03         ` Linus Torvalds
2015-09-08 23:21           ` Andrew Morton
2015-09-16  6:58             ` Geert Uytterhoeven
2015-06-04 12:07 Stephen Rothwell
2015-04-08  8:28 Stephen Rothwell
2015-04-08  8:25 Stephen Rothwell
2014-03-17  9:31 Stephen Rothwell
2014-03-17  9:36 ` Peter Zijlstra
2014-03-19 23:27   ` Andrew Morton
2014-01-14  4:53 Stephen Rothwell
2014-01-14  5:04 ` Davidlohr Bueso
2014-01-14 12:51 ` Peter Zijlstra
2014-01-14 13:17   ` Geert Uytterhoeven
2014-01-14 13:33     ` Peter Zijlstra
2014-01-14 16:19     ` H. Peter Anvin
2014-01-14 15:15   ` H. Peter Anvin
2014-01-14 15:20     ` Geert Uytterhoeven
2014-01-14 15:41       ` Peter Zijlstra
2014-01-14 15:48         ` H. Peter Anvin
2014-01-07  6:00 Stephen Rothwell
2014-01-07  6:34 ` Tang Chen
2013-11-08  7:48 Stephen Rothwell
2013-11-08 18:58 ` Josh Triplett
2013-11-08 23:20   ` Stephen Rothwell
2013-11-09  0:19     ` Josh Triplett
2013-10-30  6:40 Stephen Rothwell

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170814193828.GN6524@worktop.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-next@vger.kernel.org \
    --cc=minchan@kernel.org \
    --cc=mingo@elte.hu \
    --cc=mingo@kernel.org \
    --cc=namit@vmware.com \
    --cc=sfr@canb.auug.org.au \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Next Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-next/0 linux-next/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-next linux-next/ https://lore.kernel.org/linux-next \
		linux-next@vger.kernel.org
	public-inbox-index linux-next

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-next


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git