linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -v2 0/4] Getting rid of smp_mb__before_spinlock
@ 2017-08-02 11:38 Peter Zijlstra
  2017-08-02 11:38 ` [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending() Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Peter Zijlstra @ 2017-08-02 11:38 UTC (permalink / raw)
  To: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin
  Cc: linux-kernel, mingo, stern, peterz

This series removes two smp_mb__before_spinlock() (ab)users and converts the
scheduler to use smp_mb__after_spinlock(), which provides more guarantees with
the same amount of barriers.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02 11:38 [PATCH -v2 0/4] Getting rid of smp_mb__before_spinlock Peter Zijlstra
@ 2017-08-02 11:38 ` Peter Zijlstra
  2017-08-02 13:00   ` Mel Gorman
                     ` (3 more replies)
  2017-08-02 11:38 ` [PATCH -v2 2/4] overlayfs: Remove smp_mb__before_spinlock() usage Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 16+ messages in thread
From: Peter Zijlstra @ 2017-08-02 11:38 UTC (permalink / raw)
  To: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin
  Cc: linux-kernel, mingo, stern, peterz, Russell King, Heiko Carstens,
	Ralf Baechle, Vineet Gupta, David S. Miller, Mel Gorman,
	Rik van Riel

[-- Attachment #1: peterz-mm_tlb_flush_pending.patch --]
[-- Type: text/plain, Size: 5387 bytes --]

Commit:

  af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")

added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
can solve the same problem without this barrier.

If instead we mandate that mm_tlb_flush_pending() is used while
holding the PTL we're guaranteed to observe prior
set_tlb_flush_pending() instances.

For this to work we need to rework migrate_misplaced_transhuge_page()
a little and move the test up into do_huge_pmd_numa_page().

NOTE: this relies on flush_tlb_range() to guarantee:

   (1) it ensures that prior page table updates are visible to the
       page table walker and
   (2) it ensures that subsequent memory accesses are only made
       visible after the invalidation has completed

This is required for architectures that implement TRANSPARENT_HUGEPAGE
(arc, arm, arm64, mips, powerpc, s390, sparc, x86) or otherwise use
mm_tlb_flush_pending() in their page-table operations (arm, arm64,
x86).

This appears true for:

 - arm (DSB ISB before and after),
 - arm64 (DSB ISHST before, and DSB ISH after),
 - powerpc (PTESYNC before and after),
 - s390 and x86 TLB invalidate are serializing instructions

But I failed to understand the situation for:

 - arc, mips, sparc

Now SPARC64 is a wee bit special in that flush_tlb_range() is a no-op
and it flushes the TLBs using arch_{enter,leave}_lazy_mmu_mode()
inside the PTL. It still needs to guarantee the PTL unlock happens
_after_ the invalidate completes.

Vineet, Ralf and Dave could you guys please have a look?

Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/mm_types.h |   33 +++++++++++++++++++++++++++------
 mm/huge_memory.c         |   20 ++++++++++++++++++++
 mm/migrate.c             |    6 ------
 3 files changed, 47 insertions(+), 12 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -527,23 +527,44 @@ static inline cpumask_t *mm_cpumask(stru
  */
 static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
 {
-	barrier();
+	/*
+	 * Must be called with PTL held; such that our PTL acquire will have
+	 * observed the store from set_tlb_flush_pending().
+	 */
 	return mm->tlb_flush_pending;
 }
 static inline void set_tlb_flush_pending(struct mm_struct *mm)
 {
 	mm->tlb_flush_pending = true;
-
 	/*
-	 * Guarantee that the tlb_flush_pending store does not leak into the
-	 * critical section updating the page tables
+	 * The only time this value is relevant is when there are indeed pages
+	 * to flush. And we'll only flush pages after changing them, which
+	 * requires the PTL.
+	 *
+	 * So the ordering here is:
+	 *
+	 *	mm->tlb_flush_pending = true;
+	 *	spin_lock(&ptl);
+	 *	...
+	 *	set_pte_at();
+	 *	spin_unlock(&ptl);
+	 *
+	 *				spin_lock(&ptl)
+	 *				mm_tlb_flush_pending();
+	 *				....
+	 *				spin_unlock(&ptl);
+	 *
+	 *	flush_tlb_range();
+	 *	mm->tlb_flush_pending = false;
+	 *
+	 * So the =true store is constrained by the PTL unlock, and the =false
+	 * store is constrained by the TLB invalidate.
 	 */
-	smp_mb__before_spinlock();
 }
 /* Clearing is done after a TLB flush, which also provides a barrier. */
 static inline void clear_tlb_flush_pending(struct mm_struct *mm)
 {
-	barrier();
+	/* see set_tlb_flush_pending */
 	mm->tlb_flush_pending = false;
 }
 #else
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1410,6 +1410,7 @@ int do_huge_pmd_numa_page(struct vm_faul
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
 	int target_nid, last_cpupid = -1;
+	bool need_flush = false;
 	bool page_locked;
 	bool migrated = false;
 	bool was_writable;
@@ -1496,10 +1497,29 @@ int do_huge_pmd_numa_page(struct vm_faul
 	}
 
 	/*
+	 * Since we took the NUMA fault, we must have observed the !accessible
+	 * bit. Make sure all other CPUs agree with that, to avoid them
+	 * modifying the page we're about to migrate.
+	 *
+	 * Must be done under PTL such that we'll observe the relevant
+	 * set_tlb_flush_pending().
+	 */
+	if (mm_tlb_flush_pending(vma->vm_mm))
+		need_flush = true;
+
+	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and access rights restored.
 	 */
 	spin_unlock(vmf->ptl);
+
+	/*
+	 * We are not sure a pending tlb flush here is for a huge page
+	 * mapping or not. Hence use the tlb range variant
+	 */
+	if (need_flush)
+		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+
 	migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
 				vmf->pmd, pmd, vmf->address, page, target_nid);
 	if (migrated) {
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1937,12 +1937,6 @@ int migrate_misplaced_transhuge_page(str
 		put_page(new_page);
 		goto out_fail;
 	}
-	/*
-	 * We are not sure a pending tlb flush here is for a huge page
-	 * mapping or not. Hence use the tlb range variant
-	 */
-	if (mm_tlb_flush_pending(mm))
-		flush_tlb_range(vma, mmun_start, mmun_end);
 
 	/* Prepare a page as a migration target */
 	__SetPageLocked(new_page);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH -v2 2/4] overlayfs: Remove smp_mb__before_spinlock() usage
  2017-08-02 11:38 [PATCH -v2 0/4] Getting rid of smp_mb__before_spinlock Peter Zijlstra
  2017-08-02 11:38 ` [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending() Peter Zijlstra
@ 2017-08-02 11:38 ` Peter Zijlstra
  2017-08-02 11:38 ` [PATCH -v2 3/4] locking: Introduce smp_mb__after_spinlock() Peter Zijlstra
  2017-08-02 11:38 ` [PATCH -v2 4/4] locking: Remove smp_mb__before_spinlock() Peter Zijlstra
  3 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2017-08-02 11:38 UTC (permalink / raw)
  To: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin
  Cc: linux-kernel, mingo, stern, peterz, Al Viro

[-- Attachment #1: peterz-overlayfs.patch --]
[-- Type: text/plain, Size: 1024 bytes --]

While we could replace the smp_mb__before_spinlock() with the new
smp_mb__after_spinlock(), the normal pattern is to use
smp_store_release() to publish an object that is used for
lockless_dereference() -- and mirrors the regular rcu_assign_pointer()
/ rcu_dereference() patterns.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 fs/overlayfs/readdir.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -446,14 +446,14 @@ static int ovl_dir_fsync(struct file *fi
 
 			ovl_path_upper(dentry, &upperpath);
 			realfile = ovl_path_open(&upperpath, O_RDONLY);
-			smp_mb__before_spinlock();
+
 			inode_lock(inode);
 			if (!od->upperfile) {
 				if (IS_ERR(realfile)) {
 					inode_unlock(inode);
 					return PTR_ERR(realfile);
 				}
-				od->upperfile = realfile;
+				smp_store_release(&od->upperfile, realfile);
 			} else {
 				/* somebody has beaten us to it */
 				if (!IS_ERR(realfile))

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH -v2 3/4] locking: Introduce smp_mb__after_spinlock().
  2017-08-02 11:38 [PATCH -v2 0/4] Getting rid of smp_mb__before_spinlock Peter Zijlstra
  2017-08-02 11:38 ` [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending() Peter Zijlstra
  2017-08-02 11:38 ` [PATCH -v2 2/4] overlayfs: Remove smp_mb__before_spinlock() usage Peter Zijlstra
@ 2017-08-02 11:38 ` Peter Zijlstra
  2017-08-03 15:28   ` Will Deacon
  2017-08-02 11:38 ` [PATCH -v2 4/4] locking: Remove smp_mb__before_spinlock() Peter Zijlstra
  3 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2017-08-02 11:38 UTC (permalink / raw)
  To: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin
  Cc: linux-kernel, mingo, stern, peterz

[-- Attachment #1: peter_zijlstra-question_on_smp_mb__before_spinlock.patch --]
[-- Type: text/plain, Size: 4859 bytes --]

Since its inception, our understanding of ACQUIRE, esp. as applied to
spinlocks, has changed somewhat. Also, I wonder if, with a simple
change, we cannot make it provide more.

The problem with the comment is that the STORE done by spin_lock isn't
itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
it and cross with any prior STORE, rendering the default WMB
insufficient (pointed out by Alan).

Now, this is only really a problem on PowerPC and ARM64, both of
which already defined smp_mb__before_spinlock() as a smp_mb().

At the same time, we can get a much stronger construct if we place
that same barrier _inside_ the spin_lock(). In that case we upgrade
the RCpc spinlock to an RCsc.  That would make all schedule() calls
fully transitive against one another.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/arm64/include/asm/spinlock.h   |    2 ++
 arch/powerpc/include/asm/spinlock.h |    3 +++
 include/linux/atomic.h              |    3 +++
 include/linux/spinlock.h            |   36 ++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                 |    4 ++--
 5 files changed, 46 insertions(+), 2 deletions(-)

--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -367,5 +367,7 @@ static inline int arch_read_trylock(arch
  * smp_mb__before_spinlock() can restore the required ordering.
  */
 #define smp_mb__before_spinlock()	smp_mb()
+/* See include/linux/spinlock.h */
+#define smp_mb__after_spinlock()	smp_mb()
 
 #endif /* __ASM_SPINLOCK_H */
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -342,5 +342,8 @@ static inline void arch_write_unlock(arc
 #define arch_read_relax(lock)	__rw_yield(lock)
 #define arch_write_relax(lock)	__rw_yield(lock)
 
+/* See include/linux/spinlock.h */
+#define smp_mb__after_spinlock()   smp_mb()
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_SPINLOCK_H */
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -38,6 +38,9 @@
  * Besides, if an arch has a special barrier for acquire/release, it could
  * implement its own __atomic_op_* and use the same framework for building
  * variants
+ *
+ * If an architecture overrides __atomic_op_acquire() it will probably want
+ * to define smp_mb__after_spinlock().
  */
 #ifndef __atomic_op_acquire
 #define __atomic_op_acquire(op, args...)				\
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -130,6 +130,42 @@ do {								\
 #define smp_mb__before_spinlock()	smp_wmb()
 #endif
 
+/*
+ * This barrier must provide two things:
+ *
+ *   - it must guarantee a STORE before the spin_lock() is ordered against a
+ *     LOAD after it, see the comments at its two usage sites.
+ *
+ *   - it must ensure the critical section is RCsc.
+ *
+ * The latter is important for cases where we observe values written by other
+ * CPUs in spin-loops, without barriers, while being subject to scheduling.
+ *
+ * CPU0			CPU1			CPU2
+ *
+ *			for (;;) {
+ *			  if (READ_ONCE(X))
+ *			    break;
+ *			}
+ * X=1
+ *			<sched-out>
+ *						<sched-in>
+ *						r = X;
+ *
+ * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
+ * we get migrated and CPU2 sees X==0.
+ *
+ * Since most load-store architectures implement ACQUIRE with an smp_mb() after
+ * the LL/SC loop, they need no further barriers. Similarly all our TSO
+ * architectures imply an smp_mb() for each atomic instruction and equally don't
+ * need more.
+ *
+ * Architectures that can implement ACQUIRE better need to take care.
+ */
+#ifndef smp_mb__after_spinlock
+#define smp_mb__after_spinlock()	do { } while (0)
+#endif
+
 /**
  * raw_spin_unlock_wait - wait until the spinlock gets unlocked
  * @lock: the spinlock in question.
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1969,8 +1969,8 @@ try_to_wake_up(struct task_struct *p, un
 	 * reordered with p->state check below. This pairs with mb() in
 	 * set_current_state() the waiting thread does.
 	 */
-	smp_mb__before_spinlock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	smp_mb__after_spinlock();
 	if (!(p->state & state))
 		goto out;
 
@@ -3283,8 +3283,8 @@ static void __sched notrace __schedule(b
 	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
 	 * done by the caller to avoid the race with signal_wake_up().
 	 */
-	smp_mb__before_spinlock();
 	rq_lock(rq, &rf);
+	smp_mb__after_spinlock();
 
 	/* Promote REQ to ACT */
 	rq->clock_update_flags <<= 1;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH -v2 4/4] locking: Remove smp_mb__before_spinlock()
  2017-08-02 11:38 [PATCH -v2 0/4] Getting rid of smp_mb__before_spinlock Peter Zijlstra
                   ` (2 preceding siblings ...)
  2017-08-02 11:38 ` [PATCH -v2 3/4] locking: Introduce smp_mb__after_spinlock() Peter Zijlstra
@ 2017-08-02 11:38 ` Peter Zijlstra
  3 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2017-08-02 11:38 UTC (permalink / raw)
  To: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin
  Cc: linux-kernel, mingo, stern, peterz

[-- Attachment #1: peterz-remove-smp_mb__before_spinlock.patch --]
[-- Type: text/plain, Size: 5444 bytes --]

Now that there are no users of smp_mb__before_spinlock() left, remove
it entirely.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 Documentation/memory-barriers.txt                    |    5 ---
 Documentation/translations/ko_KR/memory-barriers.txt |    5 ---
 arch/arm64/include/asm/spinlock.h                    |    9 ------
 arch/powerpc/include/asm/barrier.h                   |    7 -----
 fs/userfaultfd.c                                     |   25 ++++++++-----------
 include/linux/spinlock.h                             |   13 ---------
 6 files changed, 13 insertions(+), 51 deletions(-)

--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1981,10 +1981,7 @@ In all cases there are variants on "ACQU
      ACQUIRE operation has completed.
 
      Memory operations issued before the ACQUIRE may be completed after
-     the ACQUIRE operation has completed.  An smp_mb__before_spinlock(),
-     combined with a following ACQUIRE, orders prior stores against
-     subsequent loads and stores.  Note that this is weaker than smp_mb()!
-     The smp_mb__before_spinlock() primitive is free on many architectures.
+     the ACQUIRE operation has completed.
 
  (2) RELEASE operation implication:
 
--- a/Documentation/translations/ko_KR/memory-barriers.txt
+++ b/Documentation/translations/ko_KR/memory-barriers.txt
@@ -1956,10 +1956,7 @@ MMIO 쓰기 배리어
      뒤에 완료됩니다.
 
      ACQUIRE 앞에서 요청된 메모리 오퍼레이션은 ACQUIRE 오퍼레이션이 완료된 후에
-     완료될 수 있습니다.  smp_mb__before_spinlock() 뒤에 ACQUIRE 가 실행되는
-     코드 블록은 블록 앞의 스토어를 블록 뒤의 로드와 스토어에 대해 순서
-     맞춥니다.  이건 smp_mb() 보다 완화된 것임을 기억하세요!  많은 아키텍쳐에서
-     smp_mb__before_spinlock() 은 사실 아무일도 하지 않습니다.
+     완료될 수 있습니다.
 
  (2) RELEASE 오퍼레이션의 영향:
 
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -358,15 +358,6 @@ static inline int arch_read_trylock(arch
 #define arch_read_relax(lock)	cpu_relax()
 #define arch_write_relax(lock)	cpu_relax()
 
-/*
- * Accesses appearing in program order before a spin_lock() operation
- * can be reordered with accesses inside the critical section, by virtue
- * of arch_spin_lock being constructed using acquire semantics.
- *
- * In cases where this is problematic (e.g. try_to_wake_up), an
- * smp_mb__before_spinlock() can restore the required ordering.
- */
-#define smp_mb__before_spinlock()	smp_mb()
 /* See include/linux/spinlock.h */
 #define smp_mb__after_spinlock()	smp_mb()
 
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -74,13 +74,6 @@ do {									\
 	___p1;								\
 })
 
-/*
- * This must resolve to hwsync on SMP for the context switch path.
- * See _switch, and core scheduler context switch memory ordering
- * comments.
- */
-#define smp_mb__before_spinlock()   smp_mb()
-
 #include <asm-generic/barrier.h>
 
 #endif /* _ASM_POWERPC_BARRIER_H */
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -109,27 +109,24 @@ static int userfaultfd_wake_function(wai
 		goto out;
 	WRITE_ONCE(uwq->waken, true);
 	/*
-	 * The implicit smp_mb__before_spinlock in try_to_wake_up()
-	 * renders uwq->waken visible to other CPUs before the task is
-	 * waken.
+	 * The Program-Order guarantees provided by the scheduler
+	 * ensure uwq->waken is visible before the task is woken.
 	 */
 	ret = wake_up_state(wq->private, mode);
-	if (ret)
+	if (ret) {
 		/*
 		 * Wake only once, autoremove behavior.
 		 *
-		 * After the effect of list_del_init is visible to the
-		 * other CPUs, the waitqueue may disappear from under
-		 * us, see the !list_empty_careful() in
-		 * handle_userfault(). try_to_wake_up() has an
-		 * implicit smp_mb__before_spinlock, and the
-		 * wq->private is read before calling the extern
-		 * function "wake_up_state" (which in turns calls
-		 * try_to_wake_up). While the spin_lock;spin_unlock;
-		 * wouldn't be enough, the smp_mb__before_spinlock is
-		 * enough to avoid an explicit smp_mb() here.
+		 * After the effect of list_del_init is visible to the other
+		 * CPUs, the waitqueue may disappear from under us, see the
+		 * !list_empty_careful() in handle_userfault().
+		 *
+		 * try_to_wake_up() has an implicit smp_mb(), and the
+		 * wq->private is read before calling the extern function
+		 * "wake_up_state" (which in turns calls try_to_wake_up).
 		 */
 		list_del_init(&wq->entry);
+	}
 out:
 	return ret;
 }
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -118,19 +118,6 @@ do {								\
 #endif
 
 /*
- * Despite its name it doesn't necessarily has to be a full barrier.
- * It should only guarantee that a STORE before the critical section
- * can not be reordered with LOADs and STOREs inside this section.
- * spin_lock() is the one-way barrier, this LOAD can not escape out
- * of the region. So the default implementation simply ensures that
- * a STORE can not move into the critical section, smp_wmb() should
- * serialize it with another STORE done by spin_lock().
- */
-#ifndef smp_mb__before_spinlock
-#define smp_mb__before_spinlock()	smp_wmb()
-#endif
-
-/*
  * This barrier must provide two things:
  *
  *   - it must guarantee a STORE before the spin_lock() is ordered against a

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02 11:38 ` [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending() Peter Zijlstra
@ 2017-08-02 13:00   ` Mel Gorman
  2017-08-02 13:05     ` Peter Zijlstra
  2017-08-02 13:00   ` ARC stuff (was Re: [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending()) Vineet Gupta
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2017-08-02 13:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin,
	linux-kernel, mingo, stern, Russell King, Heiko Carstens,
	Ralf Baechle, Vineet Gupta, David S. Miller, Rik van Riel

On Wed, Aug 02, 2017 at 01:38:38PM +0200, Peter Zijlstra wrote:
> For this to work we need to rework migrate_misplaced_transhuge_page()
> a little and move the test up into do_huge_pmd_numa_page().
> 

Note that Nadav has a patch in parallel on it's way towards Andrew's
tree that I suggested to always always check if a TLB flush is pending
under the PTL. A conflict will happen but will be trivial to resolve.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* ARC stuff (was Re: [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending())
  2017-08-02 11:38 ` [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending() Peter Zijlstra
  2017-08-02 13:00   ` Mel Gorman
@ 2017-08-02 13:00   ` Vineet Gupta
  2017-08-02 13:17     ` Peter Zijlstra
  2017-08-03 15:27   ` [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending() Will Deacon
  2017-08-11  9:45   ` Peter Zijlstra
  3 siblings, 1 reply; 16+ messages in thread
From: Vineet Gupta @ 2017-08-02 13:00 UTC (permalink / raw)
  To: Peter Zijlstra, torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin
  Cc: linux-kernel, mingo, stern, Russell King, Heiko Carstens,
	Ralf Baechle, arcml, David S. Miller, Mel Gorman, Rik van Riel

On 08/02/2017 05:19 PM, Peter Zijlstra wrote:
> Commit:
>
>    af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
>
> added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
> can solve the same problem without this barrier.
>
> If instead we mandate that mm_tlb_flush_pending() is used while
> holding the PTL we're guaranteed to observe prior
> set_tlb_flush_pending() instances.
>
> For this to work we need to rework migrate_misplaced_transhuge_page()
> a little and move the test up into do_huge_pmd_numa_page().
>
> NOTE: this relies on flush_tlb_range() to guarantee:
>
>     (1) it ensures that prior page table updates are visible to the
>         page table walker and

ARC doesn't have hw page walker so this is not relevant.

>     (2) it ensures that subsequent memory accesses are only made
>         visible after the invalidation has completed

flush_tlb_range() does a bunch of aux register accesses, I need to check with hw 
folks if those can be assumed to serializing w.r.t. memory ordering.
But if not then we need to add an explicit smb barrier (which will not be paired ? 
) and would be penalizing the other callers of flush_tlb_range(). Will a new API 
for this be an overkill ? Is a memory barrier needed here anyways - like ARM !


>
> This is required for architectures that implement TRANSPARENT_HUGEPAGE
> (arc, arm, arm64, mips, powerpc, s390, sparc, x86) or otherwise use
> mm_tlb_flush_pending() in their page-table operations (arm, arm64,
> x86).
>
> This appears true for:
>
>   - arm (DSB ISB before and after),
>   - arm64 (DSB ISHST before, and DSB ISH after),
>   - powerpc (PTESYNC before and after),
>   - s390 and x86 TLB invalidate are serializing instructions
>
> But I failed to understand the situation for:
>
>   - arc, mips, sparc
>
> Now SPARC64 is a wee bit special in that flush_tlb_range() is a no-op
> and it flushes the TLBs using arch_{enter,leave}_lazy_mmu_mode()
> inside the PTL. It still needs to guarantee the PTL unlock happens
> _after_ the invalidate completes.
>
> Vineet, Ralf and Dave could you guys please have a look?
>
> Cc: Russell King <linux@armlinux.org.uk>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Ralf Baechle <ralf@linux-mips.org>
> Cc: Vineet Gupta <vgupta@synopsys.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   include/linux/mm_types.h |   33 +++++++++++++++++++++++++++------
>   mm/huge_memory.c         |   20 ++++++++++++++++++++
>   mm/migrate.c             |    6 ------
>   3 files changed, 47 insertions(+), 12 deletions(-)
>
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -527,23 +527,44 @@ static inline cpumask_t *mm_cpumask(stru
>    */
>   static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
>   {
> -	barrier();
> +	/*
> +	 * Must be called with PTL held; such that our PTL acquire will have
> +	 * observed the store from set_tlb_flush_pending().
> +	 */
>   	return mm->tlb_flush_pending;
>   }
>   static inline void set_tlb_flush_pending(struct mm_struct *mm)
>   {
>   	mm->tlb_flush_pending = true;
> -
>   	/*
> -	 * Guarantee that the tlb_flush_pending store does not leak into the
> -	 * critical section updating the page tables
> +	 * The only time this value is relevant is when there are indeed pages
> +	 * to flush. And we'll only flush pages after changing them, which
> +	 * requires the PTL.
> +	 *
> +	 * So the ordering here is:
> +	 *
> +	 *	mm->tlb_flush_pending = true;
> +	 *	spin_lock(&ptl);
> +	 *	...
> +	 *	set_pte_at();
> +	 *	spin_unlock(&ptl);
> +	 *
> +	 *				spin_lock(&ptl)
> +	 *				mm_tlb_flush_pending();
> +	 *				....
> +	 *				spin_unlock(&ptl);
> +	 *
> +	 *	flush_tlb_range();
> +	 *	mm->tlb_flush_pending = false;
> +	 *
> +	 * So the =true store is constrained by the PTL unlock, and the =false
> +	 * store is constrained by the TLB invalidate.
>   	 */
> -	smp_mb__before_spinlock();
>   }
>   /* Clearing is done after a TLB flush, which also provides a barrier. */
>   static inline void clear_tlb_flush_pending(struct mm_struct *mm)
>   {
> -	barrier();
> +	/* see set_tlb_flush_pending */
>   	mm->tlb_flush_pending = false;
>   }
>   #else
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1410,6 +1410,7 @@ int do_huge_pmd_numa_page(struct vm_faul
>   	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>   	int page_nid = -1, this_nid = numa_node_id();
>   	int target_nid, last_cpupid = -1;
> +	bool need_flush = false;
>   	bool page_locked;
>   	bool migrated = false;
>   	bool was_writable;
> @@ -1496,10 +1497,29 @@ int do_huge_pmd_numa_page(struct vm_faul
>   	}
>   
>   	/*
> +	 * Since we took the NUMA fault, we must have observed the !accessible
> +	 * bit. Make sure all other CPUs agree with that, to avoid them
> +	 * modifying the page we're about to migrate.
> +	 *
> +	 * Must be done under PTL such that we'll observe the relevant
> +	 * set_tlb_flush_pending().
> +	 */
> +	if (mm_tlb_flush_pending(vma->vm_mm))
> +		need_flush = true;
> +
> +	/*
>   	 * Migrate the THP to the requested node, returns with page unlocked
>   	 * and access rights restored.
>   	 */
>   	spin_unlock(vmf->ptl);
> +
> +	/*
> +	 * We are not sure a pending tlb flush here is for a huge page
> +	 * mapping or not. Hence use the tlb range variant
> +	 */
> +	if (need_flush)
> +		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
> +
>   	migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
>   				vmf->pmd, pmd, vmf->address, page, target_nid);
>   	if (migrated) {
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1937,12 +1937,6 @@ int migrate_misplaced_transhuge_page(str
>   		put_page(new_page);
>   		goto out_fail;
>   	}
> -	/*
> -	 * We are not sure a pending tlb flush here is for a huge page
> -	 * mapping or not. Hence use the tlb range variant
> -	 */
> -	if (mm_tlb_flush_pending(mm))
> -		flush_tlb_range(vma, mmun_start, mmun_end);
>   
>   	/* Prepare a page as a migration target */
>   	__SetPageLocked(new_page);
>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02 13:00   ` Mel Gorman
@ 2017-08-02 13:05     ` Peter Zijlstra
  2017-08-02 13:52       ` Mel Gorman
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2017-08-02 13:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin,
	linux-kernel, mingo, stern, Russell King, Heiko Carstens,
	Ralf Baechle, Vineet Gupta, David S. Miller, Rik van Riel

On Wed, Aug 02, 2017 at 02:00:40PM +0100, Mel Gorman wrote:
> On Wed, Aug 02, 2017 at 01:38:38PM +0200, Peter Zijlstra wrote:
> > For this to work we need to rework migrate_misplaced_transhuge_page()
> > a little and move the test up into do_huge_pmd_numa_page().
> > 
> 
> Note that Nadav has a patch in parallel on it's way towards Andrew's
> tree that I suggested to always always check if a TLB flush is pending
> under the PTL. A conflict will happen but will be trivial to resolve.

Got a link?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ARC stuff (was Re: [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending())
  2017-08-02 13:00   ` ARC stuff (was Re: [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending()) Vineet Gupta
@ 2017-08-02 13:17     ` Peter Zijlstra
  2017-08-11 14:15       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2017-08-02 13:17 UTC (permalink / raw)
  To: Vineet Gupta
  Cc: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin,
	linux-kernel, mingo, stern, Russell King, Heiko Carstens,
	Ralf Baechle, arcml, David S. Miller, Mel Gorman, Rik van Riel

On Wed, Aug 02, 2017 at 06:30:43PM +0530, Vineet Gupta wrote:
> On 08/02/2017 05:19 PM, Peter Zijlstra wrote:
> > Commit:
> > 
> >    af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
> > 
> > added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
> > can solve the same problem without this barrier.
> > 
> > If instead we mandate that mm_tlb_flush_pending() is used while
> > holding the PTL we're guaranteed to observe prior
> > set_tlb_flush_pending() instances.
> > 
> > For this to work we need to rework migrate_misplaced_transhuge_page()
> > a little and move the test up into do_huge_pmd_numa_page().
> > 
> > NOTE: this relies on flush_tlb_range() to guarantee:
> > 
> >     (1) it ensures that prior page table updates are visible to the
> >         page table walker and
> 
> ARC doesn't have hw page walker so this is not relevant.

Well, you still want your software walker to observe the new PTE entries
before you start shooting down the old ones. So I would expect at least
an smp_wmb() before the TLB invalidate to order against another CPU
doing a software TLB fill, such that the other CPU will indeed observe
the new PTE after it has observed the TLB missing.

> >     (2) it ensures that subsequent memory accesses are only made
> >         visible after the invalidation has completed
> 
> flush_tlb_range() does a bunch of aux register accesses, I need to check
> with hw folks if those can be assumed to serializing w.r.t. memory ordering.
> But if not then we need to add an explicit smb barrier (which will not be
> paired ? )

It would pair with the ACQUIRE from the PTL in the below example.

> and would be penalizing the other callers of flush_tlb_range().
> Will a new API for this be an overkill ? Is a memory barrier needed here
> anyways - like ARM !

It is needed at the very least if you do transparant huge pages as per
the existing logic (this requirement isn't new per this patch, I was
just the silly person wondering if flush_tlb_range() does indeed provide
the ordering assumed).

But yes, lots of architectures do provide this ordering already and
some, like ARM and PPC do so with quite expensive barriers.

To me it's also a natural / expected ordering, but that could just be
me :-)

> >   	/*
> > +	 * The only time this value is relevant is when there are indeed pages
> > +	 * to flush. And we'll only flush pages after changing them, which
> > +	 * requires the PTL.
> > +	 *
> > +	 * So the ordering here is:
> > +	 *
> > +	 *	mm->tlb_flush_pending = true;
> > +	 *	spin_lock(&ptl);
> > +	 *	...
> > +	 *	set_pte_at();
> > +	 *	spin_unlock(&ptl);
> > +	 *
> > +	 *				spin_lock(&ptl)
> > +	 *				mm_tlb_flush_pending();
> > +	 *				....
> > +	 *				spin_unlock(&ptl);
> > +	 *
> > +	 *	flush_tlb_range();
> > +	 *	mm->tlb_flush_pending = false;
> > +	 *
> > +	 * So the =true store is constrained by the PTL unlock, and the =false
> > +	 * store is constrained by the TLB invalidate.
> >   	 */
> >   }
> >   /* Clearing is done after a TLB flush, which also provides a barrier. */

See, not a new requirement.. I only mucked with the ordering for
setting it, clearing already relied on the flush_tlb_range().

> >   static inline void clear_tlb_flush_pending(struct mm_struct *mm)
> >   {
> > +	/* see set_tlb_flush_pending */
> >   	mm->tlb_flush_pending = false;
> >   }

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02 13:05     ` Peter Zijlstra
@ 2017-08-02 13:52       ` Mel Gorman
  2017-08-02 14:16         ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2017-08-02 13:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin,
	linux-kernel, mingo, stern, Russell King, Heiko Carstens,
	Ralf Baechle, Vineet Gupta, David S. Miller, Rik van Riel

On Wed, Aug 02, 2017 at 03:05:50PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 02, 2017 at 02:00:40PM +0100, Mel Gorman wrote:
> > On Wed, Aug 02, 2017 at 01:38:38PM +0200, Peter Zijlstra wrote:
> > > For this to work we need to rework migrate_misplaced_transhuge_page()
> > > a little and move the test up into do_huge_pmd_numa_page().
> > > 
> > 
> > Note that Nadav has a patch in parallel on it's way towards Andrew's
> > tree that I suggested to always always check if a TLB flush is pending
> > under the PTL. A conflict will happen but will be trivial to resolve.
> 
> Got a link?

Subject: Revert "mm: numa: defer TLB flush for THP migration as long as possible"

http://lkml.kernel.org/r/20170731164325.235019-4-namit@vmware.com


-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02 13:52       ` Mel Gorman
@ 2017-08-02 14:16         ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2017-08-02 14:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin,
	linux-kernel, mingo, stern, Russell King, Heiko Carstens,
	Ralf Baechle, Vineet Gupta, David S. Miller, Rik van Riel

On Wed, Aug 02, 2017 at 02:52:03PM +0100, Mel Gorman wrote:
> On Wed, Aug 02, 2017 at 03:05:50PM +0200, Peter Zijlstra wrote:
> > On Wed, Aug 02, 2017 at 02:00:40PM +0100, Mel Gorman wrote:
> > > On Wed, Aug 02, 2017 at 01:38:38PM +0200, Peter Zijlstra wrote:
> > > > For this to work we need to rework migrate_misplaced_transhuge_page()
> > > > a little and move the test up into do_huge_pmd_numa_page().
> > > > 
> > > 
> > > Note that Nadav has a patch in parallel on it's way towards Andrew's
> > > tree that I suggested to always always check if a TLB flush is pending
> > > under the PTL. A conflict will happen but will be trivial to resolve.
> > 
> > Got a link?
> 
> Subject: Revert "mm: numa: defer TLB flush for THP migration as long as possible"
> 
> http://lkml.kernel.org/r/20170731164325.235019-4-namit@vmware.com
> 

Thanks, yes that should be easy to resolve.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02 11:38 ` [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending() Peter Zijlstra
  2017-08-02 13:00   ` Mel Gorman
  2017-08-02 13:00   ` ARC stuff (was Re: [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending()) Vineet Gupta
@ 2017-08-03 15:27   ` Will Deacon
  2017-08-11  9:45   ` Peter Zijlstra
  3 siblings, 0 replies; 16+ messages in thread
From: Will Deacon @ 2017-08-03 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, oleg, paulmck, benh, mpe, npiggin, linux-kernel, mingo,
	stern, Russell King, Heiko Carstens, Ralf Baechle, Vineet Gupta,
	David S. Miller, Mel Gorman, Rik van Riel

On Wed, Aug 02, 2017 at 01:38:38PM +0200, Peter Zijlstra wrote:
> Commit:
> 
>   af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
> 
> added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
> can solve the same problem without this barrier.
> 
> If instead we mandate that mm_tlb_flush_pending() is used while
> holding the PTL we're guaranteed to observe prior
> set_tlb_flush_pending() instances.
> 
> For this to work we need to rework migrate_misplaced_transhuge_page()
> a little and move the test up into do_huge_pmd_numa_page().
> 
> NOTE: this relies on flush_tlb_range() to guarantee:
> 
>    (1) it ensures that prior page table updates are visible to the
>        page table walker and
>    (2) it ensures that subsequent memory accesses are only made
>        visible after the invalidation has completed

Works for me:

Acked-by: Will Deacon <will.deacon@arm.com>

Will

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -v2 3/4] locking: Introduce smp_mb__after_spinlock().
  2017-08-02 11:38 ` [PATCH -v2 3/4] locking: Introduce smp_mb__after_spinlock() Peter Zijlstra
@ 2017-08-03 15:28   ` Will Deacon
  2017-08-03 15:43     ` Nicholas Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Will Deacon @ 2017-08-03 15:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, oleg, paulmck, benh, mpe, npiggin, linux-kernel, mingo, stern

On Wed, Aug 02, 2017 at 01:38:40PM +0200, Peter Zijlstra wrote:
> Since its inception, our understanding of ACQUIRE, esp. as applied to
> spinlocks, has changed somewhat. Also, I wonder if, with a simple
> change, we cannot make it provide more.
> 
> The problem with the comment is that the STORE done by spin_lock isn't
> itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
> it and cross with any prior STORE, rendering the default WMB
> insufficient (pointed out by Alan).
> 
> Now, this is only really a problem on PowerPC and ARM64, both of
> which already defined smp_mb__before_spinlock() as a smp_mb().
> 
> At the same time, we can get a much stronger construct if we place
> that same barrier _inside_ the spin_lock(). In that case we upgrade
> the RCpc spinlock to an RCsc.  That would make all schedule() calls
> fully transitive against one another.
> 
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/arm64/include/asm/spinlock.h   |    2 ++
>  arch/powerpc/include/asm/spinlock.h |    3 +++
>  include/linux/atomic.h              |    3 +++
>  include/linux/spinlock.h            |   36 ++++++++++++++++++++++++++++++++++++
>  kernel/sched/core.c                 |    4 ++--
>  5 files changed, 46 insertions(+), 2 deletions(-)
> 
> --- a/arch/arm64/include/asm/spinlock.h
> +++ b/arch/arm64/include/asm/spinlock.h
> @@ -367,5 +367,7 @@ static inline int arch_read_trylock(arch
>   * smp_mb__before_spinlock() can restore the required ordering.
>   */
>  #define smp_mb__before_spinlock()	smp_mb()
> +/* See include/linux/spinlock.h */
> +#define smp_mb__after_spinlock()	smp_mb()
>  
>  #endif /* __ASM_SPINLOCK_H */

Acked-by: Will Deacon <will.deacon@arm.com>

Will

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -v2 3/4] locking: Introduce smp_mb__after_spinlock().
  2017-08-03 15:28   ` Will Deacon
@ 2017-08-03 15:43     ` Nicholas Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nicholas Piggin @ 2017-08-03 15:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, torvalds, oleg, paulmck, benh, mpe, linux-kernel,
	mingo, stern

On Thu, 3 Aug 2017 16:28:20 +0100
Will Deacon <will.deacon@arm.com> wrote:

> On Wed, Aug 02, 2017 at 01:38:40PM +0200, Peter Zijlstra wrote:
> > Since its inception, our understanding of ACQUIRE, esp. as applied to
> > spinlocks, has changed somewhat. Also, I wonder if, with a simple
> > change, we cannot make it provide more.
> > 
> > The problem with the comment is that the STORE done by spin_lock isn't
> > itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
> > it and cross with any prior STORE, rendering the default WMB
> > insufficient (pointed out by Alan).
> > 
> > Now, this is only really a problem on PowerPC and ARM64, both of
> > which already defined smp_mb__before_spinlock() as a smp_mb().
> > 
> > At the same time, we can get a much stronger construct if we place
> > that same barrier _inside_ the spin_lock(). In that case we upgrade
> > the RCpc spinlock to an RCsc.  That would make all schedule() calls
> > fully transitive against one another.
> > 
> > Cc: Alan Stern <stern@rowland.harvard.edu>
> > Cc: Nicholas Piggin <npiggin@gmail.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Will Deacon <will.deacon@arm.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  arch/arm64/include/asm/spinlock.h   |    2 ++
> >  arch/powerpc/include/asm/spinlock.h |    3 +++
> >  include/linux/atomic.h              |    3 +++
> >  include/linux/spinlock.h            |   36 ++++++++++++++++++++++++++++++++++++
> >  kernel/sched/core.c                 |    4 ++--
> >  5 files changed, 46 insertions(+), 2 deletions(-)
> > 
> > --- a/arch/arm64/include/asm/spinlock.h
> > +++ b/arch/arm64/include/asm/spinlock.h
> > @@ -367,5 +367,7 @@ static inline int arch_read_trylock(arch
> >   * smp_mb__before_spinlock() can restore the required ordering.
> >   */
> >  #define smp_mb__before_spinlock()	smp_mb()
> > +/* See include/linux/spinlock.h */
> > +#define smp_mb__after_spinlock()	smp_mb()
> >  
> >  #endif /* __ASM_SPINLOCK_H */  
> 
> Acked-by: Will Deacon <will.deacon@arm.com>

Yeah this looks good to me. I don't think there would ever be a reason
to use smp_mb__before_spinlock() rather than smp_mb__after_spinlock().

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02 11:38 ` [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending() Peter Zijlstra
                     ` (2 preceding siblings ...)
  2017-08-03 15:27   ` [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending() Will Deacon
@ 2017-08-11  9:45   ` Peter Zijlstra
  3 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2017-08-11  9:45 UTC (permalink / raw)
  To: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin
  Cc: linux-kernel, mingo, stern, Russell King, Heiko Carstens,
	Ralf Baechle, Vineet Gupta, David S. Miller, Mel Gorman,
	Rik van Riel

On Wed, Aug 02, 2017 at 01:38:38PM +0200, Peter Zijlstra wrote:
>  	/*
> +	 * The only time this value is relevant is when there are indeed pages
> +	 * to flush. And we'll only flush pages after changing them, which
> +	 * requires the PTL.
> +	 *
> +	 * So the ordering here is:
> +	 *
> +	 *	mm->tlb_flush_pending = true;
> +	 *	spin_lock(&ptl);
> +	 *	...
> +	 *	set_pte_at();
> +	 *	spin_unlock(&ptl);
> +	 *
> +	 *				spin_lock(&ptl)
> +	 *				mm_tlb_flush_pending();
> +	 *				....

Crud, so while I was rebasing Nadav's patches I realized that this does
not in fact work for PPC and split PTL. Because the PPC lwsync relies on
the address dependency to actual produce the ordering.

Also, since Nadav switched to atomic_inc/atomic_dec, I'll send a patch
to add smp_mb__after_atomic(), and

> +	 *				spin_unlock(&ptl);
> +	 *
> +	 *	flush_tlb_range();
> +	 *	mm->tlb_flush_pending = false;
> +	 *
> +	 * So the =true store is constrained by the PTL unlock, and the =false
> +	 * store is constrained by the TLB invalidate.
>  	 */
>  }
>  /* Clearing is done after a TLB flush, which also provides a barrier. */
>  static inline void clear_tlb_flush_pending(struct mm_struct *mm)
>  {
> +	/* see set_tlb_flush_pending */

smp_mb__before_atomic() here. That also avoids the whole reliance on the
tlb_flush nonsense.

It will overstuff on barriers on some platforms though :/

>  	mm->tlb_flush_pending = false;
>  }

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ARC stuff (was Re: [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending())
  2017-08-02 13:17     ` Peter Zijlstra
@ 2017-08-11 14:15       ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2017-08-11 14:15 UTC (permalink / raw)
  To: Vineet Gupta
  Cc: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin,
	linux-kernel, mingo, stern, Russell King, Heiko Carstens,
	Ralf Baechle, arcml, David S. Miller, Mel Gorman, Rik van Riel

On Wed, Aug 02, 2017 at 03:17:10PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 02, 2017 at 06:30:43PM +0530, Vineet Gupta wrote:
> > flush_tlb_range() does a bunch of aux register accesses, I need to check
> > with hw folks if those can be assumed to serializing w.r.t. memory ordering.
> > But if not then we need to add an explicit smb barrier (which will not be
> > paired ? )
> 
> It would pair with the ACQUIRE from the PTL in the below example.
> 
> > and would be penalizing the other callers of flush_tlb_range().
> > Will a new API for this be an overkill ? Is a memory barrier needed here
> > anyways - like ARM !
> 
> It is needed at the very least if you do transparant huge pages as per
> the existing logic (this requirement isn't new per this patch, I was
> just the silly person wondering if flush_tlb_range() does indeed provide
> the ordering assumed).

Any word on this? It just got way worse and anything SMP needs to
provide this.

See commit:

  0a2dd266dd6b ("mm: make tlb_flush_pending global")

And these semantics are now required for the correct operation of KSM
and MADV_{FREE,DONT_NEED}.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-08-11 14:15 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-02 11:38 [PATCH -v2 0/4] Getting rid of smp_mb__before_spinlock Peter Zijlstra
2017-08-02 11:38 ` [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending() Peter Zijlstra
2017-08-02 13:00   ` Mel Gorman
2017-08-02 13:05     ` Peter Zijlstra
2017-08-02 13:52       ` Mel Gorman
2017-08-02 14:16         ` Peter Zijlstra
2017-08-02 13:00   ` ARC stuff (was Re: [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending()) Vineet Gupta
2017-08-02 13:17     ` Peter Zijlstra
2017-08-11 14:15       ` Peter Zijlstra
2017-08-03 15:27   ` [PATCH -v2 1/4] mm: Rework {set,clear,mm}_tlb_flush_pending() Will Deacon
2017-08-11  9:45   ` Peter Zijlstra
2017-08-02 11:38 ` [PATCH -v2 2/4] overlayfs: Remove smp_mb__before_spinlock() usage Peter Zijlstra
2017-08-02 11:38 ` [PATCH -v2 3/4] locking: Introduce smp_mb__after_spinlock() Peter Zijlstra
2017-08-03 15:28   ` Will Deacon
2017-08-03 15:43     ` Nicholas Piggin
2017-08-02 11:38 ` [PATCH -v2 4/4] locking: Remove smp_mb__before_spinlock() Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).