All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/5] Getting rid of smp_mb__before_spinlock
@ 2017-06-07 16:15 Peter Zijlstra
  2017-06-07 16:15 ` [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending() Peter Zijlstra
                   ` (5 more replies)
  0 siblings, 6 replies; 41+ messages in thread
From: Peter Zijlstra @ 2017-06-07 16:15 UTC (permalink / raw)
  To: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin
  Cc: linux-kernel, mingo, stern, peterz

There was a thread on this somewhere about a year ago, I finally remembered to
finish this :-)

This series removes two smp_mb__before_spinlock() (ab)users and converts the
scheduler to use smp_mb__after_spinlock(), which provides more guarantees with
the same amount of barriers.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-06-07 16:15 [RFC][PATCH 0/5] Getting rid of smp_mb__before_spinlock Peter Zijlstra
@ 2017-06-07 16:15 ` Peter Zijlstra
  2017-06-09 14:45   ` Will Deacon
  2017-06-07 16:15 ` [RFC][PATCH 2/5] locking: Introduce smp_mb__after_spinlock() Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2017-06-07 16:15 UTC (permalink / raw)
  To: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin
  Cc: linux-kernel, mingo, stern, peterz, Mel Gorman, Rik van Riel

[-- Attachment #1: peterz-mm_tlb_flush_pending.patch --]
[-- Type: text/plain, Size: 3126 bytes --]

Commit:

  af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")

added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
can solve the same problem without this barrier.

If instead we mandate that mm_tlb_flush_pending() is used while
holding the PTL we're guaranteed to observe prior
set_tlb_flush_pending() instances.

For this to work we need to rework migrate_misplaced_transhuge_page()
a little and move the test up into do_huge_pmd_numa_page().

Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -527,18 +527,16 @@ static inline cpumask_t *mm_cpumask(stru
  */
 static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
 {
-	barrier();
+	/*
+	 * Must be called with PTL held; such that our PTL acquire will have
+	 * observed the store from set_tlb_flush_pending().
+	 */
 	return mm->tlb_flush_pending;
 }
 static inline void set_tlb_flush_pending(struct mm_struct *mm)
 {
 	mm->tlb_flush_pending = true;
-
-	/*
-	 * Guarantee that the tlb_flush_pending store does not leak into the
-	 * critical section updating the page tables
-	 */
-	smp_mb__before_spinlock();
+	barrier();
 }
 /* Clearing is done after a TLB flush, which also provides a barrier. */
 static inline void clear_tlb_flush_pending(struct mm_struct *mm)
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1410,6 +1410,7 @@ int do_huge_pmd_numa_page(struct vm_faul
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
 	int target_nid, last_cpupid = -1;
+	bool need_flush = false;
 	bool page_locked;
 	bool migrated = false;
 	bool was_writable;
@@ -1490,10 +1491,29 @@ int do_huge_pmd_numa_page(struct vm_faul
 	}
 
 	/*
+	 * Since we took the NUMA fault, we must have observed the !accessible
+	 * bit. Make sure all other CPUs agree with that, to avoid them
+	 * modifying the page we're about to migrate.
+	 *
+	 * Must be done under PTL such that we'll observe the relevant
+	 * set_tlb_flush_pending().
+	 */
+	if (mm_tlb_flush_pending(mm))
+		need_flush = true;
+
+	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and access rights restored.
 	 */
 	spin_unlock(vmf->ptl);
+
+	/*
+	 * We are not sure a pending tlb flush here is for a huge page
+	 * mapping or not. Hence use the tlb range variant
+	 */
+	if (need_flush)
+		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+
 	migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
 				vmf->pmd, pmd, vmf->address, page, target_nid);
 	if (migrated) {
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1935,12 +1935,6 @@ int migrate_misplaced_transhuge_page(str
 		put_page(new_page);
 		goto out_fail;
 	}
-	/*
-	 * We are not sure a pending tlb flush here is for a huge page
-	 * mapping or not. Hence use the tlb range variant
-	 */
-	if (mm_tlb_flush_pending(mm))
-		flush_tlb_range(vma, mmun_start, mmun_end);
 
 	/* Prepare a page as a migration target */
 	__SetPageLocked(new_page);

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][PATCH 2/5] locking: Introduce smp_mb__after_spinlock().
  2017-06-07 16:15 [RFC][PATCH 0/5] Getting rid of smp_mb__before_spinlock Peter Zijlstra
  2017-06-07 16:15 ` [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending() Peter Zijlstra
@ 2017-06-07 16:15 ` Peter Zijlstra
  2017-06-07 16:15 ` [RFC][PATCH 3/5] overlayfs: Remove smp_mb__before_spinlock() usage Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2017-06-07 16:15 UTC (permalink / raw)
  To: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin
  Cc: linux-kernel, mingo, stern, peterz

[-- Attachment #1: peter_zijlstra-question_on_smp_mb__before_spinlock.patch --]
[-- Type: text/plain, Size: 4368 bytes --]

Since its inception, our understanding of ACQUIRE, esp. as applied to
spinlocks, has changed somewhat. Also, I wonder if, with a simple
change, we cannot make it provide more.

The problem with the comment is that the STORE done by spin_lock isn't
itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
it and cross with any prior STORE, rendering the default WMB
insufficient (pointed out by Alan).

Now, this is only really a problem on PowerPC and ARM64, both of
which already defined smp_mb__before_spinlock() as a smp_mb().

At the same time, we can get a much stronger construct if we place
that same barrier _inside_ the spin_lock(). In that case we upgrade
the RCpc spinlock to an RCsc.  That would make all schedule() calls
fully transitive against one another.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/arm64/include/asm/spinlock.h   |    2 ++
 arch/powerpc/include/asm/spinlock.h |    3 +++
 include/linux/spinlock.h            |   36 ++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                 |    4 ++--
 4 files changed, 43 insertions(+), 2 deletions(-)

--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -367,5 +367,7 @@ static inline int arch_read_trylock(arch
  * smp_mb__before_spinlock() can restore the required ordering.
  */
 #define smp_mb__before_spinlock()	smp_mb()
+/* See include/linux/spinlock.h */
+#define smp_mb__after_spinlock()	smp_mb()
 
 #endif /* __ASM_SPINLOCK_H */
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -342,5 +342,8 @@ static inline void arch_write_unlock(arc
 #define arch_read_relax(lock)	__rw_yield(lock)
 #define arch_write_relax(lock)	__rw_yield(lock)
 
+/* See include/linux/spinlock.h */
+#define smp_mb__after_spinlock()   smp_mb()
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_SPINLOCK_H */
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -130,6 +130,42 @@ do {								\
 #define smp_mb__before_spinlock()	smp_wmb()
 #endif
 
+/*
+ * This barrier must provide two things:
+ *
+ *   - it must guarantee a STORE before the spin_lock() is ordered against a
+ *     LOAD after it, see the comments at its two usage sites.
+ *
+ *   - it must ensure the critical section is RCsc.
+ *
+ * The latter is important for cases where we observe values written by other
+ * CPUs in spin-loops, without barriers, while being subject to scheduling.
+ *
+ * CPU0			CPU1			CPU2
+ *
+ *			for (;;) {
+ *			  if (READ_ONCE(X))
+ *			    break;
+ *			}
+ * X=1
+ *			<sched-out>
+ *						<sched-in>
+ *						r = X;
+ *
+ * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
+ * we get migrated and CPU2 sees X==0.
+ *
+ * Since most load-store architectures implement ACQUIRE with an smp_mb() after
+ * the LL/SC loop, they need no further barriers. Similarly all our TSO
+ * architectures imply an smp_mb() for each atomic instruction and equally don't
+ * need more.
+ *
+ * Architectures that can implement ACQUIRE better need to take care.
+ */
+#ifndef smp_mb__after_spinlock
+#define smp_mb__after_spinlock()	do { } while (0)
+#endif
+
 /**
  * raw_spin_unlock_wait - wait until the spinlock gets unlocked
  * @lock: the spinlock in question.
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1969,8 +1969,8 @@ try_to_wake_up(struct task_struct *p, un
 	 * reordered with p->state check below. This pairs with mb() in
 	 * set_current_state() the waiting thread does.
 	 */
-	smp_mb__before_spinlock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	smp_mb__after_spinlock();
 	if (!(p->state & state))
 		goto out;
 
@@ -3383,8 +3383,8 @@ static void __sched notrace __schedule(b
 	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
 	 * done by the caller to avoid the race with signal_wake_up().
 	 */
-	smp_mb__before_spinlock();
 	rq_lock(rq, &rf);
+	smp_mb__after_spinlock();
 
 	/* Promote REQ to ACT */
 	rq->clock_update_flags <<= 1;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][PATCH 3/5] overlayfs: Remove smp_mb__before_spinlock() usage
  2017-06-07 16:15 [RFC][PATCH 0/5] Getting rid of smp_mb__before_spinlock Peter Zijlstra
  2017-06-07 16:15 ` [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending() Peter Zijlstra
  2017-06-07 16:15 ` [RFC][PATCH 2/5] locking: Introduce smp_mb__after_spinlock() Peter Zijlstra
@ 2017-06-07 16:15 ` Peter Zijlstra
  2017-06-07 16:15 ` [RFC][PATCH 4/5] locking: Remove smp_mb__before_spinlock() Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2017-06-07 16:15 UTC (permalink / raw)
  To: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin
  Cc: linux-kernel, mingo, stern, peterz, Al Viro

[-- Attachment #1: peterz-overlayfs.patch --]
[-- Type: text/plain, Size: 1024 bytes --]

While we could replace the smp_mb__before_spinlock() with the new
smp_mb__after_spinlock(), the normal pattern is to use
smp_store_release() to publish an object that is used for
lockless_dereference() -- and mirrors the regular rcu_assign_pointer()
/ rcu_dereference() patterns.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 fs/overlayfs/readdir.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -446,14 +446,14 @@ static int ovl_dir_fsync(struct file *fi
 
 			ovl_path_upper(dentry, &upperpath);
 			realfile = ovl_path_open(&upperpath, O_RDONLY);
-			smp_mb__before_spinlock();
+
 			inode_lock(inode);
 			if (!od->upperfile) {
 				if (IS_ERR(realfile)) {
 					inode_unlock(inode);
 					return PTR_ERR(realfile);
 				}
-				od->upperfile = realfile;
+				smp_store_release(&od->upperfile, realfile);
 			} else {
 				/* somebody has beaten us to it */
 				if (!IS_ERR(realfile))

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][PATCH 4/5] locking: Remove smp_mb__before_spinlock()
  2017-06-07 16:15 [RFC][PATCH 0/5] Getting rid of smp_mb__before_spinlock Peter Zijlstra
                   ` (2 preceding siblings ...)
  2017-06-07 16:15 ` [RFC][PATCH 3/5] overlayfs: Remove smp_mb__before_spinlock() usage Peter Zijlstra
@ 2017-06-07 16:15 ` Peter Zijlstra
  2017-06-07 16:15 ` [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch Peter Zijlstra
  2017-06-09 14:49 ` [RFC][PATCH 0/5] Getting rid of smp_mb__before_spinlock Will Deacon
  5 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2017-06-07 16:15 UTC (permalink / raw)
  To: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin
  Cc: linux-kernel, mingo, stern, peterz

[-- Attachment #1: peterz-remove-smp_mb__before_spinlock.patch --]
[-- Type: text/plain, Size: 5281 bytes --]

Now that there are no users of smp_mb__before_spinlock() left, remove
it entirely.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 Documentation/memory-barriers.txt                    |    5 ---
 Documentation/translations/ko_KR/memory-barriers.txt |    5 ---
 arch/arm64/include/asm/spinlock.h                    |    9 ------
 arch/powerpc/include/asm/barrier.h                   |    2 -
 fs/userfaultfd.c                                     |   25 ++++++++-----------
 include/linux/spinlock.h                             |   13 ---------
 6 files changed, 13 insertions(+), 46 deletions(-)

--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1982,10 +1982,7 @@ In all cases there are variants on "ACQU
      ACQUIRE operation has completed.
 
      Memory operations issued before the ACQUIRE may be completed after
-     the ACQUIRE operation has completed.  An smp_mb__before_spinlock(),
-     combined with a following ACQUIRE, orders prior stores against
-     subsequent loads and stores.  Note that this is weaker than smp_mb()!
-     The smp_mb__before_spinlock() primitive is free on many architectures.
+     the ACQUIRE operation has completed.
 
  (2) RELEASE operation implication:
 
--- a/Documentation/translations/ko_KR/memory-barriers.txt
+++ b/Documentation/translations/ko_KR/memory-barriers.txt
@@ -1956,10 +1956,7 @@ MMIO 쓰기 배리어
      뒤에 완료됩니다.
 
      ACQUIRE 앞에서 요청된 메모리 오퍼레이션은 ACQUIRE 오퍼레이션이 완료된 후에
-     완료될 수 있습니다.  smp_mb__before_spinlock() 뒤에 ACQUIRE 가 실행되는
-     코드 블록은 블록 앞의 스토어를 블록 뒤의 로드와 스토어에 대해 순서
-     맞춥니다.  이건 smp_mb() 보다 완화된 것임을 기억하세요!  많은 아키텍쳐에서
-     smp_mb__before_spinlock() 은 사실 아무일도 하지 않습니다.
+     완료될 수 있습니다.
 
  (2) RELEASE 오퍼레이션의 영향:
 
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -358,15 +358,6 @@ static inline int arch_read_trylock(arch
 #define arch_read_relax(lock)	cpu_relax()
 #define arch_write_relax(lock)	cpu_relax()
 
-/*
- * Accesses appearing in program order before a spin_lock() operation
- * can be reordered with accesses inside the critical section, by virtue
- * of arch_spin_lock being constructed using acquire semantics.
- *
- * In cases where this is problematic (e.g. try_to_wake_up), an
- * smp_mb__before_spinlock() can restore the required ordering.
- */
-#define smp_mb__before_spinlock()	smp_mb()
 /* See include/linux/spinlock.h */
 #define smp_mb__after_spinlock()	smp_mb()
 
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -74,8 +74,6 @@ do {									\
 	___p1;								\
 })
 
-#define smp_mb__before_spinlock()   smp_mb()
-
 #include <asm-generic/barrier.h>
 
 #endif /* _ASM_POWERPC_BARRIER_H */
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -109,27 +109,24 @@ static int userfaultfd_wake_function(wai
 		goto out;
 	WRITE_ONCE(uwq->waken, true);
 	/*
-	 * The implicit smp_mb__before_spinlock in try_to_wake_up()
-	 * renders uwq->waken visible to other CPUs before the task is
-	 * waken.
+	 * The Program-Order guarantees provided by the scheduler
+	 * ensure uwq->waken is visible before the task is woken.
 	 */
 	ret = wake_up_state(wq->private, mode);
-	if (ret)
+	if (ret) {
 		/*
 		 * Wake only once, autoremove behavior.
 		 *
-		 * After the effect of list_del_init is visible to the
-		 * other CPUs, the waitqueue may disappear from under
-		 * us, see the !list_empty_careful() in
-		 * handle_userfault(). try_to_wake_up() has an
-		 * implicit smp_mb__before_spinlock, and the
-		 * wq->private is read before calling the extern
-		 * function "wake_up_state" (which in turns calls
-		 * try_to_wake_up). While the spin_lock;spin_unlock;
-		 * wouldn't be enough, the smp_mb__before_spinlock is
-		 * enough to avoid an explicit smp_mb() here.
+		 * After the effect of list_del_init is visible to the other
+		 * CPUs, the waitqueue may disappear from under us, see the
+		 * !list_empty_careful() in handle_userfault().
+		 *
+		 * try_to_wake_up() has an implicit smp_mb(), and the
+		 * wq->private is read before calling the extern function
+		 * "wake_up_state" (which in turns calls try_to_wake_up).
 		 */
 		list_del_init(&wq->entry);
+	}
 out:
 	return ret;
 }
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -118,19 +118,6 @@ do {								\
 #endif
 
 /*
- * Despite its name it doesn't necessarily has to be a full barrier.
- * It should only guarantee that a STORE before the critical section
- * can not be reordered with LOADs and STOREs inside this section.
- * spin_lock() is the one-way barrier, this LOAD can not escape out
- * of the region. So the default implementation simply ensures that
- * a STORE can not move into the critical section, smp_wmb() should
- * serialize it with another STORE done by spin_lock().
- */
-#ifndef smp_mb__before_spinlock
-#define smp_mb__before_spinlock()	smp_wmb()
-#endif
-
-/*
  * This barrier must provide two things:
  *
  *   - it must guarantee a STORE before the spin_lock() is ordered against a

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch
  2017-06-07 16:15 [RFC][PATCH 0/5] Getting rid of smp_mb__before_spinlock Peter Zijlstra
                   ` (3 preceding siblings ...)
  2017-06-07 16:15 ` [RFC][PATCH 4/5] locking: Remove smp_mb__before_spinlock() Peter Zijlstra
@ 2017-06-07 16:15 ` Peter Zijlstra
  2017-06-08  0:32   ` Nicholas Piggin
  2017-06-09 14:49 ` [RFC][PATCH 0/5] Getting rid of smp_mb__before_spinlock Will Deacon
  5 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2017-06-07 16:15 UTC (permalink / raw)
  To: torvalds, will.deacon, oleg, paulmck, benh, mpe, npiggin
  Cc: linux-kernel, mingo, stern, peterz

[-- Attachment #1: peterz-powerpc-switch-to.patch --]
[-- Type: text/plain, Size: 944 bytes --]

Now that the scheduler's rq->lock is RCsc and thus provides full
transitivity between scheduling actions. And since we cannot migrate
current, a task needs a switch-out and a switch-in in order to
migrate, in which case the RCsc provides all the ordering we need.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/powerpc/kernel/entry_64.S |    8 --------
 1 file changed, 8 deletions(-)

--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -488,14 +488,6 @@ _GLOBAL(_switch)
 	std	r23,_CCR(r1)
 	std	r1,KSP(r3)	/* Set old stack pointer */
 
-#ifdef CONFIG_SMP
-	/* We need a sync somewhere here to make sure that if the
-	 * previous task gets rescheduled on another CPU, it sees all
-	 * stores it has performed on this one.
-	 */
-	sync
-#endif /* CONFIG_SMP */
-
 	/*
 	 * If we optimise away the clear of the reservation in system
 	 * calls because we know the CPU tracks the address of the

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch
  2017-06-07 16:15 ` [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch Peter Zijlstra
@ 2017-06-08  0:32   ` Nicholas Piggin
  2017-06-08  6:54     ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Nicholas Piggin @ 2017-06-08  0:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, will.deacon, oleg, paulmck, benh, mpe, linux-kernel,
	mingo, stern

On Wed, 07 Jun 2017 18:15:06 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Now that the scheduler's rq->lock is RCsc and thus provides full
> transitivity between scheduling actions. And since we cannot migrate
> current, a task needs a switch-out and a switch-in in order to
> migrate, in which case the RCsc provides all the ordering we need.

Hi Peter,

I'm actually just working on removing this right now too, so
good timing.

I think we can't "just" remove it, because it is required to order
MMIO on powerpc as well.

But what I have done is to comment that some other primitives are
already providing the hwsync for other, so we don't have to add
another one in _switch.

Thanks,
Nick

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/powerpc/kernel/entry_64.S |    8 --------
>  1 file changed, 8 deletions(-)
> 
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -488,14 +488,6 @@ _GLOBAL(_switch)
>  	std	r23,_CCR(r1)
>  	std	r1,KSP(r3)	/* Set old stack pointer */
>  
> -#ifdef CONFIG_SMP
> -	/* We need a sync somewhere here to make sure that if the
> -	 * previous task gets rescheduled on another CPU, it sees all
> -	 * stores it has performed on this one.
> -	 */
> -	sync
> -#endif /* CONFIG_SMP */
> -
>  	/*
>  	 * If we optimise away the clear of the reservation in system
>  	 * calls because we know the CPU tracks the address of the
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch
  2017-06-08  0:32   ` Nicholas Piggin
@ 2017-06-08  6:54     ` Peter Zijlstra
  2017-06-08  7:29       ` Nicholas Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2017-06-08  6:54 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: torvalds, will.deacon, oleg, paulmck, benh, mpe, linux-kernel,
	mingo, stern

On Thu, Jun 08, 2017 at 10:32:44AM +1000, Nicholas Piggin wrote:
> On Wed, 07 Jun 2017 18:15:06 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Now that the scheduler's rq->lock is RCsc and thus provides full
> > transitivity between scheduling actions. And since we cannot migrate
> > current, a task needs a switch-out and a switch-in in order to
> > migrate, in which case the RCsc provides all the ordering we need.
> 
> Hi Peter,
> 
> I'm actually just working on removing this right now too, so
> good timing.
> 
> I think we can't "just" remove it, because it is required to order
> MMIO on powerpc as well.

How is MMIO special? That is, there is only MMIO before we call into
schedule() right? So the rq->lock should be sufficient to order that
too.

> 
> But what I have done is to comment that some other primitives are
> already providing the hwsync for other, so we don't have to add
> another one in _switch.

Right, so this patch relies on the smp_mb__before_spinlock ->
smp_mb__after_spinlock conversion that makes the rq->lock RCsc and
should thus provide the required SYNC for migrations.

That said, I think you can already use the smp_mb__before_spinlock() as
that is done with IRQs disabled, but its a more difficult argument. The
rq->lock RCsc property should be more obvious.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch
  2017-06-08  6:54     ` Peter Zijlstra
@ 2017-06-08  7:29       ` Nicholas Piggin
  2017-06-08  7:57         ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Nicholas Piggin @ 2017-06-08  7:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, will.deacon, oleg, paulmck, benh, mpe, linux-kernel,
	mingo, stern, linuxppc-dev

On Thu, 8 Jun 2017 08:54:00 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Jun 08, 2017 at 10:32:44AM +1000, Nicholas Piggin wrote:
> > On Wed, 07 Jun 2017 18:15:06 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >   
> > > Now that the scheduler's rq->lock is RCsc and thus provides full
> > > transitivity between scheduling actions. And since we cannot migrate
> > > current, a task needs a switch-out and a switch-in in order to
> > > migrate, in which case the RCsc provides all the ordering we need.  
> > 
> > Hi Peter,
> > 
> > I'm actually just working on removing this right now too, so
> > good timing.
> > 
> > I think we can't "just" remove it, because it is required to order
> > MMIO on powerpc as well.  
> 
> How is MMIO special? That is, there is only MMIO before we call into
> schedule() right? So the rq->lock should be sufficient to order that
> too.

MMIO uses different barriers. spinlock and smp_ type barriers do
not order it.

> > 
> > But what I have done is to comment that some other primitives are
> > already providing the hwsync for other, so we don't have to add
> > another one in _switch.  
> 
> Right, so this patch relies on the smp_mb__before_spinlock ->
> smp_mb__after_spinlock conversion that makes the rq->lock RCsc and
> should thus provide the required SYNC for migrations.

AFAIKS either one will do, so long as there is a hwsync there. The
point is just that I have added some commentary in the generic and
powerpc parts to make it clear we're relying on that behavior of
the primitive. smp_mb* is not guaranteed to order MMIO, it's just
that it does on powerpc.

> That said, I think you can already use the smp_mb__before_spinlock() as
> that is done with IRQs disabled, but its a more difficult argument. The
> rq->lock RCsc property should be more obvious.

This is what I got.

https://patchwork.ozlabs.org/patch/770154/

But I'm not sure if I followed I'm not sure why it's a more
difficult argument: any time a process moves it must first execute
a hwsync on the current CPU after it has performed all its access
there, and then it must execute hwsync on the new CPU before it
performs any new access.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch
  2017-06-08  7:29       ` Nicholas Piggin
@ 2017-06-08  7:57         ` Peter Zijlstra
  2017-06-08  8:21           ` Nicholas Piggin
  2017-06-08  9:54           ` Michael Ellerman
  0 siblings, 2 replies; 41+ messages in thread
From: Peter Zijlstra @ 2017-06-08  7:57 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: torvalds, will.deacon, oleg, paulmck, benh, mpe, linux-kernel,
	mingo, stern, linuxppc-dev

On Thu, Jun 08, 2017 at 05:29:38PM +1000, Nicholas Piggin wrote:
> On Thu, 8 Jun 2017 08:54:00 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, Jun 08, 2017 at 10:32:44AM +1000, Nicholas Piggin wrote:
> > > On Wed, 07 Jun 2017 18:15:06 +0200
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> > >   
> > > > Now that the scheduler's rq->lock is RCsc and thus provides full
> > > > transitivity between scheduling actions. And since we cannot migrate
> > > > current, a task needs a switch-out and a switch-in in order to
> > > > migrate, in which case the RCsc provides all the ordering we need.  
> > > 
> > > Hi Peter,
> > > 
> > > I'm actually just working on removing this right now too, so
> > > good timing.
> > > 
> > > I think we can't "just" remove it, because it is required to order
> > > MMIO on powerpc as well.  
> > 
> > How is MMIO special? That is, there is only MMIO before we call into
> > schedule() right? So the rq->lock should be sufficient to order that
> > too.
> 
> MMIO uses different barriers. spinlock and smp_ type barriers do
> not order it.

Right, but you only have SYNC, which is what makes it possible at all.

Some of the other architectures are not so lucky and need a different
barrier, ARM for instance needs DSB(ISH) vs the DMB(ISH) provided by
smp_mb(). IA64, MIPS and a few others are in the same boat as ARM.

> > > But what I have done is to comment that some other primitives are
> > > already providing the hwsync for other, so we don't have to add
> > > another one in _switch.  
> > 
> > Right, so this patch relies on the smp_mb__before_spinlock ->
> > smp_mb__after_spinlock conversion that makes the rq->lock RCsc and
> > should thus provide the required SYNC for migrations.
> 
> AFAIKS either one will do, so long as there is a hwsync there. The
> point is just that I have added some commentary in the generic and
> powerpc parts to make it clear we're relying on that behavior of
> the primitive. smp_mb* is not guaranteed to order MMIO, it's just
> that it does on powerpc.

I'm not particularly happy with the generic comment; I don't feel we
should care that PPC is special here.

> > That said, I think you can already use the smp_mb__before_spinlock() as
> > that is done with IRQs disabled, but its a more difficult argument. The
> > rq->lock RCsc property should be more obvious.
> 
> This is what I got.
> 
> https://patchwork.ozlabs.org/patch/770154/

Your comment isn't fully correct, smp_cond_load_acquire() isn't
necessarily done by CPUy. It might be easiest to simply refer to the
"Notes on Program-Order guarantees on SMP systems." comment.

> But I'm not sure if I followed I'm not sure why it's a more
> difficult argument: any time a process moves it must first execute
> a hwsync on the current CPU after it has performed all its access
> there, and then it must execute hwsync on the new CPU before it
> performs any new access.

Yeah, its not a terribly difficult argument either way, but I feel the
RSsc rq->lock on is slightly easier.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch
  2017-06-08  7:57         ` Peter Zijlstra
@ 2017-06-08  8:21           ` Nicholas Piggin
  2017-06-08  9:54           ` Michael Ellerman
  1 sibling, 0 replies; 41+ messages in thread
From: Nicholas Piggin @ 2017-06-08  8:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, will.deacon, oleg, paulmck, benh, mpe, linux-kernel,
	mingo, stern, linuxppc-dev

On Thu, 8 Jun 2017 09:57:20 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Jun 08, 2017 at 05:29:38PM +1000, Nicholas Piggin wrote:
> > On Thu, 8 Jun 2017 08:54:00 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >   
> > > On Thu, Jun 08, 2017 at 10:32:44AM +1000, Nicholas Piggin wrote:  
> > > > On Wed, 07 Jun 2017 18:15:06 +0200
> > > > Peter Zijlstra <peterz@infradead.org> wrote:
> > > >     
> > > > > Now that the scheduler's rq->lock is RCsc and thus provides full
> > > > > transitivity between scheduling actions. And since we cannot migrate
> > > > > current, a task needs a switch-out and a switch-in in order to
> > > > > migrate, in which case the RCsc provides all the ordering we need.    
> > > > 
> > > > Hi Peter,
> > > > 
> > > > I'm actually just working on removing this right now too, so
> > > > good timing.
> > > > 
> > > > I think we can't "just" remove it, because it is required to order
> > > > MMIO on powerpc as well.    
> > > 
> > > How is MMIO special? That is, there is only MMIO before we call into
> > > schedule() right? So the rq->lock should be sufficient to order that
> > > too.  
> > 
> > MMIO uses different barriers. spinlock and smp_ type barriers do
> > not order it.  
> 
> Right, but you only have SYNC, which is what makes it possible at all.

Yeah, but a future CPU in theory could implement some other barrier
which provides hwsync ordering for cacheable memory but not uncacheable.
smp_mb* barriers would be able to use that new type of barrier, except
here.

> Some of the other architectures are not so lucky and need a different
> barrier, ARM for instance needs DSB(ISH) vs the DMB(ISH) provided by
> smp_mb(). IA64, MIPS and a few others are in the same boat as ARM.
> 
> > > > But what I have done is to comment that some other primitives are
> > > > already providing the hwsync for other, so we don't have to add
> > > > another one in _switch.    
> > > 
> > > Right, so this patch relies on the smp_mb__before_spinlock ->
> > > smp_mb__after_spinlock conversion that makes the rq->lock RCsc and
> > > should thus provide the required SYNC for migrations.  
> > 
> > AFAIKS either one will do, so long as there is a hwsync there. The
> > point is just that I have added some commentary in the generic and
> > powerpc parts to make it clear we're relying on that behavior of
> > the primitive. smp_mb* is not guaranteed to order MMIO, it's just
> > that it does on powerpc.  
> 
> I'm not particularly happy with the generic comment; I don't feel we
> should care that PPC is special here.

I think we do though, because its smp_mb happens to also order mmio.

Your patch I think failed to capture that unless I miss something. It's
not that the rq lock is RCsc that we can remove the hwsync, it's that
the smp_mb__before/after_spinlock has a hwsync in it.

As a counter-example: I think you can implement RCsc spinlocks in
powerpc using ll/sc+isync for the acquire, but that would be insufficient
because no hwsync for MMIO.

> > > That said, I think you can already use the smp_mb__before_spinlock() as
> > > that is done with IRQs disabled, but its a more difficult argument. The
> > > rq->lock RCsc property should be more obvious.  
> > 
> > This is what I got.
> > 
> > https://patchwork.ozlabs.org/patch/770154/  
> 
> Your comment isn't fully correct, smp_cond_load_acquire() isn't
> necessarily done by CPUy. It might be easiest to simply refer to the
> "Notes on Program-Order guarantees on SMP systems." comment.

True, thanks.

> > But I'm not sure if I followed I'm not sure why it's a more
> > difficult argument: any time a process moves it must first execute
> > a hwsync on the current CPU after it has performed all its access
> > there, and then it must execute hwsync on the new CPU before it
> > performs any new access.  
> 
> Yeah, its not a terribly difficult argument either way, but I feel the
> RSsc rq->lock on is slightly easier.

It is neater to have the barrier inside the lock, I think.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch
  2017-06-08  7:57         ` Peter Zijlstra
  2017-06-08  8:21           ` Nicholas Piggin
@ 2017-06-08  9:54           ` Michael Ellerman
  2017-06-08 10:00             ` Nicholas Piggin
  1 sibling, 1 reply; 41+ messages in thread
From: Michael Ellerman @ 2017-06-08  9:54 UTC (permalink / raw)
  To: Peter Zijlstra, Nicholas Piggin
  Cc: torvalds, will.deacon, oleg, paulmck, benh, linux-kernel, mingo,
	stern, linuxppc-dev

Peter Zijlstra <peterz@infradead.org> writes:
> On Thu, Jun 08, 2017 at 05:29:38PM +1000, Nicholas Piggin wrote:
>> On Thu, 8 Jun 2017 08:54:00 +0200
>> Peter Zijlstra <peterz@infradead.org> wrote:
>> > 
>> > Right, so this patch relies on the smp_mb__before_spinlock ->
>> > smp_mb__after_spinlock conversion that makes the rq->lock RCsc and
>> > should thus provide the required SYNC for migrations.
>> 
>> AFAIKS either one will do, so long as there is a hwsync there. The
>> point is just that I have added some commentary in the generic and
>> powerpc parts to make it clear we're relying on that behavior of
>> the primitive. smp_mb* is not guaranteed to order MMIO, it's just
>> that it does on powerpc.
>
> I'm not particularly happy with the generic comment; I don't feel we
> should care that PPC is special here.

I think it'd be nice if there was *some* comment on the two uses of
smp_mb__after_spinlock(), it's fairly subtle, but I don't think it needs
to mention PPC specifically.


If we have:

arch/powerpc/include/asm/barrier.h:
+/*
+ * This must resolve to hwsync on SMP for the context switch path. See
+ * _switch.
+ */
 #define smp_mb__after_spinlock()   smp_mb()


And then something in _switch() that says "we rely on the
smp_mb__after_spinlock() in the scheduler core being a hwsync", that
should probably be sufficient.

cheers

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch
  2017-06-08  9:54           ` Michael Ellerman
@ 2017-06-08 10:00             ` Nicholas Piggin
  2017-06-08 12:45               ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Nicholas Piggin @ 2017-06-08 10:00 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Peter Zijlstra, torvalds, will.deacon, oleg, paulmck, benh,
	linux-kernel, mingo, stern, linuxppc-dev

On Thu, 08 Jun 2017 19:54:30 +1000
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Peter Zijlstra <peterz@infradead.org> writes:
> > On Thu, Jun 08, 2017 at 05:29:38PM +1000, Nicholas Piggin wrote:  
> >> On Thu, 8 Jun 2017 08:54:00 +0200
> >> Peter Zijlstra <peterz@infradead.org> wrote:  
> >> > 
> >> > Right, so this patch relies on the smp_mb__before_spinlock ->
> >> > smp_mb__after_spinlock conversion that makes the rq->lock RCsc and
> >> > should thus provide the required SYNC for migrations.  
> >> 
> >> AFAIKS either one will do, so long as there is a hwsync there. The
> >> point is just that I have added some commentary in the generic and
> >> powerpc parts to make it clear we're relying on that behavior of
> >> the primitive. smp_mb* is not guaranteed to order MMIO, it's just
> >> that it does on powerpc.  
> >
> > I'm not particularly happy with the generic comment; I don't feel we
> > should care that PPC is special here.  
> 
> I think it'd be nice if there was *some* comment on the two uses of
> smp_mb__after_spinlock(), it's fairly subtle, but I don't think it needs
> to mention PPC specifically.
> 
> 
> If we have:
> 
> arch/powerpc/include/asm/barrier.h:
> +/*
> + * This must resolve to hwsync on SMP for the context switch path. See
> + * _switch.
> + */
>  #define smp_mb__after_spinlock()   smp_mb()
> 
> 
> And then something in _switch() that says "we rely on the
> smp_mb__after_spinlock() in the scheduler core being a hwsync", that
> should probably be sufficient.

I have those, I just also would like one in the core scheduler's use
of smp_mb__after_spinlock(), because it would be easy for core scheduler
change to miss that quirk. Sure we can say that Peter and scheduler
maintainers know about powerpc oddities, but then why shouldn't it also
go into a comment there?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch
  2017-06-08 10:00             ` Nicholas Piggin
@ 2017-06-08 12:45               ` Peter Zijlstra
  2017-06-08 13:18                 ` Nicholas Piggin
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2017-06-08 12:45 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Michael Ellerman, torvalds, will.deacon, oleg, paulmck, benh,
	linux-kernel, mingo, stern, linuxppc-dev

On Thu, Jun 08, 2017 at 08:00:15PM +1000, Nicholas Piggin wrote:

> I have those, I just also would like one in the core scheduler's use
> of smp_mb__after_spinlock(), because it would be easy for core scheduler
> change to miss that quirk. Sure we can say that Peter and scheduler
> maintainers know about powerpc oddities, but then why shouldn't it also
> go into a comment there?

So the core scheduler guarantees smp_mb() or equivalent full transitive
ordering happens at schedule() time.

It has for a fairly long time and I don't think we'll ever get rid of
that, its fairly fundamental.

PPC is special in that smp_mb() ends up being the strongest ordering
primitive there is. But note that PPC is not unique, afaict Alpha is in
the same boat. They rely on the MB from the scheduler core.

IA64 OTOH, while they have smp_mb() == mb() still needs SYNC.I in
__switch_to() to serialize against (instruction) cache flushes.

So while I'm all for adding comments explaining what the core provides,
I don't see immediate reasons to call out PPC.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch
  2017-06-08 12:45               ` Peter Zijlstra
@ 2017-06-08 13:18                 ` Nicholas Piggin
  2017-06-08 13:47                   ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Nicholas Piggin @ 2017-06-08 13:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Ellerman, torvalds, will.deacon, oleg, paulmck, benh,
	linux-kernel, mingo, stern, linuxppc-dev

On Thu, 8 Jun 2017 14:45:40 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Jun 08, 2017 at 08:00:15PM +1000, Nicholas Piggin wrote:
> 
> > I have those, I just also would like one in the core scheduler's use
> > of smp_mb__after_spinlock(), because it would be easy for core scheduler
> > change to miss that quirk. Sure we can say that Peter and scheduler
> > maintainers know about powerpc oddities, but then why shouldn't it also
> > go into a comment there?  
> 
> So the core scheduler guarantees smp_mb() or equivalent full transitive
> ordering happens at schedule() time.
> 
> It has for a fairly long time and I don't think we'll ever get rid of
> that, its fairly fundamental.
> 
> PPC is special in that smp_mb() ends up being the strongest ordering
> primitive there is. But note that PPC is not unique, afaict Alpha is in
> the same boat. They rely on the MB from the scheduler core.
> 
> IA64 OTOH, while they have smp_mb() == mb() still needs SYNC.I in
> __switch_to() to serialize against (instruction) cache flushes.
> 
> So while I'm all for adding comments explaining what the core provides,
> I don't see immediate reasons to call out PPC.

I guess I see your point... okay, will constrain the comment to powerpc
context switch and primitives code. Any fundamental change to such
scheduler barriers I guess would require at least a glance over arch
switch code :)

My plan is to send the powerpc sync removal patch for hopefully 4.13
merge. I'm pretty sure it will be equally happy with your patches.
Unless you can see any problems with it? More eyes would be welcome.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch
  2017-06-08 13:18                 ` Nicholas Piggin
@ 2017-06-08 13:47                   ` Peter Zijlstra
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2017-06-08 13:47 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Michael Ellerman, torvalds, will.deacon, oleg, paulmck, benh,
	linux-kernel, mingo, stern, linuxppc-dev

On Thu, Jun 08, 2017 at 11:18:13PM +1000, Nicholas Piggin wrote:
> I guess I see your point... okay, will constrain the comment to powerpc
> context switch and primitives code. Any fundamental change to such
> scheduler barriers I guess would require at least a glance over arch
> switch code :)

Just so.

> My plan is to send the powerpc sync removal patch for hopefully 4.13
> merge. I'm pretty sure it will be equally happy with your patches.
> Unless you can see any problems with it? More eyes would be welcome.

Feel free to Cc me. I don't see a problem with removing that SYNC now,
I'll rebase my patches on top.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-06-07 16:15 ` [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending() Peter Zijlstra
@ 2017-06-09 14:45   ` Will Deacon
  2017-06-09 18:42     ` Peter Zijlstra
  2017-07-28 17:45     ` Peter Zijlstra
  0 siblings, 2 replies; 41+ messages in thread
From: Will Deacon @ 2017-06-09 14:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, oleg, paulmck, benh, mpe, npiggin, linux-kernel, mingo,
	stern, Mel Gorman, Rik van Riel

On Wed, Jun 07, 2017 at 06:15:02PM +0200, Peter Zijlstra wrote:
> Commit:
> 
>   af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
> 
> added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
> can solve the same problem without this barrier.
> 
> If instead we mandate that mm_tlb_flush_pending() is used while
> holding the PTL we're guaranteed to observe prior
> set_tlb_flush_pending() instances.
> 
> For this to work we need to rework migrate_misplaced_transhuge_page()
> a little and move the test up into do_huge_pmd_numa_page().
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -527,18 +527,16 @@ static inline cpumask_t *mm_cpumask(stru
>   */
>  static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
>  {
> -	barrier();
> +	/*
> +	 * Must be called with PTL held; such that our PTL acquire will have
> +	 * observed the store from set_tlb_flush_pending().
> +	 */
>  	return mm->tlb_flush_pending;
>  }
>  static inline void set_tlb_flush_pending(struct mm_struct *mm)
>  {
>  	mm->tlb_flush_pending = true;
> -
> -	/*
> -	 * Guarantee that the tlb_flush_pending store does not leak into the
> -	 * critical section updating the page tables
> -	 */
> -	smp_mb__before_spinlock();
> +	barrier();

Why do you need the barrier() here? Isn't the ptl unlock sufficient?

Will

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 0/5] Getting rid of smp_mb__before_spinlock
  2017-06-07 16:15 [RFC][PATCH 0/5] Getting rid of smp_mb__before_spinlock Peter Zijlstra
                   ` (4 preceding siblings ...)
  2017-06-07 16:15 ` [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch Peter Zijlstra
@ 2017-06-09 14:49 ` Will Deacon
  5 siblings, 0 replies; 41+ messages in thread
From: Will Deacon @ 2017-06-09 14:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, oleg, paulmck, benh, mpe, npiggin, linux-kernel, mingo, stern

On Wed, Jun 07, 2017 at 06:15:01PM +0200, Peter Zijlstra wrote:
> There was a thread on this somewhere about a year ago, I finally remembered to
> finish this :-)
> 
> This series removes two smp_mb__before_spinlock() (ab)users and converts the
> scheduler to use smp_mb__after_spinlock(), which provides more guarantees with
> the same amount of barriers.

For the arm64 bits:

Acked-by: Will Deacon <will.deacon@arm.com>

Will

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-06-09 14:45   ` Will Deacon
@ 2017-06-09 18:42     ` Peter Zijlstra
  2017-07-28 17:45     ` Peter Zijlstra
  1 sibling, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2017-06-09 18:42 UTC (permalink / raw)
  To: Will Deacon
  Cc: torvalds, oleg, paulmck, benh, mpe, npiggin, linux-kernel, mingo,
	stern, Mel Gorman, Rik van Riel

On Fri, Jun 09, 2017 at 03:45:54PM +0100, Will Deacon wrote:
> On Wed, Jun 07, 2017 at 06:15:02PM +0200, Peter Zijlstra wrote:
> > Commit:
> > 
> >   af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
> > 
> > added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
> > can solve the same problem without this barrier.
> > 
> > If instead we mandate that mm_tlb_flush_pending() is used while
> > holding the PTL we're guaranteed to observe prior
> > set_tlb_flush_pending() instances.
> > 
> > For this to work we need to rework migrate_misplaced_transhuge_page()
> > a little and move the test up into do_huge_pmd_numa_page().
> > 
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Rik van Riel <riel@redhat.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -527,18 +527,16 @@ static inline cpumask_t *mm_cpumask(stru
> >   */
> >  static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> >  {
> > -	barrier();
> > +	/*
> > +	 * Must be called with PTL held; such that our PTL acquire will have
> > +	 * observed the store from set_tlb_flush_pending().
> > +	 */
> >  	return mm->tlb_flush_pending;
> >  }
> >  static inline void set_tlb_flush_pending(struct mm_struct *mm)
> >  {
> >  	mm->tlb_flush_pending = true;
> > -
> > -	/*
> > -	 * Guarantee that the tlb_flush_pending store does not leak into the
> > -	 * critical section updating the page tables
> > -	 */
> > -	smp_mb__before_spinlock();
> > +	barrier();
> 
> Why do you need the barrier() here? Isn't the ptl unlock sufficient?

General paranioa I think. I'll have another look.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-06-09 14:45   ` Will Deacon
  2017-06-09 18:42     ` Peter Zijlstra
@ 2017-07-28 17:45     ` Peter Zijlstra
  2017-08-01 10:31       ` Will Deacon
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2017-07-28 17:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: torvalds, oleg, paulmck, benh, mpe, npiggin, linux-kernel, mingo,
	stern, Mel Gorman, Rik van Riel

On Fri, Jun 09, 2017 at 03:45:54PM +0100, Will Deacon wrote:
> On Wed, Jun 07, 2017 at 06:15:02PM +0200, Peter Zijlstra wrote:
> > Commit:
> > 
> >   af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
> > 
> > added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
> > can solve the same problem without this barrier.
> > 
> > If instead we mandate that mm_tlb_flush_pending() is used while
> > holding the PTL we're guaranteed to observe prior
> > set_tlb_flush_pending() instances.
> > 
> > For this to work we need to rework migrate_misplaced_transhuge_page()
> > a little and move the test up into do_huge_pmd_numa_page().
> > 
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Rik van Riel <riel@redhat.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -527,18 +527,16 @@ static inline cpumask_t *mm_cpumask(stru
> >   */
> >  static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> >  {
> > -	barrier();
> > +	/*
> > +	 * Must be called with PTL held; such that our PTL acquire will have
> > +	 * observed the store from set_tlb_flush_pending().
> > +	 */
> >  	return mm->tlb_flush_pending;
> >  }
> >  static inline void set_tlb_flush_pending(struct mm_struct *mm)
> >  {
> >  	mm->tlb_flush_pending = true;
> > -
> > -	/*
> > -	 * Guarantee that the tlb_flush_pending store does not leak into the
> > -	 * critical section updating the page tables
> > -	 */
> > -	smp_mb__before_spinlock();
> > +	barrier();
> 
> Why do you need the barrier() here? Isn't the ptl unlock sufficient?

So I was going through these here patches again, and wrote the
following comment:

static inline void set_tlb_flush_pending(struct mm_struct *mm)
{
	mm->tlb_flush_pending = true;
	/*
	 * The only time this value is relevant is when there are indeed pages
	 * to flush. And we'll only flush pages after changing them, which
	 * requires the PTL.
	 *
	 * So the ordering here is:
	 *
	 * 	mm->tlb_flush_pending = true;
	 * 	spin_lock(&ptl);
	 *	...
	 * 	set_pte_at();
	 * 	spin_unlock(&ptl);
	 *
	 *
	 * 				spin_lock(&ptl)
	 * 				mm_tlb_flush_pending();
	 * 				....
	 * 				spin_unlock(&ptl);
	 *
	 * 	flush_tlb_range();
	 * 	mm->tlb_flush_pending = false;
	 */
}

And while the ptl locks are indeed sufficient to constrain the true
assignment, what constrains the false assignment? As in the above there
is nothing stopping the false from ending up visible at
mm_tlb_flush_pending().

Or does flush_tlb_range() have implicit ordering? It does on x86, but is
this generally so?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-07-28 17:45     ` Peter Zijlstra
@ 2017-08-01 10:31       ` Will Deacon
  2017-08-01 12:02         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2017-08-01 10:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, oleg, paulmck, benh, mpe, npiggin, linux-kernel, mingo,
	stern, Mel Gorman, Rik van Riel

On Fri, Jul 28, 2017 at 07:45:33PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 09, 2017 at 03:45:54PM +0100, Will Deacon wrote:
> > On Wed, Jun 07, 2017 at 06:15:02PM +0200, Peter Zijlstra wrote:
> > > Commit:
> > > 
> > >   af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
> > > 
> > > added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
> > > can solve the same problem without this barrier.
> > > 
> > > If instead we mandate that mm_tlb_flush_pending() is used while
> > > holding the PTL we're guaranteed to observe prior
> > > set_tlb_flush_pending() instances.
> > > 
> > > For this to work we need to rework migrate_misplaced_transhuge_page()
> > > a little and move the test up into do_huge_pmd_numa_page().
> > > 
> > > Cc: Mel Gorman <mgorman@suse.de>
> > > Cc: Rik van Riel <riel@redhat.com>
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -527,18 +527,16 @@ static inline cpumask_t *mm_cpumask(stru
> > >   */
> > >  static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> > >  {
> > > -	barrier();
> > > +	/*
> > > +	 * Must be called with PTL held; such that our PTL acquire will have
> > > +	 * observed the store from set_tlb_flush_pending().
> > > +	 */
> > >  	return mm->tlb_flush_pending;
> > >  }
> > >  static inline void set_tlb_flush_pending(struct mm_struct *mm)
> > >  {
> > >  	mm->tlb_flush_pending = true;
> > > -
> > > -	/*
> > > -	 * Guarantee that the tlb_flush_pending store does not leak into the
> > > -	 * critical section updating the page tables
> > > -	 */
> > > -	smp_mb__before_spinlock();
> > > +	barrier();
> > 
> > Why do you need the barrier() here? Isn't the ptl unlock sufficient?
> 
> So I was going through these here patches again, and wrote the
> following comment:
> 
> static inline void set_tlb_flush_pending(struct mm_struct *mm)
> {
> 	mm->tlb_flush_pending = true;
> 	/*
> 	 * The only time this value is relevant is when there are indeed pages
> 	 * to flush. And we'll only flush pages after changing them, which
> 	 * requires the PTL.
> 	 *
> 	 * So the ordering here is:
> 	 *
> 	 * 	mm->tlb_flush_pending = true;
> 	 * 	spin_lock(&ptl);
> 	 *	...
> 	 * 	set_pte_at();
> 	 * 	spin_unlock(&ptl);
> 	 *
> 	 *
> 	 * 				spin_lock(&ptl)
> 	 * 				mm_tlb_flush_pending();
> 	 * 				....
> 	 * 				spin_unlock(&ptl);
> 	 *
> 	 * 	flush_tlb_range();
> 	 * 	mm->tlb_flush_pending = false;
> 	 */
> }
> 
> And while the ptl locks are indeed sufficient to constrain the true
> assignment, what constrains the false assignment? As in the above there
> is nothing stopping the false from ending up visible at
> mm_tlb_flush_pending().
> 
> Or does flush_tlb_range() have implicit ordering? It does on x86, but is
> this generally so?

Looks like that's what's currently relied upon:

  /* Clearing is done after a TLB flush, which also provides a barrier. */

It also provides barrier semantics on arm/arm64. In reality, I suspect
all archs have to provide some order between set_pte_at and flush_tlb_range
which is sufficient to hold up clearing the flag. :/

Will

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-01 10:31       ` Will Deacon
@ 2017-08-01 12:02         ` Benjamin Herrenschmidt
  2017-08-01 12:14           ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-01 12:02 UTC (permalink / raw)
  To: Will Deacon, Peter Zijlstra
  Cc: torvalds, oleg, paulmck, mpe, npiggin, linux-kernel, mingo,
	stern, Mel Gorman, Rik van Riel

On Tue, 2017-08-01 at 11:31 +0100, Will Deacon wrote:
> Looks like that's what's currently relied upon:
> 
>   /* Clearing is done after a TLB flush, which also provides a barrier. */
> 
> It also provides barrier semantics on arm/arm64. In reality, I suspect
> all archs have to provide some order between set_pte_at and flush_tlb_range
> which is sufficient to hold up clearing the flag. :/

Hrm... not explicitely.

Most archs (powerpc among them) have set_pte_at be just a dumb store,
so the only barrier it has is the surrounding PTL.

Now flush_tlb_range() I assume has some internal strong barriers but
none of that is well defined or documented at all, so I suspect all
bets are off.

Ben.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-01 12:02         ` Benjamin Herrenschmidt
@ 2017-08-01 12:14           ` Peter Zijlstra
  2017-08-01 16:39             ` Peter Zijlstra
  2017-08-01 22:42             ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 41+ messages in thread
From: Peter Zijlstra @ 2017-08-01 12:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Will Deacon, torvalds, oleg, paulmck, mpe, npiggin, linux-kernel,
	mingo, stern, Mel Gorman, Rik van Riel

On Tue, Aug 01, 2017 at 10:02:45PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2017-08-01 at 11:31 +0100, Will Deacon wrote:
> > Looks like that's what's currently relied upon:
> > 
> >   /* Clearing is done after a TLB flush, which also provides a barrier. */
> > 
> > It also provides barrier semantics on arm/arm64. In reality, I suspect
> > all archs have to provide some order between set_pte_at and flush_tlb_range
> > which is sufficient to hold up clearing the flag. :/
> 
> Hrm... not explicitely.
> 
> Most archs (powerpc among them) have set_pte_at be just a dumb store,
> so the only barrier it has is the surrounding PTL.
> 
> Now flush_tlb_range() I assume has some internal strong barriers but
> none of that is well defined or documented at all, so I suspect all
> bets are off.

Right.. but seeing how we're in fact relying on things here it might be
time to go figure this out and document bits.

*sigh*, I suppose its going to be me doing this.. :-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-01 12:14           ` Peter Zijlstra
@ 2017-08-01 16:39             ` Peter Zijlstra
  2017-08-01 16:44               ` Will Deacon
  2017-08-01 22:42             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2017-08-01 16:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Will Deacon, torvalds, oleg, paulmck, mpe, npiggin, linux-kernel,
	mingo, stern, Mel Gorman, Rik van Riel

On Tue, Aug 01, 2017 at 02:14:19PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 01, 2017 at 10:02:45PM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2017-08-01 at 11:31 +0100, Will Deacon wrote:
> > > Looks like that's what's currently relied upon:
> > > 
> > >   /* Clearing is done after a TLB flush, which also provides a barrier. */
> > > 
> > > It also provides barrier semantics on arm/arm64. In reality, I suspect
> > > all archs have to provide some order between set_pte_at and flush_tlb_range
> > > which is sufficient to hold up clearing the flag. :/
> > 
> > Hrm... not explicitely.
> > 
> > Most archs (powerpc among them) have set_pte_at be just a dumb store,
> > so the only barrier it has is the surrounding PTL.
> > 
> > Now flush_tlb_range() I assume has some internal strong barriers but
> > none of that is well defined or documented at all, so I suspect all
> > bets are off.
> 
> Right.. but seeing how we're in fact relying on things here it might be
> time to go figure this out and document bits.
> 
> *sigh*, I suppose its going to be me doing this.. :-)


So on the related question; does on_each_cpu() provide a full smp_mb(),
I think we can answer: yes.

on_each_cpu() does IPIs to all _other_ CPUs, and those IPIs are using
llist_add() which is cmpxchg() which implies smp_mb().

After that it runs the local function.

So we can see on_each_cpu() as doing a smp_mb() before running @func.


xtensa - it uses on_each_cpu() for TLB invalidates.

x86 - we use either on_each_cpu() (flush_tlb_all(),
flush_tlb_kernel_range()) or we use flush_tlb_mm_range() which does an
atomic_inc_return() at the very start. Not to mention that actually
flushing TLBs itself is a barrier. Arguably flush_tlb_mm_range() should
first do _others* and then self, because others will use
smp_call_function_many() and see above.

(TODO look into paravirt)

Tile - does mb() in flush_remote()

sparc32-smp !?

sparc64 -- nope, no-op functions, TLB flushes are contained inside the PTL.

sh - yes, per smp_call_function

s390 - has atomics when it flushes. ptep_modify_prot_start() can set
mm->flush_mm = 1, at which point flush_tlb_range() will actually do
something, in that case there will be a smp_mb as per the atomics.
Otherwise the TLB invalidate is contained inside the PTL.

powerpc - radix - PTESYNC
	  hash - flush inside PTL


parisc - has all PTE and TLB operations serialized using a global lock

nm10300 - *ugh* but yes, smp_call_function() for remote CPUs

mips - smp_call_function for remote CPUs

metag - mmio write

m32r - doesn't seem to have smp_mb()

ia64 - smp_call_function_*()

hexagon - HVM trap, no smp_mb()

blackfin - nommu

arm - dsb ish

arm64  - dsb ish

arc - no barrier

alpha - no barrier


Now the architectures that do not have a barrier, like alpha, arc,
metag, the  PTL spin_unlock has a smp_mb, however I don't think that is
enough, because then the flush_tlb_range() might still be pending. That
said, these architectures probably don't have transparant huge pages so
it doesn't matter.


Still this is all rather unsatisfactory. Either we should define
flush_tlb*() to imply a barrier when its not a no-op (sparc64/ppc-hash)
or simply make clear_tlb_flush_pending() an smp_store_release().

I prefer the latter option.

Opinions?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-01 16:39             ` Peter Zijlstra
@ 2017-08-01 16:44               ` Will Deacon
  2017-08-01 16:48                 ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2017-08-01 16:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, torvalds, oleg, paulmck, mpe, npiggin,
	linux-kernel, mingo, stern, Mel Gorman, Rik van Riel

On Tue, Aug 01, 2017 at 06:39:03PM +0200, Peter Zijlstra wrote:
> Still this is all rather unsatisfactory. Either we should define
> flush_tlb*() to imply a barrier when its not a no-op (sparc64/ppc-hash)
> or simply make clear_tlb_flush_pending() an smp_store_release().
> 
> I prefer the latter option.
> 
> Opinions?

I prefer the latter option too, since I'd like to relax the arm64 TLB
flushing to have weaker barriers for the local case. Granted, that doesn't
break the NUMA migration code, but it would make the barrier semantics of
the TLB invalidation routines even more subtle if we were to define them
generally.

Will

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-01 16:44               ` Will Deacon
@ 2017-08-01 16:48                 ` Peter Zijlstra
  2017-08-01 22:59                   ` Peter Zijlstra
  2017-08-02  0:17                   ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 41+ messages in thread
From: Peter Zijlstra @ 2017-08-01 16:48 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, torvalds, oleg, paulmck, mpe, npiggin,
	linux-kernel, mingo, stern, Mel Gorman, Rik van Riel

On Tue, Aug 01, 2017 at 05:44:14PM +0100, Will Deacon wrote:
> On Tue, Aug 01, 2017 at 06:39:03PM +0200, Peter Zijlstra wrote:
> > Still this is all rather unsatisfactory. Either we should define
> > flush_tlb*() to imply a barrier when its not a no-op (sparc64/ppc-hash)
> > or simply make clear_tlb_flush_pending() an smp_store_release().
> > 
> > I prefer the latter option.
> > 
> > Opinions?
> 
> I prefer the latter option too, since I'd like to relax the arm64 TLB
> flushing to have weaker barriers for the local case. Granted, that doesn't
> break the NUMA migration code, but it would make the barrier semantics of
> the TLB invalidation routines even more subtle if we were to define them
> generally.

Another 'fun' question, is smp_mb() strong enough to order against the
TLB invalidate? Because we really want to clear this flag _after_.

PowerPC for example uses PTESYNC before the TBLIE, so does a SYNC after
work? Ben?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-01 12:14           ` Peter Zijlstra
  2017-08-01 16:39             ` Peter Zijlstra
@ 2017-08-01 22:42             ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-01 22:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, torvalds, oleg, paulmck, mpe, npiggin, linux-kernel,
	mingo, stern, Mel Gorman, Rik van Riel

On Tue, 2017-08-01 at 14:14 +0200, Peter Zijlstra wrote:
> Right.. but seeing how we're in fact relying on things here it might be
> time to go figure this out and document bits.
> 
> *sigh*, I suppose its going to be me doing this.. :-)

Thanks mate ! :-)

Cheers
Ben.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-01 16:48                 ` Peter Zijlstra
@ 2017-08-01 22:59                   ` Peter Zijlstra
  2017-08-02  1:23                     ` Benjamin Herrenschmidt
  2017-08-02  0:17                   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2017-08-01 22:59 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, torvalds, oleg, paulmck, mpe, npiggin,
	linux-kernel, mingo, stern, Mel Gorman, Rik van Riel

On Tue, Aug 01, 2017 at 06:48:20PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 01, 2017 at 05:44:14PM +0100, Will Deacon wrote:
> > On Tue, Aug 01, 2017 at 06:39:03PM +0200, Peter Zijlstra wrote:
> > > Still this is all rather unsatisfactory. Either we should define
> > > flush_tlb*() to imply a barrier when its not a no-op (sparc64/ppc-hash)
> > > or simply make clear_tlb_flush_pending() an smp_store_release().
> > > 
> > > I prefer the latter option.
> > > 
> > > Opinions?
> > 
> > I prefer the latter option too, since I'd like to relax the arm64 TLB
> > flushing to have weaker barriers for the local case. Granted, that doesn't
> > break the NUMA migration code, but it would make the barrier semantics of
> > the TLB invalidation routines even more subtle if we were to define them
> > generally.
> 
> Another 'fun' question, is smp_mb() strong enough to order against the
> TLB invalidate? Because we really want to clear this flag _after_.
> 
> PowerPC for example uses PTESYNC before the TBLIE, so does a SYNC after
> work? Ben?

>From what I gather it is not. You have TLBSYNC for it. So the good news
is that PPC-radix does all that and is fully serialized on the tlb
flush. Not sure for the PPC-hash case.

At the same time, smp_mb() is not sufficient on ARM either, they need a
DSB barrier on both ends.

So are we going to mandate tlb flush implementations are completely
ordered ?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-01 16:48                 ` Peter Zijlstra
  2017-08-01 22:59                   ` Peter Zijlstra
@ 2017-08-02  0:17                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-02  0:17 UTC (permalink / raw)
  To: Peter Zijlstra, Will Deacon
  Cc: torvalds, oleg, paulmck, mpe, npiggin, linux-kernel, mingo,
	stern, Mel Gorman, Rik van Riel

On Tue, 2017-08-01 at 18:48 +0200, Peter Zijlstra wrote:
> On Tue, Aug 01, 2017 at 05:44:14PM +0100, Will Deacon wrote:
> > On Tue, Aug 01, 2017 at 06:39:03PM +0200, Peter Zijlstra wrote:
> > > Still this is all rather unsatisfactory. Either we should define
> > > flush_tlb*() to imply a barrier when its not a no-op (sparc64/ppc-hash)
> > > or simply make clear_tlb_flush_pending() an smp_store_release().
> > > 
> > > I prefer the latter option.
> > > 
> > > Opinions?
> > 
> > I prefer the latter option too, since I'd like to relax the arm64 TLB
> > flushing to have weaker barriers for the local case. Granted, that doesn't
> > break the NUMA migration code, but it would make the barrier semantics of
> > the TLB invalidation routines even more subtle if we were to define them
> > generally.
> 
> Another 'fun' question, is smp_mb() strong enough to order against the
> TLB invalidate? Because we really want to clear this flag _after_.
> 
> PowerPC for example uses PTESYNC before the TBLIE, so does a SYNC after
> work? Ben?

I have no idea. But then our tlbie has a ptesync after too no ? And
afaik a ptesync is a superset of sync.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-01 22:59                   ` Peter Zijlstra
@ 2017-08-02  1:23                     ` Benjamin Herrenschmidt
  2017-08-02  8:11                       ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-02  1:23 UTC (permalink / raw)
  To: Peter Zijlstra, Will Deacon
  Cc: torvalds, oleg, paulmck, mpe, npiggin, linux-kernel, mingo,
	stern, Mel Gorman, Rik van Riel

On Wed, 2017-08-02 at 00:59 +0200, Peter Zijlstra wrote:
> > PowerPC for example uses PTESYNC before the TBLIE, so does a SYNC after
> > work? Ben?
> > From what I gather it is not. You have TLBSYNC for it. So the good news

tlbsync is pretty much a nop these days. ptesync is a strict superset
of sync and we have it after every tlbie.

> is that PPC-radix does all that and is fully serialized on the tlb
> flush. Not sure for the PPC-hash case.
> 
> At the same time, smp_mb() is not sufficient on ARM either, they need a
> DSB barrier on both ends.
> 
> So are we going to mandate tlb flush implementations are completely
> ordered ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02  1:23                     ` Benjamin Herrenschmidt
@ 2017-08-02  8:11                       ` Peter Zijlstra
  2017-08-02  8:15                         ` Will Deacon
  2017-08-02 13:57                         ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 41+ messages in thread
From: Peter Zijlstra @ 2017-08-02  8:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Will Deacon, torvalds, oleg, paulmck, mpe, npiggin, linux-kernel,
	mingo, stern, Mel Gorman, Rik van Riel

On Wed, Aug 02, 2017 at 11:23:12AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2017-08-02 at 00:59 +0200, Peter Zijlstra wrote:
> > > PowerPC for example uses PTESYNC before the TBLIE, so does a SYNC after
> > > work? Ben?
> > > From what I gather it is not. You have TLBSYNC for it. So the good news
> 
> tlbsync is pretty much a nop these days. ptesync is a strict superset
> of sync and we have it after every tlbie.

In the radix code, yes. I got lost going through the hash code, and I
didn't look at the 32bit code at all.

So the radix code does:

 PTESYNC
 TLBIE
 EIEIO; TLBSYNC; PTESYNC

which should be completely ordered against anything prior and anything
following, and is I think the behaviour we want from TLB flushes in
general, but is very much not provided by a number of architectures
afaict.

Ah, found the hash-64 code, yes that's good too. The hash32 code lives
in asm and confuses me, it has a bunch of SYNC, SYNC_601 and isync in.
The nohash variant seems to do a isync after tlbwe, but again no clue.


Now, do I go and attempt fixing all that needs fixing?


x86 is good, our CR3 writes or INVLPG stuff is fully serializing.

arm is good, it does DSB ISH before and after

arm64 looks good too, although it plays silly games with the first
barrier, but I trust that to be sufficient.

But I'll have to go dig up arch manuals for the rest, if they include
the relevant information at all of course :/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02  8:11                       ` Peter Zijlstra
@ 2017-08-02  8:15                         ` Will Deacon
  2017-08-02  8:43                           ` Will Deacon
  2017-08-02  8:45                           ` Peter Zijlstra
  2017-08-02 13:57                         ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 41+ messages in thread
From: Will Deacon @ 2017-08-02  8:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, torvalds, oleg, paulmck, mpe, npiggin,
	linux-kernel, mingo, stern, Mel Gorman, Rik van Riel

On Wed, Aug 02, 2017 at 10:11:06AM +0200, Peter Zijlstra wrote:
> On Wed, Aug 02, 2017 at 11:23:12AM +1000, Benjamin Herrenschmidt wrote:
> > On Wed, 2017-08-02 at 00:59 +0200, Peter Zijlstra wrote:
> > > > PowerPC for example uses PTESYNC before the TBLIE, so does a SYNC after
> > > > work? Ben?
> > > > From what I gather it is not. You have TLBSYNC for it. So the good news
> > 
> > tlbsync is pretty much a nop these days. ptesync is a strict superset
> > of sync and we have it after every tlbie.
> 
> In the radix code, yes. I got lost going through the hash code, and I
> didn't look at the 32bit code at all.
> 
> So the radix code does:
> 
>  PTESYNC
>  TLBIE
>  EIEIO; TLBSYNC; PTESYNC
> 
> which should be completely ordered against anything prior and anything
> following, and is I think the behaviour we want from TLB flushes in
> general, but is very much not provided by a number of architectures
> afaict.
> 
> Ah, found the hash-64 code, yes that's good too. The hash32 code lives
> in asm and confuses me, it has a bunch of SYNC, SYNC_601 and isync in.
> The nohash variant seems to do a isync after tlbwe, but again no clue.
> 
> 
> Now, do I go and attempt fixing all that needs fixing?
> 
> 
> x86 is good, our CR3 writes or INVLPG stuff is fully serializing.
> 
> arm is good, it does DSB ISH before and after
> 
> arm64 looks good too, although it plays silly games with the first
> barrier, but I trust that to be sufficient.

The first barrier only orders prior stores for us, because page table
updates are made using stores. A prior load could be reordered past the
invalidation, but can't make it past the second barrier.

I really think we should avoid defining TLB invalidation in terms of
smp_mb() because it's a lot more subtle than that.

Will

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02  8:15                         ` Will Deacon
@ 2017-08-02  8:43                           ` Will Deacon
  2017-08-02  8:51                             ` Peter Zijlstra
  2017-08-02  8:45                           ` Peter Zijlstra
  1 sibling, 1 reply; 41+ messages in thread
From: Will Deacon @ 2017-08-02  8:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, torvalds, oleg, paulmck, mpe, npiggin,
	linux-kernel, mingo, stern, Mel Gorman, Rik van Riel

On Wed, Aug 02, 2017 at 09:15:23AM +0100, Will Deacon wrote:
> On Wed, Aug 02, 2017 at 10:11:06AM +0200, Peter Zijlstra wrote:
> > On Wed, Aug 02, 2017 at 11:23:12AM +1000, Benjamin Herrenschmidt wrote:
> > > On Wed, 2017-08-02 at 00:59 +0200, Peter Zijlstra wrote:
> > > > > PowerPC for example uses PTESYNC before the TBLIE, so does a SYNC after
> > > > > work? Ben?
> > > > > From what I gather it is not. You have TLBSYNC for it. So the good news
> > > 
> > > tlbsync is pretty much a nop these days. ptesync is a strict superset
> > > of sync and we have it after every tlbie.
> > 
> > In the radix code, yes. I got lost going through the hash code, and I
> > didn't look at the 32bit code at all.
> > 
> > So the radix code does:
> > 
> >  PTESYNC
> >  TLBIE
> >  EIEIO; TLBSYNC; PTESYNC
> > 
> > which should be completely ordered against anything prior and anything
> > following, and is I think the behaviour we want from TLB flushes in
> > general, but is very much not provided by a number of architectures
> > afaict.
> > 
> > Ah, found the hash-64 code, yes that's good too. The hash32 code lives
> > in asm and confuses me, it has a bunch of SYNC, SYNC_601 and isync in.
> > The nohash variant seems to do a isync after tlbwe, but again no clue.
> > 
> > 
> > Now, do I go and attempt fixing all that needs fixing?
> > 
> > 
> > x86 is good, our CR3 writes or INVLPG stuff is fully serializing.
> > 
> > arm is good, it does DSB ISH before and after
> > 
> > arm64 looks good too, although it plays silly games with the first
> > barrier, but I trust that to be sufficient.
> 
> The first barrier only orders prior stores for us, because page table
> updates are made using stores. A prior load could be reordered past the
> invalidation, but can't make it past the second barrier.
> 
> I really think we should avoid defining TLB invalidation in terms of
> smp_mb() because it's a lot more subtle than that.

Another worry I have here is with architectures that can optimise the
"only need to flush the local TLB" case. For example, this version of 'R':


P0:
WRITE_ONCE(x, 1);
smp_mb();
WRITE_ONCE(y, 1);

P1:
WRITE_ONCE(y, 2);
flush_tlb_range(...);  // Only needs to flush the local TLB
r0 = READ_ONCE(x);


It doesn't seem unreasonable to me for y==2 && r0==0 if the
flush_tlb_range(...) ends up only doing local invalidation. As a concrete
example, imagine a CPU with a page table walker that can snoop the local
store-buffer. Then, the local flush_tlb_range in P1 only needs to progress
the write to y as far as the store-buffer before it can invalidate the local
TLB. Once the TLB is invalidated, it can read x knowing that the translation
is up-to-date wrt the page table, but that read doesn't need to wait for
write to y to become visible to other CPUs.

So flush_tlb_range is actually weaker than smp_mb in some respects, yet the
flush_tlb_pending stuff will still work correctly.

Will

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02  8:15                         ` Will Deacon
  2017-08-02  8:43                           ` Will Deacon
@ 2017-08-02  8:45                           ` Peter Zijlstra
  2017-08-02  9:02                             ` Will Deacon
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2017-08-02  8:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, torvalds, oleg, paulmck, mpe, npiggin,
	linux-kernel, mingo, stern, Mel Gorman, Rik van Riel

On Wed, Aug 02, 2017 at 09:15:23AM +0100, Will Deacon wrote:
> On Wed, Aug 02, 2017 at 10:11:06AM +0200, Peter Zijlstra wrote:

> > arm64 looks good too, although it plays silly games with the first
> > barrier, but I trust that to be sufficient.
> 
> The first barrier only orders prior stores for us, because page table
> updates are made using stores. A prior load could be reordered past the
> invalidation, but can't make it past the second barrier.

So then you rely on the program not having any loads pending to the
address you're about to invalidate, right? Otherwise we can do the TLBI
and then the load to insta-repopulate the TLB entry you just wanted
dead.

That later DSB ISH is too late for that.

Isn't that somewhat fragile?

> I really think we should avoid defining TLB invalidation in terms of
> smp_mb() because it's a lot more subtle than that.

I'm tempted to say stronger, smp_mb() only provides order, we want full
serialization. Everything before stays before and _completes_ before.
Everything after happens after (if the primitives actually do something
at all of course, sparc64 for instance has no-op flush_tlb*).

While such semantics might be slightly too strong for what we currently
need, it is what powerpc, x86 and arm currently implement and are fairly
easy to reason about. If we weaken it, stuff gets confusing again.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02  8:43                           ` Will Deacon
@ 2017-08-02  8:51                             ` Peter Zijlstra
  2017-08-02  9:02                               ` Will Deacon
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2017-08-02  8:51 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, torvalds, oleg, paulmck, mpe, npiggin,
	linux-kernel, mingo, stern, Mel Gorman, Rik van Riel

On Wed, Aug 02, 2017 at 09:43:50AM +0100, Will Deacon wrote:
> On Wed, Aug 02, 2017 at 09:15:23AM +0100, Will Deacon wrote:

> > I really think we should avoid defining TLB invalidation in terms of
> > smp_mb() because it's a lot more subtle than that.
> 
> Another worry I have here is with architectures that can optimise the
> "only need to flush the local TLB" case. For example, this version of 'R':
> 
> 
> P0:
> WRITE_ONCE(x, 1);
> smp_mb();
> WRITE_ONCE(y, 1);
> 
> P1:
> WRITE_ONCE(y, 2);
> flush_tlb_range(...);  // Only needs to flush the local TLB
> r0 = READ_ONCE(x);
> 
> 
> It doesn't seem unreasonable to me for y==2 && r0==0 if the
> flush_tlb_range(...) ends up only doing local invalidation. As a concrete
> example, imagine a CPU with a page table walker that can snoop the local
> store-buffer. Then, the local flush_tlb_range in P1 only needs to progress
> the write to y as far as the store-buffer before it can invalidate the local
> TLB. Once the TLB is invalidated, it can read x knowing that the translation
> is up-to-date wrt the page table, but that read doesn't need to wait for
> write to y to become visible to other CPUs.
> 
> So flush_tlb_range is actually weaker than smp_mb in some respects, yet the
> flush_tlb_pending stuff will still work correctly.

So while I think you're right, and we could live with this, after all,
if we know the mm is CPU local, there shouldn't be any SMP concerns wrt
its page tables. Do you really want to make this more complicated?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02  8:51                             ` Peter Zijlstra
@ 2017-08-02  9:02                               ` Will Deacon
  2017-08-02 22:54                                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2017-08-02  9:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, torvalds, oleg, paulmck, mpe, npiggin,
	linux-kernel, mingo, stern, Mel Gorman, Rik van Riel

On Wed, Aug 02, 2017 at 10:51:11AM +0200, Peter Zijlstra wrote:
> On Wed, Aug 02, 2017 at 09:43:50AM +0100, Will Deacon wrote:
> > On Wed, Aug 02, 2017 at 09:15:23AM +0100, Will Deacon wrote:
> 
> > > I really think we should avoid defining TLB invalidation in terms of
> > > smp_mb() because it's a lot more subtle than that.
> > 
> > Another worry I have here is with architectures that can optimise the
> > "only need to flush the local TLB" case. For example, this version of 'R':
> > 
> > 
> > P0:
> > WRITE_ONCE(x, 1);
> > smp_mb();
> > WRITE_ONCE(y, 1);
> > 
> > P1:
> > WRITE_ONCE(y, 2);
> > flush_tlb_range(...);  // Only needs to flush the local TLB
> > r0 = READ_ONCE(x);
> > 
> > 
> > It doesn't seem unreasonable to me for y==2 && r0==0 if the
> > flush_tlb_range(...) ends up only doing local invalidation. As a concrete
> > example, imagine a CPU with a page table walker that can snoop the local
> > store-buffer. Then, the local flush_tlb_range in P1 only needs to progress
> > the write to y as far as the store-buffer before it can invalidate the local
> > TLB. Once the TLB is invalidated, it can read x knowing that the translation
> > is up-to-date wrt the page table, but that read doesn't need to wait for
> > write to y to become visible to other CPUs.
> > 
> > So flush_tlb_range is actually weaker than smp_mb in some respects, yet the
> > flush_tlb_pending stuff will still work correctly.
> 
> So while I think you're right, and we could live with this, after all,
> if we know the mm is CPU local, there shouldn't be any SMP concerns wrt
> its page tables. Do you really want to make this more complicated?

It gives us a nice performance lift on arm64 and I have a patch...[1]

Will

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/commit/?h=aarch64/devel&id=1c7cf53658f0fa16338d1f8406285ae28fd5f616

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02  8:45                           ` Peter Zijlstra
@ 2017-08-02  9:02                             ` Will Deacon
  2017-08-02  9:18                               ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2017-08-02  9:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, torvalds, oleg, paulmck, mpe, npiggin,
	linux-kernel, mingo, stern, Mel Gorman, Rik van Riel

On Wed, Aug 02, 2017 at 10:45:51AM +0200, Peter Zijlstra wrote:
> On Wed, Aug 02, 2017 at 09:15:23AM +0100, Will Deacon wrote:
> > On Wed, Aug 02, 2017 at 10:11:06AM +0200, Peter Zijlstra wrote:
> 
> > > arm64 looks good too, although it plays silly games with the first
> > > barrier, but I trust that to be sufficient.
> > 
> > The first barrier only orders prior stores for us, because page table
> > updates are made using stores. A prior load could be reordered past the
> > invalidation, but can't make it past the second barrier.
> 
> So then you rely on the program not having any loads pending to the
> address you're about to invalidate, right? Otherwise we can do the TLBI
> and then the load to insta-repopulate the TLB entry you just wanted
> dead.
> 
> That later DSB ISH is too late for that.
> 
> Isn't that somewhat fragile?

We only initiate the TLB invalidation after the page table update is
observable to the page table walker, so any repopulation will cause a fill
using the new page table entry.

> > I really think we should avoid defining TLB invalidation in terms of
> > smp_mb() because it's a lot more subtle than that.
> 
> I'm tempted to say stronger, smp_mb() only provides order, we want full
> serialization. Everything before stays before and _completes_ before.
> Everything after happens after (if the primitives actually do something
> at all of course, sparc64 for instance has no-op flush_tlb*).
> 
> While such semantics might be slightly too strong for what we currently
> need, it is what powerpc, x86 and arm currently implement and are fairly
> easy to reason about. If we weaken it, stuff gets confusing again.

My problem with this is that we're strengthening the semantics for no actual
use-case, but at the same time this will have a real performance impact.

Will

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02  9:02                             ` Will Deacon
@ 2017-08-02  9:18                               ` Peter Zijlstra
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2017-08-02  9:18 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, torvalds, oleg, paulmck, mpe, npiggin,
	linux-kernel, mingo, stern, Mel Gorman, Rik van Riel

On Wed, Aug 02, 2017 at 10:02:28AM +0100, Will Deacon wrote:
> On Wed, Aug 02, 2017 at 10:45:51AM +0200, Peter Zijlstra wrote:
> > On Wed, Aug 02, 2017 at 09:15:23AM +0100, Will Deacon wrote:
> > > On Wed, Aug 02, 2017 at 10:11:06AM +0200, Peter Zijlstra wrote:
> > 
> > > > arm64 looks good too, although it plays silly games with the first
> > > > barrier, but I trust that to be sufficient.
> > > 
> > > The first barrier only orders prior stores for us, because page table
> > > updates are made using stores. A prior load could be reordered past the
> > > invalidation, but can't make it past the second barrier.
> > 
> > So then you rely on the program not having any loads pending to the
> > address you're about to invalidate, right? Otherwise we can do the TLBI
> > and then the load to insta-repopulate the TLB entry you just wanted
> > dead.
> > 
> > That later DSB ISH is too late for that.
> > 
> > Isn't that somewhat fragile?
> 
> We only initiate the TLB invalidation after the page table update is
> observable to the page table walker, so any repopulation will cause a fill
> using the new page table entry.

Ah, indeed. Might be worth a comment tho.

> > > I really think we should avoid defining TLB invalidation in terms of
> > > smp_mb() because it's a lot more subtle than that.
> > 
> > I'm tempted to say stronger, smp_mb() only provides order, we want full
> > serialization. Everything before stays before and _completes_ before.
> > Everything after happens after (if the primitives actually do something
> > at all of course, sparc64 for instance has no-op flush_tlb*).
> > 
> > While such semantics might be slightly too strong for what we currently
> > need, it is what powerpc, x86 and arm currently implement and are fairly
> > easy to reason about. If we weaken it, stuff gets confusing again.
> 
> My problem with this is that we're strengthening the semantics for no actual
> use-case, but at the same time this will have a real performance impact.

Well, you could put in a dmb(ish) in the local case, that's loads
cheaper than the dsb(ish) you need for the !local case. But OK..

Back to staring at dodgy arch code..

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02  8:11                       ` Peter Zijlstra
  2017-08-02  8:15                         ` Will Deacon
@ 2017-08-02 13:57                         ` Benjamin Herrenschmidt
  2017-08-02 15:46                           ` Peter Zijlstra
  1 sibling, 1 reply; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-02 13:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, torvalds, oleg, paulmck, mpe, npiggin, linux-kernel,
	mingo, stern, Mel Gorman, Rik van Riel

On Wed, 2017-08-02 at 10:11 +0200, Peter Zijlstra wrote:

> which should be completely ordered against anything prior and anything
> following, and is I think the behaviour we want from TLB flushes in
> general, but is very much not provided by a number of architectures
> afaict.
> 
> Ah, found the hash-64 code, yes that's good too. The hash32 code lives
> in asm and confuses me, it has a bunch of SYNC, SYNC_601 and isync in.
> The nohash variant seems to do a isync after tlbwe, but again no clue.

Doing some archeology ? :-)

In the hash32 days ptesync didn't exist, sync had all the needed
semantics. tlbew isn't a proper invalidate per-se, but isync will flush
the shadow TLBs, but I wouldn't bother too much about these, if needed
I can go fix them.

> Now, do I go and attempt fixing all that needs fixing?
> 
> 
> x86 is good, our CR3 writes or INVLPG stuff is fully serializing.
> 
> arm is good, it does DSB ISH before and after
> 
> arm64 looks good too, although it plays silly games with the first
> barrier, but I trust that to be sufficient.
> 
> But I'll have to go dig up arch manuals for the rest, if they include
> the relevant information at all of course :/

Ben.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02 13:57                         ` Benjamin Herrenschmidt
@ 2017-08-02 15:46                           ` Peter Zijlstra
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2017-08-02 15:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Will Deacon, torvalds, oleg, paulmck, mpe, npiggin, linux-kernel,
	mingo, stern, Mel Gorman, Rik van Riel

On Wed, Aug 02, 2017 at 11:57:04PM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2017-08-02 at 10:11 +0200, Peter Zijlstra wrote:
> 
> > which should be completely ordered against anything prior and anything
> > following, and is I think the behaviour we want from TLB flushes in
> > general, but is very much not provided by a number of architectures
> > afaict.
> > 
> > Ah, found the hash-64 code, yes that's good too. The hash32 code lives
> > in asm and confuses me, it has a bunch of SYNC, SYNC_601 and isync in.
> > The nohash variant seems to do a isync after tlbwe, but again no clue.
> 
> Doing some archeology ? :-)

I thought ppc32 is still a popular platform for embedded, and not
actually knowing what kind of mmu those sport (if one at all of course),
I just looked at all of them.

Also, I'd been looking at all arch tlb invalidate code in any case :-)
(and yes my head hurts because of it)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()
  2017-08-02  9:02                               ` Will Deacon
@ 2017-08-02 22:54                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2017-08-02 22:54 UTC (permalink / raw)
  To: Will Deacon, Peter Zijlstra
  Cc: torvalds, oleg, paulmck, mpe, npiggin, linux-kernel, mingo,
	stern, Mel Gorman, Rik van Riel

On Wed, 2017-08-02 at 10:02 +0100, Will Deacon wrote:
> > > So flush_tlb_range is actually weaker than smp_mb in some respects, yet the
> > > flush_tlb_pending stuff will still work correctly.
> > 
> > So while I think you're right, and we could live with this, after all,
> > if we know the mm is CPU local, there shouldn't be any SMP concerns wrt
> > its page tables. Do you really want to make this more complicated?
> 
> It gives us a nice performance lift on arm64 and I have a patch...[1]

We do that on powerpc too, though there are ongoing questions a to
whether an smp_mb() after setting the mask bit in switch_mm is
sufficient vs. prefetch brining entries in the TLB after the context is
switched. But that's a powerpc specific issue. Nick Piggin is working
on sorting that out.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2017-08-03  2:34 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-07 16:15 [RFC][PATCH 0/5] Getting rid of smp_mb__before_spinlock Peter Zijlstra
2017-06-07 16:15 ` [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending() Peter Zijlstra
2017-06-09 14:45   ` Will Deacon
2017-06-09 18:42     ` Peter Zijlstra
2017-07-28 17:45     ` Peter Zijlstra
2017-08-01 10:31       ` Will Deacon
2017-08-01 12:02         ` Benjamin Herrenschmidt
2017-08-01 12:14           ` Peter Zijlstra
2017-08-01 16:39             ` Peter Zijlstra
2017-08-01 16:44               ` Will Deacon
2017-08-01 16:48                 ` Peter Zijlstra
2017-08-01 22:59                   ` Peter Zijlstra
2017-08-02  1:23                     ` Benjamin Herrenschmidt
2017-08-02  8:11                       ` Peter Zijlstra
2017-08-02  8:15                         ` Will Deacon
2017-08-02  8:43                           ` Will Deacon
2017-08-02  8:51                             ` Peter Zijlstra
2017-08-02  9:02                               ` Will Deacon
2017-08-02 22:54                                 ` Benjamin Herrenschmidt
2017-08-02  8:45                           ` Peter Zijlstra
2017-08-02  9:02                             ` Will Deacon
2017-08-02  9:18                               ` Peter Zijlstra
2017-08-02 13:57                         ` Benjamin Herrenschmidt
2017-08-02 15:46                           ` Peter Zijlstra
2017-08-02  0:17                   ` Benjamin Herrenschmidt
2017-08-01 22:42             ` Benjamin Herrenschmidt
2017-06-07 16:15 ` [RFC][PATCH 2/5] locking: Introduce smp_mb__after_spinlock() Peter Zijlstra
2017-06-07 16:15 ` [RFC][PATCH 3/5] overlayfs: Remove smp_mb__before_spinlock() usage Peter Zijlstra
2017-06-07 16:15 ` [RFC][PATCH 4/5] locking: Remove smp_mb__before_spinlock() Peter Zijlstra
2017-06-07 16:15 ` [RFC][PATCH 5/5] powerpc: Remove SYNC from _switch Peter Zijlstra
2017-06-08  0:32   ` Nicholas Piggin
2017-06-08  6:54     ` Peter Zijlstra
2017-06-08  7:29       ` Nicholas Piggin
2017-06-08  7:57         ` Peter Zijlstra
2017-06-08  8:21           ` Nicholas Piggin
2017-06-08  9:54           ` Michael Ellerman
2017-06-08 10:00             ` Nicholas Piggin
2017-06-08 12:45               ` Peter Zijlstra
2017-06-08 13:18                 ` Nicholas Piggin
2017-06-08 13:47                   ` Peter Zijlstra
2017-06-09 14:49 ` [RFC][PATCH 0/5] Getting rid of smp_mb__before_spinlock Will Deacon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.