All of lore.kernel.org
 help / color / mirror / Atom feed
* Question on smp_mb__before_spinlock
@ 2016-09-05  9:37 Peter Zijlstra
  2016-09-05  9:56 ` kbuild test robot
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Peter Zijlstra @ 2016-09-05  9:37 UTC (permalink / raw)
  To: Linus Torvalds, Will Deacon, Oleg Nesterov, Paul McKenney,
	Benjamin Herrenschmidt, Michael Ellerman
  Cc: linux-kernel, Nicholas Piggin, Ingo Molnar, Alan Stern

Hi all,

So recently I've had two separate issues that touched upon
smp_mb__before_spinlock().


Since its inception, our understanding of ACQUIRE, esp. as applied to
spinlocks, has changed somewhat. Also, I wonder if, with a simple
change, we cannot make it provide more.

The problem with the comment is that the STORE done by spin_lock isn't
itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
it and cross with any prior STORE, rendering the default WMB
insufficient (pointed out by Alan).

Now, this is only really a problem on PowerPC and ARM64, the former of
which already defined smp_mb__before_spinlock() as a smp_mb(), the
latter does not, Will?

The second issue I wondered about is spinlock transitivity. All except
powerpc have RCsc locks, and since Power already does a full mb, would
it not make sense to put it _after_ the spin_lock(), which would provide
the same guarantee, but also upgrades the section to RCsc.

That would make all schedule() calls fully transitive against one
another.


That is, would something like the below make sense?

(does not deal with mm_types.h and overlayfs using
smp_mb__before_spnlock).

---
 arch/arm64/include/asm/barrier.h   |  2 ++
 arch/powerpc/include/asm/barrier.h |  2 +-
 include/linux/spinlock.h           | 41 +++++++++++++++++++++++++++++---------
 kernel/sched/core.c                |  5 +++--
 4 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 4eea7f618dce..d5cc8b58f942 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -104,6 +104,8 @@ do {									\
 	VAL;								\
 })
 
+#define smp_mb__after_spinlock()	smp_mb()
+
 #include <asm-generic/barrier.h>
 
 #endif	/* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index c0deafc212b8..23d64d7196b7 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -74,7 +74,7 @@ do {									\
 	___p1;								\
 })
 
-#define smp_mb__before_spinlock()   smp_mb()
+#define smp_mb__after_spinlock()   smp_mb()
 
 #include <asm-generic/barrier.h>
 
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 47dd0cebd204..284616dad607 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -118,16 +118,39 @@ do {								\
 #endif
 
 /*
- * Despite its name it doesn't necessarily has to be a full barrier.
- * It should only guarantee that a STORE before the critical section
- * can not be reordered with LOADs and STOREs inside this section.
- * spin_lock() is the one-way barrier, this LOAD can not escape out
- * of the region. So the default implementation simply ensures that
- * a STORE can not move into the critical section, smp_wmb() should
- * serialize it with another STORE done by spin_lock().
+ * This barrier must provide two things:
+ *
+ *   - it must guarantee a STORE before the spin_lock() is ordered against a
+ *     LOAD after it, see the comments at its two usage sites.
+ *
+ *   - it must ensure the critical section is RCsc.
+ *
+ * The latter is important for cases where we observe values written by other
+ * CPUs in spin-loops, without barriers, while being subject to scheduling.
+ *
+ * CPU0			CPU1			CPU2
+ * 
+ * 			for (;;) {
+ * 			  if (READ_ONCE(X))
+ * 			  	break;
+ * 			}
+ * X=1
+ * 			<sched-out>
+ * 						<sched-in>
+ * 						r = X;
+ *
+ * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
+ * we get migrated and CPU2 sees X==0.
+ *
+ * Since most load-store architectures implement ACQUIRE with an smp_mb() after
+ * the LL/SC loop, they need no further barriers. Similarly all our TSO
+ * architectures imlpy an smp_mb() for each atomic instruction and equally don't
+ * need more.
+ *
+ * Architectures that can implement ACQUIRE better need to take care.
  */
-#ifndef smp_mb__before_spinlock
-#define smp_mb__before_spinlock()	smp_wmb()
+#ifndef smp_mb__after_spinlock
+#define smp_mb__after_spinlock()	do { } while (0)
 #endif
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 556cb07ab1cf..b151a33d393b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2006,8 +2006,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 * reordered with p->state check below. This pairs with mb() in
 	 * set_current_state() the waiting thread does.
 	 */
-	smp_mb__before_spinlock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	smp_mb__after_spinlock();
 	if (!(p->state & state))
 		goto out;
 
@@ -3332,8 +3332,9 @@ static void __sched notrace __schedule(bool preempt)
 	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
 	 * done by the caller to avoid the race with signal_wake_up().
 	 */
-	smp_mb__before_spinlock();
 	raw_spin_lock(&rq->lock);
+	smp_mb__after_spinlock();
+
 	cookie = lockdep_pin_lock(&rq->lock);
 
 	rq->clock_skip_update <<= 1; /* promote REQ to ACT */

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-05  9:37 Question on smp_mb__before_spinlock Peter Zijlstra
@ 2016-09-05  9:56 ` kbuild test robot
  2016-09-05 10:19   ` Peter Zijlstra
  2016-09-05 10:10 ` Will Deacon
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: kbuild test robot @ 2016-09-05  9:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kbuild-all, Linus Torvalds, Will Deacon, Oleg Nesterov,
	Paul McKenney, Benjamin Herrenschmidt, Michael Ellerman,
	linux-kernel, Nicholas Piggin, Ingo Molnar, Alan Stern

[-- Attachment #1: Type: text/plain, Size: 2848 bytes --]

Hi Peter,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.8-rc5 next-20160825]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
[Suggest to use git(>=2.9.0) format-patch --base=<commit> (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:    https://github.com/0day-ci/linux/commits/Peter-Zijlstra/Question-on-smp_mb__before_spinlock/20160905-174026
config: x86_64-randconfig-x013-201636 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from include/linux/sched.h:27:0,
                    from include/linux/kasan.h:4,
                    from include/linux/slab.h:118,
                    from include/linux/crypto.h:24,
                    from arch/x86/kernel/asm-offsets.c:8:
   include/linux/mm_types.h: In function 'set_tlb_flush_pending':
>> include/linux/mm_types.h:557:2: error: implicit declaration of function 'smp_mb__before_spinlock' [-Werror=implicit-function-declaration]
     smp_mb__before_spinlock();
     ^~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors
   make[2]: *** [arch/x86/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [sub-make] Error 2

vim +/smp_mb__before_spinlock +557 include/linux/mm_types.h

20841405 Rik van Riel 2013-12-18  551  	mm->tlb_flush_pending = true;
af2c1401 Mel Gorman   2013-12-18  552  
af2c1401 Mel Gorman   2013-12-18  553  	/*
af2c1401 Mel Gorman   2013-12-18  554  	 * Guarantee that the tlb_flush_pending store does not leak into the
af2c1401 Mel Gorman   2013-12-18  555  	 * critical section updating the page tables
af2c1401 Mel Gorman   2013-12-18  556  	 */
af2c1401 Mel Gorman   2013-12-18 @557  	smp_mb__before_spinlock();
20841405 Rik van Riel 2013-12-18  558  }
20841405 Rik van Riel 2013-12-18  559  /* Clearing is done after a TLB flush, which also provides a barrier. */
20841405 Rik van Riel 2013-12-18  560  static inline void clear_tlb_flush_pending(struct mm_struct *mm)

:::::: The code at line 557 was first introduced by commit
:::::: af2c1401e6f9177483be4fad876d0073669df9df mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates

:::::: TO: Mel Gorman <mgorman@suse.de>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 25403 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-05  9:37 Question on smp_mb__before_spinlock Peter Zijlstra
  2016-09-05  9:56 ` kbuild test robot
@ 2016-09-05 10:10 ` Will Deacon
  2016-09-06 11:17   ` Peter Zijlstra
  2016-09-05 10:37 ` Paul E. McKenney
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Will Deacon @ 2016-09-05 10:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Paul McKenney,
	Benjamin Herrenschmidt, Michael Ellerman, linux-kernel,
	Nicholas Piggin, Ingo Molnar, Alan Stern

On Mon, Sep 05, 2016 at 11:37:53AM +0200, Peter Zijlstra wrote:
> So recently I've had two separate issues that touched upon
> smp_mb__before_spinlock().
> 
> 
> Since its inception, our understanding of ACQUIRE, esp. as applied to
> spinlocks, has changed somewhat. Also, I wonder if, with a simple
> change, we cannot make it provide more.
> 
> The problem with the comment is that the STORE done by spin_lock isn't
> itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
> it and cross with any prior STORE, rendering the default WMB
> insufficient (pointed out by Alan).
> 
> Now, this is only really a problem on PowerPC and ARM64, the former of
> which already defined smp_mb__before_spinlock() as a smp_mb(), the
> latter does not, Will?

I just replied to that thread and, assuming I've groked the sched/core.c
usage correctly, then it does look like we need to make that an smp_mb()
with the current code.

> The second issue I wondered about is spinlock transitivity. All except
> powerpc have RCsc locks, and since Power already does a full mb, would
> it not make sense to put it _after_ the spin_lock(), which would provide
> the same guarantee, but also upgrades the section to RCsc.
> 
> That would make all schedule() calls fully transitive against one
> another.

It would also match the way in which the arm64 atomic_*_return ops
are implemented, since full barrier semantics are required there.

> That is, would something like the below make sense?

Works for me, but I'll do a fix to smp_mb__before_spinlock anyway for
the stable tree.

The only slight annoyance is that, on arm64 anyway, a store-release
appearing in program order before the LOCK operation will be observed
in order, so if the write of CONDITION=1 in the try_to_wake_up case
used smp_store_release, we wouldn't need this barrier at all.

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-05  9:56 ` kbuild test robot
@ 2016-09-05 10:19   ` Peter Zijlstra
  2016-09-05 11:26     ` Fengguang Wu
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2016-09-05 10:19 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Linus Torvalds, Will Deacon, Oleg Nesterov,
	Paul McKenney, Benjamin Herrenschmidt, Michael Ellerman,
	linux-kernel, Nicholas Piggin, Ingo Molnar, Alan Stern

On Mon, Sep 05, 2016 at 05:56:46PM +0800, kbuild test robot wrote:
> Hi Peter,
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.8-rc5 next-20160825]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> [Suggest to use git(>=2.9.0) format-patch --base=<commit> (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on]
> [Check https://git-scm.com/docs/git-format-patch for more information]

What happend to not applying patches that lack a SoB and the subject
doesn't even include [PATCH] either.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-05  9:37 Question on smp_mb__before_spinlock Peter Zijlstra
  2016-09-05  9:56 ` kbuild test robot
  2016-09-05 10:10 ` Will Deacon
@ 2016-09-05 10:37 ` Paul E. McKenney
  2016-09-05 11:34   ` Peter Zijlstra
  2016-09-05 10:51 ` kbuild test robot
  2016-09-07 12:17 ` Nicholas Piggin
  4 siblings, 1 reply; 18+ messages in thread
From: Paul E. McKenney @ 2016-09-05 10:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Will Deacon, Oleg Nesterov,
	Benjamin Herrenschmidt, Michael Ellerman, linux-kernel,
	Nicholas Piggin, Ingo Molnar, Alan Stern

On Mon, Sep 05, 2016 at 11:37:53AM +0200, Peter Zijlstra wrote:
> Hi all,
> 
> So recently I've had two separate issues that touched upon
> smp_mb__before_spinlock().
> 
> 
> Since its inception, our understanding of ACQUIRE, esp. as applied to
> spinlocks, has changed somewhat. Also, I wonder if, with a simple
> change, we cannot make it provide more.
> 
> The problem with the comment is that the STORE done by spin_lock isn't
> itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
> it and cross with any prior STORE, rendering the default WMB
> insufficient (pointed out by Alan).
> 
> Now, this is only really a problem on PowerPC and ARM64, the former of
> which already defined smp_mb__before_spinlock() as a smp_mb(), the
> latter does not, Will?
> 
> The second issue I wondered about is spinlock transitivity. All except
> powerpc have RCsc locks, and since Power already does a full mb, would
> it not make sense to put it _after_ the spin_lock(), which would provide
> the same guarantee, but also upgrades the section to RCsc.
> 
> That would make all schedule() calls fully transitive against one
> another.
> 
> 
> That is, would something like the below make sense?

Looks to me like you have reinvented smp_mb__after_unlock_lock()...

							Thanx, Paul

> (does not deal with mm_types.h and overlayfs using
> smp_mb__before_spnlock).
> 
> ---
>  arch/arm64/include/asm/barrier.h   |  2 ++
>  arch/powerpc/include/asm/barrier.h |  2 +-
>  include/linux/spinlock.h           | 41 +++++++++++++++++++++++++++++---------
>  kernel/sched/core.c                |  5 +++--
>  4 files changed, 38 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 4eea7f618dce..d5cc8b58f942 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -104,6 +104,8 @@ do {									\
>  	VAL;								\
>  })
> 
> +#define smp_mb__after_spinlock()	smp_mb()
> +
>  #include <asm-generic/barrier.h>
> 
>  #endif	/* __ASSEMBLY__ */
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index c0deafc212b8..23d64d7196b7 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -74,7 +74,7 @@ do {									\
>  	___p1;								\
>  })
> 
> -#define smp_mb__before_spinlock()   smp_mb()
> +#define smp_mb__after_spinlock()   smp_mb()
> 
>  #include <asm-generic/barrier.h>
> 
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 47dd0cebd204..284616dad607 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -118,16 +118,39 @@ do {								\
>  #endif
> 
>  /*
> - * Despite its name it doesn't necessarily has to be a full barrier.
> - * It should only guarantee that a STORE before the critical section
> - * can not be reordered with LOADs and STOREs inside this section.
> - * spin_lock() is the one-way barrier, this LOAD can not escape out
> - * of the region. So the default implementation simply ensures that
> - * a STORE can not move into the critical section, smp_wmb() should
> - * serialize it with another STORE done by spin_lock().
> + * This barrier must provide two things:
> + *
> + *   - it must guarantee a STORE before the spin_lock() is ordered against a
> + *     LOAD after it, see the comments at its two usage sites.
> + *
> + *   - it must ensure the critical section is RCsc.
> + *
> + * The latter is important for cases where we observe values written by other
> + * CPUs in spin-loops, without barriers, while being subject to scheduling.
> + *
> + * CPU0			CPU1			CPU2
> + * 
> + * 			for (;;) {
> + * 			  if (READ_ONCE(X))
> + * 			  	break;
> + * 			}
> + * X=1
> + * 			<sched-out>
> + * 						<sched-in>
> + * 						r = X;
> + *
> + * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
> + * we get migrated and CPU2 sees X==0.
> + *
> + * Since most load-store architectures implement ACQUIRE with an smp_mb() after
> + * the LL/SC loop, they need no further barriers. Similarly all our TSO
> + * architectures imlpy an smp_mb() for each atomic instruction and equally don't
> + * need more.
> + *
> + * Architectures that can implement ACQUIRE better need to take care.
>   */
> -#ifndef smp_mb__before_spinlock
> -#define smp_mb__before_spinlock()	smp_wmb()
> +#ifndef smp_mb__after_spinlock
> +#define smp_mb__after_spinlock()	do { } while (0)
>  #endif
> 
>  /**
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 556cb07ab1cf..b151a33d393b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2006,8 +2006,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	 * reordered with p->state check below. This pairs with mb() in
>  	 * set_current_state() the waiting thread does.
>  	 */
> -	smp_mb__before_spinlock();
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
> +	smp_mb__after_spinlock();
>  	if (!(p->state & state))
>  		goto out;
> 
> @@ -3332,8 +3332,9 @@ static void __sched notrace __schedule(bool preempt)
>  	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
>  	 * done by the caller to avoid the race with signal_wake_up().
>  	 */
> -	smp_mb__before_spinlock();
>  	raw_spin_lock(&rq->lock);
> +	smp_mb__after_spinlock();
> +
>  	cookie = lockdep_pin_lock(&rq->lock);
> 
>  	rq->clock_skip_update <<= 1; /* promote REQ to ACT */
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-05  9:37 Question on smp_mb__before_spinlock Peter Zijlstra
                   ` (2 preceding siblings ...)
  2016-09-05 10:37 ` Paul E. McKenney
@ 2016-09-05 10:51 ` kbuild test robot
  2016-09-07 12:17 ` Nicholas Piggin
  4 siblings, 0 replies; 18+ messages in thread
From: kbuild test robot @ 2016-09-05 10:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kbuild-all, Linus Torvalds, Will Deacon, Oleg Nesterov,
	Paul McKenney, Benjamin Herrenschmidt, Michael Ellerman,
	linux-kernel, Nicholas Piggin, Ingo Molnar, Alan Stern

[-- Attachment #1: Type: text/plain, Size: 2270 bytes --]

Hi Peter,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.8-rc5 next-20160825]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
[Suggest to use git(>=2.9.0) format-patch --base=<commit> (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:    https://github.com/0day-ci/linux/commits/Peter-Zijlstra/Question-on-smp_mb__before_spinlock/20160905-174026
config: i386-randconfig-s1-201636 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   fs/overlayfs/readdir.c: In function 'ovl_dir_fsync':
>> fs/overlayfs/readdir.c:449:4: error: implicit declaration of function 'smp_mb__before_spinlock' [-Werror=implicit-function-declaration]
       smp_mb__before_spinlock();
       ^~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/smp_mb__before_spinlock +449 fs/overlayfs/readdir.c

d45f00ae Al Viro        2014-10-28  443  		realfile = lockless_dereference(od->upperfile);
e9be9d5e Miklos Szeredi 2014-10-24  444  		if (!realfile) {
e9be9d5e Miklos Szeredi 2014-10-24  445  			struct path upperpath;
e9be9d5e Miklos Szeredi 2014-10-24  446  
e9be9d5e Miklos Szeredi 2014-10-24  447  			ovl_path_upper(dentry, &upperpath);
e9be9d5e Miklos Szeredi 2014-10-24  448  			realfile = ovl_path_open(&upperpath, O_RDONLY);
d45f00ae Al Viro        2014-10-28 @449  			smp_mb__before_spinlock();
5955102c Al Viro        2016-01-22  450  			inode_lock(inode);
3d268c9b Al Viro        2014-10-23  451  			if (!od->upperfile) {
e9be9d5e Miklos Szeredi 2014-10-24  452  				if (IS_ERR(realfile)) {

:::::: The code at line 449 was first introduced by commit
:::::: d45f00ae43e63eff1b3d79df20610ae1ef645ebd overlayfs: barriers for opening upper-layer directory

:::::: TO: Al Viro <viro@zeniv.linux.org.uk>
:::::: CC: Al Viro <viro@zeniv.linux.org.uk>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 27871 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-05 10:19   ` Peter Zijlstra
@ 2016-09-05 11:26     ` Fengguang Wu
  0 siblings, 0 replies; 18+ messages in thread
From: Fengguang Wu @ 2016-09-05 11:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kbuild-all, Linus Torvalds, Will Deacon, Oleg Nesterov,
	Paul McKenney, Benjamin Herrenschmidt, Michael Ellerman,
	linux-kernel, Nicholas Piggin, Ingo Molnar, Alan Stern

Hi Peter,

On Mon, Sep 05, 2016 at 12:19:37PM +0200, Peter Zijlstra wrote:
>On Mon, Sep 05, 2016 at 05:56:46PM +0800, kbuild test robot wrote:
>> Hi Peter,
>>
>> [auto build test ERROR on linus/master]
>> [also build test ERROR on v4.8-rc5 next-20160825]
>> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>> [Suggest to use git(>=2.9.0) format-patch --base=<commit> (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on]
>> [Check https://git-scm.com/docs/git-format-patch for more information]
>
>What happend to not applying patches that lack a SoB and the subject
>doesn't even include [PATCH] either.

Sorry the current logic assumes "Re: " in subject. Just fixed it to
ignore all patches w/o SOB.

Regards,
Fengguang

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-05 10:37 ` Paul E. McKenney
@ 2016-09-05 11:34   ` Peter Zijlstra
  2016-09-05 13:57     ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2016-09-05 11:34 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Will Deacon, Oleg Nesterov,
	Benjamin Herrenschmidt, Michael Ellerman, linux-kernel,
	Nicholas Piggin, Ingo Molnar, Alan Stern

On Mon, Sep 05, 2016 at 03:37:14AM -0700, Paul E. McKenney wrote:
> On Mon, Sep 05, 2016 at 11:37:53AM +0200, Peter Zijlstra wrote:
> > Hi all,
> > 
> > So recently I've had two separate issues that touched upon
> > smp_mb__before_spinlock().
> > 
> > 
> > Since its inception, our understanding of ACQUIRE, esp. as applied to
> > spinlocks, has changed somewhat. Also, I wonder if, with a simple
> > change, we cannot make it provide more.
> > 
> > The problem with the comment is that the STORE done by spin_lock isn't
> > itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
> > it and cross with any prior STORE, rendering the default WMB
> > insufficient (pointed out by Alan).
> > 
> > Now, this is only really a problem on PowerPC and ARM64, the former of
> > which already defined smp_mb__before_spinlock() as a smp_mb(), the
> > latter does not, Will?
> > 
> > The second issue I wondered about is spinlock transitivity. All except
> > powerpc have RCsc locks, and since Power already does a full mb, would
> > it not make sense to put it _after_ the spin_lock(), which would provide
> > the same guarantee, but also upgrades the section to RCsc.
> > 
> > That would make all schedule() calls fully transitive against one
> > another.
> > 
> > 
> > That is, would something like the below make sense?
> 
> Looks to me like you have reinvented smp_mb__after_unlock_lock()...

Will said the same, but that one doesn't in fact do the first bit, as
ARM64 also needs a full barrier for that, while it doesn't need that to
upgrade to RCsc.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-05 11:34   ` Peter Zijlstra
@ 2016-09-05 13:57     ` Paul E. McKenney
  0 siblings, 0 replies; 18+ messages in thread
From: Paul E. McKenney @ 2016-09-05 13:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Will Deacon, Oleg Nesterov,
	Benjamin Herrenschmidt, Michael Ellerman, linux-kernel,
	Nicholas Piggin, Ingo Molnar, Alan Stern

On Mon, Sep 05, 2016 at 01:34:35PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 05, 2016 at 03:37:14AM -0700, Paul E. McKenney wrote:
> > On Mon, Sep 05, 2016 at 11:37:53AM +0200, Peter Zijlstra wrote:
> > > Hi all,
> > > 
> > > So recently I've had two separate issues that touched upon
> > > smp_mb__before_spinlock().
> > > 
> > > 
> > > Since its inception, our understanding of ACQUIRE, esp. as applied to
> > > spinlocks, has changed somewhat. Also, I wonder if, with a simple
> > > change, we cannot make it provide more.
> > > 
> > > The problem with the comment is that the STORE done by spin_lock isn't
> > > itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
> > > it and cross with any prior STORE, rendering the default WMB
> > > insufficient (pointed out by Alan).
> > > 
> > > Now, this is only really a problem on PowerPC and ARM64, the former of
> > > which already defined smp_mb__before_spinlock() as a smp_mb(), the
> > > latter does not, Will?
> > > 
> > > The second issue I wondered about is spinlock transitivity. All except
> > > powerpc have RCsc locks, and since Power already does a full mb, would
> > > it not make sense to put it _after_ the spin_lock(), which would provide
> > > the same guarantee, but also upgrades the section to RCsc.
> > > 
> > > That would make all schedule() calls fully transitive against one
> > > another.
> > > 
> > > 
> > > That is, would something like the below make sense?
> > 
> > Looks to me like you have reinvented smp_mb__after_unlock_lock()...
> 
> Will said the same, but that one doesn't in fact do the first bit, as
> ARM64 also needs a full barrier for that, while it doesn't need that to
> upgrade to RCsc.

Fair enough!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-05 10:10 ` Will Deacon
@ 2016-09-06 11:17   ` Peter Zijlstra
  2016-09-06 17:42     ` Will Deacon
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2016-09-06 11:17 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Oleg Nesterov, Paul McKenney,
	Benjamin Herrenschmidt, Michael Ellerman, linux-kernel,
	Nicholas Piggin, Ingo Molnar, Alan Stern

On Mon, Sep 05, 2016 at 11:10:22AM +0100, Will Deacon wrote:

> > The second issue I wondered about is spinlock transitivity. All except
> > powerpc have RCsc locks, and since Power already does a full mb, would
> > it not make sense to put it _after_ the spin_lock(), which would provide
> > the same guarantee, but also upgrades the section to RCsc.
> > 
> > That would make all schedule() calls fully transitive against one
> > another.
> 
> It would also match the way in which the arm64 atomic_*_return ops
> are implemented, since full barrier semantics are required there.

Hmm, are you sure; the way I read arch/arm64/include/asm/atomic_ll_sc.h
is that you do ll/sc-rel + mb.

> > That is, would something like the below make sense?
> 
> Works for me, but I'll do a fix to smp_mb__before_spinlock anyway for
> the stable tree.

Indeed, thanks!

> 
> The only slight annoyance is that, on arm64 anyway, a store-release
> appearing in program order before the LOCK operation will be observed
> in order, so if the write of CONDITION=1 in the try_to_wake_up case
> used smp_store_release, we wouldn't need this barrier at all.

Right, but this is because your load-acquire and store-release are much
stronger than Linux's. Not only are they RCsc, they are also globally
ordered irrespective of the variable (iirc).

This wouldn't work for PPC (even if we could find all such prior
stores).

OK, I suppose I'll go stare what we can do about the mm_types.h use and
spin a patch with Changelog.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-06 11:17   ` Peter Zijlstra
@ 2016-09-06 17:42     ` Will Deacon
  0 siblings, 0 replies; 18+ messages in thread
From: Will Deacon @ 2016-09-06 17:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Paul McKenney,
	Benjamin Herrenschmidt, Michael Ellerman, linux-kernel,
	Nicholas Piggin, Ingo Molnar, Alan Stern

On Tue, Sep 06, 2016 at 01:17:53PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 05, 2016 at 11:10:22AM +0100, Will Deacon wrote:
> 
> > > The second issue I wondered about is spinlock transitivity. All except
> > > powerpc have RCsc locks, and since Power already does a full mb, would
> > > it not make sense to put it _after_ the spin_lock(), which would provide
> > > the same guarantee, but also upgrades the section to RCsc.
> > > 
> > > That would make all schedule() calls fully transitive against one
> > > another.
> > 
> > It would also match the way in which the arm64 atomic_*_return ops
> > are implemented, since full barrier semantics are required there.
> 
> Hmm, are you sure; the way I read arch/arm64/include/asm/atomic_ll_sc.h
> is that you do ll/sc-rel + mb.

Yes, all I meant was that we put the full barrier at the end, but the
two things are indeed different sequences.

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-05  9:37 Question on smp_mb__before_spinlock Peter Zijlstra
                   ` (3 preceding siblings ...)
  2016-09-05 10:51 ` kbuild test robot
@ 2016-09-07 12:17 ` Nicholas Piggin
  2016-09-07 13:23   ` Peter Zijlstra
  4 siblings, 1 reply; 18+ messages in thread
From: Nicholas Piggin @ 2016-09-07 12:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Will Deacon, Oleg Nesterov, Paul McKenney,
	Benjamin Herrenschmidt, Michael Ellerman, linux-kernel,
	Ingo Molnar, Alan Stern

On Mon, 5 Sep 2016 11:37:53 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Hi all,
> 
> So recently I've had two separate issues that touched upon
> smp_mb__before_spinlock().
> 
> 
> Since its inception, our understanding of ACQUIRE, esp. as applied to
> spinlocks, has changed somewhat. Also, I wonder if, with a simple
> change, we cannot make it provide more.
> 
> The problem with the comment is that the STORE done by spin_lock isn't
> itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
> it and cross with any prior STORE, rendering the default WMB
> insufficient (pointed out by Alan).
> 
> Now, this is only really a problem on PowerPC and ARM64, the former of
> which already defined smp_mb__before_spinlock() as a smp_mb(), the
> latter does not, Will?
> 
> The second issue I wondered about is spinlock transitivity. All except
> powerpc have RCsc locks, and since Power already does a full mb, would
> it not make sense to put it _after_ the spin_lock(), which would provide
> the same guarantee, but also upgrades the section to RCsc.
> 
> That would make all schedule() calls fully transitive against one
> another.
> 
> 
> That is, would something like the below make sense?
> 
> (does not deal with mm_types.h and overlayfs using
> smp_mb__before_spnlock).
> 
> ---
>  arch/arm64/include/asm/barrier.h   |  2 ++
>  arch/powerpc/include/asm/barrier.h |  2 +-
>  include/linux/spinlock.h           | 41 +++++++++++++++++++++++++++++---------
>  kernel/sched/core.c                |  5 +++--
>  4 files changed, 38 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 4eea7f618dce..d5cc8b58f942 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -104,6 +104,8 @@ do {									\
>  	VAL;								\
>  })
>  
> +#define smp_mb__after_spinlock()	smp_mb()
> +
>  #include <asm-generic/barrier.h>
>  
>  #endif	/* __ASSEMBLY__ */
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index c0deafc212b8..23d64d7196b7 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -74,7 +74,7 @@ do {									\
>  	___p1;								\
>  })
>  
> -#define smp_mb__before_spinlock()   smp_mb()
> +#define smp_mb__after_spinlock()   smp_mb()
>  
>  #include <asm-generic/barrier.h>
>  
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 47dd0cebd204..284616dad607 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -118,16 +118,39 @@ do {								\
>  #endif
>  
>  /*
> - * Despite its name it doesn't necessarily has to be a full barrier.
> - * It should only guarantee that a STORE before the critical section
> - * can not be reordered with LOADs and STOREs inside this section.
> - * spin_lock() is the one-way barrier, this LOAD can not escape out
> - * of the region. So the default implementation simply ensures that
> - * a STORE can not move into the critical section, smp_wmb() should
> - * serialize it with another STORE done by spin_lock().
> + * This barrier must provide two things:
> + *
> + *   - it must guarantee a STORE before the spin_lock() is ordered against a
> + *     LOAD after it, see the comments at its two usage sites.
> + *
> + *   - it must ensure the critical section is RCsc.
> + *
> + * The latter is important for cases where we observe values written by other
> + * CPUs in spin-loops, without barriers, while being subject to scheduling.
> + *
> + * CPU0			CPU1			CPU2
> + * 
> + * 			for (;;) {
> + * 			  if (READ_ONCE(X))
> + * 			  	break;
> + * 			}
> + * X=1
> + * 			<sched-out>
> + * 						<sched-in>
> + * 						r = X;
> + *
> + * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
> + * we get migrated and CPU2 sees X==0.
> + *
> + * Since most load-store architectures implement ACQUIRE with an smp_mb() after
> + * the LL/SC loop, they need no further barriers. Similarly all our TSO
> + * architectures imlpy an smp_mb() for each atomic instruction and equally don't
> + * need more.
> + *
> + * Architectures that can implement ACQUIRE better need to take care.
>   */
> -#ifndef smp_mb__before_spinlock
> -#define smp_mb__before_spinlock()	smp_wmb()
> +#ifndef smp_mb__after_spinlock
> +#define smp_mb__after_spinlock()	do { } while (0)
>  #endif

It seems okay, but why not make it a special sched-only function name
to prevent it being used in generic code?

I would not mind seeing responsibility for the switch barrier moved to
generic context switch code like this (alternative for powerpc reducing
number of hwsync instructions was to add documentation and warnings about
the barriers in arch dependent and independent code). And pairing it with
a spinlock is reasonable.

It may not strictly be an "smp_" style of barrier if MMIO accesses are to
be ordered here too, despite critical section may only be providing
acquire/release for cacheable memory, so maybe it's slightly more
complicated than just cacheable RCsc?

This would end up flushing the store queue while holding the spinlock on
POWER, as opposed to before acquiring the lock. I doubt that's ever going
to be noticable, but if you already add a special new primitive here, then
an arch wrapper for raw_spin_lock() can let archs do their own thing here.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-07 12:17 ` Nicholas Piggin
@ 2016-09-07 13:23   ` Peter Zijlstra
  2016-09-07 13:51     ` Will Deacon
  2016-09-12  2:27     ` Nicholas Piggin
  0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2016-09-07 13:23 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Linus Torvalds, Will Deacon, Oleg Nesterov, Paul McKenney,
	Benjamin Herrenschmidt, Michael Ellerman, linux-kernel,
	Ingo Molnar, Alan Stern

On Wed, Sep 07, 2016 at 10:17:26PM +1000, Nicholas Piggin wrote:
> >  /*
> > + * This barrier must provide two things:
> > + *
> > + *   - it must guarantee a STORE before the spin_lock() is ordered against a
> > + *     LOAD after it, see the comments at its two usage sites.
> > + *
> > + *   - it must ensure the critical section is RCsc.
> > + *
> > + * The latter is important for cases where we observe values written by other
> > + * CPUs in spin-loops, without barriers, while being subject to scheduling.
> > + *
> > + * CPU0			CPU1			CPU2
> > + * 
> > + * 			for (;;) {
> > + * 			  if (READ_ONCE(X))
> > + * 			  	break;
> > + * 			}
> > + * X=1
> > + * 			<sched-out>
> > + * 						<sched-in>
> > + * 						r = X;
> > + *
> > + * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
> > + * we get migrated and CPU2 sees X==0.
> > + *
> > + * Since most load-store architectures implement ACQUIRE with an smp_mb() after
> > + * the LL/SC loop, they need no further barriers. Similarly all our TSO
> > + * architectures imlpy an smp_mb() for each atomic instruction and equally don't
> > + * need more.
> > + *
> > + * Architectures that can implement ACQUIRE better need to take care.
> >   */
> > +#ifndef smp_mb__after_spinlock
> > +#define smp_mb__after_spinlock()	do { } while (0)
> >  #endif
> 
> It seems okay, but why not make it a special sched-only function name
> to prevent it being used in generic code?
> 
> I would not mind seeing responsibility for the switch barrier moved to
> generic context switch code like this (alternative for powerpc reducing
> number of hwsync instructions was to add documentation and warnings about
> the barriers in arch dependent and independent code). And pairing it with
> a spinlock is reasonable.
> 
> It may not strictly be an "smp_" style of barrier if MMIO accesses are to
> be ordered here too, despite critical section may only be providing
> acquire/release for cacheable memory, so maybe it's slightly more
> complicated than just cacheable RCsc?

Interesting idea..

So I'm not a fan of that raw_spin_lock wrapper, since that would end up
with a lot more boiler-plate code than just the one extra barrier.

But moving MMIO/DMA/TLB etc.. barriers into this spinlock might not be a
good idea, since those are typically fairly heavy barriers, and its
quite common to call schedule() without ending up in switch_to().

For PowerPC it works out, since there's only SYNC, no other option
afaik.

But ARM/ARM64 will have to do DSB(ISH) instead of DMB(ISH). IA64 would
need to issue "sync.i" and mips-octeon "synciobdma".

Will, any idea of the extra cost involved in DSB vs DMB?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-07 13:23   ` Peter Zijlstra
@ 2016-09-07 13:51     ` Will Deacon
  2016-09-12  2:35       ` Nicholas Piggin
  2016-09-12  2:27     ` Nicholas Piggin
  1 sibling, 1 reply; 18+ messages in thread
From: Will Deacon @ 2016-09-07 13:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Linus Torvalds, Oleg Nesterov, Paul McKenney,
	Benjamin Herrenschmidt, Michael Ellerman, linux-kernel,
	Ingo Molnar, Alan Stern

On Wed, Sep 07, 2016 at 03:23:54PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 07, 2016 at 10:17:26PM +1000, Nicholas Piggin wrote:
> > It seems okay, but why not make it a special sched-only function name
> > to prevent it being used in generic code?
> > 
> > I would not mind seeing responsibility for the switch barrier moved to
> > generic context switch code like this (alternative for powerpc reducing
> > number of hwsync instructions was to add documentation and warnings about
> > the barriers in arch dependent and independent code). And pairing it with
> > a spinlock is reasonable.
> > 
> > It may not strictly be an "smp_" style of barrier if MMIO accesses are to
> > be ordered here too, despite critical section may only be providing
> > acquire/release for cacheable memory, so maybe it's slightly more
> > complicated than just cacheable RCsc?
> 
> Interesting idea..
> 
> So I'm not a fan of that raw_spin_lock wrapper, since that would end up
> with a lot more boiler-plate code than just the one extra barrier.
> 
> But moving MMIO/DMA/TLB etc.. barriers into this spinlock might not be a
> good idea, since those are typically fairly heavy barriers, and its
> quite common to call schedule() without ending up in switch_to().
> 
> For PowerPC it works out, since there's only SYNC, no other option
> afaik.
> 
> But ARM/ARM64 will have to do DSB(ISH) instead of DMB(ISH). IA64 would
> need to issue "sync.i" and mips-octeon "synciobdma".
> 
> Will, any idea of the extra cost involved in DSB vs DMB?

DSB is *much* more expensive, since it completes out-of-band communication
such as MMIO accesses and TLB invalidation, as well as plain old memory
accesses.

The only reason we have DSB in our __switch_to code is to complete cache
maintenance in case the task is going to migrate to another CPU; there's
just no way to know that at the point we need to do the barrier :(

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-07 13:23   ` Peter Zijlstra
  2016-09-07 13:51     ` Will Deacon
@ 2016-09-12  2:27     ` Nicholas Piggin
  2016-09-12 12:54       ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Nicholas Piggin @ 2016-09-12  2:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Will Deacon, Oleg Nesterov, Paul McKenney,
	Benjamin Herrenschmidt, Michael Ellerman, linux-kernel,
	Ingo Molnar, Alan Stern

On Wed, 7 Sep 2016 15:23:54 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Sep 07, 2016 at 10:17:26PM +1000, Nicholas Piggin wrote:
> > >  /*
> > > + * This barrier must provide two things:
> > > + *
> > > + *   - it must guarantee a STORE before the spin_lock() is ordered against a
> > > + *     LOAD after it, see the comments at its two usage sites.
> > > + *
> > > + *   - it must ensure the critical section is RCsc.
> > > + *
> > > + * The latter is important for cases where we observe values written by other
> > > + * CPUs in spin-loops, without barriers, while being subject to scheduling.
> > > + *
> > > + * CPU0			CPU1			CPU2
> > > + * 
> > > + * 			for (;;) {
> > > + * 			  if (READ_ONCE(X))
> > > + * 			  	break;
> > > + * 			}
> > > + * X=1
> > > + * 			<sched-out>
> > > + * 						<sched-in>
> > > + * 						r = X;
> > > + *
> > > + * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
> > > + * we get migrated and CPU2 sees X==0.
> > > + *
> > > + * Since most load-store architectures implement ACQUIRE with an smp_mb() after
> > > + * the LL/SC loop, they need no further barriers. Similarly all our TSO
> > > + * architectures imlpy an smp_mb() for each atomic instruction and equally don't
> > > + * need more.
> > > + *
> > > + * Architectures that can implement ACQUIRE better need to take care.
> > >   */
> > > +#ifndef smp_mb__after_spinlock
> > > +#define smp_mb__after_spinlock()	do { } while (0)
> > >  #endif  
> > 
> > It seems okay, but why not make it a special sched-only function name
> > to prevent it being used in generic code?
> > 
> > I would not mind seeing responsibility for the switch barrier moved to
> > generic context switch code like this (alternative for powerpc reducing
> > number of hwsync instructions was to add documentation and warnings about
> > the barriers in arch dependent and independent code). And pairing it with
> > a spinlock is reasonable.
> > 
> > It may not strictly be an "smp_" style of barrier if MMIO accesses are to
> > be ordered here too, despite critical section may only be providing
> > acquire/release for cacheable memory, so maybe it's slightly more
> > complicated than just cacheable RCsc?  
> 
> Interesting idea..
> 
> So I'm not a fan of that raw_spin_lock wrapper, since that would end up
> with a lot more boiler-plate code than just the one extra barrier.

#ifndef sched_ctxsw_raw_spin_lock
#define sched_ctxsw_raw_spin_lock(lock) raw_spin_lock(lock)
#endif

#define sched_ctxsw_raw_spin_lock(lock) do { smp_mb() ; raw_spin_lock(lock); } while (0)

?


> But moving MMIO/DMA/TLB etc.. barriers into this spinlock might not be a
> good idea, since those are typically fairly heavy barriers, and its
> quite common to call schedule() without ending up in switch_to().

That's true I guess, but if we already have the arch specific smp_mb__
specifically for this context switch code, and you are asking for them to
implement *cacheable* memory barrier vs migration, then I see no reason
not to allow them to implement uncacheable as well.

You make a good point about schedule() without switch_to(), but
architectures will still have no less flexibility than they do now.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-07 13:51     ` Will Deacon
@ 2016-09-12  2:35       ` Nicholas Piggin
  0 siblings, 0 replies; 18+ messages in thread
From: Nicholas Piggin @ 2016-09-12  2:35 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Paul McKenney,
	Benjamin Herrenschmidt, Michael Ellerman, linux-kernel,
	Ingo Molnar, Alan Stern

On Wed, 7 Sep 2016 14:51:47 +0100
Will Deacon <will.deacon@arm.com> wrote:

> On Wed, Sep 07, 2016 at 03:23:54PM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 07, 2016 at 10:17:26PM +1000, Nicholas Piggin wrote:  
> > > It seems okay, but why not make it a special sched-only function name
> > > to prevent it being used in generic code?
> > > 
> > > I would not mind seeing responsibility for the switch barrier moved to
> > > generic context switch code like this (alternative for powerpc reducing
> > > number of hwsync instructions was to add documentation and warnings about
> > > the barriers in arch dependent and independent code). And pairing it with
> > > a spinlock is reasonable.
> > > 
> > > It may not strictly be an "smp_" style of barrier if MMIO accesses are to
> > > be ordered here too, despite critical section may only be providing
> > > acquire/release for cacheable memory, so maybe it's slightly more
> > > complicated than just cacheable RCsc?  
> > 
> > Interesting idea..
> > 
> > So I'm not a fan of that raw_spin_lock wrapper, since that would end up
> > with a lot more boiler-plate code than just the one extra barrier.
> > 
> > But moving MMIO/DMA/TLB etc.. barriers into this spinlock might not be a
> > good idea, since those are typically fairly heavy barriers, and its
> > quite common to call schedule() without ending up in switch_to().
> > 
> > For PowerPC it works out, since there's only SYNC, no other option
> > afaik.
> > 
> > But ARM/ARM64 will have to do DSB(ISH) instead of DMB(ISH). IA64 would
> > need to issue "sync.i" and mips-octeon "synciobdma".
> > 
> > Will, any idea of the extra cost involved in DSB vs DMB?  
> 
> DSB is *much* more expensive, since it completes out-of-band communication
> such as MMIO accesses and TLB invalidation, as well as plain old memory
> accesses.
> 
> The only reason we have DSB in our __switch_to code is to complete cache
> maintenance in case the task is going to migrate to another CPU; there's
> just no way to know that at the point we need to do the barrier :(

Unfortunately it's not trivial to move such barriers to migrate-time,
because the source CPU may not be involved after the task is switched
out.

This won't prevent ARM32/64 from continuing to do what it does today,
if we note that the arch must provide such barriers *either* in the
context switch lock / barrier, or in its own switch code.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-12  2:27     ` Nicholas Piggin
@ 2016-09-12 12:54       ` Peter Zijlstra
  2016-09-13  2:05         ` Nicholas Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2016-09-12 12:54 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Linus Torvalds, Will Deacon, Oleg Nesterov, Paul McKenney,
	Benjamin Herrenschmidt, Michael Ellerman, linux-kernel,
	Ingo Molnar, Alan Stern

On Mon, Sep 12, 2016 at 12:27:08PM +1000, Nicholas Piggin wrote:
> On Wed, 7 Sep 2016 15:23:54 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:

> > Interesting idea..
> > 
> > So I'm not a fan of that raw_spin_lock wrapper, since that would end up
> > with a lot more boiler-plate code than just the one extra barrier.
> 
> #ifndef sched_ctxsw_raw_spin_lock
> #define sched_ctxsw_raw_spin_lock(lock) raw_spin_lock(lock)
> #endif
> 
> #define sched_ctxsw_raw_spin_lock(lock) do { smp_mb() ; raw_spin_lock(lock); } while (0)

I was thinking you wanted to avoid the lwsync in arch_spin_lock()
entirely, at which point you'll grow more layers. Because then you get
an arch_spin_lock_mb() or something and then you'll have to do the
raw_spin_lock wrappery for that.

Or am I missing the point of having the raw_spin_lock wrapper, as
opposed to the extra barrier after it?

Afaict the benefit of having that wrapper is so you can avoid issuing
multiple barriers.

> > But moving MMIO/DMA/TLB etc.. barriers into this spinlock might not be a
> > good idea, since those are typically fairly heavy barriers, and its
> > quite common to call schedule() without ending up in switch_to().
> 
> That's true I guess, but if we already have the arch specific smp_mb__
> specifically for this context switch code, and you are asking for them to
> implement *cacheable* memory barrier vs migration, then I see no reason
> not to allow them to implement uncacheable as well.
> 
> You make a good point about schedule() without switch_to(), but
> architectures will still have no less flexibility than they do now.

Ah, so you're saying make it optional where they put it? I was initially
thinking you wanted to add it to the list of requirements. Sure,
optional works.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Question on smp_mb__before_spinlock
  2016-09-12 12:54       ` Peter Zijlstra
@ 2016-09-13  2:05         ` Nicholas Piggin
  0 siblings, 0 replies; 18+ messages in thread
From: Nicholas Piggin @ 2016-09-13  2:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Will Deacon, Oleg Nesterov, Paul McKenney,
	Benjamin Herrenschmidt, Michael Ellerman, linux-kernel,
	Ingo Molnar, Alan Stern

On Mon, 12 Sep 2016 14:54:03 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 12, 2016 at 12:27:08PM +1000, Nicholas Piggin wrote:
> > On Wed, 7 Sep 2016 15:23:54 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:  
> 
> > > Interesting idea..
> > > 
> > > So I'm not a fan of that raw_spin_lock wrapper, since that would end up
> > > with a lot more boiler-plate code than just the one extra barrier.  
> > 
> > #ifndef sched_ctxsw_raw_spin_lock
> > #define sched_ctxsw_raw_spin_lock(lock) raw_spin_lock(lock)
> > #endif
> > 
> > #define sched_ctxsw_raw_spin_lock(lock) do { smp_mb() ; raw_spin_lock(lock); } while (0)  
> 
> I was thinking you wanted to avoid the lwsync in arch_spin_lock()
> entirely, at which point you'll grow more layers. Because then you get
> an arch_spin_lock_mb() or something and then you'll have to do the
> raw_spin_lock wrappery for that.
> 
> Or am I missing the point of having the raw_spin_lock wrapper, as
> opposed to the extra barrier after it?
> 
> Afaict the benefit of having that wrapper is so you can avoid issuing
> multiple barriers.

Oh you could do that too yes. But it's all going to be in
arch/powerpc, so I don't know if layers would be much problem.

I was thinking to avoid the hwsync inside the critical section.


> > > But moving MMIO/DMA/TLB etc.. barriers into this spinlock might not be a
> > > good idea, since those are typically fairly heavy barriers, and its
> > > quite common to call schedule() without ending up in switch_to().  
> > 
> > That's true I guess, but if we already have the arch specific smp_mb__
> > specifically for this context switch code, and you are asking for them to
> > implement *cacheable* memory barrier vs migration, then I see no reason
> > not to allow them to implement uncacheable as well.
> > 
> > You make a good point about schedule() without switch_to(), but
> > architectures will still have no less flexibility than they do now.  
> 
> Ah, so you're saying make it optional where they put it? I was initially
> thinking you wanted to add it to the list of requirements. Sure,
> optional works.

Yes i.e., this primitive must provide minimally X, and optionally Y. If
Y is not provided, then switch_to or other arch hook must provide it.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-09-13  2:05 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-05  9:37 Question on smp_mb__before_spinlock Peter Zijlstra
2016-09-05  9:56 ` kbuild test robot
2016-09-05 10:19   ` Peter Zijlstra
2016-09-05 11:26     ` Fengguang Wu
2016-09-05 10:10 ` Will Deacon
2016-09-06 11:17   ` Peter Zijlstra
2016-09-06 17:42     ` Will Deacon
2016-09-05 10:37 ` Paul E. McKenney
2016-09-05 11:34   ` Peter Zijlstra
2016-09-05 13:57     ` Paul E. McKenney
2016-09-05 10:51 ` kbuild test robot
2016-09-07 12:17 ` Nicholas Piggin
2016-09-07 13:23   ` Peter Zijlstra
2016-09-07 13:51     ` Will Deacon
2016-09-12  2:35       ` Nicholas Piggin
2016-09-12  2:27     ` Nicholas Piggin
2016-09-12 12:54       ` Peter Zijlstra
2016-09-13  2:05         ` Nicholas Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.