All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -tip 0/4] A few updates around smp_store_mb()
@ 2015-10-27 19:53 Davidlohr Bueso
  2015-10-27 19:53 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions Davidlohr Bueso
                   ` (4 more replies)
  0 siblings, 5 replies; 56+ messages in thread
From: Davidlohr Bueso @ 2015-10-27 19:53 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Thomas Gleixner, Linus Torvalds, Paul E. McKenney, dave, linux-kernel

Hi,

Other than Peter's recent rename of set_mb to reflect the SMP
ordering nature of the call, there really hasn't updates that
further reflect this. Here are some pretty straightforward
updates, but ultimately wonder if we cannot make the call the
same for all archs; which is where we seem to be headed _anyway_,
more of this in Patch 3 which updates x86.

Patch 1 is just one a snuck in while looking at arch code.

Only tested on the x86 bits.

Thanks!

Davidlohr Bueso (4):
  arch,cmpxchg: Remove tas() definitions
  arch,barrier: Use smp barriers in smp_store_release()
  x86,asm: Re-work smp_store_mb()
  doc,smp: Remove ambiguous statement in smp_store_mb()

 Documentation/memory-barriers.txt   | 4 ++--
 arch/blackfin/include/asm/cmpxchg.h | 1 -
 arch/c6x/include/asm/cmpxchg.h      | 2 --
 arch/frv/include/asm/cmpxchg.h      | 2 --
 arch/ia64/include/asm/barrier.h     | 2 +-
 arch/powerpc/include/asm/barrier.h  | 2 +-
 arch/s390/include/asm/barrier.h     | 2 +-
 arch/tile/include/asm/cmpxchg.h     | 2 --
 arch/x86/include/asm/barrier.h      | 8 ++++++--
 include/asm-generic/barrier.h       | 2 +-
 10 files changed, 12 insertions(+), 15 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 1/4] arch,cmpxchg: Remove tas() definitions
  2015-10-27 19:53 [PATCH -tip 0/4] A few updates around smp_store_mb() Davidlohr Bueso
@ 2015-10-27 19:53 ` Davidlohr Bueso
  2015-12-04 12:01   ` [tip:locking/core] locking/cmpxchg, arch: " tip-bot for Davidlohr Bueso
  2015-10-27 19:53 ` [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release() Davidlohr Bueso
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 56+ messages in thread
From: Davidlohr Bueso @ 2015-10-27 19:53 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Thomas Gleixner, Linus Torvalds, Paul E. McKenney, dave,
	linux-kernel, Steven Miao, Aurelien Jacquiot, David Howells,
	Chris Metcalf, Davidlohr Bueso

It seems that 5dc12ddee93 (Remove tas()) missed some files. Correct
this and fully drop this macro, for which we should be using cmpxchg
like calls.

Cc: Steven Miao <realmz6@gmail.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 arch/blackfin/include/asm/cmpxchg.h | 1 -
 arch/c6x/include/asm/cmpxchg.h      | 2 --
 arch/frv/include/asm/cmpxchg.h      | 2 --
 arch/tile/include/asm/cmpxchg.h     | 2 --
 4 files changed, 7 deletions(-)

diff --git a/arch/blackfin/include/asm/cmpxchg.h b/arch/blackfin/include/asm/cmpxchg.h
index c05868c..2539288 100644
--- a/arch/blackfin/include/asm/cmpxchg.h
+++ b/arch/blackfin/include/asm/cmpxchg.h
@@ -128,6 +128,5 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr,
 #endif /* !CONFIG_SMP */
 
 #define xchg(ptr, x) ((__typeof__(*(ptr)))__xchg((unsigned long)(x), (ptr), sizeof(*(ptr))))
-#define tas(ptr) ((void)xchg((ptr), 1))
 
 #endif /* __ARCH_BLACKFIN_CMPXCHG__ */
diff --git a/arch/c6x/include/asm/cmpxchg.h b/arch/c6x/include/asm/cmpxchg.h
index b27c8ce..93d0a5a 100644
--- a/arch/c6x/include/asm/cmpxchg.h
+++ b/arch/c6x/include/asm/cmpxchg.h
@@ -47,8 +47,6 @@ static inline unsigned int __xchg(unsigned int x, volatile void *ptr, int size)
 #define xchg(ptr, x) \
 	((__typeof__(*(ptr)))__xchg((unsigned int)(x), (void *) (ptr), \
 				    sizeof(*(ptr))))
-#define tas(ptr)    xchg((ptr), 1)
-
 
 #include <asm-generic/cmpxchg-local.h>
 
diff --git a/arch/frv/include/asm/cmpxchg.h b/arch/frv/include/asm/cmpxchg.h
index 5b04dd0..a899765 100644
--- a/arch/frv/include/asm/cmpxchg.h
+++ b/arch/frv/include/asm/cmpxchg.h
@@ -69,8 +69,6 @@ extern uint32_t __xchg_32(uint32_t i, volatile void *v);
 
 #endif
 
-#define tas(ptr) (xchg((ptr), 1))
-
 /*****************************************************************************/
 /*
  * compare and conditionally exchange value with memory
diff --git a/arch/tile/include/asm/cmpxchg.h b/arch/tile/include/asm/cmpxchg.h
index 0ccda3c..25d5899 100644
--- a/arch/tile/include/asm/cmpxchg.h
+++ b/arch/tile/include/asm/cmpxchg.h
@@ -127,8 +127,6 @@ long long _atomic64_cmpxchg(long long *v, long long o, long long n);
 
 #endif
 
-#define tas(ptr) xchg((ptr), 1)
-
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_TILE_CMPXCHG_H */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release()
  2015-10-27 19:53 [PATCH -tip 0/4] A few updates around smp_store_mb() Davidlohr Bueso
  2015-10-27 19:53 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions Davidlohr Bueso
@ 2015-10-27 19:53 ` Davidlohr Bueso
  2015-10-27 20:03   ` Davidlohr Bueso
  2015-12-04 12:01   ` [tip:locking/core] lcoking/barriers, arch: " tip-bot for Davidlohr Bueso
  2015-10-27 19:53 ` [PATCH 3/4] x86,asm: Re-work smp_store_mb() Davidlohr Bueso
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 56+ messages in thread
From: Davidlohr Bueso @ 2015-10-27 19:53 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Thomas Gleixner, Linus Torvalds, Paul E. McKenney, dave,
	linux-kernel, Tony Luck, Benjamin Herrenschmidt, Heiko Carstens,
	Davidlohr Bueso

With b92b8b35a2e (locking/arch: Rename set_mb() to smp_store_mb())
it was made clear that the context of this call (and thus set_mb)
is strictly for CPU ordering, as opposed to IO. As such all archs
should use the smp variant of mb(), respecting the semantics and
saving a mandatory barrier on UP.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---

Completely untested.

 arch/ia64/include/asm/barrier.h    | 2 +-
 arch/powerpc/include/asm/barrier.h | 2 +-
 arch/s390/include/asm/barrier.h    | 2 +-
 include/asm-generic/barrier.h      | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index df896a1..209c4b8 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -77,7 +77,7 @@ do {									\
 	___p1;								\
 })
 
-#define smp_store_mb(var, value)	do { WRITE_ONCE(var, value); mb(); } while (0)
+#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); smp_mb(); } while (0)
 
 /*
  * The group barrier in front of the rsm & ssm are necessary to ensure
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index 0eca6ef..a7af5fb 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -34,7 +34,7 @@
 #define rmb()  __asm__ __volatile__ ("sync" : : : "memory")
 #define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
 
-#define smp_store_mb(var, value)	do { WRITE_ONCE(var, value); mb(); } while (0)
+#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); smp_mb(); } while (0)
 
 #ifdef __SUBARCH_HAS_LWSYNC
 #    define SMPWMB      LWSYNC
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index d48fe01..d360737 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -36,7 +36,7 @@
 #define smp_mb__before_atomic()		smp_mb()
 #define smp_mb__after_atomic()		smp_mb()
 
-#define smp_store_mb(var, value)		do { WRITE_ONCE(var, value); mb(); } while (0)
+#define smp_store_mb(var, value)	do { WRITE_ONCE(var, value); smp_mb(); } while (0)
 
 #define smp_store_release(p, v)						\
 do {									\
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index b42afad..0f45f93 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -93,7 +93,7 @@
 #endif	/* CONFIG_SMP */
 
 #ifndef smp_store_mb
-#define smp_store_mb(var, value)  do { WRITE_ONCE(var, value); mb(); } while (0)
+#define smp_store_mb(var, value)  do { WRITE_ONCE(var, value); smp_mb(); } while (0)
 #endif
 
 #ifndef smp_mb__before_atomic
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2015-10-27 19:53 [PATCH -tip 0/4] A few updates around smp_store_mb() Davidlohr Bueso
  2015-10-27 19:53 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions Davidlohr Bueso
  2015-10-27 19:53 ` [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release() Davidlohr Bueso
@ 2015-10-27 19:53 ` Davidlohr Bueso
  2015-10-27 21:33   ` Linus Torvalds
  2015-10-27 19:53 ` [PATCH 4/4] doc,smp: Remove ambiguous statement in smp_store_mb() Davidlohr Bueso
  2015-10-27 23:27 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions David Howells
  4 siblings, 1 reply; 56+ messages in thread
From: Davidlohr Bueso @ 2015-10-27 19:53 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Thomas Gleixner, Linus Torvalds, Paul E. McKenney, dave,
	linux-kernel, x86, Davidlohr Bueso

With the exception of the recent rename of set_mb to smp_store_mb,
thus explicitly enforcing SMP ordering, the code is quite stale -
going back to 2002, afaict. Specifically, replace the implicit
barriers of xchg for more standard smp_mb() call instead. Thus,

(i) We need not re-define it for SMP and UP systems. The later
already converts the smp_mb() to a compiler barrier.

(ii) Like most other archs, avoid using ugly/hacky (void)xchg
patterns and simply add the smp_mb() call explicitly after the
assignment.

Note that this might affect callers that could/would rely on the
atomicity semantics, but there are no guarantees of that for
smp_store_mb() mentioned anywhere, plus most archs use this anyway.
Thus we continue to be consistent with the memory-barriers.txt file,
and more importantly, maintain the semantics of the smp_ nature.

Cc: x86@kernel.org
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 arch/x86/include/asm/barrier.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 0681d25..09f817a 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -35,14 +35,18 @@
 #define smp_mb()	mb()
 #define smp_rmb()	dma_rmb()
 #define smp_wmb()	barrier()
-#define smp_store_mb(var, value) do { (void)xchg(&var, value); } while (0)
 #else /* !SMP */
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
-#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); barrier(); } while (0)
 #endif /* SMP */
 
+#define smp_store_mb(var, val)						\
+do {									\
+	WRITE_ONCE(var, val);						\
+	smp_mb();							\
+} while (0)
+
 #define read_barrier_depends()		do { } while (0)
 #define smp_read_barrier_depends()	do { } while (0)
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 4/4] doc,smp: Remove ambiguous statement in smp_store_mb()
  2015-10-27 19:53 [PATCH -tip 0/4] A few updates around smp_store_mb() Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2015-10-27 19:53 ` [PATCH 3/4] x86,asm: Re-work smp_store_mb() Davidlohr Bueso
@ 2015-10-27 19:53 ` Davidlohr Bueso
  2015-12-04 12:01   ` [tip:locking/core] locking/barriers, arch: Remove ambiguous statement in the smp_store_mb() documentation tip-bot for Davidlohr Bueso
  2015-10-27 23:27 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions David Howells
  4 siblings, 1 reply; 56+ messages in thread
From: Davidlohr Bueso @ 2015-10-27 19:53 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Thomas Gleixner, Linus Torvalds, Paul E. McKenney, dave, linux-kernel

It serves no purpose but to confuse readers, and is most
likely a left over from constant memory-barriers.txt
updates. Ie:

http://lists.openwall.net/linux-kernel/2006/07/15/27

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
 Documentation/memory-barriers.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index b5fe765..ee6fbb0 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1683,8 +1683,8 @@ There are some more advanced barrier functions:
  (*) smp_store_mb(var, value)
 
      This assigns the value to the variable and then inserts a full memory
-     barrier after it, depending on the function.  It isn't guaranteed to
-     insert anything more than a compiler barrier in a UP compilation.
+     barrier after it.  It isn't guaranteed to insert anything more than a
+     compiler barrier in a UP compilation.
 
 
  (*) smp_mb__before_atomic();
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release()
  2015-10-27 19:53 ` [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release() Davidlohr Bueso
@ 2015-10-27 20:03   ` Davidlohr Bueso
  2015-12-04 12:01   ` [tip:locking/core] lcoking/barriers, arch: " tip-bot for Davidlohr Bueso
  1 sibling, 0 replies; 56+ messages in thread
From: Davidlohr Bueso @ 2015-10-27 20:03 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Thomas Gleixner, Linus Torvalds, Paul E. McKenney, linux-kernel,
	Tony Luck, Benjamin Herrenschmidt, Heiko Carstens,
	Davidlohr Bueso

So the subject should actually say 'smp_store_mb()'... barriers here, bariers
there, barriers everywhere.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2015-10-27 19:53 ` [PATCH 3/4] x86,asm: Re-work smp_store_mb() Davidlohr Bueso
@ 2015-10-27 21:33   ` Linus Torvalds
  2015-10-27 22:01     ` Davidlohr Bueso
  2015-10-27 22:37     ` Peter Zijlstra
  0 siblings, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2015-10-27 21:33 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Paul E. McKenney,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	Davidlohr Bueso

On Wed, Oct 28, 2015 at 4:53 AM, Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> Note that this might affect callers that could/would rely on the
> atomicity semantics, but there are no guarantees of that for
> smp_store_mb() mentioned anywhere, plus most archs use this anyway.
> Thus we continue to be consistent with the memory-barriers.txt file,
> and more importantly, maintain the semantics of the smp_ nature.

So I dislike this patch, mostly because it now makes it obvious that
smp_store_mb() seems to be totally pointless. Every single
implementation is now apparently WRITE_ONCE+smp_mb(), and there are
what, five users of it, so why not then open-code it?

But more importantly, is the "WRITE_ONCE()" even necessary? If there
are no atomicity guarantees, then why bother with WRTE_ONCE() either?

So with this patch, the whole thing becomes pointless, I feel. (Ok, so
it may have been pointless before too, but at least before this patch
it generated special code, now it doesn't). So why carry it along at
all?

               Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2015-10-27 21:33   ` Linus Torvalds
@ 2015-10-27 22:01     ` Davidlohr Bueso
  2015-10-27 22:37     ` Peter Zijlstra
  1 sibling, 0 replies; 56+ messages in thread
From: Davidlohr Bueso @ 2015-10-27 22:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Paul E. McKenney,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	Davidlohr Bueso

On Wed, 28 Oct 2015, Linus Torvalds wrote:

>On Wed, Oct 28, 2015 at 4:53 AM, Davidlohr Bueso <dave@stgolabs.net> wrote:
>>
>> Note that this might affect callers that could/would rely on the
>> atomicity semantics, but there are no guarantees of that for
>> smp_store_mb() mentioned anywhere, plus most archs use this anyway.
>> Thus we continue to be consistent with the memory-barriers.txt file,
>> and more importantly, maintain the semantics of the smp_ nature.
>
>So I dislike this patch, mostly because it now makes it obvious that
>smp_store_mb() seems to be totally pointless. Every single
>implementation is now apparently WRITE_ONCE+smp_mb(), and there are
>what, five users of it, so why not then open-code it?

So after having gone through pretty much all of smp_store_mb code, this
is a feeling I also share. However I justified its existence (as opposed
to dropping the call, updating all the callers/documenting the barriers
etc.) to at least encapsulate the store+mb logic, which apparently is a
pattern somewhat needed(?). Also, the name is obviously exactly what its
name implies.

But I have no strong preference either way. Now, if we should keep
smp_store_mb(), it should probably be made generic, instead of having
each arch define it.

>
>But more importantly, is the "WRITE_ONCE()" even necessary? If there
>are no atomicity guarantees, then why bother with WRTE_ONCE() either?

Agreed. Hmm, this was introduced by ab3f02fc237 (locking/arch: Add WRITE_ONCE()
to set_mb()), back when atomicity aspects were not clear yet.

>So with this patch, the whole thing becomes pointless, I feel. (Ok, so
>it may have been pointless before too, but at least before this patch
>it generated special code, now it doesn't). So why carry it along at
>all?

Ok, unless others are strongly against it, I'll send a series to drop the
call altogether.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2015-10-27 21:33   ` Linus Torvalds
  2015-10-27 22:01     ` Davidlohr Bueso
@ 2015-10-27 22:37     ` Peter Zijlstra
  2015-10-28 19:49       ` Davidlohr Bueso
  2015-11-02 20:15       ` Davidlohr Bueso
  1 sibling, 2 replies; 56+ messages in thread
From: Peter Zijlstra @ 2015-10-27 22:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Ingo Molnar, Thomas Gleixner, Paul E. McKenney,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	Davidlohr Bueso

On Wed, Oct 28, 2015 at 06:33:56AM +0900, Linus Torvalds wrote:
> On Wed, Oct 28, 2015 at 4:53 AM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> >
> > Note that this might affect callers that could/would rely on the
> > atomicity semantics, but there are no guarantees of that for
> > smp_store_mb() mentioned anywhere, plus most archs use this anyway.
> > Thus we continue to be consistent with the memory-barriers.txt file,
> > and more importantly, maintain the semantics of the smp_ nature.
> 

> So with this patch, the whole thing becomes pointless, I feel. (Ok, so
> it may have been pointless before too, but at least before this patch
> it generated special code, now it doesn't). So why carry it along at
> all?

So I suppose this boils down to if: XCHG ends up being cheaper than
MOV+FENCE.

PeterA, any idea?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/4] arch,cmpxchg: Remove tas() definitions
  2015-10-27 19:53 [PATCH -tip 0/4] A few updates around smp_store_mb() Davidlohr Bueso
                   ` (3 preceding siblings ...)
  2015-10-27 19:53 ` [PATCH 4/4] doc,smp: Remove ambiguous statement in smp_store_mb() Davidlohr Bueso
@ 2015-10-27 23:27 ` David Howells
  4 siblings, 0 replies; 56+ messages in thread
From: David Howells @ 2015-10-27 23:27 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: dhowells, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Linus Torvalds, Paul E. McKenney, linux-kernel, Steven Miao,
	Aurelien Jacquiot, Chris Metcalf, Davidlohr Bueso

Davidlohr Bueso <dave@stgolabs.net> wrote:

> It seems that 5dc12ddee93 (Remove tas()) missed some files. Correct
> this and fully drop this macro, for which we should be using cmpxchg
> like calls.
> 
> Cc: Steven Miao <realmz6@gmail.com>
> Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
> Cc: Chris Metcalf <cmetcalf@ezchip.com>
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>

Acked-by: David Howells <dhowells@redhat.com> [frv]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2015-10-27 22:37     ` Peter Zijlstra
@ 2015-10-28 19:49       ` Davidlohr Bueso
  2015-11-02 20:15       ` Davidlohr Bueso
  1 sibling, 0 replies; 56+ messages in thread
From: Davidlohr Bueso @ 2015-10-28 19:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Paul E. McKenney,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	Davidlohr Bueso, H. Peter Anvin

On Tue, 27 Oct 2015, Peter Zijlstra wrote:

>On Wed, Oct 28, 2015 at 06:33:56AM +0900, Linus Torvalds wrote:
>> On Wed, Oct 28, 2015 at 4:53 AM, Davidlohr Bueso <dave@stgolabs.net> wrote:
>> >
>> > Note that this might affect callers that could/would rely on the
>> > atomicity semantics, but there are no guarantees of that for
>> > smp_store_mb() mentioned anywhere, plus most archs use this anyway.
>> > Thus we continue to be consistent with the memory-barriers.txt file,
>> > and more importantly, maintain the semantics of the smp_ nature.
>>
>
>> So with this patch, the whole thing becomes pointless, I feel. (Ok, so
>> it may have been pointless before too, but at least before this patch
>> it generated special code, now it doesn't). So why carry it along at
>> all?
>
>So I suppose this boils down to if: XCHG ends up being cheaper than
>MOV+FENCE.

If so, could this be the reasoning behind the mix and match of xchg and
MOV+FENCE? for different archs? This is from the days when set_mb() was
introduced. I wonder if it still even matters... I at least haven't seen
much difference in general workloads (I guess any difference would be
neglictible for practical matters). But could obviously be missing something.

>PeterA, any idea?

I suppose you're referring to hpa, Cc'ing him.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2015-10-27 22:37     ` Peter Zijlstra
  2015-10-28 19:49       ` Davidlohr Bueso
@ 2015-11-02 20:15       ` Davidlohr Bueso
  2015-11-03  0:06         ` Linus Torvalds
  1 sibling, 1 reply; 56+ messages in thread
From: Davidlohr Bueso @ 2015-11-02 20:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Paul E. McKenney,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	Davidlohr Bueso

On Tue, 27 Oct 2015, Peter Zijlstra wrote:

>On Wed, Oct 28, 2015 at 06:33:56AM +0900, Linus Torvalds wrote:
>> On Wed, Oct 28, 2015 at 4:53 AM, Davidlohr Bueso <dave@stgolabs.net> wrote:
>> >
>> > Note that this might affect callers that could/would rely on the
>> > atomicity semantics, but there are no guarantees of that for
>> > smp_store_mb() mentioned anywhere, plus most archs use this anyway.
>> > Thus we continue to be consistent with the memory-barriers.txt file,
>> > and more importantly, maintain the semantics of the smp_ nature.
>>
>
>> So with this patch, the whole thing becomes pointless, I feel. (Ok, so
>> it may have been pointless before too, but at least before this patch
>> it generated special code, now it doesn't). So why carry it along at
>> all?
>
>So I suppose this boils down to if: XCHG ends up being cheaper than
>MOV+FENCE.

So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is
constantly cheaper (by at least half the latency) than MFENCE. While there
was a decent amount of variation, this difference remained rather constant.

Then again, I'm not sure this matters. Thoughts?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2015-11-02 20:15       ` Davidlohr Bueso
@ 2015-11-03  0:06         ` Linus Torvalds
  2015-11-03  1:36           ` Davidlohr Bueso
                             ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Linus Torvalds @ 2015-11-03  0:06 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul E. McKenney,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	Davidlohr Bueso

On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is
> constantly cheaper (by at least half the latency) than MFENCE. While there
> was a decent amount of variation, this difference remained rather constant.

Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we
use on old cpu's without one (ie 32-bit).

I'm not actually convinced that mfence is necessarily a good idea. I
could easily see it being microcode, for example.

At least on my Haswell, the "lock addq" is pretty much exactly half
the cost of "mfence".

                     Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2015-11-03  0:06         ` Linus Torvalds
@ 2015-11-03  1:36           ` Davidlohr Bueso
  2016-01-12 13:57           ` Michael S. Tsirkin
  2016-01-12 13:57           ` Michael S. Tsirkin
  2 siblings, 0 replies; 56+ messages in thread
From: Davidlohr Bueso @ 2015-11-03  1:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul E. McKenney,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	Davidlohr Bueso

On Mon, 02 Nov 2015, Linus Torvalds wrote:

>On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
>>
>> So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is
>> constantly cheaper (by at least half the latency) than MFENCE. While there
>> was a decent amount of variation, this difference remained rather constant.
>
>Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we
>use on old cpu's without one (ie 32-bit).

I'm getting results very close to xchg.

>I'm not actually convinced that mfence is necessarily a good idea. I
>could easily see it being microcode, for example.

Interesting.

>
>At least on my Haswell, the "lock addq" is pretty much exactly half
>the cost of "mfence".

Ok, his coincides with my results on IvB.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [tip:locking/core] locking/cmpxchg, arch: Remove tas() definitions
  2015-10-27 19:53 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions Davidlohr Bueso
@ 2015-12-04 12:01   ` tip-bot for Davidlohr Bueso
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot for Davidlohr Bueso @ 2015-12-04 12:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-arch, cmetcalf, realmz6, a-jacquiot, linux-kernel, dbueso,
	torvalds, tglx, dave, hpa, peterz, paulmck, akpm, mingo

Commit-ID:  fbd35c0d2fb41b75863a0e45fe939c8440375b0a
Gitweb:     http://git.kernel.org/tip/fbd35c0d2fb41b75863a0e45fe939c8440375b0a
Author:     Davidlohr Bueso <dave@stgolabs.net>
AuthorDate: Tue, 27 Oct 2015 12:53:48 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 4 Dec 2015 11:39:51 +0100

locking/cmpxchg, arch: Remove tas() definitions

It seems that commit 5dc12ddee93 ("Remove tas()") missed some files.
Correct this and fully drop this macro, for which we should be using
cmpxchg() like calls.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <linux-arch@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: David Howells <dhowells@re hat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/1445975631-17047-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/blackfin/include/asm/cmpxchg.h | 1 -
 arch/c6x/include/asm/cmpxchg.h      | 2 --
 arch/frv/include/asm/cmpxchg.h      | 2 --
 arch/tile/include/asm/cmpxchg.h     | 2 --
 4 files changed, 7 deletions(-)

diff --git a/arch/blackfin/include/asm/cmpxchg.h b/arch/blackfin/include/asm/cmpxchg.h
index c05868c..2539288 100644
--- a/arch/blackfin/include/asm/cmpxchg.h
+++ b/arch/blackfin/include/asm/cmpxchg.h
@@ -128,6 +128,5 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr,
 #endif /* !CONFIG_SMP */
 
 #define xchg(ptr, x) ((__typeof__(*(ptr)))__xchg((unsigned long)(x), (ptr), sizeof(*(ptr))))
-#define tas(ptr) ((void)xchg((ptr), 1))
 
 #endif /* __ARCH_BLACKFIN_CMPXCHG__ */
diff --git a/arch/c6x/include/asm/cmpxchg.h b/arch/c6x/include/asm/cmpxchg.h
index b27c8ce..93d0a5a 100644
--- a/arch/c6x/include/asm/cmpxchg.h
+++ b/arch/c6x/include/asm/cmpxchg.h
@@ -47,8 +47,6 @@ static inline unsigned int __xchg(unsigned int x, volatile void *ptr, int size)
 #define xchg(ptr, x) \
 	((__typeof__(*(ptr)))__xchg((unsigned int)(x), (void *) (ptr), \
 				    sizeof(*(ptr))))
-#define tas(ptr)    xchg((ptr), 1)
-
 
 #include <asm-generic/cmpxchg-local.h>
 
diff --git a/arch/frv/include/asm/cmpxchg.h b/arch/frv/include/asm/cmpxchg.h
index 5b04dd0..a899765 100644
--- a/arch/frv/include/asm/cmpxchg.h
+++ b/arch/frv/include/asm/cmpxchg.h
@@ -69,8 +69,6 @@ extern uint32_t __xchg_32(uint32_t i, volatile void *v);
 
 #endif
 
-#define tas(ptr) (xchg((ptr), 1))
-
 /*****************************************************************************/
 /*
  * compare and conditionally exchange value with memory
diff --git a/arch/tile/include/asm/cmpxchg.h b/arch/tile/include/asm/cmpxchg.h
index 0ccda3c..25d5899 100644
--- a/arch/tile/include/asm/cmpxchg.h
+++ b/arch/tile/include/asm/cmpxchg.h
@@ -127,8 +127,6 @@ long long _atomic64_cmpxchg(long long *v, long long o, long long n);
 
 #endif
 
-#define tas(ptr) xchg((ptr), 1)
-
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_TILE_CMPXCHG_H */

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip:locking/core] lcoking/barriers, arch: Use smp barriers in smp_store_release()
  2015-10-27 19:53 ` [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release() Davidlohr Bueso
  2015-10-27 20:03   ` Davidlohr Bueso
@ 2015-12-04 12:01   ` tip-bot for Davidlohr Bueso
  1 sibling, 0 replies; 56+ messages in thread
From: tip-bot for Davidlohr Bueso @ 2015-12-04 12:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, linux-kernel, dave, hpa, tony.luck, linux-arch,
	heiko.carstens, akpm, peterz, mingo, paulmck, tglx, dbueso, benh

Commit-ID:  d5a73cadf3fdec95e9518ee5bb91bd0747c42b30
Gitweb:     http://git.kernel.org/tip/d5a73cadf3fdec95e9518ee5bb91bd0747c42b30
Author:     Davidlohr Bueso <dave@stgolabs.net>
AuthorDate: Tue, 27 Oct 2015 12:53:49 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 4 Dec 2015 11:39:51 +0100

lcoking/barriers, arch: Use smp barriers in smp_store_release()

With commit b92b8b35a2e ("locking/arch: Rename set_mb() to smp_store_mb()")
it was made clear that the context of this call (and thus set_mb)
is strictly for CPU ordering, as opposed to IO. As such all archs
should use the smp variant of mb(), respecting the semantics and
saving a mandatory barrier on UP.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <linux-arch@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/1445975631-17047-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/ia64/include/asm/barrier.h    | 2 +-
 arch/powerpc/include/asm/barrier.h | 2 +-
 arch/s390/include/asm/barrier.h    | 2 +-
 include/asm-generic/barrier.h      | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index df896a1..209c4b8 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -77,7 +77,7 @@ do {									\
 	___p1;								\
 })
 
-#define smp_store_mb(var, value)	do { WRITE_ONCE(var, value); mb(); } while (0)
+#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); smp_mb(); } while (0)
 
 /*
  * The group barrier in front of the rsm & ssm are necessary to ensure
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index 0eca6ef..a7af5fb 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -34,7 +34,7 @@
 #define rmb()  __asm__ __volatile__ ("sync" : : : "memory")
 #define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
 
-#define smp_store_mb(var, value)	do { WRITE_ONCE(var, value); mb(); } while (0)
+#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); smp_mb(); } while (0)
 
 #ifdef __SUBARCH_HAS_LWSYNC
 #    define SMPWMB      LWSYNC
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index d68e11e..7ffd0b1 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -36,7 +36,7 @@
 #define smp_mb__before_atomic()		smp_mb()
 #define smp_mb__after_atomic()		smp_mb()
 
-#define smp_store_mb(var, value)		do { WRITE_ONCE(var, value); mb(); } while (0)
+#define smp_store_mb(var, value)	do { WRITE_ONCE(var, value); smp_mb(); } while (0)
 
 #define smp_store_release(p, v)						\
 do {									\
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index b42afad..0f45f93 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -93,7 +93,7 @@
 #endif	/* CONFIG_SMP */
 
 #ifndef smp_store_mb
-#define smp_store_mb(var, value)  do { WRITE_ONCE(var, value); mb(); } while (0)
+#define smp_store_mb(var, value)  do { WRITE_ONCE(var, value); smp_mb(); } while (0)
 #endif
 
 #ifndef smp_mb__before_atomic

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip:locking/core] locking/barriers, arch: Remove ambiguous statement in the smp_store_mb() documentation
  2015-10-27 19:53 ` [PATCH 4/4] doc,smp: Remove ambiguous statement in smp_store_mb() Davidlohr Bueso
@ 2015-12-04 12:01   ` tip-bot for Davidlohr Bueso
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot for Davidlohr Bueso @ 2015-12-04 12:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: corbet, tglx, linux-kernel, hpa, linux-arch, paulmck, torvalds,
	mingo, dave, peterz, akpm

Commit-ID:  2d142e599bf73ab70a3457e6947f86935245415e
Gitweb:     http://git.kernel.org/tip/2d142e599bf73ab70a3457e6947f86935245415e
Author:     Davidlohr Bueso <dave@stgolabs.net>
AuthorDate: Tue, 27 Oct 2015 12:53:51 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 4 Dec 2015 11:39:51 +0100

locking/barriers, arch: Remove ambiguous statement in the smp_store_mb() documentation

It serves no purpose but to confuse readers, and is most
likely a left over from constant memory-barriers.txt
updates. I.e.:

  http://lists.openwall.net/linux-kernel/2006/07/15/27

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <linux-arch@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1445975631-17047-5-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/memory-barriers.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index aef9487..c85054d 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1673,8 +1673,8 @@ There are some more advanced barrier functions:
  (*) smp_store_mb(var, value)
 
      This assigns the value to the variable and then inserts a full memory
-     barrier after it, depending on the function.  It isn't guaranteed to
-     insert anything more than a compiler barrier in a UP compilation.
+     barrier after it.  It isn't guaranteed to insert anything more than a
+     compiler barrier in a UP compilation.
 
 
  (*) smp_mb__before_atomic();

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2015-11-03  0:06         ` Linus Torvalds
  2015-11-03  1:36           ` Davidlohr Bueso
  2016-01-12 13:57           ` Michael S. Tsirkin
@ 2016-01-12 13:57           ` Michael S. Tsirkin
  2016-01-12 17:20               ` Linus Torvalds
  2 siblings, 1 reply; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 13:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul E. McKenney, Linux Kernel Mailing List,
	the arch/x86 maintainers, Davidlohr Bueso, H. Peter Anvin,
	virtualization

On Mon, Nov 02, 2015 at 04:06:46PM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> >
> > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is
> > constantly cheaper (by at least half the latency) than MFENCE. While there
> > was a decent amount of variation, this difference remained rather constant.
> 
> Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we
> use on old cpu's without one (ie 32-bit).
> 
> I'm not actually convinced that mfence is necessarily a good idea. I
> could easily see it being microcode, for example.
> 
> At least on my Haswell, the "lock addq" is pretty much exactly half
> the cost of "mfence".
> 
>                      Linus

mfence was high on some traces I was seeing, so I got curious, too:

---->
main.c
---->


extern volatile int x;
volatile int x;

#ifdef __x86_64__
#define SP "rsp"
#else
#define SP "esp"
#endif
#ifdef lock
#define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory")
#endif
#ifdef xchg
#define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0)
#endif
#ifdef xchgrz
/* same as xchg but poking at gcc red zone */
#define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
#endif
#ifdef mfence
#define barrier() asm("mfence" ::: "memory")
#endif
#ifdef lfence
#define barrier() asm("lfence" ::: "memory")
#endif
#ifdef sfence
#define barrier() asm("sfence" ::: "memory")
#endif

int main(int argc, char **argv)
{
	int i;
	int j = 1234;

	/*
	 * Test barrier in a loop. We also poke at a volatile variable in an
	 * attempt to make it a bit more realistic - this way there's something
	 * in the store-buffer.
	 */
	for (i = 0; i < 10000000; ++i) {
		x = i - j;
		barrier();
		j = x;
	}

	return 0;
}
---->
Makefile:
---->

ALL = xchg xchgrz lock mfence lfence sfence

CC = gcc
CFLAGS += -Wall -O2 -ggdb
PERF = perf stat -r 10 --log-fd 1 --

all: ${ALL}
clean:
	rm -f ${ALL}
run: all
	for file in ${ALL}; do echo ${PERF} ./$$file ; ${PERF} ./$$file; done

.PHONY: all clean run

${ALL}: main.c
	${CC} ${CFLAGS} -D$@ -o $@ main.c

----->

Is this a good way to test it?

E.g. on my laptop I get:

perf stat -r 10 --log-fd 1 -- ./xchg

 Performance counter stats for './xchg' (10 runs):

         53.236967 task-clock                #    0.992 CPUs utilized            ( +-  0.09% )
                10 context-switches          #    0.180 K/sec                    ( +-  1.70% )
                 0 CPU-migrations            #    0.000 K/sec                  
                37 page-faults               #    0.691 K/sec                    ( +-  1.13% )
       190,997,612 cycles                    #    3.588 GHz                      ( +-  0.04% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,654,850 instructions              #    0.42  insns per cycle          ( +-  0.01% )
        10,122,372 branches                  #  190.138 M/sec                    ( +-  0.01% )
             4,514 branch-misses             #    0.04% of all branches          ( +-  3.37% )

       0.053642809 seconds time elapsed                                          ( +-  0.12% )

perf stat -r 10 --log-fd 1 -- ./xchgrz

 Performance counter stats for './xchgrz' (10 runs):

         53.189533 task-clock                #    0.997 CPUs utilized            ( +-  0.22% )
                 0 context-switches          #    0.000 K/sec                  
                 0 CPU-migrations            #    0.000 K/sec                  
                37 page-faults               #    0.694 K/sec                    ( +-  0.75% )
       190,785,621 cycles                    #    3.587 GHz                      ( +-  0.03% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,602,086 instructions              #    0.42  insns per cycle          ( +-  0.00% )
        10,112,154 branches                  #  190.115 M/sec                    ( +-  0.01% )
             3,743 branch-misses             #    0.04% of all branches          ( +-  4.02% )

       0.053343693 seconds time elapsed                                          ( +-  0.23% )

perf stat -r 10 --log-fd 1 -- ./lock

 Performance counter stats for './lock' (10 runs):

         53.096434 task-clock                #    0.997 CPUs utilized            ( +-  0.16% )
                 0 context-switches          #    0.002 K/sec                    ( +-100.00% )
                 0 CPU-migrations            #    0.000 K/sec                  
                37 page-faults               #    0.693 K/sec                    ( +-  0.98% )
       190,796,621 cycles                    #    3.593 GHz                      ( +-  0.02% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,601,376 instructions              #    0.42  insns per cycle          ( +-  0.01% )
        10,112,074 branches                  #  190.447 M/sec                    ( +-  0.01% )
             3,475 branch-misses             #    0.03% of all branches          ( +-  1.33% )

       0.053252678 seconds time elapsed                                          ( +-  0.16% )

perf stat -r 10 --log-fd 1 -- ./mfence

 Performance counter stats for './mfence' (10 runs):

        126.376473 task-clock                #    0.999 CPUs utilized            ( +-  0.21% )
                 0 context-switches          #    0.002 K/sec                    ( +- 66.67% )
                 0 CPU-migrations            #    0.000 K/sec                  
                36 page-faults               #    0.289 K/sec                    ( +-  0.84% )
       456,147,770 cycles                    #    3.609 GHz                      ( +-  0.01% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,892,416 instructions              #    0.18  insns per cycle          ( +-  0.00% )
        10,163,220 branches                  #   80.420 M/sec                    ( +-  0.01% )
             4,653 branch-misses             #    0.05% of all branches          ( +-  1.27% )

       0.126539273 seconds time elapsed                                          ( +-  0.21% )

perf stat -r 10 --log-fd 1 -- ./lfence

 Performance counter stats for './lfence' (10 runs):

         47.617861 task-clock                #    0.997 CPUs utilized            ( +-  0.06% )
                 0 context-switches          #    0.002 K/sec                    ( +-100.00% )
                 0 CPU-migrations            #    0.000 K/sec                  
                36 page-faults               #    0.764 K/sec                    ( +-  0.45% )
       170,767,856 cycles                    #    3.586 GHz                      ( +-  0.03% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,581,607 instructions              #    0.47  insns per cycle          ( +-  0.00% )
        10,108,508 branches                  #  212.284 M/sec                    ( +-  0.00% )
             3,320 branch-misses             #    0.03% of all branches          ( +-  1.12% )

       0.047768505 seconds time elapsed                                          ( +-  0.07% )

perf stat -r 10 --log-fd 1 -- ./sfence

 Performance counter stats for './sfence' (10 runs):

         20.156676 task-clock                #    0.988 CPUs utilized            ( +-  0.45% )
                 3 context-switches          #    0.159 K/sec                    ( +- 12.15% )
                 0 CPU-migrations            #    0.000 K/sec                  
                36 page-faults               #    0.002 M/sec                    ( +-  0.87% )
        72,212,225 cycles                    #    3.583 GHz                      ( +-  0.33% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,479,149 instructions              #    1.11  insns per cycle          ( +-  0.00% )
        10,090,785 branches                  #  500.618 M/sec                    ( +-  0.01% )
             3,626 branch-misses             #    0.04% of all branches          ( +-  3.59% )

       0.020411208 seconds time elapsed                                          ( +-  0.52% )


So mfence is more expensive than locked instructions/xchg, but sfence/lfence
are slightly faster, and xchg and locked instructions are very close if
not the same.

I poked at some 10 intel and AMD machines and the numbers are different
but the results seem more or less consistent with this.

>From size point of view xchg is longer and xchgrz pokes at the red zone
which seems unnecessarily hacky, so good old lock+addl is probably the
best.

There isn't any extra magic behind mfence, is there?
E.g. I think lock orders accesses to WC memory as well,
so apparently mb() can be redefined unconditionally, without
looking at XMM2:

--->
x86: drop mfence in favor of lock+addl

mfence appears to be way slower than a locked instruction - let's use
lock+add unconditionally, same as we always did on old 32-bit.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

I'll play with this some more before posting this as a
non-stand alone patch. Is there a macro-benchmark where mb
is prominent?

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index a584e1c..f0d36e2 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -15,15 +15,15 @@
  * Some non-Intel clones support out of order store. wmb() ceases to be a
  * nop for these.
  */
-#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
+#define mb() asm volatile("lock; addl $0,0(%%esp)":::"memory")
 #define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
 #define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
 #else
+#define mb()	asm volatile("lock; addl $0,0(%%rsp)":::"memory")
 #define rmb()	asm volatile("lfence":::"memory")
 #define wmb()	asm volatile("sfence" ::: "memory")
 #endif
 
 #ifdef CONFIG_X86_PPRO_FENCE
 #define dma_rmb()	rmb()
 #else

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2015-11-03  0:06         ` Linus Torvalds
  2015-11-03  1:36           ` Davidlohr Bueso
@ 2016-01-12 13:57           ` Michael S. Tsirkin
  2016-01-12 13:57           ` Michael S. Tsirkin
  2 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 13:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, H. Peter Anvin, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar

On Mon, Nov 02, 2015 at 04:06:46PM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> >
> > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is
> > constantly cheaper (by at least half the latency) than MFENCE. While there
> > was a decent amount of variation, this difference remained rather constant.
> 
> Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we
> use on old cpu's without one (ie 32-bit).
> 
> I'm not actually convinced that mfence is necessarily a good idea. I
> could easily see it being microcode, for example.
> 
> At least on my Haswell, the "lock addq" is pretty much exactly half
> the cost of "mfence".
> 
>                      Linus

mfence was high on some traces I was seeing, so I got curious, too:

---->
main.c
---->


extern volatile int x;
volatile int x;

#ifdef __x86_64__
#define SP "rsp"
#else
#define SP "esp"
#endif
#ifdef lock
#define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory")
#endif
#ifdef xchg
#define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0)
#endif
#ifdef xchgrz
/* same as xchg but poking at gcc red zone */
#define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
#endif
#ifdef mfence
#define barrier() asm("mfence" ::: "memory")
#endif
#ifdef lfence
#define barrier() asm("lfence" ::: "memory")
#endif
#ifdef sfence
#define barrier() asm("sfence" ::: "memory")
#endif

int main(int argc, char **argv)
{
	int i;
	int j = 1234;

	/*
	 * Test barrier in a loop. We also poke at a volatile variable in an
	 * attempt to make it a bit more realistic - this way there's something
	 * in the store-buffer.
	 */
	for (i = 0; i < 10000000; ++i) {
		x = i - j;
		barrier();
		j = x;
	}

	return 0;
}
---->
Makefile:
---->

ALL = xchg xchgrz lock mfence lfence sfence

CC = gcc
CFLAGS += -Wall -O2 -ggdb
PERF = perf stat -r 10 --log-fd 1 --

all: ${ALL}
clean:
	rm -f ${ALL}
run: all
	for file in ${ALL}; do echo ${PERF} ./$$file ; ${PERF} ./$$file; done

.PHONY: all clean run

${ALL}: main.c
	${CC} ${CFLAGS} -D$@ -o $@ main.c

----->

Is this a good way to test it?

E.g. on my laptop I get:

perf stat -r 10 --log-fd 1 -- ./xchg

 Performance counter stats for './xchg' (10 runs):

         53.236967 task-clock                #    0.992 CPUs utilized            ( +-  0.09% )
                10 context-switches          #    0.180 K/sec                    ( +-  1.70% )
                 0 CPU-migrations            #    0.000 K/sec                  
                37 page-faults               #    0.691 K/sec                    ( +-  1.13% )
       190,997,612 cycles                    #    3.588 GHz                      ( +-  0.04% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,654,850 instructions              #    0.42  insns per cycle          ( +-  0.01% )
        10,122,372 branches                  #  190.138 M/sec                    ( +-  0.01% )
             4,514 branch-misses             #    0.04% of all branches          ( +-  3.37% )

       0.053642809 seconds time elapsed                                          ( +-  0.12% )

perf stat -r 10 --log-fd 1 -- ./xchgrz

 Performance counter stats for './xchgrz' (10 runs):

         53.189533 task-clock                #    0.997 CPUs utilized            ( +-  0.22% )
                 0 context-switches          #    0.000 K/sec                  
                 0 CPU-migrations            #    0.000 K/sec                  
                37 page-faults               #    0.694 K/sec                    ( +-  0.75% )
       190,785,621 cycles                    #    3.587 GHz                      ( +-  0.03% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,602,086 instructions              #    0.42  insns per cycle          ( +-  0.00% )
        10,112,154 branches                  #  190.115 M/sec                    ( +-  0.01% )
             3,743 branch-misses             #    0.04% of all branches          ( +-  4.02% )

       0.053343693 seconds time elapsed                                          ( +-  0.23% )

perf stat -r 10 --log-fd 1 -- ./lock

 Performance counter stats for './lock' (10 runs):

         53.096434 task-clock                #    0.997 CPUs utilized            ( +-  0.16% )
                 0 context-switches          #    0.002 K/sec                    ( +-100.00% )
                 0 CPU-migrations            #    0.000 K/sec                  
                37 page-faults               #    0.693 K/sec                    ( +-  0.98% )
       190,796,621 cycles                    #    3.593 GHz                      ( +-  0.02% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,601,376 instructions              #    0.42  insns per cycle          ( +-  0.01% )
        10,112,074 branches                  #  190.447 M/sec                    ( +-  0.01% )
             3,475 branch-misses             #    0.03% of all branches          ( +-  1.33% )

       0.053252678 seconds time elapsed                                          ( +-  0.16% )

perf stat -r 10 --log-fd 1 -- ./mfence

 Performance counter stats for './mfence' (10 runs):

        126.376473 task-clock                #    0.999 CPUs utilized            ( +-  0.21% )
                 0 context-switches          #    0.002 K/sec                    ( +- 66.67% )
                 0 CPU-migrations            #    0.000 K/sec                  
                36 page-faults               #    0.289 K/sec                    ( +-  0.84% )
       456,147,770 cycles                    #    3.609 GHz                      ( +-  0.01% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,892,416 instructions              #    0.18  insns per cycle          ( +-  0.00% )
        10,163,220 branches                  #   80.420 M/sec                    ( +-  0.01% )
             4,653 branch-misses             #    0.05% of all branches          ( +-  1.27% )

       0.126539273 seconds time elapsed                                          ( +-  0.21% )

perf stat -r 10 --log-fd 1 -- ./lfence

 Performance counter stats for './lfence' (10 runs):

         47.617861 task-clock                #    0.997 CPUs utilized            ( +-  0.06% )
                 0 context-switches          #    0.002 K/sec                    ( +-100.00% )
                 0 CPU-migrations            #    0.000 K/sec                  
                36 page-faults               #    0.764 K/sec                    ( +-  0.45% )
       170,767,856 cycles                    #    3.586 GHz                      ( +-  0.03% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,581,607 instructions              #    0.47  insns per cycle          ( +-  0.00% )
        10,108,508 branches                  #  212.284 M/sec                    ( +-  0.00% )
             3,320 branch-misses             #    0.03% of all branches          ( +-  1.12% )

       0.047768505 seconds time elapsed                                          ( +-  0.07% )

perf stat -r 10 --log-fd 1 -- ./sfence

 Performance counter stats for './sfence' (10 runs):

         20.156676 task-clock                #    0.988 CPUs utilized            ( +-  0.45% )
                 3 context-switches          #    0.159 K/sec                    ( +- 12.15% )
                 0 CPU-migrations            #    0.000 K/sec                  
                36 page-faults               #    0.002 M/sec                    ( +-  0.87% )
        72,212,225 cycles                    #    3.583 GHz                      ( +-  0.33% )
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        80,479,149 instructions              #    1.11  insns per cycle          ( +-  0.00% )
        10,090,785 branches                  #  500.618 M/sec                    ( +-  0.01% )
             3,626 branch-misses             #    0.04% of all branches          ( +-  3.59% )

       0.020411208 seconds time elapsed                                          ( +-  0.52% )


So mfence is more expensive than locked instructions/xchg, but sfence/lfence
are slightly faster, and xchg and locked instructions are very close if
not the same.

I poked at some 10 intel and AMD machines and the numbers are different
but the results seem more or less consistent with this.

From size point of view xchg is longer and xchgrz pokes at the red zone
which seems unnecessarily hacky, so good old lock+addl is probably the
best.

There isn't any extra magic behind mfence, is there?
E.g. I think lock orders accesses to WC memory as well,
so apparently mb() can be redefined unconditionally, without
looking at XMM2:

--->
x86: drop mfence in favor of lock+addl

mfence appears to be way slower than a locked instruction - let's use
lock+add unconditionally, same as we always did on old 32-bit.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

I'll play with this some more before posting this as a
non-stand alone patch. Is there a macro-benchmark where mb
is prominent?

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index a584e1c..f0d36e2 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -15,15 +15,15 @@
  * Some non-Intel clones support out of order store. wmb() ceases to be a
  * nop for these.
  */
-#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
+#define mb() asm volatile("lock; addl $0,0(%%esp)":::"memory")
 #define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
 #define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
 #else
+#define mb()	asm volatile("lock; addl $0,0(%%rsp)":::"memory")
 #define rmb()	asm volatile("lfence":::"memory")
 #define wmb()	asm volatile("sfence" ::: "memory")
 #endif
 
 #ifdef CONFIG_X86_PPRO_FENCE
 #define dma_rmb()	rmb()
 #else

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 13:57           ` Michael S. Tsirkin
@ 2016-01-12 17:20               ` Linus Torvalds
  0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 17:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Davidlohr Bueso, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul E. McKenney, Linux Kernel Mailing List,
	the arch/x86 maintainers, Davidlohr Bueso, H. Peter Anvin,
	virtualization

On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> #ifdef xchgrz
> /* same as xchg but poking at gcc red zone */
> #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
> #endif

That's not safe in general. gcc might be using its redzone, so doing
xchg into it is unsafe.

But..

> Is this a good way to test it?

.. it's fine for some basic testing. It doesn't show any subtle
interactions (ie some operations may have different dynamic behavior
when the write buffers are busy etc), but as a baseline for "how fast
can things go" the stupid raw loop is fine. And while the xchg into
the redzoen wouldn't be acceptable as a real implementation, for
timing testing it's likely fine (ie you aren't hitting the problem it
can cause).

> So mfence is more expensive than locked instructions/xchg, but sfence/lfence
> are slightly faster, and xchg and locked instructions are very close if
> not the same.

Note that we never actually *use* lfence/sfence. They are pointless
instructions when looking at CPU memory ordering, because for pure CPU
memory ordering stores and loads are already ordered.

The only reason to use lfence/sfence is after you've used nontemporal
stores for IO. That's very very rare in the kernel. So I wouldn't
worry about those.

But yes, it does sound like mfence is just a bad idea too.

> There isn't any extra magic behind mfence, is there?

No.

I think the only issue is that there has never been any real reason
for CPU designers to try to make mfence go particularly fast. Nobody
uses it, again with the exception of some odd loops that use
nontemporal stores, and for those the cost tends to always be about
the nontemporal accesses themselves (often to things like GPU memory
over PCIe), and the mfence cost of a few extra cycles is negligible.

The reason "lock ; add $0" has generally been the fastest we've found
is simply that locked ops have been important for CPU designers.

So I think the patch is fine, and we should likely drop the use of mfence..

                      Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 17:20               ` Linus Torvalds
  0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 17:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, H. Peter Anvin, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar

On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> #ifdef xchgrz
> /* same as xchg but poking at gcc red zone */
> #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
> #endif

That's not safe in general. gcc might be using its redzone, so doing
xchg into it is unsafe.

But..

> Is this a good way to test it?

.. it's fine for some basic testing. It doesn't show any subtle
interactions (ie some operations may have different dynamic behavior
when the write buffers are busy etc), but as a baseline for "how fast
can things go" the stupid raw loop is fine. And while the xchg into
the redzoen wouldn't be acceptable as a real implementation, for
timing testing it's likely fine (ie you aren't hitting the problem it
can cause).

> So mfence is more expensive than locked instructions/xchg, but sfence/lfence
> are slightly faster, and xchg and locked instructions are very close if
> not the same.

Note that we never actually *use* lfence/sfence. They are pointless
instructions when looking at CPU memory ordering, because for pure CPU
memory ordering stores and loads are already ordered.

The only reason to use lfence/sfence is after you've used nontemporal
stores for IO. That's very very rare in the kernel. So I wouldn't
worry about those.

But yes, it does sound like mfence is just a bad idea too.

> There isn't any extra magic behind mfence, is there?

No.

I think the only issue is that there has never been any real reason
for CPU designers to try to make mfence go particularly fast. Nobody
uses it, again with the exception of some odd loops that use
nontemporal stores, and for those the cost tends to always be about
the nontemporal accesses themselves (often to things like GPU memory
over PCIe), and the mfence cost of a few extra cycles is negligible.

The reason "lock ; add $0" has generally been the fastest we've found
is simply that locked ops have been important for CPU designers.

So I think the patch is fine, and we should likely drop the use of mfence..

                      Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 17:20               ` Linus Torvalds
@ 2016-01-12 17:45                 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 17:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul E. McKenney, Linux Kernel Mailing List,
	the arch/x86 maintainers, Davidlohr Bueso, H. Peter Anvin,
	virtualization

On Tue, Jan 12, 2016 at 09:20:06AM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > #ifdef xchgrz
> > /* same as xchg but poking at gcc red zone */
> > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
> > #endif
> 
> That's not safe in general. gcc might be using its redzone, so doing
> xchg into it is unsafe.
> 
> But..
> 
> > Is this a good way to test it?
> 
> .. it's fine for some basic testing. It doesn't show any subtle
> interactions (ie some operations may have different dynamic behavior
> when the write buffers are busy etc), but as a baseline for "how fast
> can things go" the stupid raw loop is fine. And while the xchg into
> the redzoen wouldn't be acceptable as a real implementation, for
> timing testing it's likely fine (ie you aren't hitting the problem it
> can cause).
> 
> > So mfence is more expensive than locked instructions/xchg, but sfence/lfence
> > are slightly faster, and xchg and locked instructions are very close if
> > not the same.
> 
> Note that we never actually *use* lfence/sfence. They are pointless
> instructions when looking at CPU memory ordering, because for pure CPU
> memory ordering stores and loads are already ordered.
> 
> The only reason to use lfence/sfence is after you've used nontemporal
> stores for IO.


By the way, the comment in barrier.h says:

/*
 * Some non-Intel clones support out of order store. wmb() ceases to be
 * a nop for these.
 */

and while the 1st sentence may well be true, if you have
an SMP system with out of order stores, making wmb
not a nop will not help.

Additionally as you point out, wmb is not a nop even
for regular intel CPUs because of these weird use-cases.

Drop this comment?

> That's very very rare in the kernel. So I wouldn't
> worry about those.

Right - I'll leave these alone, whoever wants to optimize this path will
have to do the necessary research.

> But yes, it does sound like mfence is just a bad idea too.
> 
> > There isn't any extra magic behind mfence, is there?
> 
> No.
> 
> I think the only issue is that there has never been any real reason
> for CPU designers to try to make mfence go particularly fast. Nobody
> uses it, again with the exception of some odd loops that use
> nontemporal stores, and for those the cost tends to always be about
> the nontemporal accesses themselves (often to things like GPU memory
> over PCIe), and the mfence cost of a few extra cycles is negligible.
> 
> The reason "lock ; add $0" has generally been the fastest we've found
> is simply that locked ops have been important for CPU designers.
> 
> So I think the patch is fine, and we should likely drop the use of mfence..
> 
>                       Linus

OK so should I repost after a bit more testing?  I don't believe this
will affect the kernel build benchmark, but I'll try :)


-- 
MST

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 17:45                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 17:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, H. Peter Anvin, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar

On Tue, Jan 12, 2016 at 09:20:06AM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > #ifdef xchgrz
> > /* same as xchg but poking at gcc red zone */
> > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
> > #endif
> 
> That's not safe in general. gcc might be using its redzone, so doing
> xchg into it is unsafe.
> 
> But..
> 
> > Is this a good way to test it?
> 
> .. it's fine for some basic testing. It doesn't show any subtle
> interactions (ie some operations may have different dynamic behavior
> when the write buffers are busy etc), but as a baseline for "how fast
> can things go" the stupid raw loop is fine. And while the xchg into
> the redzoen wouldn't be acceptable as a real implementation, for
> timing testing it's likely fine (ie you aren't hitting the problem it
> can cause).
> 
> > So mfence is more expensive than locked instructions/xchg, but sfence/lfence
> > are slightly faster, and xchg and locked instructions are very close if
> > not the same.
> 
> Note that we never actually *use* lfence/sfence. They are pointless
> instructions when looking at CPU memory ordering, because for pure CPU
> memory ordering stores and loads are already ordered.
> 
> The only reason to use lfence/sfence is after you've used nontemporal
> stores for IO.


By the way, the comment in barrier.h says:

/*
 * Some non-Intel clones support out of order store. wmb() ceases to be
 * a nop for these.
 */

and while the 1st sentence may well be true, if you have
an SMP system with out of order stores, making wmb
not a nop will not help.

Additionally as you point out, wmb is not a nop even
for regular intel CPUs because of these weird use-cases.

Drop this comment?

> That's very very rare in the kernel. So I wouldn't
> worry about those.

Right - I'll leave these alone, whoever wants to optimize this path will
have to do the necessary research.

> But yes, it does sound like mfence is just a bad idea too.
> 
> > There isn't any extra magic behind mfence, is there?
> 
> No.
> 
> I think the only issue is that there has never been any real reason
> for CPU designers to try to make mfence go particularly fast. Nobody
> uses it, again with the exception of some odd loops that use
> nontemporal stores, and for those the cost tends to always be about
> the nontemporal accesses themselves (often to things like GPU memory
> over PCIe), and the mfence cost of a few extra cycles is negligible.
> 
> The reason "lock ; add $0" has generally been the fastest we've found
> is simply that locked ops have been important for CPU designers.
> 
> So I think the patch is fine, and we should likely drop the use of mfence..
> 
>                       Linus

OK so should I repost after a bit more testing?  I don't believe this
will affect the kernel build benchmark, but I'll try :)


-- 
MST

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 17:45                 ` Michael S. Tsirkin
@ 2016-01-12 18:04                   ` Linus Torvalds
  -1 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 18:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Davidlohr Bueso, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul E. McKenney, Linux Kernel Mailing List,
	the arch/x86 maintainers, Davidlohr Bueso, H. Peter Anvin,
	virtualization

On Tue, Jan 12, 2016 at 9:45 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> By the way, the comment in barrier.h says:
>
> /*
>  * Some non-Intel clones support out of order store. wmb() ceases to be
>  * a nop for these.
>  */
>
> and while the 1st sentence may well be true, if you have
> an SMP system with out of order stores, making wmb
> not a nop will not help.
>
> Additionally as you point out, wmb is not a nop even
> for regular intel CPUs because of these weird use-cases.
>
> Drop this comment?

We should drop it, yes. We dropped support for CONFIG_X86_OOSTORE
almost two years ago. See commit 09df7c4c8097 ("x86: Remove
CONFIG_X86_OOSTORE") and it was questionable for a long time even
before that (perhaps ever).

So the comment is stale.

We *do* still use the non-nop rmb/wmb for IO barriers, but even that
is generally questionable. See our "copy_user_64.S" for an actual use
of "movnt" followed by sfence. There's a couple of other cases too. So
that's all correct, but the point is that when we use "movnt" we don't
actually use "wmb()", we are doing assembly, and the assembly should
just use sfence directly.

So it's actually very questionable to ever make even the IO
wmb()/rmb() functions use lfence/sfence. They should never really need
it.

But at the same time, I _really_ don't think we care enough. I'd
rather leave those non-smp barrier cases alone as historial unless
somebody can point to a case where they care about the performance.

We also do have the whole PPRO_FENCE thing, which we can hopefully get
rid of at some point too.

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 18:04                   ` Linus Torvalds
  0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 18:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, H. Peter Anvin, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar

On Tue, Jan 12, 2016 at 9:45 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> By the way, the comment in barrier.h says:
>
> /*
>  * Some non-Intel clones support out of order store. wmb() ceases to be
>  * a nop for these.
>  */
>
> and while the 1st sentence may well be true, if you have
> an SMP system with out of order stores, making wmb
> not a nop will not help.
>
> Additionally as you point out, wmb is not a nop even
> for regular intel CPUs because of these weird use-cases.
>
> Drop this comment?

We should drop it, yes. We dropped support for CONFIG_X86_OOSTORE
almost two years ago. See commit 09df7c4c8097 ("x86: Remove
CONFIG_X86_OOSTORE") and it was questionable for a long time even
before that (perhaps ever).

So the comment is stale.

We *do* still use the non-nop rmb/wmb for IO barriers, but even that
is generally questionable. See our "copy_user_64.S" for an actual use
of "movnt" followed by sfence. There's a couple of other cases too. So
that's all correct, but the point is that when we use "movnt" we don't
actually use "wmb()", we are doing assembly, and the assembly should
just use sfence directly.

So it's actually very questionable to ever make even the IO
wmb()/rmb() functions use lfence/sfence. They should never really need
it.

But at the same time, I _really_ don't think we care enough. I'd
rather leave those non-smp barrier cases alone as historial unless
somebody can point to a case where they care about the performance.

We also do have the whole PPRO_FENCE thing, which we can hopefully get
rid of at some point too.

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 17:20               ` Linus Torvalds
  (?)
  (?)
@ 2016-01-12 20:30               ` Andy Lutomirski
  2016-01-12 20:54                   ` Linus Torvalds
  -1 siblings, 1 reply; 56+ messages in thread
From: Andy Lutomirski @ 2016-01-12 20:30 UTC (permalink / raw)
  To: Linus Torvalds, Michael S. Tsirkin
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, H. Peter Anvin, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar

On 01/12/2016 09:20 AM, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> #ifdef xchgrz
>> /* same as xchg but poking at gcc red zone */
>> #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
>> #endif
>
> That's not safe in general. gcc might be using its redzone, so doing
> xchg into it is unsafe.
>
> But..
>
>> Is this a good way to test it?
>
> .. it's fine for some basic testing. It doesn't show any subtle
> interactions (ie some operations may have different dynamic behavior
> when the write buffers are busy etc), but as a baseline for "how fast
> can things go" the stupid raw loop is fine. And while the xchg into
> the redzoen wouldn't be acceptable as a real implementation, for
> timing testing it's likely fine (ie you aren't hitting the problem it
> can cause).

I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 
64) was better because it avoided stomping on very-likely-to-be-hot 
write buffers.

--Andy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 20:30               ` Andy Lutomirski
@ 2016-01-12 20:54                   ` Linus Torvalds
  0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 20:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Michael S. Tsirkin, Davidlohr Bueso, Davidlohr Bueso,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List, virtualization, H. Peter Anvin,
	Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64)
> was better because it avoided stomping on very-likely-to-be-hot write
> buffers.

I suspect it could go either way. You want a small constant (for the
isntruction size), but any small constant is likely to be within the
current stack frame anyway. I don't think 0(%rsp) is particularly
likely to have a spill on it right then and there, but who knows..

And 64(%rsp) is  possibly going to be cold in the L1 cache, especially
if it's just after a deep function call. Which it might be. So it
might work the other way.

So my guess would be that you wouldn't be able to measure the
difference. It might be there, but probably too small to really see in
any noise.

But numbers talk, bullshit walks. It would be interesting to be proven wrong.

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 20:54                   ` Linus Torvalds
  0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 20:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, Michael S. Tsirkin, H. Peter Anvin,
	Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64)
> was better because it avoided stomping on very-likely-to-be-hot write
> buffers.

I suspect it could go either way. You want a small constant (for the
isntruction size), but any small constant is likely to be within the
current stack frame anyway. I don't think 0(%rsp) is particularly
likely to have a spill on it right then and there, but who knows..

And 64(%rsp) is  possibly going to be cold in the L1 cache, especially
if it's just after a deep function call. Which it might be. So it
might work the other way.

So my guess would be that you wouldn't be able to measure the
difference. It might be there, but probably too small to really see in
any noise.

But numbers talk, bullshit walks. It would be interesting to be proven wrong.

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 20:54                   ` Linus Torvalds
@ 2016-01-12 20:59                     ` Andy Lutomirski
  -1 siblings, 0 replies; 56+ messages in thread
From: Andy Lutomirski @ 2016-01-12 20:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Michael S. Tsirkin, Davidlohr Bueso,
	Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List, virtualization, H. Peter Anvin,
	Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64)
>> was better because it avoided stomping on very-likely-to-be-hot write
>> buffers.
>
> I suspect it could go either way. You want a small constant (for the
> isntruction size), but any small constant is likely to be within the
> current stack frame anyway. I don't think 0(%rsp) is particularly
> likely to have a spill on it right then and there, but who knows..
>
> And 64(%rsp) is  possibly going to be cold in the L1 cache, especially
> if it's just after a deep function call. Which it might be. So it
> might work the other way.
>
> So my guess would be that you wouldn't be able to measure the
> difference. It might be there, but probably too small to really see in
> any noise.
>
> But numbers talk, bullshit walks. It would be interesting to be proven wrong.

Here's an article with numbers:

http://shipilev.net/blog/2014/on-the-fence-with-dependencies/

I think they're suggesting using a negative offset, which is safe as
long as it doesn't page fault, even though we have the redzone
disabled.

--Andy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 20:59                     ` Andy Lutomirski
  0 siblings, 0 replies; 56+ messages in thread
From: Andy Lutomirski @ 2016-01-12 20:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, Michael S. Tsirkin, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64)
>> was better because it avoided stomping on very-likely-to-be-hot write
>> buffers.
>
> I suspect it could go either way. You want a small constant (for the
> isntruction size), but any small constant is likely to be within the
> current stack frame anyway. I don't think 0(%rsp) is particularly
> likely to have a spill on it right then and there, but who knows..
>
> And 64(%rsp) is  possibly going to be cold in the L1 cache, especially
> if it's just after a deep function call. Which it might be. So it
> might work the other way.
>
> So my guess would be that you wouldn't be able to measure the
> difference. It might be there, but probably too small to really see in
> any noise.
>
> But numbers talk, bullshit walks. It would be interesting to be proven wrong.

Here's an article with numbers:

http://shipilev.net/blog/2014/on-the-fence-with-dependencies/

I think they're suggesting using a negative offset, which is safe as
long as it doesn't page fault, even though we have the redzone
disabled.

--Andy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 20:59                     ` Andy Lutomirski
@ 2016-01-12 21:37                       ` Linus Torvalds
  -1 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 21:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Michael S. Tsirkin, Davidlohr Bueso,
	Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List, virtualization, H. Peter Anvin,
	Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Here's an article with numbers:
>
> http://shipilev.net/blog/2014/on-the-fence-with-dependencies/

Well, that's with the busy loop and one set of code generation. It
doesn't show the "oops, deeper stack isn't even in the cache any more
due to call chains" issue.

But yes:

> I think they're suggesting using a negative offset, which is safe as
> long as it doesn't page fault, even though we have the redzone
> disabled.

I think a negative offset might work very well. Partly exactly
*because* we have the redzone disabled: we know that inside the
kernel, we'll never have any live stack frame accesses under the stack
pointer, so "-4(%rsp)" sounds good to me. There should never be any
pending writes in the write buffer, because even if it *was* live, it
would have been read off first.

Yeah, it potentially does extend the stack cache footprint by another
4 bytes, but that sounds very benign.

So perhaps it might be worth trying to switch the "mfence" to "lock ;
addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
for x86-32.

I'd still want to see somebody try to benchmark it. I doubt it's
noticeable, but making changes because you think it might save a few
cycles without then even measuring it is just wrong.

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 21:37                       ` Linus Torvalds
  0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 21:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, Michael S. Tsirkin, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Here's an article with numbers:
>
> http://shipilev.net/blog/2014/on-the-fence-with-dependencies/

Well, that's with the busy loop and one set of code generation. It
doesn't show the "oops, deeper stack isn't even in the cache any more
due to call chains" issue.

But yes:

> I think they're suggesting using a negative offset, which is safe as
> long as it doesn't page fault, even though we have the redzone
> disabled.

I think a negative offset might work very well. Partly exactly
*because* we have the redzone disabled: we know that inside the
kernel, we'll never have any live stack frame accesses under the stack
pointer, so "-4(%rsp)" sounds good to me. There should never be any
pending writes in the write buffer, because even if it *was* live, it
would have been read off first.

Yeah, it potentially does extend the stack cache footprint by another
4 bytes, but that sounds very benign.

So perhaps it might be worth trying to switch the "mfence" to "lock ;
addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
for x86-32.

I'd still want to see somebody try to benchmark it. I doubt it's
noticeable, but making changes because you think it might save a few
cycles without then even measuring it is just wrong.

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 21:37                       ` Linus Torvalds
  (?)
@ 2016-01-12 22:14                       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 22:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andy Lutomirski, Davidlohr Bueso,
	Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List, virtualization, H. Peter Anvin,
	Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Here's an article with numbers:
> >
> > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> 
> Well, that's with the busy loop and one set of code generation. It
> doesn't show the "oops, deeper stack isn't even in the cache any more
> due to call chains" issue.
> 
> But yes:
> 
> > I think they're suggesting using a negative offset, which is safe as
> > long as it doesn't page fault, even though we have the redzone
> > disabled.
> 
> I think a negative offset might work very well. Partly exactly
> *because* we have the redzone disabled: we know that inside the
> kernel, we'll never have any live stack frame accesses under the stack
> pointer, so "-4(%rsp)" sounds good to me. There should never be any
> pending writes in the write buffer, because even if it *was* live, it
> would have been read off first.
> 
> Yeah, it potentially does extend the stack cache footprint by another
> 4 bytes, but that sounds very benign.
> 
> So perhaps it might be worth trying to switch the "mfence" to "lock ;
> addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
> for x86-32.
> 
> I'd still want to see somebody try to benchmark it. I doubt it's
> noticeable, but making changes because you think it might save a few
> cycles without then even measuring it is just wrong.
> 
>                  Linus

Oops, I posted v2 with just offset 0 before reading
the rest of this thread.
I did try with offset 0 and didn't measure any
change on any perf bench test, or on kernel build.
I wonder which benchmark stresses smp_mb the most.
I'll look into using a negative offset.

-- 
MST

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 21:37                       ` Linus Torvalds
  (?)
  (?)
@ 2016-01-12 22:14                       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 22:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Andy Lutomirski, Andy Lutomirski, H. Peter Anvin,
	Paul E. McKenney, Thomas Gleixner, virtualization, Ingo Molnar

On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Here's an article with numbers:
> >
> > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> 
> Well, that's with the busy loop and one set of code generation. It
> doesn't show the "oops, deeper stack isn't even in the cache any more
> due to call chains" issue.
> 
> But yes:
> 
> > I think they're suggesting using a negative offset, which is safe as
> > long as it doesn't page fault, even though we have the redzone
> > disabled.
> 
> I think a negative offset might work very well. Partly exactly
> *because* we have the redzone disabled: we know that inside the
> kernel, we'll never have any live stack frame accesses under the stack
> pointer, so "-4(%rsp)" sounds good to me. There should never be any
> pending writes in the write buffer, because even if it *was* live, it
> would have been read off first.
> 
> Yeah, it potentially does extend the stack cache footprint by another
> 4 bytes, but that sounds very benign.
> 
> So perhaps it might be worth trying to switch the "mfence" to "lock ;
> addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
> for x86-32.
> 
> I'd still want to see somebody try to benchmark it. I doubt it's
> noticeable, but making changes because you think it might save a few
> cycles without then even measuring it is just wrong.
> 
>                  Linus

Oops, I posted v2 with just offset 0 before reading
the rest of this thread.
I did try with offset 0 and didn't measure any
change on any perf bench test, or on kernel build.
I wonder which benchmark stresses smp_mb the most.
I'll look into using a negative offset.

-- 
MST

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 20:59                     ` Andy Lutomirski
@ 2016-01-12 22:21                       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 22:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Andy Lutomirski, Davidlohr Bueso,
	Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List, virtualization, H. Peter Anvin,
	Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Tue, Jan 12, 2016 at 12:59:58PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >>
> >> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64)
> >> was better because it avoided stomping on very-likely-to-be-hot write
> >> buffers.
> >
> > I suspect it could go either way. You want a small constant (for the
> > isntruction size), but any small constant is likely to be within the
> > current stack frame anyway. I don't think 0(%rsp) is particularly
> > likely to have a spill on it right then and there, but who knows..
> >
> > And 64(%rsp) is  possibly going to be cold in the L1 cache, especially
> > if it's just after a deep function call. Which it might be. So it
> > might work the other way.
> >
> > So my guess would be that you wouldn't be able to measure the
> > difference. It might be there, but probably too small to really see in
> > any noise.
> >
> > But numbers talk, bullshit walks. It would be interesting to be proven wrong.
> 
> Here's an article with numbers:
> 
> http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> 
> I think they're suggesting using a negative offset, which is safe as
> long as it doesn't page fault, even though we have the redzone
> disabled.
> 
> --Andy

OK so I'll have to tweak the test to put something
on stack to measure the difference: my test tweaks a
global variable instead.
I'll try that by tomorrow.

I couldn't measure any difference between mfence and lock+addl
except in a micro-benchmark, but hey since we are tweaking this,
let's do the optimal thing.

-- 
MST

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 22:21                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 22:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, Andy Lutomirski, H. Peter Anvin, Thomas Gleixner,
	Paul E. McKenney, Linus Torvalds, Ingo Molnar

On Tue, Jan 12, 2016 at 12:59:58PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >>
> >> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64)
> >> was better because it avoided stomping on very-likely-to-be-hot write
> >> buffers.
> >
> > I suspect it could go either way. You want a small constant (for the
> > isntruction size), but any small constant is likely to be within the
> > current stack frame anyway. I don't think 0(%rsp) is particularly
> > likely to have a spill on it right then and there, but who knows..
> >
> > And 64(%rsp) is  possibly going to be cold in the L1 cache, especially
> > if it's just after a deep function call. Which it might be. So it
> > might work the other way.
> >
> > So my guess would be that you wouldn't be able to measure the
> > difference. It might be there, but probably too small to really see in
> > any noise.
> >
> > But numbers talk, bullshit walks. It would be interesting to be proven wrong.
> 
> Here's an article with numbers:
> 
> http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> 
> I think they're suggesting using a negative offset, which is safe as
> long as it doesn't page fault, even though we have the redzone
> disabled.
> 
> --Andy

OK so I'll have to tweak the test to put something
on stack to measure the difference: my test tweaks a
global variable instead.
I'll try that by tomorrow.

I couldn't measure any difference between mfence and lock+addl
except in a micro-benchmark, but hey since we are tweaking this,
let's do the optimal thing.

-- 
MST

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 22:21                       ` Michael S. Tsirkin
@ 2016-01-12 22:55                         ` H. Peter Anvin
  -1 siblings, 0 replies; 56+ messages in thread
From: H. Peter Anvin @ 2016-01-12 22:55 UTC (permalink / raw)
  To: Michael S. Tsirkin, Andy Lutomirski
  Cc: Linus Torvalds, Andy Lutomirski, Davidlohr Bueso,
	Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List, virtualization, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar

On 01/12/16 14:21, Michael S. Tsirkin wrote:
> 
> OK so I'll have to tweak the test to put something
> on stack to measure the difference: my test tweaks a
> global variable instead.
> I'll try that by tomorrow.
> 
> I couldn't measure any difference between mfence and lock+addl
> except in a micro-benchmark, but hey since we are tweaking this,
> let's do the optimal thing.
> 

Be careful with this: if it only shows up in a microbenchmark, we may
introduce a hard-to-debug regression for no real benefit.

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 22:55                         ` H. Peter Anvin
  0 siblings, 0 replies; 56+ messages in thread
From: H. Peter Anvin @ 2016-01-12 22:55 UTC (permalink / raw)
  To: Michael S. Tsirkin, Andy Lutomirski
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, Andy Lutomirski, Thomas Gleixner,
	Paul E. McKenney, Linus Torvalds, Ingo Molnar

On 01/12/16 14:21, Michael S. Tsirkin wrote:
> 
> OK so I'll have to tweak the test to put something
> on stack to measure the difference: my test tweaks a
> global variable instead.
> I'll try that by tomorrow.
> 
> I couldn't measure any difference between mfence and lock+addl
> except in a micro-benchmark, but hey since we are tweaking this,
> let's do the optimal thing.
> 

Be careful with this: if it only shows up in a microbenchmark, we may
introduce a hard-to-debug regression for no real benefit.

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 22:55                         ` H. Peter Anvin
  (?)
  (?)
@ 2016-01-12 23:24                         ` Linus Torvalds
  2016-01-13 16:17                             ` Borislav Petkov
  -1 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 23:24 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michael S. Tsirkin, Andy Lutomirski, Andy Lutomirski,
	Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Tue, Jan 12, 2016 at 2:55 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> Be careful with this: if it only shows up in a microbenchmark, we may
> introduce a hard-to-debug regression for no real benefit.

So I can pretty much guarantee that it shouldn't regress from a
correctness angle, since we rely *heavily* on locked instructions
being barriers, in locking and in various other situations.

Indeed, much more so than we ever rely on "smp_mb()". The places that
rely on smp_mb() are pretty few in the end.

So I think the only issue is whether sometimes "mfence" might be
faster. So far, I've never actually heard of that being the case. The
fence instructions have always sucked when I've seen them.

But talking to the hw people about this is certainly a good idea regardless.

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 22:55                         ` H. Peter Anvin
  (?)
@ 2016-01-12 23:24                         ` Linus Torvalds
  -1 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 23:24 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	Linux Kernel Mailing List, Michael S. Tsirkin,
	the arch/x86 maintainers, Andy Lutomirski, Andy Lutomirski,
	Paul E. McKenney, Thomas Gleixner, virtualization, Ingo Molnar

On Tue, Jan 12, 2016 at 2:55 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> Be careful with this: if it only shows up in a microbenchmark, we may
> introduce a hard-to-debug regression for no real benefit.

So I can pretty much guarantee that it shouldn't regress from a
correctness angle, since we rely *heavily* on locked instructions
being barriers, in locking and in various other situations.

Indeed, much more so than we ever rely on "smp_mb()". The places that
rely on smp_mb() are pretty few in the end.

So I think the only issue is whether sometimes "mfence" might be
faster. So far, I've never actually heard of that being the case. The
fence instructions have always sucked when I've seen them.

But talking to the hw people about this is certainly a good idea regardless.

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 23:24                         ` Linus Torvalds
@ 2016-01-13 16:17                             ` Borislav Petkov
  0 siblings, 0 replies; 56+ messages in thread
From: Borislav Petkov @ 2016-01-13 16:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Michael S. Tsirkin, Andy Lutomirski,
	Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List, virtualization, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar

On Tue, Jan 12, 2016 at 03:24:05PM -0800, Linus Torvalds wrote:
> But talking to the hw people about this is certainly a good idea regardless.

I'm not seeing it in this thread but I might've missed it too. Anyway,
I'm being reminded that the ADD will change rFLAGS while MFENCE doesn't
touch them.

Do we care?

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-13 16:17                             ` Borislav Petkov
  0 siblings, 0 replies; 56+ messages in thread
From: Borislav Petkov @ 2016-01-13 16:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	Linux Kernel Mailing List, Michael S. Tsirkin,
	the arch/x86 maintainers, Andy Lutomirski, Andy Lutomirski,
	H. Peter Anvin, Paul E. McKenney, Thomas Gleixner,
	virtualization, Ingo Molnar

On Tue, Jan 12, 2016 at 03:24:05PM -0800, Linus Torvalds wrote:
> But talking to the hw people about this is certainly a good idea regardless.

I'm not seeing it in this thread but I might've missed it too. Anyway,
I'm being reminded that the ADD will change rFLAGS while MFENCE doesn't
touch them.

Do we care?

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-12 21:37                       ` Linus Torvalds
@ 2016-01-13 16:20                         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 16:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andy Lutomirski, Davidlohr Bueso,
	Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List, virtualization, H. Peter Anvin,
	Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Here's an article with numbers:
> >
> > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> 
> Well, that's with the busy loop and one set of code generation. It
> doesn't show the "oops, deeper stack isn't even in the cache any more
> due to call chains" issue.

It's an interesting read, thanks!

So sp is read on return from function I think.  I added a function and sure
enough, it slows the add 0(sp) variant down. It's still faster than mfence for
me though! Testing code + results below. Reaching below stack, or
allocating extra 4 bytes above the stack pointer gives us back the performance.


> But yes:
> 
> > I think they're suggesting using a negative offset, which is safe as
> > long as it doesn't page fault, even though we have the redzone
> > disabled.
> 
> I think a negative offset might work very well. Partly exactly
> *because* we have the redzone disabled: we know that inside the
> kernel, we'll never have any live stack frame accesses under the stack
> pointer, so "-4(%rsp)" sounds good to me. There should never be any
> pending writes in the write buffer, because even if it *was* live, it
> would have been read off first.
> 
> Yeah, it potentially does extend the stack cache footprint by another
> 4 bytes, but that sounds very benign.
> 
> So perhaps it might be worth trying to switch the "mfence" to "lock ;
> addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
> for x86-32.
>
>
> I'd still want to see somebody try to benchmark it. I doubt it's
> noticeable, but making changes because you think it might save a few
> cycles without then even measuring it is just wrong.
> 
>                  Linus

I'll try this in the kernel now, will report, though I'm
not optimistic a high level benchmark can show this
kind of thing.


---------------
main.c:
---------------

extern volatile int x;
volatile int x;

#ifdef __x86_64__
#define SP "rsp"
#else
#define SP "esp"
#endif


#ifdef lock
#define barrier() do { int p; asm volatile ("lock; addl $0,%0" ::"m"(p): "memory"); } while (0)
#endif
#ifdef locksp
#define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory")
#endif
#ifdef lockrz
#define barrier() asm("lock; addl $0,-4(%%" SP ")" ::: "memory")
#endif
#ifdef xchg
#define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0)
#endif
#ifdef xchgrz
/* same as xchg but poking at gcc red zone */
#define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
#endif
#ifdef mfence
#define barrier() asm("mfence" ::: "memory")
#endif
#ifdef lfence
#define barrier() asm("lfence" ::: "memory")
#endif
#ifdef sfence
#define barrier() asm("sfence" ::: "memory")
#endif

void __attribute__ ((noinline)) test(int i, int *j)
{
	/*
	 * Test barrier in a loop. We also poke at a volatile variable in an
	 * attempt to make it a bit more realistic - this way there's something
	 * in the store-buffer.
	 */
	x = i - *j;
	barrier();
	*j = x;
}

int main(int argc, char **argv)
{
	int i;
	int j = 1234;

	for (i = 0; i < 10000000; ++i)
		test(i, &j);

	return 0;
}
---------------

ALL = xchg xchgrz lock locksp lockrz mfence lfence sfence

CC = gcc
CFLAGS += -Wall -O2 -ggdb
PERF = perf stat -r 10 --log-fd 1 --
TIME = /usr/bin/time -f %e
FILTER = cat

all: ${ALL}
clean:
	rm -f ${ALL}
run: all
	for file in ${ALL}; do echo ${RUN} ./$$file "|" ${FILTER}; ${RUN} ./$$file | ${FILTER}; done
perf time: run
time: RUN=${TIME}
perf: RUN=${PERF}
perf: FILTER=grep elapsed

.PHONY: all clean run perf time

xchgrz: CFLAGS += -mno-red-zone

${ALL}: main.c
	${CC} ${CFLAGS} -D$@ -o $@ main.c

--------------------------------------------
perf stat -r 10 --log-fd 1 -- ./xchg | grep elapsed
       0.080420565 seconds time elapsed                                          ( +-  2.31% )
perf stat -r 10 --log-fd 1 -- ./xchgrz | grep elapsed
       0.087798571 seconds time elapsed                                          ( +-  2.58% )
perf stat -r 10 --log-fd 1 -- ./lock | grep elapsed
       0.083023724 seconds time elapsed                                          ( +-  2.44% )
perf stat -r 10 --log-fd 1 -- ./locksp | grep elapsed
       0.102880750 seconds time elapsed                                          ( +-  0.13% )
perf stat -r 10 --log-fd 1 -- ./lockrz | grep elapsed
       0.084917420 seconds time elapsed                                          ( +-  3.28% )
perf stat -r 10 --log-fd 1 -- ./mfence | grep elapsed
       0.156014715 seconds time elapsed                                          ( +-  0.16% )
perf stat -r 10 --log-fd 1 -- ./lfence | grep elapsed
       0.077731443 seconds time elapsed                                          ( +-  0.12% )
perf stat -r 10 --log-fd 1 -- ./sfence | grep elapsed
       0.036655741 seconds time elapsed                                          ( +-  0.21% )

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-13 16:20                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 16:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Andy Lutomirski, Andy Lutomirski, H. Peter Anvin,
	Paul E. McKenney, Thomas Gleixner, virtualization, Ingo Molnar

On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Here's an article with numbers:
> >
> > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> 
> Well, that's with the busy loop and one set of code generation. It
> doesn't show the "oops, deeper stack isn't even in the cache any more
> due to call chains" issue.

It's an interesting read, thanks!

So sp is read on return from function I think.  I added a function and sure
enough, it slows the add 0(sp) variant down. It's still faster than mfence for
me though! Testing code + results below. Reaching below stack, or
allocating extra 4 bytes above the stack pointer gives us back the performance.


> But yes:
> 
> > I think they're suggesting using a negative offset, which is safe as
> > long as it doesn't page fault, even though we have the redzone
> > disabled.
> 
> I think a negative offset might work very well. Partly exactly
> *because* we have the redzone disabled: we know that inside the
> kernel, we'll never have any live stack frame accesses under the stack
> pointer, so "-4(%rsp)" sounds good to me. There should never be any
> pending writes in the write buffer, because even if it *was* live, it
> would have been read off first.
> 
> Yeah, it potentially does extend the stack cache footprint by another
> 4 bytes, but that sounds very benign.
> 
> So perhaps it might be worth trying to switch the "mfence" to "lock ;
> addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
> for x86-32.
>
>
> I'd still want to see somebody try to benchmark it. I doubt it's
> noticeable, but making changes because you think it might save a few
> cycles without then even measuring it is just wrong.
> 
>                  Linus

I'll try this in the kernel now, will report, though I'm
not optimistic a high level benchmark can show this
kind of thing.


---------------
main.c:
---------------

extern volatile int x;
volatile int x;

#ifdef __x86_64__
#define SP "rsp"
#else
#define SP "esp"
#endif


#ifdef lock
#define barrier() do { int p; asm volatile ("lock; addl $0,%0" ::"m"(p): "memory"); } while (0)
#endif
#ifdef locksp
#define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory")
#endif
#ifdef lockrz
#define barrier() asm("lock; addl $0,-4(%%" SP ")" ::: "memory")
#endif
#ifdef xchg
#define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0)
#endif
#ifdef xchgrz
/* same as xchg but poking at gcc red zone */
#define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
#endif
#ifdef mfence
#define barrier() asm("mfence" ::: "memory")
#endif
#ifdef lfence
#define barrier() asm("lfence" ::: "memory")
#endif
#ifdef sfence
#define barrier() asm("sfence" ::: "memory")
#endif

void __attribute__ ((noinline)) test(int i, int *j)
{
	/*
	 * Test barrier in a loop. We also poke at a volatile variable in an
	 * attempt to make it a bit more realistic - this way there's something
	 * in the store-buffer.
	 */
	x = i - *j;
	barrier();
	*j = x;
}

int main(int argc, char **argv)
{
	int i;
	int j = 1234;

	for (i = 0; i < 10000000; ++i)
		test(i, &j);

	return 0;
}
---------------

ALL = xchg xchgrz lock locksp lockrz mfence lfence sfence

CC = gcc
CFLAGS += -Wall -O2 -ggdb
PERF = perf stat -r 10 --log-fd 1 --
TIME = /usr/bin/time -f %e
FILTER = cat

all: ${ALL}
clean:
	rm -f ${ALL}
run: all
	for file in ${ALL}; do echo ${RUN} ./$$file "|" ${FILTER}; ${RUN} ./$$file | ${FILTER}; done
perf time: run
time: RUN=${TIME}
perf: RUN=${PERF}
perf: FILTER=grep elapsed

.PHONY: all clean run perf time

xchgrz: CFLAGS += -mno-red-zone

${ALL}: main.c
	${CC} ${CFLAGS} -D$@ -o $@ main.c

--------------------------------------------
perf stat -r 10 --log-fd 1 -- ./xchg | grep elapsed
       0.080420565 seconds time elapsed                                          ( +-  2.31% )
perf stat -r 10 --log-fd 1 -- ./xchgrz | grep elapsed
       0.087798571 seconds time elapsed                                          ( +-  2.58% )
perf stat -r 10 --log-fd 1 -- ./lock | grep elapsed
       0.083023724 seconds time elapsed                                          ( +-  2.44% )
perf stat -r 10 --log-fd 1 -- ./locksp | grep elapsed
       0.102880750 seconds time elapsed                                          ( +-  0.13% )
perf stat -r 10 --log-fd 1 -- ./lockrz | grep elapsed
       0.084917420 seconds time elapsed                                          ( +-  3.28% )
perf stat -r 10 --log-fd 1 -- ./mfence | grep elapsed
       0.156014715 seconds time elapsed                                          ( +-  0.16% )
perf stat -r 10 --log-fd 1 -- ./lfence | grep elapsed
       0.077731443 seconds time elapsed                                          ( +-  0.12% )
perf stat -r 10 --log-fd 1 -- ./sfence | grep elapsed
       0.036655741 seconds time elapsed                                          ( +-  0.21% )

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-13 16:17                             ` Borislav Petkov
@ 2016-01-13 16:25                               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 16:25 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski,
	Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Wed, Jan 13, 2016 at 05:17:04PM +0100, Borislav Petkov wrote:
> On Tue, Jan 12, 2016 at 03:24:05PM -0800, Linus Torvalds wrote:
> > But talking to the hw people about this is certainly a good idea regardless.
> 
> I'm not seeing it in this thread but I might've missed it too. Anyway,
> I'm being reminded that the ADD will change rFLAGS while MFENCE doesn't
> touch them.
> 
> Do we care?

Which flag do you refer to, exactly?

> -- 
> Regards/Gruss,
>     Boris.
> 
> ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-13 16:25                               ` Michael S. Tsirkin
  0 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 16:25 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Andy Lutomirski, Andy Lutomirski, H. Peter Anvin,
	Paul E. McKenney, Thomas Gleixner, virtualization,
	Linus Torvalds, Ingo Molnar

On Wed, Jan 13, 2016 at 05:17:04PM +0100, Borislav Petkov wrote:
> On Tue, Jan 12, 2016 at 03:24:05PM -0800, Linus Torvalds wrote:
> > But talking to the hw people about this is certainly a good idea regardless.
> 
> I'm not seeing it in this thread but I might've missed it too. Anyway,
> I'm being reminded that the ADD will change rFLAGS while MFENCE doesn't
> touch them.
> 
> Do we care?

Which flag do you refer to, exactly?

> -- 
> Regards/Gruss,
>     Boris.
> 
> ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-13 16:25                               ` Michael S. Tsirkin
@ 2016-01-13 16:33                                 ` Borislav Petkov
  -1 siblings, 0 replies; 56+ messages in thread
From: Borislav Petkov @ 2016-01-13 16:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski,
	Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Wed, Jan 13, 2016 at 06:25:21PM +0200, Michael S. Tsirkin wrote:
> Which flag do you refer to, exactly?

All the flags in rFLAGS which ADD modifies: OF,SF,ZF,AF,PF,CF

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-13 16:33                                 ` Borislav Petkov
  0 siblings, 0 replies; 56+ messages in thread
From: Borislav Petkov @ 2016-01-13 16:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Andy Lutomirski, Andy Lutomirski, H. Peter Anvin,
	Paul E. McKenney, Thomas Gleixner, virtualization,
	Linus Torvalds, Ingo Molnar

On Wed, Jan 13, 2016 at 06:25:21PM +0200, Michael S. Tsirkin wrote:
> Which flag do you refer to, exactly?

All the flags in rFLAGS which ADD modifies: OF,SF,ZF,AF,PF,CF

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-13 16:33                                 ` Borislav Petkov
@ 2016-01-13 16:42                                   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 16:42 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski,
	Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Wed, Jan 13, 2016 at 05:33:31PM +0100, Borislav Petkov wrote:
> On Wed, Jan 13, 2016 at 06:25:21PM +0200, Michael S. Tsirkin wrote:
> > Which flag do you refer to, exactly?
> 
> All the flags in rFLAGS which ADD modifies: OF,SF,ZF,AF,PF,CF

Oh, I think this means we need a "cc" clobber.

This also seems to be a bug on !XMM CPUs.

cmpxchg.h gets it right.

I'll send a patch.

> -- 
> Regards/Gruss,
>     Boris.
> 
> ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-13 16:42                                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 16:42 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Andy Lutomirski, Andy Lutomirski, H. Peter Anvin,
	Paul E. McKenney, Thomas Gleixner, virtualization,
	Linus Torvalds, Ingo Molnar

On Wed, Jan 13, 2016 at 05:33:31PM +0100, Borislav Petkov wrote:
> On Wed, Jan 13, 2016 at 06:25:21PM +0200, Michael S. Tsirkin wrote:
> > Which flag do you refer to, exactly?
> 
> All the flags in rFLAGS which ADD modifies: OF,SF,ZF,AF,PF,CF

Oh, I think this means we need a "cc" clobber.

This also seems to be a bug on !XMM CPUs.

cmpxchg.h gets it right.

I'll send a patch.

> -- 
> Regards/Gruss,
>     Boris.
> 
> ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-13 16:42                                   ` Michael S. Tsirkin
@ 2016-01-13 16:53                                     ` Borislav Petkov
  -1 siblings, 0 replies; 56+ messages in thread
From: Borislav Petkov @ 2016-01-13 16:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski,
	Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Wed, Jan 13, 2016 at 06:42:48PM +0200, Michael S. Tsirkin wrote:
> Oh, I think this means we need a "cc" clobber.

Btw, does your microbenchmark do it too?

Because, the "cc" clobber should cause additional handling of flags,
depending on the context. It won't matter if the context doesn't need
rFLAGS handling in the benchmark but if we start using LOCK; ADD in the
kernel, I can imagine some places where mb() is used and rFLAGS are
live, causing gcc to either reorder code or stash them away...

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-13 16:53                                     ` Borislav Petkov
  0 siblings, 0 replies; 56+ messages in thread
From: Borislav Petkov @ 2016-01-13 16:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Andy Lutomirski, Andy Lutomirski, H. Peter Anvin,
	Paul E. McKenney, Thomas Gleixner, virtualization,
	Linus Torvalds, Ingo Molnar

On Wed, Jan 13, 2016 at 06:42:48PM +0200, Michael S. Tsirkin wrote:
> Oh, I think this means we need a "cc" clobber.

Btw, does your microbenchmark do it too?

Because, the "cc" clobber should cause additional handling of flags,
depending on the context. It won't matter if the context doesn't need
rFLAGS handling in the benchmark but if we start using LOCK; ADD in the
kernel, I can imagine some places where mb() is used and rFLAGS are
live, causing gcc to either reorder code or stash them away...

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-13 16:53                                     ` Borislav Petkov
  (?)
  (?)
@ 2016-01-13 17:00                                     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 17:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski,
	Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar

On Wed, Jan 13, 2016 at 05:53:20PM +0100, Borislav Petkov wrote:
> On Wed, Jan 13, 2016 at 06:42:48PM +0200, Michael S. Tsirkin wrote:
> > Oh, I think this means we need a "cc" clobber.
> 
> Btw, does your microbenchmark do it too?

Yes - I fixed it now, but it did not affect the result.
We'd need some code where gcc carries flags around though.

> Because, the "cc" clobber should cause additional handling of flags,
> depending on the context. It won't matter if the context doesn't need
> rFLAGS handling in the benchmark but if we start using LOCK; ADD in the
> kernel, I can imagine some places where mb() is used and rFLAGS are
> live, causing gcc to either reorder code or stash them away...

It will reorder code but not necessarily for the worse :)
Best I can do, I will add cc clobber to kernel and see whether
binary size grows.

> -- 
> Regards/Gruss,
>     Boris.
> 
> ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-13 16:53                                     ` Borislav Petkov
  (?)
@ 2016-01-13 17:00                                     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 17:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Andy Lutomirski, Andy Lutomirski, H. Peter Anvin,
	Paul E. McKenney, Thomas Gleixner, virtualization,
	Linus Torvalds, Ingo Molnar

On Wed, Jan 13, 2016 at 05:53:20PM +0100, Borislav Petkov wrote:
> On Wed, Jan 13, 2016 at 06:42:48PM +0200, Michael S. Tsirkin wrote:
> > Oh, I think this means we need a "cc" clobber.
> 
> Btw, does your microbenchmark do it too?

Yes - I fixed it now, but it did not affect the result.
We'd need some code where gcc carries flags around though.

> Because, the "cc" clobber should cause additional handling of flags,
> depending on the context. It won't matter if the context doesn't need
> rFLAGS handling in the benchmark but if we start using LOCK; ADD in the
> kernel, I can imagine some places where mb() is used and rFLAGS are
> live, causing gcc to either reorder code or stash them away...

It will reorder code but not necessarily for the worse :)
Best I can do, I will add cc clobber to kernel and see whether
binary size grows.

> -- 
> Regards/Gruss,
>     Boris.
> 
> ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
  2016-01-13 16:42                                   ` Michael S. Tsirkin
@ 2016-01-13 18:38                                     ` Linus Torvalds
  -1 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-13 18:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Borislav Petkov, H. Peter Anvin, Andy Lutomirski,
	Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List, virtualization, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar

On Wed, Jan 13, 2016 at 8:42 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> Oh, I think this means we need a "cc" clobber.

It's probably a good idea to add one.

Historically, gcc doesn't need one on x86, and always considers flags
clobbered. We are probably missing the cc clobber in a *lot* of places
for this reason.

But even if not necessary, it's probably a good thing to add for
documentation, and in case gcc semantcs ever change.

                  Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-13 18:38                                     ` Linus Torvalds
  0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-13 18:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Andy Lutomirski, Borislav Petkov, Andy Lutomirski,
	H. Peter Anvin, Paul E. McKenney, Thomas Gleixner,
	virtualization, Ingo Molnar

On Wed, Jan 13, 2016 at 8:42 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> Oh, I think this means we need a "cc" clobber.

It's probably a good idea to add one.

Historically, gcc doesn't need one on x86, and always considers flags
clobbered. We are probably missing the cc clobber in a *lot* of places
for this reason.

But even if not necessary, it's probably a good thing to add for
documentation, and in case gcc semantcs ever change.

                  Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2016-01-13 18:38 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-27 19:53 [PATCH -tip 0/4] A few updates around smp_store_mb() Davidlohr Bueso
2015-10-27 19:53 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions Davidlohr Bueso
2015-12-04 12:01   ` [tip:locking/core] locking/cmpxchg, arch: " tip-bot for Davidlohr Bueso
2015-10-27 19:53 ` [PATCH 2/4] arch,barrier: Use smp barriers in smp_store_release() Davidlohr Bueso
2015-10-27 20:03   ` Davidlohr Bueso
2015-12-04 12:01   ` [tip:locking/core] lcoking/barriers, arch: " tip-bot for Davidlohr Bueso
2015-10-27 19:53 ` [PATCH 3/4] x86,asm: Re-work smp_store_mb() Davidlohr Bueso
2015-10-27 21:33   ` Linus Torvalds
2015-10-27 22:01     ` Davidlohr Bueso
2015-10-27 22:37     ` Peter Zijlstra
2015-10-28 19:49       ` Davidlohr Bueso
2015-11-02 20:15       ` Davidlohr Bueso
2015-11-03  0:06         ` Linus Torvalds
2015-11-03  1:36           ` Davidlohr Bueso
2016-01-12 13:57           ` Michael S. Tsirkin
2016-01-12 13:57           ` Michael S. Tsirkin
2016-01-12 17:20             ` Linus Torvalds
2016-01-12 17:20               ` Linus Torvalds
2016-01-12 17:45               ` Michael S. Tsirkin
2016-01-12 17:45                 ` Michael S. Tsirkin
2016-01-12 18:04                 ` Linus Torvalds
2016-01-12 18:04                   ` Linus Torvalds
2016-01-12 20:30               ` Andy Lutomirski
2016-01-12 20:54                 ` Linus Torvalds
2016-01-12 20:54                   ` Linus Torvalds
2016-01-12 20:59                   ` Andy Lutomirski
2016-01-12 20:59                     ` Andy Lutomirski
2016-01-12 21:37                     ` Linus Torvalds
2016-01-12 21:37                       ` Linus Torvalds
2016-01-12 22:14                       ` Michael S. Tsirkin
2016-01-12 22:14                       ` Michael S. Tsirkin
2016-01-13 16:20                       ` Michael S. Tsirkin
2016-01-13 16:20                         ` Michael S. Tsirkin
2016-01-12 22:21                     ` Michael S. Tsirkin
2016-01-12 22:21                       ` Michael S. Tsirkin
2016-01-12 22:55                       ` H. Peter Anvin
2016-01-12 22:55                         ` H. Peter Anvin
2016-01-12 23:24                         ` Linus Torvalds
2016-01-12 23:24                         ` Linus Torvalds
2016-01-13 16:17                           ` Borislav Petkov
2016-01-13 16:17                             ` Borislav Petkov
2016-01-13 16:25                             ` Michael S. Tsirkin
2016-01-13 16:25                               ` Michael S. Tsirkin
2016-01-13 16:33                               ` Borislav Petkov
2016-01-13 16:33                                 ` Borislav Petkov
2016-01-13 16:42                                 ` Michael S. Tsirkin
2016-01-13 16:42                                   ` Michael S. Tsirkin
2016-01-13 16:53                                   ` Borislav Petkov
2016-01-13 16:53                                     ` Borislav Petkov
2016-01-13 17:00                                     ` Michael S. Tsirkin
2016-01-13 17:00                                     ` Michael S. Tsirkin
2016-01-13 18:38                                   ` Linus Torvalds
2016-01-13 18:38                                     ` Linus Torvalds
2015-10-27 19:53 ` [PATCH 4/4] doc,smp: Remove ambiguous statement in smp_store_mb() Davidlohr Bueso
2015-12-04 12:01   ` [tip:locking/core] locking/barriers, arch: Remove ambiguous statement in the smp_store_mb() documentation tip-bot for Davidlohr Bueso
2015-10-27 23:27 ` [PATCH 1/4] arch,cmpxchg: Remove tas() definitions David Howells

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.