* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2015-11-03 0:06 ` Linus Torvalds
@ 2015-11-03 1:36 ` Davidlohr Bueso
2016-01-12 13:57 ` Michael S. Tsirkin
2016-01-12 13:57 ` Michael S. Tsirkin
2 siblings, 0 replies; 56+ messages in thread
From: Davidlohr Bueso @ 2015-11-03 1:36 UTC (permalink / raw)
To: Linus Torvalds
Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul E. McKenney,
Linux Kernel Mailing List, the arch/x86 maintainers,
Davidlohr Bueso
On Mon, 02 Nov 2015, Linus Torvalds wrote:
>On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
>>
>> So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is
>> constantly cheaper (by at least half the latency) than MFENCE. While there
>> was a decent amount of variation, this difference remained rather constant.
>
>Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we
>use on old cpu's without one (ie 32-bit).
I'm getting results very close to xchg.
>I'm not actually convinced that mfence is necessarily a good idea. I
>could easily see it being microcode, for example.
Interesting.
>
>At least on my Haswell, the "lock addq" is pretty much exactly half
>the cost of "mfence".
Ok, his coincides with my results on IvB.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2015-11-03 0:06 ` Linus Torvalds
2015-11-03 1:36 ` Davidlohr Bueso
@ 2016-01-12 13:57 ` Michael S. Tsirkin
2016-01-12 13:57 ` Michael S. Tsirkin
2 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 13:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, H. Peter Anvin, Thomas Gleixner,
Paul E. McKenney, Ingo Molnar
On Mon, Nov 02, 2015 at 04:06:46PM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> >
> > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is
> > constantly cheaper (by at least half the latency) than MFENCE. While there
> > was a decent amount of variation, this difference remained rather constant.
>
> Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we
> use on old cpu's without one (ie 32-bit).
>
> I'm not actually convinced that mfence is necessarily a good idea. I
> could easily see it being microcode, for example.
>
> At least on my Haswell, the "lock addq" is pretty much exactly half
> the cost of "mfence".
>
> Linus
mfence was high on some traces I was seeing, so I got curious, too:
---->
main.c
---->
extern volatile int x;
volatile int x;
#ifdef __x86_64__
#define SP "rsp"
#else
#define SP "esp"
#endif
#ifdef lock
#define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory")
#endif
#ifdef xchg
#define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0)
#endif
#ifdef xchgrz
/* same as xchg but poking at gcc red zone */
#define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
#endif
#ifdef mfence
#define barrier() asm("mfence" ::: "memory")
#endif
#ifdef lfence
#define barrier() asm("lfence" ::: "memory")
#endif
#ifdef sfence
#define barrier() asm("sfence" ::: "memory")
#endif
int main(int argc, char **argv)
{
int i;
int j = 1234;
/*
* Test barrier in a loop. We also poke at a volatile variable in an
* attempt to make it a bit more realistic - this way there's something
* in the store-buffer.
*/
for (i = 0; i < 10000000; ++i) {
x = i - j;
barrier();
j = x;
}
return 0;
}
---->
Makefile:
---->
ALL = xchg xchgrz lock mfence lfence sfence
CC = gcc
CFLAGS += -Wall -O2 -ggdb
PERF = perf stat -r 10 --log-fd 1 --
all: ${ALL}
clean:
rm -f ${ALL}
run: all
for file in ${ALL}; do echo ${PERF} ./$$file ; ${PERF} ./$$file; done
.PHONY: all clean run
${ALL}: main.c
${CC} ${CFLAGS} -D$@ -o $@ main.c
----->
Is this a good way to test it?
E.g. on my laptop I get:
perf stat -r 10 --log-fd 1 -- ./xchg
Performance counter stats for './xchg' (10 runs):
53.236967 task-clock # 0.992 CPUs utilized ( +- 0.09% )
10 context-switches # 0.180 K/sec ( +- 1.70% )
0 CPU-migrations # 0.000 K/sec
37 page-faults # 0.691 K/sec ( +- 1.13% )
190,997,612 cycles # 3.588 GHz ( +- 0.04% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
80,654,850 instructions # 0.42 insns per cycle ( +- 0.01% )
10,122,372 branches # 190.138 M/sec ( +- 0.01% )
4,514 branch-misses # 0.04% of all branches ( +- 3.37% )
0.053642809 seconds time elapsed ( +- 0.12% )
perf stat -r 10 --log-fd 1 -- ./xchgrz
Performance counter stats for './xchgrz' (10 runs):
53.189533 task-clock # 0.997 CPUs utilized ( +- 0.22% )
0 context-switches # 0.000 K/sec
0 CPU-migrations # 0.000 K/sec
37 page-faults # 0.694 K/sec ( +- 0.75% )
190,785,621 cycles # 3.587 GHz ( +- 0.03% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
80,602,086 instructions # 0.42 insns per cycle ( +- 0.00% )
10,112,154 branches # 190.115 M/sec ( +- 0.01% )
3,743 branch-misses # 0.04% of all branches ( +- 4.02% )
0.053343693 seconds time elapsed ( +- 0.23% )
perf stat -r 10 --log-fd 1 -- ./lock
Performance counter stats for './lock' (10 runs):
53.096434 task-clock # 0.997 CPUs utilized ( +- 0.16% )
0 context-switches # 0.002 K/sec ( +-100.00% )
0 CPU-migrations # 0.000 K/sec
37 page-faults # 0.693 K/sec ( +- 0.98% )
190,796,621 cycles # 3.593 GHz ( +- 0.02% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
80,601,376 instructions # 0.42 insns per cycle ( +- 0.01% )
10,112,074 branches # 190.447 M/sec ( +- 0.01% )
3,475 branch-misses # 0.03% of all branches ( +- 1.33% )
0.053252678 seconds time elapsed ( +- 0.16% )
perf stat -r 10 --log-fd 1 -- ./mfence
Performance counter stats for './mfence' (10 runs):
126.376473 task-clock # 0.999 CPUs utilized ( +- 0.21% )
0 context-switches # 0.002 K/sec ( +- 66.67% )
0 CPU-migrations # 0.000 K/sec
36 page-faults # 0.289 K/sec ( +- 0.84% )
456,147,770 cycles # 3.609 GHz ( +- 0.01% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
80,892,416 instructions # 0.18 insns per cycle ( +- 0.00% )
10,163,220 branches # 80.420 M/sec ( +- 0.01% )
4,653 branch-misses # 0.05% of all branches ( +- 1.27% )
0.126539273 seconds time elapsed ( +- 0.21% )
perf stat -r 10 --log-fd 1 -- ./lfence
Performance counter stats for './lfence' (10 runs):
47.617861 task-clock # 0.997 CPUs utilized ( +- 0.06% )
0 context-switches # 0.002 K/sec ( +-100.00% )
0 CPU-migrations # 0.000 K/sec
36 page-faults # 0.764 K/sec ( +- 0.45% )
170,767,856 cycles # 3.586 GHz ( +- 0.03% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
80,581,607 instructions # 0.47 insns per cycle ( +- 0.00% )
10,108,508 branches # 212.284 M/sec ( +- 0.00% )
3,320 branch-misses # 0.03% of all branches ( +- 1.12% )
0.047768505 seconds time elapsed ( +- 0.07% )
perf stat -r 10 --log-fd 1 -- ./sfence
Performance counter stats for './sfence' (10 runs):
20.156676 task-clock # 0.988 CPUs utilized ( +- 0.45% )
3 context-switches # 0.159 K/sec ( +- 12.15% )
0 CPU-migrations # 0.000 K/sec
36 page-faults # 0.002 M/sec ( +- 0.87% )
72,212,225 cycles # 3.583 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
80,479,149 instructions # 1.11 insns per cycle ( +- 0.00% )
10,090,785 branches # 500.618 M/sec ( +- 0.01% )
3,626 branch-misses # 0.04% of all branches ( +- 3.59% )
0.020411208 seconds time elapsed ( +- 0.52% )
So mfence is more expensive than locked instructions/xchg, but sfence/lfence
are slightly faster, and xchg and locked instructions are very close if
not the same.
I poked at some 10 intel and AMD machines and the numbers are different
but the results seem more or less consistent with this.
From size point of view xchg is longer and xchgrz pokes at the red zone
which seems unnecessarily hacky, so good old lock+addl is probably the
best.
There isn't any extra magic behind mfence, is there?
E.g. I think lock orders accesses to WC memory as well,
so apparently mb() can be redefined unconditionally, without
looking at XMM2:
--->
x86: drop mfence in favor of lock+addl
mfence appears to be way slower than a locked instruction - let's use
lock+add unconditionally, same as we always did on old 32-bit.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
I'll play with this some more before posting this as a
non-stand alone patch. Is there a macro-benchmark where mb
is prominent?
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index a584e1c..f0d36e2 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -15,15 +15,15 @@
* Some non-Intel clones support out of order store. wmb() ceases to be a
* nop for these.
*/
-#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
+#define mb() asm volatile("lock; addl $0,0(%%esp)":::"memory")
#define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
#else
+#define mb() asm volatile("lock; addl $0,0(%%rsp)":::"memory")
#define rmb() asm volatile("lfence":::"memory")
#define wmb() asm volatile("sfence" ::: "memory")
#endif
#ifdef CONFIG_X86_PPRO_FENCE
#define dma_rmb() rmb()
#else
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2015-11-03 0:06 ` Linus Torvalds
2015-11-03 1:36 ` Davidlohr Bueso
2016-01-12 13:57 ` Michael S. Tsirkin
@ 2016-01-12 13:57 ` Michael S. Tsirkin
2016-01-12 17:20 ` Linus Torvalds
2 siblings, 1 reply; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 13:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: Davidlohr Bueso, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
Paul E. McKenney, Linux Kernel Mailing List,
the arch/x86 maintainers, Davidlohr Bueso, H. Peter Anvin,
virtualization
On Mon, Nov 02, 2015 at 04:06:46PM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2015 at 12:15 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> >
> > So I ran some experiments on an IvyBridge (2.8GHz) and the cost of XCHG is
> > constantly cheaper (by at least half the latency) than MFENCE. While there
> > was a decent amount of variation, this difference remained rather constant.
>
> Mind testing "lock addq $0,0(%rsp)" instead of mfence? That's what we
> use on old cpu's without one (ie 32-bit).
>
> I'm not actually convinced that mfence is necessarily a good idea. I
> could easily see it being microcode, for example.
>
> At least on my Haswell, the "lock addq" is pretty much exactly half
> the cost of "mfence".
>
> Linus
mfence was high on some traces I was seeing, so I got curious, too:
---->
main.c
---->
extern volatile int x;
volatile int x;
#ifdef __x86_64__
#define SP "rsp"
#else
#define SP "esp"
#endif
#ifdef lock
#define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory")
#endif
#ifdef xchg
#define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0)
#endif
#ifdef xchgrz
/* same as xchg but poking at gcc red zone */
#define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
#endif
#ifdef mfence
#define barrier() asm("mfence" ::: "memory")
#endif
#ifdef lfence
#define barrier() asm("lfence" ::: "memory")
#endif
#ifdef sfence
#define barrier() asm("sfence" ::: "memory")
#endif
int main(int argc, char **argv)
{
int i;
int j = 1234;
/*
* Test barrier in a loop. We also poke at a volatile variable in an
* attempt to make it a bit more realistic - this way there's something
* in the store-buffer.
*/
for (i = 0; i < 10000000; ++i) {
x = i - j;
barrier();
j = x;
}
return 0;
}
---->
Makefile:
---->
ALL = xchg xchgrz lock mfence lfence sfence
CC = gcc
CFLAGS += -Wall -O2 -ggdb
PERF = perf stat -r 10 --log-fd 1 --
all: ${ALL}
clean:
rm -f ${ALL}
run: all
for file in ${ALL}; do echo ${PERF} ./$$file ; ${PERF} ./$$file; done
.PHONY: all clean run
${ALL}: main.c
${CC} ${CFLAGS} -D$@ -o $@ main.c
----->
Is this a good way to test it?
E.g. on my laptop I get:
perf stat -r 10 --log-fd 1 -- ./xchg
Performance counter stats for './xchg' (10 runs):
53.236967 task-clock # 0.992 CPUs utilized ( +- 0.09% )
10 context-switches # 0.180 K/sec ( +- 1.70% )
0 CPU-migrations # 0.000 K/sec
37 page-faults # 0.691 K/sec ( +- 1.13% )
190,997,612 cycles # 3.588 GHz ( +- 0.04% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
80,654,850 instructions # 0.42 insns per cycle ( +- 0.01% )
10,122,372 branches # 190.138 M/sec ( +- 0.01% )
4,514 branch-misses # 0.04% of all branches ( +- 3.37% )
0.053642809 seconds time elapsed ( +- 0.12% )
perf stat -r 10 --log-fd 1 -- ./xchgrz
Performance counter stats for './xchgrz' (10 runs):
53.189533 task-clock # 0.997 CPUs utilized ( +- 0.22% )
0 context-switches # 0.000 K/sec
0 CPU-migrations # 0.000 K/sec
37 page-faults # 0.694 K/sec ( +- 0.75% )
190,785,621 cycles # 3.587 GHz ( +- 0.03% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
80,602,086 instructions # 0.42 insns per cycle ( +- 0.00% )
10,112,154 branches # 190.115 M/sec ( +- 0.01% )
3,743 branch-misses # 0.04% of all branches ( +- 4.02% )
0.053343693 seconds time elapsed ( +- 0.23% )
perf stat -r 10 --log-fd 1 -- ./lock
Performance counter stats for './lock' (10 runs):
53.096434 task-clock # 0.997 CPUs utilized ( +- 0.16% )
0 context-switches # 0.002 K/sec ( +-100.00% )
0 CPU-migrations # 0.000 K/sec
37 page-faults # 0.693 K/sec ( +- 0.98% )
190,796,621 cycles # 3.593 GHz ( +- 0.02% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
80,601,376 instructions # 0.42 insns per cycle ( +- 0.01% )
10,112,074 branches # 190.447 M/sec ( +- 0.01% )
3,475 branch-misses # 0.03% of all branches ( +- 1.33% )
0.053252678 seconds time elapsed ( +- 0.16% )
perf stat -r 10 --log-fd 1 -- ./mfence
Performance counter stats for './mfence' (10 runs):
126.376473 task-clock # 0.999 CPUs utilized ( +- 0.21% )
0 context-switches # 0.002 K/sec ( +- 66.67% )
0 CPU-migrations # 0.000 K/sec
36 page-faults # 0.289 K/sec ( +- 0.84% )
456,147,770 cycles # 3.609 GHz ( +- 0.01% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
80,892,416 instructions # 0.18 insns per cycle ( +- 0.00% )
10,163,220 branches # 80.420 M/sec ( +- 0.01% )
4,653 branch-misses # 0.05% of all branches ( +- 1.27% )
0.126539273 seconds time elapsed ( +- 0.21% )
perf stat -r 10 --log-fd 1 -- ./lfence
Performance counter stats for './lfence' (10 runs):
47.617861 task-clock # 0.997 CPUs utilized ( +- 0.06% )
0 context-switches # 0.002 K/sec ( +-100.00% )
0 CPU-migrations # 0.000 K/sec
36 page-faults # 0.764 K/sec ( +- 0.45% )
170,767,856 cycles # 3.586 GHz ( +- 0.03% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
80,581,607 instructions # 0.47 insns per cycle ( +- 0.00% )
10,108,508 branches # 212.284 M/sec ( +- 0.00% )
3,320 branch-misses # 0.03% of all branches ( +- 1.12% )
0.047768505 seconds time elapsed ( +- 0.07% )
perf stat -r 10 --log-fd 1 -- ./sfence
Performance counter stats for './sfence' (10 runs):
20.156676 task-clock # 0.988 CPUs utilized ( +- 0.45% )
3 context-switches # 0.159 K/sec ( +- 12.15% )
0 CPU-migrations # 0.000 K/sec
36 page-faults # 0.002 M/sec ( +- 0.87% )
72,212,225 cycles # 3.583 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
80,479,149 instructions # 1.11 insns per cycle ( +- 0.00% )
10,090,785 branches # 500.618 M/sec ( +- 0.01% )
3,626 branch-misses # 0.04% of all branches ( +- 3.59% )
0.020411208 seconds time elapsed ( +- 0.52% )
So mfence is more expensive than locked instructions/xchg, but sfence/lfence
are slightly faster, and xchg and locked instructions are very close if
not the same.
I poked at some 10 intel and AMD machines and the numbers are different
but the results seem more or less consistent with this.
>From size point of view xchg is longer and xchgrz pokes at the red zone
which seems unnecessarily hacky, so good old lock+addl is probably the
best.
There isn't any extra magic behind mfence, is there?
E.g. I think lock orders accesses to WC memory as well,
so apparently mb() can be redefined unconditionally, without
looking at XMM2:
--->
x86: drop mfence in favor of lock+addl
mfence appears to be way slower than a locked instruction - let's use
lock+add unconditionally, same as we always did on old 32-bit.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
I'll play with this some more before posting this as a
non-stand alone patch. Is there a macro-benchmark where mb
is prominent?
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index a584e1c..f0d36e2 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -15,15 +15,15 @@
* Some non-Intel clones support out of order store. wmb() ceases to be a
* nop for these.
*/
-#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
+#define mb() asm volatile("lock; addl $0,0(%%esp)":::"memory")
#define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
#else
+#define mb() asm volatile("lock; addl $0,0(%%rsp)":::"memory")
#define rmb() asm volatile("lfence":::"memory")
#define wmb() asm volatile("sfence" ::: "memory")
#endif
#ifdef CONFIG_X86_PPRO_FENCE
#define dma_rmb() rmb()
#else
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 13:57 ` Michael S. Tsirkin
@ 2016-01-12 17:20 ` Linus Torvalds
0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 17:20 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Davidlohr Bueso, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
Paul E. McKenney, Linux Kernel Mailing List,
the arch/x86 maintainers, Davidlohr Bueso, H. Peter Anvin,
virtualization
On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> #ifdef xchgrz
> /* same as xchg but poking at gcc red zone */
> #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
> #endif
That's not safe in general. gcc might be using its redzone, so doing
xchg into it is unsafe.
But..
> Is this a good way to test it?
.. it's fine for some basic testing. It doesn't show any subtle
interactions (ie some operations may have different dynamic behavior
when the write buffers are busy etc), but as a baseline for "how fast
can things go" the stupid raw loop is fine. And while the xchg into
the redzoen wouldn't be acceptable as a real implementation, for
timing testing it's likely fine (ie you aren't hitting the problem it
can cause).
> So mfence is more expensive than locked instructions/xchg, but sfence/lfence
> are slightly faster, and xchg and locked instructions are very close if
> not the same.
Note that we never actually *use* lfence/sfence. They are pointless
instructions when looking at CPU memory ordering, because for pure CPU
memory ordering stores and loads are already ordered.
The only reason to use lfence/sfence is after you've used nontemporal
stores for IO. That's very very rare in the kernel. So I wouldn't
worry about those.
But yes, it does sound like mfence is just a bad idea too.
> There isn't any extra magic behind mfence, is there?
No.
I think the only issue is that there has never been any real reason
for CPU designers to try to make mfence go particularly fast. Nobody
uses it, again with the exception of some odd loops that use
nontemporal stores, and for those the cost tends to always be about
the nontemporal accesses themselves (often to things like GPU memory
over PCIe), and the mfence cost of a few extra cycles is negligible.
The reason "lock ; add $0" has generally been the fastest we've found
is simply that locked ops have been important for CPU designers.
So I think the patch is fine, and we should likely drop the use of mfence..
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 17:20 ` Linus Torvalds
0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 17:20 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, H. Peter Anvin, Thomas Gleixner,
Paul E. McKenney, Ingo Molnar
On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> #ifdef xchgrz
> /* same as xchg but poking at gcc red zone */
> #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
> #endif
That's not safe in general. gcc might be using its redzone, so doing
xchg into it is unsafe.
But..
> Is this a good way to test it?
.. it's fine for some basic testing. It doesn't show any subtle
interactions (ie some operations may have different dynamic behavior
when the write buffers are busy etc), but as a baseline for "how fast
can things go" the stupid raw loop is fine. And while the xchg into
the redzoen wouldn't be acceptable as a real implementation, for
timing testing it's likely fine (ie you aren't hitting the problem it
can cause).
> So mfence is more expensive than locked instructions/xchg, but sfence/lfence
> are slightly faster, and xchg and locked instructions are very close if
> not the same.
Note that we never actually *use* lfence/sfence. They are pointless
instructions when looking at CPU memory ordering, because for pure CPU
memory ordering stores and loads are already ordered.
The only reason to use lfence/sfence is after you've used nontemporal
stores for IO. That's very very rare in the kernel. So I wouldn't
worry about those.
But yes, it does sound like mfence is just a bad idea too.
> There isn't any extra magic behind mfence, is there?
No.
I think the only issue is that there has never been any real reason
for CPU designers to try to make mfence go particularly fast. Nobody
uses it, again with the exception of some odd loops that use
nontemporal stores, and for those the cost tends to always be about
the nontemporal accesses themselves (often to things like GPU memory
over PCIe), and the mfence cost of a few extra cycles is negligible.
The reason "lock ; add $0" has generally been the fastest we've found
is simply that locked ops have been important for CPU designers.
So I think the patch is fine, and we should likely drop the use of mfence..
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 17:20 ` Linus Torvalds
@ 2016-01-12 17:45 ` Michael S. Tsirkin
-1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 17:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: Davidlohr Bueso, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
Paul E. McKenney, Linux Kernel Mailing List,
the arch/x86 maintainers, Davidlohr Bueso, H. Peter Anvin,
virtualization
On Tue, Jan 12, 2016 at 09:20:06AM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > #ifdef xchgrz
> > /* same as xchg but poking at gcc red zone */
> > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
> > #endif
>
> That's not safe in general. gcc might be using its redzone, so doing
> xchg into it is unsafe.
>
> But..
>
> > Is this a good way to test it?
>
> .. it's fine for some basic testing. It doesn't show any subtle
> interactions (ie some operations may have different dynamic behavior
> when the write buffers are busy etc), but as a baseline for "how fast
> can things go" the stupid raw loop is fine. And while the xchg into
> the redzoen wouldn't be acceptable as a real implementation, for
> timing testing it's likely fine (ie you aren't hitting the problem it
> can cause).
>
> > So mfence is more expensive than locked instructions/xchg, but sfence/lfence
> > are slightly faster, and xchg and locked instructions are very close if
> > not the same.
>
> Note that we never actually *use* lfence/sfence. They are pointless
> instructions when looking at CPU memory ordering, because for pure CPU
> memory ordering stores and loads are already ordered.
>
> The only reason to use lfence/sfence is after you've used nontemporal
> stores for IO.
By the way, the comment in barrier.h says:
/*
* Some non-Intel clones support out of order store. wmb() ceases to be
* a nop for these.
*/
and while the 1st sentence may well be true, if you have
an SMP system with out of order stores, making wmb
not a nop will not help.
Additionally as you point out, wmb is not a nop even
for regular intel CPUs because of these weird use-cases.
Drop this comment?
> That's very very rare in the kernel. So I wouldn't
> worry about those.
Right - I'll leave these alone, whoever wants to optimize this path will
have to do the necessary research.
> But yes, it does sound like mfence is just a bad idea too.
>
> > There isn't any extra magic behind mfence, is there?
>
> No.
>
> I think the only issue is that there has never been any real reason
> for CPU designers to try to make mfence go particularly fast. Nobody
> uses it, again with the exception of some odd loops that use
> nontemporal stores, and for those the cost tends to always be about
> the nontemporal accesses themselves (often to things like GPU memory
> over PCIe), and the mfence cost of a few extra cycles is negligible.
>
> The reason "lock ; add $0" has generally been the fastest we've found
> is simply that locked ops have been important for CPU designers.
>
> So I think the patch is fine, and we should likely drop the use of mfence..
>
> Linus
OK so should I repost after a bit more testing? I don't believe this
will affect the kernel build benchmark, but I'll try :)
--
MST
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 17:45 ` Michael S. Tsirkin
0 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 17:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, H. Peter Anvin, Thomas Gleixner,
Paul E. McKenney, Ingo Molnar
On Tue, Jan 12, 2016 at 09:20:06AM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > #ifdef xchgrz
> > /* same as xchg but poking at gcc red zone */
> > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
> > #endif
>
> That's not safe in general. gcc might be using its redzone, so doing
> xchg into it is unsafe.
>
> But..
>
> > Is this a good way to test it?
>
> .. it's fine for some basic testing. It doesn't show any subtle
> interactions (ie some operations may have different dynamic behavior
> when the write buffers are busy etc), but as a baseline for "how fast
> can things go" the stupid raw loop is fine. And while the xchg into
> the redzoen wouldn't be acceptable as a real implementation, for
> timing testing it's likely fine (ie you aren't hitting the problem it
> can cause).
>
> > So mfence is more expensive than locked instructions/xchg, but sfence/lfence
> > are slightly faster, and xchg and locked instructions are very close if
> > not the same.
>
> Note that we never actually *use* lfence/sfence. They are pointless
> instructions when looking at CPU memory ordering, because for pure CPU
> memory ordering stores and loads are already ordered.
>
> The only reason to use lfence/sfence is after you've used nontemporal
> stores for IO.
By the way, the comment in barrier.h says:
/*
* Some non-Intel clones support out of order store. wmb() ceases to be
* a nop for these.
*/
and while the 1st sentence may well be true, if you have
an SMP system with out of order stores, making wmb
not a nop will not help.
Additionally as you point out, wmb is not a nop even
for regular intel CPUs because of these weird use-cases.
Drop this comment?
> That's very very rare in the kernel. So I wouldn't
> worry about those.
Right - I'll leave these alone, whoever wants to optimize this path will
have to do the necessary research.
> But yes, it does sound like mfence is just a bad idea too.
>
> > There isn't any extra magic behind mfence, is there?
>
> No.
>
> I think the only issue is that there has never been any real reason
> for CPU designers to try to make mfence go particularly fast. Nobody
> uses it, again with the exception of some odd loops that use
> nontemporal stores, and for those the cost tends to always be about
> the nontemporal accesses themselves (often to things like GPU memory
> over PCIe), and the mfence cost of a few extra cycles is negligible.
>
> The reason "lock ; add $0" has generally been the fastest we've found
> is simply that locked ops have been important for CPU designers.
>
> So I think the patch is fine, and we should likely drop the use of mfence..
>
> Linus
OK so should I repost after a bit more testing? I don't believe this
will affect the kernel build benchmark, but I'll try :)
--
MST
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 17:45 ` Michael S. Tsirkin
@ 2016-01-12 18:04 ` Linus Torvalds
-1 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 18:04 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Davidlohr Bueso, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
Paul E. McKenney, Linux Kernel Mailing List,
the arch/x86 maintainers, Davidlohr Bueso, H. Peter Anvin,
virtualization
On Tue, Jan 12, 2016 at 9:45 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> By the way, the comment in barrier.h says:
>
> /*
> * Some non-Intel clones support out of order store. wmb() ceases to be
> * a nop for these.
> */
>
> and while the 1st sentence may well be true, if you have
> an SMP system with out of order stores, making wmb
> not a nop will not help.
>
> Additionally as you point out, wmb is not a nop even
> for regular intel CPUs because of these weird use-cases.
>
> Drop this comment?
We should drop it, yes. We dropped support for CONFIG_X86_OOSTORE
almost two years ago. See commit 09df7c4c8097 ("x86: Remove
CONFIG_X86_OOSTORE") and it was questionable for a long time even
before that (perhaps ever).
So the comment is stale.
We *do* still use the non-nop rmb/wmb for IO barriers, but even that
is generally questionable. See our "copy_user_64.S" for an actual use
of "movnt" followed by sfence. There's a couple of other cases too. So
that's all correct, but the point is that when we use "movnt" we don't
actually use "wmb()", we are doing assembly, and the assembly should
just use sfence directly.
So it's actually very questionable to ever make even the IO
wmb()/rmb() functions use lfence/sfence. They should never really need
it.
But at the same time, I _really_ don't think we care enough. I'd
rather leave those non-smp barrier cases alone as historial unless
somebody can point to a case where they care about the performance.
We also do have the whole PPRO_FENCE thing, which we can hopefully get
rid of at some point too.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 18:04 ` Linus Torvalds
0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 18:04 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, H. Peter Anvin, Thomas Gleixner,
Paul E. McKenney, Ingo Molnar
On Tue, Jan 12, 2016 at 9:45 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> By the way, the comment in barrier.h says:
>
> /*
> * Some non-Intel clones support out of order store. wmb() ceases to be
> * a nop for these.
> */
>
> and while the 1st sentence may well be true, if you have
> an SMP system with out of order stores, making wmb
> not a nop will not help.
>
> Additionally as you point out, wmb is not a nop even
> for regular intel CPUs because of these weird use-cases.
>
> Drop this comment?
We should drop it, yes. We dropped support for CONFIG_X86_OOSTORE
almost two years ago. See commit 09df7c4c8097 ("x86: Remove
CONFIG_X86_OOSTORE") and it was questionable for a long time even
before that (perhaps ever).
So the comment is stale.
We *do* still use the non-nop rmb/wmb for IO barriers, but even that
is generally questionable. See our "copy_user_64.S" for an actual use
of "movnt" followed by sfence. There's a couple of other cases too. So
that's all correct, but the point is that when we use "movnt" we don't
actually use "wmb()", we are doing assembly, and the assembly should
just use sfence directly.
So it's actually very questionable to ever make even the IO
wmb()/rmb() functions use lfence/sfence. They should never really need
it.
But at the same time, I _really_ don't think we care enough. I'd
rather leave those non-smp barrier cases alone as historial unless
somebody can point to a case where they care about the performance.
We also do have the whole PPRO_FENCE thing, which we can hopefully get
rid of at some point too.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 17:20 ` Linus Torvalds
(?)
(?)
@ 2016-01-12 20:30 ` Andy Lutomirski
2016-01-12 20:54 ` Linus Torvalds
-1 siblings, 1 reply; 56+ messages in thread
From: Andy Lutomirski @ 2016-01-12 20:30 UTC (permalink / raw)
To: Linus Torvalds, Michael S. Tsirkin
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, H. Peter Anvin, Thomas Gleixner,
Paul E. McKenney, Ingo Molnar
On 01/12/2016 09:20 AM, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> #ifdef xchgrz
>> /* same as xchg but poking at gcc red zone */
>> #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
>> #endif
>
> That's not safe in general. gcc might be using its redzone, so doing
> xchg into it is unsafe.
>
> But..
>
>> Is this a good way to test it?
>
> .. it's fine for some basic testing. It doesn't show any subtle
> interactions (ie some operations may have different dynamic behavior
> when the write buffers are busy etc), but as a baseline for "how fast
> can things go" the stupid raw loop is fine. And while the xchg into
> the redzoen wouldn't be acceptable as a real implementation, for
> timing testing it's likely fine (ie you aren't hitting the problem it
> can cause).
I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even
64) was better because it avoided stomping on very-likely-to-be-hot
write buffers.
--Andy
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 20:30 ` Andy Lutomirski
@ 2016-01-12 20:54 ` Linus Torvalds
0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 20:54 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Michael S. Tsirkin, Davidlohr Bueso, Davidlohr Bueso,
Peter Zijlstra, the arch/x86 maintainers,
Linux Kernel Mailing List, virtualization, H. Peter Anvin,
Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64)
> was better because it avoided stomping on very-likely-to-be-hot write
> buffers.
I suspect it could go either way. You want a small constant (for the
isntruction size), but any small constant is likely to be within the
current stack frame anyway. I don't think 0(%rsp) is particularly
likely to have a spill on it right then and there, but who knows..
And 64(%rsp) is possibly going to be cold in the L1 cache, especially
if it's just after a deep function call. Which it might be. So it
might work the other way.
So my guess would be that you wouldn't be able to measure the
difference. It might be there, but probably too small to really see in
any noise.
But numbers talk, bullshit walks. It would be interesting to be proven wrong.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 20:54 ` Linus Torvalds
0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 20:54 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, Michael S. Tsirkin, H. Peter Anvin,
Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64)
> was better because it avoided stomping on very-likely-to-be-hot write
> buffers.
I suspect it could go either way. You want a small constant (for the
isntruction size), but any small constant is likely to be within the
current stack frame anyway. I don't think 0(%rsp) is particularly
likely to have a spill on it right then and there, but who knows..
And 64(%rsp) is possibly going to be cold in the L1 cache, especially
if it's just after a deep function call. Which it might be. So it
might work the other way.
So my guess would be that you wouldn't be able to measure the
difference. It might be there, but probably too small to really see in
any noise.
But numbers talk, bullshit walks. It would be interesting to be proven wrong.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 20:54 ` Linus Torvalds
@ 2016-01-12 20:59 ` Andy Lutomirski
-1 siblings, 0 replies; 56+ messages in thread
From: Andy Lutomirski @ 2016-01-12 20:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Michael S. Tsirkin, Davidlohr Bueso,
Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers,
Linux Kernel Mailing List, virtualization, H. Peter Anvin,
Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64)
>> was better because it avoided stomping on very-likely-to-be-hot write
>> buffers.
>
> I suspect it could go either way. You want a small constant (for the
> isntruction size), but any small constant is likely to be within the
> current stack frame anyway. I don't think 0(%rsp) is particularly
> likely to have a spill on it right then and there, but who knows..
>
> And 64(%rsp) is possibly going to be cold in the L1 cache, especially
> if it's just after a deep function call. Which it might be. So it
> might work the other way.
>
> So my guess would be that you wouldn't be able to measure the
> difference. It might be there, but probably too small to really see in
> any noise.
>
> But numbers talk, bullshit walks. It would be interesting to be proven wrong.
Here's an article with numbers:
http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
I think they're suggesting using a negative offset, which is safe as
long as it doesn't page fault, even though we have the redzone
disabled.
--Andy
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 20:59 ` Andy Lutomirski
0 siblings, 0 replies; 56+ messages in thread
From: Andy Lutomirski @ 2016-01-12 20:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, Michael S. Tsirkin, Andy Lutomirski,
H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64)
>> was better because it avoided stomping on very-likely-to-be-hot write
>> buffers.
>
> I suspect it could go either way. You want a small constant (for the
> isntruction size), but any small constant is likely to be within the
> current stack frame anyway. I don't think 0(%rsp) is particularly
> likely to have a spill on it right then and there, but who knows..
>
> And 64(%rsp) is possibly going to be cold in the L1 cache, especially
> if it's just after a deep function call. Which it might be. So it
> might work the other way.
>
> So my guess would be that you wouldn't be able to measure the
> difference. It might be there, but probably too small to really see in
> any noise.
>
> But numbers talk, bullshit walks. It would be interesting to be proven wrong.
Here's an article with numbers:
http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
I think they're suggesting using a negative offset, which is safe as
long as it doesn't page fault, even though we have the redzone
disabled.
--Andy
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 20:59 ` Andy Lutomirski
@ 2016-01-12 21:37 ` Linus Torvalds
-1 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 21:37 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Andy Lutomirski, Michael S. Tsirkin, Davidlohr Bueso,
Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers,
Linux Kernel Mailing List, virtualization, H. Peter Anvin,
Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Here's an article with numbers:
>
> http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
Well, that's with the busy loop and one set of code generation. It
doesn't show the "oops, deeper stack isn't even in the cache any more
due to call chains" issue.
But yes:
> I think they're suggesting using a negative offset, which is safe as
> long as it doesn't page fault, even though we have the redzone
> disabled.
I think a negative offset might work very well. Partly exactly
*because* we have the redzone disabled: we know that inside the
kernel, we'll never have any live stack frame accesses under the stack
pointer, so "-4(%rsp)" sounds good to me. There should never be any
pending writes in the write buffer, because even if it *was* live, it
would have been read off first.
Yeah, it potentially does extend the stack cache footprint by another
4 bytes, but that sounds very benign.
So perhaps it might be worth trying to switch the "mfence" to "lock ;
addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
for x86-32.
I'd still want to see somebody try to benchmark it. I doubt it's
noticeable, but making changes because you think it might save a few
cycles without then even measuring it is just wrong.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 21:37 ` Linus Torvalds
0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 21:37 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, Michael S. Tsirkin, Andy Lutomirski,
H. Peter Anvin, Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Here's an article with numbers:
>
> http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
Well, that's with the busy loop and one set of code generation. It
doesn't show the "oops, deeper stack isn't even in the cache any more
due to call chains" issue.
But yes:
> I think they're suggesting using a negative offset, which is safe as
> long as it doesn't page fault, even though we have the redzone
> disabled.
I think a negative offset might work very well. Partly exactly
*because* we have the redzone disabled: we know that inside the
kernel, we'll never have any live stack frame accesses under the stack
pointer, so "-4(%rsp)" sounds good to me. There should never be any
pending writes in the write buffer, because even if it *was* live, it
would have been read off first.
Yeah, it potentially does extend the stack cache footprint by another
4 bytes, but that sounds very benign.
So perhaps it might be worth trying to switch the "mfence" to "lock ;
addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
for x86-32.
I'd still want to see somebody try to benchmark it. I doubt it's
noticeable, but making changes because you think it might save a few
cycles without then even measuring it is just wrong.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 21:37 ` Linus Torvalds
(?)
@ 2016-01-12 22:14 ` Michael S. Tsirkin
-1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 22:14 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Andy Lutomirski, Davidlohr Bueso,
Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers,
Linux Kernel Mailing List, virtualization, H. Peter Anvin,
Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Here's an article with numbers:
> >
> > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
>
> Well, that's with the busy loop and one set of code generation. It
> doesn't show the "oops, deeper stack isn't even in the cache any more
> due to call chains" issue.
>
> But yes:
>
> > I think they're suggesting using a negative offset, which is safe as
> > long as it doesn't page fault, even though we have the redzone
> > disabled.
>
> I think a negative offset might work very well. Partly exactly
> *because* we have the redzone disabled: we know that inside the
> kernel, we'll never have any live stack frame accesses under the stack
> pointer, so "-4(%rsp)" sounds good to me. There should never be any
> pending writes in the write buffer, because even if it *was* live, it
> would have been read off first.
>
> Yeah, it potentially does extend the stack cache footprint by another
> 4 bytes, but that sounds very benign.
>
> So perhaps it might be worth trying to switch the "mfence" to "lock ;
> addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
> for x86-32.
>
> I'd still want to see somebody try to benchmark it. I doubt it's
> noticeable, but making changes because you think it might save a few
> cycles without then even measuring it is just wrong.
>
> Linus
Oops, I posted v2 with just offset 0 before reading
the rest of this thread.
I did try with offset 0 and didn't measure any
change on any perf bench test, or on kernel build.
I wonder which benchmark stresses smp_mb the most.
I'll look into using a negative offset.
--
MST
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 21:37 ` Linus Torvalds
(?)
(?)
@ 2016-01-12 22:14 ` Michael S. Tsirkin
-1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 22:14 UTC (permalink / raw)
To: Linus Torvalds
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
Andy Lutomirski, Andy Lutomirski, H. Peter Anvin,
Paul E. McKenney, Thomas Gleixner, virtualization, Ingo Molnar
On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Here's an article with numbers:
> >
> > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
>
> Well, that's with the busy loop and one set of code generation. It
> doesn't show the "oops, deeper stack isn't even in the cache any more
> due to call chains" issue.
>
> But yes:
>
> > I think they're suggesting using a negative offset, which is safe as
> > long as it doesn't page fault, even though we have the redzone
> > disabled.
>
> I think a negative offset might work very well. Partly exactly
> *because* we have the redzone disabled: we know that inside the
> kernel, we'll never have any live stack frame accesses under the stack
> pointer, so "-4(%rsp)" sounds good to me. There should never be any
> pending writes in the write buffer, because even if it *was* live, it
> would have been read off first.
>
> Yeah, it potentially does extend the stack cache footprint by another
> 4 bytes, but that sounds very benign.
>
> So perhaps it might be worth trying to switch the "mfence" to "lock ;
> addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
> for x86-32.
>
> I'd still want to see somebody try to benchmark it. I doubt it's
> noticeable, but making changes because you think it might save a few
> cycles without then even measuring it is just wrong.
>
> Linus
Oops, I posted v2 with just offset 0 before reading
the rest of this thread.
I did try with offset 0 and didn't measure any
change on any perf bench test, or on kernel build.
I wonder which benchmark stresses smp_mb the most.
I'll look into using a negative offset.
--
MST
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 21:37 ` Linus Torvalds
@ 2016-01-13 16:20 ` Michael S. Tsirkin
-1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 16:20 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Andy Lutomirski, Davidlohr Bueso,
Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers,
Linux Kernel Mailing List, virtualization, H. Peter Anvin,
Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Here's an article with numbers:
> >
> > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
>
> Well, that's with the busy loop and one set of code generation. It
> doesn't show the "oops, deeper stack isn't even in the cache any more
> due to call chains" issue.
It's an interesting read, thanks!
So sp is read on return from function I think. I added a function and sure
enough, it slows the add 0(sp) variant down. It's still faster than mfence for
me though! Testing code + results below. Reaching below stack, or
allocating extra 4 bytes above the stack pointer gives us back the performance.
> But yes:
>
> > I think they're suggesting using a negative offset, which is safe as
> > long as it doesn't page fault, even though we have the redzone
> > disabled.
>
> I think a negative offset might work very well. Partly exactly
> *because* we have the redzone disabled: we know that inside the
> kernel, we'll never have any live stack frame accesses under the stack
> pointer, so "-4(%rsp)" sounds good to me. There should never be any
> pending writes in the write buffer, because even if it *was* live, it
> would have been read off first.
>
> Yeah, it potentially does extend the stack cache footprint by another
> 4 bytes, but that sounds very benign.
>
> So perhaps it might be worth trying to switch the "mfence" to "lock ;
> addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
> for x86-32.
>
>
> I'd still want to see somebody try to benchmark it. I doubt it's
> noticeable, but making changes because you think it might save a few
> cycles without then even measuring it is just wrong.
>
> Linus
I'll try this in the kernel now, will report, though I'm
not optimistic a high level benchmark can show this
kind of thing.
---------------
main.c:
---------------
extern volatile int x;
volatile int x;
#ifdef __x86_64__
#define SP "rsp"
#else
#define SP "esp"
#endif
#ifdef lock
#define barrier() do { int p; asm volatile ("lock; addl $0,%0" ::"m"(p): "memory"); } while (0)
#endif
#ifdef locksp
#define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory")
#endif
#ifdef lockrz
#define barrier() asm("lock; addl $0,-4(%%" SP ")" ::: "memory")
#endif
#ifdef xchg
#define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0)
#endif
#ifdef xchgrz
/* same as xchg but poking at gcc red zone */
#define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
#endif
#ifdef mfence
#define barrier() asm("mfence" ::: "memory")
#endif
#ifdef lfence
#define barrier() asm("lfence" ::: "memory")
#endif
#ifdef sfence
#define barrier() asm("sfence" ::: "memory")
#endif
void __attribute__ ((noinline)) test(int i, int *j)
{
/*
* Test barrier in a loop. We also poke at a volatile variable in an
* attempt to make it a bit more realistic - this way there's something
* in the store-buffer.
*/
x = i - *j;
barrier();
*j = x;
}
int main(int argc, char **argv)
{
int i;
int j = 1234;
for (i = 0; i < 10000000; ++i)
test(i, &j);
return 0;
}
---------------
ALL = xchg xchgrz lock locksp lockrz mfence lfence sfence
CC = gcc
CFLAGS += -Wall -O2 -ggdb
PERF = perf stat -r 10 --log-fd 1 --
TIME = /usr/bin/time -f %e
FILTER = cat
all: ${ALL}
clean:
rm -f ${ALL}
run: all
for file in ${ALL}; do echo ${RUN} ./$$file "|" ${FILTER}; ${RUN} ./$$file | ${FILTER}; done
perf time: run
time: RUN=${TIME}
perf: RUN=${PERF}
perf: FILTER=grep elapsed
.PHONY: all clean run perf time
xchgrz: CFLAGS += -mno-red-zone
${ALL}: main.c
${CC} ${CFLAGS} -D$@ -o $@ main.c
--------------------------------------------
perf stat -r 10 --log-fd 1 -- ./xchg | grep elapsed
0.080420565 seconds time elapsed ( +- 2.31% )
perf stat -r 10 --log-fd 1 -- ./xchgrz | grep elapsed
0.087798571 seconds time elapsed ( +- 2.58% )
perf stat -r 10 --log-fd 1 -- ./lock | grep elapsed
0.083023724 seconds time elapsed ( +- 2.44% )
perf stat -r 10 --log-fd 1 -- ./locksp | grep elapsed
0.102880750 seconds time elapsed ( +- 0.13% )
perf stat -r 10 --log-fd 1 -- ./lockrz | grep elapsed
0.084917420 seconds time elapsed ( +- 3.28% )
perf stat -r 10 --log-fd 1 -- ./mfence | grep elapsed
0.156014715 seconds time elapsed ( +- 0.16% )
perf stat -r 10 --log-fd 1 -- ./lfence | grep elapsed
0.077731443 seconds time elapsed ( +- 0.12% )
perf stat -r 10 --log-fd 1 -- ./sfence | grep elapsed
0.036655741 seconds time elapsed ( +- 0.21% )
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-13 16:20 ` Michael S. Tsirkin
0 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 16:20 UTC (permalink / raw)
To: Linus Torvalds
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
Andy Lutomirski, Andy Lutomirski, H. Peter Anvin,
Paul E. McKenney, Thomas Gleixner, virtualization, Ingo Molnar
On Tue, Jan 12, 2016 at 01:37:38PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 12:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Here's an article with numbers:
> >
> > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
>
> Well, that's with the busy loop and one set of code generation. It
> doesn't show the "oops, deeper stack isn't even in the cache any more
> due to call chains" issue.
It's an interesting read, thanks!
So sp is read on return from function I think. I added a function and sure
enough, it slows the add 0(sp) variant down. It's still faster than mfence for
me though! Testing code + results below. Reaching below stack, or
allocating extra 4 bytes above the stack pointer gives us back the performance.
> But yes:
>
> > I think they're suggesting using a negative offset, which is safe as
> > long as it doesn't page fault, even though we have the redzone
> > disabled.
>
> I think a negative offset might work very well. Partly exactly
> *because* we have the redzone disabled: we know that inside the
> kernel, we'll never have any live stack frame accesses under the stack
> pointer, so "-4(%rsp)" sounds good to me. There should never be any
> pending writes in the write buffer, because even if it *was* live, it
> would have been read off first.
>
> Yeah, it potentially does extend the stack cache footprint by another
> 4 bytes, but that sounds very benign.
>
> So perhaps it might be worth trying to switch the "mfence" to "lock ;
> addl $0,-4(%rsp)" in the kernel for x86-64, and remove the alternate
> for x86-32.
>
>
> I'd still want to see somebody try to benchmark it. I doubt it's
> noticeable, but making changes because you think it might save a few
> cycles without then even measuring it is just wrong.
>
> Linus
I'll try this in the kernel now, will report, though I'm
not optimistic a high level benchmark can show this
kind of thing.
---------------
main.c:
---------------
extern volatile int x;
volatile int x;
#ifdef __x86_64__
#define SP "rsp"
#else
#define SP "esp"
#endif
#ifdef lock
#define barrier() do { int p; asm volatile ("lock; addl $0,%0" ::"m"(p): "memory"); } while (0)
#endif
#ifdef locksp
#define barrier() asm("lock; addl $0,0(%%" SP ")" ::: "memory")
#endif
#ifdef lockrz
#define barrier() asm("lock; addl $0,-4(%%" SP ")" ::: "memory")
#endif
#ifdef xchg
#define barrier() do { int p; int ret; asm volatile ("xchgl %0, %1;": "=r"(ret) : "m"(p): "memory", "cc"); } while (0)
#endif
#ifdef xchgrz
/* same as xchg but poking at gcc red zone */
#define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
#endif
#ifdef mfence
#define barrier() asm("mfence" ::: "memory")
#endif
#ifdef lfence
#define barrier() asm("lfence" ::: "memory")
#endif
#ifdef sfence
#define barrier() asm("sfence" ::: "memory")
#endif
void __attribute__ ((noinline)) test(int i, int *j)
{
/*
* Test barrier in a loop. We also poke at a volatile variable in an
* attempt to make it a bit more realistic - this way there's something
* in the store-buffer.
*/
x = i - *j;
barrier();
*j = x;
}
int main(int argc, char **argv)
{
int i;
int j = 1234;
for (i = 0; i < 10000000; ++i)
test(i, &j);
return 0;
}
---------------
ALL = xchg xchgrz lock locksp lockrz mfence lfence sfence
CC = gcc
CFLAGS += -Wall -O2 -ggdb
PERF = perf stat -r 10 --log-fd 1 --
TIME = /usr/bin/time -f %e
FILTER = cat
all: ${ALL}
clean:
rm -f ${ALL}
run: all
for file in ${ALL}; do echo ${RUN} ./$$file "|" ${FILTER}; ${RUN} ./$$file | ${FILTER}; done
perf time: run
time: RUN=${TIME}
perf: RUN=${PERF}
perf: FILTER=grep elapsed
.PHONY: all clean run perf time
xchgrz: CFLAGS += -mno-red-zone
${ALL}: main.c
${CC} ${CFLAGS} -D$@ -o $@ main.c
--------------------------------------------
perf stat -r 10 --log-fd 1 -- ./xchg | grep elapsed
0.080420565 seconds time elapsed ( +- 2.31% )
perf stat -r 10 --log-fd 1 -- ./xchgrz | grep elapsed
0.087798571 seconds time elapsed ( +- 2.58% )
perf stat -r 10 --log-fd 1 -- ./lock | grep elapsed
0.083023724 seconds time elapsed ( +- 2.44% )
perf stat -r 10 --log-fd 1 -- ./locksp | grep elapsed
0.102880750 seconds time elapsed ( +- 0.13% )
perf stat -r 10 --log-fd 1 -- ./lockrz | grep elapsed
0.084917420 seconds time elapsed ( +- 3.28% )
perf stat -r 10 --log-fd 1 -- ./mfence | grep elapsed
0.156014715 seconds time elapsed ( +- 0.16% )
perf stat -r 10 --log-fd 1 -- ./lfence | grep elapsed
0.077731443 seconds time elapsed ( +- 0.12% )
perf stat -r 10 --log-fd 1 -- ./sfence | grep elapsed
0.036655741 seconds time elapsed ( +- 0.21% )
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 20:59 ` Andy Lutomirski
@ 2016-01-12 22:21 ` Michael S. Tsirkin
-1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 22:21 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Linus Torvalds, Andy Lutomirski, Davidlohr Bueso,
Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers,
Linux Kernel Mailing List, virtualization, H. Peter Anvin,
Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Tue, Jan 12, 2016 at 12:59:58PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >>
> >> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64)
> >> was better because it avoided stomping on very-likely-to-be-hot write
> >> buffers.
> >
> > I suspect it could go either way. You want a small constant (for the
> > isntruction size), but any small constant is likely to be within the
> > current stack frame anyway. I don't think 0(%rsp) is particularly
> > likely to have a spill on it right then and there, but who knows..
> >
> > And 64(%rsp) is possibly going to be cold in the L1 cache, especially
> > if it's just after a deep function call. Which it might be. So it
> > might work the other way.
> >
> > So my guess would be that you wouldn't be able to measure the
> > difference. It might be there, but probably too small to really see in
> > any noise.
> >
> > But numbers talk, bullshit walks. It would be interesting to be proven wrong.
>
> Here's an article with numbers:
>
> http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
>
> I think they're suggesting using a negative offset, which is safe as
> long as it doesn't page fault, even though we have the redzone
> disabled.
>
> --Andy
OK so I'll have to tweak the test to put something
on stack to measure the difference: my test tweaks a
global variable instead.
I'll try that by tomorrow.
I couldn't measure any difference between mfence and lock+addl
except in a micro-benchmark, but hey since we are tweaking this,
let's do the optimal thing.
--
MST
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 22:21 ` Michael S. Tsirkin
0 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-12 22:21 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, Andy Lutomirski, H. Peter Anvin, Thomas Gleixner,
Paul E. McKenney, Linus Torvalds, Ingo Molnar
On Tue, Jan 12, 2016 at 12:59:58PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 12, 2016 at 12:54 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Tue, Jan 12, 2016 at 12:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >>
> >> I recall reading somewhere that lock addl $0, 32(%rsp) or so (maybe even 64)
> >> was better because it avoided stomping on very-likely-to-be-hot write
> >> buffers.
> >
> > I suspect it could go either way. You want a small constant (for the
> > isntruction size), but any small constant is likely to be within the
> > current stack frame anyway. I don't think 0(%rsp) is particularly
> > likely to have a spill on it right then and there, but who knows..
> >
> > And 64(%rsp) is possibly going to be cold in the L1 cache, especially
> > if it's just after a deep function call. Which it might be. So it
> > might work the other way.
> >
> > So my guess would be that you wouldn't be able to measure the
> > difference. It might be there, but probably too small to really see in
> > any noise.
> >
> > But numbers talk, bullshit walks. It would be interesting to be proven wrong.
>
> Here's an article with numbers:
>
> http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
>
> I think they're suggesting using a negative offset, which is safe as
> long as it doesn't page fault, even though we have the redzone
> disabled.
>
> --Andy
OK so I'll have to tweak the test to put something
on stack to measure the difference: my test tweaks a
global variable instead.
I'll try that by tomorrow.
I couldn't measure any difference between mfence and lock+addl
except in a micro-benchmark, but hey since we are tweaking this,
let's do the optimal thing.
--
MST
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 22:21 ` Michael S. Tsirkin
@ 2016-01-12 22:55 ` H. Peter Anvin
-1 siblings, 0 replies; 56+ messages in thread
From: H. Peter Anvin @ 2016-01-12 22:55 UTC (permalink / raw)
To: Michael S. Tsirkin, Andy Lutomirski
Cc: Linus Torvalds, Andy Lutomirski, Davidlohr Bueso,
Davidlohr Bueso, Peter Zijlstra, the arch/x86 maintainers,
Linux Kernel Mailing List, virtualization, Thomas Gleixner,
Paul E. McKenney, Ingo Molnar
On 01/12/16 14:21, Michael S. Tsirkin wrote:
>
> OK so I'll have to tweak the test to put something
> on stack to measure the difference: my test tweaks a
> global variable instead.
> I'll try that by tomorrow.
>
> I couldn't measure any difference between mfence and lock+addl
> except in a micro-benchmark, but hey since we are tweaking this,
> let's do the optimal thing.
>
Be careful with this: if it only shows up in a microbenchmark, we may
introduce a hard-to-debug regression for no real benefit.
-hpa
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-12 22:55 ` H. Peter Anvin
0 siblings, 0 replies; 56+ messages in thread
From: H. Peter Anvin @ 2016-01-12 22:55 UTC (permalink / raw)
To: Michael S. Tsirkin, Andy Lutomirski
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, Andy Lutomirski, Thomas Gleixner,
Paul E. McKenney, Linus Torvalds, Ingo Molnar
On 01/12/16 14:21, Michael S. Tsirkin wrote:
>
> OK so I'll have to tweak the test to put something
> on stack to measure the difference: my test tweaks a
> global variable instead.
> I'll try that by tomorrow.
>
> I couldn't measure any difference between mfence and lock+addl
> except in a micro-benchmark, but hey since we are tweaking this,
> let's do the optimal thing.
>
Be careful with this: if it only shows up in a microbenchmark, we may
introduce a hard-to-debug regression for no real benefit.
-hpa
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 22:55 ` H. Peter Anvin
(?)
@ 2016-01-12 23:24 ` Linus Torvalds
-1 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 23:24 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
Linux Kernel Mailing List, Michael S. Tsirkin,
the arch/x86 maintainers, Andy Lutomirski, Andy Lutomirski,
Paul E. McKenney, Thomas Gleixner, virtualization, Ingo Molnar
On Tue, Jan 12, 2016 at 2:55 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> Be careful with this: if it only shows up in a microbenchmark, we may
> introduce a hard-to-debug regression for no real benefit.
So I can pretty much guarantee that it shouldn't regress from a
correctness angle, since we rely *heavily* on locked instructions
being barriers, in locking and in various other situations.
Indeed, much more so than we ever rely on "smp_mb()". The places that
rely on smp_mb() are pretty few in the end.
So I think the only issue is whether sometimes "mfence" might be
faster. So far, I've never actually heard of that being the case. The
fence instructions have always sucked when I've seen them.
But talking to the hw people about this is certainly a good idea regardless.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 22:55 ` H. Peter Anvin
(?)
(?)
@ 2016-01-12 23:24 ` Linus Torvalds
2016-01-13 16:17 ` Borislav Petkov
-1 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2016-01-12 23:24 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Michael S. Tsirkin, Andy Lutomirski, Andy Lutomirski,
Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Tue, Jan 12, 2016 at 2:55 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> Be careful with this: if it only shows up in a microbenchmark, we may
> introduce a hard-to-debug regression for no real benefit.
So I can pretty much guarantee that it shouldn't regress from a
correctness angle, since we rely *heavily* on locked instructions
being barriers, in locking and in various other situations.
Indeed, much more so than we ever rely on "smp_mb()". The places that
rely on smp_mb() are pretty few in the end.
So I think the only issue is whether sometimes "mfence" might be
faster. So far, I've never actually heard of that being the case. The
fence instructions have always sucked when I've seen them.
But talking to the hw people about this is certainly a good idea regardless.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-12 23:24 ` Linus Torvalds
@ 2016-01-13 16:17 ` Borislav Petkov
0 siblings, 0 replies; 56+ messages in thread
From: Borislav Petkov @ 2016-01-13 16:17 UTC (permalink / raw)
To: Linus Torvalds
Cc: H. Peter Anvin, Michael S. Tsirkin, Andy Lutomirski,
Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso,
Peter Zijlstra, the arch/x86 maintainers,
Linux Kernel Mailing List, virtualization, Thomas Gleixner,
Paul E. McKenney, Ingo Molnar
On Tue, Jan 12, 2016 at 03:24:05PM -0800, Linus Torvalds wrote:
> But talking to the hw people about this is certainly a good idea regardless.
I'm not seeing it in this thread but I might've missed it too. Anyway,
I'm being reminded that the ADD will change rFLAGS while MFENCE doesn't
touch them.
Do we care?
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-13 16:17 ` Borislav Petkov
0 siblings, 0 replies; 56+ messages in thread
From: Borislav Petkov @ 2016-01-13 16:17 UTC (permalink / raw)
To: Linus Torvalds
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
Linux Kernel Mailing List, Michael S. Tsirkin,
the arch/x86 maintainers, Andy Lutomirski, Andy Lutomirski,
H. Peter Anvin, Paul E. McKenney, Thomas Gleixner,
virtualization, Ingo Molnar
On Tue, Jan 12, 2016 at 03:24:05PM -0800, Linus Torvalds wrote:
> But talking to the hw people about this is certainly a good idea regardless.
I'm not seeing it in this thread but I might've missed it too. Anyway,
I'm being reminded that the ADD will change rFLAGS while MFENCE doesn't
touch them.
Do we care?
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-13 16:17 ` Borislav Petkov
@ 2016-01-13 16:25 ` Michael S. Tsirkin
-1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 16:25 UTC (permalink / raw)
To: Borislav Petkov
Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski,
Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Wed, Jan 13, 2016 at 05:17:04PM +0100, Borislav Petkov wrote:
> On Tue, Jan 12, 2016 at 03:24:05PM -0800, Linus Torvalds wrote:
> > But talking to the hw people about this is certainly a good idea regardless.
>
> I'm not seeing it in this thread but I might've missed it too. Anyway,
> I'm being reminded that the ADD will change rFLAGS while MFENCE doesn't
> touch them.
>
> Do we care?
Which flag do you refer to, exactly?
> --
> Regards/Gruss,
> Boris.
>
> ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-13 16:25 ` Michael S. Tsirkin
0 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 16:25 UTC (permalink / raw)
To: Borislav Petkov
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
Andy Lutomirski, Andy Lutomirski, H. Peter Anvin,
Paul E. McKenney, Thomas Gleixner, virtualization,
Linus Torvalds, Ingo Molnar
On Wed, Jan 13, 2016 at 05:17:04PM +0100, Borislav Petkov wrote:
> On Tue, Jan 12, 2016 at 03:24:05PM -0800, Linus Torvalds wrote:
> > But talking to the hw people about this is certainly a good idea regardless.
>
> I'm not seeing it in this thread but I might've missed it too. Anyway,
> I'm being reminded that the ADD will change rFLAGS while MFENCE doesn't
> touch them.
>
> Do we care?
Which flag do you refer to, exactly?
> --
> Regards/Gruss,
> Boris.
>
> ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-13 16:25 ` Michael S. Tsirkin
@ 2016-01-13 16:33 ` Borislav Petkov
-1 siblings, 0 replies; 56+ messages in thread
From: Borislav Petkov @ 2016-01-13 16:33 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski,
Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Wed, Jan 13, 2016 at 06:25:21PM +0200, Michael S. Tsirkin wrote:
> Which flag do you refer to, exactly?
All the flags in rFLAGS which ADD modifies: OF,SF,ZF,AF,PF,CF
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-13 16:33 ` Borislav Petkov
0 siblings, 0 replies; 56+ messages in thread
From: Borislav Petkov @ 2016-01-13 16:33 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
Andy Lutomirski, Andy Lutomirski, H. Peter Anvin,
Paul E. McKenney, Thomas Gleixner, virtualization,
Linus Torvalds, Ingo Molnar
On Wed, Jan 13, 2016 at 06:25:21PM +0200, Michael S. Tsirkin wrote:
> Which flag do you refer to, exactly?
All the flags in rFLAGS which ADD modifies: OF,SF,ZF,AF,PF,CF
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-13 16:33 ` Borislav Petkov
@ 2016-01-13 16:42 ` Michael S. Tsirkin
-1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 16:42 UTC (permalink / raw)
To: Borislav Petkov
Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski,
Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Wed, Jan 13, 2016 at 05:33:31PM +0100, Borislav Petkov wrote:
> On Wed, Jan 13, 2016 at 06:25:21PM +0200, Michael S. Tsirkin wrote:
> > Which flag do you refer to, exactly?
>
> All the flags in rFLAGS which ADD modifies: OF,SF,ZF,AF,PF,CF
Oh, I think this means we need a "cc" clobber.
This also seems to be a bug on !XMM CPUs.
cmpxchg.h gets it right.
I'll send a patch.
> --
> Regards/Gruss,
> Boris.
>
> ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-13 16:42 ` Michael S. Tsirkin
0 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 16:42 UTC (permalink / raw)
To: Borislav Petkov
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
Andy Lutomirski, Andy Lutomirski, H. Peter Anvin,
Paul E. McKenney, Thomas Gleixner, virtualization,
Linus Torvalds, Ingo Molnar
On Wed, Jan 13, 2016 at 05:33:31PM +0100, Borislav Petkov wrote:
> On Wed, Jan 13, 2016 at 06:25:21PM +0200, Michael S. Tsirkin wrote:
> > Which flag do you refer to, exactly?
>
> All the flags in rFLAGS which ADD modifies: OF,SF,ZF,AF,PF,CF
Oh, I think this means we need a "cc" clobber.
This also seems to be a bug on !XMM CPUs.
cmpxchg.h gets it right.
I'll send a patch.
> --
> Regards/Gruss,
> Boris.
>
> ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-13 16:42 ` Michael S. Tsirkin
@ 2016-01-13 16:53 ` Borislav Petkov
-1 siblings, 0 replies; 56+ messages in thread
From: Borislav Petkov @ 2016-01-13 16:53 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski,
Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Wed, Jan 13, 2016 at 06:42:48PM +0200, Michael S. Tsirkin wrote:
> Oh, I think this means we need a "cc" clobber.
Btw, does your microbenchmark do it too?
Because, the "cc" clobber should cause additional handling of flags,
depending on the context. It won't matter if the context doesn't need
rFLAGS handling in the benchmark but if we start using LOCK; ADD in the
kernel, I can imagine some places where mb() is used and rFLAGS are
live, causing gcc to either reorder code or stash them away...
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-13 16:53 ` Borislav Petkov
0 siblings, 0 replies; 56+ messages in thread
From: Borislav Petkov @ 2016-01-13 16:53 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
Andy Lutomirski, Andy Lutomirski, H. Peter Anvin,
Paul E. McKenney, Thomas Gleixner, virtualization,
Linus Torvalds, Ingo Molnar
On Wed, Jan 13, 2016 at 06:42:48PM +0200, Michael S. Tsirkin wrote:
> Oh, I think this means we need a "cc" clobber.
Btw, does your microbenchmark do it too?
Because, the "cc" clobber should cause additional handling of flags,
depending on the context. It won't matter if the context doesn't need
rFLAGS handling in the benchmark but if we start using LOCK; ADD in the
kernel, I can imagine some places where mb() is used and rFLAGS are
live, causing gcc to either reorder code or stash them away...
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-13 16:53 ` Borislav Petkov
(?)
@ 2016-01-13 17:00 ` Michael S. Tsirkin
-1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 17:00 UTC (permalink / raw)
To: Borislav Petkov
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
Andy Lutomirski, Andy Lutomirski, H. Peter Anvin,
Paul E. McKenney, Thomas Gleixner, virtualization,
Linus Torvalds, Ingo Molnar
On Wed, Jan 13, 2016 at 05:53:20PM +0100, Borislav Petkov wrote:
> On Wed, Jan 13, 2016 at 06:42:48PM +0200, Michael S. Tsirkin wrote:
> > Oh, I think this means we need a "cc" clobber.
>
> Btw, does your microbenchmark do it too?
Yes - I fixed it now, but it did not affect the result.
We'd need some code where gcc carries flags around though.
> Because, the "cc" clobber should cause additional handling of flags,
> depending on the context. It won't matter if the context doesn't need
> rFLAGS handling in the benchmark but if we start using LOCK; ADD in the
> kernel, I can imagine some places where mb() is used and rFLAGS are
> live, causing gcc to either reorder code or stash them away...
It will reorder code but not necessarily for the worse :)
Best I can do, I will add cc clobber to kernel and see whether
binary size grows.
> --
> Regards/Gruss,
> Boris.
>
> ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-13 16:53 ` Borislav Petkov
(?)
(?)
@ 2016-01-13 17:00 ` Michael S. Tsirkin
-1 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2016-01-13 17:00 UTC (permalink / raw)
To: Borislav Petkov
Cc: Linus Torvalds, H. Peter Anvin, Andy Lutomirski, Andy Lutomirski,
Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, Thomas Gleixner, Paul E. McKenney, Ingo Molnar
On Wed, Jan 13, 2016 at 05:53:20PM +0100, Borislav Petkov wrote:
> On Wed, Jan 13, 2016 at 06:42:48PM +0200, Michael S. Tsirkin wrote:
> > Oh, I think this means we need a "cc" clobber.
>
> Btw, does your microbenchmark do it too?
Yes - I fixed it now, but it did not affect the result.
We'd need some code where gcc carries flags around though.
> Because, the "cc" clobber should cause additional handling of flags,
> depending on the context. It won't matter if the context doesn't need
> rFLAGS handling in the benchmark but if we start using LOCK; ADD in the
> kernel, I can imagine some places where mb() is used and rFLAGS are
> live, causing gcc to either reorder code or stash them away...
It will reorder code but not necessarily for the worse :)
Best I can do, I will add cc clobber to kernel and see whether
binary size grows.
> --
> Regards/Gruss,
> Boris.
>
> ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
2016-01-13 16:42 ` Michael S. Tsirkin
@ 2016-01-13 18:38 ` Linus Torvalds
-1 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-13 18:38 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Borislav Petkov, H. Peter Anvin, Andy Lutomirski,
Andy Lutomirski, Davidlohr Bueso, Davidlohr Bueso,
Peter Zijlstra, the arch/x86 maintainers,
Linux Kernel Mailing List, virtualization, Thomas Gleixner,
Paul E. McKenney, Ingo Molnar
On Wed, Jan 13, 2016 at 8:42 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> Oh, I think this means we need a "cc" clobber.
It's probably a good idea to add one.
Historically, gcc doesn't need one on x86, and always considers flags
clobbered. We are probably missing the cc clobber in a *lot* of places
for this reason.
But even if not necessary, it's probably a good thing to add for
documentation, and in case gcc semantcs ever change.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()
@ 2016-01-13 18:38 ` Linus Torvalds
0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2016-01-13 18:38 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Davidlohr Bueso, Davidlohr Bueso, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
Andy Lutomirski, Borislav Petkov, Andy Lutomirski,
H. Peter Anvin, Paul E. McKenney, Thomas Gleixner,
virtualization, Ingo Molnar
On Wed, Jan 13, 2016 at 8:42 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> Oh, I think this means we need a "cc" clobber.
It's probably a good idea to add one.
Historically, gcc doesn't need one on x86, and always considers flags
clobbered. We are probably missing the cc clobber in a *lot* of places
for this reason.
But even if not necessary, it's probably a good thing to add for
documentation, and in case gcc semantcs ever change.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread