* [PATCH v2] arm: reduce power use by contented spin locks with WFE/SEV
@ 2015-08-03 11:29 Ian Campbell
2015-08-03 11:51 ` Stefano Stabellini
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Ian Campbell @ 2015-08-03 11:29 UTC (permalink / raw)
To: stefano.stabellini, julien.grall, xen-devel, jbeulich, andrew.cooper3
Cc: wei.liu2, David Vrabel, Ian Campbell
From: David Vrabel <david.vrabel@citrix.com>
Instead of cpu_relax() while spinning and observing the ticket head,
introduce arch_lock_relax() which executes a WFE instruction. After
the ticket head is changed call arch_lock_signal() to execute an SEV
instruction (with the required DSB first) to wake any spinners.
This should improve power consumption when locks are contented and
spinning.
For consistency also move arch_lock_(acquire|release)_barrier to
asm/spinlock.h.
Booted the result on arm32 (Midway) and arm64 (Mustang). Build test
only on amd64.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
[ijc: add barrier, rename as arch_lock_*, move arch_lock_*_barrier, test]
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
v2 (ijc):
Add dsb(ishst) to spin_relax.
s/spin_(relax|signal)/arch_lock_\1/
Move arch_lock_(acquire|release)_barrier to asm/spinlock.h
(dropped Andy's Reviewed-by due to this change)
In principal the SEV could be made unnecessary on arm64, but this
requires a new hook before the wait loop as well as changing
observe_head and _spin_unlock to use the Acquire/Release instructions
instead of the non-atomic loads and stores used today, which is a lot
more refactoring of the generic code than I think we can be bothered
with at this stage.
4.6: I'm in two minds about this, the lack of WFE in the ticket
spinlocks is not a regression (the old locks lacked them as well,
oops!). On the otherhand spinning like this isn't good. I think
overall I'm inclined to say this should wait for 4.7 but be a
candidate for backport to 4.6.1.
---
xen/common/spinlock.c | 5 +++--
xen/include/asm-arm/spinlock.h | 9 ++++++++-
xen/include/asm-arm/system.h | 3 ---
xen/include/asm-x86/spinlock.h | 14 ++++++++++++++
xen/include/asm-x86/system.h | 11 -----------
5 files changed, 25 insertions(+), 17 deletions(-)
diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
index 29149d1..7f89694 100644
--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -141,7 +141,7 @@ void _spin_lock(spinlock_t *lock)
while ( tickets.tail != observe_head(&lock->tickets) )
{
LOCK_PROFILE_BLOCK;
- cpu_relax();
+ arch_lock_relax();
}
LOCK_PROFILE_GOT;
preempt_disable();
@@ -170,6 +170,7 @@ void _spin_unlock(spinlock_t *lock)
preempt_enable();
LOCK_PROFILE_REL;
add_sized(&lock->tickets.head, 1);
+ arch_lock_signal();
}
void _spin_unlock_irq(spinlock_t *lock)
@@ -228,7 +229,7 @@ void _spin_barrier(spinlock_t *lock)
if ( sample.head != sample.tail )
{
while ( observe_head(&lock->tickets) == sample.head )
- cpu_relax();
+ arch_lock_relax();
#ifdef LOCK_PROFILE
if ( lock->profile )
{
diff --git a/xen/include/asm-arm/spinlock.h b/xen/include/asm-arm/spinlock.h
index 81955d1..8cdf9e1 100644
--- a/xen/include/asm-arm/spinlock.h
+++ b/xen/include/asm-arm/spinlock.h
@@ -1,6 +1,13 @@
#ifndef __ASM_SPINLOCK_H
#define __ASM_SPINLOCK_H
-/* Nothing ARM specific. */
+#define arch_lock_acquire_barrier() smp_mb()
+#define arch_lock_release_barrier() smp_mb()
+
+#define arch_lock_relax() wfe()
+#define arch_lock_signal() do { \
+ dsb(ishst); \
+ sev(); \
+} while(0)
#endif /* __ASM_SPINLOCK_H */
diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h
index f0e222f..2eb96e8 100644
--- a/xen/include/asm-arm/system.h
+++ b/xen/include/asm-arm/system.h
@@ -53,9 +53,6 @@
#define arch_fetch_and_add(x, v) __sync_fetch_and_add(x, v)
-#define arch_lock_acquire_barrier() smp_mb()
-#define arch_lock_release_barrier() smp_mb()
-
extern struct vcpu *__context_switch(struct vcpu *prev, struct vcpu *next);
#endif
diff --git a/xen/include/asm-x86/spinlock.h b/xen/include/asm-x86/spinlock.h
index 7d69e75..70a85af 100644
--- a/xen/include/asm-x86/spinlock.h
+++ b/xen/include/asm-x86/spinlock.h
@@ -4,4 +4,18 @@
#define _raw_read_unlock(l) \
asm volatile ( "lock; dec%z0 %0" : "+m" ((l)->lock) :: "memory" )
+/*
+ * On x86 the only reordering is of reads with older writes. In the
+ * lock case, the read in observe_head() can only be reordered with
+ * writes that precede it, and moving a write _into_ a locked section
+ * is OK. In the release case, the write in add_sized() can only be
+ * reordered with reads that follow it, and hoisting a read _into_ a
+ * locked region is OK.
+ */
+#define arch_lock_acquire_barrier() barrier()
+#define arch_lock_release_barrier() barrier()
+
+#define arch_lock_relax() cpu_relax()
+#define arch_lock_signal()
+
#endif /* __ASM_SPINLOCK_H */
diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
index 25a6a2a..9fb70f5 100644
--- a/xen/include/asm-x86/system.h
+++ b/xen/include/asm-x86/system.h
@@ -185,17 +185,6 @@ static always_inline unsigned long __xadd(
#define set_mb(var, value) do { xchg(&var, value); } while (0)
#define set_wmb(var, value) do { var = value; wmb(); } while (0)
-/*
- * On x86 the only reordering is of reads with older writes. In the
- * lock case, the read in observe_head() can only be reordered with
- * writes that precede it, and moving a write _into_ a locked section
- * is OK. In the release case, the write in add_sized() can only be
- * reordered with reads that follow it, and hoisting a read _into_ a
- * locked region is OK.
- */
-#define arch_lock_acquire_barrier() barrier()
-#define arch_lock_release_barrier() barrier()
-
#define local_irq_disable() asm volatile ( "cli" : : : "memory" )
#define local_irq_enable() asm volatile ( "sti" : : : "memory" )
--
2.1.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2] arm: reduce power use by contented spin locks with WFE/SEV
2015-08-03 11:29 [PATCH v2] arm: reduce power use by contented spin locks with WFE/SEV Ian Campbell
@ 2015-08-03 11:51 ` Stefano Stabellini
2015-08-03 11:58 ` Ian Campbell
2015-08-03 12:00 ` Andrew Cooper
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: Stefano Stabellini @ 2015-08-03 11:51 UTC (permalink / raw)
To: Ian Campbell
Cc: wei.liu2, stefano.stabellini, andrew.cooper3, xen-devel,
julien.grall, David Vrabel, jbeulich
On Mon, 3 Aug 2015, Ian Campbell wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Instead of cpu_relax() while spinning and observing the ticket head,
> introduce arch_lock_relax() which executes a WFE instruction. After
> the ticket head is changed call arch_lock_signal() to execute an SEV
> instruction (with the required DSB first) to wake any spinners.
>
> This should improve power consumption when locks are contented and
> spinning.
>
> For consistency also move arch_lock_(acquire|release)_barrier to
> asm/spinlock.h.
>
> Booted the result on arm32 (Midway) and arm64 (Mustang). Build test
> only on amd64.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> [ijc: add barrier, rename as arch_lock_*, move arch_lock_*_barrier, test]
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
I don't know why you are moving arch_lock_(acquire|release)_barrier to
spinlock.h as part of this patch, I would have thought that sticking to
one goal only would have been better for this kind of change.
The code should work though.
> v2 (ijc):
> Add dsb(ishst) to spin_relax.
> s/spin_(relax|signal)/arch_lock_\1/
> Move arch_lock_(acquire|release)_barrier to asm/spinlock.h
> (dropped Andy's Reviewed-by due to this change)
>
> In principal the SEV could be made unnecessary on arm64, but this
> requires a new hook before the wait loop as well as changing
> observe_head and _spin_unlock to use the Acquire/Release instructions
> instead of the non-atomic loads and stores used today, which is a lot
> more refactoring of the generic code than I think we can be bothered
> with at this stage.
>
> 4.6: I'm in two minds about this, the lack of WFE in the ticket
> spinlocks is not a regression (the old locks lacked them as well,
> oops!). On the otherhand spinning like this isn't good. I think
> overall I'm inclined to say this should wait for 4.7 but be a
> candidate for backport to 4.6.1.
I agree
> ---
> xen/common/spinlock.c | 5 +++--
> xen/include/asm-arm/spinlock.h | 9 ++++++++-
> xen/include/asm-arm/system.h | 3 ---
> xen/include/asm-x86/spinlock.h | 14 ++++++++++++++
> xen/include/asm-x86/system.h | 11 -----------
> 5 files changed, 25 insertions(+), 17 deletions(-)
>
> diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
> index 29149d1..7f89694 100644
> --- a/xen/common/spinlock.c
> +++ b/xen/common/spinlock.c
> @@ -141,7 +141,7 @@ void _spin_lock(spinlock_t *lock)
> while ( tickets.tail != observe_head(&lock->tickets) )
> {
> LOCK_PROFILE_BLOCK;
> - cpu_relax();
> + arch_lock_relax();
> }
> LOCK_PROFILE_GOT;
> preempt_disable();
> @@ -170,6 +170,7 @@ void _spin_unlock(spinlock_t *lock)
> preempt_enable();
> LOCK_PROFILE_REL;
> add_sized(&lock->tickets.head, 1);
> + arch_lock_signal();
> }
>
> void _spin_unlock_irq(spinlock_t *lock)
> @@ -228,7 +229,7 @@ void _spin_barrier(spinlock_t *lock)
> if ( sample.head != sample.tail )
> {
> while ( observe_head(&lock->tickets) == sample.head )
> - cpu_relax();
> + arch_lock_relax();
> #ifdef LOCK_PROFILE
> if ( lock->profile )
> {
> diff --git a/xen/include/asm-arm/spinlock.h b/xen/include/asm-arm/spinlock.h
> index 81955d1..8cdf9e1 100644
> --- a/xen/include/asm-arm/spinlock.h
> +++ b/xen/include/asm-arm/spinlock.h
> @@ -1,6 +1,13 @@
> #ifndef __ASM_SPINLOCK_H
> #define __ASM_SPINLOCK_H
>
> -/* Nothing ARM specific. */
> +#define arch_lock_acquire_barrier() smp_mb()
> +#define arch_lock_release_barrier() smp_mb()
> +
> +#define arch_lock_relax() wfe()
> +#define arch_lock_signal() do { \
> + dsb(ishst); \
> + sev(); \
> +} while(0)
>
> #endif /* __ASM_SPINLOCK_H */
> diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h
> index f0e222f..2eb96e8 100644
> --- a/xen/include/asm-arm/system.h
> +++ b/xen/include/asm-arm/system.h
> @@ -53,9 +53,6 @@
>
> #define arch_fetch_and_add(x, v) __sync_fetch_and_add(x, v)
>
> -#define arch_lock_acquire_barrier() smp_mb()
> -#define arch_lock_release_barrier() smp_mb()
> -
> extern struct vcpu *__context_switch(struct vcpu *prev, struct vcpu *next);
>
> #endif
> diff --git a/xen/include/asm-x86/spinlock.h b/xen/include/asm-x86/spinlock.h
> index 7d69e75..70a85af 100644
> --- a/xen/include/asm-x86/spinlock.h
> +++ b/xen/include/asm-x86/spinlock.h
> @@ -4,4 +4,18 @@
> #define _raw_read_unlock(l) \
> asm volatile ( "lock; dec%z0 %0" : "+m" ((l)->lock) :: "memory" )
>
> +/*
> + * On x86 the only reordering is of reads with older writes. In the
> + * lock case, the read in observe_head() can only be reordered with
> + * writes that precede it, and moving a write _into_ a locked section
> + * is OK. In the release case, the write in add_sized() can only be
> + * reordered with reads that follow it, and hoisting a read _into_ a
> + * locked region is OK.
> + */
> +#define arch_lock_acquire_barrier() barrier()
> +#define arch_lock_release_barrier() barrier()
> +
> +#define arch_lock_relax() cpu_relax()
> +#define arch_lock_signal()
> +
> #endif /* __ASM_SPINLOCK_H */
> diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
> index 25a6a2a..9fb70f5 100644
> --- a/xen/include/asm-x86/system.h
> +++ b/xen/include/asm-x86/system.h
> @@ -185,17 +185,6 @@ static always_inline unsigned long __xadd(
> #define set_mb(var, value) do { xchg(&var, value); } while (0)
> #define set_wmb(var, value) do { var = value; wmb(); } while (0)
>
> -/*
> - * On x86 the only reordering is of reads with older writes. In the
> - * lock case, the read in observe_head() can only be reordered with
> - * writes that precede it, and moving a write _into_ a locked section
> - * is OK. In the release case, the write in add_sized() can only be
> - * reordered with reads that follow it, and hoisting a read _into_ a
> - * locked region is OK.
> - */
> -#define arch_lock_acquire_barrier() barrier()
> -#define arch_lock_release_barrier() barrier()
> -
> #define local_irq_disable() asm volatile ( "cli" : : : "memory" )
> #define local_irq_enable() asm volatile ( "sti" : : : "memory" )
>
> --
> 2.1.4
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] arm: reduce power use by contented spin locks with WFE/SEV
2015-08-03 11:51 ` Stefano Stabellini
@ 2015-08-03 11:58 ` Ian Campbell
0 siblings, 0 replies; 9+ messages in thread
From: Ian Campbell @ 2015-08-03 11:58 UTC (permalink / raw)
To: Stefano Stabellini
Cc: wei.liu2, andrew.cooper3, xen-devel, julien.grall, David Vrabel,
jbeulich
On Mon, 2015-08-03 at 12:51 +0100, Stefano Stabellini wrote:
> On Mon, 3 Aug 2015, Ian Campbell wrote:
> > From: David Vrabel <david.vrabel@citrix.com>
> >
> > Instead of cpu_relax() while spinning and observing the ticket head,
> > introduce arch_lock_relax() which executes a WFE instruction. After
> > the ticket head is changed call arch_lock_signal() to execute an SEV
> > instruction (with the required DSB first) to wake any spinners.
> >
> > This should improve power consumption when locks are contented and
> > spinning.
> >
> > For consistency also move arch_lock_(acquire|release)_barrier to
> > asm/spinlock.h.
> >
> > Booted the result on arm32 (Midway) and arm64 (Mustang). Build test
> > only on amd64.
> >
> > Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> > [ijc: add barrier, rename as arch_lock_*, move arch_lock_*_barrier,
> > test]
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>
> I don't know why you are moving arch_lock_(acquire|release)_barrier to
> spinlock.h as part of this patch,
It's literally explained two paragraphs above... "For consistency", i.e. to
put arch_lock_* in the same place.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] arm: reduce power use by contented spin locks with WFE/SEV
2015-08-03 11:29 [PATCH v2] arm: reduce power use by contented spin locks with WFE/SEV Ian Campbell
2015-08-03 11:51 ` Stefano Stabellini
@ 2015-08-03 12:00 ` Andrew Cooper
2015-08-11 15:07 ` Jan Beulich
2015-09-15 10:28 ` Ian Campbell
3 siblings, 0 replies; 9+ messages in thread
From: Andrew Cooper @ 2015-08-03 12:00 UTC (permalink / raw)
To: Ian Campbell, stefano.stabellini, julien.grall, xen-devel, jbeulich
Cc: wei.liu2, David Vrabel
On 03/08/15 12:29, Ian Campbell wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Instead of cpu_relax() while spinning and observing the ticket head,
> introduce arch_lock_relax() which executes a WFE instruction. After
> the ticket head is changed call arch_lock_signal() to execute an SEV
> instruction (with the required DSB first) to wake any spinners.
>
> This should improve power consumption when locks are contented and
> spinning.
>
> For consistency also move arch_lock_(acquire|release)_barrier to
> asm/spinlock.h.
>
> Booted the result on arm32 (Midway) and arm64 (Mustang). Build test
> only on amd64.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> [ijc: add barrier, rename as arch_lock_*, move arch_lock_*_barrier, test]
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
x86 bits Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> v2 (ijc):
> Add dsb(ishst) to spin_relax.
> s/spin_(relax|signal)/arch_lock_\1/
> Move arch_lock_(acquire|release)_barrier to asm/spinlock.h
> (dropped Andy's Reviewed-by due to this change)
>
> In principal the SEV could be made unnecessary on arm64, but this
> requires a new hook before the wait loop as well as changing
> observe_head and _spin_unlock to use the Acquire/Release instructions
> instead of the non-atomic loads and stores used today, which is a lot
> more refactoring of the generic code than I think we can be bothered
> with at this stage.
>
> 4.6: I'm in two minds about this, the lack of WFE in the ticket
> spinlocks is not a regression (the old locks lacked them as well,
> oops!). On the otherhand spinning like this isn't good. I think
> overall I'm inclined to say this should wait for 4.7 but be a
> candidate for backport to 4.6.1.
> ---
> xen/common/spinlock.c | 5 +++--
> xen/include/asm-arm/spinlock.h | 9 ++++++++-
> xen/include/asm-arm/system.h | 3 ---
> xen/include/asm-x86/spinlock.h | 14 ++++++++++++++
> xen/include/asm-x86/system.h | 11 -----------
> 5 files changed, 25 insertions(+), 17 deletions(-)
>
> diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
> index 29149d1..7f89694 100644
> --- a/xen/common/spinlock.c
> +++ b/xen/common/spinlock.c
> @@ -141,7 +141,7 @@ void _spin_lock(spinlock_t *lock)
> while ( tickets.tail != observe_head(&lock->tickets) )
> {
> LOCK_PROFILE_BLOCK;
> - cpu_relax();
> + arch_lock_relax();
> }
> LOCK_PROFILE_GOT;
> preempt_disable();
> @@ -170,6 +170,7 @@ void _spin_unlock(spinlock_t *lock)
> preempt_enable();
> LOCK_PROFILE_REL;
> add_sized(&lock->tickets.head, 1);
> + arch_lock_signal();
> }
>
> void _spin_unlock_irq(spinlock_t *lock)
> @@ -228,7 +229,7 @@ void _spin_barrier(spinlock_t *lock)
> if ( sample.head != sample.tail )
> {
> while ( observe_head(&lock->tickets) == sample.head )
> - cpu_relax();
> + arch_lock_relax();
> #ifdef LOCK_PROFILE
> if ( lock->profile )
> {
> diff --git a/xen/include/asm-arm/spinlock.h b/xen/include/asm-arm/spinlock.h
> index 81955d1..8cdf9e1 100644
> --- a/xen/include/asm-arm/spinlock.h
> +++ b/xen/include/asm-arm/spinlock.h
> @@ -1,6 +1,13 @@
> #ifndef __ASM_SPINLOCK_H
> #define __ASM_SPINLOCK_H
>
> -/* Nothing ARM specific. */
> +#define arch_lock_acquire_barrier() smp_mb()
> +#define arch_lock_release_barrier() smp_mb()
> +
> +#define arch_lock_relax() wfe()
> +#define arch_lock_signal() do { \
> + dsb(ishst); \
> + sev(); \
> +} while(0)
>
> #endif /* __ASM_SPINLOCK_H */
> diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h
> index f0e222f..2eb96e8 100644
> --- a/xen/include/asm-arm/system.h
> +++ b/xen/include/asm-arm/system.h
> @@ -53,9 +53,6 @@
>
> #define arch_fetch_and_add(x, v) __sync_fetch_and_add(x, v)
>
> -#define arch_lock_acquire_barrier() smp_mb()
> -#define arch_lock_release_barrier() smp_mb()
> -
> extern struct vcpu *__context_switch(struct vcpu *prev, struct vcpu *next);
>
> #endif
> diff --git a/xen/include/asm-x86/spinlock.h b/xen/include/asm-x86/spinlock.h
> index 7d69e75..70a85af 100644
> --- a/xen/include/asm-x86/spinlock.h
> +++ b/xen/include/asm-x86/spinlock.h
> @@ -4,4 +4,18 @@
> #define _raw_read_unlock(l) \
> asm volatile ( "lock; dec%z0 %0" : "+m" ((l)->lock) :: "memory" )
>
> +/*
> + * On x86 the only reordering is of reads with older writes. In the
> + * lock case, the read in observe_head() can only be reordered with
> + * writes that precede it, and moving a write _into_ a locked section
> + * is OK. In the release case, the write in add_sized() can only be
> + * reordered with reads that follow it, and hoisting a read _into_ a
> + * locked region is OK.
> + */
> +#define arch_lock_acquire_barrier() barrier()
> +#define arch_lock_release_barrier() barrier()
> +
> +#define arch_lock_relax() cpu_relax()
> +#define arch_lock_signal()
> +
> #endif /* __ASM_SPINLOCK_H */
> diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
> index 25a6a2a..9fb70f5 100644
> --- a/xen/include/asm-x86/system.h
> +++ b/xen/include/asm-x86/system.h
> @@ -185,17 +185,6 @@ static always_inline unsigned long __xadd(
> #define set_mb(var, value) do { xchg(&var, value); } while (0)
> #define set_wmb(var, value) do { var = value; wmb(); } while (0)
>
> -/*
> - * On x86 the only reordering is of reads with older writes. In the
> - * lock case, the read in observe_head() can only be reordered with
> - * writes that precede it, and moving a write _into_ a locked section
> - * is OK. In the release case, the write in add_sized() can only be
> - * reordered with reads that follow it, and hoisting a read _into_ a
> - * locked region is OK.
> - */
> -#define arch_lock_acquire_barrier() barrier()
> -#define arch_lock_release_barrier() barrier()
> -
> #define local_irq_disable() asm volatile ( "cli" : : : "memory" )
> #define local_irq_enable() asm volatile ( "sti" : : : "memory" )
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] arm: reduce power use by contented spin locks with WFE/SEV
2015-08-03 11:29 [PATCH v2] arm: reduce power use by contented spin locks with WFE/SEV Ian Campbell
2015-08-03 11:51 ` Stefano Stabellini
2015-08-03 12:00 ` Andrew Cooper
@ 2015-08-11 15:07 ` Jan Beulich
2015-09-15 11:17 ` Ian Campbell
2015-09-15 10:28 ` Ian Campbell
3 siblings, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2015-08-11 15:07 UTC (permalink / raw)
To: David Vrabel, IanCampbell
Cc: julien.grall, andrew.cooper3, xen-devel, wei.liu2, stefano.stabellini
>>> On 03.08.15 at 13:29, <ian.campbell@citrix.com> wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Instead of cpu_relax() while spinning and observing the ticket head,
> introduce arch_lock_relax() which executes a WFE instruction. After
> the ticket head is changed call arch_lock_signal() to execute an SEV
> instruction (with the required DSB first) to wake any spinners.
>
> This should improve power consumption when locks are contented and
> spinning.
So why not use MONITOR/MWAIT on x86 for the same purpose?
Jan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] arm: reduce power use by contented spin locks with WFE/SEV
2015-08-03 11:29 [PATCH v2] arm: reduce power use by contented spin locks with WFE/SEV Ian Campbell
` (2 preceding siblings ...)
2015-08-11 15:07 ` Jan Beulich
@ 2015-09-15 10:28 ` Ian Campbell
2015-09-15 10:39 ` Stefano Stabellini
3 siblings, 1 reply; 9+ messages in thread
From: Ian Campbell @ 2015-09-15 10:28 UTC (permalink / raw)
To: stefano.stabellini, julien.grall, xen-devel, jbeulich, andrew.cooper3
Cc: wei.liu2, David Vrabel
On Mon, 2015-08-03 at 12:29 +0100, Ian Campbell wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> Instead of cpu_relax() while spinning and observing the ticket head,
> introduce arch_lock_relax() which executes a WFE instruction. After
> the ticket head is changed call arch_lock_signal() to execute an SEV
> instruction (with the required DSB first) to wake any spinners.
>
> This should improve power consumption when locks are contented and
> spinning.
>
> For consistency also move arch_lock_(acquire|release)_barrier to
> asm/spinlock.h.
>
> Booted the result on arm32 (Midway) and arm64 (Mustang). Build test
> only on amd64.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> [ijc: add barrier, rename as arch_lock_*, move arch_lock_*_barrier, test]
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Andy has Reviewed-by but this change still lacks an Ack from the ARM side.
Stefano, was your previous "The code should work though." intended as an
Ack?
> ---
> v2 (ijc):
> Add dsb(ishst) to spin_relax.
> s/spin_(relax|signal)/arch_lock_\1/
> Move arch_lock_(acquire|release)_barrier to asm/spinlock.h
> (dropped Andy's Reviewed-by due to this change)
>
> In principal the SEV could be made unnecessary on arm64, but this
> requires a new hook before the wait loop as well as changing
> observe_head and _spin_unlock to use the Acquire/Release instructions
> instead of the non-atomic loads and stores used today, which is a lot
> more refactoring of the generic code than I think we can be bothered
> with at this stage.
>
> 4.6: I'm in two minds about this, the lack of WFE in the ticket
> spinlocks is not a regression (the old locks lacked them as well,
> oops!). On the otherhand spinning like this isn't good. I think
> overall I'm inclined to say this should wait for 4.7 but be a
> candidate for backport to 4.6.1.
> ---
> xen/common/spinlock.c | 5 +++--
> xen/include/asm-arm/spinlock.h | 9 ++++++++-
> xen/include/asm-arm/system.h | 3 ---
> xen/include/asm-x86/spinlock.h | 14 ++++++++++++++
> xen/include/asm-x86/system.h | 11 -----------
> 5 files changed, 25 insertions(+), 17 deletions(-)
>
> diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
> index 29149d1..7f89694 100644
> --- a/xen/common/spinlock.c
> +++ b/xen/common/spinlock.c
> @@ -141,7 +141,7 @@ void _spin_lock(spinlock_t *lock)
> while ( tickets.tail != observe_head(&lock->tickets) )
> {
> LOCK_PROFILE_BLOCK;
> - cpu_relax();
> + arch_lock_relax();
> }
> LOCK_PROFILE_GOT;
> preempt_disable();
> @@ -170,6 +170,7 @@ void _spin_unlock(spinlock_t *lock)
> preempt_enable();
> LOCK_PROFILE_REL;
> add_sized(&lock->tickets.head, 1);
> + arch_lock_signal();
> }
>
> void _spin_unlock_irq(spinlock_t *lock)
> @@ -228,7 +229,7 @@ void _spin_barrier(spinlock_t *lock)
> if ( sample.head != sample.tail )
> {
> while ( observe_head(&lock->tickets) == sample.head )
> - cpu_relax();
> + arch_lock_relax();
> #ifdef LOCK_PROFILE
> if ( lock->profile )
> {
> diff --git a/xen/include/asm-arm/spinlock.h b/xen/include/asm
> -arm/spinlock.h
> index 81955d1..8cdf9e1 100644
> --- a/xen/include/asm-arm/spinlock.h
> +++ b/xen/include/asm-arm/spinlock.h
> @@ -1,6 +1,13 @@
> #ifndef __ASM_SPINLOCK_H
> #define __ASM_SPINLOCK_H
>
> -/* Nothing ARM specific. */
> +#define arch_lock_acquire_barrier() smp_mb()
> +#define arch_lock_release_barrier() smp_mb()
> +
> +#define arch_lock_relax() wfe()
> +#define arch_lock_signal() do { \
> + dsb(ishst); \
> + sev(); \
> +} while(0)
>
> #endif /* __ASM_SPINLOCK_H */
> diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h
> index f0e222f..2eb96e8 100644
> --- a/xen/include/asm-arm/system.h
> +++ b/xen/include/asm-arm/system.h
> @@ -53,9 +53,6 @@
>
> #define arch_fetch_and_add(x, v) __sync_fetch_and_add(x, v)
>
> -#define arch_lock_acquire_barrier() smp_mb()
> -#define arch_lock_release_barrier() smp_mb()
> -
> extern struct vcpu *__context_switch(struct vcpu *prev, struct vcpu
> *next);
>
> #endif
> diff --git a/xen/include/asm-x86/spinlock.h b/xen/include/asm
> -x86/spinlock.h
> index 7d69e75..70a85af 100644
> --- a/xen/include/asm-x86/spinlock.h
> +++ b/xen/include/asm-x86/spinlock.h
> @@ -4,4 +4,18 @@
> #define _raw_read_unlock(l) \
> asm volatile ( "lock; dec%z0 %0" : "+m" ((l)->lock) :: "memory" )
>
> +/*
> + * On x86 the only reordering is of reads with older writes. In the
> + * lock case, the read in observe_head() can only be reordered with
> + * writes that precede it, and moving a write _into_ a locked section
> + * is OK. In the release case, the write in add_sized() can only be
> + * reordered with reads that follow it, and hoisting a read _into_ a
> + * locked region is OK.
> + */
> +#define arch_lock_acquire_barrier() barrier()
> +#define arch_lock_release_barrier() barrier()
> +
> +#define arch_lock_relax() cpu_relax()
> +#define arch_lock_signal()
> +
> #endif /* __ASM_SPINLOCK_H */
> diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
> index 25a6a2a..9fb70f5 100644
> --- a/xen/include/asm-x86/system.h
> +++ b/xen/include/asm-x86/system.h
> @@ -185,17 +185,6 @@ static always_inline unsigned long __xadd(
> #define set_mb(var, value) do { xchg(&var, value); } while (0)
> #define set_wmb(var, value) do { var = value; wmb(); } while (0)
>
> -/*
> - * On x86 the only reordering is of reads with older writes. In the
> - * lock case, the read in observe_head() can only be reordered with
> - * writes that precede it, and moving a write _into_ a locked section
> - * is OK. In the release case, the write in add_sized() can only be
> - * reordered with reads that follow it, and hoisting a read _into_ a
> - * locked region is OK.
> - */
> -#define arch_lock_acquire_barrier() barrier()
> -#define arch_lock_release_barrier() barrier()
> -
> #define local_irq_disable() asm volatile ( "cli" : : : "memory" )
> #define local_irq_enable() asm volatile ( "sti" : : : "memory" )
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] arm: reduce power use by contented spin locks with WFE/SEV
2015-09-15 10:28 ` Ian Campbell
@ 2015-09-15 10:39 ` Stefano Stabellini
2015-09-15 11:17 ` Ian Campbell
0 siblings, 1 reply; 9+ messages in thread
From: Stefano Stabellini @ 2015-09-15 10:39 UTC (permalink / raw)
To: Ian Campbell
Cc: wei.liu2, stefano.stabellini, andrew.cooper3, xen-devel,
julien.grall, David Vrabel, jbeulich
On Tue, 15 Sep 2015, Ian Campbell wrote:
> On Mon, 2015-08-03 at 12:29 +0100, Ian Campbell wrote:
> > From: David Vrabel <david.vrabel@citrix.com>
> >
> > Instead of cpu_relax() while spinning and observing the ticket head,
> > introduce arch_lock_relax() which executes a WFE instruction. After
> > the ticket head is changed call arch_lock_signal() to execute an SEV
> > instruction (with the required DSB first) to wake any spinners.
> >
> > This should improve power consumption when locks are contented and
> > spinning.
> >
> > For consistency also move arch_lock_(acquire|release)_barrier to
> > asm/spinlock.h.
> >
> > Booted the result on arm32 (Midway) and arm64 (Mustang). Build test
> > only on amd64.
> >
> > Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> > [ijc: add barrier, rename as arch_lock_*, move arch_lock_*_barrier, test]
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>
> Andy has Reviewed-by but this change still lacks an Ack from the ARM side.
> Stefano, was your previous "The code should work though." intended as an
> Ack?
Yes
> > ---
> > v2 (ijc):
> > Add dsb(ishst) to spin_relax.
> > s/spin_(relax|signal)/arch_lock_\1/
> > Move arch_lock_(acquire|release)_barrier to asm/spinlock.h
> > (dropped Andy's Reviewed-by due to this change)
> >
> > In principal the SEV could be made unnecessary on arm64, but this
> > requires a new hook before the wait loop as well as changing
> > observe_head and _spin_unlock to use the Acquire/Release instructions
> > instead of the non-atomic loads and stores used today, which is a lot
> > more refactoring of the generic code than I think we can be bothered
> > with at this stage.
> >
> > 4.6: I'm in two minds about this, the lack of WFE in the ticket
> > spinlocks is not a regression (the old locks lacked them as well,
> > oops!). On the otherhand spinning like this isn't good. I think
> > overall I'm inclined to say this should wait for 4.7 but be a
> > candidate for backport to 4.6.1.
> > ---
> > xen/common/spinlock.c | 5 +++--
> > xen/include/asm-arm/spinlock.h | 9 ++++++++-
> > xen/include/asm-arm/system.h | 3 ---
> > xen/include/asm-x86/spinlock.h | 14 ++++++++++++++
> > xen/include/asm-x86/system.h | 11 -----------
> > 5 files changed, 25 insertions(+), 17 deletions(-)
> >
> > diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
> > index 29149d1..7f89694 100644
> > --- a/xen/common/spinlock.c
> > +++ b/xen/common/spinlock.c
> > @@ -141,7 +141,7 @@ void _spin_lock(spinlock_t *lock)
> > while ( tickets.tail != observe_head(&lock->tickets) )
> > {
> > LOCK_PROFILE_BLOCK;
> > - cpu_relax();
> > + arch_lock_relax();
> > }
> > LOCK_PROFILE_GOT;
> > preempt_disable();
> > @@ -170,6 +170,7 @@ void _spin_unlock(spinlock_t *lock)
> > preempt_enable();
> > LOCK_PROFILE_REL;
> > add_sized(&lock->tickets.head, 1);
> > + arch_lock_signal();
> > }
> >
> > void _spin_unlock_irq(spinlock_t *lock)
> > @@ -228,7 +229,7 @@ void _spin_barrier(spinlock_t *lock)
> > if ( sample.head != sample.tail )
> > {
> > while ( observe_head(&lock->tickets) == sample.head )
> > - cpu_relax();
> > + arch_lock_relax();
> > #ifdef LOCK_PROFILE
> > if ( lock->profile )
> > {
> > diff --git a/xen/include/asm-arm/spinlock.h b/xen/include/asm
> > -arm/spinlock.h
> > index 81955d1..8cdf9e1 100644
> > --- a/xen/include/asm-arm/spinlock.h
> > +++ b/xen/include/asm-arm/spinlock.h
> > @@ -1,6 +1,13 @@
> > #ifndef __ASM_SPINLOCK_H
> > #define __ASM_SPINLOCK_H
> >
> > -/* Nothing ARM specific. */
> > +#define arch_lock_acquire_barrier() smp_mb()
> > +#define arch_lock_release_barrier() smp_mb()
> > +
> > +#define arch_lock_relax() wfe()
> > +#define arch_lock_signal() do { \
> > + dsb(ishst); \
> > + sev(); \
> > +} while(0)
> >
> > #endif /* __ASM_SPINLOCK_H */
> > diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h
> > index f0e222f..2eb96e8 100644
> > --- a/xen/include/asm-arm/system.h
> > +++ b/xen/include/asm-arm/system.h
> > @@ -53,9 +53,6 @@
> >
> > #define arch_fetch_and_add(x, v) __sync_fetch_and_add(x, v)
> >
> > -#define arch_lock_acquire_barrier() smp_mb()
> > -#define arch_lock_release_barrier() smp_mb()
> > -
> > extern struct vcpu *__context_switch(struct vcpu *prev, struct vcpu
> > *next);
> >
> > #endif
> > diff --git a/xen/include/asm-x86/spinlock.h b/xen/include/asm
> > -x86/spinlock.h
> > index 7d69e75..70a85af 100644
> > --- a/xen/include/asm-x86/spinlock.h
> > +++ b/xen/include/asm-x86/spinlock.h
> > @@ -4,4 +4,18 @@
> > #define _raw_read_unlock(l) \
> > asm volatile ( "lock; dec%z0 %0" : "+m" ((l)->lock) :: "memory" )
> >
> > +/*
> > + * On x86 the only reordering is of reads with older writes. In the
> > + * lock case, the read in observe_head() can only be reordered with
> > + * writes that precede it, and moving a write _into_ a locked section
> > + * is OK. In the release case, the write in add_sized() can only be
> > + * reordered with reads that follow it, and hoisting a read _into_ a
> > + * locked region is OK.
> > + */
> > +#define arch_lock_acquire_barrier() barrier()
> > +#define arch_lock_release_barrier() barrier()
> > +
> > +#define arch_lock_relax() cpu_relax()
> > +#define arch_lock_signal()
> > +
> > #endif /* __ASM_SPINLOCK_H */
> > diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
> > index 25a6a2a..9fb70f5 100644
> > --- a/xen/include/asm-x86/system.h
> > +++ b/xen/include/asm-x86/system.h
> > @@ -185,17 +185,6 @@ static always_inline unsigned long __xadd(
> > #define set_mb(var, value) do { xchg(&var, value); } while (0)
> > #define set_wmb(var, value) do { var = value; wmb(); } while (0)
> >
> > -/*
> > - * On x86 the only reordering is of reads with older writes. In the
> > - * lock case, the read in observe_head() can only be reordered with
> > - * writes that precede it, and moving a write _into_ a locked section
> > - * is OK. In the release case, the write in add_sized() can only be
> > - * reordered with reads that follow it, and hoisting a read _into_ a
> > - * locked region is OK.
> > - */
> > -#define arch_lock_acquire_barrier() barrier()
> > -#define arch_lock_release_barrier() barrier()
> > -
> > #define local_irq_disable() asm volatile ( "cli" : : : "memory" )
> > #define local_irq_enable() asm volatile ( "sti" : : : "memory" )
> >
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] arm: reduce power use by contented spin locks with WFE/SEV
2015-09-15 10:39 ` Stefano Stabellini
@ 2015-09-15 11:17 ` Ian Campbell
0 siblings, 0 replies; 9+ messages in thread
From: Ian Campbell @ 2015-09-15 11:17 UTC (permalink / raw)
To: Stefano Stabellini
Cc: wei.liu2, andrew.cooper3, xen-devel, julien.grall, David Vrabel,
jbeulich
On Tue, 2015-09-15 at 11:39 +0100, Stefano Stabellini wrote:
> On Tue, 15 Sep 2015, Ian Campbell wrote:
> > On Mon, 2015-08-03 at 12:29 +0100, Ian Campbell wrote:
> > > From: David Vrabel <david.vrabel@citrix.com>
> > >
> > > Instead of cpu_relax() while spinning and observing the ticket head,
> > > introduce arch_lock_relax() which executes a WFE instruction. After
> > > the ticket head is changed call arch_lock_signal() to execute an SEV
> > > instruction (with the required DSB first) to wake any spinners.
> > >
> > > This should improve power consumption when locks are contented and
> > > spinning.
> > >
> > > For consistency also move arch_lock_(acquire|release)_barrier to
> > > asm/spinlock.h.
> > >
> > > Booted the result on arm32 (Midway) and arm64 (Mustang). Build test
> > > only on amd64.
> > >
> > > Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> > > [ijc: add barrier, rename as arch_lock_*, move arch_lock_*_barrier,
> > > test]
> > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > > Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > > Cc: Jan Beulich <jbeulich@suse.com>
> > > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> >
> > Andy has Reviewed-by but this change still lacks an Ack from the ARM
> > side.
> > Stefano, was your previous "The code should work though." intended as
> > an
> > Ack?
>
> Yes
Applied then. thanks.
Ian.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] arm: reduce power use by contented spin locks with WFE/SEV
2015-08-11 15:07 ` Jan Beulich
@ 2015-09-15 11:17 ` Ian Campbell
0 siblings, 0 replies; 9+ messages in thread
From: Ian Campbell @ 2015-09-15 11:17 UTC (permalink / raw)
To: Jan Beulich, David Vrabel
Cc: julien.grall, andrew.cooper3, stefano.stabellini, wei.liu2, xen-devel
On Tue, 2015-08-11 at 09:07 -0600, Jan Beulich wrote:
> > > > On 03.08.15 at 13:29, <ian.campbell@citrix.com> wrote:
> > From: David Vrabel <david.vrabel@citrix.com>
> >
> > Instead of cpu_relax() while spinning and observing the ticket head,
> > introduce arch_lock_relax() which executes a WFE instruction. After
> > the ticket head is changed call arch_lock_signal() to execute an SEV
> > instruction (with the required DSB first) to wake any spinners.
> >
> > This should improve power consumption when locks are contented and
> > spinning.
>
> So why not use MONITOR/MWAIT on x86 for the same purpose?
I'll leave this for an x86 person to decide/implement.
Ian.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-09-15 11:17 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-03 11:29 [PATCH v2] arm: reduce power use by contented spin locks with WFE/SEV Ian Campbell
2015-08-03 11:51 ` Stefano Stabellini
2015-08-03 11:58 ` Ian Campbell
2015-08-03 12:00 ` Andrew Cooper
2015-08-11 15:07 ` Jan Beulich
2015-09-15 11:17 ` Ian Campbell
2015-09-15 10:28 ` Ian Campbell
2015-09-15 10:39 ` Stefano Stabellini
2015-09-15 11:17 ` Ian Campbell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).