All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] spin loop arch primitives for busy waiting
@ 2017-04-03  8:13 Nicholas Piggin
  2017-04-03 15:31 ` Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Nicholas Piggin @ 2017-04-03  8:13 UTC (permalink / raw)
  To: linux-arch; +Cc: Nicholas Piggin, linux-kernel, torvalds, Anton Blanchard

Hi,

I would like to revisit this again and see if people are opposed to this
arch primitive. We have attributed cases of suboptimal performance on
real customer workloads to this, so I'd like to find a solution.

Since last posting, I promised the s390 people I'd consider hypervisor
yield additions. I'd like to punt that until after getting the basic
primitives in. HV yielding can still be done within these loops, but
there's no real simple recipe that's useful to add, that I've found yet.
We have cpu_relax_yield, but it's barely used so I prefer to avoid
over-engineering something that's not well tested.

Thanks,
Nick


Current busy-wait loops are implemented by once calling cpu_relax() to
give a low latency arch option for improving power and/or SMT resource
consumption.

This poses some difficulties for powerpc, which has SMT priority setting
instructions where relative priorities between threads determine how
ifetch cycles are apportioned. cpu_relax() is implemented by setting a
low priority then setting normal priority. This has several problems:

 - Changing thread priority can have some execution cost and potential
   impact to other threads in the core. It's inefficient to execute them
   every time around a busy-wait loop.

 - Depending on implementation details, a `low ; medium` sequence may
   not have much if any affect. Some software with similar pattern
   actually inserts a lot of nops between, in order to cause a few
   fetch cycles with the low priority.

 - The rest of the busy-wait loop runs with regular priority. This
   might only be a few fetch cycles, but if there are several threads
   running such loops, they could cause a noticable impact on a non-idle
   thread.

Implement spin_do {} spin_while(), and spin_do {} spin_until()
primitives ("until" tends to be quite readable for busy-wait loops),
which allows powerpc to enter low SMT priority, run the loop, then enter
normal SMT priority.

The loops have some restrictions on what can be used, but they are
intended to be small and simple so it's not generally a problem:
 - Don't use cpu_relax.
 - Don't use return or goto.
 - Don't use sleeping or spinning primitives.

---
 include/linux/processor.h            | 43 ++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)
 create mode 100644 include/linux/processor.h

diff --git a/include/linux/processor.h b/include/linux/processor.h
new file mode 100644
index 000000000000..282457cd9b67
--- /dev/null
+++ b/include/linux/processor.h
@@ -0,0 +1,43 @@
+/* Misc low level processor primitives */
+#ifndef _LINUX_PROCESSOR_H
+#define _LINUX_PROCESSOR_H
+
+#include <asm/processor.h>
+
+/*
+ * Begin a busy-wait loop, terminated with spin_while() or spin_until().
+ * This can be used in place of cpu_relax, and should be optimized to be
+ * used where wait times are expected to be less than the cost of a context
+ * switch.
+ *
+ * These loops should be very simple, and avoid calling cpu_relax, or
+ * any "spin" or sleep type of primitive. That should not cause a bug, just
+ * possible suboptimal behaviour on some implementations.
+ *
+ * The most common / fastpath exit case(s) should be tested in the
+ * spin_while / spin_until conditions. Uncommon cases can use break from
+ * within the loop. Return and goto must not be used to exit from the
+ * loop.
+ *
+ * Guest yielding and such techniques are to be implemented by the caller.
+ */
+#ifndef spin_do
+#define spin_do								\
+do {									\
+	do {								\
+		cpu_relax();						\
+#endif
+
+#ifndef spin_while
+#define spin_while(cond)						\
+	} while (cond);							\
+} while (0)
+#endif
+
+#ifndef spin_until
+#define spin_until(cond)						\
+	} while (!(cond));						\
+} while (0)
+#endif
+
+#endif /* _LINUX_PROCESSOR_H */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-03  8:13 [RFC][PATCH] spin loop arch primitives for busy waiting Nicholas Piggin
@ 2017-04-03 15:31 ` Linus Torvalds
  2017-04-03 23:50   ` Nicholas Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2017-04-03 15:31 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linux-arch, Linux Kernel Mailing List, Anton Blanchard

On Mon, Apr 3, 2017 at 1:13 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
>
> The loops have some restrictions on what can be used, but they are
> intended to be small and simple so it's not generally a problem:
>  - Don't use cpu_relax.
>  - Don't use return or goto.
>  - Don't use sleeping or spinning primitives.

So you're supposed to "break" out of the loop if you want to exit
early? Or what?

One of the issues is that with a do-while/until loop, at least the way
you've coded it, it's always done at least once.

Which means that people will have to code the condition as

    if (cond) {
        .. fast case..
        return;
    }

    spin_do {
       ...
    } spin_until (cond);
    .. slow case ..

because "cpu_relax()" itself can be slightly slow.

And the way you've done it, even if there's a "break" in the loop, the
cpu_relax() is still done (because it's done at the top).

So quite frankly, I think "while(cond) ()" semantics would be better
than "do { } while (cond)".

Now, a lot of loops *are* of the kind where we've already handled the
fast case earlier, so by the time we get into the loop we really are
waiting for the condition to become true (but we knew it started out
false). But not all.

Doing a quick

    git grep -2 cpu_relax

for existing users of cpu_relax() does imply that most of the current
users are very much of the "cpu_relax() at the _end_ of the loop
tests" type.

So I don't know. I think the interface sucks.

What is it that POWER _actually_ wants? Not the loop - the
"cpu_relax()" kind of thing. Can we abstract *that* better?

                    Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-03 15:31 ` Linus Torvalds
@ 2017-04-03 23:50   ` Nicholas Piggin
  2017-04-04  0:43     ` Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Nicholas Piggin @ 2017-04-03 23:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Linux Kernel Mailing List, Anton Blanchard, linuxppc-dev

On Mon, 3 Apr 2017 08:31:30 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Apr 3, 2017 at 1:13 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> >
> > The loops have some restrictions on what can be used, but they are
> > intended to be small and simple so it's not generally a problem:
> >  - Don't use cpu_relax.
> >  - Don't use return or goto.
> >  - Don't use sleeping or spinning primitives.  
> 
> So you're supposed to "break" out of the loop if you want to exit
> early? Or what?

Yes, break (I put that as a comment somewhere).

> One of the issues is that with a do-while/until loop, at least the way
> you've coded it, it's always done at least once.
> 
> Which means that people will have to code the condition as
> 
>     if (cond) {
>         .. fast case..
>         return;
>     }
> 
>     spin_do {
>        ...
>     } spin_until (cond);
>     .. slow case ..
> 
> because "cpu_relax()" itself can be slightly slow.
> 
> And the way you've done it, even if there's a "break" in the loop, the
> cpu_relax() is still done (because it's done at the top).
> 
> So quite frankly, I think "while(cond) ()" semantics would be better
> than "do { } while (cond)".
> 
> Now, a lot of loops *are* of the kind where we've already handled the
> fast case earlier, so by the time we get into the loop we really are
> waiting for the condition to become true (but we knew it started out
> false). But not all.

Yeah that's a good point, I see what you mean. The important cases I've
been looked at (e.g., mutex spinning, bit spinlock, idle loops) are the
slowpath cases. But this would need to be explicit in the API so fastpaths
don't use it.

> Doing a quick
> 
>     git grep -2 cpu_relax
> 
> for existing users of cpu_relax() does imply that most of the current
> users are very much of the "cpu_relax() at the _end_ of the loop
> tests" type.
> 
> So I don't know. I think the interface sucks.

It's a bit clunky.

> What is it that POWER _actually_ wants? Not the loop - the
> "cpu_relax()" kind of thing. Can we abstract *that* better?

POWER does not have an instruction like pause. We can only set current
thread priority, and current implementations do something like allocate
issue cycles to threads based on relative priorities. So there should
be at least one or two issue cycles at low priority, but ideally we
would not be changing priority in the busy-wait loop because it can
impact other threads in the core.

I couldn't think of a good way to improve cpu_relax. Our (open source)
firmware has a cpu_relax, and it puts a bunch of nops between low and
normal priority instructions so we get some fetch cycles at low prio.
That isn't ideal though.

If you have any ideas, I'd be open to them.

I had a secondary motive for defining it as a loop, and that was using
branch hints or even asm goto to arrange the loop and exit branch
nicely. E.g., if we static predict loop exits, then we don't take a
branch miss as first thing when condition changes. This is a second
order concern though.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-03 23:50   ` Nicholas Piggin
@ 2017-04-04  0:43     ` Linus Torvalds
  2017-04-04  3:02       ` Nicholas Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2017-04-04  0:43 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Linux Kernel Mailing List, Anton Blanchard, linuxppc-dev

On Mon, Apr 3, 2017 at 4:50 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
>
> POWER does not have an instruction like pause. We can only set current
> thread priority, and current implementations do something like allocate
> issue cycles to threads based on relative priorities. So there should
> be at least one or two issue cycles at low priority, but ideally we
> would not be changing priority in the busy-wait loop because it can
> impact other threads in the core.
>
> I couldn't think of a good way to improve cpu_relax. Our (open source)
> firmware has a cpu_relax, and it puts a bunch of nops between low and
> normal priority instructions so we get some fetch cycles at low prio.
> That isn't ideal though.
>
> If you have any ideas, I'd be open to them.

So the idea would be that maybe we can just make those things
explicit. IOW, instead of having that magical looping construct that
does other magical hidden things as part of the loop, maybe we can
just have a

   begin_cpu_relax();
   while (!cond)
       cpu_relax();
   end_cpu_relax();

and then architectures can decide how they implement it. So for x86,
the begin/end macros would be empty. For ppc, maybe begin/end would be
the "lower and raise priority", while cpu_relax() itself is an empty
thing.

Or maybe "begin" just clears a counter, while "cpu_relax()" does some
"increase iterations, and lower priority after X iterations", and then
"end" raises the priority again.

The "do magic having a special loop" approach disturbs me. I'd much
rather have more explicit hooks that allow people to do their own loop
semantics (including having a "return" to exit early).

But that depends on architectures having some pattern that we *can*
abstract. Would some "begin/in-loop/end" pattern like the above be
sufficient? The pure "in-loop" case we have now (ie "cpu_relax()"
clearly isn't sufficient.

I think s390 might have issues too, since they tried to have that
"cpu_relax_yield" thing (which is only used by stop_machine), and
they've tried cpu_relax_lowlatency() and other games.

                    Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-04  0:43     ` Linus Torvalds
@ 2017-04-04  3:02       ` Nicholas Piggin
  2017-04-04  4:11         ` Nicholas Piggin
  2017-04-05 14:01         ` David Miller
  0 siblings, 2 replies; 19+ messages in thread
From: Nicholas Piggin @ 2017-04-04  3:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Linux Kernel Mailing List, Anton Blanchard, linuxppc-dev

On Mon, 3 Apr 2017 17:43:05 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Apr 3, 2017 at 4:50 PM, Nicholas Piggin <npiggin@gmail.com> wrote:

> > If you have any ideas, I'd be open to them.  
> 
> So the idea would be that maybe we can just make those things
> explicit. IOW, instead of having that magical looping construct that
> does other magical hidden things as part of the loop, maybe we can
> just have a
> 
>    begin_cpu_relax();
>    while (!cond)
>        cpu_relax();
>    end_cpu_relax();
> 
> and then architectures can decide how they implement it. So for x86,
> the begin/end macros would be empty. For ppc, maybe begin/end would be
> the "lower and raise priority", while cpu_relax() itself is an empty
> thing.
> 
> Or maybe "begin" just clears a counter, while "cpu_relax()" does some
> "increase iterations, and lower priority after X iterations", and then
> "end" raises the priority again.
> 
> The "do magic having a special loop" approach disturbs me. I'd much
> rather have more explicit hooks that allow people to do their own loop
> semantics (including having a "return" to exit early).

I guess so. Users will still have to read the documentation rather than
just throw it in ad hoc because it seems like a good idea.

For example powerpc must not use any other primitives that might change
SMT priority in the idle loop. For x86 if you do too much work, the rep
; nop component becomes relatively small and you lose benefit (10 cycles
latency pre-skylake).

I would suggest keeping standalone cpu_relax() for incidental code in
drivers and things (powerpc may add some extra nops between SMT low
and SMT normal priorities to improve it a little), and this would
be used by important core code and primitives.

> But that depends on architectures having some pattern that we *can*
> abstract. Would some "begin/in-loop/end" pattern like the above be
> sufficient?

Yes. begin/in/end would be sufficient for powerpc SMT priority, and
for x86, and it looks like sparc64 too. So we could do that if you
prefer.

After this, I might look at optimizing the loop code itself (optimizing
exit condition, de-pipelining, etc). That would require spin loop
primitives like this again. BUT they would not have funny restrictions
on exit conditions because they wouldn't do SMT priority. If I get
positive numbers for that, would you be opposed to having such
primitives (just for the important core spin loops)?

> I think s390 might have issues too, since they tried to have that
> "cpu_relax_yield" thing (which is only used by stop_machine), and
> they've tried cpu_relax_lowlatency() and other games.

There are some considerations with hv/guest yielding, yes. I'm not
touching that yet, but it needs to be looked at. Generic code has no
chance of looking at primitives available and deciding which should
be best used (not least because they aren't documented).

I think for the most part, busy loops shouldn't be done if they tend
to be more expensive than a context switch + associated cache costs,
so hypervisors are okay there. It's just rare cases where we *have*
to spin. Directed spinning or yielding for a resource is possibly more
general pattern and something to look at adding to these spin APIs, but
for now they should be okay to just remain within the loop body.

Summary: hypervisor guests should not be affected by the new
APIs, but we could look at augmenting them later with some hv hints.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-04  3:02       ` Nicholas Piggin
@ 2017-04-04  4:11         ` Nicholas Piggin
  2017-04-05 14:01         ` David Miller
  1 sibling, 0 replies; 19+ messages in thread
From: Nicholas Piggin @ 2017-04-04  4:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Linux Kernel Mailing List, Anton Blanchard, linuxppc-dev

On Tue, 4 Apr 2017 13:02:33 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> On Mon, 3 Apr 2017 17:43:05 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 

> > But that depends on architectures having some pattern that we *can*
> > abstract. Would some "begin/in-loop/end" pattern like the above be
> > sufficient?  
> 
> Yes. begin/in/end would be sufficient for powerpc SMT priority, and
> for x86, and it looks like sparc64 too. So we could do that if you
> prefer.

How's this? I changed your name a bit just so we have a common spin_
prefix. With example powerpc implementation and one caller converted
to see the effect.

---
 arch/powerpc/include/asm/processor.h | 17 +++++++++++++
 include/linux/processor.h            | 48 ++++++++++++++++++++++++++++++++++++
 kernel/sched/idle.c                  |  7 +++++-
 3 files changed, 71 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/processor.h

diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index e9bbd450d966..1274dc818e74 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -402,6 +402,23 @@ static inline unsigned long __pack_fe01(unsigned int fpmode)
 
 #ifdef CONFIG_PPC64
 #define cpu_relax()	do { HMT_low(); HMT_medium(); barrier(); } while (0)
+
+#ifndef spin_begin
+#define spin_begin()	HMT_low()
+#endif
+
+#ifndef spin_cpu_relax
+#define spin_cpu_relax()	barrier()
+#endif
+
+#ifndef spin_cpu_yield
+#define spin_cpu_yield()
+#endif
+
+#ifndef spin_end
+#define spin_end()	HMT_medium()
+#endif
+
 #else
 #define cpu_relax()	barrier()
 #endif
diff --git a/include/linux/processor.h b/include/linux/processor.h
new file mode 100644
index 000000000000..65e5635d0069
--- /dev/null
+++ b/include/linux/processor.h
@@ -0,0 +1,48 @@
+/* Misc low level processor primitives */
+#ifndef _LINUX_PROCESSOR_H
+#define _LINUX_PROCESSOR_H
+
+#include <asm/processor.h>
+
+/*
+ * spin_begin is used before beginning a busy-wait loop, and must be paired
+ * with spin_end when the loop is exited. spin_cpu_relax must be called
+ * within the loop.
+ *
+ * These loop body should be as small and fast as possible, on the order of
+ * tens of instructions/cycles as a guide. It should and avoid calling
+ * cpu_relax, or any "spin" or sleep type of primitive including nested uses
+ * of these primitives. It should not lock or take any other resource.
+ * Violations of this will not cause a bug, but may cause sub optimal
+ * performance.
+ *
+ * These loops are optimized to be used where wait times are expected to be
+ * less than the cost of a context switch (and associated overhead).
+ *
+ * Detection of resource owner and decision to spin or sleep or guest-yield
+ * (e.g., spin lock holder vcpu preempted, or mutex owner not on CPU) can be
+ * tested within the busy loop body if necessary.
+ */
+#ifndef spin_begin
+#define spin_begin()
+#endif
+
+#ifndef spin_cpu_relax
+#define spin_cpu_relax() cpu_relax()
+#endif
+
+/*
+ * spin_cpu_yield may be called to yield (undirected) to the hypervisor if
+ * necessary. This should be used if the wait is expected to take longer
+ * than context switch overhead, but we can't sleep or do a directed yield.
+ */
+#ifndef spin_cpu_yield
+#define spin_cpu_yield() cpu_relax_yield()
+#endif
+
+#ifndef spin_end
+#define spin_end()
+#endif
+
+#endif /* _LINUX_PROCESSOR_H */
+
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index ac6d5176463d..99a032d9f4a9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -10,6 +10,7 @@
 #include <linux/mm.h>
 #include <linux/stackprotector.h>
 #include <linux/suspend.h>
+#include <linux/processor.h>
 
 #include <asm/tlb.h>
 
@@ -63,9 +64,13 @@ static noinline int __cpuidle cpu_idle_poll(void)
 	trace_cpu_idle_rcuidle(0, smp_processor_id());
 	local_irq_enable();
 	stop_critical_timings();
+
+	spin_begin();
 	while (!tif_need_resched() &&
 		(cpu_idle_force_poll || tick_check_broadcast_expired()))
-		cpu_relax();
+		spin_cpu_relax();
+	spin_end();
+
 	start_critical_timings();
 	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
 	rcu_idle_exit();
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-04  3:02       ` Nicholas Piggin
  2017-04-04  4:11         ` Nicholas Piggin
@ 2017-04-05 14:01         ` David Miller
  2017-04-06  0:59           ` Nicholas Piggin
  1 sibling, 1 reply; 19+ messages in thread
From: David Miller @ 2017-04-05 14:01 UTC (permalink / raw)
  To: npiggin; +Cc: torvalds, linux-arch, linux-kernel, anton, linuxppc-dev

From: Nicholas Piggin <npiggin@gmail.com>
Date: Tue, 4 Apr 2017 13:02:33 +1000

> On Mon, 3 Apr 2017 17:43:05 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> But that depends on architectures having some pattern that we *can*
>> abstract. Would some "begin/in-loop/end" pattern like the above be
>> sufficient?
> 
> Yes. begin/in/end would be sufficient for powerpc SMT priority, and
> for x86, and it looks like sparc64 too. So we could do that if you
> prefer.

Sparc64 has two cases, on older chips we can induce a cpu thread yield
with a special sequence of instructions, and on newer chips we have
a bonafide pause instruction.

So cpu_relax() all by itself pretty much works for us.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-05 14:01         ` David Miller
@ 2017-04-06  0:59           ` Nicholas Piggin
  2017-04-06 14:13             ` Will Deacon
  0 siblings, 1 reply; 19+ messages in thread
From: Nicholas Piggin @ 2017-04-06  0:59 UTC (permalink / raw)
  To: David Miller; +Cc: torvalds, linux-arch, linux-kernel, anton, linuxppc-dev

On Wed, 05 Apr 2017 07:01:57 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Nicholas Piggin <npiggin@gmail.com>
> Date: Tue, 4 Apr 2017 13:02:33 +1000
> 
> > On Mon, 3 Apr 2017 17:43:05 -0700
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >   
> >> But that depends on architectures having some pattern that we *can*
> >> abstract. Would some "begin/in-loop/end" pattern like the above be
> >> sufficient?  
> > 
> > Yes. begin/in/end would be sufficient for powerpc SMT priority, and
> > for x86, and it looks like sparc64 too. So we could do that if you
> > prefer.  
> 
> Sparc64 has two cases, on older chips we can induce a cpu thread yield
> with a special sequence of instructions, and on newer chips we have
> a bonafide pause instruction.
> 
> So cpu_relax() all by itself pretty much works for us.
> 

Thanks for taking a look. The default spin primitives should just
continue to do the right thing for you in that case.

Arm has a yield instruction, ia64 has a pause... No unusual
requirements that I can see.

If there are no objections, I'll send the arch-independent part of
this through the powerpc tree (the last one I sent, which follows
Linus' preferred pattern).

Thanks,
Nick

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-06  0:59           ` Nicholas Piggin
@ 2017-04-06 14:13             ` Will Deacon
  2017-04-06 15:16               ` Linus Torvalds
  2017-04-06 15:30               ` Nicholas Piggin
  0 siblings, 2 replies; 19+ messages in thread
From: Will Deacon @ 2017-04-06 14:13 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: David Miller, torvalds, linux-arch, linux-kernel, anton,
	linuxppc-dev, peterz

Hi Nick,

On Thu, Apr 06, 2017 at 10:59:58AM +1000, Nicholas Piggin wrote:
> On Wed, 05 Apr 2017 07:01:57 -0700 (PDT)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Nicholas Piggin <npiggin@gmail.com>
> > Date: Tue, 4 Apr 2017 13:02:33 +1000
> > 
> > > On Mon, 3 Apr 2017 17:43:05 -0700
> > > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > >   
> > >> But that depends on architectures having some pattern that we *can*
> > >> abstract. Would some "begin/in-loop/end" pattern like the above be
> > >> sufficient?  
> > > 
> > > Yes. begin/in/end would be sufficient for powerpc SMT priority, and
> > > for x86, and it looks like sparc64 too. So we could do that if you
> > > prefer.  
> > 
> > Sparc64 has two cases, on older chips we can induce a cpu thread yield
> > with a special sequence of instructions, and on newer chips we have
> > a bonafide pause instruction.
> > 
> > So cpu_relax() all by itself pretty much works for us.
> > 
> 
> Thanks for taking a look. The default spin primitives should just
> continue to do the right thing for you in that case.
> 
> Arm has a yield instruction, ia64 has a pause... No unusual
> requirements that I can see.

Yield tends to be implemented as a NOP in practice, since it's in the
architecture for SMT CPUs and most ARM CPUs are single-threaded. We do have
the WFE instruction (wait for event) which is used in our implementation of
smp_cond_load_acquire, but I don't think we'd be able to use it with the
proposals here.

WFE can stop the clock for the CPU until an "event" is signalled by
another CPU. This could be done by an explicit SEV (send event) instruction,
but that tends to require heavy barriers on the signalling side. Instead,
the preferred way to generate an event is to clear the exclusive monitor
reservation for the CPU executing the WFE. That means that the waiter
does something like:

	LDXR x0, [some_address]	// Load exclusive from some_address
	CMP  x0, some value	// If the value matches what I want
	B.EQ out		// then we're done
	WFE			// otherwise, wait

at this point, the waiter will stop on the WFE until its monitor is cleared,
which happens if another CPU writes to some_address.

We've wrapped this up in the arm64 code as __cmpwait, and we use that
to build smp_cond_load_acquire. It would be nice to use the same machinery
for the conditional spinning here, unless you anticipate that we're only
going to be spinning for a handful of iterations anyway?

Cheers,

Will

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-06 14:13             ` Will Deacon
@ 2017-04-06 15:16               ` Linus Torvalds
  2017-04-06 16:36                 ` Peter Zijlstra
  2017-04-06 15:30               ` Nicholas Piggin
  1 sibling, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2017-04-06 15:16 UTC (permalink / raw)
  To: Will Deacon
  Cc: Nicholas Piggin, David Miller, linux-arch,
	Linux Kernel Mailing List, Anton Blanchard, linuxppc-dev list,
	Peter Zijlstra

On Thu, Apr 6, 2017 at 7:13 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> We've wrapped this up in the arm64 code as __cmpwait, and we use that
> to build smp_cond_load_acquire. It would be nice to use the same machinery
> for the conditional spinning here, unless you anticipate that we're only
> going to be spinning for a handful of iterations anyway?

I suspect most of these loops aren't set up for the WFE kind of
spinning, because they look for more than one variable.

.. and the ones that _are_ set up for this probably should just be
rewritten to use smp_cond_load_acquire() anyway, because the "wait for
value" special case is fairly special.

In theory x86 could use monitor/mwait for it too, in practice I think
it tends to still be too high latency (because it was originally just
designed for the idle loop). mwait got extended to actually be useful,
but I'm not sure what the latency is for the modern one.

               Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-06 14:13             ` Will Deacon
  2017-04-06 15:16               ` Linus Torvalds
@ 2017-04-06 15:30               ` Nicholas Piggin
  2017-04-07 16:13                 ` Will Deacon
  1 sibling, 1 reply; 19+ messages in thread
From: Nicholas Piggin @ 2017-04-06 15:30 UTC (permalink / raw)
  To: Will Deacon
  Cc: David Miller, torvalds, linux-arch, linux-kernel, anton,
	linuxppc-dev, peterz

On Thu, 6 Apr 2017 15:13:53 +0100
Will Deacon <will.deacon@arm.com> wrote:

> Hi Nick,
> 
> On Thu, Apr 06, 2017 at 10:59:58AM +1000, Nicholas Piggin wrote:
> > On Wed, 05 Apr 2017 07:01:57 -0700 (PDT)
> > David Miller <davem@davemloft.net> wrote:
> >   
> > > From: Nicholas Piggin <npiggin@gmail.com>
> > > Date: Tue, 4 Apr 2017 13:02:33 +1000
> > >   
> > > > On Mon, 3 Apr 2017 17:43:05 -0700
> > > > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > > >     
> > > >> But that depends on architectures having some pattern that we *can*
> > > >> abstract. Would some "begin/in-loop/end" pattern like the above be
> > > >> sufficient?    
> > > > 
> > > > Yes. begin/in/end would be sufficient for powerpc SMT priority, and
> > > > for x86, and it looks like sparc64 too. So we could do that if you
> > > > prefer.    
> > > 
> > > Sparc64 has two cases, on older chips we can induce a cpu thread yield
> > > with a special sequence of instructions, and on newer chips we have
> > > a bonafide pause instruction.
> > > 
> > > So cpu_relax() all by itself pretty much works for us.
> > >   
> > 
> > Thanks for taking a look. The default spin primitives should just
> > continue to do the right thing for you in that case.
> > 
> > Arm has a yield instruction, ia64 has a pause... No unusual
> > requirements that I can see.  
> 
> Yield tends to be implemented as a NOP in practice, since it's in the
> architecture for SMT CPUs and most ARM CPUs are single-threaded. We do have
> the WFE instruction (wait for event) which is used in our implementation of
> smp_cond_load_acquire, but I don't think we'd be able to use it with the
> proposals here.
> 
> WFE can stop the clock for the CPU until an "event" is signalled by
> another CPU. This could be done by an explicit SEV (send event) instruction,
> but that tends to require heavy barriers on the signalling side. Instead,
> the preferred way to generate an event is to clear the exclusive monitor
> reservation for the CPU executing the WFE. That means that the waiter
> does something like:
> 
> 	LDXR x0, [some_address]	// Load exclusive from some_address
> 	CMP  x0, some value	// If the value matches what I want
> 	B.EQ out		// then we're done
> 	WFE			// otherwise, wait
> 
> at this point, the waiter will stop on the WFE until its monitor is cleared,
> which happens if another CPU writes to some_address.
> 
> We've wrapped this up in the arm64 code as __cmpwait, and we use that
> to build smp_cond_load_acquire. It would be nice to use the same machinery
> for the conditional spinning here, unless you anticipate that we're only
> going to be spinning for a handful of iterations anyway?

So I do want to look at adding spin loop primitives as well as the
begin/in/end primitives to help with powerpc's SMT priorities.

So we'd have:

  spin_begin();
  spin_do {
    if (blah) {
        spin_end();
        return;
    }
  } spin_until(!locked);
  spin_end();

So you could implement your monitor with that. There's a handful of core
places. mutex, bit spinlock, seqlock, polling idle, etc. So I think if it
is beneficial for you in smp_cond_load_acquire, it should be useful in
those too.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-06 15:16               ` Linus Torvalds
@ 2017-04-06 16:36                 ` Peter Zijlstra
  2017-04-06 17:31                   ` Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2017-04-06 16:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Nicholas Piggin, David Miller, linux-arch,
	Linux Kernel Mailing List, Anton Blanchard, linuxppc-dev list

On Thu, Apr 06, 2017 at 08:16:19AM -0700, Linus Torvalds wrote:

> In theory x86 could use monitor/mwait for it too, in practice I think
> it tends to still be too high latency (because it was originally just
> designed for the idle loop). mwait got extended to actually be useful,
> but I'm not sure what the latency is for the modern one.

I've been meaning to test mwait-c0 for this, but never got around to it.

Something like the below, which is ugly (because I couldn't be bothered
to resolve the header recursion and thus duplicates the monitor/mwait
functions) and broken (because it hard assumes the hardware can do
monitor/mwait).

But it builds and boots, no clue if its better or worse. Changing mwait
eax to 0 would give us C1 and might also be worth a try I suppose.

---
 arch/x86/include/asm/barrier.h | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index bfb28ca..faab9cd 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -80,6 +80,36 @@ do {									\
 #define __smp_mb__before_atomic()	barrier()
 #define __smp_mb__after_atomic()	barrier()
 
+static inline void ___monitor(const void *eax, unsigned long ecx,
+			     unsigned long edx)
+{
+	/* "monitor %eax, %ecx, %edx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc8;"
+		     :: "a" (eax), "c" (ecx), "d"(edx));
+}
+
+static inline void ___mwait(unsigned long eax, unsigned long ecx)
+{
+	/* "mwait %eax, %ecx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc9;"
+		     :: "a" (eax), "c" (ecx));
+}
+
+#define smp_cond_load_acquire(ptr, cond_expr)				\
+({									\
+	typeof(ptr) __PTR = (ptr);					\
+	typeof(*ptr) VAL;						\
+	for (;;) {							\
+		___monitor(__PTR, 0, 0);				\
+		VAL = READ_ONCE(*__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		___mwait(0xf0 /* C0 */, 0x01 /* INT */);		\
+	}								\
+	smp_acquire__after_ctrl_dep();					\
+	VAL;								\
+})
+
 #include <asm-generic/barrier.h>
 
 #endif /* _ASM_X86_BARRIER_H */

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-06 16:36                 ` Peter Zijlstra
@ 2017-04-06 17:31                   ` Linus Torvalds
  2017-04-06 19:23                     ` Peter Zijlstra
  2017-04-07  9:43                     ` Peter Zijlstra
  0 siblings, 2 replies; 19+ messages in thread
From: Linus Torvalds @ 2017-04-06 17:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Nicholas Piggin, David Miller, linux-arch,
	Linux Kernel Mailing List, Anton Blanchard, linuxppc-dev list

On Thu, Apr 6, 2017 at 9:36 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Something like the below, which is ugly (because I couldn't be bothered
> to resolve the header recursion and thus duplicates the monitor/mwait
> functions) and broken (because it hard assumes the hardware can do
> monitor/mwait).

Yeah, I think it needs to be conditional not just om mwait support,
but on the "new" mwait support (ie "CPUID.05H:ECX[bit 1] = 1").

And we'd probably want to make it even more strict, in that soem mwait
implementations might simply not be very good for short waits.

Because I think the latency was hundreds of cycles at some point (but
that may have been the original version that wouldn't have had the
"new mwait" bit set), and there are also issues with virtualization
(ie we may not want to do this in a guest because it causes a VM
exit).

> But it builds and boots, no clue if its better or worse. Changing mwait
> eax to 0 would give us C1 and might also be worth a try I suppose.

Hmm. Also:

> +               ___mwait(0xf0 /* C0 */, 0x01 /* INT */);                \

Do you actually want that "INT" bit set? It's only meaningful if
interrupts are _masked_, afaik. Which doesn't necessarily make sense
for this case.

If interrupts would actually get delivered to software, mwait will
exit regardless.

So I think __mwait(0,0) might be the rigth thing at least in some
cases. Or at least worth timing at some point.

Of course, we really shouldn't have very many places that actually
need this. We currently use it in three places, I think:

 - spinlocks. This is likely the the big case.

 - the per-cpu cross-cpu calling (call_single_data) exclusivity waiting

 - the magical on_cpu waiting in ttwu. I'm not sure how often this
actually triggers, the original argument for this was to avoid an
expensive barrier - the loop itself probably very seldom actually
triggers.

It may be, for example, that just the fact that your implementation
does the "__monitor()" part before doing the load and test, might be
already too expensive for the (common) cases where we don't expect to
loop.

But maybe "monitor" is really cheap. I suspect it's microcoded,
though, which implies "no".

                      Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-06 17:31                   ` Linus Torvalds
@ 2017-04-06 19:23                     ` Peter Zijlstra
  2017-04-06 19:41                       ` Linus Torvalds
  2017-04-07  9:43                     ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2017-04-06 19:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Nicholas Piggin, David Miller, linux-arch,
	Linux Kernel Mailing List, Anton Blanchard, linuxppc-dev list

On Thu, Apr 06, 2017 at 10:31:46AM -0700, Linus Torvalds wrote:
> And we'd probably want to make it even more strict, in that soem mwait
> implementations might simply not be very good for short waits.

Yeah, we need to find something that works; assuming its beneficial at
all on modern chips.

> > But it builds and boots, no clue if its better or worse. Changing mwait
> > eax to 0 would give us C1 and might also be worth a try I suppose.
> 
> Hmm. Also:
> 
> > +               ___mwait(0xf0 /* C0 */, 0x01 /* INT */);                \
> 
> Do you actually want that "INT" bit set? It's only meaningful if
> interrupts are _masked_, afaik. Which doesn't necessarily make sense
> for this case.

Correct, in fact its quite broken. Corrected.

> But maybe "monitor" is really cheap. I suspect it's microcoded,
> though, which implies "no".

Something like so then. According to the SDM mwait is a no-op if we do
not execute monitor first. So this variant should get the first
iteration without expensive instructions.

And this variant is actually quite amenable to alternative(), with which
we can pick between PAUSE and this.

---
 arch/x86/include/asm/barrier.h | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index bfb28ca..2b5d5a2 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -80,6 +80,36 @@ do {									\
 #define __smp_mb__before_atomic()	barrier()
 #define __smp_mb__after_atomic()	barrier()
 
+static inline void ___monitor(const void *eax, unsigned long ecx,
+			     unsigned long edx)
+{
+	/* "monitor %eax, %ecx, %edx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc8;"
+		     :: "a" (eax), "c" (ecx), "d"(edx));
+}
+
+static inline void ___mwait(unsigned long eax, unsigned long ecx)
+{
+	/* "mwait %eax, %ecx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc9;"
+		     :: "a" (eax), "c" (ecx));
+}
+
+#define smp_cond_load_acquire(ptr, cond_expr)				\
+({									\
+	typeof(ptr) __PTR = (ptr);					\
+	typeof(*ptr) VAL;						\
+	for (;;) {							\
+		VAL = READ_ONCE(*__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		___mwait(0xf0 /* C0 */, 0);				\
+		___monitor(__PTR, 0, 0);				\
+	}								\
+	smp_acquire__after_ctrl_dep();					\
+	VAL;								\
+})
+
 #include <asm-generic/barrier.h>
 
 #endif /* _ASM_X86_BARRIER_H */

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-06 19:23                     ` Peter Zijlstra
@ 2017-04-06 19:41                       ` Linus Torvalds
  2017-04-07  3:31                         ` Nicholas Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2017-04-06 19:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Nicholas Piggin, David Miller, linux-arch,
	Linux Kernel Mailing List, Anton Blanchard, linuxppc-dev list

On Thu, Apr 6, 2017 at 12:23 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Something like so then. According to the SDM mwait is a no-op if we do
> not execute monitor first. So this variant should get the first
> iteration without expensive instructions.

No, the problem is that we *would* have executed a prior monitor that
could still be pending - from a previous invocation of
smp_cond_load_acquire().

Especially with spinlocks, these things can very much happen back-to-back.

And it would be pending with a different address (the previous
spinlock) that might not have changed since then (and might not be
changing), so now we might actually be pausing in mwait waiting for
that *other* thing to change.

So it would probably need to do something complicated like

  #define smp_cond_load_acquire(ptr, cond_expr)                         \
  ({                                                                    \
        typeof(ptr) __PTR = (ptr);                                      \
        typeof(*ptr) VAL;                                               \
        do {                                                            \
                VAL = READ_ONCE(*__PTR);                                \
                if (cond_expr)                                          \
                        break;                                          \
                for (;;) {                                              \
                        ___monitor(__PTR, 0, 0);                        \
                        VAL = READ_ONCE(*__PTR);                        \
                        if (cond_expr) break;                           \
                        ___mwait(0xf0 /* C0 */, 0);                     \
                }                                                       \
        } while (0)                                                     \
        smp_acquire__after_ctrl_dep();                                  \
        VAL;                                                            \
  })

which might just generate nasty enough code to not be worth it.

I dunno

             Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-06 19:41                       ` Linus Torvalds
@ 2017-04-07  3:31                         ` Nicholas Piggin
  0 siblings, 0 replies; 19+ messages in thread
From: Nicholas Piggin @ 2017-04-07  3:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Will Deacon, David Miller, linux-arch,
	Linux Kernel Mailing List, Anton Blanchard, linuxppc-dev list

On Thu, 6 Apr 2017 12:41:52 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Apr 6, 2017 at 12:23 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Something like so then. According to the SDM mwait is a no-op if we do
> > not execute monitor first. So this variant should get the first
> > iteration without expensive instructions.  
> 
> No, the problem is that we *would* have executed a prior monitor that
> could still be pending - from a previous invocation of
> smp_cond_load_acquire().
> 
> Especially with spinlocks, these things can very much happen back-to-back.
> 
> And it would be pending with a different address (the previous
> spinlock) that might not have changed since then (and might not be
> changing), so now we might actually be pausing in mwait waiting for
> that *other* thing to change.
> 
> So it would probably need to do something complicated like
> 
>   #define smp_cond_load_acquire(ptr, cond_expr)                         \
>   ({                                                                    \
>         typeof(ptr) __PTR = (ptr);                                      \
>         typeof(*ptr) VAL;                                               \
>         do {                                                            \
>                 VAL = READ_ONCE(*__PTR);                                \
>                 if (cond_expr)                                          \
>                         break;                                          \
>                 for (;;) {                                              \
>                         ___monitor(__PTR, 0, 0);                        \
>                         VAL = READ_ONCE(*__PTR);                        \
>                         if (cond_expr) break;                           \
>                         ___mwait(0xf0 /* C0 */, 0);                     \
>                 }                                                       \
>         } while (0)                                                     \
>         smp_acquire__after_ctrl_dep();                                  \
>         VAL;                                                            \
>   })
> 
> which might just generate nasty enough code to not be worth it.

Okay, that's a bit of an issue I had with the spin_begin/in/end primitives.
The problem being some of these loops are a fastpath that expect success on
first iteration when not spinning. seqlock is another example.

powerpc does not want to go to low priority until after the initial test
fails, but cpu_relax based architectures do not want a pointless extra branch.

I added a spin_until_cond_likely() primitive that catches a number of these.
Not sure how well people would like that though.

Most loops actually are slowpath ones and work fine with begin/in/end though.
Attached a combined patch for reference with powerpc implementation and some
sites converted.

diff --git a/include/linux/processor.h b/include/linux/processor.h
new file mode 100644
index 000000000000..0a058aaa9bab
--- /dev/null
+++ b/include/linux/processor.h
@@ -0,0 +1,62 @@
+/* Misc low level processor primitives */
+#ifndef _LINUX_PROCESSOR_H
+#define _LINUX_PROCESSOR_H
+
+#include <asm/processor.h>
+
+/*
+ * spin_begin is used before beginning a busy-wait loop, and must be paired
+ * with spin_end when the loop is exited. spin_cpu_relax must be called
+ * within the loop.
+ *
+ * The loop body should be as small and fast as possible, on the order of
+ * tens of instructions/cycles as a guide. It should and avoid calling
+ * cpu_relax, or any "spin" or sleep type of primitive including nested uses
+ * of these primitives. It should not lock or take any other resource.
+ * Violations of these guidelies will not cause a bug, but may cause sub
+ * optimal performance.
+ *
+ * These loops are optimized to be used where wait times are expected to be
+ * less than the cost of a context switch (and associated overhead).
+ *
+ * Detection of resource owner and decision to spin or sleep or guest-yield
+ * (e.g., spin lock holder vcpu preempted, or mutex owner not on CPU) can be
+ * tested within the loop body.
+ */
+#ifndef spin_begin
+#define spin_begin()
+#endif
+
+#ifndef spin_cpu_relax
+#define spin_cpu_relax() cpu_relax()
+#endif
+
+/*
+ * spin_cpu_yield may be called to yield (undirected) to the hypervisor if
+ * necessary. This should be used if the wait is expected to take longer
+ * than context switch overhead, but we can't sleep or do a directed yield.
+ */
+#ifndef spin_cpu_yield
+#define spin_cpu_yield() cpu_relax_yield()
+#endif
+
+#ifndef spin_end
+#define spin_end()
+#endif
+
+/*
+ * spin_until_cond_likely can be used to wait for a condition to become true. It
+ * may be expected that the first iteration will true in the common case
+ * (no spinning).
+ */
+#ifndef spin_until_cond_likely
+#define spin_until_cond_likely(cond)					\
+do {								\
+	spin_begin();						\
+	while (unlikely(cond))						\
+		spin_cpu_relax();				\
+	spin_end();						\
+} while (0)
+#endif
+
+#endif /* _LINUX_PROCESSOR_H */
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index e0fecbcea2a2..47d134755c02 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -398,6 +398,33 @@ static inline unsigned long __pack_fe01(unsigned int fpmode)
 
 #ifdef CONFIG_PPC64
 #define cpu_relax()	do { HMT_low(); HMT_medium(); barrier(); } while (0)
+
+#ifndef spin_begin
+#define spin_begin()	HMT_low()
+#endif
+
+#ifndef spin_cpu_relax
+#define spin_cpu_relax()	barrier()
+#endif
+
+#ifndef spin_cpu_yield
+#define spin_cpu_yield()
+#endif
+
+#ifndef spin_end
+#define spin_end()	HMT_medium()
+#endif
+
+#define spin_until_cond_likely(cond)					\
+do {								\
+	if (unlikely(cond)) {					\
+		spin_begin();					\
+		while (cond)					\
+			spin_cpu_relax();			\
+		spin_end();					\
+	}							\
+} while (0)
+
 #else
 #define cpu_relax()	barrier()
 #endif
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 07b90725855e..5d2fec2d43ed 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -442,6 +442,7 @@ void __delay(unsigned long loops)
 	unsigned long start;
 	int diff;
 
+	spin_begin();
 	if (__USE_RTC()) {
 		start = get_rtcl();
 		do {
@@ -449,13 +450,14 @@ void __delay(unsigned long loops)
 			diff = get_rtcl() - start;
 			if (diff < 0)
 				diff += 1000000000;
+			spin_cpu_relax();
 		} while (diff < loops);
 	} else {
 		start = get_tbl();
 		while (get_tbl() - start < loops)
-			HMT_low();
-		HMT_medium();
+			spin_cpu_relax();
 	}
+	spin_end();
 }
 EXPORT_SYMBOL(__delay);
 
diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index cc332608e656..6175855e5535 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -181,8 +181,10 @@ static inline void native_lock_hpte(struct hash_pte *hptep)
 	while (1) {
 		if (!test_and_set_bit_lock(HPTE_LOCK_BIT, word))
 			break;
+		spin_begin();
 		while(test_bit(HPTE_LOCK_BIT, word))
-			cpu_relax();
+			spin_cpu_relax();
+		spin_end();
 	}
 }
 
diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
index 3b5bafce4337..a9093ad8b5a2 100644
--- a/include/linux/bit_spinlock.h
+++ b/include/linux/bit_spinlock.h
@@ -25,9 +25,11 @@ static inline void bit_spin_lock(int bitnum, unsigned long *addr)
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
 	while (unlikely(test_and_set_bit_lock(bitnum, addr))) {
 		preempt_enable();
+		spin_begin();
 		do {
-			cpu_relax();
+			spin_cpu_relax();
 		} while (test_bit(bitnum, addr));
+		spin_end();
 		preempt_disable();
 	}
 #endif
diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
index 1828900c9411..323ed2f272c4 100644
--- a/include/linux/netpoll.h
+++ b/include/linux/netpoll.h
@@ -80,8 +80,7 @@ static inline void *netpoll_poll_lock(struct napi_struct *napi)
 	if (dev && dev->npinfo) {
 		int owner = smp_processor_id();
 
-		while (cmpxchg(&napi->poll_owner, -1, owner) != -1)
-			cpu_relax();
+		spin_until_cond_likely(cmpxchg(&napi->poll_owner, -1, owner) == -1);
 
 		return napi;
 	}
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index ead97654c4e9..3991e2927c13 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -32,6 +32,7 @@
  * by Keith Owens and Andrea Arcangeli
  */
 
+#include <linux/processor.h>
 #include <linux/spinlock.h>
 #include <linux/preempt.h>
 #include <linux/lockdep.h>
@@ -108,12 +109,8 @@ static inline unsigned __read_seqcount_begin(const seqcount_t *s)
 {
 	unsigned ret;
 
-repeat:
-	ret = READ_ONCE(s->sequence);
-	if (unlikely(ret & 1)) {
-		cpu_relax();
-		goto repeat;
-	}
+	spin_until_cond_likely(!((ret = READ_ONCE(s->sequence)) & 1));
+
 	return ret;
 }
 
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 6a385aabcce7..f10b3fca6efd 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -27,8 +27,10 @@ struct mcs_spinlock {
  */
 #define arch_mcs_spin_lock_contended(l)					\
 do {									\
+	spin_begin();							\
 	while (!(smp_load_acquire(l)))					\
-		cpu_relax();						\
+		spin_cpu_relax();					\
+	spin_end();							\
 } while (0)
 #endif
 
@@ -107,8 +109,10 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 		if (likely(cmpxchg_release(lock, node, NULL) == node))
 			return;
 		/* Wait until the next pointer is set */
+		spin_begin();
 		while (!(next = READ_ONCE(node->next)))
-			cpu_relax();
+			spin_cpu_relax();
+		spin_end();
 	}
 
 	/* Pass lock to next waiter. */
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 198527a62149..a8e42bfa2eef 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -427,6 +427,7 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner,
 	bool ret = true;
 
 	rcu_read_lock();
+	spin_begin();
 	while (__mutex_owner(lock) == owner) {
 		/*
 		 * Ensure we emit the owner->on_cpu, dereference _after_
@@ -450,8 +451,9 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner,
 			break;
 		}
 
-		cpu_relax();
+		spin_cpu_relax();
 	}
+	spin_end();
 	rcu_read_unlock();
 
 	return ret;
@@ -532,20 +534,25 @@ mutex_optimistic_spin(struct mutex *lock, struct ww_acquire_ctx *ww_ctx,
 			goto fail;
 	}
 
+	spin_begin();
 	for (;;) {
 		struct task_struct *owner;
 
 		/* Try to acquire the mutex... */
 		owner = __mutex_trylock_or_owner(lock);
-		if (!owner)
+		if (!owner) {
+			spin_end();
 			break;
+		}
 
 		/*
 		 * There's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */
-		if (!mutex_spin_on_owner(lock, owner, ww_ctx, waiter))
+		if (!mutex_spin_on_owner(lock, owner, ww_ctx, waiter)) {
+			spin_end();
 			goto fail_unlock;
+		}
 
 		/*
 		 * The cpu_relax() call is a compiler barrier which forces
@@ -553,7 +560,7 @@ mutex_optimistic_spin(struct mutex *lock, struct ww_acquire_ctx *ww_ctx,
 		 * memory barriers as we'll eventually observe the right
 		 * values at the cost of a few extra spins.
 		 */
-		cpu_relax();
+		spin_cpu_relax();
 	}
 
 	if (!waiter)
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index a3167941093b..bb83dccf277c 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -53,6 +53,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
 	 */
 	old = prev ? prev->cpu : OSQ_UNLOCKED_VAL;
 
+	spin_begin();
 	for (;;) {
 		if (atomic_read(&lock->tail) == curr &&
 		    atomic_cmpxchg_acquire(&lock->tail, curr, old) == curr) {
@@ -80,8 +81,9 @@ osq_wait_next(struct optimistic_spin_queue *lock,
 				break;
 		}
 
-		cpu_relax();
+		spin_cpu_relax();
 	}
+	spin_end();
 
 	return next;
 }
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index cc3ed0ccdfa2..786f26453a4a 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -53,10 +53,12 @@ struct __qrwlock {
 static __always_inline void
 rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
 {
+	spin_begin();
 	while ((cnts & _QW_WMASK) == _QW_LOCKED) {
-		cpu_relax();
+		spin_cpu_relax();
 		cnts = atomic_read_acquire(&lock->cnts);
 	}
+	spin_end();
 }
 
 /**
@@ -123,6 +125,7 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 	 * Set the waiting flag to notify readers that a writer is pending,
 	 * or wait for a previous writer to go away.
 	 */
+	spin_begin();
 	for (;;) {
 		struct __qrwlock *l = (struct __qrwlock *)lock;
 
@@ -130,7 +133,7 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 		   (cmpxchg_relaxed(&l->wmode, 0, _QW_WAITING) == 0))
 			break;
 
-		cpu_relax();
+		spin_cpu_relax();
 	}
 
 	/* When no more readers, set the locked flag */
@@ -141,8 +144,10 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 					    _QW_LOCKED) == _QW_WAITING))
 			break;
 
-		cpu_relax();
+		spin_cpu_relax();
 	}
+	spin_end();
+
 unlock:
 	arch_spin_unlock(&lock->wait_lock);
 }
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index b2caec7315af..4cd2c148e989 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -361,6 +361,7 @@ void queued_spin_unlock_wait(struct qspinlock *lock)
 {
 	u32 val;
 
+	spin_begin();
 	for (;;) {
 		val = atomic_read(&lock->val);
 
@@ -371,14 +372,15 @@ void queued_spin_unlock_wait(struct qspinlock *lock)
 			break;
 
 		/* not locked, but pending, wait until we observe the lock */
-		cpu_relax();
+		spin_cpu_relax();
 	}
 
 	/* any unlock is good */
 	while (atomic_read(&lock->val) & _Q_LOCKED_MASK)
-		cpu_relax();
+		spin_cpu_relax();
 
 done:
+	spin_end();
 	smp_acquire__after_ctrl_dep();
 }
 EXPORT_SYMBOL(queued_spin_unlock_wait);
@@ -427,8 +429,10 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * 0,1,0 -> 0,0,1
 	 */
 	if (val == _Q_PENDING_VAL) {
+		spin_begin();
 		while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
-			cpu_relax();
+			spin_cpu_relax();
+		spin_end();
 	}
 
 	/*
@@ -608,8 +612,10 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * contended path; wait for next if not observed yet, release.
 	 */
 	if (!next) {
+		spin_begin();
 		while (!(next = READ_ONCE(node->next)))
-			cpu_relax();
+			spin_cpu_relax();
+		spin_end();
 	}
 
 	arch_mcs_spin_unlock_contended(&next->locked);
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e6b2f7ad3e51..0dd12c61220d 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -292,15 +292,19 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 	bool wait_early;
 
 	for (;;) {
+		spin_begin();
 		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
-			if (READ_ONCE(node->locked))
+			if (READ_ONCE(node->locked)) {
+				spin_end();
 				return;
+			}
 			if (pv_wait_early(pp, loop)) {
 				wait_early = true;
 				break;
 			}
-			cpu_relax();
+			spin_cpu_relax();
 		}
+		spin_end();
 
 		/*
 		 * Order pn->state vs pn->locked thusly:
@@ -416,11 +420,15 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 		 * disable lock stealing before attempting to acquire the lock.
 		 */
 		set_pending(lock);
+		spin_begin();
 		for (loop = SPIN_THRESHOLD; loop; loop--) {
-			if (trylock_clear_pending(lock))
+			if (trylock_clear_pending(lock)) {
+				spin_end();
 				goto gotlock;
-			cpu_relax();
+			}
+			spin_cpu_relax();
 		}
+		spin_end();
 		clear_pending(lock);
 
 
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 34e727f18e49..2d0e539f1a95 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -358,6 +358,7 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
 		goto out;
 
 	rcu_read_lock();
+	spin_begin();
 	while (sem->owner == owner) {
 		/*
 		 * Ensure we emit the owner->on_cpu, dereference _after_
@@ -373,12 +374,14 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
 		 */
 		if (!owner->on_cpu || need_resched() ||
 				vcpu_is_preempted(task_cpu(owner))) {
+			spin_end();
 			rcu_read_unlock();
 			return false;
 		}
 
-		cpu_relax();
+		spin_cpu_relax();
 	}
+	spin_end();
 	rcu_read_unlock();
 out:
 	/*
@@ -408,6 +411,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	 *  2) readers own the lock as we can't determine if they are
 	 *     actively running or not.
 	 */
+	spin_begin();
 	while (rwsem_spin_on_owner(sem)) {
 		/*
 		 * Try to acquire the lock
@@ -432,8 +436,9 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 		 * memory barriers as we'll eventually observe the right
 		 * values at the cost of a few extra spins.
 		 */
-		cpu_relax();
+		spin_cpu_relax();
 	}
+	spin_end();
 	osq_unlock(&sem->osq);
 done:
 	preempt_enable();
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index ac6d5176463d..99a032d9f4a9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -10,6 +10,7 @@
 #include <linux/mm.h>
 #include <linux/stackprotector.h>
 #include <linux/suspend.h>
+#include <linux/processor.h>
 
 #include <asm/tlb.h>
 
@@ -63,9 +64,13 @@ static noinline int __cpuidle cpu_idle_poll(void)
 	trace_cpu_idle_rcuidle(0, smp_processor_id());
 	local_irq_enable();
 	stop_critical_timings();
+
+	spin_begin();
 	while (!tif_need_resched() &&
 		(cpu_idle_force_poll || tick_check_broadcast_expired()))
-		cpu_relax();
+		spin_cpu_relax();
+	spin_end();
+
 	start_critical_timings();
 	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
 	rcu_idle_exit();

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-06 17:31                   ` Linus Torvalds
  2017-04-06 19:23                     ` Peter Zijlstra
@ 2017-04-07  9:43                     ` Peter Zijlstra
  2017-04-07 11:26                       ` Nicholas Piggin
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2017-04-07  9:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Nicholas Piggin, David Miller, linux-arch,
	Linux Kernel Mailing List, Anton Blanchard, linuxppc-dev list

On Thu, Apr 06, 2017 at 10:31:46AM -0700, Linus Torvalds wrote:
> But maybe "monitor" is really cheap. I suspect it's microcoded,
> though, which implies "no".

On my IVB-EP (will also try on something newer):

MONITOR	~332 cycles
MWAIT	~224 cycles (C0, explicitly invalidated MONITOR)

So yes, expensive.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-07  9:43                     ` Peter Zijlstra
@ 2017-04-07 11:26                       ` Nicholas Piggin
  0 siblings, 0 replies; 19+ messages in thread
From: Nicholas Piggin @ 2017-04-07 11:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Will Deacon, David Miller, linux-arch,
	Linux Kernel Mailing List, Anton Blanchard, linuxppc-dev list

On Fri, 7 Apr 2017 11:43:49 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Apr 06, 2017 at 10:31:46AM -0700, Linus Torvalds wrote:
> > But maybe "monitor" is really cheap. I suspect it's microcoded,
> > though, which implies "no".  
> 
> On my IVB-EP (will also try on something newer):
> 
> MONITOR	~332 cycles
> MWAIT	~224 cycles (C0, explicitly invalidated MONITOR)
> 
> So yes, expensive.

Interestingly, Intel optimization manual says:

  The latency of PAUSE instruction in prior generation microarchitecture
  is about 10 cycles, whereas on Skylake microarchitecture it has been
  extended to as many as 140 cycles.

In another part this is claimed for efficiency improvement. Still much
cheaper than your monitor+mwait on your IVB but if skylake is a bit
faster it might become worth it.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH] spin loop arch primitives for busy waiting
  2017-04-06 15:30               ` Nicholas Piggin
@ 2017-04-07 16:13                 ` Will Deacon
  0 siblings, 0 replies; 19+ messages in thread
From: Will Deacon @ 2017-04-07 16:13 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: David Miller, torvalds, linux-arch, linux-kernel, anton,
	linuxppc-dev, peterz

On Fri, Apr 07, 2017 at 01:30:11AM +1000, Nicholas Piggin wrote:
> On Thu, 6 Apr 2017 15:13:53 +0100
> Will Deacon <will.deacon@arm.com> wrote:
> > On Thu, Apr 06, 2017 at 10:59:58AM +1000, Nicholas Piggin wrote:
> > > Thanks for taking a look. The default spin primitives should just
> > > continue to do the right thing for you in that case.
> > > 
> > > Arm has a yield instruction, ia64 has a pause... No unusual
> > > requirements that I can see.  
> > 
> > Yield tends to be implemented as a NOP in practice, since it's in the
> > architecture for SMT CPUs and most ARM CPUs are single-threaded. We do have
> > the WFE instruction (wait for event) which is used in our implementation of
> > smp_cond_load_acquire, but I don't think we'd be able to use it with the
> > proposals here.
> > 
> > WFE can stop the clock for the CPU until an "event" is signalled by
> > another CPU. This could be done by an explicit SEV (send event) instruction,
> > but that tends to require heavy barriers on the signalling side. Instead,
> > the preferred way to generate an event is to clear the exclusive monitor
> > reservation for the CPU executing the WFE. That means that the waiter
> > does something like:
> > 
> > 	LDXR x0, [some_address]	// Load exclusive from some_address
> > 	CMP  x0, some value	// If the value matches what I want
> > 	B.EQ out		// then we're done
> > 	WFE			// otherwise, wait
> > 
> > at this point, the waiter will stop on the WFE until its monitor is cleared,
> > which happens if another CPU writes to some_address.
> > 
> > We've wrapped this up in the arm64 code as __cmpwait, and we use that
> > to build smp_cond_load_acquire. It would be nice to use the same machinery
> > for the conditional spinning here, unless you anticipate that we're only
> > going to be spinning for a handful of iterations anyway?
> 
> So I do want to look at adding spin loop primitives as well as the
> begin/in/end primitives to help with powerpc's SMT priorities.
> 
> So we'd have:
> 
>   spin_begin();
>   spin_do {
>     if (blah) {
>         spin_end();
>         return;
>     }
>   } spin_until(!locked);
>   spin_end();
> 
> So you could implement your monitor with that. There's a handful of core
> places. mutex, bit spinlock, seqlock, polling idle, etc. So I think if it
> is beneficial for you in smp_cond_load_acquire, it should be useful in
> those too.

Yeah, I think we should be able to implement spin_until like we do for
smp_cond_load_acquir, although it means we need to pass in the pointer as
well.

Will

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-04-07 16:13 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-03  8:13 [RFC][PATCH] spin loop arch primitives for busy waiting Nicholas Piggin
2017-04-03 15:31 ` Linus Torvalds
2017-04-03 23:50   ` Nicholas Piggin
2017-04-04  0:43     ` Linus Torvalds
2017-04-04  3:02       ` Nicholas Piggin
2017-04-04  4:11         ` Nicholas Piggin
2017-04-05 14:01         ` David Miller
2017-04-06  0:59           ` Nicholas Piggin
2017-04-06 14:13             ` Will Deacon
2017-04-06 15:16               ` Linus Torvalds
2017-04-06 16:36                 ` Peter Zijlstra
2017-04-06 17:31                   ` Linus Torvalds
2017-04-06 19:23                     ` Peter Zijlstra
2017-04-06 19:41                       ` Linus Torvalds
2017-04-07  3:31                         ` Nicholas Piggin
2017-04-07  9:43                     ` Peter Zijlstra
2017-04-07 11:26                       ` Nicholas Piggin
2017-04-06 15:30               ` Nicholas Piggin
2017-04-07 16:13                 ` Will Deacon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.