All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xen/livepatch: Make check_for_livepatch_work() faster in the common case
@ 2023-12-22 22:00 Andrew Cooper
  2024-01-04 13:24 ` Jan Beulich
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Andrew Cooper @ 2023-12-22 22:00 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Konrad Rzeszutek Wilk, Ross Lagerwall,
	Jan Beulich, Roger Pau Monné,
	Wei Liu

When livepatching is enabled, this function is used all the time.  Really do
check the fastpath first, and annotate it likely() as this is the right answer
100% of the time (to many significant figures).

This cuts out 3 pointer dereferences in the "nothing to do path", and it seems
the optimiser has an easier time too.  Bloat-o-meter reports:

  add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
  Function                                     old     new   delta
  check_for_livepatch_work.cold               1201    1183     -18
  check_for_livepatch_work                    1021     982     -39

which isn't too shabby for no logical change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: Ross Lagerwall <ross.lagerwall@citrix.com>
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>

I'm still a little disappointed with the code generation.  GCC still chooses
to set up the full stack frame (6 regs, +3 more slots) intermixed with the
per-cpu calculations.

In isolation, GCC can check the boolean without creating a stack frame:

  <work_to_to>:
    48 89 e2                mov    %rsp,%rdx
    48 8d 05 de e1 37 00    lea    0x37e1de(%rip),%rax        # ffff82d0405b6068 <per_cpu__work_to_do>
    48 81 ca ff 7f 00 00    or     $0x7fff,%rdx
    8b 4a c1                mov    -0x3f(%rdx),%ecx
    48 8d 15 45 aa 39 00    lea    0x39aa45(%rip),%rdx        # ffff82d0405d28e0 <__per_cpu_offset>
    48 8b 14 ca             mov    (%rdx,%rcx,8),%rdx
    0f b6 04 02             movzbl (%rdx,%rax,1),%eax
    c3                      retq

but I can't find a way to convince GCC that it would be worth not setting up a
stack frame in in the common case, and having a few extra mov reg/reg's later
in the uncommon case.

I haven't tried manually splitting the function into a check() and a do()
function.  Views on whether that might be acceptable?  At a guess, do() would
need to be a static noinline to avoid it turning back into what it currently
is.
---
 xen/common/livepatch.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/xen/common/livepatch.c b/xen/common/livepatch.c
index 1209fea2566c..b6275339f663 100644
--- a/xen/common/livepatch.c
+++ b/xen/common/livepatch.c
@@ -1706,15 +1706,15 @@ void check_for_livepatch_work(void)
     s_time_t timeout;
     unsigned long flags;
 
+    /* Fast path: no work to do. */
+    if ( likely(!per_cpu(work_to_do, cpu)) )
+        return;
+
     /* Only do any work when invoked in truly idle state. */
     if ( system_state != SYS_STATE_active ||
          !is_idle_domain(current->sched_unit->domain) )
         return;
 
-    /* Fast path: no work to do. */
-    if ( !per_cpu(work_to_do, cpu ) )
-        return;
-
     smp_rmb();
     /* In case we aborted, other CPUs can skip right away. */
     if ( !livepatch_work.do_work )

base-commit: 49818cde637b5ec20383e46b71f93b2e7d867686
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] xen/livepatch: Make check_for_livepatch_work() faster in the common case
  2023-12-22 22:00 [PATCH] xen/livepatch: Make check_for_livepatch_work() faster in the common case Andrew Cooper
@ 2024-01-04 13:24 ` Jan Beulich
  2024-01-04 13:32   ` Andrew Cooper
  2024-01-05 15:24 ` Ross Lagerwall
  2024-01-11 20:11 ` [PATCH v1-alt] " Andrew Cooper
  2 siblings, 1 reply; 6+ messages in thread
From: Jan Beulich @ 2024-01-04 13:24 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Konrad Rzeszutek Wilk, Ross Lagerwall, Roger Pau Monné,
	Wei Liu, Xen-devel

On 22.12.2023 23:00, Andrew Cooper wrote:
> When livepatching is enabled, this function is used all the time.  Really do
> check the fastpath first, and annotate it likely() as this is the right answer
> 100% of the time (to many significant figures).
> 
> This cuts out 3 pointer dereferences in the "nothing to do path", and it seems
> the optimiser has an easier time too.  Bloat-o-meter reports:
> 
>   add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
>   Function                                     old     new   delta
>   check_for_livepatch_work.cold               1201    1183     -18
>   check_for_livepatch_work                    1021     982     -39
> 
> which isn't too shabby for no logical change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

> I'm still a little disappointed with the code generation.  GCC still chooses
> to set up the full stack frame (6 regs, +3 more slots) intermixed with the
> per-cpu calculations.
> 
> In isolation, GCC can check the boolean without creating a stack frame:
> 
>   <work_to_to>:
>     48 89 e2                mov    %rsp,%rdx
>     48 8d 05 de e1 37 00    lea    0x37e1de(%rip),%rax        # ffff82d0405b6068 <per_cpu__work_to_do>
>     48 81 ca ff 7f 00 00    or     $0x7fff,%rdx
>     8b 4a c1                mov    -0x3f(%rdx),%ecx
>     48 8d 15 45 aa 39 00    lea    0x39aa45(%rip),%rdx        # ffff82d0405d28e0 <__per_cpu_offset>
>     48 8b 14 ca             mov    (%rdx,%rcx,8),%rdx
>     0f b6 04 02             movzbl (%rdx,%rax,1),%eax
>     c3                      retq
> 
> but I can't find a way to convince GCC that it would be worth not setting up a
> stack frame in in the common case, and having a few extra mov reg/reg's later
> in the uncommon case.
> 
> I haven't tried manually splitting the function into a check() and a do()
> function.  Views on whether that might be acceptable?  At a guess, do() would
> need to be a static noinline to avoid it turning back into what it currently
> is.

Or maybe move the fast-path check into an inline function, which calls the
(renamed) out-of-line one only when the fast-path check passes? Downside
would be that the per-CPU work_to_do variable then couldn't be static anymore.

Jan


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xen/livepatch: Make check_for_livepatch_work() faster in the common case
  2024-01-04 13:24 ` Jan Beulich
@ 2024-01-04 13:32   ` Andrew Cooper
  0 siblings, 0 replies; 6+ messages in thread
From: Andrew Cooper @ 2024-01-04 13:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Konrad Rzeszutek Wilk, Ross Lagerwall, Roger Pau Monné,
	Wei Liu, Xen-devel

On 04/01/2024 1:24 pm, Jan Beulich wrote:
> On 22.12.2023 23:00, Andrew Cooper wrote:
>> When livepatching is enabled, this function is used all the time.  Really do
>> check the fastpath first, and annotate it likely() as this is the right answer
>> 100% of the time (to many significant figures).
>>
>> This cuts out 3 pointer dereferences in the "nothing to do path", and it seems
>> the optimiser has an easier time too.  Bloat-o-meter reports:
>>
>>   add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
>>   Function                                     old     new   delta
>>   check_for_livepatch_work.cold               1201    1183     -18
>>   check_for_livepatch_work                    1021     982     -39
>>
>> which isn't too shabby for no logical change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

>
>> I'm still a little disappointed with the code generation.  GCC still chooses
>> to set up the full stack frame (6 regs, +3 more slots) intermixed with the
>> per-cpu calculations.
>>
>> In isolation, GCC can check the boolean without creating a stack frame:
>>
>>   <work_to_to>:
>>     48 89 e2                mov    %rsp,%rdx
>>     48 8d 05 de e1 37 00    lea    0x37e1de(%rip),%rax        # ffff82d0405b6068 <per_cpu__work_to_do>
>>     48 81 ca ff 7f 00 00    or     $0x7fff,%rdx
>>     8b 4a c1                mov    -0x3f(%rdx),%ecx
>>     48 8d 15 45 aa 39 00    lea    0x39aa45(%rip),%rdx        # ffff82d0405d28e0 <__per_cpu_offset>
>>     48 8b 14 ca             mov    (%rdx,%rcx,8),%rdx
>>     0f b6 04 02             movzbl (%rdx,%rax,1),%eax
>>     c3                      retq
>>
>> but I can't find a way to convince GCC that it would be worth not setting up a
>> stack frame in in the common case, and having a few extra mov reg/reg's later
>> in the uncommon case.
>>
>> I haven't tried manually splitting the function into a check() and a do()
>> function.  Views on whether that might be acceptable?  At a guess, do() would
>> need to be a static noinline to avoid it turning back into what it currently
>> is.
> Or maybe move the fast-path check into an inline function, which calls the
> (renamed) out-of-line one only when the fast-path check passes? Downside
> would be that the per-CPU work_to_do variable then couldn't be static anymore.

We can't do that unfortunately.  It's called from assembly in
reset_stack_and_*()

But, I've had another idea.  I think attribute cold can help here, as it
will move the majority of the implementation into a separate section.

~Andrew


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xen/livepatch: Make check_for_livepatch_work() faster in the common case
  2023-12-22 22:00 [PATCH] xen/livepatch: Make check_for_livepatch_work() faster in the common case Andrew Cooper
  2024-01-04 13:24 ` Jan Beulich
@ 2024-01-05 15:24 ` Ross Lagerwall
  2024-01-11 20:11 ` [PATCH v1-alt] " Andrew Cooper
  2 siblings, 0 replies; 6+ messages in thread
From: Ross Lagerwall @ 2024-01-05 15:24 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, Konrad Rzeszutek Wilk, Jan Beulich,
	Roger Pau Monné,
	Wei Liu

On Fri, Dec 22, 2023 at 10:01 PM Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>
> When livepatching is enabled, this function is used all the time.  Really do
> check the fastpath first, and annotate it likely() as this is the right answer
> 100% of the time (to many significant figures).
>
> This cuts out 3 pointer dereferences in the "nothing to do path", and it seems
> the optimiser has an easier time too.  Bloat-o-meter reports:
>
>   add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
>   Function                                     old     new   delta
>   check_for_livepatch_work.cold               1201    1183     -18
>   check_for_livepatch_work                    1021     982     -39
>
> which isn't too shabby for no logical change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v1-alt] xen/livepatch: Make check_for_livepatch_work() faster in the common case
  2023-12-22 22:00 [PATCH] xen/livepatch: Make check_for_livepatch_work() faster in the common case Andrew Cooper
  2024-01-04 13:24 ` Jan Beulich
  2024-01-05 15:24 ` Ross Lagerwall
@ 2024-01-11 20:11 ` Andrew Cooper
  2024-01-12 10:31   ` Jan Beulich
  2 siblings, 1 reply; 6+ messages in thread
From: Andrew Cooper @ 2024-01-11 20:11 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Konrad Rzeszutek Wilk, Ross Lagerwall,
	Jan Beulich, Roger Pau Monné,
	Wei Liu

When livepatching is enabled, this function is used all the time.  Really do
check the fastpath first, and annotate it likely() as this is the right answer
100% of the time (to many significant figures).  This cuts out 3 pointer
dereferences in the "nothing to do path".

However, GCC still needs some help to persuade it not to set the full stack
frame (6 spilled registers, 3 slots of locals) even on the fastpath.

Create a new check_for_livepatch_work() with the fastpath only, and make the
"new" do_livepatch_work() noinline.  This causes the fastpath to need no stack
frame, making it faster still.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: Ross Lagerwall <ross.lagerwall@citrix.com>
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>

v1-alt:
 * Manually split the functions.

Experimenting with __attribute__((cold)) was disappointing.  Vs this patch, it
creates an extra check_for_livepatch_work.cold function(and section) which is
just `jmp do_livepatch_work`.
---
 xen/common/livepatch.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/xen/common/livepatch.c b/xen/common/livepatch.c
index 1209fea2566c..2c4b84382798 100644
--- a/xen/common/livepatch.c
+++ b/xen/common/livepatch.c
@@ -1693,7 +1693,7 @@ static int livepatch_spin(atomic_t *counter, s_time_t timeout,
  * The main function which manages the work of quiescing the system and
  * patching code.
  */
-void check_for_livepatch_work(void)
+static void noinline do_livepatch_work(void)
 {
 #define ACTION(x) [LIVEPATCH_ACTION_##x] = #x
     static const char *const names[] = {
@@ -1711,10 +1711,6 @@ void check_for_livepatch_work(void)
          !is_idle_domain(current->sched_unit->domain) )
         return;
 
-    /* Fast path: no work to do. */
-    if ( !per_cpu(work_to_do, cpu ) )
-        return;
-
     smp_rmb();
     /* In case we aborted, other CPUs can skip right away. */
     if ( !livepatch_work.do_work )
@@ -1864,6 +1860,17 @@ void check_for_livepatch_work(void)
     }
 }
 
+void check_for_livepatch_work(void)
+{
+    unsigned int cpu = smp_processor_id();
+
+    /* Fast path: no work to do. */
+    if ( likely(!per_cpu(work_to_do, cpu)) )
+        return;
+
+    do_livepatch_work();
+}
+
 /*
  * Only allow dependent payload is applied on top of the correct
  * build-id.
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v1-alt] xen/livepatch: Make check_for_livepatch_work() faster in the common case
  2024-01-11 20:11 ` [PATCH v1-alt] " Andrew Cooper
@ 2024-01-12 10:31   ` Jan Beulich
  0 siblings, 0 replies; 6+ messages in thread
From: Jan Beulich @ 2024-01-12 10:31 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Konrad Rzeszutek Wilk, Ross Lagerwall, Roger Pau Monné,
	Wei Liu, Xen-devel

On 11.01.2024 21:11, Andrew Cooper wrote:
> When livepatching is enabled, this function is used all the time.  Really do
> check the fastpath first, and annotate it likely() as this is the right answer
> 100% of the time (to many significant figures).  This cuts out 3 pointer
> dereferences in the "nothing to do path".
> 
> However, GCC still needs some help to persuade it not to set the full stack
> frame (6 spilled registers, 3 slots of locals) even on the fastpath.
> 
> Create a new check_for_livepatch_work() with the fastpath only, and make the
> "new" do_livepatch_work() noinline.  This causes the fastpath to need no stack
> frame, making it faster still.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-01-12 10:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-22 22:00 [PATCH] xen/livepatch: Make check_for_livepatch_work() faster in the common case Andrew Cooper
2024-01-04 13:24 ` Jan Beulich
2024-01-04 13:32   ` Andrew Cooper
2024-01-05 15:24 ` Ross Lagerwall
2024-01-11 20:11 ` [PATCH v1-alt] " Andrew Cooper
2024-01-12 10:31   ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.