All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] xen: various function pointer cleanups
@ 2021-11-11 17:57 Andrew Cooper
  2021-11-11 17:57 ` [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers Andrew Cooper
                   ` (4 more replies)
  0 siblings, 5 replies; 34+ messages in thread
From: Andrew Cooper @ 2021-11-11 17:57 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis

Passing CI runs:
  https://cirrus-ci.com/build/6095362789212160
  https://gitlab.com/xen-project/people/andyhhp/xen/-/pipelines/407123417

Andrew Cooper (5):
  xen/domain: Remove function pointers from domain pause helpers
  xen/domain: Improve pirq handling
  xen/sort: Switch to an extern inline implementation
  xen/wait: Remove indirect jump
  x86/ioapic: Drop function pointers from __ioapic_{read,write}_entry()

 xen/arch/arm/bootfdt.c   |  9 ++++-
 xen/arch/arm/io.c        |  9 ++++-
 xen/arch/x86/io_apic.c   | 30 +++++++++++-----
 xen/common/domain.c      | 93 ++++++++++++++++++++++++++++--------------------
 xen/common/wait.c        | 19 +++++-----
 xen/include/xen/domain.h |  1 -
 xen/include/xen/sched.h  | 15 +++-----
 xen/include/xen/sort.h   | 55 +++++++++++++++++++++++++++-
 xen/lib/sort.c           | 80 ++---------------------------------------
 9 files changed, 162 insertions(+), 149 deletions(-)

-- 
2.11.0



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-11 17:57 [PATCH 0/5] xen: various function pointer cleanups Andrew Cooper
@ 2021-11-11 17:57 ` Andrew Cooper
  2021-11-12  9:36   ` Julien Grall
                     ` (2 more replies)
  2021-11-11 17:57 ` [PATCH 2/5] xen/domain: Improve pirq handling Andrew Cooper
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 34+ messages in thread
From: Andrew Cooper @ 2021-11-11 17:57 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis

Retpolines are expensive, and all these do are select between the sync and
nosync helpers.  Pass a boolean instead, and use direct calls everywhere.

Pause/unpause operations on behalf of dom0 are not fastpaths, so avoid
exposing the __domain_pause_by_systemcontroller() internal.

This actually compiles smaller than before:

  $ ../scripts/bloat-o-meter xen-syms-before xen-syms-after
  add/remove: 3/1 grow/shrink: 0/5 up/down: 250/-273 (-23)
  Function                                     old     new   delta
  _domain_pause                                  -     115    +115
  domain_pause_by_systemcontroller               -      69     +69
  domain_pause_by_systemcontroller_nosync        -      66     +66
  domain_kill                                  426     398     -28
  domain_resume                                246     214     -32
  domain_pause_except_self                     189     141     -48
  domain_pause                                  59      10     -49
  domain_pause_nosync                           59       7     -52
  __domain_pause_by_systemcontroller            64       -     -64

despite GCC's best efforts.  The new _domain_pause_by_systemcontroller()
really should not be inlined, considering that the difference is only the
setup of the sync boolean to pass to _domain_pause(), and there are plenty of
registers to spare.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
---
 xen/common/domain.c     | 31 ++++++++++++++++++++++---------
 xen/include/xen/sched.h | 15 +++++----------
 2 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 56d47dd66478..562872cdf87a 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1234,15 +1234,18 @@ int vcpu_unpause_by_systemcontroller(struct vcpu *v)
     return 0;
 }
 
-static void do_domain_pause(struct domain *d,
-                            void (*sleep_fn)(struct vcpu *v))
+static void _domain_pause(struct domain *d, bool sync /* or nosync */)
 {
     struct vcpu *v;
 
     atomic_inc(&d->pause_count);
 
-    for_each_vcpu( d, v )
-        sleep_fn(v);
+    if ( sync )
+        for_each_vcpu ( d, v )
+            vcpu_sleep_sync(v);
+    else
+        for_each_vcpu ( d, v )
+            vcpu_sleep_nosync(v);
 
     arch_domain_pause(d);
 }
@@ -1250,12 +1253,12 @@ static void do_domain_pause(struct domain *d,
 void domain_pause(struct domain *d)
 {
     ASSERT(d != current->domain);
-    do_domain_pause(d, vcpu_sleep_sync);
+    _domain_pause(d, true /* sync */);
 }
 
 void domain_pause_nosync(struct domain *d)
 {
-    do_domain_pause(d, vcpu_sleep_nosync);
+    _domain_pause(d, false /* nosync */);
 }
 
 void domain_unpause(struct domain *d)
@@ -1269,8 +1272,8 @@ void domain_unpause(struct domain *d)
             vcpu_wake(v);
 }
 
-int __domain_pause_by_systemcontroller(struct domain *d,
-                                       void (*pause_fn)(struct domain *d))
+static int _domain_pause_by_systemcontroller(
+    struct domain *d, bool sync /* or nosync */)
 {
     int old, new, prev = d->controller_pause_count;
 
@@ -1289,11 +1292,21 @@ int __domain_pause_by_systemcontroller(struct domain *d,
         prev = cmpxchg(&d->controller_pause_count, old, new);
     } while ( prev != old );
 
-    pause_fn(d);
+    _domain_pause(d, sync);
 
     return 0;
 }
 
+int domain_pause_by_systemcontroller(struct domain *d)
+{
+    return _domain_pause_by_systemcontroller(d, true /* sync */);
+}
+
+int domain_pause_by_systemcontroller_nosync(struct domain *d)
+{
+    return _domain_pause_by_systemcontroller(d, false /* nosync */);
+}
+
 int domain_unpause_by_systemcontroller(struct domain *d)
 {
     int old, new, prev = d->controller_pause_count;
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 28146ee404e6..37f78cc4c4c9 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -920,26 +920,21 @@ static inline bool vcpu_cpu_dirty(const struct vcpu *v)
 
 void vcpu_block(void);
 void vcpu_unblock(struct vcpu *v);
+
 void vcpu_pause(struct vcpu *v);
 void vcpu_pause_nosync(struct vcpu *v);
 void vcpu_unpause(struct vcpu *v);
+
 int vcpu_pause_by_systemcontroller(struct vcpu *v);
 int vcpu_unpause_by_systemcontroller(struct vcpu *v);
 
 void domain_pause(struct domain *d);
 void domain_pause_nosync(struct domain *d);
 void domain_unpause(struct domain *d);
+
+int domain_pause_by_systemcontroller(struct domain *d);
+int domain_pause_by_systemcontroller_nosync(struct domain *d);
 int domain_unpause_by_systemcontroller(struct domain *d);
-int __domain_pause_by_systemcontroller(struct domain *d,
-                                       void (*pause_fn)(struct domain *d));
-static inline int domain_pause_by_systemcontroller(struct domain *d)
-{
-    return __domain_pause_by_systemcontroller(d, domain_pause);
-}
-static inline int domain_pause_by_systemcontroller_nosync(struct domain *d)
-{
-    return __domain_pause_by_systemcontroller(d, domain_pause_nosync);
-}
 
 /* domain_pause() but safe against trying to pause current. */
 int __must_check domain_pause_except_self(struct domain *d);
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/5] xen/domain: Improve pirq handling
  2021-11-11 17:57 [PATCH 0/5] xen: various function pointer cleanups Andrew Cooper
  2021-11-11 17:57 ` [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers Andrew Cooper
@ 2021-11-11 17:57 ` Andrew Cooper
  2021-11-12 10:16   ` Jan Beulich
  2021-11-11 17:57 ` [PATCH 3/5] xen/sort: Switch to an extern inline implementation Andrew Cooper
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 34+ messages in thread
From: Andrew Cooper @ 2021-11-11 17:57 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis

free_pirq_struct() has no external users, so shouldn't be exposed.  Making it
static necessitates moving the function as domain_destroy() uses it.

Rework pirq_get_info() to have easier-to-follow logic.  The one functional
change is to the insertion failure path; we should not be using a full
call_rcu() chain to free an otherwise local structure we failed to insert into
the radix tree to begin with.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
---
 xen/common/domain.c      | 62 ++++++++++++++++++++++++++----------------------
 xen/include/xen/domain.h |  1 -
 2 files changed, 33 insertions(+), 30 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 562872cdf87a..a53dd114d5ba 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -365,6 +365,39 @@ static int __init parse_extra_guest_irqs(const char *s)
 }
 custom_param("extra_guest_irqs", parse_extra_guest_irqs);
 
+static void _free_pirq_struct(struct rcu_head *head)
+{
+    xfree(container_of(head, struct pirq, rcu_head));
+}
+
+static void free_pirq_struct(void *ptr)
+{
+    struct pirq *pirq = ptr;
+
+    call_rcu(&pirq->rcu_head, _free_pirq_struct);
+}
+
+struct pirq *pirq_get_info(struct domain *d, int pirq)
+{
+    struct pirq *info = pirq_info(d, pirq);
+
+    if ( likely(info) )
+        return info;
+
+    info = alloc_pirq_struct(d);
+    if ( unlikely(!info) )
+        return NULL;
+
+    info->pirq = pirq;
+    if ( likely(radix_tree_insert(&d->pirq_tree, pirq, info) == 0) )
+        return info; /* Success. */
+
+    /* Don't use call_rcu() to free a struct we failed to insert. */
+    _free_pirq_struct(&info->rcu_head);
+
+    return NULL;
+}
+
 /*
  * Release resources held by a domain.  There may or may not be live
  * references to the domain, and it may or may not be fully constructed.
@@ -1789,35 +1822,6 @@ long do_vm_assist(unsigned int cmd, unsigned int type)
 }
 #endif
 
-struct pirq *pirq_get_info(struct domain *d, int pirq)
-{
-    struct pirq *info = pirq_info(d, pirq);
-
-    if ( !info && (info = alloc_pirq_struct(d)) != NULL )
-    {
-        info->pirq = pirq;
-        if ( radix_tree_insert(&d->pirq_tree, pirq, info) )
-        {
-            free_pirq_struct(info);
-            info = NULL;
-        }
-    }
-
-    return info;
-}
-
-static void _free_pirq_struct(struct rcu_head *head)
-{
-    xfree(container_of(head, struct pirq, rcu_head));
-}
-
-void free_pirq_struct(void *ptr)
-{
-    struct pirq *pirq = ptr;
-
-    call_rcu(&pirq->rcu_head, _free_pirq_struct);
-}
-
 struct migrate_info {
     long (*func)(void *data);
     void *data;
diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
index 160c8dbdab33..b4d202fda9fd 100644
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -44,7 +44,6 @@ void free_vcpu_struct(struct vcpu *v);
 #ifndef alloc_pirq_struct
 struct pirq *alloc_pirq_struct(struct domain *);
 #endif
-void free_pirq_struct(void *);
 
 /*
  * Initialise/destroy arch-specific details of a VCPU.
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/5] xen/sort: Switch to an extern inline implementation
  2021-11-11 17:57 [PATCH 0/5] xen: various function pointer cleanups Andrew Cooper
  2021-11-11 17:57 ` [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers Andrew Cooper
  2021-11-11 17:57 ` [PATCH 2/5] xen/domain: Improve pirq handling Andrew Cooper
@ 2021-11-11 17:57 ` Andrew Cooper
  2021-11-11 18:15   ` Julien Grall
                     ` (2 more replies)
  2021-11-11 17:57 ` [PATCH 4/5] xen/wait: Remove indirect jump Andrew Cooper
  2021-11-11 17:57 ` [PATCH 5/5] x86/ioapic: Drop function pointers from __ioapic_{read,write}_entry() Andrew Cooper
  4 siblings, 3 replies; 34+ messages in thread
From: Andrew Cooper @ 2021-11-11 17:57 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis

There are exactly 3 callers of sort() in the hypervisor.

Both arm callers pass in NULL for the swap function.  While this might seem
like an attractive option at first, it causes generic_swap() to be used which
forced a byte-wise copy.  Provide real swap functions which the compiler can
optimise sensibly.

Furthermore, use of function pointers in tight loops like that can be very bad
for performance.  Implement sort() as extern inline, so the optimiser can
judge whether to inline things or not.

On x86, the diffstat shows how much of a better job the compiler can do when
it is able to see the cmp/swap implementations.

  $ ../scripts/bloat-o-meter xen-syms-before xen-syms-after
  add/remove: 0/5 grow/shrink: 1/1 up/down: 505/-735 (-230)
  Function                                     old     new   delta
  sort_exception_table                          31     536    +505
  u32_swap                                       9       -      -9
  generic_swap                                  34       -     -34
  cmp_ex                                        36       -     -36
  swap_ex                                       41       -     -41
  sort_exception_tables                         90      38     -52
  sort                                         563       -    -563

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
---
 xen/arch/arm/bootfdt.c |  9 +++++-
 xen/arch/arm/io.c      |  9 +++++-
 xen/include/xen/sort.h | 55 +++++++++++++++++++++++++++++++++-
 xen/lib/sort.c         | 80 ++------------------------------------------------
 4 files changed, 72 insertions(+), 81 deletions(-)

diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c
index afaa0e249b71..e318ef960386 100644
--- a/xen/arch/arm/bootfdt.c
+++ b/xen/arch/arm/bootfdt.c
@@ -448,6 +448,13 @@ static int __init cmp_memory_node(const void *key, const void *elem)
     return 0;
 }
 
+static void __init swap_memory_node(void *_a, void *_b, size_t size)
+{
+    struct membank *a = _a, *b = _b;
+
+    SWAP(*a, *b);
+}
+
 /**
  * boot_fdt_info - initialize bootinfo from a DTB
  * @fdt: flattened device tree binary
@@ -472,7 +479,7 @@ size_t __init boot_fdt_info(const void *fdt, paddr_t paddr)
      * the banks sorted in ascending order. So sort them through.
      */
     sort(bootinfo.mem.bank, bootinfo.mem.nr_banks, sizeof(struct membank),
-         cmp_memory_node, NULL);
+         cmp_memory_node, swap_memory_node);
 
     early_print_info();
 
diff --git a/xen/arch/arm/io.c b/xen/arch/arm/io.c
index 729287e37c59..1a066f9ae502 100644
--- a/xen/arch/arm/io.c
+++ b/xen/arch/arm/io.c
@@ -80,6 +80,13 @@ static int cmp_mmio_handler(const void *key, const void *elem)
     return 0;
 }
 
+static void swap_mmio_handler(void *_a, void *_b, size_t size)
+{
+    struct mmio_handler *a = _a, *b = _b;
+
+    SWAP(*a, *b);
+}
+
 static const struct mmio_handler *find_mmio_handler(struct domain *d,
                                                     paddr_t gpa)
 {
@@ -170,7 +177,7 @@ void register_mmio_handler(struct domain *d,
 
     /* Sort mmio handlers in ascending order based on base address */
     sort(vmmio->handlers, vmmio->num_entries, sizeof(struct mmio_handler),
-         cmp_mmio_handler, NULL);
+         cmp_mmio_handler, swap_mmio_handler);
 
     write_unlock(&vmmio->lock);
 }
diff --git a/xen/include/xen/sort.h b/xen/include/xen/sort.h
index a403652948e7..01479ea44606 100644
--- a/xen/include/xen/sort.h
+++ b/xen/include/xen/sort.h
@@ -3,8 +3,61 @@
 
 #include <xen/types.h>
 
+/*
+ * sort - sort an array of elements
+ * @base: pointer to data to sort
+ * @num: number of elements
+ * @size: size of each element
+ * @cmp: pointer to comparison function
+ * @swap: pointer to swap function or NULL
+ *
+ * This function does a heapsort on the given array. You may provide a
+ * swap function optimized to your element type.
+ *
+ * Sorting time is O(n log n) both on average and worst-case. While
+ * qsort is about 20% faster on average, it suffers from exploitable
+ * O(n*n) worst-case behavior and extra memory requirements that make
+ * it less suitable for kernel use.
+ */
+#ifndef SORT_IMPLEMENTATION
+extern gnu_inline
+#endif
 void sort(void *base, size_t num, size_t size,
           int (*cmp)(const void *, const void *),
-          void (*swap)(void *, void *, size_t));
+          void (*swap)(void *, void *, size_t))
+{
+    /* pre-scale counters for performance */
+    size_t i = (num / 2) * size, n = num * size, c, r;
+
+    /* heapify */
+    while ( i > 0 )
+    {
+        for ( r = i -= size; r * 2 + size < n; r = c )
+        {
+            c = r * 2 + size;
+            if ( (c < n - size) && (cmp(base + c, base + c + size) < 0) )
+                c += size;
+            if ( cmp(base + r, base + c) >= 0 )
+                break;
+            swap(base + r, base + c, size);
+        }
+    }
+
+    /* sort */
+    for ( i = n; i > 0; )
+    {
+        i -= size;
+        swap(base, base + i, size);
+        for ( r = 0; r * 2 + size < i; r = c )
+        {
+            c = r * 2 + size;
+            if ( (c < i - size) && (cmp(base + c, base + c + size) < 0) )
+                c += size;
+            if ( cmp(base + r, base + c) >= 0 )
+                break;
+            swap(base + r, base + c, size);
+        }
+    }
+}
 
 #endif /* __XEN_SORT_H__ */
diff --git a/xen/lib/sort.c b/xen/lib/sort.c
index 35ce0d7abdec..b7e78cc0e8d2 100644
--- a/xen/lib/sort.c
+++ b/xen/lib/sort.c
@@ -4,81 +4,5 @@
  * Jan 23 2005  Matt Mackall <mpm@selenic.com>
  */
 
-#include <xen/types.h>
-
-static void u32_swap(void *a, void *b, size_t size)
-{
-    uint32_t t = *(uint32_t *)a;
-
-    *(uint32_t *)a = *(uint32_t *)b;
-    *(uint32_t *)b = t;
-}
-
-static void generic_swap(void *a, void *b, size_t size)
-{
-    char t;
-
-    do {
-        t = *(char *)a;
-        *(char *)a++ = *(char *)b;
-        *(char *)b++ = t;
-    } while ( --size > 0 );
-}
-
-/*
- * sort - sort an array of elements
- * @base: pointer to data to sort
- * @num: number of elements
- * @size: size of each element
- * @cmp: pointer to comparison function
- * @swap: pointer to swap function or NULL
- *
- * This function does a heapsort on the given array. You may provide a
- * swap function optimized to your element type.
- *
- * Sorting time is O(n log n) both on average and worst-case. While
- * qsort is about 20% faster on average, it suffers from exploitable
- * O(n*n) worst-case behavior and extra memory requirements that make
- * it less suitable for kernel use.
- */
-
-void sort(void *base, size_t num, size_t size,
-          int (*cmp)(const void *, const void *),
-          void (*swap)(void *, void *, size_t size))
-{
-    /* pre-scale counters for performance */
-    size_t i = (num / 2) * size, n = num * size, c, r;
-
-    if ( !swap )
-        swap = (size == 4 ? u32_swap : generic_swap);
-
-    /* heapify */
-    while ( i > 0 )
-    {
-        for ( r = i -= size; r * 2 + size < n; r = c )
-        {
-            c = r * 2 + size;
-            if ( (c < n - size) && (cmp(base + c, base + c + size) < 0) )
-                c += size;
-            if ( cmp(base + r, base + c) >= 0 )
-                break;
-            swap(base + r, base + c, size);
-        }
-    }
-
-    /* sort */
-    for ( i = n; i > 0; )
-    {
-        i -= size;
-        swap(base, base + i, size);
-        for ( r = 0; r * 2 + size < i; r = c )
-        {
-            c = r * 2 + size;
-            if ( (c < i - size) && (cmp(base + c, base + c + size) < 0) )
-                c += size;
-            if ( cmp(base + r, base + c) >= 0 )
-                break;
-            swap(base + r, base + c, size);
-        }
-    }
-}
+#define SORT_IMPLEMENTATION
+#include <xen/sort.h>
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/5] xen/wait: Remove indirect jump
  2021-11-11 17:57 [PATCH 0/5] xen: various function pointer cleanups Andrew Cooper
                   ` (2 preceding siblings ...)
  2021-11-11 17:57 ` [PATCH 3/5] xen/sort: Switch to an extern inline implementation Andrew Cooper
@ 2021-11-11 17:57 ` Andrew Cooper
  2021-11-12 10:35   ` Jan Beulich
  2021-11-11 17:57 ` [PATCH 5/5] x86/ioapic: Drop function pointers from __ioapic_{read,write}_entry() Andrew Cooper
  4 siblings, 1 reply; 34+ messages in thread
From: Andrew Cooper @ 2021-11-11 17:57 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu

There is no need for this to be an indirect jump at all.  Execution always
returns to a specific location, so use a direct jump instead.

Use a named label for the jump.  As both 1 and 2 have disappeared as labels,
rename 3 to skip to better describe its purpose.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
---
 xen/common/wait.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/xen/common/wait.c b/xen/common/wait.c
index 24716e76768b..9276d76464fb 100644
--- a/xen/common/wait.c
+++ b/xen/common/wait.c
@@ -144,18 +144,16 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
         "push %%r8;  push %%r9;  push %%r10; push %%r11;"
         "push %%r12; push %%r13; push %%r14; push %%r15;"
 
-        "call 1f;"
-        "1: addq $2f-1b,(%%rsp);"
         "sub %%esp,%%ecx;"
         "cmp %3,%%ecx;"
-        "ja 3f;"
+        "ja .L_skip;"
         "mov %%rsp,%%rsi;"
 
         /* check_wakeup_from_wait() longjmp()'s to this point. */
-        "2: rep movsb;"
+        ".L_wq_resume: rep movsb;"
         "mov %%rsp,%%rsi;"
-        "3: pop %%rax;"
 
+        ".L_skip:"
         "pop %%r15; pop %%r14; pop %%r13; pop %%r12;"
         "pop %%r11; pop %%r10; pop %%r9;  pop %%r8;"
         "pop %%rbp; pop %%rdx; pop %%rbx; pop %%rax"
@@ -204,15 +202,14 @@ void check_wakeup_from_wait(void)
     }
 
     /*
-     * Hand-rolled longjmp().  Returns to the pointer on the top of
-     * wqv->stack, and lands on a `rep movs` instruction.  All other GPRs are
-     * restored from the stack, so are available for use here.
+     * Hand-rolled longjmp().  Returns to __prepare_to_wait(), and lands on a
+     * `rep movs` instruction.  All other GPRs are restored from the stack, so
+     * are available for use here.
      */
     asm volatile (
-        "mov %1,%%"__OP"sp; INDIRECT_JMP %[ip]"
+        "mov %1,%%"__OP"sp; jmp .L_wq_resume;"
         : : "S" (wqv->stack), "D" (wqv->esp),
-          "c" ((char *)get_cpu_info() - (char *)wqv->esp),
-          [ip] "r" (*(unsigned long *)wqv->stack)
+          "c" ((char *)get_cpu_info() - (char *)wqv->esp)
         : "memory" );
     unreachable();
 }
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/5] x86/ioapic: Drop function pointers from __ioapic_{read,write}_entry()
  2021-11-11 17:57 [PATCH 0/5] xen: various function pointer cleanups Andrew Cooper
                   ` (3 preceding siblings ...)
  2021-11-11 17:57 ` [PATCH 4/5] xen/wait: Remove indirect jump Andrew Cooper
@ 2021-11-11 17:57 ` Andrew Cooper
  2021-11-12 10:43   ` Jan Beulich
  4 siblings, 1 reply; 34+ messages in thread
From: Andrew Cooper @ 2021-11-11 17:57 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu

Function pointers are expensive, and the raw parameter is a constant from all
callers, meaning that it predicts very well with local branch history.

Furthermore, the knock-on effects are quite impressive.

  $ ../scripts/bloat-o-meter xen-syms-before xen-syms-after
  add/remove: 0/4 grow/shrink: 3/9 up/down: 459/-823 (-364)
  Function                                     old     new   delta
  __ioapic_write_entry                          73     286    +213
  __ioapic_read_entry                           75     276    +201
  save_IO_APIC_setup                           182     227     +45
  eoi_IO_APIC_irq                              241     229     -12
  disable_IO_APIC                              296     280     -16
  mask_IO_APIC_setup                           272     240     -32
  __io_apic_write                               46       -     -46
  __io_apic_read                                46       -     -46
  io_apic_set_pci_routing                      985     930     -55
  __io_apic_eoi.part                           223     161     -62
  io_apic_write                                 69       -     -69
  io_apic_read                                  69       -     -69
  restore_IO_APIC_setup                        325     253     -72
  ioapic_guest_write                          1413    1333     -80
  clear_IO_APIC_pin                            447     343    -104
  setup_IO_APIC                               5148    4988    -160

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
---
 xen/arch/x86/io_apic.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/xen/arch/x86/io_apic.c b/xen/arch/x86/io_apic.c
index c3ad9efac88d..1c49a0fe1478 100644
--- a/xen/arch/x86/io_apic.c
+++ b/xen/arch/x86/io_apic.c
@@ -235,11 +235,19 @@ union entry_union {
 struct IO_APIC_route_entry __ioapic_read_entry(
     unsigned int apic, unsigned int pin, bool raw)
 {
-    unsigned int (*read)(unsigned int, unsigned int)
-        = raw ? __io_apic_read : io_apic_read;
     union entry_union eu;
-    eu.w1 = (*read)(apic, 0x10 + 2 * pin);
-    eu.w2 = (*read)(apic, 0x11 + 2 * pin);
+
+    if ( raw )
+    {
+        eu.w1 = __io_apic_read(apic, 0x10 + 2 * pin);
+        eu.w2 = __io_apic_read(apic, 0x11 + 2 * pin);
+    }
+    else
+    {
+        eu.w1 = io_apic_read(apic, 0x10 + 2 * pin);
+        eu.w2 = io_apic_read(apic, 0x11 + 2 * pin);
+    }
+
     return eu.entry;
 }
 
@@ -259,12 +267,18 @@ void __ioapic_write_entry(
     unsigned int apic, unsigned int pin, bool raw,
     struct IO_APIC_route_entry e)
 {
-    void (*write)(unsigned int, unsigned int, unsigned int)
-        = raw ? __io_apic_write : io_apic_write;
     union entry_union eu = { .entry = e };
 
-    (*write)(apic, 0x11 + 2*pin, eu.w2);
-    (*write)(apic, 0x10 + 2*pin, eu.w1);
+    if ( raw )
+    {
+        __io_apic_write(apic, 0x11 + 2 * pin, eu.w2);
+        __io_apic_write(apic, 0x10 + 2 * pin, eu.w1);
+    }
+    else
+    {
+        io_apic_write(apic, 0x11 + 2 * pin, eu.w2);
+        io_apic_write(apic, 0x10 + 2 * pin, eu.w1);
+    }
 }
 
 static void ioapic_write_entry(
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] xen/sort: Switch to an extern inline implementation
  2021-11-11 17:57 ` [PATCH 3/5] xen/sort: Switch to an extern inline implementation Andrew Cooper
@ 2021-11-11 18:15   ` Julien Grall
  2021-11-16  0:36     ` Stefano Stabellini
  2021-11-12  9:39   ` Julien Grall
  2021-11-12 10:25   ` Jan Beulich
  2 siblings, 1 reply; 34+ messages in thread
From: Julien Grall @ 2021-11-11 18:15 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel
  Cc: Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis

Hi Andrew,

On 11/11/2021 17:57, Andrew Cooper wrote:
> There are exactly 3 callers of sort() in the hypervisor.
> 
> Both arm callers pass in NULL for the swap function.  While this might seem
> like an attractive option at first, it causes generic_swap() to be used which
> forced a byte-wise copy.  Provide real swap functions which the compiler can
> optimise sensibly.

I understand the theory, but none of the two calls are in hot paths or 
deal with large set on Arm. So I am rather hesitant to introduce 
specialised swap in each case as it doesn't seem we will gain much from 
this change.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-11 17:57 ` [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers Andrew Cooper
@ 2021-11-12  9:36   ` Julien Grall
  2021-11-18  1:47     ` Andrew Cooper
  2021-11-12  9:57   ` Jan Beulich
  2021-11-15 10:13   ` Bertrand Marquis
  2 siblings, 1 reply; 34+ messages in thread
From: Julien Grall @ 2021-11-12  9:36 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel
  Cc: Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis

  ~/works/oss/linux/scripts/bloat-o-meter xen-syms-old xen-syms

On 11/11/2021 17:57, Andrew Cooper wrote:
> Retpolines are expensive, and all these do are select between the sync and
> nosync helpers.  Pass a boolean instead, and use direct calls everywhere.

To be honest, I much prefer to read the old code. I am totally not 
against the change but I can see how I would be ready to introduce new 
function pointers use in the future.

So I think we need some guidelines on when to use function pointers in 
Xen. The more...

> 
> Pause/unpause operations on behalf of dom0 are not fastpaths, so avoid
> exposing the __domain_pause_by_systemcontroller() internal.
> 
> This actually compiles smaller than before:

... the code doesn't really compile smaller on Arm:

42sh>  ../scripts/bloat-o-meter xen-syms-old xen-syms

add/remove: 4/2 grow/shrink: 0/6 up/down: 272/-252 (20)
Function                                     old     new   delta
_domain_pause                                  -     136    +136
_domain_pause_by_systemcontroller              -     120    +120
domain_pause_by_systemcontroller_nosync        -       8      +8
domain_pause_by_systemcontroller               -       8      +8
domain_resume                                136     132      -4
domain_pause_nosync                           12       8      -4
domain_pause                                  12       8      -4
domain_pause_except_self                     188     180      -8
do_domctl                                   5480    5472      -8
domain_kill                                  372     356     -16
do_domain_pause                               88       -     -88
__domain_pause_by_systemcontroller           120       -    -120
Total: Before=966919, After=966939, chg +0.00%

> 
>    $ ../scripts/bloat-o-meter xen-syms-before xen-syms-after
>    add/remove: 3/1 grow/shrink: 0/5 up/down: 250/-273 (-23)
>    Function                                     old     new   delta
>    _domain_pause                                  -     115    +115
>    domain_pause_by_systemcontroller               -      69     +69
>    domain_pause_by_systemcontroller_nosync        -      66     +66
>    domain_kill                                  426     398     -28
>    domain_resume                                246     214     -32
>    domain_pause_except_self                     189     141     -48
>    domain_pause                                  59      10     -49
>    domain_pause_nosync                           59       7     -52
>    __domain_pause_by_systemcontroller            64       -     -64
> 
> despite GCC's best efforts.  The new _domain_pause_by_systemcontroller()
> really should not be inlined, considering that the difference is only the
> setup of the sync boolean to pass to _domain_pause(), and there are plenty of
> registers to spare.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] xen/sort: Switch to an extern inline implementation
  2021-11-11 17:57 ` [PATCH 3/5] xen/sort: Switch to an extern inline implementation Andrew Cooper
  2021-11-11 18:15   ` Julien Grall
@ 2021-11-12  9:39   ` Julien Grall
  2021-11-12 10:25   ` Jan Beulich
  2 siblings, 0 replies; 34+ messages in thread
From: Julien Grall @ 2021-11-12  9:39 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel
  Cc: Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis

Hi Andrew,

On 11/11/2021 17:57, Andrew Cooper wrote:
> There are exactly 3 callers of sort() in the hypervisor.
> 
> Both arm callers pass in NULL for the swap function.  While this might seem
> like an attractive option at first, it causes generic_swap() to be used which
> forced a byte-wise copy.  Provide real swap functions which the compiler can
> optimise sensibly.
> 
> Furthermore, use of function pointers in tight loops like that can be very bad
> for performance.  Implement sort() as extern inline, so the optimiser can
> judge whether to inline things or not.
> 
> On x86, the diffstat shows how much of a better job the compiler can do when
> it is able to see the cmp/swap implementations.

For completness, here the Arm bloat-o-meter:

add/remove: 0/5 grow/shrink: 2/0 up/down: 928/-660 (268)
Function                                     old     new   delta
boot_fdt_info                                640    1132    +492
register_mmio_handler                        292     728    +436
u32_swap                                      20       -     -20
generic_swap                                  40       -     -40
cmp_mmio_handler                              44       -     -44
cmp_memory_node                               44       -     -44
sort                                         512       -    -512
Total: Before=966915, After=967183, chg +0.03%

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-11 17:57 ` [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers Andrew Cooper
  2021-11-12  9:36   ` Julien Grall
@ 2021-11-12  9:57   ` Jan Beulich
  2021-11-17 23:31     ` Andrew Cooper
  2021-11-15 10:13   ` Bertrand Marquis
  2 siblings, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2021-11-12  9:57 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis, Xen-devel

On 11.11.2021 18:57, Andrew Cooper wrote:
> Retpolines are expensive, and all these do are select between the sync and
> nosync helpers.  Pass a boolean instead, and use direct calls everywhere.
> 
> Pause/unpause operations on behalf of dom0 are not fastpaths, so avoid
> exposing the __domain_pause_by_systemcontroller() internal.
> 
> This actually compiles smaller than before:
> 
>   $ ../scripts/bloat-o-meter xen-syms-before xen-syms-after
>   add/remove: 3/1 grow/shrink: 0/5 up/down: 250/-273 (-23)
>   Function                                     old     new   delta
>   _domain_pause                                  -     115    +115
>   domain_pause_by_systemcontroller               -      69     +69
>   domain_pause_by_systemcontroller_nosync        -      66     +66
>   domain_kill                                  426     398     -28
>   domain_resume                                246     214     -32
>   domain_pause_except_self                     189     141     -48
>   domain_pause                                  59      10     -49
>   domain_pause_nosync                           59       7     -52
>   __domain_pause_by_systemcontroller            64       -     -64
> 
> despite GCC's best efforts.  The new _domain_pause_by_systemcontroller()
> really should not be inlined, considering that the difference is only the
> setup of the sync boolean to pass to _domain_pause(), and there are plenty of
> registers to spare.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
albeit without meaning to override Julien's concerns in any way.

Also a question:

> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -1234,15 +1234,18 @@ int vcpu_unpause_by_systemcontroller(struct vcpu *v)
>      return 0;
>  }
>  
> -static void do_domain_pause(struct domain *d,
> -                            void (*sleep_fn)(struct vcpu *v))
> +static void _domain_pause(struct domain *d, bool sync /* or nosync */)
>  {
>      struct vcpu *v;
>  
>      atomic_inc(&d->pause_count);
>  
> -    for_each_vcpu( d, v )
> -        sleep_fn(v);
> +    if ( sync )
> +        for_each_vcpu ( d, v )
> +            vcpu_sleep_sync(v);
> +    else
> +        for_each_vcpu ( d, v )
> +            vcpu_sleep_nosync(v);

Is this really better (for whichever reason) than 

    for_each_vcpu ( d, v )
    {
        if ( sync )
            vcpu_sleep_sync(v);
        else
            vcpu_sleep_nosync(v);
    }

?

Jan



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/5] xen/domain: Improve pirq handling
  2021-11-11 17:57 ` [PATCH 2/5] xen/domain: Improve pirq handling Andrew Cooper
@ 2021-11-12 10:16   ` Jan Beulich
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Beulich @ 2021-11-12 10:16 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis, Xen-devel

On 11.11.2021 18:57, Andrew Cooper wrote:
> free_pirq_struct() has no external users, so shouldn't be exposed.

This has been the case from its very introduction. Which iirc was done
that way because its alloc counterpart is non-static. Not an objection,
just an observation.

>  Making it
> static necessitates moving the function as domain_destroy() uses it.
> 
> Rework pirq_get_info() to have easier-to-follow logic.

That's a matter of taste; I for one would prefer the original form with
just a single return statement. I'm (obviously) not going to nack this,
but I'm not sure yet whether I'm willing to (eventually) ack it.

>  The one functional
> change is to the insertion failure path; we should not be using a full
> call_rcu() chain to free an otherwise local structure we failed to insert into
> the radix tree to begin with.

This makes an assumption on the radix tree implementation, in that failure
there may not occur after publication of the pointer. This perhaps is not
a requirement that would easily get violated considering the present code
structure, but I'm still not sure we want to have such hidden dependencies.
At the very least I seem to vaguely recall that at the time of
introduction it wasn't just an oversight to use the RCU approach there as
well.

> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -365,6 +365,39 @@ static int __init parse_extra_guest_irqs(const char *s)
>  }
>  custom_param("extra_guest_irqs", parse_extra_guest_irqs);
>  
> +static void _free_pirq_struct(struct rcu_head *head)
> +{
> +    xfree(container_of(head, struct pirq, rcu_head));
> +}
> +
> +static void free_pirq_struct(void *ptr)
> +{
> +    struct pirq *pirq = ptr;
> +
> +    call_rcu(&pirq->rcu_head, _free_pirq_struct);
> +}
> +
> +struct pirq *pirq_get_info(struct domain *d, int pirq)
> +{
> +    struct pirq *info = pirq_info(d, pirq);
> +
> +    if ( likely(info) )
> +        return info;
> +
> +    info = alloc_pirq_struct(d);
> +    if ( unlikely(!info) )
> +        return NULL;

Are the unlikely() here and ...

> +    info->pirq = pirq;
> +    if ( likely(radix_tree_insert(&d->pirq_tree, pirq, info) == 0) )
> +        return info; /* Success. */

... the likely() here really warranted? Iirc you're generally advocating
for avoiding their use unless strongly indicated, and if I'm not mistaken
the compiler's heuristics result in such NULL / 0 checks to get assumed
to be unlikely / likely anyway.

Jan



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] xen/sort: Switch to an extern inline implementation
  2021-11-11 17:57 ` [PATCH 3/5] xen/sort: Switch to an extern inline implementation Andrew Cooper
  2021-11-11 18:15   ` Julien Grall
  2021-11-12  9:39   ` Julien Grall
@ 2021-11-12 10:25   ` Jan Beulich
  2 siblings, 0 replies; 34+ messages in thread
From: Jan Beulich @ 2021-11-12 10:25 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis, Xen-devel

On 11.11.2021 18:57, Andrew Cooper wrote:
> There are exactly 3 callers of sort() in the hypervisor.
> 
> Both arm callers pass in NULL for the swap function.  While this might seem
> like an attractive option at first, it causes generic_swap() to be used which
> forced a byte-wise copy.  Provide real swap functions which the compiler can
> optimise sensibly.
> 
> Furthermore, use of function pointers in tight loops like that can be very bad
> for performance.  Implement sort() as extern inline, so the optimiser can
> judge whether to inline things or not.
> 
> On x86, the diffstat shows how much of a better job the compiler can do when
> it is able to see the cmp/swap implementations.
> 
>   $ ../scripts/bloat-o-meter xen-syms-before xen-syms-after
>   add/remove: 0/5 grow/shrink: 1/1 up/down: 505/-735 (-230)
>   Function                                     old     new   delta
>   sort_exception_table                          31     536    +505
>   u32_swap                                       9       -      -9
>   generic_swap                                  34       -     -34
>   cmp_ex                                        36       -     -36
>   swap_ex                                       41       -     -41
>   sort_exception_tables                         90      38     -52
>   sort                                         563       -    -563
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Technically
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Yet again without the intention of overriding Julien's concerns in any
way. To address one of them, how about retaining generic_swap() (as an
inline function), ...

> --- a/xen/include/xen/sort.h
> +++ b/xen/include/xen/sort.h
> @@ -3,8 +3,61 @@
>  
>  #include <xen/types.h>
>  
> +/*
> + * sort - sort an array of elements
> + * @base: pointer to data to sort
> + * @num: number of elements
> + * @size: size of each element
> + * @cmp: pointer to comparison function
> + * @swap: pointer to swap function or NULL
> + *
> + * This function does a heapsort on the given array. You may provide a
> + * swap function optimized to your element type.
> + *
> + * Sorting time is O(n log n) both on average and worst-case. While
> + * qsort is about 20% faster on average, it suffers from exploitable
> + * O(n*n) worst-case behavior and extra memory requirements that make
> + * it less suitable for kernel use.
> + */
> +#ifndef SORT_IMPLEMENTATION
> +extern gnu_inline
> +#endif
>  void sort(void *base, size_t num, size_t size,
>            int (*cmp)(const void *, const void *),
> -          void (*swap)(void *, void *, size_t));
> +          void (*swap)(void *, void *, size_t))
> +{
> +    /* pre-scale counters for performance */
> +    size_t i = (num / 2) * size, n = num * size, c, r;
> +
> +    /* heapify */
> +    while ( i > 0 )
> +    {
> +        for ( r = i -= size; r * 2 + size < n; r = c )
> +        {
> +            c = r * 2 + size;
> +            if ( (c < n - size) && (cmp(base + c, base + c + size) < 0) )
> +                c += size;
> +            if ( cmp(base + r, base + c) >= 0 )
> +                break;
> +            swap(base + r, base + c, size);

... doing

            if ( swap )
                swap(base + r, base + c, size);
            else
                generic_swap(base + r, base + c, size);

here and below. The compiler would then still be able to eliminate the
indirect calls (as well as the added conditional), I think.

Jan



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/5] xen/wait: Remove indirect jump
  2021-11-11 17:57 ` [PATCH 4/5] xen/wait: Remove indirect jump Andrew Cooper
@ 2021-11-12 10:35   ` Jan Beulich
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Beulich @ 2021-11-12 10:35 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 11.11.2021 18:57, Andrew Cooper wrote:
> There is no need for this to be an indirect jump at all.  Execution always
> returns to a specific location, so use a direct jump instead.
> 
> Use a named label for the jump.  As both 1 and 2 have disappeared as labels,
> rename 3 to skip to better describe its purpose.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] x86/ioapic: Drop function pointers from __ioapic_{read,write}_entry()
  2021-11-11 17:57 ` [PATCH 5/5] x86/ioapic: Drop function pointers from __ioapic_{read,write}_entry() Andrew Cooper
@ 2021-11-12 10:43   ` Jan Beulich
  2021-11-18  0:32     ` Andrew Cooper
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2021-11-12 10:43 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 11.11.2021 18:57, Andrew Cooper wrote:
> Function pointers are expensive, and the raw parameter is a constant from all
> callers, meaning that it predicts very well with local branch history.

The code change is fine, but I'm having trouble with "all" here: Both
functions aren't even static, so while callers in io_apic.c may
benefit (perhaps with the exception of ioapic_{read,write}_entry(),
depending on whether the compiler views inlining them as warranted),
I'm in no way convinced this extends to the callers in VT-d code.

Further ISTR clang being quite a bit less aggressive about inlining,
so the effects might not be quite as good there even for the call
sites in io_apic.c.

Can you clarify this for me please?

Jan



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-11 17:57 ` [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers Andrew Cooper
  2021-11-12  9:36   ` Julien Grall
  2021-11-12  9:57   ` Jan Beulich
@ 2021-11-15 10:13   ` Bertrand Marquis
  2021-11-15 10:20     ` Jan Beulich
  2 siblings, 1 reply; 34+ messages in thread
From: Bertrand Marquis @ 2021-11-15 10:13 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk

Hi Andrew,

> On 11 Nov 2021, at 17:57, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> 
> Retpolines are expensive, and all these do are select between the sync and
> nosync helpers.  Pass a boolean instead, and use direct calls everywhere.
> 
> Pause/unpause operations on behalf of dom0 are not fastpaths, so avoid
> exposing the __domain_pause_by_systemcontroller() internal.
> 
> This actually compiles smaller than before:
> 
>  $ ../scripts/bloat-o-meter xen-syms-before xen-syms-after
>  add/remove: 3/1 grow/shrink: 0/5 up/down: 250/-273 (-23)
>  Function                                     old     new   delta
>  _domain_pause                                  -     115    +115
>  domain_pause_by_systemcontroller               -      69     +69
>  domain_pause_by_systemcontroller_nosync        -      66     +66
>  domain_kill                                  426     398     -28
>  domain_resume                                246     214     -32
>  domain_pause_except_self                     189     141     -48
>  domain_pause                                  59      10     -49
>  domain_pause_nosync                           59       7     -52
>  __domain_pause_by_systemcontroller            64       -     -64
> 
> despite GCC's best efforts.  The new _domain_pause_by_systemcontroller()
> really should not be inlined, considering that the difference is only the
> setup of the sync boolean to pass to _domain_pause(), and there are plenty of
> registers to spare.

To add an argument to the discussion I would say that preventing to use function
pointers is something that is good for FuSa so I am in favour here.

> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
> CC: Stefano Stabellini <sstabellini@kernel.org>
> CC: Julien Grall <julien@xen.org>
> CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
> CC: Bertrand Marquis <bertrand.marquis@arm.com>
> ---
> xen/common/domain.c     | 31 ++++++++++++++++++++++---------
> xen/include/xen/sched.h | 15 +++++----------
> 2 files changed, 27 insertions(+), 19 deletions(-)
> 
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 56d47dd66478..562872cdf87a 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -1234,15 +1234,18 @@ int vcpu_unpause_by_systemcontroller(struct vcpu *v)
>     return 0;
> }
> 
> -static void do_domain_pause(struct domain *d,
> -                            void (*sleep_fn)(struct vcpu *v))
> +static void _domain_pause(struct domain *d, bool sync /* or nosync */)

Here you use comments inside the function definition.
I think this is something that should be avoided and in this specific case a
boolean sync is clear enough not to need to explain that false is nosing.

> {
>     struct vcpu *v;
> 
>     atomic_inc(&d->pause_count);
> 
> -    for_each_vcpu( d, v )
> -        sleep_fn(v);
> +    if ( sync )
> +        for_each_vcpu ( d, v )
> +            vcpu_sleep_sync(v);
> +    else
> +        for_each_vcpu ( d, v )
> +            vcpu_sleep_nosync(v);
> 
>     arch_domain_pause(d);
> }
> @@ -1250,12 +1253,12 @@ static void do_domain_pause(struct domain *d,
> void domain_pause(struct domain *d)
> {
>     ASSERT(d != current->domain);
> -    do_domain_pause(d, vcpu_sleep_sync);
> +    _domain_pause(d, true /* sync */);
Same here.

> }
> 
> void domain_pause_nosync(struct domain *d)
> {
> -    do_domain_pause(d, vcpu_sleep_nosync);
> +    _domain_pause(d, false /* nosync */);
Same here.

> }
> 
> void domain_unpause(struct domain *d)
> @@ -1269,8 +1272,8 @@ void domain_unpause(struct domain *d)
>             vcpu_wake(v);
> }
> 
> -int __domain_pause_by_systemcontroller(struct domain *d,
> -                                       void (*pause_fn)(struct domain *d))
> +static int _domain_pause_by_systemcontroller(
> +    struct domain *d, bool sync /* or nosync */)
Same here.

> {
>     int old, new, prev = d->controller_pause_count;
> 
> @@ -1289,11 +1292,21 @@ int __domain_pause_by_systemcontroller(struct domain *d,
>         prev = cmpxchg(&d->controller_pause_count, old, new);
>     } while ( prev != old );
> 
> -    pause_fn(d);
> +    _domain_pause(d, sync);
> 
>     return 0;
> }
> 
> +int domain_pause_by_systemcontroller(struct domain *d)
> +{
> +    return _domain_pause_by_systemcontroller(d, true /* sync */);
Same here.

> +}
> +
> +int domain_pause_by_systemcontroller_nosync(struct domain *d)
> +{
> +    return _domain_pause_by_systemcontroller(d, false /* nosync */);
Same here.

Cheers
Bertrand

> +}
> +
> int domain_unpause_by_systemcontroller(struct domain *d)
> {
>     int old, new, prev = d->controller_pause_count;
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index 28146ee404e6..37f78cc4c4c9 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -920,26 +920,21 @@ static inline bool vcpu_cpu_dirty(const struct vcpu *v)
> 
> void vcpu_block(void);
> void vcpu_unblock(struct vcpu *v);
> +
> void vcpu_pause(struct vcpu *v);
> void vcpu_pause_nosync(struct vcpu *v);
> void vcpu_unpause(struct vcpu *v);
> +
> int vcpu_pause_by_systemcontroller(struct vcpu *v);
> int vcpu_unpause_by_systemcontroller(struct vcpu *v);
> 
> void domain_pause(struct domain *d);
> void domain_pause_nosync(struct domain *d);
> void domain_unpause(struct domain *d);
> +
> +int domain_pause_by_systemcontroller(struct domain *d);
> +int domain_pause_by_systemcontroller_nosync(struct domain *d);
> int domain_unpause_by_systemcontroller(struct domain *d);
> -int __domain_pause_by_systemcontroller(struct domain *d,
> -                                       void (*pause_fn)(struct domain *d));
> -static inline int domain_pause_by_systemcontroller(struct domain *d)
> -{
> -    return __domain_pause_by_systemcontroller(d, domain_pause);
> -}
> -static inline int domain_pause_by_systemcontroller_nosync(struct domain *d)
> -{
> -    return __domain_pause_by_systemcontroller(d, domain_pause_nosync);
> -}
> 
> /* domain_pause() but safe against trying to pause current. */
> int __must_check domain_pause_except_self(struct domain *d);
> -- 
> 2.11.0
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-15 10:13   ` Bertrand Marquis
@ 2021-11-15 10:20     ` Jan Beulich
  2021-11-15 10:23       ` Bertrand Marquis
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2021-11-15 10:20 UTC (permalink / raw)
  To: Bertrand Marquis
  Cc: Xen-devel, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Andrew Cooper

On 15.11.2021 11:13, Bertrand Marquis wrote:
>> On 11 Nov 2021, at 17:57, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -1234,15 +1234,18 @@ int vcpu_unpause_by_systemcontroller(struct vcpu *v)
>>     return 0;
>> }
>>
>> -static void do_domain_pause(struct domain *d,
>> -                            void (*sleep_fn)(struct vcpu *v))
>> +static void _domain_pause(struct domain *d, bool sync /* or nosync */)
> 
> Here you use comments inside the function definition.
> I think this is something that should be avoided and in this specific case a
> boolean sync is clear enough not to need to explain that false is nosing.

While I agree the comment here isn't overly useful, I think ...

>> @@ -1250,12 +1253,12 @@ static void do_domain_pause(struct domain *d,
>> void domain_pause(struct domain *d)
>> {
>>     ASSERT(d != current->domain);
>> -    do_domain_pause(d, vcpu_sleep_sync);
>> +    _domain_pause(d, true /* sync */);
> Same here.

... comments like this one are pretty useful to disambiguate the plain
"true" or "false" (without the reader needing to go look at the function
declaration or definition).

Jan



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-15 10:20     ` Jan Beulich
@ 2021-11-15 10:23       ` Bertrand Marquis
  2021-11-15 10:55         ` Jan Beulich
  0 siblings, 1 reply; 34+ messages in thread
From: Bertrand Marquis @ 2021-11-15 10:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xen-devel, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Andrew Cooper

Hi Jan,

> On 15 Nov 2021, at 10:20, Jan Beulich <jbeulich@suse.com> wrote:
> 
> On 15.11.2021 11:13, Bertrand Marquis wrote:
>>> On 11 Nov 2021, at 17:57, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>> --- a/xen/common/domain.c
>>> +++ b/xen/common/domain.c
>>> @@ -1234,15 +1234,18 @@ int vcpu_unpause_by_systemcontroller(struct vcpu *v)
>>>    return 0;
>>> }
>>> 
>>> -static void do_domain_pause(struct domain *d,
>>> -                            void (*sleep_fn)(struct vcpu *v))
>>> +static void _domain_pause(struct domain *d, bool sync /* or nosync */)
>> 
>> Here you use comments inside the function definition.
>> I think this is something that should be avoided and in this specific case a
>> boolean sync is clear enough not to need to explain that false is nosing.
> 
> While I agree the comment here isn't overly useful, I think ...
> 
>>> @@ -1250,12 +1253,12 @@ static void do_domain_pause(struct domain *d,
>>> void domain_pause(struct domain *d)
>>> {
>>>    ASSERT(d != current->domain);
>>> -    do_domain_pause(d, vcpu_sleep_sync);
>>> +    _domain_pause(d, true /* sync */);
>> Same here.
> 
> ... comments like this one are pretty useful to disambiguate the plain
> "true" or "false" (without the reader needing to go look at the function
> declaration or definition).

I agree with that but the comment here is useful, it could be added before
the call instead of inside it.

Bertrand

> 
> Jan
> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-15 10:23       ` Bertrand Marquis
@ 2021-11-15 10:55         ` Jan Beulich
  2021-11-15 11:23           ` Bertrand Marquis
  2021-11-16  0:41           ` Stefano Stabellini
  0 siblings, 2 replies; 34+ messages in thread
From: Jan Beulich @ 2021-11-15 10:55 UTC (permalink / raw)
  To: Bertrand Marquis
  Cc: Xen-devel, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Andrew Cooper

On 15.11.2021 11:23, Bertrand Marquis wrote:
> Hi Jan,
> 
>> On 15 Nov 2021, at 10:20, Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 15.11.2021 11:13, Bertrand Marquis wrote:
>>>> On 11 Nov 2021, at 17:57, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>>> --- a/xen/common/domain.c
>>>> +++ b/xen/common/domain.c
>>>> @@ -1234,15 +1234,18 @@ int vcpu_unpause_by_systemcontroller(struct vcpu *v)
>>>>    return 0;
>>>> }
>>>>
>>>> -static void do_domain_pause(struct domain *d,
>>>> -                            void (*sleep_fn)(struct vcpu *v))
>>>> +static void _domain_pause(struct domain *d, bool sync /* or nosync */)
>>>
>>> Here you use comments inside the function definition.
>>> I think this is something that should be avoided and in this specific case a
>>> boolean sync is clear enough not to need to explain that false is nosing.
>>
>> While I agree the comment here isn't overly useful, I think ...
>>
>>>> @@ -1250,12 +1253,12 @@ static void do_domain_pause(struct domain *d,
>>>> void domain_pause(struct domain *d)
>>>> {
>>>>    ASSERT(d != current->domain);
>>>> -    do_domain_pause(d, vcpu_sleep_sync);
>>>> +    _domain_pause(d, true /* sync */);
>>> Same here.
>>
>> ... comments like this one are pretty useful to disambiguate the plain
>> "true" or "false" (without the reader needing to go look at the function
>> declaration or definition).
> 
> I agree with that but the comment here is useful, it could be added before
> the call instead of inside it.

Except the form Andrew has used is the one we've been using elsewhere
for some time.

Jan



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-15 10:55         ` Jan Beulich
@ 2021-11-15 11:23           ` Bertrand Marquis
  2021-11-15 14:11             ` Julien Grall
  2021-11-16  0:41           ` Stefano Stabellini
  1 sibling, 1 reply; 34+ messages in thread
From: Bertrand Marquis @ 2021-11-15 11:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xen-devel, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Andrew Cooper

Hi Jan,

> On 15 Nov 2021, at 10:55, Jan Beulich <jbeulich@suse.com> wrote:
> 
> On 15.11.2021 11:23, Bertrand Marquis wrote:
>> Hi Jan,
>> 
>>> On 15 Nov 2021, at 10:20, Jan Beulich <jbeulich@suse.com> wrote:
>>> 
>>> On 15.11.2021 11:13, Bertrand Marquis wrote:
>>>>> On 11 Nov 2021, at 17:57, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>>>> --- a/xen/common/domain.c
>>>>> +++ b/xen/common/domain.c
>>>>> @@ -1234,15 +1234,18 @@ int vcpu_unpause_by_systemcontroller(struct vcpu *v)
>>>>>   return 0;
>>>>> }
>>>>> 
>>>>> -static void do_domain_pause(struct domain *d,
>>>>> -                            void (*sleep_fn)(struct vcpu *v))
>>>>> +static void _domain_pause(struct domain *d, bool sync /* or nosync */)
>>>> 
>>>> Here you use comments inside the function definition.
>>>> I think this is something that should be avoided and in this specific case a
>>>> boolean sync is clear enough not to need to explain that false is nosing.
>>> 
>>> While I agree the comment here isn't overly useful, I think ...
>>> 
>>>>> @@ -1250,12 +1253,12 @@ static void do_domain_pause(struct domain *d,
>>>>> void domain_pause(struct domain *d)
>>>>> {
>>>>>   ASSERT(d != current->domain);
>>>>> -    do_domain_pause(d, vcpu_sleep_sync);
>>>>> +    _domain_pause(d, true /* sync */);
>>>> Same here.
>>> 
>>> ... comments like this one are pretty useful to disambiguate the plain
>>> "true" or "false" (without the reader needing to go look at the function
>>> declaration or definition).
>> 
>> I agree with that but the comment here is useful, it could be added before
>> the call instead of inside it.
> 
> Except the form Andrew has used is the one we've been using elsewhere
> for some time.

I know I found some other examples and that why I say “should” and not must.
If other consider that this is the right way to go and should not be changed this
is ok with me but I wanted to make the comment as this could ease the work
with FuSa and Misra compliance in the future.

Bertrand

> 
> Jan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-15 11:23           ` Bertrand Marquis
@ 2021-11-15 14:11             ` Julien Grall
  2021-11-15 14:45               ` Bertrand Marquis
  0 siblings, 1 reply; 34+ messages in thread
From: Julien Grall @ 2021-11-15 14:11 UTC (permalink / raw)
  To: Bertrand Marquis, Jan Beulich
  Cc: Xen-devel, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk, Andrew Cooper

Hi Bertrand,

On 15/11/2021 11:23, Bertrand Marquis wrote:
>> On 15 Nov 2021, at 10:55, Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 15.11.2021 11:23, Bertrand Marquis wrote:
>>> Hi Jan,
>>>
>>>> On 15 Nov 2021, at 10:20, Jan Beulich <jbeulich@suse.com> wrote:
>>>>
>>>> On 15.11.2021 11:13, Bertrand Marquis wrote:
>>>>>> On 11 Nov 2021, at 17:57, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>>>>> --- a/xen/common/domain.c
>>>>>> +++ b/xen/common/domain.c
>>>>>> @@ -1234,15 +1234,18 @@ int vcpu_unpause_by_systemcontroller(struct vcpu *v)
>>>>>>    return 0;
>>>>>> }
>>>>>>
>>>>>> -static void do_domain_pause(struct domain *d,
>>>>>> -                            void (*sleep_fn)(struct vcpu *v))
>>>>>> +static void _domain_pause(struct domain *d, bool sync /* or nosync */)
>>>>>
>>>>> Here you use comments inside the function definition.
>>>>> I think this is something that should be avoided and in this specific case a
>>>>> boolean sync is clear enough not to need to explain that false is nosing.
>>>>
>>>> While I agree the comment here isn't overly useful, I think ...
>>>>
>>>>>> @@ -1250,12 +1253,12 @@ static void do_domain_pause(struct domain *d,
>>>>>> void domain_pause(struct domain *d)
>>>>>> {
>>>>>>    ASSERT(d != current->domain);
>>>>>> -    do_domain_pause(d, vcpu_sleep_sync);
>>>>>> +    _domain_pause(d, true /* sync */);
>>>>> Same here.
>>>>
>>>> ... comments like this one are pretty useful to disambiguate the plain
>>>> "true" or "false" (without the reader needing to go look at the function
>>>> declaration or definition).
>>>
>>> I agree with that but the comment here is useful, it could be added before
>>> the call instead of inside it.
>>
>> Except the form Andrew has used is the one we've been using elsewhere
>> for some time.
> 
> I know I found some other examples and that why I say “should” and not must.
> If other consider that this is the right way to go and should not be changed this
> is ok with me

Adding the comment after the parameter is a lot easier to read.

What is Misra/FuSA trying to solve by preventing to comment just after 
the parameters?

> but I wanted to make the comment as this could ease the work
> with FuSa and Misra compliance in the future.

This will need to be part of a larger discussion on how the community 
want to integrate FuSa/Misra rules. I do expect a few of the rules may 
be quite controversial to adopt (like the one above) and therefore we 
would need to discuss the pros/cons of them.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-15 14:11             ` Julien Grall
@ 2021-11-15 14:45               ` Bertrand Marquis
  0 siblings, 0 replies; 34+ messages in thread
From: Bertrand Marquis @ 2021-11-15 14:45 UTC (permalink / raw)
  To: Julien Grall
  Cc: Jan Beulich, Xen-devel, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk, Andrew Cooper

Hi,

> On 15 Nov 2021, at 14:11, Julien Grall <julien@xen.org> wrote:
> 
> Hi Bertrand,
> 
> On 15/11/2021 11:23, Bertrand Marquis wrote:
>>> On 15 Nov 2021, at 10:55, Jan Beulich <jbeulich@suse.com> wrote:
>>> 
>>> On 15.11.2021 11:23, Bertrand Marquis wrote:
>>>> Hi Jan,
>>>> 
>>>>> On 15 Nov 2021, at 10:20, Jan Beulich <jbeulich@suse.com> wrote:
>>>>> 
>>>>> On 15.11.2021 11:13, Bertrand Marquis wrote:
>>>>>>> On 11 Nov 2021, at 17:57, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>>>>>> --- a/xen/common/domain.c
>>>>>>> +++ b/xen/common/domain.c
>>>>>>> @@ -1234,15 +1234,18 @@ int vcpu_unpause_by_systemcontroller(struct vcpu *v)
>>>>>>>   return 0;
>>>>>>> }
>>>>>>> 
>>>>>>> -static void do_domain_pause(struct domain *d,
>>>>>>> -                            void (*sleep_fn)(struct vcpu *v))
>>>>>>> +static void _domain_pause(struct domain *d, bool sync /* or nosync */)
>>>>>> 
>>>>>> Here you use comments inside the function definition.
>>>>>> I think this is something that should be avoided and in this specific case a
>>>>>> boolean sync is clear enough not to need to explain that false is nosing.
>>>>> 
>>>>> While I agree the comment here isn't overly useful, I think ...
>>>>> 
>>>>>>> @@ -1250,12 +1253,12 @@ static void do_domain_pause(struct domain *d,
>>>>>>> void domain_pause(struct domain *d)
>>>>>>> {
>>>>>>>   ASSERT(d != current->domain);
>>>>>>> -    do_domain_pause(d, vcpu_sleep_sync);
>>>>>>> +    _domain_pause(d, true /* sync */);
>>>>>> Same here.
>>>>> 
>>>>> ... comments like this one are pretty useful to disambiguate the plain
>>>>> "true" or "false" (without the reader needing to go look at the function
>>>>> declaration or definition).
>>>> 
>>>> I agree with that but the comment here is useful, it could be added before
>>>> the call instead of inside it.
>>> 
>>> Except the form Andrew has used is the one we've been using elsewhere
>>> for some time.
>> I know I found some other examples and that why I say “should” and not must.
>> If other consider that this is the right way to go and should not be changed this
>> is ok with me
> 
> Adding the comment after the parameter is a lot easier to read.
> 
> What is Misra/FuSA trying to solve by preventing to comment just after the parameters?

I do not think Misra is always trying to prevent “bugs”, sometimes it is just trying to make
the code easier to read and review by making it always look the same. Here is to saying that
this should not be done but that comment should be written one or several lines using /* */ form.

> 
>> but I wanted to make the comment as this could ease the work
>> with FuSa and Misra compliance in the future.
> 
> This will need to be part of a larger discussion on how the community want to integrate FuSa/Misra rules. I do expect a few of the rules may be quite controversial to adopt (like the one above) and therefore we would need to discuss the pros/cons of them.

I definitely think this will require a discussion and maybe some more explanation from us on why.
Your comment asking “What is prevented” is interesting and I will try to keep that in mind when we start the discussion.


Cheers
Bertrand

> 
> Cheers,
> 
> -- 
> Julien Grall


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] xen/sort: Switch to an extern inline implementation
  2021-11-11 18:15   ` Julien Grall
@ 2021-11-16  0:36     ` Stefano Stabellini
  2021-11-16  0:41       ` Andrew Cooper
  0 siblings, 1 reply; 34+ messages in thread
From: Stefano Stabellini @ 2021-11-16  0:36 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, Xen-devel, Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis

On Thu, 11 Nov 2021, Julien Grall wrote:
> On 11/11/2021 17:57, Andrew Cooper wrote:
> > There are exactly 3 callers of sort() in the hypervisor.
> > 
> > Both arm callers pass in NULL for the swap function.  While this might seem
> > like an attractive option at first, it causes generic_swap() to be used
> > which
> > forced a byte-wise copy.  Provide real swap functions which the compiler can
> > optimise sensibly.
> 
> I understand the theory, but none of the two calls are in hot paths or deal
> with large set on Arm. So I am rather hesitant to introduce specialised swap
> in each case as it doesn't seem we will gain much from this change.

While I like Andrew's optimization, like Julien, I would also rather not
have to introduce specialized swap functions any time a sort() is
called. Why not keeping around generic_swap as extern gnu_inline in
sort.h as well so that the ARM callers can simply:

    sort(bootinfo.mem.bank, bootinfo.mem.nr_banks, sizeof(struct membank),
         cmp_memory_node, generic_swap);

?

That looks like a good option, although it would still result in a
small increase in bloat.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] xen/sort: Switch to an extern inline implementation
  2021-11-16  0:36     ` Stefano Stabellini
@ 2021-11-16  0:41       ` Andrew Cooper
  2021-12-17 15:56         ` Andrew Cooper
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Cooper @ 2021-11-16  0:41 UTC (permalink / raw)
  To: Stefano Stabellini, Julien Grall
  Cc: Andrew Cooper, Xen-devel, Jan Beulich, Roger Pau Monné,
	Wei Liu, Volodymyr Babchuk, Bertrand Marquis

On 16/11/2021 00:36, Stefano Stabellini wrote:
> On Thu, 11 Nov 2021, Julien Grall wrote:
>> On 11/11/2021 17:57, Andrew Cooper wrote:
>>> There are exactly 3 callers of sort() in the hypervisor.
>>>
>>> Both arm callers pass in NULL for the swap function.  While this might seem
>>> like an attractive option at first, it causes generic_swap() to be used
>>> which
>>> forced a byte-wise copy.  Provide real swap functions which the compiler can
>>> optimise sensibly.
>> I understand the theory, but none of the two calls are in hot paths or deal
>> with large set on Arm. So I am rather hesitant to introduce specialised swap
>> in each case as it doesn't seem we will gain much from this change.
> While I like Andrew's optimization, like Julien, I would also rather not
> have to introduce specialized swap functions any time a sort() is
> called. Why not keeping around generic_swap as extern gnu_inline in
> sort.h as well so that the ARM callers can simply:
>
>     sort(bootinfo.mem.bank, bootinfo.mem.nr_banks, sizeof(struct membank),
>          cmp_memory_node, generic_swap);
>
> ?
>
> That looks like a good option, although it would still result in a
> small increase in bloat.

That is basically what Jan suggested.

I'm still unconvinced that you actually want to be doing byte-wise
swapping, even if it isn't a hotpath.  A custom swap function will
compile to less code than using generic_swap() like this, and run faster.

~Andrew


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-15 10:55         ` Jan Beulich
  2021-11-15 11:23           ` Bertrand Marquis
@ 2021-11-16  0:41           ` Stefano Stabellini
  2021-11-16  7:15             ` Jan Beulich
  1 sibling, 1 reply; 34+ messages in thread
From: Stefano Stabellini @ 2021-11-16  0:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Bertrand Marquis, Xen-devel, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Andrew Cooper

On Mon, 15 Nov 2021, Jan Beulich wrote:
> On 15.11.2021 11:23, Bertrand Marquis wrote:
> > Hi Jan,
> > 
> >> On 15 Nov 2021, at 10:20, Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 15.11.2021 11:13, Bertrand Marquis wrote:
> >>>> On 11 Nov 2021, at 17:57, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >>>> --- a/xen/common/domain.c
> >>>> +++ b/xen/common/domain.c
> >>>> @@ -1234,15 +1234,18 @@ int vcpu_unpause_by_systemcontroller(struct vcpu *v)
> >>>>    return 0;
> >>>> }
> >>>>
> >>>> -static void do_domain_pause(struct domain *d,
> >>>> -                            void (*sleep_fn)(struct vcpu *v))
> >>>> +static void _domain_pause(struct domain *d, bool sync /* or nosync */)
> >>>
> >>> Here you use comments inside the function definition.
> >>> I think this is something that should be avoided and in this specific case a
> >>> boolean sync is clear enough not to need to explain that false is nosing.
> >>
> >> While I agree the comment here isn't overly useful, I think ...
> >>
> >>>> @@ -1250,12 +1253,12 @@ static void do_domain_pause(struct domain *d,
> >>>> void domain_pause(struct domain *d)
> >>>> {
> >>>>    ASSERT(d != current->domain);
> >>>> -    do_domain_pause(d, vcpu_sleep_sync);
> >>>> +    _domain_pause(d, true /* sync */);
> >>> Same here.
> >>
> >> ... comments like this one are pretty useful to disambiguate the plain
> >> "true" or "false" (without the reader needing to go look at the function
> >> declaration or definition).
> > 
> > I agree with that but the comment here is useful, it could be added before
> > the call instead of inside it.
> 
> Except the form Andrew has used is the one we've been using elsewhere
> for some time.

Bertrand's comment about MISRA aside, this style of comments doesn't
seem to be covered/allowed by our current CODING_STYLE. It would be good
to keep the CODING_STYLE document up to date: not only it helps us
during reviews, it also helps contributors making sure they don't
violate our guidelines. Moreover CODING_STYLE will certainly turn out
useful when we are going to have MISRA discussion so that we can have an
up-to-date reference of what we currently do and support.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-16  0:41           ` Stefano Stabellini
@ 2021-11-16  7:15             ` Jan Beulich
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Beulich @ 2021-11-16  7:15 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Bertrand Marquis, Xen-devel, Roger Pau Monné,
	Wei Liu, Julien Grall, Volodymyr Babchuk, Andrew Cooper

On 16.11.2021 01:41, Stefano Stabellini wrote:
> Bertrand's comment about MISRA aside, this style of comments doesn't
> seem to be covered/allowed by our current CODING_STYLE. It would be good
> to keep the CODING_STYLE document up to date: not only it helps us
> during reviews, it also helps contributors making sure they don't
> violate our guidelines. Moreover CODING_STYLE will certainly turn out
> useful when we are going to have MISRA discussion so that we can have an
> up-to-date reference of what we currently do and support.

Well, yes. My track record of getting changes to this file accepted is
so poor that I'm afraid I'm not willing to make a patch.

Jan



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-12  9:57   ` Jan Beulich
@ 2021-11-17 23:31     ` Andrew Cooper
  0 siblings, 0 replies; 34+ messages in thread
From: Andrew Cooper @ 2021-11-17 23:31 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis, Xen-devel

On 12/11/2021 09:57, Jan Beulich wrote:
> On 11.11.2021 18:57, Andrew Cooper wrote:
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -1234,15 +1234,18 @@ int vcpu_unpause_by_systemcontroller(struct vcpu *v)
>>       return 0;
>>   }
>>   
>> -static void do_domain_pause(struct domain *d,
>> -                            void (*sleep_fn)(struct vcpu *v))
>> +static void _domain_pause(struct domain *d, bool sync /* or nosync */)
>>   {
>>       struct vcpu *v;
>>   
>>       atomic_inc(&d->pause_count);
>>   
>> -    for_each_vcpu( d, v )
>> -        sleep_fn(v);
>> +    if ( sync )
>> +        for_each_vcpu ( d, v )
>> +            vcpu_sleep_sync(v);
>> +    else
>> +        for_each_vcpu ( d, v )
>> +            vcpu_sleep_nosync(v);
> Is this really better (for whichever reason) than
>
>      for_each_vcpu ( d, v )
>      {
>          if ( sync )
>              vcpu_sleep_sync(v);
>          else
>              vcpu_sleep_nosync(v);
>      }
>
> ?

Yes.  For cases where it can't be optimised out via constant 
propagation, it removes a conditional branch from the middle of a loop.  
I forget what the name for the compiler pass which does this is, but it 
makes a big difference given the way that L0 instruction caches and 
loop-stream-detectors/etc are build.

~Andrew


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] x86/ioapic: Drop function pointers from __ioapic_{read,write}_entry()
  2021-11-12 10:43   ` Jan Beulich
@ 2021-11-18  0:32     ` Andrew Cooper
  2021-11-18  9:06       ` Jan Beulich
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Cooper @ 2021-11-18  0:32 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 12/11/2021 10:43, Jan Beulich wrote:
> On 11.11.2021 18:57, Andrew Cooper wrote:
>> Function pointers are expensive, and the raw parameter is a constant from all
>> callers, meaning that it predicts very well with local branch history.
> The code change is fine, but I'm having trouble with "all" here: Both
> functions aren't even static, so while callers in io_apic.c may
> benefit (perhaps with the exception of ioapic_{read,write}_entry(),
> depending on whether the compiler views inlining them as warranted),
> I'm in no way convinced this extends to the callers in VT-d code.
>
> Further ISTR clang being quite a bit less aggressive about inlining,
> so the effects might not be quite as good there even for the call
> sites in io_apic.c.
>
> Can you clarify this for me please?

The way the compiler lays out the code is unrelated to why this form is 
an improvement.

Branch history is a function of "the $N most recently taken branches".  
This is because "how you got here" is typically relevant to "where you 
should go next".

Trivial schemes maintain a shift register of taken / not-taken results.  
Less trivial schemes maintain a rolling hash of (src addr, dst addr) 
tuples of all taken branches (direct and indirect).  In both cases, the 
instantaneous branch history is an input into the final prediction, and 
is commonly used to select which saturating counter (or bank of 
counters) is used.

Consider something like

while ( cond )
{
     memcpy(dst1, src1, 64);
     memcpy(dst2, src2, 7);
}

Here, the conditional jump inside memcpy() coping with the tail of the 
copy flips result 50% of the time, which is fiendish to predict for.

However, because the branch history differs (by memcpy()'s return 
address which was accumulated by the call instruction), the predictor 
can actually use two different taken/not-taken counters for the two 
different "instances" if the tail jump.  After a few iterations to warm 
up, the predictor will get every jump perfect despite the fact that 
memcpy() is a library call and the branches would otherwise alias.


Bringing it back to the code in question.  The "raw" parameter is an 
explicit true or false at the top of all call paths leading into these 
functions.  Therefore, an individual branch history has a high 
correlation with said true or false, irrespective of the absolute code 
layout.  As a consequence, the correct result of the prediction is 
highly correlated with the branch history, and it will predict 
perfectly[1] after a few times the path has been used.

~Andrew

[1] Obviously, it's not actually perfect outside of a synthetic 
example.  Aliasing in the predictor is a necessary property of keeping 
the logic small enough to provide an answer fast, but the less 
accidental aliasing there is, the faster the CPU performance in 
benchmarks, so incentives are in our favour here.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-12  9:36   ` Julien Grall
@ 2021-11-18  1:47     ` Andrew Cooper
  2021-11-18  9:28       ` Julien Grall
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Cooper @ 2021-11-18  1:47 UTC (permalink / raw)
  To: Julien Grall, Andrew Cooper, Xen-devel
  Cc: Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis

On 12/11/2021 09:36, Julien Grall wrote:
> On 11/11/2021 17:57, Andrew Cooper wrote:
>> Retpolines are expensive, and all these do are select between the 
>> sync and
>> nosync helpers.  Pass a boolean instead, and use direct calls 
>> everywhere.
>
> To be honest, I much prefer to read the old code. I am totally not 
> against the change but I can see how I would be ready to introduce new 
> function pointers use in the future.

Really?  The only reason there are function points to begin with was 
because of a far more naive (and far pre-spectre) me fixing a reference 
counting mess in 2014 by consolidating the logic.  My mistake was not 
spotting that the function pointers weren't actually necessary in the 
first place.

> So I think we need some guidelines on when to use function pointers in 
> Xen.

It's easy.  If you are in any way unsure, they're probably the wrong 
answer.  (Ok - I'm being a little facetious)

There are concrete security improvements from not using function 
pointers, demonstrated by fact that JOP/COP attacks are so pervasive 
that all major hardware and software vendors are working on techniques 
(both hardware and software) to prevent forward-edge control flow 
integrity violations.  (The mandate from the NSA to make this happen 
certainly helped spur things on, too.)

There are also concrete performance improvements too.  All competitive 
processors in the market today can cope with direct branches more 
efficiently than indirect branches, and a key principle behind 
profile-guided-optimsiation is to rearrange your `ptr()` function 
pointer call into `if ( ptr == a ) a(); else if ( ptr == b ) b(); else 
ptr()` based on the frequency of destinations, because this really does 
make orders of magnitude improvements in some cases.

We have some shockingly inappropriate uses of function pointers in Xen 
right now (patches 4 and 5 in particular, and "x86/hvm: Remove callback 
from paging->flush_tlb() hook" posted today).  While this specific 
example doesn't fall into shockingly inappropriate in my books, it is 
firmly in the "not appropriate" category.

> The more...
>
>>
>> Pause/unpause operations on behalf of dom0 are not fastpaths, so avoid
>> exposing the __domain_pause_by_systemcontroller() internal.
>>
>> This actually compiles smaller than before:
>
> ... the code doesn't really compile smaller on Arm:
>
> 42sh>  ../scripts/bloat-o-meter xen-syms-old xen-syms
>
> add/remove: 4/2 grow/shrink: 0/6 up/down: 272/-252 (20)
> Function old     new   delta
> _domain_pause                                  -     136    +136
> _domain_pause_by_systemcontroller              -     120    +120
> domain_pause_by_systemcontroller_nosync        -       8      +8
> domain_pause_by_systemcontroller               -       8      +8
> domain_resume                                136     132      -4
> domain_pause_nosync                           12       8      -4
> domain_pause                                  12       8      -4
> domain_pause_except_self                     188     180      -8
> do_domctl                                   5480    5472      -8
> domain_kill                                  372     356     -16
> do_domain_pause                               88       -     -88
> __domain_pause_by_systemcontroller           120       -    -120
> Total: Before=966919, After=966939, chg +0.00%


ARM, like x86, compiles for speed, not size.  "it got a bit larger" is 
generally not as interesting as "it got smaller, despite everything the 
compiler would normally do in the opposite direction".  Obviously, if 
the build of Xen got 10x larger then we'd have a problem, but it hasn't.

Conversely, if we were compiling for size not speed, then "it got 
smaller" is the uninteresting direction.

The truth is that Xen (both x86 and ARM) is a couple of megabytes in RAM 
and this literally doesn't matter these days where RAM is measured in GB 
and TB.  We care to an extent for efficient use of the cache/etc, but 
noone would bat an eyelid at several MB more for a want-to-have feature.


Here, you're saying that for an added 5 instructions, totalling 0.00% 
delta in the size of the hypervisor, you've got some reasonably frequent 
codepaths which will execute faster, and cannot be hijacked with a 
JOP/COP attack.

That's a clear all-around improvement on ARM, just like it is on x86.

~Andrew


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] x86/ioapic: Drop function pointers from __ioapic_{read,write}_entry()
  2021-11-18  0:32     ` Andrew Cooper
@ 2021-11-18  9:06       ` Jan Beulich
  2021-11-18  9:07         ` Jan Beulich
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2021-11-18  9:06 UTC (permalink / raw)
  To: Andrew Cooper, Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 18.11.2021 01:32, Andrew Cooper wrote:
> On 12/11/2021 10:43, Jan Beulich wrote:
>> On 11.11.2021 18:57, Andrew Cooper wrote:
>>> Function pointers are expensive, and the raw parameter is a constant from all
>>> callers, meaning that it predicts very well with local branch history.
>> The code change is fine, but I'm having trouble with "all" here: Both
>> functions aren't even static, so while callers in io_apic.c may
>> benefit (perhaps with the exception of ioapic_{read,write}_entry(),
>> depending on whether the compiler views inlining them as warranted),
>> I'm in no way convinced this extends to the callers in VT-d code.
>>
>> Further ISTR clang being quite a bit less aggressive about inlining,
>> so the effects might not be quite as good there even for the call
>> sites in io_apic.c.
>>
>> Can you clarify this for me please?
> 
> The way the compiler lays out the code is unrelated to why this form is 
> an improvement.
> 
> Branch history is a function of "the $N most recently taken branches".  
> This is because "how you got here" is typically relevant to "where you 
> should go next".
> 
> Trivial schemes maintain a shift register of taken / not-taken results.  
> Less trivial schemes maintain a rolling hash of (src addr, dst addr) 
> tuples of all taken branches (direct and indirect).  In both cases, the 
> instantaneous branch history is an input into the final prediction, and 
> is commonly used to select which saturating counter (or bank of 
> counters) is used.
> 
> Consider something like
> 
> while ( cond )
> {
>      memcpy(dst1, src1, 64);
>      memcpy(dst2, src2, 7);
> }
> 
> Here, the conditional jump inside memcpy() coping with the tail of the 
> copy flips result 50% of the time, which is fiendish to predict for.
> 
> However, because the branch history differs (by memcpy()'s return 
> address which was accumulated by the call instruction), the predictor 
> can actually use two different taken/not-taken counters for the two 
> different "instances" if the tail jump.  After a few iterations to warm 
> up, the predictor will get every jump perfect despite the fact that 
> memcpy() is a library call and the branches would otherwise alias.
> 
> 
> Bringing it back to the code in question.  The "raw" parameter is an 
> explicit true or false at the top of all call paths leading into these 
> functions.  Therefore, an individual branch history has a high 
> correlation with said true or false, irrespective of the absolute code 
> layout.  As a consequence, the correct result of the prediction is 
> highly correlated with the branch history, and it will predict 
> perfectly[1] after a few times the path has been used.

Thanks a lot for the explanation. May I suggest to make this less
ambiguous in the description, e.g. by saying "the raw parameter is a
constant at the root of all call trees"?

Jan



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] x86/ioapic: Drop function pointers from __ioapic_{read,write}_entry()
  2021-11-18  9:06       ` Jan Beulich
@ 2021-11-18  9:07         ` Jan Beulich
  2021-11-18 17:33           ` Andrew Cooper
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2021-11-18  9:07 UTC (permalink / raw)
  To: Andrew Cooper, Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 18.11.2021 10:06, Jan Beulich wrote:
> On 18.11.2021 01:32, Andrew Cooper wrote:
>> On 12/11/2021 10:43, Jan Beulich wrote:
>>> On 11.11.2021 18:57, Andrew Cooper wrote:
>>>> Function pointers are expensive, and the raw parameter is a constant from all
>>>> callers, meaning that it predicts very well with local branch history.
>>> The code change is fine, but I'm having trouble with "all" here: Both
>>> functions aren't even static, so while callers in io_apic.c may
>>> benefit (perhaps with the exception of ioapic_{read,write}_entry(),
>>> depending on whether the compiler views inlining them as warranted),
>>> I'm in no way convinced this extends to the callers in VT-d code.
>>>
>>> Further ISTR clang being quite a bit less aggressive about inlining,
>>> so the effects might not be quite as good there even for the call
>>> sites in io_apic.c.
>>>
>>> Can you clarify this for me please?
>>
>> The way the compiler lays out the code is unrelated to why this form is 
>> an improvement.
>>
>> Branch history is a function of "the $N most recently taken branches".  
>> This is because "how you got here" is typically relevant to "where you 
>> should go next".
>>
>> Trivial schemes maintain a shift register of taken / not-taken results.  
>> Less trivial schemes maintain a rolling hash of (src addr, dst addr) 
>> tuples of all taken branches (direct and indirect).  In both cases, the 
>> instantaneous branch history is an input into the final prediction, and 
>> is commonly used to select which saturating counter (or bank of 
>> counters) is used.
>>
>> Consider something like
>>
>> while ( cond )
>> {
>>      memcpy(dst1, src1, 64);
>>      memcpy(dst2, src2, 7);
>> }
>>
>> Here, the conditional jump inside memcpy() coping with the tail of the 
>> copy flips result 50% of the time, which is fiendish to predict for.
>>
>> However, because the branch history differs (by memcpy()'s return 
>> address which was accumulated by the call instruction), the predictor 
>> can actually use two different taken/not-taken counters for the two 
>> different "instances" if the tail jump.  After a few iterations to warm 
>> up, the predictor will get every jump perfect despite the fact that 
>> memcpy() is a library call and the branches would otherwise alias.
>>
>>
>> Bringing it back to the code in question.  The "raw" parameter is an 
>> explicit true or false at the top of all call paths leading into these 
>> functions.  Therefore, an individual branch history has a high 
>> correlation with said true or false, irrespective of the absolute code 
>> layout.  As a consequence, the correct result of the prediction is 
>> highly correlated with the branch history, and it will predict 
>> perfectly[1] after a few times the path has been used.
> 
> Thanks a lot for the explanation. May I suggest to make this less
> ambiguous in the description, e.g. by saying "the raw parameter is a
> constant at the root of all call trees"?

Oh, forgot to say that then:
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers
  2021-11-18  1:47     ` Andrew Cooper
@ 2021-11-18  9:28       ` Julien Grall
  0 siblings, 0 replies; 34+ messages in thread
From: Julien Grall @ 2021-11-18  9:28 UTC (permalink / raw)
  To: Andrew Cooper, Andrew Cooper, Xen-devel
  Cc: Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis

Hi Andrew,

On 18/11/2021 01:47, Andrew Cooper wrote:
> On 12/11/2021 09:36, Julien Grall wrote:
>> On 11/11/2021 17:57, Andrew Cooper wrote:
>>> Retpolines are expensive, and all these do are select between the 
>>> sync and
>>> nosync helpers.  Pass a boolean instead, and use direct calls 
>>> everywhere.
>>
>> To be honest, I much prefer to read the old code. I am totally not 
>> against the change but I can see how I would be ready to introduce new 
>> function pointers use in the future.
> 
> Really?  The only reason there are function points to begin with was 
> because of a far more naive (and far pre-spectre) me fixing a reference 
> counting mess in 2014 by consolidating the logic.  My mistake was not 
> spotting that the function pointers weren't actually necessary in the 
> first place.
> 
>> So I think we need some guidelines on when to use function pointers in 
>> Xen.
> 
> It's easy.  If you are in any way unsure, they're probably the wrong 
> answer.  (Ok - I'm being a little facetious)
> 
> There are concrete security improvements from not using function 
> pointers, demonstrated by fact that JOP/COP attacks are so pervasive 
> that all major hardware and software vendors are working on techniques 
> (both hardware and software) to prevent forward-edge control flow 
> integrity violations.  (The mandate from the NSA to make this happen 
> certainly helped spur things on, too.)
> 
> There are also concrete performance improvements too.  All competitive 
> processors in the market today can cope with direct branches more 
> efficiently than indirect branches, and a key principle behind 
> profile-guided-optimsiation is to rearrange your `ptr()` function 
> pointer call into `if ( ptr == a ) a(); else if ( ptr == b ) b(); else 
> ptr()` based on the frequency of destinations, because this really does 
> make orders of magnitude improvements in some cases.
> 
> We have some shockingly inappropriate uses of function pointers in Xen 
> right now (patches 4 and 5 in particular, and "x86/hvm: Remove callback 
> from paging->flush_tlb() hook" posted today).  While this specific 
> example doesn't fall into shockingly inappropriate in my books, it is 
> firmly in the "not appropriate" category.

Thanks for the full explanation. What I am looking for is an update of 
CODING_STYLE to make clear the function pointers should be avoided in 
Xen and when we would accept them.

>>> This actually compiles smaller than before:
>>
>> ... the code doesn't really compile smaller on Arm:
>>
>> 42sh>  ../scripts/bloat-o-meter xen-syms-old xen-syms
>>
>> add/remove: 4/2 grow/shrink: 0/6 up/down: 272/-252 (20)
>> Function old     new   delta
>> _domain_pause                                  -     136    +136
>> _domain_pause_by_systemcontroller              -     120    +120
>> domain_pause_by_systemcontroller_nosync        -       8      +8
>> domain_pause_by_systemcontroller               -       8      +8
>> domain_resume                                136     132      -4
>> domain_pause_nosync                           12       8      -4
>> domain_pause                                  12       8      -4
>> domain_pause_except_self                     188     180      -8
>> do_domctl                                   5480    5472      -8
>> domain_kill                                  372     356     -16
>> do_domain_pause                               88       -     -88
>> __domain_pause_by_systemcontroller           120       -    -120
>> Total: Before=966919, After=966939, chg +0.00%
> 
> 
> ARM, like x86, compiles for speed, not size.  "it got a bit larger" is 
> generally not as interesting as "it got smaller, despite everything the 
> compiler would normally do in the opposite direction".

My point is you have a generic section "this compiles smaller" section 
in your commit message when in fact this was only tested with one x86 
compiler version.

So at the minimum you should specify the version/architecture because 
otherwise this sounds like you claim smaller Xen for everyone.

But to be honest, I don't really see the value to mention them as this 
is depending on your compiler (e.g. it may be bigger or smaller) and as 
you wrote it yourself "you're saying that for an removed 5 instructions".

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] x86/ioapic: Drop function pointers from __ioapic_{read,write}_entry()
  2021-11-18  9:07         ` Jan Beulich
@ 2021-11-18 17:33           ` Andrew Cooper
  0 siblings, 0 replies; 34+ messages in thread
From: Andrew Cooper @ 2021-11-18 17:33 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper; +Cc: Roger Pau Monné, Wei Liu, Xen-devel

On 18/11/2021 09:07, Jan Beulich wrote:
> On 18.11.2021 10:06, Jan Beulich wrote:
>> On 18.11.2021 01:32, Andrew Cooper wrote:
>>> On 12/11/2021 10:43, Jan Beulich wrote:
>>>> On 11.11.2021 18:57, Andrew Cooper wrote:
>>>>> Function pointers are expensive, and the raw parameter is a constant from all
>>>>> callers, meaning that it predicts very well with local branch history.
>>>> The code change is fine, but I'm having trouble with "all" here: Both
>>>> functions aren't even static, so while callers in io_apic.c may
>>>> benefit (perhaps with the exception of ioapic_{read,write}_entry(),
>>>> depending on whether the compiler views inlining them as warranted),
>>>> I'm in no way convinced this extends to the callers in VT-d code.
>>>>
>>>> Further ISTR clang being quite a bit less aggressive about inlining,
>>>> so the effects might not be quite as good there even for the call
>>>> sites in io_apic.c.
>>>>
>>>> Can you clarify this for me please?
>>> The way the compiler lays out the code is unrelated to why this form is 
>>> an improvement.
>>>
>>> Branch history is a function of "the $N most recently taken branches".  
>>> This is because "how you got here" is typically relevant to "where you 
>>> should go next".
>>>
>>> Trivial schemes maintain a shift register of taken / not-taken results.  
>>> Less trivial schemes maintain a rolling hash of (src addr, dst addr) 
>>> tuples of all taken branches (direct and indirect).  In both cases, the 
>>> instantaneous branch history is an input into the final prediction, and 
>>> is commonly used to select which saturating counter (or bank of 
>>> counters) is used.
>>>
>>> Consider something like
>>>
>>> while ( cond )
>>> {
>>>      memcpy(dst1, src1, 64);
>>>      memcpy(dst2, src2, 7);
>>> }
>>>
>>> Here, the conditional jump inside memcpy() coping with the tail of the 
>>> copy flips result 50% of the time, which is fiendish to predict for.
>>>
>>> However, because the branch history differs (by memcpy()'s return 
>>> address which was accumulated by the call instruction), the predictor 
>>> can actually use two different taken/not-taken counters for the two 
>>> different "instances" if the tail jump.  After a few iterations to warm 
>>> up, the predictor will get every jump perfect despite the fact that 
>>> memcpy() is a library call and the branches would otherwise alias.
>>>
>>>
>>> Bringing it back to the code in question.  The "raw" parameter is an 
>>> explicit true or false at the top of all call paths leading into these 
>>> functions.  Therefore, an individual branch history has a high 
>>> correlation with said true or false, irrespective of the absolute code 
>>> layout.  As a consequence, the correct result of the prediction is 
>>> highly correlated with the branch history, and it will predict 
>>> perfectly[1] after a few times the path has been used.
>> Thanks a lot for the explanation. May I suggest to make this less
>> ambiguous in the description, e.g. by saying "the raw parameter is a
>> constant at the root of all call trees"?

Done.

> Oh, forgot to say that then:
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

~Andrew


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] xen/sort: Switch to an extern inline implementation
  2021-11-16  0:41       ` Andrew Cooper
@ 2021-12-17 15:56         ` Andrew Cooper
  2021-12-17 16:15           ` Julien Grall
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Cooper @ 2021-12-17 15:56 UTC (permalink / raw)
  To: Stefano Stabellini, Julien Grall
  Cc: Andrew Cooper, Xen-devel, Jan Beulich, Roger Pau Monné,
	Wei Liu, Volodymyr Babchuk, Bertrand Marquis

On 16/11/2021 00:41, Andrew Cooper wrote:
> On 16/11/2021 00:36, Stefano Stabellini wrote:
>> On Thu, 11 Nov 2021, Julien Grall wrote:
>>> On 11/11/2021 17:57, Andrew Cooper wrote:
>>>> There are exactly 3 callers of sort() in the hypervisor.
>>>>
>>>> Both arm callers pass in NULL for the swap function.  While this might seem
>>>> like an attractive option at first, it causes generic_swap() to be used
>>>> which
>>>> forced a byte-wise copy.  Provide real swap functions which the compiler can
>>>> optimise sensibly.
>>> I understand the theory, but none of the two calls are in hot paths or deal
>>> with large set on Arm. So I am rather hesitant to introduce specialised swap
>>> in each case as it doesn't seem we will gain much from this change.
>> While I like Andrew's optimization, like Julien, I would also rather not
>> have to introduce specialized swap functions any time a sort() is
>> called. Why not keeping around generic_swap as extern gnu_inline in
>> sort.h as well so that the ARM callers can simply:
>>
>>     sort(bootinfo.mem.bank, bootinfo.mem.nr_banks, sizeof(struct membank),
>>          cmp_memory_node, generic_swap);
>>
>> ?
>>
>> That looks like a good option, although it would still result in a
>> small increase in bloat.
> That is basically what Jan suggested.
>
> I'm still unconvinced that you actually want to be doing byte-wise
> swapping, even if it isn't a hotpath.  A custom swap function will
> compile to less code than using generic_swap() like this, and run faster.

ARM Ping.

I really think this is a bad course of action.  If you're going to
insist on it, I want something in writing.

~Andrew


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] xen/sort: Switch to an extern inline implementation
  2021-12-17 15:56         ` Andrew Cooper
@ 2021-12-17 16:15           ` Julien Grall
  0 siblings, 0 replies; 34+ messages in thread
From: Julien Grall @ 2021-12-17 16:15 UTC (permalink / raw)
  To: Andrew Cooper, Stefano Stabellini
  Cc: Andrew Cooper, Xen-devel, Jan Beulich, Roger Pau Monné,
	Wei Liu, Volodymyr Babchuk, Bertrand Marquis

Hi Andrew,

On 17/12/2021 15:56, Andrew Cooper wrote:
> On 16/11/2021 00:41, Andrew Cooper wrote:
>> On 16/11/2021 00:36, Stefano Stabellini wrote:
>>> On Thu, 11 Nov 2021, Julien Grall wrote:
>>>> On 11/11/2021 17:57, Andrew Cooper wrote:
>>>>> There are exactly 3 callers of sort() in the hypervisor.
>>>>>
>>>>> Both arm callers pass in NULL for the swap function.  While this might seem
>>>>> like an attractive option at first, it causes generic_swap() to be used
>>>>> which
>>>>> forced a byte-wise copy.  Provide real swap functions which the compiler can
>>>>> optimise sensibly.
>>>> I understand the theory, but none of the two calls are in hot paths or deal
>>>> with large set on Arm. So I am rather hesitant to introduce specialised swap
>>>> in each case as it doesn't seem we will gain much from this change.
>>> While I like Andrew's optimization, like Julien, I would also rather not
>>> have to introduce specialized swap functions any time a sort() is
>>> called. Why not keeping around generic_swap as extern gnu_inline in
>>> sort.h as well so that the ARM callers can simply:
>>>
>>>      sort(bootinfo.mem.bank, bootinfo.mem.nr_banks, sizeof(struct membank),
>>>           cmp_memory_node, generic_swap);
>>>
>>> ?
>>>
>>> That looks like a good option, although it would still result in a
>>> small increase in bloat.
>> That is basically what Jan suggested.
>>
>> I'm still unconvinced that you actually want to be doing byte-wise
>> swapping, even if it isn't a hotpath.  A custom swap function will
>> compile to less code than using generic_swap() like this, and run faster.
> 
> ARM Ping.
> 
> I really think this is a bad course of action.  If you're going to
> insist on it, I want something in writing.

I agree with all the points you wrote. However, you also have to put 
into perspective how this is used. On Arm, the two callers happen either 
during boot a domain creation. Neither of them have a significant impact 
on the time spent. In fact, I would be surprised if you notice the change.

So unless there are other (good) reasons to suppress generic_swap(), I 
still want to be able to use generic_swap() when the performance doesn't 
matter (like the two Arm callers).

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-12-17 16:15 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-11 17:57 [PATCH 0/5] xen: various function pointer cleanups Andrew Cooper
2021-11-11 17:57 ` [PATCH 1/5] xen/domain: Remove function pointers from domain pause helpers Andrew Cooper
2021-11-12  9:36   ` Julien Grall
2021-11-18  1:47     ` Andrew Cooper
2021-11-18  9:28       ` Julien Grall
2021-11-12  9:57   ` Jan Beulich
2021-11-17 23:31     ` Andrew Cooper
2021-11-15 10:13   ` Bertrand Marquis
2021-11-15 10:20     ` Jan Beulich
2021-11-15 10:23       ` Bertrand Marquis
2021-11-15 10:55         ` Jan Beulich
2021-11-15 11:23           ` Bertrand Marquis
2021-11-15 14:11             ` Julien Grall
2021-11-15 14:45               ` Bertrand Marquis
2021-11-16  0:41           ` Stefano Stabellini
2021-11-16  7:15             ` Jan Beulich
2021-11-11 17:57 ` [PATCH 2/5] xen/domain: Improve pirq handling Andrew Cooper
2021-11-12 10:16   ` Jan Beulich
2021-11-11 17:57 ` [PATCH 3/5] xen/sort: Switch to an extern inline implementation Andrew Cooper
2021-11-11 18:15   ` Julien Grall
2021-11-16  0:36     ` Stefano Stabellini
2021-11-16  0:41       ` Andrew Cooper
2021-12-17 15:56         ` Andrew Cooper
2021-12-17 16:15           ` Julien Grall
2021-11-12  9:39   ` Julien Grall
2021-11-12 10:25   ` Jan Beulich
2021-11-11 17:57 ` [PATCH 4/5] xen/wait: Remove indirect jump Andrew Cooper
2021-11-12 10:35   ` Jan Beulich
2021-11-11 17:57 ` [PATCH 5/5] x86/ioapic: Drop function pointers from __ioapic_{read,write}_entry() Andrew Cooper
2021-11-12 10:43   ` Jan Beulich
2021-11-18  0:32     ` Andrew Cooper
2021-11-18  9:06       ` Jan Beulich
2021-11-18  9:07         ` Jan Beulich
2021-11-18 17:33           ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.