All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] xen/domain: More structured teardown
@ 2020-12-21 18:14 Andrew Cooper
  2020-12-21 18:14 ` [PATCH 1/3] xen/domain: Reorder trivial initialisation in early domain_create() Andrew Cooper
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Andrew Cooper @ 2020-12-21 18:14 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk

Follow up from both the investigation leading to "Fix memory leak in
vcpu_create() error path", and the older work to deduplicate the domain
teardown paths.

Andrew Cooper (3):
  xen/domain: Reorder trivial initialisation in early domain_create()
  xen/domain: Introduce domain_teardown()
  xen/evtchn: Clean up teardown handling

 xen/common/domain.c        | 113 ++++++++++++++++++++++++++++++++++-----------
 xen/common/event_channel.c |   8 ++--
 xen/include/xen/sched.h    |  12 ++++-
 3 files changed, 101 insertions(+), 32 deletions(-)

-- 
2.11.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/3] xen/domain: Reorder trivial initialisation in early domain_create()
  2020-12-21 18:14 [PATCH 0/3] xen/domain: More structured teardown Andrew Cooper
@ 2020-12-21 18:14 ` Andrew Cooper
  2020-12-22 10:10   ` Jan Beulich
  2020-12-21 18:14 ` [PATCH 2/3] xen/domain: Introduce domain_teardown() Andrew Cooper
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2020-12-21 18:14 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk

This improves the robustness of the error paths.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
---
 xen/common/domain.c | 41 ++++++++++++++++++++++-------------------
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 5ec48c3e19..ce3667f1b4 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -391,25 +391,7 @@ struct domain *domain_create(domid_t domid,
 
     TRACE_1D(TRC_DOM0_DOM_ADD, d->domain_id);
 
-    /*
-     * Allocate d->vcpu[] and set ->max_vcpus up early.  Various per-domain
-     * resources want to be sized based on max_vcpus.
-     */
-    if ( !is_system_domain(d) )
-    {
-        err = -ENOMEM;
-        d->vcpu = xzalloc_array(struct vcpu *, config->max_vcpus);
-        if ( !d->vcpu )
-            goto fail;
-
-        d->max_vcpus = config->max_vcpus;
-    }
-
-    lock_profile_register_struct(LOCKPROF_TYPE_PERDOM, d, domid);
-
-    if ( (err = xsm_alloc_security_domain(d)) != 0 )
-        goto fail;
-
+    /* Trivial initialisation. */
     atomic_set(&d->refcnt, 1);
     RCU_READ_LOCK_INIT(&d->rcu_lock);
     spin_lock_init_prof(d, domain_lock);
@@ -434,6 +416,27 @@ struct domain *domain_create(domid_t domid,
     INIT_LIST_HEAD(&d->pdev_list);
 #endif
 
+    /* All error paths can depend on the above setup. */
+
+    /*
+     * Allocate d->vcpu[] and set ->max_vcpus up early.  Various per-domain
+     * resources want to be sized based on max_vcpus.
+     */
+    if ( !is_system_domain(d) )
+    {
+        err = -ENOMEM;
+        d->vcpu = xzalloc_array(struct vcpu *, config->max_vcpus);
+        if ( !d->vcpu )
+            goto fail;
+
+        d->max_vcpus = config->max_vcpus;
+    }
+
+    lock_profile_register_struct(LOCKPROF_TYPE_PERDOM, d, domid);
+
+    if ( (err = xsm_alloc_security_domain(d)) != 0 )
+        goto fail;
+
     err = -ENOMEM;
     if ( !zalloc_cpumask_var(&d->dirty_cpumask) )
         goto fail;
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/3] xen/domain: Introduce domain_teardown()
  2020-12-21 18:14 [PATCH 0/3] xen/domain: More structured teardown Andrew Cooper
  2020-12-21 18:14 ` [PATCH 1/3] xen/domain: Reorder trivial initialisation in early domain_create() Andrew Cooper
@ 2020-12-21 18:14 ` Andrew Cooper
  2020-12-21 18:36   ` Julien Grall
  2020-12-22 10:35   ` Jan Beulich
  2020-12-21 18:14 ` [PATCH 3/3] xen/evtchn: Clean up teardown handling Andrew Cooper
  2020-12-21 19:36 ` Hypercall fault injection (Was [PATCH 0/3] xen/domain: More structured teardown) Andrew Cooper
  3 siblings, 2 replies; 28+ messages in thread
From: Andrew Cooper @ 2020-12-21 18:14 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk

There is no common equivelent of domain_reliquish_resources(), which has
caused various pieces of common cleanup to live in inappropriate
places.

Perhaps most obviously, evtchn_destroy() is called for every continuation of
domain_reliquish_resources(), which can easily be thousands of times.

Create domain_teardown() to be a new top level facility, and call it from the
appropriate positions in domain_kill() and domain_create()'s error path.

No change in behaviour yet.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
---
 xen/common/domain.c     | 59 +++++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/sched.h |  8 +++++++
 2 files changed, 67 insertions(+)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index ce3667f1b4..ef1987335b 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -273,6 +273,59 @@ static int __init parse_extra_guest_irqs(const char *s)
 custom_param("extra_guest_irqs", parse_extra_guest_irqs);
 
 /*
+ * Release resources held by a domain.  There may or may not be live
+ * references to the domain, and it may or may not be fully constructed.
+ *
+ * d->is_dying differing between DOMDYING_dying and DOMDYING_dead can be used
+ * to determine if live references to the domain exist, and also whether
+ * continuations are permitted.
+ *
+ * If d->is_dying is DOMDYING_dead, this must not return non-zero.
+ */
+static int domain_teardown(struct domain *d)
+{
+    BUG_ON(!d->is_dying);
+
+    /*
+     * This hypercall can take minutes of wallclock time to complete.  This
+     * logic implements a co-routine, stashing state in struct domain across
+     * hypercall continuation boundaries.
+     */
+    switch ( d->teardown.val )
+    {
+        /*
+         * Record the current progress.  Subsequent hypercall continuations
+         * will logically restart work from this point.
+         *
+         * PROGRESS() markers must not be in the middle of loops.  The loop
+         * variable isn't preserved across a continuation.
+         *
+         * To avoid redundant work, there should be a marker before each
+         * function which may return -ERESTART.
+         */
+#define PROGRESS(x)                             \
+        d->teardown.val = PROG_ ## x;           \
+        /* Fallthrough */                       \
+    case PROG_ ## x
+
+        enum {
+            PROG_done = 1,
+        };
+
+    case 0:
+    PROGRESS(done):
+        break;
+
+#undef PROGRESS
+
+    default:
+        BUG();
+    }
+
+    return 0;
+}
+
+/*
  * Destroy a domain once all references to it have been dropped.  Used either
  * from the RCU path, or from the domain_create() error path before the domain
  * is inserted into the domlist.
@@ -553,6 +606,9 @@ struct domain *domain_create(domid_t domid,
     if ( init_status & INIT_watchdog )
         watchdog_domain_destroy(d);
 
+    /* Must not hit a continuation in this context. */
+    ASSERT(domain_teardown(d) == 0);
+
     _domain_destroy(d);
 
     return ERR_PTR(err);
@@ -733,6 +789,9 @@ int domain_kill(struct domain *d)
         domain_set_outstanding_pages(d, 0);
         /* fallthrough */
     case DOMDYING_dying:
+        rc = domain_teardown(d);
+        if ( rc )
+            break;
         rc = evtchn_destroy(d);
         if ( rc )
             break;
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index faf5fda36f..3f35c537b8 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -525,6 +525,14 @@ struct domain
     /* Argo interdomain communication support */
     struct argo_domain *argo;
 #endif
+
+    /*
+     * Continuation information for domain_teardown().  All fields entirely
+     * private.
+     */
+    struct {
+        unsigned int val;
+    } teardown;
 };
 
 static inline struct page_list_head *page_to_list(
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/3] xen/evtchn: Clean up teardown handling
  2020-12-21 18:14 [PATCH 0/3] xen/domain: More structured teardown Andrew Cooper
  2020-12-21 18:14 ` [PATCH 1/3] xen/domain: Reorder trivial initialisation in early domain_create() Andrew Cooper
  2020-12-21 18:14 ` [PATCH 2/3] xen/domain: Introduce domain_teardown() Andrew Cooper
@ 2020-12-21 18:14 ` Andrew Cooper
  2020-12-22 10:48   ` Jan Beulich
  2020-12-21 19:36 ` Hypercall fault injection (Was [PATCH 0/3] xen/domain: More structured teardown) Andrew Cooper
  3 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2020-12-21 18:14 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk

First of all, rename the evtchn APIs:
 * evtchn_destroy       => evtchn_teardown
 * evtchn_destroy_final => evtchn_destroy

Move both calls into appropriate positions in domain_teardown() and
_domain_destroy(), which avoids having different cleanup logic depending on
the the cause of the cleanup.

In particular, this avoids evtchn_teardown() (previously named
evtchn_destroy()) being called redundantly thousands of times on a typical
XEN_DOMCTL_destroydomain hypercall.

No net change in behaviour.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>

RFC.  While testing this, I observed this, after faking up an -ENOMEM in
dom0's construction:

  (XEN) [2020-12-21 16:31:20] NX (Execute Disable) protection active
  (XEN) [2020-12-21 16:33:04]
  (XEN) [2020-12-21 16:33:04] ****************************************
  (XEN) [2020-12-21 16:33:04] Panic on CPU 0:
  (XEN) [2020-12-21 16:33:04] Error creating domain 0
  (XEN) [2020-12-21 16:33:04] ****************************************

XSA-344 appears to have added nearly 2 minutes of wallclock time into the
domain_create() error path, which isn't ok.

Considering that event channels haven't even been initialised in this
particular scenario, it ought to take ~0 time.  Even if event channels have
been initalised, none can be active as the domain isn't visible to the system.
---
 xen/common/domain.c        | 17 ++++++++---------
 xen/common/event_channel.c |  8 ++++----
 xen/include/xen/sched.h    |  4 ++--
 3 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index ef1987335b..701747b9d9 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -284,6 +284,8 @@ custom_param("extra_guest_irqs", parse_extra_guest_irqs);
  */
 static int domain_teardown(struct domain *d)
 {
+    int rc;
+
     BUG_ON(!d->is_dying);
 
     /*
@@ -313,6 +315,10 @@ static int domain_teardown(struct domain *d)
         };
 
     case 0:
+        rc = evtchn_teardown(d);
+        if ( rc )
+            return rc;
+
     PROGRESS(done):
         break;
 
@@ -335,6 +341,8 @@ static void _domain_destroy(struct domain *d)
     BUG_ON(!d->is_dying);
     BUG_ON(atomic_read(&d->refcnt) != DOMAIN_DESTROYED);
 
+    evtchn_destroy(d);
+
     xfree(d->pbuf);
 
     argo_destroy(d);
@@ -598,11 +606,7 @@ struct domain *domain_create(domid_t domid,
     if ( init_status & INIT_gnttab )
         grant_table_destroy(d);
     if ( init_status & INIT_evtchn )
-    {
-        evtchn_destroy(d);
-        evtchn_destroy_final(d);
         radix_tree_destroy(&d->pirq_tree, free_pirq_struct);
-    }
     if ( init_status & INIT_watchdog )
         watchdog_domain_destroy(d);
 
@@ -792,9 +796,6 @@ int domain_kill(struct domain *d)
         rc = domain_teardown(d);
         if ( rc )
             break;
-        rc = evtchn_destroy(d);
-        if ( rc )
-            break;
         rc = domain_relinquish_resources(d);
         if ( rc != 0 )
             break;
@@ -987,8 +988,6 @@ static void complete_domain_destroy(struct rcu_head *head)
     if ( d->target != NULL )
         put_domain(d->target);
 
-    evtchn_destroy_final(d);
-
     radix_tree_destroy(&d->pirq_tree, free_pirq_struct);
 
     xfree(d->vcpu);
diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c
index 4a48094356..c1af54eed5 100644
--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -1401,7 +1401,7 @@ void free_xen_event_channel(struct domain *d, int port)
     {
         /*
          * Make sure ->is_dying is read /after/ ->valid_evtchns, pairing
-         * with the spin_barrier() and BUG_ON() in evtchn_destroy().
+         * with the spin_barrier() and BUG_ON() in evtchn_teardown().
          */
         smp_rmb();
         BUG_ON(!d->is_dying);
@@ -1421,7 +1421,7 @@ void notify_via_xen_event_channel(struct domain *ld, int lport)
     {
         /*
          * Make sure ->is_dying is read /after/ ->valid_evtchns, pairing
-         * with the spin_barrier() and BUG_ON() in evtchn_destroy().
+         * with the spin_barrier() and BUG_ON() in evtchn_teardown().
          */
         smp_rmb();
         ASSERT(ld->is_dying);
@@ -1499,7 +1499,7 @@ int evtchn_init(struct domain *d, unsigned int max_port)
     return 0;
 }
 
-int evtchn_destroy(struct domain *d)
+int evtchn_teardown(struct domain *d)
 {
     unsigned int i;
 
@@ -1534,7 +1534,7 @@ int evtchn_destroy(struct domain *d)
 }
 
 
-void evtchn_destroy_final(struct domain *d)
+void evtchn_destroy(struct domain *d)
 {
     unsigned int i, j;
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 3f35c537b8..bb22eeca38 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -142,8 +142,8 @@ struct evtchn
 } __attribute__((aligned(64)));
 
 int  evtchn_init(struct domain *d, unsigned int max_port);
-int  evtchn_destroy(struct domain *d); /* from domain_kill */
-void evtchn_destroy_final(struct domain *d); /* from complete_domain_destroy */
+int  evtchn_teardown(struct domain *d);
+void evtchn_destroy(struct domain *d);
 
 struct waitqueue_vcpu;
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] xen/domain: Introduce domain_teardown()
  2020-12-21 18:14 ` [PATCH 2/3] xen/domain: Introduce domain_teardown() Andrew Cooper
@ 2020-12-21 18:36   ` Julien Grall
  2020-12-21 18:45     ` Andrew Cooper
  2020-12-22 10:35   ` Jan Beulich
  1 sibling, 1 reply; 28+ messages in thread
From: Julien Grall @ 2020-12-21 18:36 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel
  Cc: Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk

Hi Andrew,

On 21/12/2020 18:14, Andrew Cooper wrote:
> There is no common equivelent of domain_reliquish_resources(), which has

s/equivelent/equivalent/

> caused various pieces of common cleanup to live in inappropriate
> places.
> 
> Perhaps most obviously, evtchn_destroy() is called for every continuation of
> domain_reliquish_resources(), which can easily be thousands of times.

s/reliquish/relinquish/

> 
> Create domain_teardown() to be a new top level facility, and call it from the
> appropriate positions in domain_kill() and domain_create()'s error path.
> 
> No change in behaviour yet.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
> CC: Stefano Stabellini <sstabellini@kernel.org>
> CC: Julien Grall <julien@xen.org>
> CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
> ---
>   xen/common/domain.c     | 59 +++++++++++++++++++++++++++++++++++++++++++++++++
>   xen/include/xen/sched.h |  8 +++++++
>   2 files changed, 67 insertions(+)
> 
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index ce3667f1b4..ef1987335b 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -273,6 +273,59 @@ static int __init parse_extra_guest_irqs(const char *s)
>   custom_param("extra_guest_irqs", parse_extra_guest_irqs);
>   
>   /*
> + * Release resources held by a domain.  There may or may not be live
> + * references to the domain, and it may or may not be fully constructed.
> + *
> + * d->is_dying differing between DOMDYING_dying and DOMDYING_dead can be used
> + * to determine if live references to the domain exist, and also whether
> + * continuations are permitted.
> + *
> + * If d->is_dying is DOMDYING_dead, this must not return non-zero.
> + */
> +static int domain_teardown(struct domain *d)
> +{
> +    BUG_ON(!d->is_dying);
> +
> +    /*
> +     * This hypercall can take minutes of wallclock time to complete.  This
> +     * logic implements a co-routine, stashing state in struct domain across
> +     * hypercall continuation boundaries.
> +     */
> +    switch ( d->teardown.val )
> +    {
> +        /*
> +         * Record the current progress.  Subsequent hypercall continuations
> +         * will logically restart work from this point.
> +         *
> +         * PROGRESS() markers must not be in the middle of loops.  The loop
> +         * variable isn't preserved across a continuation.
> +         *
> +         * To avoid redundant work, there should be a marker before each
> +         * function which may return -ERESTART.
> +         */
> +#define PROGRESS(x)                             \
> +        d->teardown.val = PROG_ ## x;           \
> +        /* Fallthrough */                       \
> +    case PROG_ ## x
> +
> +        enum {
> +            PROG_done = 1,
> +        };
> +
> +    case 0:
> +    PROGRESS(done):
> +        break;
> +
> +#undef PROGRESS
> +
> +    default:
> +        BUG();
> +    }
> +
> +    return 0;
> +}
> +
> +/*
>    * Destroy a domain once all references to it have been dropped.  Used either
>    * from the RCU path, or from the domain_create() error path before the domain
>    * is inserted into the domlist.
> @@ -553,6 +606,9 @@ struct domain *domain_create(domid_t domid,
>       if ( init_status & INIT_watchdog )
>           watchdog_domain_destroy(d);
>   
> +    /* Must not hit a continuation in this context. */
> +    ASSERT(domain_teardown(d) == 0);
The ASSERT() will become a NOP in production build, so 
domain_teardown_down() will not be called.

However, I think it would be better if we pass an extra argument to 
indicated wheter the code is allowed to preempt. This would make the 
preemption check more obvious in evtchn_destroy() compare to the current 
d->is_dying != DOMDYING_dead.

> +
>       _domain_destroy(d);
>   
>       return ERR_PTR(err);
> @@ -733,6 +789,9 @@ int domain_kill(struct domain *d)
>           domain_set_outstanding_pages(d, 0);
>           /* fallthrough */
>       case DOMDYING_dying:
> +        rc = domain_teardown(d);
> +        if ( rc )
> +            break;
>           rc = evtchn_destroy(d);
>           if ( rc )
>               break;
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index faf5fda36f..3f35c537b8 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -525,6 +525,14 @@ struct domain
>       /* Argo interdomain communication support */
>       struct argo_domain *argo;
>   #endif
> +
> +    /*
> +     * Continuation information for domain_teardown().  All fields entirely
> +     * private.
> +     */
> +    struct {
> +        unsigned int val;
> +    } teardown;
>   };
>   
>   static inline struct page_list_head *page_to_list(
> 

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] xen/domain: Introduce domain_teardown()
  2020-12-21 18:36   ` Julien Grall
@ 2020-12-21 18:45     ` Andrew Cooper
  2020-12-22  7:50       ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2020-12-21 18:45 UTC (permalink / raw)
  To: Julien Grall, Xen-devel
  Cc: Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk

On 21/12/2020 18:36, Julien Grall wrote:
>> @@ -553,6 +606,9 @@ struct domain *domain_create(domid_t domid,
>>       if ( init_status & INIT_watchdog )
>>           watchdog_domain_destroy(d);
>>   +    /* Must not hit a continuation in this context. */
>> +    ASSERT(domain_teardown(d) == 0);
> The ASSERT() will become a NOP in production build, so
> domain_teardown_down() will not be called.

Urgh - its not really a nop, but it's evaluation isn't symmetric between
debug and release builds.  I'll need an extra local variable.

>
> However, I think it would be better if we pass an extra argument to
> indicated wheter the code is allowed to preempt. This would make the
> preemption check more obvious in evtchn_destroy() compare to the
> current d->is_dying != DOMDYING_dead.

We can have a predicate if you'd prefer, but plumbing an extra parameter
is wasteful, and can only cause confusion if it is out of sync with
d->is_dying.

~Andrew


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Hypercall fault injection (Was [PATCH 0/3] xen/domain: More structured teardown)
  2020-12-21 18:14 [PATCH 0/3] xen/domain: More structured teardown Andrew Cooper
                   ` (2 preceding siblings ...)
  2020-12-21 18:14 ` [PATCH 3/3] xen/evtchn: Clean up teardown handling Andrew Cooper
@ 2020-12-21 19:36 ` Andrew Cooper
  2020-12-22 10:00   ` Jan Beulich
  3 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2020-12-21 19:36 UTC (permalink / raw)
  To: Xen-devel
  Cc: Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Juergen Gross

Hello,

We have some very complicated hypercalls, createdomain, and max_vcpus a
close second, with immense complexity, and very hard-to-test error handling.

It is no surprise that the error handling is riddled with bugs.

Random failures from core functions is one way, but I'm not sure that
will be especially helpful.  In particular, we'd need a way to exclude
"dom0 critical" operations so we've got a usable system to run testing on.

As an alternative, how about adding a fault_ttl field into the hypercall?

The exact paths taken in {domain,vcpu}_create() are sensitive to the
hardware, Xen Kconfig, and other parameters passed into the
hypercall(s).  The testing logic doesn't really want to care about what
failed; simply that the error was handled correctly.

So a test for this might look like:

cfg = { ... };
while ( xc_create_domain(xch, cfg) < 0 )
    cfg.fault_ttl++;


The pro's of this approach is that for a specific build of Xen on a
piece of hardware, it ought to check every failure path in
domain_create(), until the ttl finally gets higher than the number of
fail-able actions required to construct a domain.  Also, the test
doesn't need changing as the complexity of domain_create() changes.

The main con will mostly likely be the invasiveness of code in Xen, but
I suppose any fault injection is going to be invasive to a certain extent.

Fault injection like this would also want pairing with some other plans
I had for dalloc() & friends which wrap the current allocators, and
count (in struct domain) the number and/or size of domain allocations,
so we can a) check for leaks, and b) report how much memory a domain
object (and all its decendent objects) actually takes (seeing as we
don't know this value at all).

Thoughts?

~Andrew


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] xen/domain: Introduce domain_teardown()
  2020-12-21 18:45     ` Andrew Cooper
@ 2020-12-22  7:50       ` Jan Beulich
  2020-12-22 10:25         ` Julien Grall
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2020-12-22  7:50 UTC (permalink / raw)
  To: Andrew Cooper, Julien Grall
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk, Xen-devel

On 21.12.2020 19:45, Andrew Cooper wrote:
> On 21/12/2020 18:36, Julien Grall wrote:
>>> @@ -553,6 +606,9 @@ struct domain *domain_create(domid_t domid,
>>>       if ( init_status & INIT_watchdog )
>>>           watchdog_domain_destroy(d);
>>>   +    /* Must not hit a continuation in this context. */
>>> +    ASSERT(domain_teardown(d) == 0);
>> The ASSERT() will become a NOP in production build, so
>> domain_teardown_down() will not be called.
> 
> Urgh - its not really a nop, but it's evaluation isn't symmetric between
> debug and release builds.  I'll need an extra local variable.

Or use ASSERT_UNREACHABLE(). (I admit I don't really like the
resulting constructs, and would like to propose an alternative,
even if I fear it'll be controversial.)

>> However, I think it would be better if we pass an extra argument to
>> indicated wheter the code is allowed to preempt. This would make the
>> preemption check more obvious in evtchn_destroy() compare to the
>> current d->is_dying != DOMDYING_dead.
> 
> We can have a predicate if you'd prefer, but plumbing an extra parameter
> is wasteful, and can only cause confusion if it is out of sync with
> d->is_dying.

I agree here - it wasn't so long ago that event_channel.c gained
a DOMDYING_dead check, and I don't see why we shouldn't extend
this approach to here and elsewhere.

Jan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Hypercall fault injection (Was [PATCH 0/3] xen/domain: More structured teardown)
  2020-12-21 19:36 ` Hypercall fault injection (Was [PATCH 0/3] xen/domain: More structured teardown) Andrew Cooper
@ 2020-12-22 10:00   ` Jan Beulich
  2020-12-22 11:14     ` Andrew Cooper
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2020-12-22 10:00 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Juergen Gross, Xen-devel

On 21.12.2020 20:36, Andrew Cooper wrote:
> Hello,
> 
> We have some very complicated hypercalls, createdomain, and max_vcpus a
> close second, with immense complexity, and very hard-to-test error handling.
> 
> It is no surprise that the error handling is riddled with bugs.
> 
> Random failures from core functions is one way, but I'm not sure that
> will be especially helpful.  In particular, we'd need a way to exclude
> "dom0 critical" operations so we've got a usable system to run testing on.
> 
> As an alternative, how about adding a fault_ttl field into the hypercall?
> 
> The exact paths taken in {domain,vcpu}_create() are sensitive to the
> hardware, Xen Kconfig, and other parameters passed into the
> hypercall(s).  The testing logic doesn't really want to care about what
> failed; simply that the error was handled correctly.
> 
> So a test for this might look like:
> 
> cfg = { ... };
> while ( xc_create_domain(xch, cfg) < 0 )
>     cfg.fault_ttl++;
> 
> 
> The pro's of this approach is that for a specific build of Xen on a
> piece of hardware, it ought to check every failure path in
> domain_create(), until the ttl finally gets higher than the number of
> fail-able actions required to construct a domain.  Also, the test
> doesn't need changing as the complexity of domain_create() changes.
> 
> The main con will mostly likely be the invasiveness of code in Xen, but
> I suppose any fault injection is going to be invasive to a certain extent.

While I like the idea in principle, the innocent looking

cfg = { ... };

is quite a bit of a concern here as well: Depending on the precise
settings, paths taken in the hypervisor may heavily vary, and hence
such a test will only end up being useful if it covers a wide
variety of settings. Even if the number of tests to execute turned
out to still be manageable today, it may quickly turn out not
sufficiently scalable as we add new settings controllable right at
domain creation (which I understand is the plan).

Jan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] xen/domain: Reorder trivial initialisation in early domain_create()
  2020-12-21 18:14 ` [PATCH 1/3] xen/domain: Reorder trivial initialisation in early domain_create() Andrew Cooper
@ 2020-12-22 10:10   ` Jan Beulich
  2020-12-22 10:24     ` Andrew Cooper
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2020-12-22 10:10 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Xen-devel

On 21.12.2020 19:14, Andrew Cooper wrote:
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -391,25 +391,7 @@ struct domain *domain_create(domid_t domid,
>  
>      TRACE_1D(TRC_DOM0_DOM_ADD, d->domain_id);
>  
> -    /*
> -     * Allocate d->vcpu[] and set ->max_vcpus up early.  Various per-domain
> -     * resources want to be sized based on max_vcpus.
> -     */
> -    if ( !is_system_domain(d) )
> -    {
> -        err = -ENOMEM;
> -        d->vcpu = xzalloc_array(struct vcpu *, config->max_vcpus);
> -        if ( !d->vcpu )
> -            goto fail;
> -
> -        d->max_vcpus = config->max_vcpus;
> -    }
> -
> -    lock_profile_register_struct(LOCKPROF_TYPE_PERDOM, d, domid);

Wouldn't this also count as "trivial initialization", and hence while
moving want to at least be placed ...

> -    if ( (err = xsm_alloc_security_domain(d)) != 0 )
> -        goto fail;
> -
> +    /* Trivial initialisation. */
>      atomic_set(&d->refcnt, 1);
>      RCU_READ_LOCK_INIT(&d->rcu_lock);
>      spin_lock_init_prof(d, domain_lock);
> @@ -434,6 +416,27 @@ struct domain *domain_create(domid_t domid,
>      INIT_LIST_HEAD(&d->pdev_list);
>  #endif
>  
> +    /* All error paths can depend on the above setup. */

... ahead of this comment?

> +    /*
> +     * Allocate d->vcpu[] and set ->max_vcpus up early.  Various per-domain
> +     * resources want to be sized based on max_vcpus.
> +     */
> +    if ( !is_system_domain(d) )
> +    {
> +        err = -ENOMEM;
> +        d->vcpu = xzalloc_array(struct vcpu *, config->max_vcpus);
> +        if ( !d->vcpu )
> +            goto fail;
> +
> +        d->max_vcpus = config->max_vcpus;
> +    }
> +
> +    lock_profile_register_struct(LOCKPROF_TYPE_PERDOM, d, domid);
> +
> +    if ( (err = xsm_alloc_security_domain(d)) != 0 )
> +        goto fail;
> +
>      err = -ENOMEM;
>      if ( !zalloc_cpumask_var(&d->dirty_cpumask) )
>          goto fail;
> 

Just as an observation (i.e. unlikely for this patch) I doubt
system domains need ->dirty_cpumask set up, and hence this
allocation may also want moving down a few lines.

Jan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] xen/domain: Reorder trivial initialisation in early domain_create()
  2020-12-22 10:10   ` Jan Beulich
@ 2020-12-22 10:24     ` Andrew Cooper
  2020-12-22 10:50       ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2020-12-22 10:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Xen-devel

On 22/12/2020 10:10, Jan Beulich wrote:
> On 21.12.2020 19:14, Andrew Cooper wrote:
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -391,25 +391,7 @@ struct domain *domain_create(domid_t domid,
>>  
>>      TRACE_1D(TRC_DOM0_DOM_ADD, d->domain_id);
>>  
>> -    /*
>> -     * Allocate d->vcpu[] and set ->max_vcpus up early.  Various per-domain
>> -     * resources want to be sized based on max_vcpus.
>> -     */
>> -    if ( !is_system_domain(d) )
>> -    {
>> -        err = -ENOMEM;
>> -        d->vcpu = xzalloc_array(struct vcpu *, config->max_vcpus);
>> -        if ( !d->vcpu )
>> -            goto fail;
>> -
>> -        d->max_vcpus = config->max_vcpus;
>> -    }
>> -
>> -    lock_profile_register_struct(LOCKPROF_TYPE_PERDOM, d, domid);
> Wouldn't this also count as "trivial initialization", and hence while
> moving want to at least be placed ...
>
>> -    if ( (err = xsm_alloc_security_domain(d)) != 0 )
>> -        goto fail;
>> -
>> +    /* Trivial initialisation. */
>>      atomic_set(&d->refcnt, 1);
>>      RCU_READ_LOCK_INIT(&d->rcu_lock);
>>      spin_lock_init_prof(d, domain_lock);
>> @@ -434,6 +416,27 @@ struct domain *domain_create(domid_t domid,
>>      INIT_LIST_HEAD(&d->pdev_list);
>>  #endif
>>  
>> +    /* All error paths can depend on the above setup. */
> ... ahead of this comment?

Can do.

>
>> +    /*
>> +     * Allocate d->vcpu[] and set ->max_vcpus up early.  Various per-domain
>> +     * resources want to be sized based on max_vcpus.
>> +     */
>> +    if ( !is_system_domain(d) )
>> +    {
>> +        err = -ENOMEM;
>> +        d->vcpu = xzalloc_array(struct vcpu *, config->max_vcpus);
>> +        if ( !d->vcpu )
>> +            goto fail;
>> +
>> +        d->max_vcpus = config->max_vcpus;
>> +    }
>> +
>> +    lock_profile_register_struct(LOCKPROF_TYPE_PERDOM, d, domid);
>> +
>> +    if ( (err = xsm_alloc_security_domain(d)) != 0 )
>> +        goto fail;
>> +
>>      err = -ENOMEM;
>>      if ( !zalloc_cpumask_var(&d->dirty_cpumask) )
>>          goto fail;
>>
> Just as an observation (i.e. unlikely for this patch) I doubt
> system domains need ->dirty_cpumask set up, and hence this
> allocation may also want moving down a few lines.

I agree in principle.  However, something does (or at least did)
reference this for system domains when I did the is_system_domain() cleanup.

The fix might not be as trivial as just moving the allocation.

~Andrew


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] xen/domain: Introduce domain_teardown()
  2020-12-22  7:50       ` Jan Beulich
@ 2020-12-22 10:25         ` Julien Grall
  2020-12-22 10:53           ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Julien Grall @ 2020-12-22 10:25 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk, Xen-devel

Hi,

On 22/12/2020 07:50, Jan Beulich wrote:
> On 21.12.2020 19:45, Andrew Cooper wrote:
>> On 21/12/2020 18:36, Julien Grall wrote:
>>>> @@ -553,6 +606,9 @@ struct domain *domain_create(domid_t domid,
>>>>        if ( init_status & INIT_watchdog )
>>>>            watchdog_domain_destroy(d);
>>>>    +    /* Must not hit a continuation in this context. */
>>>> +    ASSERT(domain_teardown(d) == 0);
>>> The ASSERT() will become a NOP in production build, so
>>> domain_teardown_down() will not be called.
>>
>> Urgh - its not really a nop, but it's evaluation isn't symmetric between
>> debug and release builds.  I'll need an extra local variable.
> 
> Or use ASSERT_UNREACHABLE(). (I admit I don't really like the
> resulting constructs, and would like to propose an alternative,
> even if I fear it'll be controversial.)
> 
>>> However, I think it would be better if we pass an extra argument to
>>> indicated wheter the code is allowed to preempt. This would make the
>>> preemption check more obvious in evtchn_destroy() compare to the
>>> current d->is_dying != DOMDYING_dead.
>>
>> We can have a predicate if you'd prefer, but plumbing an extra parameter
>> is wasteful, and can only cause confusion if it is out of sync with
>> d->is_dying.
> 
> I agree here - it wasn't so long ago that event_channel.c gained
> a DOMDYING_dead check, and I don't see why we shouldn't extend
> this approach to here and elsewhere.

I think the d->is_dying != DOMYING_dead is difficult to understand even 
with the comment on top. This was ok in one place, but now it will 
spread everywhere. So at least, I would suggest to introduce a wrapper 
that is better named.

There is also a futureproof concern. At the moment, we are considering 
the preemption will not be needed in domain_create(). I am ready to bet 
that the assumption is going to be broken sooner or later.

As we don't have an extra parameter, we would need to investigate all 
the callers in helpers to see where we need to drop the d->is_dying != 
DOMYING_dead. I am sure you will agree that there is a risk to screw up.

With an extra parameter, it is easier to re-enable preemption when needed.

Anyway, both of you seem to agree on the approach here. If that's what 
you want, then so it be, but that's not something I will feel confident 
to Ack. Although, I will not Nack it.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] xen/domain: Introduce domain_teardown()
  2020-12-21 18:14 ` [PATCH 2/3] xen/domain: Introduce domain_teardown() Andrew Cooper
  2020-12-21 18:36   ` Julien Grall
@ 2020-12-22 10:35   ` Jan Beulich
  2020-12-22 11:46     ` Andrew Cooper
  1 sibling, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2020-12-22 10:35 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Xen-devel

On 21.12.2020 19:14, Andrew Cooper wrote:
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -273,6 +273,59 @@ static int __init parse_extra_guest_irqs(const char *s)
>  custom_param("extra_guest_irqs", parse_extra_guest_irqs);
>  
>  /*
> + * Release resources held by a domain.  There may or may not be live
> + * references to the domain, and it may or may not be fully constructed.
> + *
> + * d->is_dying differing between DOMDYING_dying and DOMDYING_dead can be used
> + * to determine if live references to the domain exist, and also whether
> + * continuations are permitted.
> + *
> + * If d->is_dying is DOMDYING_dead, this must not return non-zero.
> + */
> +static int domain_teardown(struct domain *d)
> +{
> +    BUG_ON(!d->is_dying);
> +
> +    /*
> +     * This hypercall can take minutes of wallclock time to complete.  This
> +     * logic implements a co-routine, stashing state in struct domain across
> +     * hypercall continuation boundaries.
> +     */
> +    switch ( d->teardown.val )
> +    {
> +        /*
> +         * Record the current progress.  Subsequent hypercall continuations
> +         * will logically restart work from this point.
> +         *
> +         * PROGRESS() markers must not be in the middle of loops.  The loop
> +         * variable isn't preserved across a continuation.
> +         *
> +         * To avoid redundant work, there should be a marker before each
> +         * function which may return -ERESTART.
> +         */
> +#define PROGRESS(x)                             \
> +        d->teardown.val = PROG_ ## x;           \
> +        /* Fallthrough */                       \
> +    case PROG_ ## x
> +
> +        enum {
> +            PROG_done = 1,
> +        };
> +
> +    case 0:
> +    PROGRESS(done):
> +        break;
> +
> +#undef PROGRESS
> +
> +    default:
> +        BUG();
> +    }
> +
> +    return 0;
> +}

While as an initial step this may be okay, I envision ordering issues
with domain_relinquish_resources() - there may be per-arch things that
want doing before certain common ones, and others which should only
be done when certain common teardown was already performed. Therefore
rather than this being a "common equivalent of
domain_relinquish_resources()", I'd rather see (and call) it a
(designated) replacement, where individual per-arch functions would
get called at appropriate times.

Jan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] xen/evtchn: Clean up teardown handling
  2020-12-21 18:14 ` [PATCH 3/3] xen/evtchn: Clean up teardown handling Andrew Cooper
@ 2020-12-22 10:48   ` Jan Beulich
  2020-12-22 11:28     ` Andrew Cooper
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2020-12-22 10:48 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Xen-devel

On 21.12.2020 19:14, Andrew Cooper wrote:
> First of all, rename the evtchn APIs:
>  * evtchn_destroy       => evtchn_teardown
>  * evtchn_destroy_final => evtchn_destroy

I wonder in how far this is going to cause confusion with backports
down the road. May I suggest to do only the first of the two renames,
at least until in a couple of year's time? Or make the second rename
to e.g. evtchn_cleanup() or evtchn_deinit()?

> RFC.  While testing this, I observed this, after faking up an -ENOMEM in
> dom0's construction:
> 
>   (XEN) [2020-12-21 16:31:20] NX (Execute Disable) protection active
>   (XEN) [2020-12-21 16:33:04]
>   (XEN) [2020-12-21 16:33:04] ****************************************
>   (XEN) [2020-12-21 16:33:04] Panic on CPU 0:
>   (XEN) [2020-12-21 16:33:04] Error creating domain 0
>   (XEN) [2020-12-21 16:33:04] ****************************************
> 
> XSA-344 appears to have added nearly 2 minutes of wallclock time into the
> domain_create() error path, which isn't ok.
> 
> Considering that event channels haven't even been initialised in this
> particular scenario, it ought to take ~0 time.  Even if event channels have
> been initalised, none can be active as the domain isn't visible to the system.

evtchn_init() sets d->valid_evtchns to EVTCHNS_PER_BUCKET. Are you
suggesting cleaning up one bucket's worth of unused event channels
takes two minutes? If this is really the case, and considering there
could at most be unbound Xen channels, perhaps we could avoid
calling evtchn_teardown() from domain_create()'s error path? We'd
need to take care of the then missing accounting (->active_evtchns
and ->xen_evtchns), but this should be doable.

Jan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] xen/domain: Reorder trivial initialisation in early domain_create()
  2020-12-22 10:24     ` Andrew Cooper
@ 2020-12-22 10:50       ` Jan Beulich
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Beulich @ 2020-12-22 10:50 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Xen-devel

On 22.12.2020 11:24, Andrew Cooper wrote:
> On 22/12/2020 10:10, Jan Beulich wrote:
>> On 21.12.2020 19:14, Andrew Cooper wrote:
>>> --- a/xen/common/domain.c
>>> +++ b/xen/common/domain.c
>>> @@ -391,25 +391,7 @@ struct domain *domain_create(domid_t domid,
>>>  
>>>      TRACE_1D(TRC_DOM0_DOM_ADD, d->domain_id);
>>>  
>>> -    /*
>>> -     * Allocate d->vcpu[] and set ->max_vcpus up early.  Various per-domain
>>> -     * resources want to be sized based on max_vcpus.
>>> -     */
>>> -    if ( !is_system_domain(d) )
>>> -    {
>>> -        err = -ENOMEM;
>>> -        d->vcpu = xzalloc_array(struct vcpu *, config->max_vcpus);
>>> -        if ( !d->vcpu )
>>> -            goto fail;
>>> -
>>> -        d->max_vcpus = config->max_vcpus;
>>> -    }
>>> -
>>> -    lock_profile_register_struct(LOCKPROF_TYPE_PERDOM, d, domid);
>> Wouldn't this also count as "trivial initialization", and hence while
>> moving want to at least be placed ...
>>
>>> -    if ( (err = xsm_alloc_security_domain(d)) != 0 )
>>> -        goto fail;
>>> -
>>> +    /* Trivial initialisation. */
>>>      atomic_set(&d->refcnt, 1);
>>>      RCU_READ_LOCK_INIT(&d->rcu_lock);
>>>      spin_lock_init_prof(d, domain_lock);
>>> @@ -434,6 +416,27 @@ struct domain *domain_create(domid_t domid,
>>>      INIT_LIST_HEAD(&d->pdev_list);
>>>  #endif
>>>  
>>> +    /* All error paths can depend on the above setup. */
>> ... ahead of this comment?
> 
> Can do.

At which point

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] xen/domain: Introduce domain_teardown()
  2020-12-22 10:25         ` Julien Grall
@ 2020-12-22 10:53           ` Jan Beulich
  2020-12-22 11:05             ` Julien Grall
  2020-12-22 11:11             ` Andrew Cooper
  0 siblings, 2 replies; 28+ messages in thread
From: Jan Beulich @ 2020-12-22 10:53 UTC (permalink / raw)
  To: Julien Grall, Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk, Xen-devel

On 22.12.2020 11:25, Julien Grall wrote:
> On 22/12/2020 07:50, Jan Beulich wrote:
>> On 21.12.2020 19:45, Andrew Cooper wrote:
>>> On 21/12/2020 18:36, Julien Grall wrote:
>>>>> @@ -553,6 +606,9 @@ struct domain *domain_create(domid_t domid,
>>>>>        if ( init_status & INIT_watchdog )
>>>>>            watchdog_domain_destroy(d);
>>>>>    +    /* Must not hit a continuation in this context. */
>>>>> +    ASSERT(domain_teardown(d) == 0);
>>>> The ASSERT() will become a NOP in production build, so
>>>> domain_teardown_down() will not be called.
>>>
>>> Urgh - its not really a nop, but it's evaluation isn't symmetric between
>>> debug and release builds.  I'll need an extra local variable.
>>
>> Or use ASSERT_UNREACHABLE(). (I admit I don't really like the
>> resulting constructs, and would like to propose an alternative,
>> even if I fear it'll be controversial.)
>>
>>>> However, I think it would be better if we pass an extra argument to
>>>> indicated wheter the code is allowed to preempt. This would make the
>>>> preemption check more obvious in evtchn_destroy() compare to the
>>>> current d->is_dying != DOMDYING_dead.
>>>
>>> We can have a predicate if you'd prefer, but plumbing an extra parameter
>>> is wasteful, and can only cause confusion if it is out of sync with
>>> d->is_dying.
>>
>> I agree here - it wasn't so long ago that event_channel.c gained
>> a DOMDYING_dead check, and I don't see why we shouldn't extend
>> this approach to here and elsewhere.
> 
> I think the d->is_dying != DOMYING_dead is difficult to understand even 
> with the comment on top. This was ok in one place, but now it will 
> spread everywhere. So at least, I would suggest to introduce a wrapper 
> that is better named.
> 
> There is also a futureproof concern. At the moment, we are considering 
> the preemption will not be needed in domain_create(). I am ready to bet 
> that the assumption is going to be broken sooner or later.

This is a fair consideration, yet I'm having trouble seeing what it
might be that would cause domain_create() to require preemption.
The function is supposed to only produce an empty container. But yes,
if e.g. vCPU creation was to move here, the situation would indeed
change.

Jan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] xen/domain: Introduce domain_teardown()
  2020-12-22 10:53           ` Jan Beulich
@ 2020-12-22 11:05             ` Julien Grall
  2020-12-22 11:11             ` Andrew Cooper
  1 sibling, 0 replies; 28+ messages in thread
From: Julien Grall @ 2020-12-22 11:05 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk, Xen-devel

Hi Jan,

On 22/12/2020 10:53, Jan Beulich wrote:
> On 22.12.2020 11:25, Julien Grall wrote:
>> On 22/12/2020 07:50, Jan Beulich wrote:
>>> On 21.12.2020 19:45, Andrew Cooper wrote:
>>>> On 21/12/2020 18:36, Julien Grall wrote:
>>>>>> @@ -553,6 +606,9 @@ struct domain *domain_create(domid_t domid,
>>>>>>         if ( init_status & INIT_watchdog )
>>>>>>             watchdog_domain_destroy(d);
>>>>>>     +    /* Must not hit a continuation in this context. */
>>>>>> +    ASSERT(domain_teardown(d) == 0);
>>>>> The ASSERT() will become a NOP in production build, so
>>>>> domain_teardown_down() will not be called.
>>>>
>>>> Urgh - its not really a nop, but it's evaluation isn't symmetric between
>>>> debug and release builds.  I'll need an extra local variable.
>>>
>>> Or use ASSERT_UNREACHABLE(). (I admit I don't really like the
>>> resulting constructs, and would like to propose an alternative,
>>> even if I fear it'll be controversial.)
>>>
>>>>> However, I think it would be better if we pass an extra argument to
>>>>> indicated wheter the code is allowed to preempt. This would make the
>>>>> preemption check more obvious in evtchn_destroy() compare to the
>>>>> current d->is_dying != DOMDYING_dead.
>>>>
>>>> We can have a predicate if you'd prefer, but plumbing an extra parameter
>>>> is wasteful, and can only cause confusion if it is out of sync with
>>>> d->is_dying.
>>>
>>> I agree here - it wasn't so long ago that event_channel.c gained
>>> a DOMDYING_dead check, and I don't see why we shouldn't extend
>>> this approach to here and elsewhere.
>>
>> I think the d->is_dying != DOMYING_dead is difficult to understand even
>> with the comment on top. This was ok in one place, but now it will
>> spread everywhere. So at least, I would suggest to introduce a wrapper
>> that is better named.
>>
>> There is also a futureproof concern. At the moment, we are considering
>> the preemption will not be needed in domain_create(). I am ready to bet
>> that the assumption is going to be broken sooner or later.
> 
> This is a fair consideration, yet I'm having trouble seeing what it
> might be that would cause domain_create() to require preemption.
> The function is supposed to only produce an empty container. But yes,
> if e.g. vCPU creation was to move here, the situation would indeed
> change.
We are not really creating an "empty container". There are at least one 
initial pool of memory allocated for HAP (256 pages) which need to be 
freed if we fail to create the domain.

There is a second pool in discussion for the IOMMU (see [1]) which will 
also initially allocate 256 pages.

To me it looks like domain_create() is getting quite big and preemption 
is going to be required sooner or later.

Cheers,

[1] <20201005094905.2929-6-paul@xen.org>

> 
> Jan
> 

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] xen/domain: Introduce domain_teardown()
  2020-12-22 10:53           ` Jan Beulich
  2020-12-22 11:05             ` Julien Grall
@ 2020-12-22 11:11             ` Andrew Cooper
  1 sibling, 0 replies; 28+ messages in thread
From: Andrew Cooper @ 2020-12-22 11:11 UTC (permalink / raw)
  To: Jan Beulich, Julien Grall
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Volodymyr Babchuk, Xen-devel

On 22/12/2020 10:53, Jan Beulich wrote:
> On 22.12.2020 11:25, Julien Grall wrote:
>> On 22/12/2020 07:50, Jan Beulich wrote:
>>> On 21.12.2020 19:45, Andrew Cooper wrote:
>>>> On 21/12/2020 18:36, Julien Grall wrote:
>>>>>> @@ -553,6 +606,9 @@ struct domain *domain_create(domid_t domid,
>>>>>>        if ( init_status & INIT_watchdog )
>>>>>>            watchdog_domain_destroy(d);
>>>>>>    +    /* Must not hit a continuation in this context. */
>>>>>> +    ASSERT(domain_teardown(d) == 0);
>>>>> The ASSERT() will become a NOP in production build, so
>>>>> domain_teardown_down() will not be called.
>>>> Urgh - its not really a nop, but it's evaluation isn't symmetric between
>>>> debug and release builds.  I'll need an extra local variable.
>>> Or use ASSERT_UNREACHABLE(). (I admit I don't really like the
>>> resulting constructs, and would like to propose an alternative,
>>> even if I fear it'll be controversial.)
>>>
>>>>> However, I think it would be better if we pass an extra argument to
>>>>> indicated wheter the code is allowed to preempt. This would make the
>>>>> preemption check more obvious in evtchn_destroy() compare to the
>>>>> current d->is_dying != DOMDYING_dead.
>>>> We can have a predicate if you'd prefer, but plumbing an extra parameter
>>>> is wasteful, and can only cause confusion if it is out of sync with
>>>> d->is_dying.
>>> I agree here - it wasn't so long ago that event_channel.c gained
>>> a DOMDYING_dead check, and I don't see why we shouldn't extend
>>> this approach to here and elsewhere.
>> I think the d->is_dying != DOMYING_dead is difficult to understand even 
>> with the comment on top. This was ok in one place, but now it will 
>> spread everywhere. So at least, I would suggest to introduce a wrapper 
>> that is better named.
>>
>> There is also a futureproof concern. At the moment, we are considering 
>> the preemption will not be needed in domain_create(). I am ready to bet 
>> that the assumption is going to be broken sooner or later.
> This is a fair consideration, yet I'm having trouble seeing what it
> might be that would cause domain_create() to require preemption.
> The function is supposed to only produce an empty container. But yes,
> if e.g. vCPU creation was to move here, the situation would indeed
> change.

As discussed, I no longer think that is a good plan, especially if we
want a sane mechanism for not allocating AMX memory for domains not
configured to use AMX.

domain_create() (and vcpu_create() to a lesser extent) are the functions
which can't become preemptible, because they are the allocation and
setup of the objects which would be used to store continuation
information for other hypercalls.

The only option if these get too complicated is to split the complexity
out into other hypercalls.

~Andrew


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Hypercall fault injection (Was [PATCH 0/3] xen/domain: More structured teardown)
  2020-12-22 10:00   ` Jan Beulich
@ 2020-12-22 11:14     ` Andrew Cooper
  2020-12-22 15:47       ` Tamas K Lengyel
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2020-12-22 11:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Juergen Gross, Xen-devel, Tamas K Lengyel

On 22/12/2020 10:00, Jan Beulich wrote:
> On 21.12.2020 20:36, Andrew Cooper wrote:
>> Hello,
>>
>> We have some very complicated hypercalls, createdomain, and max_vcpus a
>> close second, with immense complexity, and very hard-to-test error handling.
>>
>> It is no surprise that the error handling is riddled with bugs.
>>
>> Random failures from core functions is one way, but I'm not sure that
>> will be especially helpful.  In particular, we'd need a way to exclude
>> "dom0 critical" operations so we've got a usable system to run testing on.
>>
>> As an alternative, how about adding a fault_ttl field into the hypercall?
>>
>> The exact paths taken in {domain,vcpu}_create() are sensitive to the
>> hardware, Xen Kconfig, and other parameters passed into the
>> hypercall(s).  The testing logic doesn't really want to care about what
>> failed; simply that the error was handled correctly.
>>
>> So a test for this might look like:
>>
>> cfg = { ... };
>> while ( xc_create_domain(xch, cfg) < 0 )
>>     cfg.fault_ttl++;
>>
>>
>> The pro's of this approach is that for a specific build of Xen on a
>> piece of hardware, it ought to check every failure path in
>> domain_create(), until the ttl finally gets higher than the number of
>> fail-able actions required to construct a domain.  Also, the test
>> doesn't need changing as the complexity of domain_create() changes.
>>
>> The main con will mostly likely be the invasiveness of code in Xen, but
>> I suppose any fault injection is going to be invasive to a certain extent.
> While I like the idea in principle, the innocent looking
>
> cfg = { ... };
>
> is quite a bit of a concern here as well: Depending on the precise
> settings, paths taken in the hypervisor may heavily vary, and hence
> such a test will only end up being useful if it covers a wide
> variety of settings. Even if the number of tests to execute turned
> out to still be manageable today, it may quickly turn out not
> sufficiently scalable as we add new settings controllable right at
> domain creation (which I understand is the plan).

Well - there are two aspects here.

First, 99% of all VMs in practice are one of 3 or 4 configurations.  An
individual configuration is O(n) time complexity to test with fault_ttl,
depending on the size of Xen's logic, and we absolutely want to be able
to test these deterministically and to completion.

For the plethora of other configurations, I agree that it is infeasible
to test them all.  However, a hypercall like this is easy to wire up
into a fuzzing harness.

TBH, I was thinking of something like
https://github.com/intel/kernel-fuzzer-for-xen-project with a PVH Xen
and XTF "dom0" poking just this hypercall.  All the other complicated
bits of wiring AFL up appear to have been done.

Perhaps when we exhaust that as a source of bugs, we move onto fuzzing
the L0 Xen, because running on native will give it more paths to
explore.  We'd need some way of reporting path/trace data back to AFL in
dom0 which might require a bit plumbing.

~Andrew


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] xen/evtchn: Clean up teardown handling
  2020-12-22 10:48   ` Jan Beulich
@ 2020-12-22 11:28     ` Andrew Cooper
  2020-12-22 11:52       ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2020-12-22 11:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Xen-devel

On 22/12/2020 10:48, Jan Beulich wrote:
> On 21.12.2020 19:14, Andrew Cooper wrote:
>> First of all, rename the evtchn APIs:
>>  * evtchn_destroy       => evtchn_teardown
>>  * evtchn_destroy_final => evtchn_destroy
> I wonder in how far this is going to cause confusion with backports
> down the road. May I suggest to do only the first of the two renames,
> at least until in a couple of year's time? Or make the second rename
> to e.g. evtchn_cleanup() or evtchn_deinit()?

I considered backports, but I don't think it will be an issue.  The
contents of the two functions are very different, and we're not likely
to be moving the callers in backports.

I'm not fussed about the exact naming, so long as we can make and
agreement and adhere to it strictly.  The current APIs are a total mess.

I used teardown/destroy because that seems to be one common theme in the
APIs, but it will require some to change their name.

>> RFC.  While testing this, I observed this, after faking up an -ENOMEM in
>> dom0's construction:
>>
>>   (XEN) [2020-12-21 16:31:20] NX (Execute Disable) protection active
>>   (XEN) [2020-12-21 16:33:04]
>>   (XEN) [2020-12-21 16:33:04] ****************************************
>>   (XEN) [2020-12-21 16:33:04] Panic on CPU 0:
>>   (XEN) [2020-12-21 16:33:04] Error creating domain 0
>>   (XEN) [2020-12-21 16:33:04] ****************************************
>>
>> XSA-344 appears to have added nearly 2 minutes of wallclock time into the
>> domain_create() error path, which isn't ok.
>>
>> Considering that event channels haven't even been initialised in this
>> particular scenario, it ought to take ~0 time.  Even if event channels have
>> been initalised, none can be active as the domain isn't visible to the system.
> evtchn_init() sets d->valid_evtchns to EVTCHNS_PER_BUCKET. Are you
> suggesting cleaning up one bucket's worth of unused event channels
> takes two minutes? If this is really the case, and considering there
> could at most be unbound Xen channels, perhaps we could avoid
> calling evtchn_teardown() from domain_create()'s error path? We'd
> need to take care of the then missing accounting (->active_evtchns
> and ->xen_evtchns), but this should be doable.

Actually, its a bug in this patch.  evtchn_init() hasn't been called, so
d->valid_evtchns is 0, so the loop is 2^32 iterations long.  Luckily,
this is easy to fix.

As for avoiding calling, specifically not.  Part of the problem we're
trying to fix is that we've got two different destruction paths, and
making domain_teardown() be fully idempotent is key to fixing that.

~Andrew


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] xen/domain: Introduce domain_teardown()
  2020-12-22 10:35   ` Jan Beulich
@ 2020-12-22 11:46     ` Andrew Cooper
  2020-12-22 11:55       ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2020-12-22 11:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Xen-devel

On 22/12/2020 10:35, Jan Beulich wrote:
> On 21.12.2020 19:14, Andrew Cooper wrote:
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -273,6 +273,59 @@ static int __init parse_extra_guest_irqs(const char *s)
>>  custom_param("extra_guest_irqs", parse_extra_guest_irqs);
>>  
>>  /*
>> + * Release resources held by a domain.  There may or may not be live
>> + * references to the domain, and it may or may not be fully constructed.
>> + *
>> + * d->is_dying differing between DOMDYING_dying and DOMDYING_dead can be used
>> + * to determine if live references to the domain exist, and also whether
>> + * continuations are permitted.
>> + *
>> + * If d->is_dying is DOMDYING_dead, this must not return non-zero.
>> + */
>> +static int domain_teardown(struct domain *d)
>> +{
>> +    BUG_ON(!d->is_dying);
>> +
>> +    /*
>> +     * This hypercall can take minutes of wallclock time to complete.  This
>> +     * logic implements a co-routine, stashing state in struct domain across
>> +     * hypercall continuation boundaries.
>> +     */
>> +    switch ( d->teardown.val )
>> +    {
>> +        /*
>> +         * Record the current progress.  Subsequent hypercall continuations
>> +         * will logically restart work from this point.
>> +         *
>> +         * PROGRESS() markers must not be in the middle of loops.  The loop
>> +         * variable isn't preserved across a continuation.
>> +         *
>> +         * To avoid redundant work, there should be a marker before each
>> +         * function which may return -ERESTART.
>> +         */
>> +#define PROGRESS(x)                             \
>> +        d->teardown.val = PROG_ ## x;           \
>> +        /* Fallthrough */                       \
>> +    case PROG_ ## x
>> +
>> +        enum {
>> +            PROG_done = 1,
>> +        };
>> +
>> +    case 0:
>> +    PROGRESS(done):
>> +        break;
>> +
>> +#undef PROGRESS
>> +
>> +    default:
>> +        BUG();
>> +    }
>> +
>> +    return 0;
>> +}
> While as an initial step this may be okay, I envision ordering issues
> with domain_relinquish_resources() - there may be per-arch things that
> want doing before certain common ones, and others which should only
> be done when certain common teardown was already performed. Therefore
> rather than this being a "common equivalent of
> domain_relinquish_resources()", I'd rather see (and call) it a
> (designated) replacement, where individual per-arch functions would
> get called at appropriate times.

Over time, I do expect it to replace domain_relinquish_resources().  I'm
not sure if that will take the form of a specific arch_domain_teardown()
call (and reserving some space in teardown.val for arch use), or whether
it will be a load of possibly-CONFIG'd stubs.

I do also have a work in progress supporting per-vcpu continuations to
be able to remove the TODO's introduced into the paging() path.  However
there is a bit more plumbing work required than I have time right now,
so this will have to wait a little until I've got some of the more
urgent 4.15 work done.

~Andrew


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] xen/evtchn: Clean up teardown handling
  2020-12-22 11:28     ` Andrew Cooper
@ 2020-12-22 11:52       ` Jan Beulich
  2020-12-22 13:33         ` Andrew Cooper
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2020-12-22 11:52 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Xen-devel

On 22.12.2020 12:28, Andrew Cooper wrote:
> On 22/12/2020 10:48, Jan Beulich wrote:
>> On 21.12.2020 19:14, Andrew Cooper wrote:
>>> First of all, rename the evtchn APIs:
>>>  * evtchn_destroy       => evtchn_teardown
>>>  * evtchn_destroy_final => evtchn_destroy
>> I wonder in how far this is going to cause confusion with backports
>> down the road. May I suggest to do only the first of the two renames,
>> at least until in a couple of year's time? Or make the second rename
>> to e.g. evtchn_cleanup() or evtchn_deinit()?
> 
> I considered backports, but I don't think it will be an issue.  The
> contents of the two functions are very different, and we're not likely
> to be moving the callers in backports.

Does the same also apply to the old and new call sites of the functions?

> I'm not fussed about the exact naming, so long as we can make and
> agreement and adhere to it strictly.  The current APIs are a total mess.
> 
> I used teardown/destroy because that seems to be one common theme in the
> APIs, but it will require some to change their name.

So for domains "teardown" and "destroy" pair up with "create". I don't
think evtchn_create() is a sensible name (the function doesn't really
"create" anything); evtchn_init() seems quite a bit better to me, and
hence evtchn_deinit() could be its counterpart. In particular I don't
think all smaller entity functions involved in doing "xyz" for a
larger entity need to have "xyz" in their names.

Jan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] xen/domain: Introduce domain_teardown()
  2020-12-22 11:46     ` Andrew Cooper
@ 2020-12-22 11:55       ` Jan Beulich
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Beulich @ 2020-12-22 11:55 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Xen-devel

On 22.12.2020 12:46, Andrew Cooper wrote:
> On 22/12/2020 10:35, Jan Beulich wrote:
>> On 21.12.2020 19:14, Andrew Cooper wrote:
>>> --- a/xen/common/domain.c
>>> +++ b/xen/common/domain.c
>>> @@ -273,6 +273,59 @@ static int __init parse_extra_guest_irqs(const char *s)
>>>  custom_param("extra_guest_irqs", parse_extra_guest_irqs);
>>>  
>>>  /*
>>> + * Release resources held by a domain.  There may or may not be live
>>> + * references to the domain, and it may or may not be fully constructed.
>>> + *
>>> + * d->is_dying differing between DOMDYING_dying and DOMDYING_dead can be used
>>> + * to determine if live references to the domain exist, and also whether
>>> + * continuations are permitted.
>>> + *
>>> + * If d->is_dying is DOMDYING_dead, this must not return non-zero.
>>> + */
>>> +static int domain_teardown(struct domain *d)
>>> +{
>>> +    BUG_ON(!d->is_dying);
>>> +
>>> +    /*
>>> +     * This hypercall can take minutes of wallclock time to complete.  This
>>> +     * logic implements a co-routine, stashing state in struct domain across
>>> +     * hypercall continuation boundaries.
>>> +     */
>>> +    switch ( d->teardown.val )
>>> +    {
>>> +        /*
>>> +         * Record the current progress.  Subsequent hypercall continuations
>>> +         * will logically restart work from this point.
>>> +         *
>>> +         * PROGRESS() markers must not be in the middle of loops.  The loop
>>> +         * variable isn't preserved across a continuation.
>>> +         *
>>> +         * To avoid redundant work, there should be a marker before each
>>> +         * function which may return -ERESTART.
>>> +         */
>>> +#define PROGRESS(x)                             \
>>> +        d->teardown.val = PROG_ ## x;           \
>>> +        /* Fallthrough */                       \
>>> +    case PROG_ ## x
>>> +
>>> +        enum {
>>> +            PROG_done = 1,
>>> +        };
>>> +
>>> +    case 0:
>>> +    PROGRESS(done):
>>> +        break;
>>> +
>>> +#undef PROGRESS
>>> +
>>> +    default:
>>> +        BUG();
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>> While as an initial step this may be okay, I envision ordering issues
>> with domain_relinquish_resources() - there may be per-arch things that
>> want doing before certain common ones, and others which should only
>> be done when certain common teardown was already performed. Therefore
>> rather than this being a "common equivalent of
>> domain_relinquish_resources()", I'd rather see (and call) it a
>> (designated) replacement, where individual per-arch functions would
>> get called at appropriate times.
> 
> Over time, I do expect it to replace domain_relinquish_resources().  I'm
> not sure if that will take the form of a specific arch_domain_teardown()
> call (and reserving some space in teardown.val for arch use), or whether
> it will be a load of possibly-CONFIG'd stubs.

I'd consider helpful if you could slightly re-word the description then.

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] xen/evtchn: Clean up teardown handling
  2020-12-22 11:52       ` Jan Beulich
@ 2020-12-22 13:33         ` Andrew Cooper
  2020-12-22 13:45           ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2020-12-22 13:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Xen-devel

On 22/12/2020 11:52, Jan Beulich wrote:
> On 22.12.2020 12:28, Andrew Cooper wrote:
>> On 22/12/2020 10:48, Jan Beulich wrote:
>>> On 21.12.2020 19:14, Andrew Cooper wrote:
>>>> First of all, rename the evtchn APIs:
>>>>  * evtchn_destroy       => evtchn_teardown
>>>>  * evtchn_destroy_final => evtchn_destroy
>>> I wonder in how far this is going to cause confusion with backports
>>> down the road. May I suggest to do only the first of the two renames,
>>> at least until in a couple of year's time? Or make the second rename
>>> to e.g. evtchn_cleanup() or evtchn_deinit()?
>> I considered backports, but I don't think it will be an issue.  The
>> contents of the two functions are very different, and we're not likely
>> to be moving the callers in backports.
> Does the same also apply to the old and new call sites of the functions?

I don't understand your question.  I don't intend the new callsites to
ever move again, now they're part of the properly idempotent path, and
any movement in the older trees would be wrong for anything other than
backporting this fix, which clearly isn't a backport candidate.

(That said - there's a memory leak I need to create a backport for...)

>> I'm not fussed about the exact naming, so long as we can make and
>> agreement and adhere to it strictly.  The current APIs are a total mess.
>>
>> I used teardown/destroy because that seems to be one common theme in the
>> APIs, but it will require some to change their name.
> So for domains "teardown" and "destroy" pair up with "create". I don't
> think evtchn_create() is a sensible name (the function doesn't really
> "create" anything); evtchn_init() seems quite a bit better to me, and
> hence evtchn_deinit() could be its counterpart.

You're never going to find a perfect name for all cases, and in this
proposal, you've still got evtchn_init/deinit/destroy() as a triple.

What we do need is some clear rules, which will live in the forthcoming
"lifecycle of a domain" document.

> In particular I don't
> think all smaller entity functions involved in doing "xyz" for a
> larger entity need to have "xyz" in their names.

While in principle I agree, I'm not sure the value if having perfect
names outweighs the value of having a simple set of guidelines.

~Andrew


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] xen/evtchn: Clean up teardown handling
  2020-12-22 13:33         ` Andrew Cooper
@ 2020-12-22 13:45           ` Jan Beulich
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Beulich @ 2020-12-22 13:45 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Xen-devel

On 22.12.2020 14:33, Andrew Cooper wrote:
> On 22/12/2020 11:52, Jan Beulich wrote:
>> On 22.12.2020 12:28, Andrew Cooper wrote:
>>> On 22/12/2020 10:48, Jan Beulich wrote:
>>>> On 21.12.2020 19:14, Andrew Cooper wrote:
>>>>> First of all, rename the evtchn APIs:
>>>>>  * evtchn_destroy       => evtchn_teardown
>>>>>  * evtchn_destroy_final => evtchn_destroy
>>>> I wonder in how far this is going to cause confusion with backports
>>>> down the road. May I suggest to do only the first of the two renames,
>>>> at least until in a couple of year's time? Or make the second rename
>>>> to e.g. evtchn_cleanup() or evtchn_deinit()?
>>> I considered backports, but I don't think it will be an issue.  The
>>> contents of the two functions are very different, and we're not likely
>>> to be moving the callers in backports.
>> Does the same also apply to the old and new call sites of the functions?
> 
> I don't understand your question.  I don't intend the new callsites to
> ever move again, now they're part of the properly idempotent path, and
> any movement in the older trees would be wrong for anything other than
> backporting this fix, which clearly isn't a backport candidate.
> 
> (That said - there's a memory leak I need to create a backport for...)

My thinking was that call sites of functions also serve as references
or anchors when you do backports. Having identically named functions
with different purposes may be misleading people - both ones doing
backports on a very occasional basis, but also us who may be doing
this regularly, but only on halfway recent trees. I, for one, keep
forgetting to check for bool/true/false when moving to 4.7, or the
-ERESTART <=> -EAGAIN change after 4.4(?). For the former I'll be
saved by the compiler yelling at me, but for the latter one needs to
recognize the need for an adjustment. I'm afraid of the same thing
(granted at a lower probability) potentially happening here, down the
road.

Jan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Hypercall fault injection (Was [PATCH 0/3] xen/domain: More structured teardown)
  2020-12-22 11:14     ` Andrew Cooper
@ 2020-12-22 15:47       ` Tamas K Lengyel
  2020-12-22 17:17         ` Andrew Cooper
  0 siblings, 1 reply; 28+ messages in thread
From: Tamas K Lengyel @ 2020-12-22 15:47 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Juergen Gross, Xen-devel

[-- Attachment #1: Type: text/plain, Size: 4230 bytes --]

On Tue, Dec 22, 2020 at 6:14 AM Andrew Cooper <andrew.cooper3@citrix.com>
wrote:

> On 22/12/2020 10:00, Jan Beulich wrote:
> > On 21.12.2020 20:36, Andrew Cooper wrote:
> >> Hello,
> >>
> >> We have some very complicated hypercalls, createdomain, and max_vcpus a
> >> close second, with immense complexity, and very hard-to-test error
> handling.
> >>
> >> It is no surprise that the error handling is riddled with bugs.
> >>
> >> Random failures from core functions is one way, but I'm not sure that
> >> will be especially helpful.  In particular, we'd need a way to exclude
> >> "dom0 critical" operations so we've got a usable system to run testing
> on.
> >>
> >> As an alternative, how about adding a fault_ttl field into the
> hypercall?
> >>
> >> The exact paths taken in {domain,vcpu}_create() are sensitive to the
> >> hardware, Xen Kconfig, and other parameters passed into the
> >> hypercall(s).  The testing logic doesn't really want to care about what
> >> failed; simply that the error was handled correctly.
> >>
> >> So a test for this might look like:
> >>
> >> cfg = { ... };
> >> while ( xc_create_domain(xch, cfg) < 0 )
> >>     cfg.fault_ttl++;
> >>
> >>
> >> The pro's of this approach is that for a specific build of Xen on a
> >> piece of hardware, it ought to check every failure path in
> >> domain_create(), until the ttl finally gets higher than the number of
> >> fail-able actions required to construct a domain.  Also, the test
> >> doesn't need changing as the complexity of domain_create() changes.
> >>
> >> The main con will mostly likely be the invasiveness of code in Xen, but
> >> I suppose any fault injection is going to be invasive to a certain
> extent.
> > While I like the idea in principle, the innocent looking
> >
> > cfg = { ... };
> >
> > is quite a bit of a concern here as well: Depending on the precise
> > settings, paths taken in the hypervisor may heavily vary, and hence
> > such a test will only end up being useful if it covers a wide
> > variety of settings. Even if the number of tests to execute turned
> > out to still be manageable today, it may quickly turn out not
> > sufficiently scalable as we add new settings controllable right at
> > domain creation (which I understand is the plan).
>
> Well - there are two aspects here.
>
> First, 99% of all VMs in practice are one of 3 or 4 configurations.  An
> individual configuration is O(n) time complexity to test with fault_ttl,
> depending on the size of Xen's logic, and we absolutely want to be able
> to test these deterministically and to completion.
>
> For the plethora of other configurations, I agree that it is infeasible
> to test them all.  However, a hypercall like this is easy to wire up
> into a fuzzing harness.
>
> TBH, I was thinking of something like
> https://github.com/intel/kernel-fuzzer-for-xen-project with a PVH Xen
> and XTF "dom0" poking just this hypercall.  All the other complicated
> bits of wiring AFL up appear to have been done.
>
> Perhaps when we exhaust that as a source of bugs, we move onto fuzzing
> the L0 Xen, because running on native will give it more paths to
> explore.  We'd need some way of reporting path/trace data back to AFL in
> dom0 which might require a bit plumbing.


This is a pretty cool idea, I would be very interested in trying this out.
If running Xen nested in a HVM domain is possible (my experiments with
nested setups using Xen have only worked on ancient hw last time I tried)
then running the fuzzer would be entirely possible using VM forks. You
don't even need a special "dom0", you could just add the fuzzer's CPUID
harness to Xen's hypercall handler and the only thing needed from the
nested dom0 would be to trigger the hypercall with a normal config. The
fuzzer would take it from there and replace the config with the fuzzed
version directly in VM forks. Defining what to report as a "crash" to AFL
would still need to be defined manually for Xen as the current sink points
are Linux specific (
https://github.com/intel/kernel-fuzzer-for-xen-project/blob/master/src/sink.h),
but that should be straight forward.

Also, running the fuzzer with PVH guests hasn't been tested but since all
VM forking needs is EPT it should work.

Tamas

[-- Attachment #2: Type: text/html, Size: 5361 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Hypercall fault injection (Was [PATCH 0/3] xen/domain: More structured teardown)
  2020-12-22 15:47       ` Tamas K Lengyel
@ 2020-12-22 17:17         ` Andrew Cooper
  2020-12-22 18:24           ` Tamas K Lengyel
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2020-12-22 17:17 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Juergen Gross, Xen-devel

[-- Attachment #1: Type: text/plain, Size: 5123 bytes --]

On 22/12/2020 15:47, Tamas K Lengyel wrote:
>
>
> On Tue, Dec 22, 2020 at 6:14 AM Andrew Cooper
> <andrew.cooper3@citrix.com <mailto:andrew.cooper3@citrix.com>> wrote:
>
>     On 22/12/2020 10:00, Jan Beulich wrote:
>     > On 21.12.2020 20:36, Andrew Cooper wrote:
>     >> Hello,
>     >>
>     >> We have some very complicated hypercalls, createdomain, and
>     max_vcpus a
>     >> close second, with immense complexity, and very hard-to-test
>     error handling.
>     >>
>     >> It is no surprise that the error handling is riddled with bugs.
>     >>
>     >> Random failures from core functions is one way, but I'm not
>     sure that
>     >> will be especially helpful.  In particular, we'd need a way to
>     exclude
>     >> "dom0 critical" operations so we've got a usable system to run
>     testing on.
>     >>
>     >> As an alternative, how about adding a fault_ttl field into the
>     hypercall?
>     >>
>     >> The exact paths taken in {domain,vcpu}_create() are sensitive
>     to the
>     >> hardware, Xen Kconfig, and other parameters passed into the
>     >> hypercall(s).  The testing logic doesn't really want to care
>     about what
>     >> failed; simply that the error was handled correctly.
>     >>
>     >> So a test for this might look like:
>     >>
>     >> cfg = { ... };
>     >> while ( xc_create_domain(xch, cfg) < 0 )
>     >>     cfg.fault_ttl++;
>     >>
>     >>
>     >> The pro's of this approach is that for a specific build of Xen on a
>     >> piece of hardware, it ought to check every failure path in
>     >> domain_create(), until the ttl finally gets higher than the
>     number of
>     >> fail-able actions required to construct a domain.  Also, the test
>     >> doesn't need changing as the complexity of domain_create() changes.
>     >>
>     >> The main con will mostly likely be the invasiveness of code in
>     Xen, but
>     >> I suppose any fault injection is going to be invasive to a
>     certain extent.
>     > While I like the idea in principle, the innocent looking
>     >
>     > cfg = { ... };
>     >
>     > is quite a bit of a concern here as well: Depending on the precise
>     > settings, paths taken in the hypervisor may heavily vary, and hence
>     > such a test will only end up being useful if it covers a wide
>     > variety of settings. Even if the number of tests to execute turned
>     > out to still be manageable today, it may quickly turn out not
>     > sufficiently scalable as we add new settings controllable right at
>     > domain creation (which I understand is the plan).
>
>     Well - there are two aspects here.
>
>     First, 99% of all VMs in practice are one of 3 or 4
>     configurations.  An
>     individual configuration is O(n) time complexity to test with
>     fault_ttl,
>     depending on the size of Xen's logic, and we absolutely want to be
>     able
>     to test these deterministically and to completion.
>
>     For the plethora of other configurations, I agree that it is
>     infeasible
>     to test them all.  However, a hypercall like this is easy to wire up
>     into a fuzzing harness.
>
>     TBH, I was thinking of something like
>     https://github.com/intel/kernel-fuzzer-for-xen-project with a PVH Xen
>     and XTF "dom0" poking just this hypercall.  All the other complicated
>     bits of wiring AFL up appear to have been done.
>
>     Perhaps when we exhaust that as a source of bugs, we move onto fuzzing
>     the L0 Xen, because running on native will give it more paths to
>     explore.  We'd need some way of reporting path/trace data back to
>     AFL in
>     dom0 which might require a bit plumbing.
>
>
> This is a pretty cool idea, I would be very interested in trying this
> out. If running Xen nested in a HVM domain is possible (my experiments
> with nested setups using Xen have only worked on ancient hw last time
> I tried) then running the fuzzer would be entirely possible using VM
> forks. You don't even need a special "dom0", you could just add the
> fuzzer's CPUID harness to Xen's hypercall handler and the only thing
> needed from the nested dom0 would be to trigger the hypercall with a
> normal config. The fuzzer would take it from there and replace the
> config with the fuzzed version directly in VM forks. Defining what to
> report as a "crash" to AFL would still need to be defined manually for
> Xen as the current sink points are Linux specific
> (https://github.com/intel/kernel-fuzzer-for-xen-project/blob/master/src/sink.h),
> but that should be straight forward.
>
> Also, running the fuzzer with PVH guests hasn't been tested but since
> all VM forking needs is EPT it should work.

Xen running inside Xen definitely works, and is even supported as far as
PV-Shim goes (i.e. no nested virt).  That would limit testing to just
creation of PV guests at L1, which is plenty to get started with.

Xen nested under Xen does work to a first approximation, and for the
purposes of fuzzing areas other than the nested-virt logic, might even
be ok.  (I use this configuration for a fair chunk of hvm development).

~Andrew

[-- Attachment #2: Type: text/html, Size: 7811 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Hypercall fault injection (Was [PATCH 0/3] xen/domain: More structured teardown)
  2020-12-22 17:17         ` Andrew Cooper
@ 2020-12-22 18:24           ` Tamas K Lengyel
  0 siblings, 0 replies; 28+ messages in thread
From: Tamas K Lengyel @ 2020-12-22 18:24 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Juergen Gross, Xen-devel

On Tue, Dec 22, 2020 at 12:18 PM Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>
> On 22/12/2020 15:47, Tamas K Lengyel wrote:
>
>
>
> On Tue, Dec 22, 2020 at 6:14 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>
>> On 22/12/2020 10:00, Jan Beulich wrote:
>> > On 21.12.2020 20:36, Andrew Cooper wrote:
>> >> Hello,
>> >>
>> >> We have some very complicated hypercalls, createdomain, and max_vcpus a
>> >> close second, with immense complexity, and very hard-to-test error handling.
>> >>
>> >> It is no surprise that the error handling is riddled with bugs.
>> >>
>> >> Random failures from core functions is one way, but I'm not sure that
>> >> will be especially helpful.  In particular, we'd need a way to exclude
>> >> "dom0 critical" operations so we've got a usable system to run testing on.
>> >>
>> >> As an alternative, how about adding a fault_ttl field into the hypercall?
>> >>
>> >> The exact paths taken in {domain,vcpu}_create() are sensitive to the
>> >> hardware, Xen Kconfig, and other parameters passed into the
>> >> hypercall(s).  The testing logic doesn't really want to care about what
>> >> failed; simply that the error was handled correctly.
>> >>
>> >> So a test for this might look like:
>> >>
>> >> cfg = { ... };
>> >> while ( xc_create_domain(xch, cfg) < 0 )
>> >>     cfg.fault_ttl++;
>> >>
>> >>
>> >> The pro's of this approach is that for a specific build of Xen on a
>> >> piece of hardware, it ought to check every failure path in
>> >> domain_create(), until the ttl finally gets higher than the number of
>> >> fail-able actions required to construct a domain.  Also, the test
>> >> doesn't need changing as the complexity of domain_create() changes.
>> >>
>> >> The main con will mostly likely be the invasiveness of code in Xen, but
>> >> I suppose any fault injection is going to be invasive to a certain extent.
>> > While I like the idea in principle, the innocent looking
>> >
>> > cfg = { ... };
>> >
>> > is quite a bit of a concern here as well: Depending on the precise
>> > settings, paths taken in the hypervisor may heavily vary, and hence
>> > such a test will only end up being useful if it covers a wide
>> > variety of settings. Even if the number of tests to execute turned
>> > out to still be manageable today, it may quickly turn out not
>> > sufficiently scalable as we add new settings controllable right at
>> > domain creation (which I understand is the plan).
>>
>> Well - there are two aspects here.
>>
>> First, 99% of all VMs in practice are one of 3 or 4 configurations.  An
>> individual configuration is O(n) time complexity to test with fault_ttl,
>> depending on the size of Xen's logic, and we absolutely want to be able
>> to test these deterministically and to completion.
>>
>> For the plethora of other configurations, I agree that it is infeasible
>> to test them all.  However, a hypercall like this is easy to wire up
>> into a fuzzing harness.
>>
>> TBH, I was thinking of something like
>> https://github.com/intel/kernel-fuzzer-for-xen-project with a PVH Xen
>> and XTF "dom0" poking just this hypercall.  All the other complicated
>> bits of wiring AFL up appear to have been done.
>>
>> Perhaps when we exhaust that as a source of bugs, we move onto fuzzing
>> the L0 Xen, because running on native will give it more paths to
>> explore.  We'd need some way of reporting path/trace data back to AFL in
>> dom0 which might require a bit plumbing.
>
>
> This is a pretty cool idea, I would be very interested in trying this out. If running Xen nested in a HVM domain is possible (my experiments with nested setups using Xen have only worked on ancient hw last time I tried) then running the fuzzer would be entirely possible using VM forks. You don't even need a special "dom0", you could just add the fuzzer's CPUID harness to Xen's hypercall handler and the only thing needed from the nested dom0 would be to trigger the hypercall with a normal config. The fuzzer would take it from there and replace the config with the fuzzed version directly in VM forks. Defining what to report as a "crash" to AFL would still need to be defined manually for Xen as the current sink points are Linux specific (https://github.com/intel/kernel-fuzzer-for-xen-project/blob/master/src/sink.h), but that should be straight forward.
>
> Also, running the fuzzer with PVH guests hasn't been tested but since all VM forking needs is EPT it should work.
>
>
> Xen running inside Xen definitely works, and is even supported as far as PV-Shim goes (i.e. no nested virt).  That would limit testing to just creation of PV guests at L1, which is plenty to get started with.
>
> Xen nested under Xen does work to a first approximation, and for the purposes of fuzzing areas other than the nested-virt logic, might even be ok.  (I use this configuration for a fair chunk of hvm development).
>

Sounds like there is no blocker on fuzzing any hypercall handler then.
Just have to add the harness, define what code-path needs to be
defined as a sink, and off you go. Should work with PT-based coverage
no problem. If you need to fuzz multiple hypercalls that may require
some tweaking as the PT based coverage doesn't support a change in CR3
right now (it's just a limitation of libxdc that does the PT
decoding). You could always fall-back to the
disassemble/breakpoint/singlestep coverage option but would need to
add vmcall/vmenter instruction to the control-flow instruction list
here https://github.com/intel/kernel-fuzzer-for-xen-project/blob/master/src/tracer.c#L46.

Tamas


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-12-22 18:25 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-21 18:14 [PATCH 0/3] xen/domain: More structured teardown Andrew Cooper
2020-12-21 18:14 ` [PATCH 1/3] xen/domain: Reorder trivial initialisation in early domain_create() Andrew Cooper
2020-12-22 10:10   ` Jan Beulich
2020-12-22 10:24     ` Andrew Cooper
2020-12-22 10:50       ` Jan Beulich
2020-12-21 18:14 ` [PATCH 2/3] xen/domain: Introduce domain_teardown() Andrew Cooper
2020-12-21 18:36   ` Julien Grall
2020-12-21 18:45     ` Andrew Cooper
2020-12-22  7:50       ` Jan Beulich
2020-12-22 10:25         ` Julien Grall
2020-12-22 10:53           ` Jan Beulich
2020-12-22 11:05             ` Julien Grall
2020-12-22 11:11             ` Andrew Cooper
2020-12-22 10:35   ` Jan Beulich
2020-12-22 11:46     ` Andrew Cooper
2020-12-22 11:55       ` Jan Beulich
2020-12-21 18:14 ` [PATCH 3/3] xen/evtchn: Clean up teardown handling Andrew Cooper
2020-12-22 10:48   ` Jan Beulich
2020-12-22 11:28     ` Andrew Cooper
2020-12-22 11:52       ` Jan Beulich
2020-12-22 13:33         ` Andrew Cooper
2020-12-22 13:45           ` Jan Beulich
2020-12-21 19:36 ` Hypercall fault injection (Was [PATCH 0/3] xen/domain: More structured teardown) Andrew Cooper
2020-12-22 10:00   ` Jan Beulich
2020-12-22 11:14     ` Andrew Cooper
2020-12-22 15:47       ` Tamas K Lengyel
2020-12-22 17:17         ` Andrew Cooper
2020-12-22 18:24           ` Tamas K Lengyel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.