All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Containing AER unrecoverable errors
@ 2017-07-27  0:16 Venu Busireddy
  2017-07-27  0:16 ` [PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors Venu Busireddy
  2017-07-27  0:16 ` [PATCH v2 2/2] xl: Register the AER event handler to handle " Venu Busireddy
  0 siblings, 2 replies; 12+ messages in thread
From: Venu Busireddy @ 2017-07-27  0:16 UTC (permalink / raw)
  To: venu.busireddy, xen-devel, Ian Jackson, Wei Liu
  Cc: Andrew Cooper, Jan Beulich

This patch set is part of a set of patches that together allow containment
of unrecoverable AER errors from PCIe devices assigned to guests in
passthrough mode. The containment is achieved by forcibly removing the
erring PCIe device from the guest.

The xen-pciback patch corresponding to this patch set is:
https://lists.xen.org/archives/html/xen-devel/2017-06/msg03274.html

Changes in v2:
  - Instead of killing the guest and hiding the device, forcibly remove
    the device from the guest.

Venu Busireddy (2):
  libxl: Implement the handler to handle unrecoverable AER errors.
  xl: Register the AER event handler to handle AER errors.

 tools/libxl/libxl_event.h |  2 ++
 tools/libxl/libxl_pci.c   | 85 +++++++++++++++++++++++++++++++++++++++++++++++
 tools/xl/xl_vmcontrol.c   | 11 ++++++
 3 files changed, 98 insertions(+)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors.
  2017-07-27  0:16 [PATCH v2 0/2] Containing AER unrecoverable errors Venu Busireddy
@ 2017-07-27  0:16 ` Venu Busireddy
  2017-07-28 15:58   ` Wei Liu
  2017-07-28 16:39   ` Ian Jackson
  2017-07-27  0:16 ` [PATCH v2 2/2] xl: Register the AER event handler to handle " Venu Busireddy
  1 sibling, 2 replies; 12+ messages in thread
From: Venu Busireddy @ 2017-07-27  0:16 UTC (permalink / raw)
  To: venu.busireddy, xen-devel, Ian Jackson, Wei Liu
  Cc: Andrew Cooper, Jan Beulich

Implement the callback function to handle unrecoverable AER errors, and
also the public APIs that can be used to register/unregister the handler.
When an AER error occurs, the handler will forcibly remove the erring
PCIe device from the guest.

Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com>
---
 tools/libxl/libxl_event.h |  2 ++
 tools/libxl/libxl_pci.c   | 85 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/tools/libxl/libxl_event.h b/tools/libxl/libxl_event.h
index 1ea789e..d932432 100644
--- a/tools/libxl/libxl_event.h
+++ b/tools/libxl/libxl_event.h
@@ -178,6 +178,8 @@ void libxl_event_register_callbacks(libxl_ctx *ctx,
 typedef struct libxl__evgen_domain_death libxl_evgen_domain_death;
 int libxl_evenable_domain_death(libxl_ctx *ctx, uint32_t domid,
                          libxl_ev_user, libxl_evgen_domain_death **evgen_out);
+int libxl_reg_aer_events_handler(libxl_ctx *, uint32_t) LIBXL_EXTERNAL_CALLERS_ONLY;
+void libxl_unreg_aer_events_handler(libxl_ctx *, uint32_t) LIBXL_EXTERNAL_CALLERS_ONLY;
 void libxl_evdisable_domain_death(libxl_ctx *ctx, libxl_evgen_domain_death*);
   /* Arranges for the generation of DOMAIN_SHUTDOWN and DOMAIN_DEATH
    * events.  A domain which is destroyed before it shuts down
diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
index 65ad5e5..15d9f0c 100644
--- a/tools/libxl/libxl_pci.c
+++ b/tools/libxl/libxl_pci.c
@@ -1678,6 +1678,91 @@ static int libxl_device_pci_compare(libxl_device_pci *d1,
     return COMPARE_PCI(d1, d2);
 }
 
+typedef struct {
+    uint32_t domid;
+    libxl__ev_xswatch watch;
+} libxl_aer_watch;
+static libxl_aer_watch aer_watch;
+
+static void aer_backend_watch_callback(libxl__egc *egc,
+                                       libxl__ev_xswatch *watch,
+                                       const char *watch_path,
+                                       const char *event_path)
+{
+    EGC_GC;
+    libxl_aer_watch *l_aer_watch = CONTAINER_OF(watch, *l_aer_watch, watch);
+    uint32_t domid = l_aer_watch->domid;
+    uint32_t dom, bus, dev, fn;
+    int rc;
+    char *p, *path, *dst_path;
+    const char *aerFailedSBDF;
+    struct xs_permissions rwperm[1];
+    libxl_device_pci pcidev;
+
+    /* Extract the backend directory. */
+    path = libxl__strdup(gc, event_path);
+    p = strrchr(path, '/');
+    if (p == NULL)
+        goto skip;
+    if (strcmp(p, "/aerFailedSBDF") != 0)
+        goto skip;
+    /* Truncate the string so it points to the backend directory. */
+    *p = '\0';
+
+    /* Fetch the value of the failed PCI device. */
+    rc = libxl__xs_read_checked(gc, XBT_NULL,
+            GCSPRINTF("%s/aerFailedSBDF", path), &aerFailedSBDF);
+    if (rc || !aerFailedSBDF)
+        goto skip;
+    sscanf(aerFailedSBDF, "%x:%x:%x.%x", &dom, &bus, &dev, &fn);
+
+    libxl_device_pci_init(&pcidev);
+    pcidev_struct_fill(&pcidev, dom, bus, dev, fn, 0);
+    /* Forcibly remove the device from the guest */
+    rc = libxl__device_pci_remove_common(gc, domid, &pcidev, 1);
+
+    rwperm[0].id = 0;
+    rwperm[0].perms = XS_PERM_READ | XS_PERM_WRITE;
+    dst_path = GCSPRINTF("/local/domain/0/backend/pci/0/0/%s", "aerFailedPCIs");
+    rc = libxl__xs_mknod(gc, XBT_NULL, dst_path, rwperm, 1);
+    if (rc) {
+        LOGD(ERROR, domid, " libxl__xs_mknod() failed, rc = %d", rc);
+        goto skip;
+    }
+
+    rc = libxl__xs_write_checked(gc, XBT_NULL, dst_path, aerFailedSBDF);
+    if (rc)
+        LOGD(ERROR, domid, " libxl__xs_write_checked() failed, rc = %d", rc);
+
+skip:
+    return;
+}
+
+int libxl_reg_aer_events_handler(libxl_ctx *ctx, uint32_t domid)
+{
+    int rc = 0;
+    char *be_path;
+    GC_INIT(ctx);
+
+    aer_watch.domid = domid;
+    be_path = GCSPRINTF("/local/domain/0/backend/pci/%u/0/aerFailedSBDF", domid);
+    rc = libxl__ev_xswatch_register(gc, &aer_watch.watch,
+                                    aer_backend_watch_callback, be_path);
+
+    GC_FREE;
+    return rc;
+}
+
+void libxl_unreg_aer_events_handler(libxl_ctx *ctx, uint32_t domid)
+{
+    GC_INIT(ctx);
+
+    libxl__ev_xswatch_deregister(gc, &aer_watch.watch);
+
+    GC_FREE;
+    return;
+}
+
 DEFINE_DEVICE_TYPE_STRUCT_X(pcidev, pci);
 
 /*

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/2] xl: Register the AER event handler to handle AER errors.
  2017-07-27  0:16 [PATCH v2 0/2] Containing AER unrecoverable errors Venu Busireddy
  2017-07-27  0:16 ` [PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors Venu Busireddy
@ 2017-07-27  0:16 ` Venu Busireddy
  2017-07-28 15:58   ` Wei Liu
  1 sibling, 1 reply; 12+ messages in thread
From: Venu Busireddy @ 2017-07-27  0:16 UTC (permalink / raw)
  To: venu.busireddy, xen-devel, Ian Jackson, Wei Liu
  Cc: Andrew Cooper, Jan Beulich

When a guest is created, register the AER event handler to handle the
AER errors. When an AER error occurs, the handler will forcibly remove
the erring PCIe device from the guest.

Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com>
---
 tools/xl/xl_vmcontrol.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/tools/xl/xl_vmcontrol.c b/tools/xl/xl_vmcontrol.c
index 89c2b25..10a48a9 100644
--- a/tools/xl/xl_vmcontrol.c
+++ b/tools/xl/xl_vmcontrol.c
@@ -966,6 +966,15 @@ start:
     LOG("Waiting for domain %s (domid %u) to die [pid %ld]",
         d_config.c_info.name, domid, (long)getpid());
 
+    ret = libxl_reg_aer_events_handler(ctx, domid);
+    if (ret) {
+        /*
+         * This error may not be severe enough to fail the creation of the VM.
+         * Log the error, and continue with the creation.
+         */
+        LOG("libxl_reg_aer_events_handler() failed, ret = 0x%08x", ret);
+    }
+
     ret = libxl_evenable_domain_death(ctx, domid, 0, &deathw);
     if (ret) goto out;
 
@@ -993,6 +1002,7 @@ start:
             LOG("Domain %u has shut down, reason code %d 0x%x", domid,
                 event->u.domain_shutdown.shutdown_reason,
                 event->u.domain_shutdown.shutdown_reason);
+            libxl_unreg_aer_events_handler(ctx, domid);
             switch (handle_domain_death(&domid, event, &d_config)) {
             case DOMAIN_RESTART_SOFT_RESET:
                 domid_soft_reset = domid;
@@ -1059,6 +1069,7 @@ start:
 
         case LIBXL_EVENT_TYPE_DOMAIN_DEATH:
             LOG("Domain %u has been destroyed.", domid);
+            libxl_unreg_aer_events_handler(ctx, domid);
             libxl_event_free(ctx, event);
             ret = 0;
             goto out;

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors.
  2017-07-27  0:16 ` [PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors Venu Busireddy
@ 2017-07-28 15:58   ` Wei Liu
  2017-07-28 17:15     ` Venu Busireddy
  2017-07-28 16:39   ` Ian Jackson
  1 sibling, 1 reply; 12+ messages in thread
From: Wei Liu @ 2017-07-28 15:58 UTC (permalink / raw)
  To: Venu Busireddy
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich

On Wed, Jul 26, 2017 at 07:16:38PM -0500, Venu Busireddy wrote:
> Implement the callback function to handle unrecoverable AER errors, and
> also the public APIs that can be used to register/unregister the handler.
> When an AER error occurs, the handler will forcibly remove the erring
> PCIe device from the guest.
> 
> Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com>
> ---
>  tools/libxl/libxl_event.h |  2 ++
>  tools/libxl/libxl_pci.c   | 85 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 87 insertions(+)
> 

Please also add a LIBXL_HAVE macro to libxl.h. There are plenty of
examples there.

> +
> +int libxl_reg_aer_events_handler(libxl_ctx *ctx, uint32_t domid)
> +{
> +    int rc = 0;
> +    char *be_path;
> +    GC_INIT(ctx);
> +
> +    aer_watch.domid = domid;
> +    be_path = GCSPRINTF("/local/domain/0/backend/pci/%u/0/aerFailedSBDF", domid);

I think the best thing to do is you get the domid using
libxl__get_domid. Try not to hard-code 0.

Same for your callback function. And there are quite a few 0's that I'm
not sure what they stand for.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] xl: Register the AER event handler to handle AER errors.
  2017-07-27  0:16 ` [PATCH v2 2/2] xl: Register the AER event handler to handle " Venu Busireddy
@ 2017-07-28 15:58   ` Wei Liu
  2017-07-28 16:06     ` Venu Busireddy
  0 siblings, 1 reply; 12+ messages in thread
From: Wei Liu @ 2017-07-28 15:58 UTC (permalink / raw)
  To: Venu Busireddy
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich

On Wed, Jul 26, 2017 at 07:16:39PM -0500, Venu Busireddy wrote:
> When a guest is created, register the AER event handler to handle the
> AER errors. When an AER error occurs, the handler will forcibly remove
> the erring PCIe device from the guest.
> 
> Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com>
> ---
>  tools/xl/xl_vmcontrol.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/tools/xl/xl_vmcontrol.c b/tools/xl/xl_vmcontrol.c
> index 89c2b25..10a48a9 100644
> --- a/tools/xl/xl_vmcontrol.c
> +++ b/tools/xl/xl_vmcontrol.c
> @@ -966,6 +966,15 @@ start:
>      LOG("Waiting for domain %s (domid %u) to die [pid %ld]",
>          d_config.c_info.name, domid, (long)getpid());
>  
> +    ret = libxl_reg_aer_events_handler(ctx, domid);
> +    if (ret) {
> +        /*
> +         * This error may not be severe enough to fail the creation of the VM.
> +         * Log the error, and continue with the creation.
> +         */

What does this comment mean? This is called after the guest is created,
right?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] xl: Register the AER event handler to handle AER errors.
  2017-07-28 15:58   ` Wei Liu
@ 2017-07-28 16:06     ` Venu Busireddy
  2017-07-28 16:14       ` Wei Liu
  0 siblings, 1 reply; 12+ messages in thread
From: Venu Busireddy @ 2017-07-28 16:06 UTC (permalink / raw)
  To: Wei Liu; +Cc: Andrew Cooper, Ian Jackson, Jan Beulich, xen-devel

On 2017-07-28 16:58:16 +0100, Wei Liu wrote:
> On Wed, Jul 26, 2017 at 07:16:39PM -0500, Venu Busireddy wrote:
> > When a guest is created, register the AER event handler to handle the
> > AER errors. When an AER error occurs, the handler will forcibly remove
> > the erring PCIe device from the guest.
> > 
> > Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com>
> > ---
> >  tools/xl/xl_vmcontrol.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/tools/xl/xl_vmcontrol.c b/tools/xl/xl_vmcontrol.c
> > index 89c2b25..10a48a9 100644
> > --- a/tools/xl/xl_vmcontrol.c
> > +++ b/tools/xl/xl_vmcontrol.c
> > @@ -966,6 +966,15 @@ start:
> >      LOG("Waiting for domain %s (domid %u) to die [pid %ld]",
> >          d_config.c_info.name, domid, (long)getpid());
> >  
> > +    ret = libxl_reg_aer_events_handler(ctx, domid);
> > +    if (ret) {
> > +        /*
> > +         * This error may not be severe enough to fail the creation of the VM.
> > +         * Log the error, and continue with the creation.
> > +         */
> 
> What does this comment mean? This is called after the guest is created,
> right?

Maybe too many words. I can change it as:

/* Log the error, and move on...  */

Is that okay?

Venu


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] xl: Register the AER event handler to handle AER errors.
  2017-07-28 16:06     ` Venu Busireddy
@ 2017-07-28 16:14       ` Wei Liu
  0 siblings, 0 replies; 12+ messages in thread
From: Wei Liu @ 2017-07-28 16:14 UTC (permalink / raw)
  To: Venu Busireddy
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich

On Fri, Jul 28, 2017 at 09:06:03AM -0700, Venu Busireddy wrote:
> On 2017-07-28 16:58:16 +0100, Wei Liu wrote:
> > On Wed, Jul 26, 2017 at 07:16:39PM -0500, Venu Busireddy wrote:
> > > When a guest is created, register the AER event handler to handle the
> > > AER errors. When an AER error occurs, the handler will forcibly remove
> > > the erring PCIe device from the guest.
> > > 
> > > Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com>
> > > ---
> > >  tools/xl/xl_vmcontrol.c | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/tools/xl/xl_vmcontrol.c b/tools/xl/xl_vmcontrol.c
> > > index 89c2b25..10a48a9 100644
> > > --- a/tools/xl/xl_vmcontrol.c
> > > +++ b/tools/xl/xl_vmcontrol.c
> > > @@ -966,6 +966,15 @@ start:
> > >      LOG("Waiting for domain %s (domid %u) to die [pid %ld]",
> > >          d_config.c_info.name, domid, (long)getpid());
> > >  
> > > +    ret = libxl_reg_aer_events_handler(ctx, domid);
> > > +    if (ret) {
> > > +        /*
> > > +         * This error may not be severe enough to fail the creation of the VM.
> > > +         * Log the error, and continue with the creation.
> > > +         */
> > 
> > What does this comment mean? This is called after the guest is created,
> > right?
> 
> Maybe too many words. I can change it as:
> 
> /* Log the error, and move on...  */
> 
> Is that okay?
> 

Fine.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors.
  2017-07-27  0:16 ` [PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors Venu Busireddy
  2017-07-28 15:58   ` Wei Liu
@ 2017-07-28 16:39   ` Ian Jackson
  2017-07-28 23:56     ` Venu Busireddy
  1 sibling, 1 reply; 12+ messages in thread
From: Ian Jackson @ 2017-07-28 16:39 UTC (permalink / raw)
  To: Venu Busireddy; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, xen-devel

Venu Busireddy writes ("[PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors."):
> Implement the callback function to handle unrecoverable AER errors, and
> also the public APIs that can be used to register/unregister the handler.
> When an AER error occurs, the handler will forcibly remove the erring
> PCIe device from the guest.

Why is this only sometimes the right thing to do ?  On what basis
might a user choose ?

If this is always the right thing to do then maybe we need to recast
this as a general "please run monitoring for this domain" call ?

> +int libxl_reg_aer_events_handler(libxl_ctx *, uint32_t) LIBXL_EXTERNAL_CALLERS_ONLY;
> +void libxl_unreg_aer_events_handler(libxl_ctx *, uint32_t) LIBXL_EXTERNAL_CALLERS_ONLY;

I think these names are very unintuitive.  They describe the
implementation, not the effect.

The API seems awkward.  Inside libxl, events are only processed while
the application is inside libxl.  So for these functions to be
effective, the calling application must arrange to be running the
libxl event loop.  This should be documented, at least.

What happens if more than one process calls this at once ?

I looked at the message referred to in the 0/2
  https://lists.xen.org/archives/html/xen-devel/2017-06/msg03274.html
  https://lists.xen.org/archives/html/xen-devel/2017-06/msg03269.html
and they says that the approach taken is to kill the guest.
But the approach in these patches is not to kill the guest.

What am I missing ?

> +typedef struct {
> +    uint32_t domid;
> +    libxl__ev_xswatch watch;
> +} libxl_aer_watch;
> +static libxl_aer_watch aer_watch;

The global variable is completely unacceptable, I'm afraid.

> +static void aer_backend_watch_callback(libxl__egc *egc,
> +                                       libxl__ev_xswatch *watch,
> +                                       const char *watch_path,
> +                                       const char *event_path)
> +{
...

I haven't read this code in detail at this stage.

Thanks,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors.
  2017-07-28 15:58   ` Wei Liu
@ 2017-07-28 17:15     ` Venu Busireddy
  2017-07-31 14:37       ` Wei Liu
  0 siblings, 1 reply; 12+ messages in thread
From: Venu Busireddy @ 2017-07-28 17:15 UTC (permalink / raw)
  To: Wei Liu; +Cc: Andrew Cooper, Ian Jackson, Jan Beulich, xen-devel

On 2017-07-28 16:58:13 +0100, Wei Liu wrote:
> On Wed, Jul 26, 2017 at 07:16:38PM -0500, Venu Busireddy wrote:
> > Implement the callback function to handle unrecoverable AER errors, and
> > also the public APIs that can be used to register/unregister the handler.
> > When an AER error occurs, the handler will forcibly remove the erring
> > PCIe device from the guest.
> > 
> > Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com>
> > ---
> >  tools/libxl/libxl_event.h |  2 ++
> >  tools/libxl/libxl_pci.c   | 85 +++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 87 insertions(+)
> > 
> 
> Please also add a LIBXL_HAVE macro to libxl.h. There are plenty of
> examples there.

I assume you meant, for example, something like:

/* LIBXL_HAVE_UNREG_AER_EVENTS_HANDLER
 *
 * If it is defined, libxl has a library function called
 * libxl_unreg_aer_events_handler.
 */
#define LIBXL_HAVE_UNREG_AER_EVENTS_HANDLER 1

If so, I will add them in the next revision.

> > +
> > +int libxl_reg_aer_events_handler(libxl_ctx *ctx, uint32_t domid)
> > +{
> > +    int rc = 0;
> > +    char *be_path;
> > +    GC_INIT(ctx);
> > +
> > +    aer_watch.domid = domid;
> > +    be_path = GCSPRINTF("/local/domain/0/backend/pci/%u/0/aerFailedSBDF", domid);
> 
> I think the best thing to do is you get the domid using
> libxl__get_domid. Try not to hard-code 0.
> 
> Same for your callback function. And there are quite a few 0's that I'm
> not sure what they stand for.

All those 0's are saying dom0's domid. Isn't dom0's domid always 0?
If that is not the case, I can use libxl__get_domid(). Please let me
know.

Venu


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors.
  2017-07-28 16:39   ` Ian Jackson
@ 2017-07-28 23:56     ` Venu Busireddy
  2017-07-31 15:03       ` Wei Liu
  0 siblings, 1 reply; 12+ messages in thread
From: Venu Busireddy @ 2017-07-28 23:56 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, xen-devel

On 2017-07-28 17:39:52 +0100, Ian Jackson wrote:
> Venu Busireddy writes ("[PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors."):
> > Implement the callback function to handle unrecoverable AER errors, and
> > also the public APIs that can be used to register/unregister the handler.
> > When an AER error occurs, the handler will forcibly remove the erring
> > PCIe device from the guest.
> 
> Why is this only sometimes the right thing to do ?  On what basis
> might a user choose ?

This is not an "only sometimes" thing. User doesn't choose it. We always
want to watch for AER errors.

> If this is always the right thing to do then maybe we need to recast
> this as a general "please run monitoring for this domain" call ?

What does "recast" imply? Which "call" are you referring to?

> > +int libxl_reg_aer_events_handler(libxl_ctx *, uint32_t) LIBXL_EXTERNAL_CALLERS_ONLY;
> > +void libxl_unreg_aer_events_handler(libxl_ctx *, uint32_t) LIBXL_EXTERNAL_CALLERS_ONLY;
> 
> I think these names are very unintuitive.  They describe the
> implementation, not the effect.

These names can be changed to anything you want. Please suggest any names
of your choice, and I will change them. That ensures that we don't spend
more review revisions in fine tuning those names.

> The API seems awkward.  Inside libxl, events are only processed while
> the application is inside libxl.  So for these functions to be
> effective, the calling application must arrange to be running the
> libxl event loop.  This should be documented, at least.

I am afraid I don't follow you. This scheme is tested and it works. So, I
do not follow you when you say, "...for these functions to be effective,..."

> What happens if more than one process calls this at once ?

Each call is handled within a separate 'xl' process's context. I don't
see a problem there. Do you have anything specific in mind?

> I looked at the message referred to in the 0/2
>   https://lists.xen.org/archives/html/xen-devel/2017-06/msg03274.html
>   https://lists.xen.org/archives/html/xen-devel/2017-06/msg03269.html
> and they says that the approach taken is to kill the guest.
> But the approach in these patches is not to kill the guest.
> 
> What am I missing ?

My error. The first revision of these patches were coded to kill the
guest, and hence the commit message in that patch is still referring to
killing the guest. I will repost the xen_pciback patch with the correct
commit message, once we conclude the xen patch.

> > +typedef struct {
> > +    uint32_t domid;
> > +    libxl__ev_xswatch watch;
> > +} libxl_aer_watch;
> > +static libxl_aer_watch aer_watch;
> 
> The global variable is completely unacceptable, I'm afraid.

It was much easier to code and understand this way. I can avoid the
global variable. I will fix this in the next revision.

Thanks,

Venu

> > +static void aer_backend_watch_callback(libxl__egc *egc,
> > +                                       libxl__ev_xswatch *watch,
> > +                                       const char *watch_path,
> > +                                       const char *event_path)
> > +{
> ...
> 
> I haven't read this code in detail at this stage.
> 
> Thanks,
> Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors.
  2017-07-28 17:15     ` Venu Busireddy
@ 2017-07-31 14:37       ` Wei Liu
  0 siblings, 0 replies; 12+ messages in thread
From: Wei Liu @ 2017-07-31 14:37 UTC (permalink / raw)
  To: Venu Busireddy
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich

On Fri, Jul 28, 2017 at 10:15:40AM -0700, Venu Busireddy wrote:
> On 2017-07-28 16:58:13 +0100, Wei Liu wrote:
> > On Wed, Jul 26, 2017 at 07:16:38PM -0500, Venu Busireddy wrote:
> > > Implement the callback function to handle unrecoverable AER errors, and
> > > also the public APIs that can be used to register/unregister the handler.
> > > When an AER error occurs, the handler will forcibly remove the erring
> > > PCIe device from the guest.
> > > 
> > > Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com>
> > > ---
> > >  tools/libxl/libxl_event.h |  2 ++
> > >  tools/libxl/libxl_pci.c   | 85 +++++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 87 insertions(+)
> > > 
> > 
> > Please also add a LIBXL_HAVE macro to libxl.h. There are plenty of
> > examples there.
> 
> I assume you meant, for example, something like:
> 
> /* LIBXL_HAVE_UNREG_AER_EVENTS_HANDLER
>  *
>  * If it is defined, libxl has a library function called
>  * libxl_unreg_aer_events_handler.
>  */
> #define LIBXL_HAVE_UNREG_AER_EVENTS_HANDLER 1
> 
> If so, I will add them in the next revision.
> 
> > > +
> > > +int libxl_reg_aer_events_handler(libxl_ctx *ctx, uint32_t domid)
> > > +{
> > > +    int rc = 0;
> > > +    char *be_path;
> > > +    GC_INIT(ctx);
> > > +
> > > +    aer_watch.domid = domid;
> > > +    be_path = GCSPRINTF("/local/domain/0/backend/pci/%u/0/aerFailedSBDF", domid);
> > 
> > I think the best thing to do is you get the domid using
> > libxl__get_domid. Try not to hard-code 0.
> > 
> > Same for your callback function. And there are quite a few 0's that I'm
> > not sure what they stand for.
> 
> All those 0's are saying dom0's domid. Isn't dom0's domid always 0?
> If that is not the case, I can use libxl__get_domid(). Please let me
> know.

Don't assume the code will always run in dom0, so use get_domid is best.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors.
  2017-07-28 23:56     ` Venu Busireddy
@ 2017-07-31 15:03       ` Wei Liu
  0 siblings, 0 replies; 12+ messages in thread
From: Wei Liu @ 2017-07-31 15:03 UTC (permalink / raw)
  To: Venu Busireddy
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich

On Fri, Jul 28, 2017 at 04:56:56PM -0700, Venu Busireddy wrote:
> On 2017-07-28 17:39:52 +0100, Ian Jackson wrote:
> > Venu Busireddy writes ("[PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors."):
> > > Implement the callback function to handle unrecoverable AER errors, and
> > > also the public APIs that can be used to register/unregister the handler.
> > > When an AER error occurs, the handler will forcibly remove the erring
> > > PCIe device from the guest.
> > 
> > Why is this only sometimes the right thing to do ?  On what basis
> > might a user choose ?
> 
> This is not an "only sometimes" thing. User doesn't choose it. We always
> want to watch for AER errors.
> 
> > If this is always the right thing to do then maybe we need to recast
> > this as a general "please run monitoring for this domain" call ?
> 
> What does "recast" imply? Which "call" are you referring to?
> 
> > > +int libxl_reg_aer_events_handler(libxl_ctx *, uint32_t) LIBXL_EXTERNAL_CALLERS_ONLY;
> > > +void libxl_unreg_aer_events_handler(libxl_ctx *, uint32_t) LIBXL_EXTERNAL_CALLERS_ONLY;
> > 
> > I think these names are very unintuitive.  They describe the
> > implementation, not the effect.
> 
> These names can be changed to anything you want. Please suggest any names
> of your choice, and I will change them. That ensures that we don't spend
> more review revisions in fine tuning those names.
> 
> > The API seems awkward.  Inside libxl, events are only processed while
> > the application is inside libxl.  So for these functions to be
> > effective, the calling application must arrange to be running the
> > libxl event loop.  This should be documented, at least.
> 
> I am afraid I don't follow you. This scheme is tested and it works. So, I
> do not follow you when you say, "...for these functions to be effective,..."

Libxl has an internal event loop. The code as-is registers a watch which
runs on the internal event loop. The event loop's event is only
processed when the process enters libxl (calls libxl functions).

Imagine a toolstack which doesn't use libxl's internal event loop -- for
example libvirt has its own event loop, or a toolstack which only calls
the register function then nothing else. Your API would not work for
those cases.

Ian, please correct me if I'm wrong.

> 
> > What happens if more than one process calls this at once ?
> 
> Each call is handled within a separate 'xl' process's context. I don't
> see a problem there. Do you have anything specific in mind?
> 

It is possible that both xl processes see its watch fires and try to
write to the same node at the same time.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-07-31 15:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-27  0:16 [PATCH v2 0/2] Containing AER unrecoverable errors Venu Busireddy
2017-07-27  0:16 ` [PATCH v2 1/2] libxl: Implement the handler to handle unrecoverable AER errors Venu Busireddy
2017-07-28 15:58   ` Wei Liu
2017-07-28 17:15     ` Venu Busireddy
2017-07-31 14:37       ` Wei Liu
2017-07-28 16:39   ` Ian Jackson
2017-07-28 23:56     ` Venu Busireddy
2017-07-31 15:03       ` Wei Liu
2017-07-27  0:16 ` [PATCH v2 2/2] xl: Register the AER event handler to handle " Venu Busireddy
2017-07-28 15:58   ` Wei Liu
2017-07-28 16:06     ` Venu Busireddy
2017-07-28 16:14       ` Wei Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.