[PATCH v7] Fix interrupt latency of HVM PCI passthrough devices.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v7] Fix interrupt latency of HVM PCI passthrough devices.
@ 2014-09-27  1:33 Konrad Rzeszutek Wilk
  2014-09-27  1:33 ` [PATCH v7 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model Konrad Rzeszutek Wilk
  2014-09-27  1:33 ` [PATCH v7 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v7) Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 10+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-27  1:33 UTC (permalink / raw)
  To: xen-devel, jbeulich, tim, andrew.cooper3, keir, ian.campbell,
	ian.jackson


Changelog:
since v6 (http://lists.xen.org/archives/html/xen-devel/2014-09/msg03208.html)
 - Squashed #1 + #2.
 - Added more comments, redid it based on Jan's feedback.
since v5 (http://lists.xen.org/archives/html/xen-devel/2014-09/msg02868.html)
 - Redid the series based on Jan's feedback
since v4 
(http://lists.xen.org/archives/html/xen-devel/2014-09/msg01676.html):
 - Ditch the domain centric mechansim.
 - Fix issues raised by Jan.


The patches are an performance bug-fixes for PCI passthrough for machines
with many sockets. On those machines we have observed awful latency issues
with interrupts and with high steal time on idle guests. The root cause
was that the tasklet lock which was shared across all sockets. Each interrupt
that was to be delivered to a guest took the tasket lock - and with many
guests and many PCI passthrough devices - the performance and latency were
atrocious. These two patches fix the outstanding issues.


 xen/arch/x86/domain.c         |   4 +-
 xen/drivers/passthrough/io.c  | 208 ++++++++++++++++++++++++++++++++++++------
 xen/drivers/passthrough/pci.c |  29 ++++--
 xen/include/xen/hvm/irq.h     |   3 +-
 xen/include/xen/pci.h         |   2 +-
 xen/include/xen/softirq.h     |   3 +
 6 files changed, 211 insertions(+), 38 deletions(-)


Konrad Rzeszutek Wilk (2):
      dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
      dpci: Replace tasklet with an softirq (v7)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v7 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
  2014-09-27  1:33 [PATCH v7] Fix interrupt latency of HVM PCI passthrough devices Konrad Rzeszutek Wilk
@ 2014-09-27  1:33 ` Konrad Rzeszutek Wilk
  2014-09-29 10:16   ` Andrew Cooper
  2014-09-29 13:22   ` Jan Beulich
  2014-09-27  1:33 ` [PATCH v7 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v7) Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 10+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-27  1:33 UTC (permalink / raw)
  To: xen-devel, jbeulich, tim, andrew.cooper3, keir, ian.campbell,
	ian.jackson
  Cc: Konrad Rzeszutek Wilk

When an interrupt for an PCI (or PCIe) passthrough device
is to be sent to a guest, we find the appropiate
'hvm_dirq_dpci' structure for the interrupt (PIRQ), set
a bit, and schedule an tasklet.

Then the 'hvm_dirq_assist' tasklet gets called with the 'struct
domain' from which we it iterates over all of the 'hvm_dirq_dpci'
which are mapped to the guest. However, it is important to
note that when we setup the 'hvm_dirq_dpci' we have a field
for the 'struct domain' that we can use instead of
having to pass in the 'struct domain'.

As such this patch moves the tasklet to the 'struct hvm_dirq_dpci'
and sets the 'dom' field to the domain. We also double-check
that the '->dom' is not reset before using it.

We have to be careful with this as that means we MUST have
'dom' set before pirq_guest_bind() is called. As such
we add the 'pirq_dpci->dom = d;' to cover for all such
cases.

The mechanism to tear it down is more complex as there
are two ways it can be executed:

 a) pci_clean_dpci_irq. This gets called when the guest is
    being destroyed. We end up calling 'tasklet_kill'.

    The scenarios in which the 'struct pirq' (and subsequently
    the 'hvm_pirq_dpci') gets destroyed is when:

    - guest did not use the pirq at all after setup.
    - guest did use pirq, but decided to mask and left it in that
      state.
    - guest did use pirq, but crashed.

    In all of those scenarios we end up calling 'tasklet_kill'
    which will spin on the tasklet if it is running.

 b) pt_irq_destroy_bind (guest disables the MSI). We double-check
    that the softirq has run by piggy-backing on the existing
    'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
    We add the extra call to 'pt_pirq_softirq_active' in
    'pt_pirq_cleanup_check'.

    NOTE: Guests that use event channels unbind first the
    event channel from PIRQs, so the 'pt_pirq_cleanup_check'
    won't be called as eventch is set to zero. In that case
    we either clean it up via the a) mechanism. It is OK
    to re-use the tasklet when 'pt_irq_create_bind' is called
    afterwards.

    There is an extra scenario regardless of eventch being
    set or not: the guest did 'pt_irq_destroy_bind' while an
    interrupt was triggered and tasklet was scheduled (but had not
    been run). It is OK to still run the tasklet as
    hvm_dirq_assist won't do anything (as the flags are
    set to zero). As such we can exit out of hvm_dirq_assist
    without doing anything.

We also stop using in hvm_dirq_assist the pt_pirq_iterate.

When an interrupt for an PCI (or PCIe) passthrough device
is to be sent to a guest, we find the appropiate
'hvm_dirq_dpci' structure for the interrupt (PIRQ), set
a bit, and schedule an tasklet.

As we are now called from dpci_softirq with the outstanding
'struct hvm_pirq_dpci' there is no need to call pt_pirq_iterate
which will iterate over all of the PIRQs and call us with every
one that is mapped. And then _hvm_dirq_assist figuring out
which one to execute.

This is a inefficient and not fair as:
 - We iterate starting at PIRQ 0 and up every time. That means
   the PCIe devices that have lower PIRQs get to be called
   first.
 - If we have many PCIe devices passed in with many PIRQs and
   most of the time only the highest numbered PIRQ gets an
   interrupt - we iterate over many PIRQs.

Since we know which 'hvm_pirq_dpci' to check - as the tasklet is
called for a specific 'struct hvm_pirq_dpci' - we do that
in this patch.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/passthrough/io.c  | 76 +++++++++++++++++++++++++++++--------------
 xen/drivers/passthrough/pci.c |  4 +--
 xen/include/xen/hvm/irq.h     |  2 +-
 3 files changed, 55 insertions(+), 27 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 4cd32b5..8534d63 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -27,7 +27,7 @@
 #include <xen/hvm/irq.h>
 #include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long _d);
+static void hvm_dirq_assist(unsigned long arg);
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -114,9 +114,6 @@ int pt_irq_create_bind(
             spin_unlock(&d->event_lock);
             return -ENOMEM;
         }
-        softirq_tasklet_init(
-            &hvm_irq_dpci->dirq_tasklet,
-            hvm_dirq_assist, (unsigned long)d);
         for ( i = 0; i < NR_HVM_IRQS; i++ )
             INIT_LIST_HEAD(&hvm_irq_dpci->girq[i]);
 
@@ -130,6 +127,18 @@ int pt_irq_create_bind(
         return -ENOMEM;
     }
     pirq_dpci = pirq_dpci(info);
+    /*
+     * The 'pt_irq_create_bind' can be called right after 'pt_irq_destroy_bind'
+     * was called. The 'pirq_cleanup_check' which would free the structure
+     * is only called if the event channel for the PIRQ is active. However
+     * OS-es that use event channels usually bind the PIRQ to an event channel
+     * and also unbind it before 'pt_irq_destroy_bind' is called which means
+     * we end up re-using the 'dpci' structure. This can be easily reproduced
+     * with unloading and loading the driver for the device.
+     *
+     * As such on every 'pt_irq_create_bind' call we MUST reset the values.
+     */
+    pirq_dpci->dom = d;
 
     switch ( pt_irq_bind->irq_type )
     {
@@ -156,6 +165,7 @@ int pt_irq_create_bind(
             {
                 pirq_dpci->gmsi.gflags = 0;
                 pirq_dpci->gmsi.gvec = 0;
+                pirq_dpci->dom = NULL;
                 pirq_dpci->flags = 0;
                 pirq_cleanup_check(info, d);
                 spin_unlock(&d->event_lock);
@@ -232,7 +242,6 @@ int pt_irq_create_bind(
         {
             unsigned int share;
 
-            pirq_dpci->dom = d;
             if ( pt_irq_bind->irq_type == PT_IRQ_TYPE_MSI_TRANSLATE )
             {
                 pirq_dpci->flags = HVM_IRQ_DPCI_MAPPED |
@@ -391,6 +400,7 @@ int pt_irq_destroy_bind(
         msixtbl_pt_unregister(d, pirq);
         if ( pt_irq_need_timer(pirq_dpci->flags) )
             kill_timer(&pirq_dpci->timer);
+        /* hvm_dirq_assist can handle this. */
         pirq_dpci->dom   = NULL;
         pirq_dpci->flags = 0;
         pirq_cleanup_check(pirq, d);
@@ -415,11 +425,18 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
+    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    return !dpci->flags;
+    if ( !dpci->flags )
+    {
+        tasklet_kill(&dpci->tasklet);
+        dpci->dom = NULL;
+        return 1;
+    }
+    return 0;
 }
 
 int pt_pirq_iterate(struct domain *d,
@@ -459,7 +476,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
         return 0;
 
     pirq_dpci->masked = 1;
-    tasklet_schedule(&dpci->dirq_tasklet);
+    tasklet_schedule(&pirq_dpci->tasklet);
     return 1;
 }
 
@@ -513,9 +530,27 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
-                            void *arg)
+static void hvm_dirq_assist(unsigned long arg)
 {
+    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
+    struct domain *d = pirq_dpci->dom;
+
+    /*
+     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
+     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
+     *
+     * And '->dom' gets cleared later in the destroy path. We exit and clear
+     * 'mapping' - which is OK as later in this code we would
+     * do nothing except clear the ->masked field anyhow.
+     */
+    if ( !d )
+    {
+        pirq_dpci->masked = 0;
+        return;
+    }
+    ASSERT(d->arch.hvm_domain.irq.dpci);
+
+    spin_lock(&d->event_lock);
     if ( test_and_clear_bool(pirq_dpci->masked) )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
@@ -526,13 +561,17 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
             send_guest_pirq(d, pirq);
 
             if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
-                return 0;
+            {
+                spin_unlock(&d->event_lock);
+                return;
+            }
         }
 
         if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
         {
             vmsi_deliver_pirq(d, pirq_dpci);
-            return 0;
+            spin_unlock(&d->event_lock);
+            return;
         }
 
         list_for_each_entry ( digl, &pirq_dpci->digl_list, list )
@@ -545,7 +584,8 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
         {
             /* for translated MSI to INTx interrupt, eoi as early as possible */
             __msi_pirq_eoi(pirq_dpci);
-            return 0;
+            spin_unlock(&d->event_lock);
+            return;
         }
 
         /*
@@ -558,18 +598,6 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
         ASSERT(pt_irq_need_timer(pirq_dpci->flags));
         set_timer(&pirq_dpci->timer, NOW() + PT_IRQ_TIME_OUT);
     }
-
-    return 0;
-}
-
-static void hvm_dirq_assist(unsigned long _d)
-{
-    struct domain *d = (struct domain *)_d;
-
-    ASSERT(d->arch.hvm_domain.irq.dpci);
-
-    spin_lock(&d->event_lock);
-    pt_pirq_iterate(d, _hvm_dirq_assist, NULL);
     spin_unlock(&d->event_lock);
 }
 
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1eba833..81e8a3a 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,6 +767,8 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
+    tasklet_kill(&pirq_dpci->tasklet);
+
     return 0;
 }
 
@@ -784,8 +786,6 @@ static void pci_clean_dpci_irqs(struct domain *d)
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        tasklet_kill(&hvm_irq_dpci->dirq_tasklet);
-
         pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
 
         d->arch.hvm_domain.irq.dpci = NULL;
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index c89f4b1..94a550a 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -88,7 +88,6 @@ struct hvm_irq_dpci {
     DECLARE_BITMAP(isairq_map, NR_ISAIRQS);
     /* Record of mapped Links */
     uint8_t link_cnt[NR_LINK];
-    struct tasklet dirq_tasklet;
 };
 
 /* Machine IRQ to guest device/intx mapping. */
@@ -100,6 +99,7 @@ struct hvm_pirq_dpci {
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
+    struct tasklet tasklet;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v7 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v7)
  2014-09-27  1:33 [PATCH v7] Fix interrupt latency of HVM PCI passthrough devices Konrad Rzeszutek Wilk
  2014-09-27  1:33 ` [PATCH v7 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model Konrad Rzeszutek Wilk
@ 2014-09-27  1:33 ` Konrad Rzeszutek Wilk
  2014-09-29 10:49   ` Andrew Cooper
  2014-09-29 13:40   ` Jan Beulich
  1 sibling, 2 replies; 10+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-27  1:33 UTC (permalink / raw)
  To: xen-devel, jbeulich, tim, andrew.cooper3, keir, ian.campbell,
	ian.jackson
  Cc: Konrad Rzeszutek Wilk

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct hvm_pirq_dpci' structure the
'hvm_dirq_assist' needs to run on. That is simple solved by
threading the 'struct hvm_pirq_dpci' through a linked list.
The rule of only running one 'hvm_dirq_assist' for only
one 'hvm_pirq_dpci' is also preserved by having
'schedule_dpci_for' ignore any subsequent calls for an domain
which has already been scheduled.

Since we are using 'hvm_pirq_dpci' structure it is setup in
'pt_irq_create_bind' - we check if via 'pt_pirq_softirq_active'
whether the 'hvm_dirq_assist' is still outstanding on the
'pirq' and if so spin up to a second calling softirq
to flush it out. This replaces the 'tasklet_kill' which
had been previously done in 'pt_irq_destroy_bind' via
the 'pt_pirq_cleanup_check'.

The mechanism to tear it down is more complex as there
are three ways it can be executed. To make it simpler
everything revolves around 'pt_pirq_softirq_active'. If it
returns -EGAIN that means there is an outstanding softirq
that needs to finish running before we can continue tearing
down. With that in mind:

a) pci_clean_dpci_irq. This gets called when the guest is
   being destroyed. We end up calling 'pt_pirq_softirq_active'
   to see if it is OK to continue the destruction.

   The scenarios in which the 'struct pirq' (and subsequently
   the 'hvm_pirq_dpci') gets destroyed is when:

   - guest did not use the pirq at all after setup.
   - guest did use pirq, but decided to mask and left it in that
     state.
   - guest did use pirq, but crashed.

   In all of those scenarios we end up calling
   'pt_pirq_softirq_active' to check if the softirq is still
   active. Read below on the 'pt_pirq_softirq_active' loop.

   Note that before it would spin on an tasklet_kill waiting
   the 'hvm_dirq_assist' to flush out. We now do it via
   -EAGAIN to allow continuation.

b) pt_irq_destroy_bind (guest disables the MSI). We double-check
   that the softirq has run by piggy-backing on the existing
   'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
   We add the extra call to 'pt_pirq_softirq_active' in
   'pt_pirq_cleanup_check'.

   NOTE: Guests that use event channels unbind first the
   event channel from PIRQs, so the 'pt_pirq_cleanup_check'
   won't be called as eventch is set to zero. In that case
   we either clean it up via the a) or c) mechanism.

   There is an extra scenario regardless of eventch being
   set or not: the guest did 'pt_irq_destroy_bind' while an
   interrupt was triggered and softirq was scheduled (but had not
   been run). It is OK to still run the softirq as
   hvm_dirq_assist won't do anything (as the flags are
   set to zero).

   Note that before this patch we would spin forever in
   tasklet_kill waiting for the 'hvm_dirq_assist' to flush out.
   That 'spin' has been moved to 'pt_irq_create_bind' with
   an -EAGAIN mechanism to not spin forever.

c) pt_irq_create_bind (not a typo). The scenarios are:

     - guest disables the MSI and then enables it
       (rmmod and modprobe in a loop). We check if the
       softirq has been scheduled.
       Imagine the 'b)' with interrupts in flight and c) getting
       called in a loop.

     - we hit once the error paths in 'pt_irq_create_bind' while
       an interrupt was triggered and softirq was scheduled.

In all of those cases we will spin up to a second on
'pt_pirq_softirq_active' (at the start of the 'pt_irq_create_bind')
with the event_lock spinlock dropped and calling
'process_pending_softirqs'. hvm_dirq_assist() will be executed and
then the softirq will clear 'mapping' which signals that that we
can re-use the 'hvm_pirq_dpci' structure.

Note that the 'flags' variable is cleared at that
point, so hvm_dirq_assist won't actually do anything..

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Suggested-by: Jan Beulich <JBeulich@suse.com>

---
v2: On top of ref-cnts also have wait loop for the outstanding
    'struct domain' that need to be processed.
v3: Add -EAGAIN, fix up StyleGuide issues
v4: Clean it up mode, redo per_cpu, this_cpu logic
v5: Instead of threading struct domain, use hvm_pirq_dpci.
v6: Ditch the 'state' bit, expand description, simplify
    softirq and teardown sequence.
v7: Flesh out the comments. Drop the use of domain refcounts
---
 xen/arch/x86/domain.c         |   4 +-
 xen/drivers/passthrough/io.c  | 152 ++++++++++++++++++++++++++++++++++++++----
 xen/drivers/passthrough/pci.c |  31 ++++++---
 xen/include/xen/hvm/irq.h     |   3 +-
 xen/include/xen/pci.h         |   2 +-
 xen/include/xen/softirq.h     |   3 +
 6 files changed, 170 insertions(+), 25 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 8cfd1ca..8f6e17a 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1938,7 +1938,9 @@ int domain_relinquish_resources(struct domain *d)
     switch ( d->arch.relmem )
     {
     case RELMEM_not_started:
-        pci_release_devices(d);
+        ret = pci_release_devices(d);
+        if ( ret )
+            return ret;
 
         /* Tear down paging-assistance stuff. */
         paging_teardown(d);
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 8534d63..40ad1dc 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,14 +20,60 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long arg);
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/*
+ * Should only be called from hvm_do_IRQ_dpci. We use the
+ * 'masked' as an gate to thwart multiple interrupts.
+ *
+ * The 'masked' is cleared by 'softirq_dpci' when it has
+ * completed executing 'hvm_dirq_assist'.
+ *
+ */
+static void schedule_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
+{
+    unsigned long flags;
+
+    if ( pirq_dpci->masked )
+        return;
+
+    pirq_dpci->masked = 1;
+
+    local_irq_save(flags);
+    list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+    local_irq_restore(flags);
+
+    raise_softirq(HVM_DPCI_SOFTIRQ);
+}
+
+/*
+ * If we are racing with softirq_dpci (masked is still set) we return
+ * -EAGAIN. Otherwise we return 0.
+ *
+ *  If it is -EAGAIN, it is the callers responsibility to make sure
+ *  that the softirq (with the event_lock dropped) has ran. We need
+ *  to flush out the outstanding 'dpci_softirq' (no more of them
+ *  will be added for this pirq as the IRQ action handler has been
+ *  reset in pt_irq_destroy_bind).
+ */
+int pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
+{
+    if ( pirq_dpci->masked )
+        return -EAGAIN;
+    /*
+     * If in the future we would call 'schedule_softirq_for' right away
+     * after 'pt_pirq_softirq_active' we MUST reset the list (otherwise it
+     * might have stale data).
+     */
+    return 0;
+}
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -97,10 +143,12 @@ int pt_irq_create_bind(
     struct hvm_pirq_dpci *pirq_dpci;
     struct pirq *info;
     int rc, pirq = pt_irq_bind->machine_irq;
+    s_time_t start = NOW();
 
     if ( pirq < 0 || pirq >= d->nr_pirqs )
         return -EINVAL;
 
+ restart:
     spin_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
@@ -128,6 +176,25 @@ int pt_irq_create_bind(
     }
     pirq_dpci = pirq_dpci(info);
     /*
+     * A crude 'while' loop with us dropping the spinlock and giving
+     * the softirq_dpci a chance to run.
+     * We MUST check for this condition as the softirq could be scheduled
+     * and hasn't run yet. We do this up to one second at which point we
+     * give up. Note that this code replaced tasklet_kill which would have
+     * spun forever and would do the same thing (wait to flush out
+     * outstanding hvm_dirq_assist calls) - said tasklet_kill had been
+     * in 'pt_pirq_cleanup_check' (called during pt_irq_destroy_bind)
+     * and 'pci_clean_dpci_irq' (domain destroyed).
+     */
+    if ( pt_pirq_softirq_active(pirq_dpci) )
+    {
+        spin_unlock(&d->event_lock);
+        process_pending_softirqs();
+        if ( ( NOW() - start ) > SECONDS(1) )
+            return -EAGAIN;
+        goto restart;
+    }
+    /*
      * The 'pt_irq_create_bind' can be called right after 'pt_irq_destroy_bind'
      * was called. The 'pirq_cleanup_check' which would free the structure
      * is only called if the event channel for the PIRQ is active. However
@@ -425,14 +492,12 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
-    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    if ( !dpci->flags )
+    if ( !dpci->flags && !pt_pirq_softirq_active(dpci) )
     {
-        tasklet_kill(&dpci->tasklet);
         dpci->dom = NULL;
         return 1;
     }
@@ -475,8 +540,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
          !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         return 0;
 
-    pirq_dpci->masked = 1;
-    tasklet_schedule(&pirq_dpci->tasklet);
+    schedule_softirq_for(pirq_dpci);
     return 1;
 }
 
@@ -530,9 +594,8 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static void hvm_dirq_assist(unsigned long arg)
+static void hvm_dirq_assist(struct hvm_pirq_dpci *pirq_dpci)
 {
-    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
     struct domain *d = pirq_dpci->dom;
 
     /*
@@ -544,14 +607,12 @@ static void hvm_dirq_assist(unsigned long arg)
      * do nothing except clear the ->masked field anyhow.
      */
     if ( !d )
-    {
-        pirq_dpci->masked = 0;
         return;
-    }
+
     ASSERT(d->arch.hvm_domain.irq.dpci);
 
     spin_lock(&d->event_lock);
-    if ( test_and_clear_bool(pirq_dpci->masked) )
+    if ( pirq_dpci->masked )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
         const struct dev_intx_gsi_link *digl;
@@ -653,3 +714,68 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+static void dpci_softirq(void)
+{
+    struct hvm_pirq_dpci *pirq_dpci;
+    unsigned int cpu = smp_processor_id();
+    LIST_HEAD(our_list);
+
+    local_irq_disable();
+    list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
+    local_irq_enable();
+
+    while ( !list_empty(&our_list) )
+    {
+        pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
+        list_del(&pirq_dpci->softirq_list);
+
+        hvm_dirq_assist(pirq_dpci);
+        pirq_dpci->masked = 0;
+    }
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        /*
+         * On CPU_DYING this callback is called (on the CPU that is dying)
+         * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
+         * clear out any outstanding domains (by the virtue of the idle loop
+         * calling the softirq later). In CPU_DEAD case the CPU is deaf and
+         * there are no pending softirqs for us to handle so we can chill.
+         */
+        ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
+        break;
+    default:
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 81e8a3a..d147189 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
-    tasklet_kill(&pirq_dpci->tasklet);
-
-    return 0;
+    return pt_pirq_softirq_active(pirq_dpci);
 }
 
-static void pci_clean_dpci_irqs(struct domain *d)
+static int pci_clean_dpci_irqs(struct domain *d)
 {
     struct hvm_irq_dpci *hvm_irq_dpci = NULL;
 
     if ( !iommu_enabled )
-        return;
+        return -ESRCH;
 
     if ( !is_hvm_domain(d) )
-        return;
+        return -EINVAL;
 
     spin_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+        int ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+
+        if ( ret )
+        {
+            spin_unlock(&d->event_lock);
+            return ret;
+        }
 
         d->arch.hvm_domain.irq.dpci = NULL;
         free_hvm_irq_dpci(hvm_irq_dpci);
     }
     spin_unlock(&d->event_lock);
+    return 0;
 }
 
-void pci_release_devices(struct domain *d)
+int pci_release_devices(struct domain *d)
 {
     struct pci_dev *pdev;
     u8 bus, devfn;
+    int ret;
 
     spin_lock(&pcidevs_lock);
-    pci_clean_dpci_irqs(d);
+    ret = pci_clean_dpci_irqs(d);
+    if ( ret == -EAGAIN )
+    {
+        spin_unlock(&pcidevs_lock);
+        return ret;
+    }
     while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
     {
         bus = pdev->bus;
@@ -811,6 +822,8 @@ void pci_release_devices(struct domain *d)
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
     }
     spin_unlock(&pcidevs_lock);
+
+    return 0;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 94a550a..4fc6dcf 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -99,7 +99,7 @@ struct hvm_pirq_dpci {
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
-    struct tasklet tasklet;
+    struct list_head softirq_list;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
@@ -109,6 +109,7 @@ int pt_pirq_iterate(struct domain *d,
                               struct hvm_pirq_dpci *, void *arg),
                     void *arg);
 
+int pt_pirq_softirq_active(struct hvm_pirq_dpci *);
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 91520bc..5f295f3 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev(
 
 void setup_hwdom_pci_devices(struct domain *,
                             int (*)(u8 devfn, struct pci_dev *));
-void pci_release_devices(struct domain *d);
+int pci_release_devices(struct domain *d);
 int pci_add_segment(u16 seg);
 const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 0895a16..f3da5df 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -8,6 +8,9 @@ enum {
     NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
     RCU_SOFTIRQ,
     TASKLET_SOFTIRQ,
+#ifdef HAS_PASSTHROUGH
+    HVM_DPCI_SOFTIRQ,
+#endif
     NR_COMMON_SOFTIRQS
 };
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v7 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
  2014-09-27  1:33 ` [PATCH v7 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model Konrad Rzeszutek Wilk
@ 2014-09-29 10:16   ` Andrew Cooper
  2014-09-29 11:52     ` Jan Beulich
  2014-09-29 13:22   ` Jan Beulich
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Cooper @ 2014-09-29 10:16 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, xen-devel, jbeulich, tim, keir,
	ian.campbell, ian.jackson

On 27/09/14 02:33, Konrad Rzeszutek Wilk wrote:
> When an interrupt for an PCI (or PCIe) passthrough device
> is to be sent to a guest, we find the appropiate
> 'hvm_dirq_dpci' structure for the interrupt (PIRQ), set
> a bit, and schedule an tasklet.
>
> Then the 'hvm_dirq_assist' tasklet gets called with the 'struct
> domain' from which we it iterates over all of the 'hvm_dirq_dpci'
> which are mapped to the guest. However, it is important to
> note that when we setup the 'hvm_dirq_dpci' we have a field
> for the 'struct domain' that we can use instead of
> having to pass in the 'struct domain'.
>
> As such this patch moves the tasklet to the 'struct hvm_dirq_dpci'
> and sets the 'dom' field to the domain. We also double-check
> that the '->dom' is not reset before using it.
>
> We have to be careful with this as that means we MUST have
> 'dom' set before pirq_guest_bind() is called. As such
> we add the 'pirq_dpci->dom = d;' to cover for all such
> cases.
>
> The mechanism to tear it down is more complex as there
> are two ways it can be executed:
>
>  a) pci_clean_dpci_irq. This gets called when the guest is
>     being destroyed. We end up calling 'tasklet_kill'.
>
>     The scenarios in which the 'struct pirq' (and subsequently
>     the 'hvm_pirq_dpci') gets destroyed is when:
>
>     - guest did not use the pirq at all after setup.
>     - guest did use pirq, but decided to mask and left it in that
>       state.
>     - guest did use pirq, but crashed.
>
>     In all of those scenarios we end up calling 'tasklet_kill'
>     which will spin on the tasklet if it is running.
>
>  b) pt_irq_destroy_bind (guest disables the MSI). We double-check
>     that the softirq has run by piggy-backing on the existing
>     'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
>     We add the extra call to 'pt_pirq_softirq_active' in
>     'pt_pirq_cleanup_check'.
>
>     NOTE: Guests that use event channels unbind first the
>     event channel from PIRQs, so the 'pt_pirq_cleanup_check'
>     won't be called as eventch is set to zero. In that case
>     we either clean it up via the a) mechanism. It is OK
>     to re-use the tasklet when 'pt_irq_create_bind' is called
>     afterwards.
>
>     There is an extra scenario regardless of eventch being
>     set or not: the guest did 'pt_irq_destroy_bind' while an
>     interrupt was triggered and tasklet was scheduled (but had not
>     been run). It is OK to still run the tasklet as
>     hvm_dirq_assist won't do anything (as the flags are
>     set to zero). As such we can exit out of hvm_dirq_assist
>     without doing anything.
>
> We also stop using in hvm_dirq_assist the pt_pirq_iterate.
>
> When an interrupt for an PCI (or PCIe) passthrough device
> is to be sent to a guest, we find the appropiate
> 'hvm_dirq_dpci' structure for the interrupt (PIRQ), set
> a bit, and schedule an tasklet.

This paragraph is repeated from the top of this commit message.  Did you
rewrite the commit message after squashing?  In fact, most of the text
below looks to be duplicated.

>
> As we are now called from dpci_softirq with the outstanding
> 'struct hvm_pirq_dpci' there is no need to call pt_pirq_iterate
> which will iterate over all of the PIRQs and call us with every
> one that is mapped. And then _hvm_dirq_assist figuring out
> which one to execute.
>
> This is a inefficient and not fair as:
>  - We iterate starting at PIRQ 0 and up every time. That means
>    the PCIe devices that have lower PIRQs get to be called
>    first.
>  - If we have many PCIe devices passed in with many PIRQs and
>    most of the time only the highest numbered PIRQ gets an
>    interrupt - we iterate over many PIRQs.
>
> Since we know which 'hvm_pirq_dpci' to check - as the tasklet is
> called for a specific 'struct hvm_pirq_dpci' - we do that
> in this patch.
>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  xen/drivers/passthrough/io.c  | 76 +++++++++++++++++++++++++++++--------------
>  xen/drivers/passthrough/pci.c |  4 +--
>  xen/include/xen/hvm/irq.h     |  2 +-
>  3 files changed, 55 insertions(+), 27 deletions(-)
>
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index 4cd32b5..8534d63 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -27,7 +27,7 @@
>  #include <xen/hvm/irq.h>
>  #include <xen/tasklet.h>
>  
> -static void hvm_dirq_assist(unsigned long _d);
> +static void hvm_dirq_assist(unsigned long arg);
>  
>  bool_t pt_irq_need_timer(uint32_t flags)
>  {
> @@ -114,9 +114,6 @@ int pt_irq_create_bind(
>              spin_unlock(&d->event_lock);
>              return -ENOMEM;
>          }
> -        softirq_tasklet_init(
> -            &hvm_irq_dpci->dirq_tasklet,
> -            hvm_dirq_assist, (unsigned long)d);
>          for ( i = 0; i < NR_HVM_IRQS; i++ )
>              INIT_LIST_HEAD(&hvm_irq_dpci->girq[i]);
>  
> @@ -130,6 +127,18 @@ int pt_irq_create_bind(
>          return -ENOMEM;
>      }
>      pirq_dpci = pirq_dpci(info);
> +    /*
> +     * The 'pt_irq_create_bind' can be called right after 'pt_irq_destroy_bind'
> +     * was called. The 'pirq_cleanup_check' which would free the structure
> +     * is only called if the event channel for the PIRQ is active. However
> +     * OS-es that use event channels usually bind the PIRQ to an event channel
> +     * and also unbind it before 'pt_irq_destroy_bind' is called which means
> +     * we end up re-using the 'dpci' structure. This can be easily reproduced
> +     * with unloading and loading the driver for the device.
> +     *
> +     * As such on every 'pt_irq_create_bind' call we MUST reset the values.
> +     */
> +    pirq_dpci->dom = d;
>  
>      switch ( pt_irq_bind->irq_type )
>      {
> @@ -156,6 +165,7 @@ int pt_irq_create_bind(
>              {
>                  pirq_dpci->gmsi.gflags = 0;
>                  pirq_dpci->gmsi.gvec = 0;
> +                pirq_dpci->dom = NULL;
>                  pirq_dpci->flags = 0;
>                  pirq_cleanup_check(info, d);
>                  spin_unlock(&d->event_lock);
> @@ -232,7 +242,6 @@ int pt_irq_create_bind(
>          {
>              unsigned int share;
>  
> -            pirq_dpci->dom = d;
>              if ( pt_irq_bind->irq_type == PT_IRQ_TYPE_MSI_TRANSLATE )
>              {
>                  pirq_dpci->flags = HVM_IRQ_DPCI_MAPPED |
> @@ -391,6 +400,7 @@ int pt_irq_destroy_bind(
>          msixtbl_pt_unregister(d, pirq);
>          if ( pt_irq_need_timer(pirq_dpci->flags) )
>              kill_timer(&pirq_dpci->timer);
> +        /* hvm_dirq_assist can handle this. */

Given an equivalent "pirq_dpci->dom = NULL;" in pt_irq_create_bind(), is
this comment necessary?  FWIW, I feel that the comment in
hvm_dirq_assist() is sufficient.

>          pirq_dpci->dom   = NULL;
>          pirq_dpci->flags = 0;
>          pirq_cleanup_check(pirq, d);
> @@ -415,11 +425,18 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
>  {
>      INIT_LIST_HEAD(&dpci->digl_list);
>      dpci->gmsi.dest_vcpu_id = -1;
> +    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
>  }
>  
>  bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
>  {
> -    return !dpci->flags;
> +    if ( !dpci->flags )
> +    {
> +        tasklet_kill(&dpci->tasklet);
> +        dpci->dom = NULL;
> +        return 1;
> +    }
> +    return 0;
>  }
>  
>  int pt_pirq_iterate(struct domain *d,
> @@ -459,7 +476,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
>          return 0;
>  
>      pirq_dpci->masked = 1;
> -    tasklet_schedule(&dpci->dirq_tasklet);
> +    tasklet_schedule(&pirq_dpci->tasklet);
>      return 1;
>  }
>  
> @@ -513,9 +530,27 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
>      spin_unlock(&d->event_lock);
>  }
>  
> -static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
> -                            void *arg)
> +static void hvm_dirq_assist(unsigned long arg)
>  {
> +    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
> +    struct domain *d = pirq_dpci->dom;
> +
> +    /*
> +     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
> +     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
> +     *
> +     * And '->dom' gets cleared later in the destroy path. We exit and clear
> +     * 'mapping' - which is OK as later in this code we would
> +     * do nothing except clear the ->masked field anyhow.
> +     */
> +    if ( !d )
> +    {
> +        pirq_dpci->masked = 0;
> +        return;
> +    }
> +    ASSERT(d->arch.hvm_domain.irq.dpci);
> +
> +    spin_lock(&d->event_lock);
>      if ( test_and_clear_bool(pirq_dpci->masked) )
>      {
>          struct pirq *pirq = dpci_pirq(pirq_dpci);
> @@ -526,13 +561,17 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
>              send_guest_pirq(d, pirq);
>  
>              if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
> -                return 0;
> +            {
> +                spin_unlock(&d->event_lock);
> +                return;
> +            }

perhaps "goto out_unlock;" ?  It would help identify any return
statements as bogus while the lock is held.

~Andrew

>          }
>  
>          if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
>          {
>              vmsi_deliver_pirq(d, pirq_dpci);
> -            return 0;
> +            spin_unlock(&d->event_lock);
> +            return;
>          }
>  
>          list_for_each_entry ( digl, &pirq_dpci->digl_list, list )
> @@ -545,7 +584,8 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
>          {
>              /* for translated MSI to INTx interrupt, eoi as early as possible */
>              __msi_pirq_eoi(pirq_dpci);
> -            return 0;
> +            spin_unlock(&d->event_lock);
> +            return;
>          }
>  
>          /*
> @@ -558,18 +598,6 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
>          ASSERT(pt_irq_need_timer(pirq_dpci->flags));
>          set_timer(&pirq_dpci->timer, NOW() + PT_IRQ_TIME_OUT);
>      }
> -
> -    return 0;
> -}
> -
> -static void hvm_dirq_assist(unsigned long _d)
> -{
> -    struct domain *d = (struct domain *)_d;
> -
> -    ASSERT(d->arch.hvm_domain.irq.dpci);
> -
> -    spin_lock(&d->event_lock);
> -    pt_pirq_iterate(d, _hvm_dirq_assist, NULL);
>      spin_unlock(&d->event_lock);
>  }
>  
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 1eba833..81e8a3a 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -767,6 +767,8 @@ static int pci_clean_dpci_irq(struct domain *d,
>          xfree(digl);
>      }
>  
> +    tasklet_kill(&pirq_dpci->tasklet);
> +
>      return 0;
>  }
>  
> @@ -784,8 +786,6 @@ static void pci_clean_dpci_irqs(struct domain *d)
>      hvm_irq_dpci = domain_get_irq_dpci(d);
>      if ( hvm_irq_dpci != NULL )
>      {
> -        tasklet_kill(&hvm_irq_dpci->dirq_tasklet);
> -
>          pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
>  
>          d->arch.hvm_domain.irq.dpci = NULL;
> diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
> index c89f4b1..94a550a 100644
> --- a/xen/include/xen/hvm/irq.h
> +++ b/xen/include/xen/hvm/irq.h
> @@ -88,7 +88,6 @@ struct hvm_irq_dpci {
>      DECLARE_BITMAP(isairq_map, NR_ISAIRQS);
>      /* Record of mapped Links */
>      uint8_t link_cnt[NR_LINK];
> -    struct tasklet dirq_tasklet;
>  };
>  
>  /* Machine IRQ to guest device/intx mapping. */
> @@ -100,6 +99,7 @@ struct hvm_pirq_dpci {
>      struct domain *dom;
>      struct hvm_gmsi_info gmsi;
>      struct timer timer;
> +    struct tasklet tasklet;
>  };
>  
>  void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v7 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v7)
  2014-09-27  1:33 ` [PATCH v7 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v7) Konrad Rzeszutek Wilk
@ 2014-09-29 10:49   ` Andrew Cooper
  2014-09-29 11:59     ` Jan Beulich
  2014-09-29 13:40   ` Jan Beulich
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Cooper @ 2014-09-29 10:49 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, xen-devel, jbeulich, tim, keir,
	ian.campbell, ian.jackson

On 27/09/14 02:33, Konrad Rzeszutek Wilk wrote:
> The existing tasklet mechanism has a single global
> spinlock that is taken every-time the global list
> is touched. And we use this lock quite a lot - when
> we call do_tasklet_work which is called via an softirq
> and from the idle loop. We take the lock on any
> operation on the tasklet_list.
>
> The problem we are facing is that there are quite a lot of
> tasklets scheduled. The most common one that is invoked is
> the one injecting the VIRQ_TIMER in the guest. Guests
> are not insane and don't set the one-shot or periodic
> clocks to be in sub 1ms intervals (causing said tasklet
> to be scheduled for such small intervalls).
>
> The problem appears when PCI passthrough devices are used
> over many sockets and we have an mix of heavy-interrupt
> guests and idle guests. The idle guests end up seeing
> 1/10 of its RUNNING timeslice eaten by the hypervisor
> (and 40% steal time).
>
> The mechanism by which we inject PCI interrupts is by
> hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
> tasklet every time an interrupt is received.
> The callchain is:
>
> _asm_vmexit_handler
>  -> vmx_vmexit_handler
>     ->vmx_do_extint
>         -> do_IRQ
>             -> __do_IRQ_guest
>                 -> hvm_do_IRQ_dpci
>                    tasklet_schedule(&dpci->dirq_tasklet);
>                    [takes lock to put the tasklet on]
>
> [later on the schedule_tail is invoked which is 'vmx_do_resume']
>
> vmx_do_resume
>  -> vmx_asm_do_vmentry
>         -> call vmx_intr_assist
>           -> vmx_process_softirqs
>             -> do_softirq
>               [executes the tasklet function, takes the
>                lock again]
>
> While on other CPUs they might be sitting in a idle loop
> and invoked to deliver an VIRQ_TIMER, which also ends
> up taking the lock twice: first to schedule the
> v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
> the guests' BLOCKED_state); then to execute it - which is
> accounted for in the guest's RUNTIME_state.
>
> The end result is that on a 8 socket machine with
> PCI passthrough, where four sockets are busy with interrupts,
> and the other sockets have idle guests - we end up with
> the idle guests having around 40% steal time and 1/10
> of its timeslice (3ms out of 30 ms) being tied up
> taking the lock. The latency of the PCI interrupts delieved
> to guest is also hindered.
>
> With this patch the problem disappears completly.
> That is removing the lock for the PCI passthrough use-case
> (the 'hvm_dirq_assist' case) by not using tasklets at all.
>
> The patch is simple - instead of scheduling an tasklet
> we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
> take care of running 'hvm_dirq_assist'. The information we need
> on each CPU is which 'struct hvm_pirq_dpci' structure the
> 'hvm_dirq_assist' needs to run on. That is simple solved by
> threading the 'struct hvm_pirq_dpci' through a linked list.
> The rule of only running one 'hvm_dirq_assist' for only
> one 'hvm_pirq_dpci' is also preserved by having
> 'schedule_dpci_for' ignore any subsequent calls for an domain
> which has already been scheduled.
>
> Since we are using 'hvm_pirq_dpci' structure it is setup in
> 'pt_irq_create_bind' - we check if via 'pt_pirq_softirq_active'
> whether the 'hvm_dirq_assist' is still outstanding on the
> 'pirq' and if so spin up to a second calling softirq
> to flush it out. This replaces the 'tasklet_kill' which
> had been previously done in 'pt_irq_destroy_bind' via
> the 'pt_pirq_cleanup_check'.
>
> The mechanism to tear it down is more complex as there
> are three ways it can be executed. To make it simpler
> everything revolves around 'pt_pirq_softirq_active'. If it
> returns -EGAIN that means there is an outstanding softirq
> that needs to finish running before we can continue tearing
> down. With that in mind:
>
> a) pci_clean_dpci_irq. This gets called when the guest is
>    being destroyed. We end up calling 'pt_pirq_softirq_active'
>    to see if it is OK to continue the destruction.
>
>    The scenarios in which the 'struct pirq' (and subsequently
>    the 'hvm_pirq_dpci') gets destroyed is when:
>
>    - guest did not use the pirq at all after setup.
>    - guest did use pirq, but decided to mask and left it in that
>      state.
>    - guest did use pirq, but crashed.
>
>    In all of those scenarios we end up calling
>    'pt_pirq_softirq_active' to check if the softirq is still
>    active. Read below on the 'pt_pirq_softirq_active' loop.
>
>    Note that before it would spin on an tasklet_kill waiting
>    the 'hvm_dirq_assist' to flush out. We now do it via
>    -EAGAIN to allow continuation.
>
> b) pt_irq_destroy_bind (guest disables the MSI). We double-check
>    that the softirq has run by piggy-backing on the existing
>    'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
>    We add the extra call to 'pt_pirq_softirq_active' in
>    'pt_pirq_cleanup_check'.
>
>    NOTE: Guests that use event channels unbind first the
>    event channel from PIRQs, so the 'pt_pirq_cleanup_check'
>    won't be called as eventch is set to zero. In that case
>    we either clean it up via the a) or c) mechanism.
>
>    There is an extra scenario regardless of eventch being
>    set or not: the guest did 'pt_irq_destroy_bind' while an
>    interrupt was triggered and softirq was scheduled (but had not
>    been run). It is OK to still run the softirq as
>    hvm_dirq_assist won't do anything (as the flags are
>    set to zero).
>
>    Note that before this patch we would spin forever in
>    tasklet_kill waiting for the 'hvm_dirq_assist' to flush out.
>    That 'spin' has been moved to 'pt_irq_create_bind' with
>    an -EAGAIN mechanism to not spin forever.
>
> c) pt_irq_create_bind (not a typo). The scenarios are:
>
>      - guest disables the MSI and then enables it
>        (rmmod and modprobe in a loop). We check if the
>        softirq has been scheduled.
>        Imagine the 'b)' with interrupts in flight and c) getting
>        called in a loop.
>
>      - we hit once the error paths in 'pt_irq_create_bind' while
>        an interrupt was triggered and softirq was scheduled.
>
> In all of those cases we will spin up to a second on
> 'pt_pirq_softirq_active' (at the start of the 'pt_irq_create_bind')
> with the event_lock spinlock dropped and calling
> 'process_pending_softirqs'. hvm_dirq_assist() will be executed and
> then the softirq will clear 'mapping' which signals that that we
> can re-use the 'hvm_pirq_dpci' structure.
>
> Note that the 'flags' variable is cleared at that
> point, so hvm_dirq_assist won't actually do anything..
>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Suggested-by: Jan Beulich <JBeulich@suse.com>
>
> ---
> v2: On top of ref-cnts also have wait loop for the outstanding
>     'struct domain' that need to be processed.
> v3: Add -EAGAIN, fix up StyleGuide issues
> v4: Clean it up mode, redo per_cpu, this_cpu logic
> v5: Instead of threading struct domain, use hvm_pirq_dpci.
> v6: Ditch the 'state' bit, expand description, simplify
>     softirq and teardown sequence.
> v7: Flesh out the comments. Drop the use of domain refcounts
> ---
>  xen/arch/x86/domain.c         |   4 +-
>  xen/drivers/passthrough/io.c  | 152 ++++++++++++++++++++++++++++++++++++++----
>  xen/drivers/passthrough/pci.c |  31 ++++++---
>  xen/include/xen/hvm/irq.h     |   3 +-
>  xen/include/xen/pci.h         |   2 +-
>  xen/include/xen/softirq.h     |   3 +
>  6 files changed, 170 insertions(+), 25 deletions(-)
>
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 8cfd1ca..8f6e17a 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1938,7 +1938,9 @@ int domain_relinquish_resources(struct domain *d)
>      switch ( d->arch.relmem )
>      {
>      case RELMEM_not_started:
> -        pci_release_devices(d);
> +        ret = pci_release_devices(d);
> +        if ( ret )
> +            return ret;
>  
>          /* Tear down paging-assistance stuff. */
>          paging_teardown(d);
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index 8534d63..40ad1dc 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -20,14 +20,60 @@
>  
>  #include <xen/event.h>
>  #include <xen/iommu.h>
> +#include <xen/cpu.h>
>  #include <xen/irq.h>
>  #include <asm/hvm/irq.h>
>  #include <asm/hvm/iommu.h>
>  #include <asm/hvm/support.h>
>  #include <xen/hvm/irq.h>
> -#include <xen/tasklet.h>
>  
> -static void hvm_dirq_assist(unsigned long arg);
> +static DEFINE_PER_CPU(struct list_head, dpci_list);
> +
> +/*
> + * Should only be called from hvm_do_IRQ_dpci. We use the
> + * 'masked' as an gate to thwart multiple interrupts.
> + *
> + * The 'masked' is cleared by 'softirq_dpci' when it has
> + * completed executing 'hvm_dirq_assist'.
> + *
> + */
> +static void schedule_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
> +{
> +    unsigned long flags;
> +
> +    if ( pirq_dpci->masked )
> +        return;
> +
> +    pirq_dpci->masked = 1;
> +
> +    local_irq_save(flags);
> +    list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
> +    local_irq_restore(flags);
> +
> +    raise_softirq(HVM_DPCI_SOFTIRQ);
> +}
> +
> +/*
> + * If we are racing with softirq_dpci (masked is still set) we return
> + * -EAGAIN. Otherwise we return 0.
> + *
> + *  If it is -EAGAIN, it is the callers responsibility to make sure
> + *  that the softirq (with the event_lock dropped) has ran. We need
> + *  to flush out the outstanding 'dpci_softirq' (no more of them
> + *  will be added for this pirq as the IRQ action handler has been
> + *  reset in pt_irq_destroy_bind).
> + */
> +int pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
> +{
> +    if ( pirq_dpci->masked )
> +        return -EAGAIN;
> +    /*
> +     * If in the future we would call 'schedule_softirq_for' right away
> +     * after 'pt_pirq_softirq_active' we MUST reset the list (otherwise it
> +     * might have stale data).
> +     */
> +    return 0;
> +}
>  
>  bool_t pt_irq_need_timer(uint32_t flags)
>  {
> @@ -97,10 +143,12 @@ int pt_irq_create_bind(
>      struct hvm_pirq_dpci *pirq_dpci;
>      struct pirq *info;
>      int rc, pirq = pt_irq_bind->machine_irq;
> +    s_time_t start = NOW();
>  
>      if ( pirq < 0 || pirq >= d->nr_pirqs )
>          return -EINVAL;
>  
> + restart:
>      spin_lock(&d->event_lock);
>  
>      hvm_irq_dpci = domain_get_irq_dpci(d);
> @@ -128,6 +176,25 @@ int pt_irq_create_bind(
>      }
>      pirq_dpci = pirq_dpci(info);
>      /*
> +     * A crude 'while' loop with us dropping the spinlock and giving
> +     * the softirq_dpci a chance to run.
> +     * We MUST check for this condition as the softirq could be scheduled
> +     * and hasn't run yet. We do this up to one second at which point we
> +     * give up. Note that this code replaced tasklet_kill which would have
> +     * spun forever and would do the same thing (wait to flush out
> +     * outstanding hvm_dirq_assist calls) - said tasklet_kill had been
> +     * in 'pt_pirq_cleanup_check' (called during pt_irq_destroy_bind)
> +     * and 'pci_clean_dpci_irq' (domain destroyed).
> +     */
> +    if ( pt_pirq_softirq_active(pirq_dpci) )
> +    {
> +        spin_unlock(&d->event_lock);
> +        process_pending_softirqs();
> +        if ( ( NOW() - start ) > SECONDS(1) )
> +            return -EAGAIN;
> +        goto restart;
> +    }

1s seems like overkill here, and far too long to loop in Xen context
even if we are processing softirqs.

Why is it so long?  Surely the softirq will be processed after the first
call to process_pending_softirqs(), or are we possibly waiting on
another pcpu to run its softirqs?

> +    /*
>       * The 'pt_irq_create_bind' can be called right after 'pt_irq_destroy_bind'
>       * was called. The 'pirq_cleanup_check' which would free the structure
>       * is only called if the event channel for the PIRQ is active. However
> @@ -425,14 +492,12 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
>  {
>      INIT_LIST_HEAD(&dpci->digl_list);
>      dpci->gmsi.dest_vcpu_id = -1;
> -    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
>  }
>  
>  bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
>  {
> -    if ( !dpci->flags )
> +    if ( !dpci->flags && !pt_pirq_softirq_active(dpci) )
>      {
> -        tasklet_kill(&dpci->tasklet);
>          dpci->dom = NULL;
>          return 1;
>      }
> @@ -475,8 +540,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
>           !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
>          return 0;
>  
> -    pirq_dpci->masked = 1;
> -    tasklet_schedule(&pirq_dpci->tasklet);
> +    schedule_softirq_for(pirq_dpci);
>      return 1;
>  }
>  
> @@ -530,9 +594,8 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
>      spin_unlock(&d->event_lock);
>  }
>  
> -static void hvm_dirq_assist(unsigned long arg)
> +static void hvm_dirq_assist(struct hvm_pirq_dpci *pirq_dpci)
>  {
> -    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
>      struct domain *d = pirq_dpci->dom;
>  
>      /*
> @@ -544,14 +607,12 @@ static void hvm_dirq_assist(unsigned long arg)
>       * do nothing except clear the ->masked field anyhow.
>       */
>      if ( !d )
> -    {
> -        pirq_dpci->masked = 0;
>          return;
> -    }
> +
>      ASSERT(d->arch.hvm_domain.irq.dpci);
>  
>      spin_lock(&d->event_lock);
> -    if ( test_and_clear_bool(pirq_dpci->masked) )
> +    if ( pirq_dpci->masked )
>      {
>          struct pirq *pirq = dpci_pirq(pirq_dpci);
>          const struct dev_intx_gsi_link *digl;
> @@ -653,3 +714,68 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
>  unlock:
>      spin_unlock(&d->event_lock);
>  }
> +
> +static void dpci_softirq(void)
> +{
> +    struct hvm_pirq_dpci *pirq_dpci;
> +    unsigned int cpu = smp_processor_id();
> +    LIST_HEAD(our_list);
> +
> +    local_irq_disable();
> +    list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
> +    local_irq_enable();
> +
> +    while ( !list_empty(&our_list) )
> +    {
> +        pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
> +        list_del(&pirq_dpci->softirq_list);
> +
> +        hvm_dirq_assist(pirq_dpci);
> +        pirq_dpci->masked = 0;
> +    }
> +}
> +
> +static int cpu_callback(
> +    struct notifier_block *nfb, unsigned long action, void *hcpu)
> +{
> +    unsigned int cpu = (unsigned long)hcpu;
> +
> +    switch ( action )
> +    {
> +    case CPU_UP_PREPARE:
> +        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
> +        break;
> +    case CPU_UP_CANCELED:
> +    case CPU_DEAD:
> +        /*
> +         * On CPU_DYING this callback is called (on the CPU that is dying)
> +         * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
> +         * clear out any outstanding domains (by the virtue of the idle loop
> +         * calling the softirq later). In CPU_DEAD case the CPU is deaf and
> +         * there are no pending softirqs for us to handle so we can chill.
> +         */
> +        ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block cpu_nfb = {
> +    .notifier_call = cpu_callback,
> +};
> +
> +static int __init setup_dpci_softirq(void)
> +{
> +    unsigned int cpu;
> +
> +    for_each_online_cpu(cpu)
> +        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));

Hmm - its loops like this that make me think we should have a function
to explicitly initialise the percpu area of cpus at the point at which
they come up.  Still, this is fine for now, and another item for the
todo list.

> +
> +    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
> +    register_cpu_notifier(&cpu_nfb);
> +    return 0;
> +}
> +__initcall(setup_dpci_softirq);
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 81e8a3a..d147189 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
>          xfree(digl);
>      }
>  
> -    tasklet_kill(&pirq_dpci->tasklet);
> -
> -    return 0;
> +    return pt_pirq_softirq_active(pirq_dpci);
>  }
>  
> -static void pci_clean_dpci_irqs(struct domain *d)
> +static int pci_clean_dpci_irqs(struct domain *d)
>  {
>      struct hvm_irq_dpci *hvm_irq_dpci = NULL;
>  
>      if ( !iommu_enabled )
> -        return;
> +        return -ESRCH;

I realise that our return codes are inconsistent, but can we avoid
further overloading of return values.  -ESRCH is generally "no such domain"

Certain IOMMU related failures currently use -ENODEV, which is perhaps
more appropriate.

>  
>      if ( !is_hvm_domain(d) )
> -        return;
> +        return -EINVAL;
>  
>      spin_lock(&d->event_lock);
>      hvm_irq_dpci = domain_get_irq_dpci(d);
>      if ( hvm_irq_dpci != NULL )
>      {
> -        pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
> +        int ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
> +
> +        if ( ret )
> +        {
> +            spin_unlock(&d->event_lock);
> +            return ret;
> +        }
>  
>          d->arch.hvm_domain.irq.dpci = NULL;
>          free_hvm_irq_dpci(hvm_irq_dpci);
>      }
>      spin_unlock(&d->event_lock);
> +    return 0;
>  }
>  
> -void pci_release_devices(struct domain *d)
> +int pci_release_devices(struct domain *d)
>  {
>      struct pci_dev *pdev;
>      u8 bus, devfn;
> +    int ret;
>  
>      spin_lock(&pcidevs_lock);
> -    pci_clean_dpci_irqs(d);
> +    ret = pci_clean_dpci_irqs(d);
> +    if ( ret == -EAGAIN )
> +    {
> +        spin_unlock(&pcidevs_lock);
> +        return ret;
> +    }
>      while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
>      {
>          bus = pdev->bus;
> @@ -811,6 +822,8 @@ void pci_release_devices(struct domain *d)
>                     PCI_SLOT(devfn), PCI_FUNC(devfn));
>      }
>      spin_unlock(&pcidevs_lock);
> +
> +    return 0;
>  }
>  
>  #define PCI_CLASS_BRIDGE_HOST    0x0600
> diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
> index 94a550a..4fc6dcf 100644
> --- a/xen/include/xen/hvm/irq.h
> +++ b/xen/include/xen/hvm/irq.h
> @@ -99,7 +99,7 @@ struct hvm_pirq_dpci {
>      struct domain *dom;
>      struct hvm_gmsi_info gmsi;
>      struct timer timer;
> -    struct tasklet tasklet;
> +    struct list_head softirq_list;
>  };
>  
>  void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
> @@ -109,6 +109,7 @@ int pt_pirq_iterate(struct domain *d,
>                                struct hvm_pirq_dpci *, void *arg),
>                      void *arg);
>  
> +int pt_pirq_softirq_active(struct hvm_pirq_dpci *);

"struct hvm_pirq_dpci *pirq_dpci" please.

~Andrew

>  /* Modify state of a PCI INTx wire. */
>  void hvm_pci_intx_assert(
>      struct domain *d, unsigned int device, unsigned int intx);
> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
> index 91520bc..5f295f3 100644
> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev(
>  
>  void setup_hwdom_pci_devices(struct domain *,
>                              int (*)(u8 devfn, struct pci_dev *));
> -void pci_release_devices(struct domain *d);
> +int pci_release_devices(struct domain *d);
>  int pci_add_segment(u16 seg);
>  const unsigned long *pci_get_ro_map(u16 seg);
>  int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
> diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
> index 0895a16..f3da5df 100644
> --- a/xen/include/xen/softirq.h
> +++ b/xen/include/xen/softirq.h
> @@ -8,6 +8,9 @@ enum {
>      NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
>      RCU_SOFTIRQ,
>      TASKLET_SOFTIRQ,
> +#ifdef HAS_PASSTHROUGH
> +    HVM_DPCI_SOFTIRQ,
> +#endif
>      NR_COMMON_SOFTIRQS
>  };
>  

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v7 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
  2014-09-29 10:16   ` Andrew Cooper
@ 2014-09-29 11:52     ` Jan Beulich
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Beulich @ 2014-09-29 11:52 UTC (permalink / raw)
  To: Andrew Cooper, Konrad Rzeszutek Wilk
  Cc: xen-devel, keir, ian.jackson, ian.campbell, tim

>>> On 29.09.14 at 12:16, <andrew.cooper3@citrix.com> wrote:
> On 27/09/14 02:33, Konrad Rzeszutek Wilk wrote:
>> @@ -526,13 +561,17 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
>>              send_guest_pirq(d, pirq);
>>  
>>              if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
>> -                return 0;
>> +            {
>> +                spin_unlock(&d->event_lock);
>> +                return;
>> +            }
> 
> perhaps "goto out_unlock;" ?  It would help identify any return
> statements as bogus while the lock is held.

I know there are differing opinions on the use of goto in situations
like this, but I'm rather happy that Konrad went the goto-free
route.

Jan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v7 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v7)
  2014-09-29 10:49   ` Andrew Cooper
@ 2014-09-29 11:59     ` Jan Beulich
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Beulich @ 2014-09-29 11:59 UTC (permalink / raw)
  To: Andrew Cooper, Konrad Rzeszutek Wilk
  Cc: xen-devel, keir, ian.jackson, ian.campbell, tim

>>> On 29.09.14 at 12:49, <andrew.cooper3@citrix.com> wrote:
> On 27/09/14 02:33, Konrad Rzeszutek Wilk wrote:
>> @@ -128,6 +176,25 @@ int pt_irq_create_bind(
>>      }
>>      pirq_dpci = pirq_dpci(info);
>>      /*
>> +     * A crude 'while' loop with us dropping the spinlock and giving
>> +     * the softirq_dpci a chance to run.
>> +     * We MUST check for this condition as the softirq could be scheduled
>> +     * and hasn't run yet. We do this up to one second at which point we
>> +     * give up. Note that this code replaced tasklet_kill which would have
>> +     * spun forever and would do the same thing (wait to flush out
>> +     * outstanding hvm_dirq_assist calls) - said tasklet_kill had been
>> +     * in 'pt_pirq_cleanup_check' (called during pt_irq_destroy_bind)
>> +     * and 'pci_clean_dpci_irq' (domain destroyed).
>> +     */
>> +    if ( pt_pirq_softirq_active(pirq_dpci) )
>> +    {
>> +        spin_unlock(&d->event_lock);
>> +        process_pending_softirqs();
>> +        if ( ( NOW() - start ) > SECONDS(1) )
>> +            return -EAGAIN;
>> +        goto restart;
>> +    }
> 
> 1s seems like overkill here, and far too long to loop in Xen context
> even if we are processing softirqs.
> 
> Why is it so long?  Surely the softirq will be processed after the first
> call to process_pending_softirqs(), or are we possibly waiting on
> another pcpu to run its softirqs?

We had discussed this before, so I'm surprised this 1s value
survived, since iirc we agreed to use an unbounded loop and
refer to tasklet_kill() as to why this is valid (i.e. not dangerous)
to do.

>> +static int __init setup_dpci_softirq(void)
>> +{
>> +    unsigned int cpu;
>> +
>> +    for_each_online_cpu(cpu)
>> +        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
> 
> Hmm - its loops like this that make me think we should have a function
> to explicitly initialise the percpu area of cpus at the point at which
> they come up.  Still, this is fine for now, and another item for the
> todo list.

An alternative would be to make the respective initcall a pre-SMP
one (unless it depends on other things that don't get done early
enough).

>> -static void pci_clean_dpci_irqs(struct domain *d)
>> +static int pci_clean_dpci_irqs(struct domain *d)
>>  {
>>      struct hvm_irq_dpci *hvm_irq_dpci = NULL;
>>  
>>      if ( !iommu_enabled )
>> -        return;
>> +        return -ESRCH;
> 
> I realise that our return codes are inconsistent, but can we avoid
> further overloading of return values.  -ESRCH is generally "no such domain"
> 
> Certain IOMMU related failures currently use -ENODEV, which is perhaps
> more appropriate.

+1

>> --- a/xen/include/xen/hvm/irq.h
>> +++ b/xen/include/xen/hvm/irq.h
>> @@ -99,7 +99,7 @@ struct hvm_pirq_dpci {
>>      struct domain *dom;
>>      struct hvm_gmsi_info gmsi;
>>      struct timer timer;
>> -    struct tasklet tasklet;
>> +    struct list_head softirq_list;
>>  };
>>  
>>  void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
>> @@ -109,6 +109,7 @@ int pt_pirq_iterate(struct domain *d,
>>                                struct hvm_pirq_dpci *, void *arg),
>>                      void *arg);
>>  
>> +int pt_pirq_softirq_active(struct hvm_pirq_dpci *);
> 
> "struct hvm_pirq_dpci *pirq_dpci" please.

I don't think this is something to insist on, and I think I'm actually
the one having introduced most prototypes without parameter
names that are currently in the tree. I agree names are useful
for parameters of sufficiently generic types, or when there are
multiple parameters of the same type. But when the type name
itself is specific enough, additionally giving it a name is simply
redundant.

Jan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v7 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
  2014-09-27  1:33 ` [PATCH v7 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model Konrad Rzeszutek Wilk
  2014-09-29 10:16   ` Andrew Cooper
@ 2014-09-29 13:22   ` Jan Beulich
  2014-10-07 15:35     ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Beulich @ 2014-09-29 13:22 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

>>> On 27.09.14 at 03:33, <konrad.wilk@oracle.com> wrote:
> @@ -130,6 +127,18 @@ int pt_irq_create_bind(
>          return -ENOMEM;
>      }
>      pirq_dpci = pirq_dpci(info);
> +    /*
> +     * The 'pt_irq_create_bind' can be called right after 'pt_irq_destroy_bind'
> +     * was called. The 'pirq_cleanup_check' which would free the structure
> +     * is only called if the event channel for the PIRQ is active. However
> +     * OS-es that use event channels usually bind the PIRQ to an event channel
> +     * and also unbind it before 'pt_irq_destroy_bind' is called which means
> +     * we end up re-using the 'dpci' structure. This can be easily reproduced
> +     * with unloading and loading the driver for the device.
> +     *
> +     * As such on every 'pt_irq_create_bind' call we MUST reset the values.
> +     */
> +    pirq_dpci->dom = d;

I continue to be unconvinced of the correctness of this placement:
As said before, you only need this in place by the time
pirq_guest_bind() gets called. And with the patch applied there's
now at least one error path where this doesn't get zapped to NULL:

        if ( !digl || !girq )
        {
            spin_unlock(&d->event_lock);
            xfree(girq);
            xfree(digl);
            return -ENOMEM;
        }

> @@ -513,9 +530,27 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
>      spin_unlock(&d->event_lock);
>  }
>  
> -static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
> -                            void *arg)
> +static void hvm_dirq_assist(unsigned long arg)
>  {
> +    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
> +    struct domain *d = pirq_dpci->dom;
> +
> +    /*
> +     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
> +     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
> +     *
> +     * And '->dom' gets cleared later in the destroy path. We exit and clear
> +     * 'mapping' - which is OK as later in this code we would

Does this comment mean 'masked' instead of 'mapping'?

> +     * do nothing except clear the ->masked field anyhow.
> +     */
> +    if ( !d )
> +    {
> +        pirq_dpci->masked = 0;
> +        return;
> +    }

Jan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v7 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v7)
  2014-09-27  1:33 ` [PATCH v7 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v7) Konrad Rzeszutek Wilk
  2014-09-29 10:49   ` Andrew Cooper
@ 2014-09-29 13:40   ` Jan Beulich
  1 sibling, 0 replies; 10+ messages in thread
From: Jan Beulich @ 2014-09-29 13:40 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

>>> On 27.09.14 at 03:33, <konrad.wilk@oracle.com> wrote:
> +/*
> + * Should only be called from hvm_do_IRQ_dpci. We use the
> + * 'masked' as an gate to thwart multiple interrupts.
> + *
> + * The 'masked' is cleared by 'softirq_dpci' when it has
> + * completed executing 'hvm_dirq_assist'.
> + *
> + */
> +static void schedule_softirq_for(struct hvm_pirq_dpci *pirq_dpci)

Perhaps better name this raise_... than schedule_...?

> +/*
> + * If we are racing with softirq_dpci (masked is still set) we return
> + * -EAGAIN. Otherwise we return 0.
> + *
> + *  If it is -EAGAIN, it is the callers responsibility to make sure
> + *  that the softirq (with the event_lock dropped) has ran. We need
> + *  to flush out the outstanding 'dpci_softirq' (no more of them
> + *  will be added for this pirq as the IRQ action handler has been
> + *  reset in pt_irq_destroy_bind).
> + */
> +int pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
> +{
> +    if ( pirq_dpci->masked )
> +        return -EAGAIN;

I think this would better be -ERESTART with the post-4.4 conversion
from -EAGAIN to -ERESTART.

> @@ -544,14 +607,12 @@ static void hvm_dirq_assist(unsigned long arg)
>       * do nothing except clear the ->masked field anyhow.
>       */
>      if ( !d )
> -    {
> -        pirq_dpci->masked = 0;
>          return;
> -    }

This renders stale the comment right above.

> -void pci_release_devices(struct domain *d)
> +int pci_release_devices(struct domain *d)
>  {
>      struct pci_dev *pdev;
>      u8 bus, devfn;
> +    int ret;
>  
>      spin_lock(&pcidevs_lock);
> -    pci_clean_dpci_irqs(d);
> +    ret = pci_clean_dpci_irqs(d);
> +    if ( ret == -EAGAIN )

I don't think you should special case -EAGAIN here.

> --- a/xen/include/xen/softirq.h
> +++ b/xen/include/xen/softirq.h
> @@ -8,6 +8,9 @@ enum {
>      NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
>      RCU_SOFTIRQ,
>      TASKLET_SOFTIRQ,
> +#ifdef HAS_PASSTHROUGH
> +    HVM_DPCI_SOFTIRQ,
> +#endif

The conditional is wrong actually (sorry, noticed this just now) - as
io.c is x86-specific, this should be an arch-specific softirq.

Also I notice you dropped the domain ref-counting you previously
used, yet I don't clearly see how doing so is safe.

Jan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v7 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
  2014-09-29 13:22   ` Jan Beulich
@ 2014-10-07 15:35     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 10+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-07 15:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

On Mon, Sep 29, 2014 at 02:22:06PM +0100, Jan Beulich wrote:
> >>> On 27.09.14 at 03:33, <konrad.wilk@oracle.com> wrote:
> > @@ -130,6 +127,18 @@ int pt_irq_create_bind(
> >          return -ENOMEM;
> >      }
> >      pirq_dpci = pirq_dpci(info);
> > +    /*
> > +     * The 'pt_irq_create_bind' can be called right after 'pt_irq_destroy_bind'
> > +     * was called. The 'pirq_cleanup_check' which would free the structure
> > +     * is only called if the event channel for the PIRQ is active. However
> > +     * OS-es that use event channels usually bind the PIRQ to an event channel
> > +     * and also unbind it before 'pt_irq_destroy_bind' is called which means
> > +     * we end up re-using the 'dpci' structure. This can be easily reproduced
> > +     * with unloading and loading the driver for the device.
> > +     *
> > +     * As such on every 'pt_irq_create_bind' call we MUST reset the values.
> > +     */
> > +    pirq_dpci->dom = d;
> 
> I continue to be unconvinced of the correctness of this placement:
> As said before, you only need this in place by the time
> pirq_guest_bind() gets called. And with the patch applied there's

Correct.
> now at least one error path where this doesn't get zapped to NULL:
> 
>         if ( !digl || !girq )
>         {
>             spin_unlock(&d->event_lock);
>             xfree(girq);
>             xfree(digl);
>             return -ENOMEM;
>         }

Right, and the issue I am facing is that with zapping of it to
NULL we run in the problem of refcounting the domain. Or rather
not being able to refcount the domain properly because
it got zapped. I shall respond to your email (this one
http://mid.gmane.org/542924B1020000780003A403@mail.emea.novell.com).

> 
> > @@ -513,9 +530,27 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
> >      spin_unlock(&d->event_lock);
> >  }
> >  
> > -static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
> > -                            void *arg)
> > +static void hvm_dirq_assist(unsigned long arg)
> >  {
> > +    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
> > +    struct domain *d = pirq_dpci->dom;
> > +
> > +    /*
> > +     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
> > +     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
> > +     *
> > +     * And '->dom' gets cleared later in the destroy path. We exit and clear
> > +     * 'mapping' - which is OK as later in this code we would
> 
> Does this comment mean 'masked' instead of 'mapping'?

'masked'.
> 
> > +     * do nothing except clear the ->masked field anyhow.
> > +     */
> > +    if ( !d )
> > +    {
> > +        pirq_dpci->masked = 0;
> > +        return;
> > +    }
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-10-07 15:35 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-27  1:33 [PATCH v7] Fix interrupt latency of HVM PCI passthrough devices Konrad Rzeszutek Wilk
2014-09-27  1:33 ` [PATCH v7 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model Konrad Rzeszutek Wilk
2014-09-29 10:16   ` Andrew Cooper
2014-09-29 11:52     ` Jan Beulich
2014-09-29 13:22   ` Jan Beulich
2014-10-07 15:35     ` Konrad Rzeszutek Wilk
2014-09-27  1:33 ` [PATCH v7 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v7) Konrad Rzeszutek Wilk
2014-09-29 10:49   ` Andrew Cooper
2014-09-29 11:59     ` Jan Beulich
2014-09-29 13:40   ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.