All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 for-xen-4.5] Fix interrupt latency of HVM PCI passthrough devices.
@ 2014-10-21 17:19 Konrad Rzeszutek Wilk
  2014-10-21 17:19 ` [PATCH v8 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model Konrad Rzeszutek Wilk
  2014-10-21 17:19 ` [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8) Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-21 17:19 UTC (permalink / raw)
  To: xen-devel, JBeulich, tim, keir, ian.campbell, ian.jackson,
	andrew.cooper3


Changelog:
since v7 (http://lists.xen.org/archives/html/xen-devel/2014-09/msg04385.html)
 - Put ref-counting back, added two bits to allow ref-counting from other places.
 
since v6 (http://lists.xen.org/archives/html/xen-devel/2014-09/msg03208.html)
 - Squashed #1 + #2.
 - Added more comments, redid it based on Jan's feedback.
since v5 (http://lists.xen.org/archives/html/xen-devel/2014-09/msg02868.html)
 - Redid the series based on Jan's feedback
since v4
(http://lists.xen.org/archives/html/xen-devel/2014-09/msg01676.html):
 - Ditch the domain centric mechansim.
 - Fix issues raised by Jan.


The patches are an performance bug-fixes for PCI passthrough for machines
with many sockets. On those machines we have observed awful latency issues
with interrupts and with high steal time on idle guests. The root cause
was that the tasklet lock which was shared across all sockets. Each interrupt
that was to be delivered to a guest took the tasket lock - and with many
guests and many PCI passthrough devices - the performance and latency were
atrocious. These two patches fix the outstanding issues.

I am in a weird position to be the release-manager and also submitting
patches past the feature freeze. This is a fix (a performance one) but
at the same time the work that it does is adding new code which could
be considered a feature. It does not have an Ack from anybody, albeit
Jan and Andrew have heavily been reviewing the patchset.

I don't think we have any specific policy for this - so I am going to
go on the safe side and try to justify it as if it is an feature.

 - It definitly adds to the awesome++ part of the Xen 4.5 by lowering
   the latency of interrupts _and_ removing steal time from guests.

 - From the risk perspective - it does not deal with the common use-cases,
   but only with the 'PCI passthrough one'. As such, it replaces some of
   that core code which means it brings a risk to it. I've done my best
   to eliminate this by doing various tests - passing in buggy hardware,
   legacy hardware, MSI and MSI-X hardware. All of them work the same
   way as they did before. But of course there is still a risk.

 - If this code is not added in Xen 4.5 the release will still be awesome,
   but not awesome++ :-)


 xen/Makefile                           |   2 +-
 xen/arch/arm/domain.c                  |  12 --
 xen/arch/x86/domain.c                  |  23 +--
 xen/arch/x86/hvm/hvm.c                 |   4 +-
 xen/arch/x86/hvm/viridian.c            |  60 ++-----
 xen/arch/x86/hvm/vlapic.c              |  27 +---
 xen/arch/x86/setup.c                   |   2 +-
 xen/common/domain.c                    |  26 ++-
 xen/drivers/passthrough/io.c           | 284 +++++++++++++++++++++++++++++----
 xen/drivers/passthrough/pci.c          |  29 +++-
 xen/include/asm-x86/hvm/hvm.h          |   9 +-
 xen/include/asm-x86/hvm/viridian.h     |  27 ----
 xen/include/asm-x86/softirq.h          |   3 +-
 xen/include/public/arch-x86/hvm/save.h |   1 -
 xen/include/xen/domain.h               |   4 -
 xen/include/xen/hvm/irq.h              |   5 +-
 xen/include/xen/pci.h                  |   2 +-
 17 files changed, 316 insertions(+), 204 deletions(-)


Konrad Rzeszutek Wilk (2):
      dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
      dpci: Replace tasklet with an softirq (v8)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v8 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
  2014-10-21 17:19 [PATCH v8 for-xen-4.5] Fix interrupt latency of HVM PCI passthrough devices Konrad Rzeszutek Wilk
@ 2014-10-21 17:19 ` Konrad Rzeszutek Wilk
  2014-10-23  8:58   ` Jan Beulich
  2014-10-21 17:19 ` [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8) Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-21 17:19 UTC (permalink / raw)
  To: xen-devel, JBeulich, tim, keir, ian.campbell, ian.jackson,
	andrew.cooper3
  Cc: Konrad Rzeszutek Wilk

When an interrupt for an PCI (or PCIe) passthrough device
is to be sent to a guest, we find the appropiate
'hvm_dirq_dpci' structure for the interrupt (PIRQ), set
a bit (masked), and schedule an tasklet.

Then the 'hvm_dirq_assist' tasklet gets called with the 'struct
domain' from where it iterates over the the radix-tree of
'hvm_dirq_dpci' (from zero to the number of PIRQs allocated)
which are masked to the guest and calls each 'hvm_pirq_assist'.
If the PIRQ has a bit set (masked) it figures out how to
inject the PIRQ to the guest.

This is inefficient and not fair as:
 - We iterate starting at PIRQ 0 and up every time. That means
   the PCIe devices that have lower PIRQs get to be called
   first.
 - If we have many PCIe devices passed in with many PIRQs and
   if most of the time only the highest numbered PIRQ get an
   interrupt (as the initial ones are for control) we end up
   iterating over many PIRQs.

But we could do beter - the 'hvm_dirq_dpci' has the field for
'struct domain', which we can use instead of having to pass in the
'struct domain'.

As such this patch moves the tasklet to the 'struct hvm_dirq_dpci'
and sets the 'dom' field to the domain. We also double-check
that the '->dom' is not reset before using it.

We have to be careful with this as that means we MUST have
'dom' set before pirq_guest_bind() is called. As such
we add the 'pirq_dpci->dom = d;' to cover for all such
cases.

The mechanism to tear it down is more complex as there
are two ways it can be executed:

 a) pci_clean_dpci_irq. This gets called when the guest is
    being destroyed. We end up calling 'tasklet_kill'.

    The scenarios in which the 'struct pirq' (and subsequently
    the 'hvm_pirq_dpci') gets destroyed is when:

    - guest did not use the pirq at all after setup.
    - guest did use pirq, but decided to mask and left it in that
      state.
    - guest did use pirq, but crashed.

    In all of those scenarios we end up calling 'tasklet_kill'
    which will spin on the tasklet if it is running.

 b) pt_irq_destroy_bind (guest disables the MSI). We double-check
    that the softirq has run by piggy-backing on the existing
    'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
    We add the extra call to 'pt_pirq_softirq_active' in
    'pt_pirq_cleanup_check'.

    NOTE: Guests that use event channels unbind first the
    event channel from PIRQs, so the 'pt_pirq_cleanup_check'
    won't be called as eventch is set to zero. In that case
    we either clean it up via the a) mechanism. It is OK
    to re-use the tasklet when 'pt_irq_create_bind' is called
    afterwards.

    There is an extra scenario regardless of eventch being
    set or not: the guest did 'pt_irq_destroy_bind' while an
    interrupt was triggered and tasklet was scheduled (but had not
    been run). It is OK to still run the tasklet as
    hvm_dirq_assist won't do anything (as the flags are
    set to zero). As such we can exit out of hvm_dirq_assist
    without doing anything.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/passthrough/io.c  | 75 +++++++++++++++++++++++++++++--------------
 xen/drivers/passthrough/pci.c |  4 +--
 xen/include/xen/hvm/irq.h     |  2 +-
 3 files changed, 54 insertions(+), 27 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 4cd32b5..0cf0a52 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -27,7 +27,7 @@
 #include <xen/hvm/irq.h>
 #include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long _d);
+static void hvm_dirq_assist(unsigned long arg);
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -114,9 +114,6 @@ int pt_irq_create_bind(
             spin_unlock(&d->event_lock);
             return -ENOMEM;
         }
-        softirq_tasklet_init(
-            &hvm_irq_dpci->dirq_tasklet,
-            hvm_dirq_assist, (unsigned long)d);
         for ( i = 0; i < NR_HVM_IRQS; i++ )
             INIT_LIST_HEAD(&hvm_irq_dpci->girq[i]);
 
@@ -130,6 +127,18 @@ int pt_irq_create_bind(
         return -ENOMEM;
     }
     pirq_dpci = pirq_dpci(info);
+    /*
+     * The 'pt_irq_create_bind' can be called right after 'pt_irq_destroy_bind'
+     * was called. The 'pirq_cleanup_check' which would free the structure
+     * is only called if the event channel for the PIRQ is active. However
+     * OS-es that use event channels usually bind the PIRQ to an event channel
+     * and also unbind it before 'pt_irq_destroy_bind' is called which means
+     * we end up re-using the 'dpci' structure. This can be easily reproduced
+     * with unloading and loading the driver for the device.
+     *
+     * As such on every 'pt_irq_create_bind' call we MUST reset the values.
+     */
+    pirq_dpci->dom = d;
 
     switch ( pt_irq_bind->irq_type )
     {
@@ -156,6 +165,7 @@ int pt_irq_create_bind(
             {
                 pirq_dpci->gmsi.gflags = 0;
                 pirq_dpci->gmsi.gvec = 0;
+                pirq_dpci->dom = NULL;
                 pirq_dpci->flags = 0;
                 pirq_cleanup_check(info, d);
                 spin_unlock(&d->event_lock);
@@ -232,7 +242,6 @@ int pt_irq_create_bind(
         {
             unsigned int share;
 
-            pirq_dpci->dom = d;
             if ( pt_irq_bind->irq_type == PT_IRQ_TYPE_MSI_TRANSLATE )
             {
                 pirq_dpci->flags = HVM_IRQ_DPCI_MAPPED |
@@ -415,11 +424,18 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
+    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    return !dpci->flags;
+    if ( !dpci->flags )
+    {
+        tasklet_kill(&dpci->tasklet);
+        dpci->dom = NULL;
+        return 1;
+    }
+    return 0;
 }
 
 int pt_pirq_iterate(struct domain *d,
@@ -459,7 +475,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
         return 0;
 
     pirq_dpci->masked = 1;
-    tasklet_schedule(&dpci->dirq_tasklet);
+    tasklet_schedule(&pirq_dpci->tasklet);
     return 1;
 }
 
@@ -513,9 +529,27 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
-                            void *arg)
+static void hvm_dirq_assist(unsigned long arg)
 {
+    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
+    struct domain *d = pirq_dpci->dom;
+
+    /*
+     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
+     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
+     *
+     * And '->dom' gets cleared later in the destroy path. We exit and clear
+     * 'masked' - which is OK as later in this code we would
+     * do nothing except clear the ->masked field anyhow.
+     */
+    if ( !d )
+    {
+        pirq_dpci->masked = 0;
+        return;
+    }
+    ASSERT(d->arch.hvm_domain.irq.dpci);
+
+    spin_lock(&d->event_lock);
     if ( test_and_clear_bool(pirq_dpci->masked) )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
@@ -526,13 +560,17 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
             send_guest_pirq(d, pirq);
 
             if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
-                return 0;
+            {
+                spin_unlock(&d->event_lock);
+                return;
+            }
         }
 
         if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
         {
             vmsi_deliver_pirq(d, pirq_dpci);
-            return 0;
+            spin_unlock(&d->event_lock);
+            return;
         }
 
         list_for_each_entry ( digl, &pirq_dpci->digl_list, list )
@@ -545,7 +583,8 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
         {
             /* for translated MSI to INTx interrupt, eoi as early as possible */
             __msi_pirq_eoi(pirq_dpci);
-            return 0;
+            spin_unlock(&d->event_lock);
+            return;
         }
 
         /*
@@ -558,18 +597,6 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
         ASSERT(pt_irq_need_timer(pirq_dpci->flags));
         set_timer(&pirq_dpci->timer, NOW() + PT_IRQ_TIME_OUT);
     }
-
-    return 0;
-}
-
-static void hvm_dirq_assist(unsigned long _d)
-{
-    struct domain *d = (struct domain *)_d;
-
-    ASSERT(d->arch.hvm_domain.irq.dpci);
-
-    spin_lock(&d->event_lock);
-    pt_pirq_iterate(d, _hvm_dirq_assist, NULL);
     spin_unlock(&d->event_lock);
 }
 
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1eba833..81e8a3a 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,6 +767,8 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
+    tasklet_kill(&pirq_dpci->tasklet);
+
     return 0;
 }
 
@@ -784,8 +786,6 @@ static void pci_clean_dpci_irqs(struct domain *d)
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        tasklet_kill(&hvm_irq_dpci->dirq_tasklet);
-
         pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
 
         d->arch.hvm_domain.irq.dpci = NULL;
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index c89f4b1..94a550a 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -88,7 +88,6 @@ struct hvm_irq_dpci {
     DECLARE_BITMAP(isairq_map, NR_ISAIRQS);
     /* Record of mapped Links */
     uint8_t link_cnt[NR_LINK];
-    struct tasklet dirq_tasklet;
 };
 
 /* Machine IRQ to guest device/intx mapping. */
@@ -100,6 +99,7 @@ struct hvm_pirq_dpci {
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
+    struct tasklet tasklet;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-21 17:19 [PATCH v8 for-xen-4.5] Fix interrupt latency of HVM PCI passthrough devices Konrad Rzeszutek Wilk
  2014-10-21 17:19 ` [PATCH v8 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model Konrad Rzeszutek Wilk
@ 2014-10-21 17:19 ` Konrad Rzeszutek Wilk
  2014-10-23  9:36   ` Jan Beulich
  1 sibling, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-21 17:19 UTC (permalink / raw)
  To: xen-devel, JBeulich, tim, keir, ian.campbell, ian.jackson,
	andrew.cooper3
  Cc: Konrad Rzeszutek Wilk

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct hvm_pirq_dpci' structure the
'hvm_dirq_assist' needs to run on. That is simple solved by
threading the 'struct hvm_pirq_dpci' through a linked list.
The rule of only running one 'hvm_dirq_assist' for only
one 'hvm_pirq_dpci' is also preserved by having
'schedule_dpci_for' ignore any subsequent calls for an domain
which has already been scheduled.

== Code details ==

Most of the code complexity comes from the '->dom' field
in the 'hvm_pirq_dpci' structure. We use it for ref-counting
and as such it MUST be valid as long as STATE_SCHED bit is
set. Whoever clears the STATE_SCHED bit does the ref-counting
and can also reset the '->dom' field.

To compound the complexity, there are multiple points where the
'hvm_pirq_dpci' structure is reset or re-used. Initially
(first time the domain uses the pirq), the 'hvm_pirq_dpci->dom'
field is set to NULL as it is allocated. On subsequent calls
in to 'pt_irq_create_bind' the ->dom is whatever it had last time.

As this is the initial call (which QEMU ends up calling when the
guest writes an vector value in the MSI field) we MUST set the
'->dom' to a the proper structure (otherwise we cannot do
proper ref-counting).

The mechanism to tear it down is more complex as there
are three ways it can be executed. To make it simpler
everything revolves around 'pt_pirq_softirq_active'. If it
returns -EAGAIN that means there is an outstanding softirq
that needs to finish running before we can continue tearing
down. With that in mind:

a) pci_clean_dpci_irq. This gets called when the guest is
   being destroyed. We end up calling 'pt_pirq_softirq_active'
   to see if it is OK to continue the destruction.

   The scenarios in which the 'struct pirq' (and subsequently
   the 'hvm_pirq_dpci') gets destroyed is when:

   - guest did not use the pirq at all after setup.
   - guest did use pirq, but decided to mask and left it in that
     state.
   - guest did use pirq, but crashed.

   In all of those scenarios we end up calling
   'pt_pirq_softirq_active' to check if the softirq is still
   active. Read below on the 'pt_pirq_softirq_active' loop.

b) pt_irq_destroy_bind (guest disables the MSI). We double-check
   that the softirq has run by piggy-backing on the existing
   'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
   We add the extra call to 'pt_pirq_softirq_active' in
   'pt_pirq_cleanup_check'.

   NOTE: Guests that use event channels unbind first the
   event channel from PIRQs, so the 'pt_pirq_cleanup_check'
   won't be called as 'event' is set to zero. In that case
   we either clean it up via the a) or c) mechanism.

   There is an extra scenario regardless of 'event' being
   set or not: the guest did 'pt_irq_destroy_bind' while an
   interrupt was triggered and softirq was scheduled (but had not
   been run). It is OK to still run the softirq as
   hvm_dirq_assist won't do anything (as the flags are
   set to zero). However we will try to deschedule the
   softirq if we can (by clearing the STATE_SCHED bit and us
   doing the ref-counting).

c) pt_irq_create_bind (not a typo). The scenarios are:

     - guest disables the MSI and then enables it
       (rmmod and modprobe in a loop). We call 'pt_pirq_reset'
       which checks to see if the softirq has been scheduled.
       Imagine the 'b)' with interrupts in flight and c) getting
       called in a loop.

We will spin up on 'pt_pirq_is_active' (at the start of the
'pt_irq_create_bind') with the event_lock spinlock dropped and
calling 'process_pending_softirqs'. hvm_dirq_assist will be executed
and then the softirq will clear 'state' which signals that that we
can re-use the 'hvm_pirq_dpci' structure.

     - we hit once the error paths in 'pt_irq_create_bind' while
       an interrupt was triggered and softirq was scheduled.

If the softirq is in STATE_RUN that means it is executing and we should
let it continue on. We can clear the '->dom' field as the softirq
has stashed it beforehand. If the softirq is STATE_SCHED and
we are successful in clearing it, we do the ref-counting and
clear the '->dom' field. Otherwise we let the softirq continue
on and the '->dom' field is left intact. The clearing of
the '->dom' is left to a), b) or again c) case.

Note that in both cases the 'flags' variable is cleared so
hvm_dirq_assist won't actually do anything.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Suggested-by: Jan Beulich <JBeulich@suse.com>

---
v2: On top of ref-cnts also have wait loop for the outstanding
    'struct domain' that need to be processed.
v3: Add -ERETRY, fix up StyleGuide issues
v4: Clean it up mode, redo per_cpu, this_cpu logic
v5: Instead of threading struct domain, use hvm_pirq_dpci.
v6: Ditch the 'state' bit, expand description, simplify
    softirq and teardown sequence.
v7: Flesh out the comments. Drop the use of domain refcounts
v8: Add two bits (STATE_[SCHED|RUN]) to allow refcounts.
---
 xen/arch/x86/domain.c         |   4 +-
 xen/drivers/passthrough/io.c  | 255 +++++++++++++++++++++++++++++++++++++-----
 xen/drivers/passthrough/pci.c |  31 +++--
 xen/include/asm-x86/softirq.h |   3 +-
 xen/include/xen/hvm/irq.h     |   5 +-
 xen/include/xen/pci.h         |   2 +-
 6 files changed, 258 insertions(+), 42 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 558d8d5..11739ed 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1946,7 +1946,9 @@ int domain_relinquish_resources(struct domain *d)
     switch ( d->arch.relmem )
     {
     case RELMEM_not_started:
-        pci_release_devices(d);
+        ret = pci_release_devices(d);
+        if ( ret )
+            return ret;
 
         /* Tear down paging-assistance stuff. */
         ret = paging_teardown(d);
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 0cf0a52..c55ace7 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,14 +20,119 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long arg);
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/*
+ * These two bit states help to safely schedule, deschedule, and wait until
+ * the softirq has finished.
+ *
+ * The semantics behind these two bits is as follow:
+ *  - STATE_SCHED - whoever clears it has to ref-count the domain (->dom).
+ *  - STATE_RUN - only the softirq is allowed to set and clear it. If it has
+ *      been set the hvm_dirq_assist will RUN with an saved value of the
+ *      'struct domain' copied from 'pirq_dpci->dom' before STATE_RUN was set.
+ *
+ * The usual states are: STATE_SCHED(set) -> STATE_RUN(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(unset).
+ *
+ * However the states can also diverge such as: STATE_SCHED(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(set) -> STATE_RUN(unset). That means
+ * the 'hvm_dirq_assist' never run and that the softirq did not do any
+ * ref-counting.
+ */
+enum {
+    STATE_SCHED, /* Bit 0 */
+    STATE_RUN,
+};
+
+/*
+ * Should only be called from hvm_do_IRQ_dpci. We use the
+ * 'state' as an gate to thwart multiple interrupts being scheduled.
+ *
+ * The 'state' is cleared by 'softirq_dpci' when it has
+ * completed executing 'hvm_dirq_assist' or by 'pt_pirq_softirq_reset'
+ * if we want to try to unschedule the softirq before it runs.
+ *
+ */
+static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
+{
+    unsigned long flags;
+
+    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+        return;
+
+    get_knownalive_domain(pirq_dpci->dom);
+
+    local_irq_save(flags);
+    list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+    local_irq_restore(flags);
+
+    raise_softirq(HVM_DPCI_SOFTIRQ);
+}
+
+/*
+ * If we are racing with softirq_dpci (state is still set) we return
+ * -ERESTART. Otherwise we return 0.
+ *
+ *  If it is -ERESTART, it is the callers responsibility to make sure
+ *  that the softirq (with the event_lock dropped) has ran. We need
+ *  to flush out the outstanding 'dpci_softirq' (no more of them
+ *  will be added for this pirq as the IRQ action handler has been
+ *  reset in pt_irq_destroy_bind).
+ */
+int pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
+{
+    if ( pirq_dpci->state & (STATE_RUN | STATE_SCHED) )
+        return -ERESTART;
+
+    /*
+     * If in the future we would call 'raise_softirq_for' right away
+     * after 'pt_pirq_softirq_active' we MUST reset the list (otherwise it
+     * might have stale data).
+     */
+    return 0;
+}
+
+/*
+ * Reset the pirq_dpci->dom parameter to NULL.
+ *
+ * This function checks the different states to make sure it can do
+ * at the right time and if unschedules the softirq before it has
+ * run it also refcounts (which is what the softirq would have done).
+ */
+static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
+{
+    struct domain *d = pirq_dpci->dom;
+
+    ASSERT(spin_is_locked(&d->event_lock));
+    /*
+     * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due
+     * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom' in
+     * a local variable before it sets STATE_RUN - and therefore will not
+     * dereference '->dom' which would result in a crash.
+     */
+    if ( test_bit(STATE_RUN, &pirq_dpci->state) )
+    {
+        pirq_dpci->dom = NULL;
+        return;
+    }
+    /*
+     * We are going to try to de-schedule the softirq before it goes in
+     * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
+     */
+    if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
+    {
+        put_domain(d);
+        pirq_dpci->dom = NULL;
+    }
+}
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -40,7 +145,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
     if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT,
                               &pirq_dpci->flags) )
     {
-        pirq_dpci->masked = 0;
+        pirq_dpci->state = 0;
         pirq_dpci->pending = 0;
         pirq_guest_eoi(dpci_pirq(pirq_dpci));
     }
@@ -101,6 +206,7 @@ int pt_irq_create_bind(
     if ( pirq < 0 || pirq >= d->nr_pirqs )
         return -EINVAL;
 
+ restart:
     spin_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
@@ -128,6 +234,21 @@ int pt_irq_create_bind(
     }
     pirq_dpci = pirq_dpci(info);
     /*
+     * A crude 'while' loop with us dropping the spinlock and giving
+     * the softirq_dpci a chance to run.
+     * We MUST check for this condition as the softirq could be scheduled
+     * and hasn't run yet. We do this up to one second at which point we
+     * give up. Note that this code replaced tasklet_kill which would have
+     * spun forever and would do the same thing (wait to flush out
+     * outstanding hvm_dirq_assist calls.
+     */
+    if ( pt_pirq_softirq_active(pirq_dpci) )
+    {
+        spin_unlock(&d->event_lock);
+        process_pending_softirqs();
+        goto restart;
+    }
+    /*
      * The 'pt_irq_create_bind' can be called right after 'pt_irq_destroy_bind'
      * was called. The 'pirq_cleanup_check' which would free the structure
      * is only called if the event channel for the PIRQ is active. However
@@ -165,8 +286,14 @@ int pt_irq_create_bind(
             {
                 pirq_dpci->gmsi.gflags = 0;
                 pirq_dpci->gmsi.gvec = 0;
-                pirq_dpci->dom = NULL;
                 pirq_dpci->flags = 0;
+                /*
+                 * Between the 'pirq_guest_bind' and before 'pirq_guest_unbind'
+                 * an interrupt can be scheduled. No more of them are going to
+                 * be scheduled but we must deal with the one that is in the
+                 * queue.
+                 */
+                pt_pirq_softirq_reset(pirq_dpci);
                 pirq_cleanup_check(info, d);
                 spin_unlock(&d->event_lock);
                 return rc;
@@ -267,6 +394,14 @@ int pt_irq_create_bind(
             {
                 if ( pt_irq_need_timer(pirq_dpci->flags) )
                     kill_timer(&pirq_dpci->timer);
+                /*
+                 * We do not have to deal with the error path that
+                 * PT_IRQ_TYPE_MSI had (between pirq_guest_bind and
+                 * pirq_guest_unbind an softirq could be scheduled) because
+                 * before pirq_guest_bind call and after it (if it failed)
+                 * there is no path for __do_IRQ to schedule as IRQ_GUEST
+                 * is not set. As such we can reset 'dom' right away.
+                 */
                 pirq_dpci->dom = NULL;
                 list_del(&girq->list);
                 list_del(&digl->list);
@@ -400,8 +535,13 @@ int pt_irq_destroy_bind(
         msixtbl_pt_unregister(d, pirq);
         if ( pt_irq_need_timer(pirq_dpci->flags) )
             kill_timer(&pirq_dpci->timer);
-        pirq_dpci->dom   = NULL;
         pirq_dpci->flags = 0;
+        /*
+         * Before the 'pirq_guest_unbind' had been called an interrupt could
+         * have been scheduled. No more of them are going to be scheduled after
+         * that but we must deal with the one that were put in the queue.
+         */
+        pt_pirq_softirq_reset(pirq_dpci);
         pirq_cleanup_check(pirq, d);
     }
 
@@ -424,14 +564,12 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
-    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    if ( !dpci->flags )
+    if ( !dpci->flags && !pt_pirq_softirq_active(dpci) )
     {
-        tasklet_kill(&dpci->tasklet);
         dpci->dom = NULL;
         return 1;
     }
@@ -474,8 +612,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
          !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         return 0;
 
-    pirq_dpci->masked = 1;
-    tasklet_schedule(&pirq_dpci->tasklet);
+    raise_softirq_for(pirq_dpci);
     return 1;
 }
 
@@ -529,28 +666,12 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static void hvm_dirq_assist(unsigned long arg)
+static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
 {
-    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
-    struct domain *d = pirq_dpci->dom;
-
-    /*
-     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
-     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
-     *
-     * And '->dom' gets cleared later in the destroy path. We exit and clear
-     * 'masked' - which is OK as later in this code we would
-     * do nothing except clear the ->masked field anyhow.
-     */
-    if ( !d )
-    {
-        pirq_dpci->masked = 0;
-        return;
-    }
     ASSERT(d->arch.hvm_domain.irq.dpci);
 
     spin_lock(&d->event_lock);
-    if ( test_and_clear_bool(pirq_dpci->masked) )
+    if ( pirq_dpci->state )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
         const struct dev_intx_gsi_link *digl;
@@ -652,3 +773,81 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+static void dpci_softirq(void)
+{
+    struct hvm_pirq_dpci *pirq_dpci;
+    unsigned int cpu = smp_processor_id();
+    LIST_HEAD(our_list);
+
+    local_irq_disable();
+    list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
+    local_irq_enable();
+
+    while ( !list_empty(&our_list) )
+    {
+        struct domain *d;
+
+        pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
+        list_del(&pirq_dpci->softirq_list);
+
+        d = pirq_dpci->dom;
+        smp_wmb(); /* 'd' MUST be saved before we set/clear the bits. */
+        if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
+            BUG();
+        /*
+         * The one who clears STATE_SCHED MUST refcount the domain.
+         */
+        if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
+        {
+            hvm_dirq_assist(d, pirq_dpci);
+            put_domain(d);
+        }
+        clear_bit(STATE_RUN, &pirq_dpci->state);
+    }
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        /*
+         * On CPU_DYING this callback is called (on the CPU that is dying)
+         * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
+         * clear out any outstanding domains (by the virtue of the idle loop
+         * calling the softirq later). In CPU_DEAD case the CPU is deaf and
+         * there are no pending softirqs for us to handle so we can chill.
+         */
+        ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
+        break;
+    default:
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 81e8a3a..1660750 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
-    tasklet_kill(&pirq_dpci->tasklet);
-
-    return 0;
+    return pt_pirq_softirq_active(pirq_dpci);
 }
 
-static void pci_clean_dpci_irqs(struct domain *d)
+static int pci_clean_dpci_irqs(struct domain *d)
 {
     struct hvm_irq_dpci *hvm_irq_dpci = NULL;
 
     if ( !iommu_enabled )
-        return;
+        return -ENODEV;
 
     if ( !is_hvm_domain(d) )
-        return;
+        return -EINVAL;
 
     spin_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+        int ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+
+        if ( ret )
+        {
+            spin_unlock(&d->event_lock);
+            return ret;
+        }
 
         d->arch.hvm_domain.irq.dpci = NULL;
         free_hvm_irq_dpci(hvm_irq_dpci);
     }
     spin_unlock(&d->event_lock);
+    return 0;
 }
 
-void pci_release_devices(struct domain *d)
+int pci_release_devices(struct domain *d)
 {
     struct pci_dev *pdev;
     u8 bus, devfn;
+    int ret;
 
     spin_lock(&pcidevs_lock);
-    pci_clean_dpci_irqs(d);
+    ret = pci_clean_dpci_irqs(d);
+    if ( ret == -EAGAIN )
+    {
+        spin_unlock(&pcidevs_lock);
+        return ret;
+    }
     while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
     {
         bus = pdev->bus;
@@ -811,6 +822,8 @@ void pci_release_devices(struct domain *d)
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
     }
     spin_unlock(&pcidevs_lock);
+
+    return 0;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
diff --git a/xen/include/asm-x86/softirq.h b/xen/include/asm-x86/softirq.h
index 7225dea..ec787d6 100644
--- a/xen/include/asm-x86/softirq.h
+++ b/xen/include/asm-x86/softirq.h
@@ -7,7 +7,8 @@
 
 #define MACHINE_CHECK_SOFTIRQ  (NR_COMMON_SOFTIRQS + 3)
 #define PCI_SERR_SOFTIRQ       (NR_COMMON_SOFTIRQS + 4)
-#define NR_ARCH_SOFTIRQS       5
+#define HVM_DPCI_SOFTIRQ       (NR_COMMON_SOFTIRQS + 5)
+#define NR_ARCH_SOFTIRQS       6
 
 bool_t arch_skip_send_event_check(unsigned int cpu);
 
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 94a550a..43b371f 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -93,13 +93,13 @@ struct hvm_irq_dpci {
 /* Machine IRQ to guest device/intx mapping. */
 struct hvm_pirq_dpci {
     uint32_t flags;
-    bool_t masked;
+    unsigned long state;
     uint16_t pending;
     struct list_head digl_list;
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
-    struct tasklet tasklet;
+    struct list_head softirq_list;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
@@ -109,6 +109,7 @@ int pt_pirq_iterate(struct domain *d,
                               struct hvm_pirq_dpci *, void *arg),
                     void *arg);
 
+int pt_pirq_softirq_active(struct hvm_pirq_dpci *);
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 91520bc..5f295f3 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev(
 
 void setup_hwdom_pci_devices(struct domain *,
                             int (*)(u8 devfn, struct pci_dev *));
-void pci_release_devices(struct domain *d);
+int pci_release_devices(struct domain *d);
 int pci_add_segment(u16 seg);
 const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
  2014-10-21 17:19 ` [PATCH v8 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model Konrad Rzeszutek Wilk
@ 2014-10-23  8:58   ` Jan Beulich
  2014-10-24  1:58     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2014-10-23  8:58 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

>>> On 21.10.14 at 19:19, <konrad.wilk@oracle.com> wrote:
> @@ -130,6 +127,18 @@ int pt_irq_create_bind(
>          return -ENOMEM;
>      }
>      pirq_dpci = pirq_dpci(info);
> +    /*
> +     * The 'pt_irq_create_bind' can be called right after 'pt_irq_destroy_bind'
> +     * was called. The 'pirq_cleanup_check' which would free the structure
> +     * is only called if the event channel for the PIRQ is active. However
> +     * OS-es that use event channels usually bind the PIRQ to an event channel
> +     * and also unbind it before 'pt_irq_destroy_bind' is called which means
> +     * we end up re-using the 'dpci' structure. This can be easily reproduced
> +     * with unloading and loading the driver for the device.
> +     *
> +     * As such on every 'pt_irq_create_bind' call we MUST reset the values.
> +     */
> +    pirq_dpci->dom = d;

I continue to disagree with the placement of this (only needed right
before calling pirq_guest_bind(), as iirc you actually indicated you
agree with), and I can only re-iterate that with it being here two error
paths (hit before/without pirq_guest_bind() getting called) would need
fixing up too (which wouldn't be needed if that assignment got
deferred as much as possible).

> @@ -156,6 +165,7 @@ int pt_irq_create_bind(
>              {
>                  pirq_dpci->gmsi.gflags = 0;
>                  pirq_dpci->gmsi.gvec = 0;
> +                pirq_dpci->dom = NULL;
>                  pirq_dpci->flags = 0;
>                  pirq_cleanup_check(info, d);
>                  spin_unlock(&d->event_lock);

Just like this error path needing adjustment, the other one following
failure of pirq_guest_bind() (after

> @@ -232,7 +242,6 @@ int pt_irq_create_bind(
>          {
>              unsigned int share;
>  
> -            pirq_dpci->dom = d;
>              if ( pt_irq_bind->irq_type == PT_IRQ_TYPE_MSI_TRANSLATE )
>              {
>                  pirq_dpci->flags = HVM_IRQ_DPCI_MAPPED |

) would seem to need adjustment too.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-21 17:19 ` [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8) Konrad Rzeszutek Wilk
@ 2014-10-23  9:36   ` Jan Beulich
  2014-10-24  1:58     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2014-10-23  9:36 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

>>> On 21.10.14 at 19:19, <konrad.wilk@oracle.com> wrote:
> +/*
> + * These two bit states help to safely schedule, deschedule, and wait until
> + * the softirq has finished.
> + *
> + * The semantics behind these two bits is as follow:
> + *  - STATE_SCHED - whoever clears it has to ref-count the domain (->dom).

s/clears/modifies/

> + *  - STATE_RUN - only the softirq is allowed to set and clear it. If it has
> + *      been set the hvm_dirq_assist will RUN with an saved value of the

s/ an / a / (and perhaps also ditch the first "the" on that line,
and the similarly further down)

> +    STATE_SCHED, /* Bit 0 */

Bogus comment (effectively re-stating what the language
specification says)?

> +/*
> + * Should only be called from hvm_do_IRQ_dpci. We use the
> + * 'state' as an gate to thwart multiple interrupts being scheduled.

s/ an / a /

> + *
> + * The 'state' is cleared by 'softirq_dpci' when it has
> + * completed executing 'hvm_dirq_assist' or by 'pt_pirq_softirq_reset'
> + * if we want to try to unschedule the softirq before it runs.
> + *

Stray blank comment line.

> +/*
> + * If we are racing with softirq_dpci (state is still set) we return
> + * -ERESTART. Otherwise we return 0.
> + *
> + *  If it is -ERESTART, it is the callers responsibility to make sure
> + *  that the softirq (with the event_lock dropped) has ran. We need
> + *  to flush out the outstanding 'dpci_softirq' (no more of them
> + *  will be added for this pirq as the IRQ action handler has been
> + *  reset in pt_irq_destroy_bind).
> + */
> +int pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
> +{
> +    if ( pirq_dpci->state & (STATE_RUN | STATE_SCHED) )
> +        return -ERESTART;
> +
> +    /*
> +     * If in the future we would call 'raise_softirq_for' right away
> +     * after 'pt_pirq_softirq_active' we MUST reset the list (otherwise it
> +     * might have stale data).
> +     */
> +    return 0;
> +}

Having this return -ERESTART and 0 rather than a simple boolean
is kind of odd as long as there are no other errors possible here.

> +static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
> +{
> +    struct domain *d = pirq_dpci->dom;
> +
> +    ASSERT(spin_is_locked(&d->event_lock));
> +    /*
> +     * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due
> +     * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom' in
> +     * a local variable before it sets STATE_RUN - and therefore will not
> +     * dereference '->dom' which would result in a crash.
> +     */
> +    if ( test_bit(STATE_RUN, &pirq_dpci->state) )
> +    {
> +        pirq_dpci->dom = NULL;
> +        return;
> +    }
> +    /*
> +     * We are going to try to de-schedule the softirq before it goes in
> +     * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
> +     */
> +    if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
> +    {
> +        put_domain(d);
> +        pirq_dpci->dom = NULL;
> +    }

Would it not be easier to follow if instead of the two if()-s you
used switch(cmpxchg(..., STATE_SCHED, 0)) here?

> @@ -128,6 +234,21 @@ int pt_irq_create_bind(
>      }
>      pirq_dpci = pirq_dpci(info);
>      /*
> +     * A crude 'while' loop with us dropping the spinlock and giving
> +     * the softirq_dpci a chance to run.
> +     * We MUST check for this condition as the softirq could be scheduled
> +     * and hasn't run yet. We do this up to one second at which point we
> +     * give up. Note that this code replaced tasklet_kill which would have
> +     * spun forever and would do the same thing (wait to flush out
> +     * outstanding hvm_dirq_assist calls.
> +     */

Stale comment - there's no 1s timeout here anymore.

> +    if ( pt_pirq_softirq_active(pirq_dpci) )
> +    {
> +        spin_unlock(&d->event_lock);
> +        process_pending_softirqs();

ASSERT_NOT_IN_ATOMIC() between these two (the assertion
process_pending_softirqs() does seems too weak for the purposes
here, as we really shouldn't be holding any other spin locks; otoh
it's not really clear to me why that aspect is different from
do_softirq() - just fired off a separate mail)?

> @@ -400,8 +535,13 @@ int pt_irq_destroy_bind(
>          msixtbl_pt_unregister(d, pirq);
>          if ( pt_irq_need_timer(pirq_dpci->flags) )
>              kill_timer(&pirq_dpci->timer);
> -        pirq_dpci->dom   = NULL;
>          pirq_dpci->flags = 0;
> +        /*
> +         * Before the 'pirq_guest_unbind' had been called an interrupt could
> +         * have been scheduled. No more of them are going to be scheduled after
> +         * that but we must deal with the one that were put in the queue.
> +         */

This is the second of third instance of this or a similar comment.
Please have it just once in full, and have the other(s) simply refer
to the full one.

> @@ -652,3 +773,81 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
>  unlock:
>      spin_unlock(&d->event_lock);
>  }
> +
> +static void dpci_softirq(void)
> +{
> +    struct hvm_pirq_dpci *pirq_dpci;
> +    unsigned int cpu = smp_processor_id();
> +    LIST_HEAD(our_list);
> +
> +    local_irq_disable();
> +    list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
> +    local_irq_enable();
> +
> +    while ( !list_empty(&our_list) )
> +    {
> +        struct domain *d;

Considering that you already have one non-function scope variable
here, please use the other one only needed inside this loop
(pirq_dpci) here too.

> +
> +        pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
> +        list_del(&pirq_dpci->softirq_list);
> +
> +        d = pirq_dpci->dom;
> +        smp_wmb(); /* 'd' MUST be saved before we set/clear the bits. */

So you state the right thing, but use the wrong barrier: To order a
read and a write, smp_mb() is your only choice.

> -void pci_release_devices(struct domain *d)
> +int pci_release_devices(struct domain *d)
>  {
>      struct pci_dev *pdev;
>      u8 bus, devfn;
> +    int ret;
>  
>      spin_lock(&pcidevs_lock);
> -    pci_clean_dpci_irqs(d);
> +    ret = pci_clean_dpci_irqs(d);
> +    if ( ret == -EAGAIN )

-ERESTART?

> --- a/xen/include/xen/hvm/irq.h
> +++ b/xen/include/xen/hvm/irq.h
> @@ -93,13 +93,13 @@ struct hvm_irq_dpci {
>  /* Machine IRQ to guest device/intx mapping. */
>  struct hvm_pirq_dpci {
>      uint32_t flags;
> -    bool_t masked;
> +    unsigned long state;

I think "unsigned int" would be sufficient here?

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
  2014-10-23  8:58   ` Jan Beulich
@ 2014-10-24  1:58     ` Konrad Rzeszutek Wilk
  2014-10-24  9:49       ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-24  1:58 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

[-- Attachment #1: Type: text/plain, Size: 13891 bytes --]

On Thu, Oct 23, 2014 at 09:58:34AM +0100, Jan Beulich wrote:
> >>> On 21.10.14 at 19:19, <konrad.wilk@oracle.com> wrote:
> > @@ -130,6 +127,18 @@ int pt_irq_create_bind(
> >          return -ENOMEM;
> >      }
> >      pirq_dpci = pirq_dpci(info);
> > +    /*
> > +     * The 'pt_irq_create_bind' can be called right after 'pt_irq_destroy_bind'
> > +     * was called. The 'pirq_cleanup_check' which would free the structure
> > +     * is only called if the event channel for the PIRQ is active. However
> > +     * OS-es that use event channels usually bind the PIRQ to an event channel
> > +     * and also unbind it before 'pt_irq_destroy_bind' is called which means
> > +     * we end up re-using the 'dpci' structure. This can be easily reproduced
> > +     * with unloading and loading the driver for the device.
> > +     *
> > +     * As such on every 'pt_irq_create_bind' call we MUST reset the values.
> > +     */
> > +    pirq_dpci->dom = d;
> 
> I continue to disagree with the placement of this (only needed right
> before calling pirq_guest_bind(), as iirc you actually indicated you
> agree with), and I can only re-iterate that with it being here two error
> paths (hit before/without pirq_guest_bind() getting called) would need
> fixing up too (which wouldn't be needed if that assignment got
> deferred as much as possible).

The patch (inline and attached) should be following what you explained
above.
> 
> > @@ -156,6 +165,7 @@ int pt_irq_create_bind(
> >              {
> >                  pirq_dpci->gmsi.gflags = 0;
> >                  pirq_dpci->gmsi.gvec = 0;
> > +                pirq_dpci->dom = NULL;
> >                  pirq_dpci->flags = 0;
> >                  pirq_cleanup_check(info, d);
> >                  spin_unlock(&d->event_lock);
> 
> Just like this error path needing adjustment, the other one following
> failure of pirq_guest_bind() (after
> 
> > @@ -232,7 +242,6 @@ int pt_irq_create_bind(
> >          {
> >              unsigned int share;
> >  
> > -            pirq_dpci->dom = d;
> >              if ( pt_irq_bind->irq_type == PT_IRQ_TYPE_MSI_TRANSLATE )
> >              {
> >                  pirq_dpci->flags = HVM_IRQ_DPCI_MAPPED |
> 
> ) would seem to need adjustment too.

However I am at lost of what you mean here. If by adjustment you mean
leave it alone I concur on the later.

In the error path after the 'msixtbl_pt_register':

157             rc = pirq_guest_bind(d->vcpu[0], info, 0);                          
158             if ( rc == 0 && pt_irq_bind->u.msi.gtable )                         
159             {                                                                   
160                 rc = msixtbl_pt_register(d, info, pt_irq_bind->u.msi.gtable);   
161                 if ( unlikely(rc) )                                             
162                     pirq_guest_unbind(d, info);                                 
163             }                                         

We can choose to clear 'dom' right away if !pt_irq_bind->u.msi.gtable.

Otherwise we can leave it as is.

But per your previous recommendation I've modified hvm_dirq_assist to deal
with the case of 'dom' being set to NULL. Which means we can set 'dom' to
NULL in either case.

Please see attached and inline patch.

>From 37e4b8203d701d0340066c9f746f08188e74f679 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Mon, 22 Sep 2014 17:46:40 -0400
Subject: [PATCH 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an
 hvm_dirq_dpci model.

When an interrupt for an PCI (or PCIe) passthrough device
is to be sent to a guest, we find the appropiate
'hvm_dirq_dpci' structure for the interrupt (PIRQ), set
a bit (masked), and schedule an tasklet.

Then the 'hvm_dirq_assist' tasklet gets called with the 'struct
domain' from where it iterates over the the radix-tree of
'hvm_dirq_dpci' (from zero to the number of PIRQs allocated)
which are masked to the guest and calls each 'hvm_pirq_assist'.
If the PIRQ has a bit set (masked) it figures out how to
inject the PIRQ to the guest.

This is inefficient and not fair as:
 - We iterate starting at PIRQ 0 and up every time. That means
   the PCIe devices that have lower PIRQs get to be called
   first.
 - If we have many PCIe devices passed in with many PIRQs and
   if most of the time only the highest numbered PIRQ get an
   interrupt (as the initial ones are for control) we end up
   iterating over many PIRQs.

But we could do beter - the 'hvm_dirq_dpci' has the field for
'struct domain', which we can use instead of having to pass in the
'struct domain'.

As such this patch moves the tasklet to the 'struct hvm_dirq_dpci'
and sets the 'dom' field to the domain. We also double-check
that the '->dom' is not reset before using it.

We have to be careful with this as that means we MUST have
'dom' set before pirq_guest_bind() is called. As such
we add the 'pirq_dpci->dom = d;' to cover for such
cases.

The mechanism to tear it down is more complex as there
are two ways it can be executed:

 a) pci_clean_dpci_irq. This gets called when the guest is
    being destroyed. We end up calling 'tasklet_kill'.

    The scenarios in which the 'struct pirq' (and subsequently
    the 'hvm_pirq_dpci') gets destroyed is when:

    - guest did not use the pirq at all after setup.
    - guest did use pirq, but decided to mask and left it in that
      state.
    - guest did use pirq, but crashed.

    In all of those scenarios we end up calling 'tasklet_kill'
    which will spin on the tasklet if it is running.

 b) pt_irq_destroy_bind (guest disables the MSI). We double-check
    that the softirq has run by piggy-backing on the existing
    'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
    We add the extra call to 'pt_pirq_softirq_active' in
    'pt_pirq_cleanup_check'.

    NOTE: Guests that use event channels unbind first the
    event channel from PIRQs, so the 'pt_pirq_cleanup_check'
    won't be called as event is set to zero. In that case
    we either clean it up via the a) mechanism. It is OK
    to re-use the tasklet when 'pt_irq_create_bind' is called
    afterwards.

    There is an extra scenario regardless of event being
    set or not: the guest did 'pt_irq_destroy_bind' while an
    interrupt was triggered and tasklet was scheduled (but had not
    been run). It is OK to still run the tasklet as
    hvm_dirq_assist won't do anything (as the flags are
    set to zero). As such we can exit out of hvm_dirq_assist
    without doing anything.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/passthrough/io.c  | 75 ++++++++++++++++++++++++++++++-------------
 xen/drivers/passthrough/pci.c |  4 +--
 xen/include/xen/hvm/irq.h     |  2 +-
 3 files changed, 55 insertions(+), 26 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 4cd32b5..dceb17e 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -27,7 +27,7 @@
 #include <xen/hvm/irq.h>
 #include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long _d);
+static void hvm_dirq_assist(unsigned long arg);
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -114,9 +114,6 @@ int pt_irq_create_bind(
             spin_unlock(&d->event_lock);
             return -ENOMEM;
         }
-        softirq_tasklet_init(
-            &hvm_irq_dpci->dirq_tasklet,
-            hvm_dirq_assist, (unsigned long)d);
         for ( i = 0; i < NR_HVM_IRQS; i++ )
             INIT_LIST_HEAD(&hvm_irq_dpci->girq[i]);
 
@@ -144,6 +141,18 @@ int pt_irq_create_bind(
                                HVM_IRQ_DPCI_GUEST_MSI;
             pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
             pirq_dpci->gmsi.gflags = pt_irq_bind->u.msi.gflags;
+            /*
+             * 'pt_irq_create_bind' can be called after 'pt_irq_destroy_bind'.
+             * The 'pirq_cleanup_check' which would free the structure is only
+             * called if the event channel for the PIRQ is active. However
+             * OS-es that use event channels usually bind PIRQs to eventds
+             * and unbind them before calling 'pt_irq_destroy_bind' - with the
+             * result that we re-use the 'dpci' structure. This can be
+             * reproduced with unloading and loading the driver for a device.
+             *
+             * As such on every 'pt_irq_create_bind' call we MUST set it.
+             */
+            pirq_dpci->dom = d;
             /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/
             rc = pirq_guest_bind(d->vcpu[0], info, 0);
             if ( rc == 0 && pt_irq_bind->u.msi.gtable )
@@ -156,6 +165,7 @@ int pt_irq_create_bind(
             {
                 pirq_dpci->gmsi.gflags = 0;
                 pirq_dpci->gmsi.gvec = 0;
+                pirq_dpci->dom = NULL;
                 pirq_dpci->flags = 0;
                 pirq_cleanup_check(info, d);
                 spin_unlock(&d->event_lock);
@@ -232,6 +242,7 @@ int pt_irq_create_bind(
         {
             unsigned int share;
 
+            /* MUST be set, as the pirq_dpci can be re-used. */
             pirq_dpci->dom = d;
             if ( pt_irq_bind->irq_type == PT_IRQ_TYPE_MSI_TRANSLATE )
             {
@@ -415,11 +426,18 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
+    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    return !dpci->flags;
+    if ( !dpci->flags )
+    {
+        tasklet_kill(&dpci->tasklet);
+        dpci->dom = NULL;
+        return 1;
+    }
+    return 0;
 }
 
 int pt_pirq_iterate(struct domain *d,
@@ -459,7 +477,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
         return 0;
 
     pirq_dpci->masked = 1;
-    tasklet_schedule(&dpci->dirq_tasklet);
+    tasklet_schedule(&pirq_dpci->tasklet);
     return 1;
 }
 
@@ -513,9 +531,27 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
-                            void *arg)
+static void hvm_dirq_assist(unsigned long arg)
 {
+    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
+    struct domain *d = pirq_dpci->dom;
+
+    /*
+     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
+     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
+     *
+     * And '->dom' gets cleared later in the destroy path. We exit and clear
+     * 'masked' - which is OK as later in this code we would
+     * do nothing except clear the ->masked field anyhow.
+     */
+    if ( !d )
+    {
+        pirq_dpci->masked = 0;
+        return;
+    }
+    ASSERT(d->arch.hvm_domain.irq.dpci);
+
+    spin_lock(&d->event_lock);
     if ( test_and_clear_bool(pirq_dpci->masked) )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
@@ -526,13 +562,17 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
             send_guest_pirq(d, pirq);
 
             if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
-                return 0;
+            {
+                spin_unlock(&d->event_lock);
+                return;
+            }
         }
 
         if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
         {
             vmsi_deliver_pirq(d, pirq_dpci);
-            return 0;
+            spin_unlock(&d->event_lock);
+            return;
         }
 
         list_for_each_entry ( digl, &pirq_dpci->digl_list, list )
@@ -545,7 +585,8 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
         {
             /* for translated MSI to INTx interrupt, eoi as early as possible */
             __msi_pirq_eoi(pirq_dpci);
-            return 0;
+            spin_unlock(&d->event_lock);
+            return;
         }
 
         /*
@@ -558,18 +599,6 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
         ASSERT(pt_irq_need_timer(pirq_dpci->flags));
         set_timer(&pirq_dpci->timer, NOW() + PT_IRQ_TIME_OUT);
     }
-
-    return 0;
-}
-
-static void hvm_dirq_assist(unsigned long _d)
-{
-    struct domain *d = (struct domain *)_d;
-
-    ASSERT(d->arch.hvm_domain.irq.dpci);
-
-    spin_lock(&d->event_lock);
-    pt_pirq_iterate(d, _hvm_dirq_assist, NULL);
     spin_unlock(&d->event_lock);
 }
 
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1eba833..81e8a3a 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,6 +767,8 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
+    tasklet_kill(&pirq_dpci->tasklet);
+
     return 0;
 }
 
@@ -784,8 +786,6 @@ static void pci_clean_dpci_irqs(struct domain *d)
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        tasklet_kill(&hvm_irq_dpci->dirq_tasklet);
-
         pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
 
         d->arch.hvm_domain.irq.dpci = NULL;
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index c89f4b1..94a550a 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -88,7 +88,6 @@ struct hvm_irq_dpci {
     DECLARE_BITMAP(isairq_map, NR_ISAIRQS);
     /* Record of mapped Links */
     uint8_t link_cnt[NR_LINK];
-    struct tasklet dirq_tasklet;
 };
 
 /* Machine IRQ to guest device/intx mapping. */
@@ -100,6 +99,7 @@ struct hvm_pirq_dpci {
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
+    struct tasklet tasklet;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
-- 
1.9.3

> 
> Jan
> 

[-- Attachment #2: 0001-dpci-Move-from-an-hvm_irq_dpci-and-struct-domain-to-.patch --]
[-- Type: text/plain, Size: 10596 bytes --]

>From 37e4b8203d701d0340066c9f746f08188e74f679 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Mon, 22 Sep 2014 17:46:40 -0400
Subject: [PATCH 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an
 hvm_dirq_dpci model.

When an interrupt for an PCI (or PCIe) passthrough device
is to be sent to a guest, we find the appropiate
'hvm_dirq_dpci' structure for the interrupt (PIRQ), set
a bit (masked), and schedule an tasklet.

Then the 'hvm_dirq_assist' tasklet gets called with the 'struct
domain' from where it iterates over the the radix-tree of
'hvm_dirq_dpci' (from zero to the number of PIRQs allocated)
which are masked to the guest and calls each 'hvm_pirq_assist'.
If the PIRQ has a bit set (masked) it figures out how to
inject the PIRQ to the guest.

This is inefficient and not fair as:
 - We iterate starting at PIRQ 0 and up every time. That means
   the PCIe devices that have lower PIRQs get to be called
   first.
 - If we have many PCIe devices passed in with many PIRQs and
   if most of the time only the highest numbered PIRQ get an
   interrupt (as the initial ones are for control) we end up
   iterating over many PIRQs.

But we could do beter - the 'hvm_dirq_dpci' has the field for
'struct domain', which we can use instead of having to pass in the
'struct domain'.

As such this patch moves the tasklet to the 'struct hvm_dirq_dpci'
and sets the 'dom' field to the domain. We also double-check
that the '->dom' is not reset before using it.

We have to be careful with this as that means we MUST have
'dom' set before pirq_guest_bind() is called. As such
we add the 'pirq_dpci->dom = d;' to cover for such
cases.

The mechanism to tear it down is more complex as there
are two ways it can be executed:

 a) pci_clean_dpci_irq. This gets called when the guest is
    being destroyed. We end up calling 'tasklet_kill'.

    The scenarios in which the 'struct pirq' (and subsequently
    the 'hvm_pirq_dpci') gets destroyed is when:

    - guest did not use the pirq at all after setup.
    - guest did use pirq, but decided to mask and left it in that
      state.
    - guest did use pirq, but crashed.

    In all of those scenarios we end up calling 'tasklet_kill'
    which will spin on the tasklet if it is running.

 b) pt_irq_destroy_bind (guest disables the MSI). We double-check
    that the softirq has run by piggy-backing on the existing
    'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
    We add the extra call to 'pt_pirq_softirq_active' in
    'pt_pirq_cleanup_check'.

    NOTE: Guests that use event channels unbind first the
    event channel from PIRQs, so the 'pt_pirq_cleanup_check'
    won't be called as event is set to zero. In that case
    we either clean it up via the a) mechanism. It is OK
    to re-use the tasklet when 'pt_irq_create_bind' is called
    afterwards.

    There is an extra scenario regardless of event being
    set or not: the guest did 'pt_irq_destroy_bind' while an
    interrupt was triggered and tasklet was scheduled (but had not
    been run). It is OK to still run the tasklet as
    hvm_dirq_assist won't do anything (as the flags are
    set to zero). As such we can exit out of hvm_dirq_assist
    without doing anything.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/passthrough/io.c  | 75 ++++++++++++++++++++++++++++++-------------
 xen/drivers/passthrough/pci.c |  4 +--
 xen/include/xen/hvm/irq.h     |  2 +-
 3 files changed, 55 insertions(+), 26 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 4cd32b5..dceb17e 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -27,7 +27,7 @@
 #include <xen/hvm/irq.h>
 #include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long _d);
+static void hvm_dirq_assist(unsigned long arg);
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -114,9 +114,6 @@ int pt_irq_create_bind(
             spin_unlock(&d->event_lock);
             return -ENOMEM;
         }
-        softirq_tasklet_init(
-            &hvm_irq_dpci->dirq_tasklet,
-            hvm_dirq_assist, (unsigned long)d);
         for ( i = 0; i < NR_HVM_IRQS; i++ )
             INIT_LIST_HEAD(&hvm_irq_dpci->girq[i]);
 
@@ -144,6 +141,18 @@ int pt_irq_create_bind(
                                HVM_IRQ_DPCI_GUEST_MSI;
             pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
             pirq_dpci->gmsi.gflags = pt_irq_bind->u.msi.gflags;
+            /*
+             * 'pt_irq_create_bind' can be called after 'pt_irq_destroy_bind'.
+             * The 'pirq_cleanup_check' which would free the structure is only
+             * called if the event channel for the PIRQ is active. However
+             * OS-es that use event channels usually bind PIRQs to eventds
+             * and unbind them before calling 'pt_irq_destroy_bind' - with the
+             * result that we re-use the 'dpci' structure. This can be
+             * reproduced with unloading and loading the driver for a device.
+             *
+             * As such on every 'pt_irq_create_bind' call we MUST set it.
+             */
+            pirq_dpci->dom = d;
             /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/
             rc = pirq_guest_bind(d->vcpu[0], info, 0);
             if ( rc == 0 && pt_irq_bind->u.msi.gtable )
@@ -156,6 +165,7 @@ int pt_irq_create_bind(
             {
                 pirq_dpci->gmsi.gflags = 0;
                 pirq_dpci->gmsi.gvec = 0;
+                pirq_dpci->dom = NULL;
                 pirq_dpci->flags = 0;
                 pirq_cleanup_check(info, d);
                 spin_unlock(&d->event_lock);
@@ -232,6 +242,7 @@ int pt_irq_create_bind(
         {
             unsigned int share;
 
+            /* MUST be set, as the pirq_dpci can be re-used. */
             pirq_dpci->dom = d;
             if ( pt_irq_bind->irq_type == PT_IRQ_TYPE_MSI_TRANSLATE )
             {
@@ -415,11 +426,18 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
+    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    return !dpci->flags;
+    if ( !dpci->flags )
+    {
+        tasklet_kill(&dpci->tasklet);
+        dpci->dom = NULL;
+        return 1;
+    }
+    return 0;
 }
 
 int pt_pirq_iterate(struct domain *d,
@@ -459,7 +477,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
         return 0;
 
     pirq_dpci->masked = 1;
-    tasklet_schedule(&dpci->dirq_tasklet);
+    tasklet_schedule(&pirq_dpci->tasklet);
     return 1;
 }
 
@@ -513,9 +531,27 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
-                            void *arg)
+static void hvm_dirq_assist(unsigned long arg)
 {
+    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
+    struct domain *d = pirq_dpci->dom;
+
+    /*
+     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
+     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
+     *
+     * And '->dom' gets cleared later in the destroy path. We exit and clear
+     * 'masked' - which is OK as later in this code we would
+     * do nothing except clear the ->masked field anyhow.
+     */
+    if ( !d )
+    {
+        pirq_dpci->masked = 0;
+        return;
+    }
+    ASSERT(d->arch.hvm_domain.irq.dpci);
+
+    spin_lock(&d->event_lock);
     if ( test_and_clear_bool(pirq_dpci->masked) )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
@@ -526,13 +562,17 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
             send_guest_pirq(d, pirq);
 
             if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
-                return 0;
+            {
+                spin_unlock(&d->event_lock);
+                return;
+            }
         }
 
         if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
         {
             vmsi_deliver_pirq(d, pirq_dpci);
-            return 0;
+            spin_unlock(&d->event_lock);
+            return;
         }
 
         list_for_each_entry ( digl, &pirq_dpci->digl_list, list )
@@ -545,7 +585,8 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
         {
             /* for translated MSI to INTx interrupt, eoi as early as possible */
             __msi_pirq_eoi(pirq_dpci);
-            return 0;
+            spin_unlock(&d->event_lock);
+            return;
         }
 
         /*
@@ -558,18 +599,6 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
         ASSERT(pt_irq_need_timer(pirq_dpci->flags));
         set_timer(&pirq_dpci->timer, NOW() + PT_IRQ_TIME_OUT);
     }
-
-    return 0;
-}
-
-static void hvm_dirq_assist(unsigned long _d)
-{
-    struct domain *d = (struct domain *)_d;
-
-    ASSERT(d->arch.hvm_domain.irq.dpci);
-
-    spin_lock(&d->event_lock);
-    pt_pirq_iterate(d, _hvm_dirq_assist, NULL);
     spin_unlock(&d->event_lock);
 }
 
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1eba833..81e8a3a 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,6 +767,8 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
+    tasklet_kill(&pirq_dpci->tasklet);
+
     return 0;
 }
 
@@ -784,8 +786,6 @@ static void pci_clean_dpci_irqs(struct domain *d)
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        tasklet_kill(&hvm_irq_dpci->dirq_tasklet);
-
         pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
 
         d->arch.hvm_domain.irq.dpci = NULL;
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index c89f4b1..94a550a 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -88,7 +88,6 @@ struct hvm_irq_dpci {
     DECLARE_BITMAP(isairq_map, NR_ISAIRQS);
     /* Record of mapped Links */
     uint8_t link_cnt[NR_LINK];
-    struct tasklet dirq_tasklet;
 };
 
 /* Machine IRQ to guest device/intx mapping. */
@@ -100,6 +99,7 @@ struct hvm_pirq_dpci {
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
+    struct tasklet tasklet;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
-- 
1.9.3


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-23  9:36   ` Jan Beulich
@ 2014-10-24  1:58     ` Konrad Rzeszutek Wilk
  2014-10-24 10:09       ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-24  1:58 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

[-- Attachment #1: Type: text/plain, Size: 31258 bytes --]

On Thu, Oct 23, 2014 at 10:36:07AM +0100, Jan Beulich wrote:
> >>> On 21.10.14 at 19:19, <konrad.wilk@oracle.com> wrote:
> > +/*
> > + * These two bit states help to safely schedule, deschedule, and wait until
> > + * the softirq has finished.
> > + *
> > + * The semantics behind these two bits is as follow:
> > + *  - STATE_SCHED - whoever clears it has to ref-count the domain (->dom).
> 
> s/clears/modifies/
> 
> > + *  - STATE_RUN - only the softirq is allowed to set and clear it. If it has
> > + *      been set the hvm_dirq_assist will RUN with an saved value of the
> 
> s/ an / a / (and perhaps also ditch the first "the" on that line,
> and the similarly further down)
> 
> > +    STATE_SCHED, /* Bit 0 */
> 
> Bogus comment (effectively re-stating what the language
> specification says)?
> 
> > +/*
> > + * Should only be called from hvm_do_IRQ_dpci. We use the
> > + * 'state' as an gate to thwart multiple interrupts being scheduled.
> 
> s/ an / a /
> 
> > + *
> > + * The 'state' is cleared by 'softirq_dpci' when it has
> > + * completed executing 'hvm_dirq_assist' or by 'pt_pirq_softirq_reset'
> > + * if we want to try to unschedule the softirq before it runs.
> > + *
> 
> Stray blank comment line.

All above fixed.
> 
> > +/*
> > + * If we are racing with softirq_dpci (state is still set) we return
> > + * -ERESTART. Otherwise we return 0.
> > + *
> > + *  If it is -ERESTART, it is the callers responsibility to make sure
> > + *  that the softirq (with the event_lock dropped) has ran. We need
> > + *  to flush out the outstanding 'dpci_softirq' (no more of them
> > + *  will be added for this pirq as the IRQ action handler has been
> > + *  reset in pt_irq_destroy_bind).
> > + */
> > +int pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
> > +{
> > +    if ( pirq_dpci->state & (STATE_RUN | STATE_SCHED) )
> > +        return -ERESTART;
> > +
> > +    /*
> > +     * If in the future we would call 'raise_softirq_for' right away
> > +     * after 'pt_pirq_softirq_active' we MUST reset the list (otherwise it
> > +     * might have stale data).
> > +     */
> > +    return 0;
> > +}
> 
> Having this return -ERESTART and 0 rather than a simple boolean
> is kind of odd as long as there are no other errors possible here.

True. Replaced it with an bool_t. Was not sure whether you prefer 'true'
or 'false' instead of numbers - decided on numbers since most of the code-base
uses numbers.
> 
> > +static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
> > +{
> > +    struct domain *d = pirq_dpci->dom;
> > +
> > +    ASSERT(spin_is_locked(&d->event_lock));
> > +    /*
> > +     * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due
> > +     * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom' in
> > +     * a local variable before it sets STATE_RUN - and therefore will not
> > +     * dereference '->dom' which would result in a crash.
> > +     */
> > +    if ( test_bit(STATE_RUN, &pirq_dpci->state) )
> > +    {
> > +        pirq_dpci->dom = NULL;
> > +        return;
> > +    }
> > +    /*
> > +     * We are going to try to de-schedule the softirq before it goes in
> > +     * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
> > +     */
> > +    if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
> > +    {
> > +        put_domain(d);
> > +        pirq_dpci->dom = NULL;
> > +    }
> 
> Would it not be easier to follow if instead of the two if()-s you
> used switch(cmpxchg(..., STATE_SCHED, 0)) here?

Yes, done. Thought I was not clear on how to put the comments there

> 
> > @@ -128,6 +234,21 @@ int pt_irq_create_bind(
> >      }
> >      pirq_dpci = pirq_dpci(info);
> >      /*
> > +     * A crude 'while' loop with us dropping the spinlock and giving
> > +     * the softirq_dpci a chance to run.
> > +     * We MUST check for this condition as the softirq could be scheduled
> > +     * and hasn't run yet. We do this up to one second at which point we
> > +     * give up. Note that this code replaced tasklet_kill which would have
> > +     * spun forever and would do the same thing (wait to flush out
> > +     * outstanding hvm_dirq_assist calls.
> > +     */
> 
> Stale comment - there's no 1s timeout here anymore.

<nods>
> 
> > +    if ( pt_pirq_softirq_active(pirq_dpci) )
> > +    {
> > +        spin_unlock(&d->event_lock);
> > +        process_pending_softirqs();
> 
> ASSERT_NOT_IN_ATOMIC() between these two (the assertion
> process_pending_softirqs() does seems too weak for the purposes
> here, as we really shouldn't be holding any other spin locks; otoh
> it's not really clear to me why that aspect is different from
> do_softirq() - just fired off a separate mail)?

OK, for right now I have ASSERT_NOT_IN_ATOMIC in it.
> 
> > @@ -400,8 +535,13 @@ int pt_irq_destroy_bind(
> >          msixtbl_pt_unregister(d, pirq);
> >          if ( pt_irq_need_timer(pirq_dpci->flags) )
> >              kill_timer(&pirq_dpci->timer);
> > -        pirq_dpci->dom   = NULL;
> >          pirq_dpci->flags = 0;
> > +        /*
> > +         * Before the 'pirq_guest_unbind' had been called an interrupt could
> > +         * have been scheduled. No more of them are going to be scheduled after
> > +         * that but we must deal with the one that were put in the queue.
> > +         */
> 
> This is the second of third instance of this or a similar comment.
> Please have it just once in full, and have the other(s) simply refer
> to the full one.

I trimmed them down and also made reference to the top-most.

> 
> > @@ -652,3 +773,81 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
> >  unlock:
> >      spin_unlock(&d->event_lock);
> >  }
> > +
> > +static void dpci_softirq(void)
> > +{
> > +    struct hvm_pirq_dpci *pirq_dpci;
> > +    unsigned int cpu = smp_processor_id();
> > +    LIST_HEAD(our_list);
> > +
> > +    local_irq_disable();
> > +    list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
> > +    local_irq_enable();
> > +
> > +    while ( !list_empty(&our_list) )
> > +    {
> > +        struct domain *d;
> 
> Considering that you already have one non-function scope variable
> here, please use the other one only needed inside this loop
> (pirq_dpci) here too.
> 
> > +
> > +        pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
> > +        list_del(&pirq_dpci->softirq_list);
> > +
> > +        d = pirq_dpci->dom;
> > +        smp_wmb(); /* 'd' MUST be saved before we set/clear the bits. */
> 
> So you state the right thing, but use the wrong barrier: To order a
> read and a write, smp_mb() is your only choice.

Done.
> 
> > -void pci_release_devices(struct domain *d)
> > +int pci_release_devices(struct domain *d)
> >  {
> >      struct pci_dev *pdev;
> >      u8 bus, devfn;
> > +    int ret;
> >  
> >      spin_lock(&pcidevs_lock);
> > -    pci_clean_dpci_irqs(d);
> > +    ret = pci_clean_dpci_irqs(d);
> > +    if ( ret == -EAGAIN )
> 
> -ERESTART?
> 
> > --- a/xen/include/xen/hvm/irq.h
> > +++ b/xen/include/xen/hvm/irq.h
> > @@ -93,13 +93,13 @@ struct hvm_irq_dpci {
> >  /* Machine IRQ to guest device/intx mapping. */
> >  struct hvm_pirq_dpci {
> >      uint32_t flags;
> > -    bool_t masked;
> > +    unsigned long state;
> 
> I think "unsigned int" would be sufficient here?

Yes.

See inline and attached patch pls.

>From c958ee7d4350584a1d6654615303819bf987b8e8 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Thu, 23 Oct 2014 20:41:24 -0400
Subject: [PATCH 2/2] dpci: Replace tasklet with an softirq (v9)

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct hvm_pirq_dpci' structure the
'hvm_dirq_assist' needs to run on. That is simple solved by
threading the 'struct hvm_pirq_dpci' through a linked list.
The rule of only running one 'hvm_dirq_assist' for only
one 'hvm_pirq_dpci' is also preserved by having
'schedule_dpci_for' ignore any subsequent calls for an domain
which has already been scheduled.

== Code details ==

Most of the code complexity comes from the '->dom' field
in the 'hvm_pirq_dpci' structure. We use it for ref-counting
and as such it MUST be valid as long as STATE_SCHED bit is
set. Whoever clears the STATE_SCHED bit does the ref-counting
and can also reset the '->dom' field.

To compound the complexity, there are multiple points where the
'hvm_pirq_dpci' structure is reset or re-used. Initially
(first time the domain uses the pirq), the 'hvm_pirq_dpci->dom'
field is set to NULL as it is allocated. On subsequent calls
in to 'pt_irq_create_bind' the ->dom is whatever it had last time.

As this is the initial call (which QEMU ends up calling when the
guest writes an vector value in the MSI field) we MUST set the
'->dom' to a the proper structure (otherwise we cannot do
proper ref-counting).

The mechanism to tear it down is more complex as there
are three ways it can be executed. To make it simpler
everything revolves around 'pt_pirq_softirq_active'. If it
returns -EAGAIN that means there is an outstanding softirq
that needs to finish running before we can continue tearing
down. With that in mind:

a) pci_clean_dpci_irq. This gets called when the guest is
   being destroyed. We end up calling 'pt_pirq_softirq_active'
   to see if it is OK to continue the destruction.

   The scenarios in which the 'struct pirq' (and subsequently
   the 'hvm_pirq_dpci') gets destroyed is when:

   - guest did not use the pirq at all after setup.
   - guest did use pirq, but decided to mask and left it in that
     state.
   - guest did use pirq, but crashed.

   In all of those scenarios we end up calling
   'pt_pirq_softirq_active' to check if the softirq is still
   active. Read below on the 'pt_pirq_softirq_active' loop.

b) pt_irq_destroy_bind (guest disables the MSI). We double-check
   that the softirq has run by piggy-backing on the existing
   'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
   We add the extra call to 'pt_pirq_softirq_active' in
   'pt_pirq_cleanup_check'.

   NOTE: Guests that use event channels unbind first the
   event channel from PIRQs, so the 'pt_pirq_cleanup_check'
   won't be called as 'event' is set to zero. In that case
   we either clean it up via the a) or c) mechanism.

   There is an extra scenario regardless of 'event' being
   set or not: the guest did 'pt_irq_destroy_bind' while an
   interrupt was triggered and softirq was scheduled (but had not
   been run). It is OK to still run the softirq as
   hvm_dirq_assist won't do anything (as the flags are
   set to zero). However we will try to deschedule the
   softirq if we can (by clearing the STATE_SCHED bit and us
   doing the ref-counting).

c) pt_irq_create_bind (not a typo). The scenarios are:

     - guest disables the MSI and then enables it
       (rmmod and modprobe in a loop). We call 'pt_pirq_reset'
       which checks to see if the softirq has been scheduled.
       Imagine the 'b)' with interrupts in flight and c) getting
       called in a loop.

We will spin up on 'pt_pirq_is_active' (at the start of the
'pt_irq_create_bind') with the event_lock spinlock dropped and
calling 'process_pending_softirqs'. hvm_dirq_assist will be executed
and then the softirq will clear 'state' which signals that that we
can re-use the 'hvm_pirq_dpci' structure.

     - we hit once the error paths in 'pt_irq_create_bind' while
       an interrupt was triggered and softirq was scheduled.

If the softirq is in STATE_RUN that means it is executing and we should
let it continue on. We can clear the '->dom' field as the softirq
has stashed it beforehand. If the softirq is STATE_SCHED and
we are successful in clearing it, we do the ref-counting and
clear the '->dom' field. Otherwise we let the softirq continue
on and the '->dom' field is left intact. The clearing of
the '->dom' is left to a), b) or again c) case.

Note that in both cases the 'flags' variable is cleared so
hvm_dirq_assist won't actually do anything.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Suggested-by: Jan Beulich <JBeulich@suse.com>

---
v2: On top of ref-cnts also have wait loop for the outstanding
    'struct domain' that need to be processed.
v3: Add -ERETRY, fix up StyleGuide issues
v4: Clean it up mode, redo per_cpu, this_cpu logic
v5: Instead of threading struct domain, use hvm_pirq_dpci.
v6: Ditch the 'state' bit, expand description, simplify
    softirq and teardown sequence.
v7: Flesh out the comments. Drop the use of domain refcounts
v8: Add two bits (STATE_[SCHED|RUN]) to allow refcounts.
v9: Use cmpxchg, ASSSERT, fix up comments per Jan.
---
 xen/arch/x86/domain.c         |   4 +-
 xen/drivers/passthrough/io.c  | 251 +++++++++++++++++++++++++++++++++++++-----
 xen/drivers/passthrough/pci.c |  31 ++++--
 xen/include/asm-x86/softirq.h |   3 +-
 xen/include/xen/hvm/irq.h     |   5 +-
 xen/include/xen/pci.h         |   2 +-
 6 files changed, 253 insertions(+), 43 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ae0a344..73d01bb 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1965,7 +1965,9 @@ int domain_relinquish_resources(struct domain *d)
     switch ( d->arch.relmem )
     {
     case RELMEM_not_started:
-        pci_release_devices(d);
+        ret = pci_release_devices(d);
+        if ( ret )
+            return ret;
 
         /* Tear down paging-assistance stuff. */
         ret = paging_teardown(d);
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index dceb17e..342406e 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,14 +20,119 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long arg);
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/*
+ * These two bit states help to safely schedule, deschedule, and wait until
+ * the softirq has finished.
+ *
+ * The semantics behind these two bits is as follow:
+ *  - STATE_SCHED - whoever modifies it has to ref-count the domain (->dom).
+ *  - STATE_RUN - only softirq is allowed to set and clear it. If it has
+ *      been set hvm_dirq_assist will RUN with a saved value of the
+ *      'struct domain' copied from 'pirq_dpci->dom' before STATE_RUN was set.
+ *
+ * The usual states are: STATE_SCHED(set) -> STATE_RUN(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(unset).
+ *
+ * However the states can also diverge such as: STATE_SCHED(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(set) -> STATE_RUN(unset). That means
+ * the 'hvm_dirq_assist' never run and that the softirq did not do any
+ * ref-counting.
+ */
+enum {
+    STATE_SCHED,
+    STATE_RUN,
+};
+
+/*
+ * Should only be called from hvm_do_IRQ_dpci. We use the
+ * 'state' as a gate to thwart multiple interrupts being scheduled.
+ * The 'state' is cleared by 'softirq_dpci' when it has
+ * completed executing 'hvm_dirq_assist' or by 'pt_pirq_softirq_reset'
+ * if we want to try to unschedule the softirq before it runs.
+ *
+ */
+static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
+{
+    unsigned long flags;
+
+    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+        return;
+
+    get_knownalive_domain(pirq_dpci->dom);
+
+    local_irq_save(flags);
+    list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+    local_irq_restore(flags);
+
+    raise_softirq(HVM_DPCI_SOFTIRQ);
+}
+
+/*
+ * If we are racing with softirq_dpci (state is still set) we return
+ * true. Otherwise we return false.
+ *
+ *  If it is false, it is the callers responsibility to make sure
+ *  that the softirq (with the event_lock dropped) has ran. We need
+ *  to flush out the outstanding 'dpci_softirq' (no more of them
+ *  will be added for this pirq as the IRQ action handler has been
+ *  reset in pt_irq_destroy_bind).
+ */
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
+{
+    if ( pirq_dpci->state & (STATE_RUN | STATE_SCHED) )
+        return 1;
+
+    /*
+     * If in the future we would call 'raise_softirq_for' right away
+     * after 'pt_pirq_softirq_active' we MUST reset the list (otherwise it
+     * might have stale data).
+     */
+    return 0;
+}
+
+/*
+ * Reset the pirq_dpci->dom parameter to NULL.
+ *
+ * This function checks the different states to make sure it can do
+ * at the right time and if unschedules the softirq before it has
+ * run it also refcounts (which is what the softirq would have done).
+ */
+static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
+{
+    struct domain *d = pirq_dpci->dom;
+
+    ASSERT(spin_is_locked(&d->event_lock));
+    switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, 0) )
+    {
+        /*
+         * We are going to try to de-schedule the softirq before it goes in
+         * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
+         */
+        case STATE_SCHED:
+            put_domain(d);
+            /* fallthrough. */
+        /*
+         * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due
+         * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom' in
+         * a local variable before it sets STATE_RUN - and therefore will not
+         * dereference '->dom' which would result in a crash.
+        */
+        case STATE_RUN:
+            pirq_dpci->dom = NULL;
+            break;
+        default:
+            break;
+    }
+}
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -40,7 +145,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
     if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT,
                               &pirq_dpci->flags) )
     {
-        pirq_dpci->masked = 0;
+        pirq_dpci->state = 0;
         pirq_dpci->pending = 0;
         pirq_guest_eoi(dpci_pirq(pirq_dpci));
     }
@@ -101,6 +206,7 @@ int pt_irq_create_bind(
     if ( pirq < 0 || pirq >= d->nr_pirqs )
         return -EINVAL;
 
+ restart:
     spin_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
@@ -127,7 +233,21 @@ int pt_irq_create_bind(
         return -ENOMEM;
     }
     pirq_dpci = pirq_dpci(info);
-
+    /*
+     * A crude 'while' loop with us dropping the spinlock and giving
+     * the softirq_dpci a chance to run.
+     * We MUST check for this condition as the softirq could be scheduled
+     * and hasn't run yet. Note that this code replaced tasklet_kill which
+     * would have spun forever and would do the same thing (wait to flush out
+     * outstanding hvm_dirq_assist calls.
+     */
+    if ( pt_pirq_softirq_active(pirq_dpci) )
+    {
+        spin_unlock(&d->event_lock);
+        ASSERT_NOT_IN_ATOMIC();
+        process_pending_softirqs();
+        goto restart;
+    }
     switch ( pt_irq_bind->irq_type )
     {
     case PT_IRQ_TYPE_MSI:
@@ -165,8 +285,14 @@ int pt_irq_create_bind(
             {
                 pirq_dpci->gmsi.gflags = 0;
                 pirq_dpci->gmsi.gvec = 0;
-                pirq_dpci->dom = NULL;
                 pirq_dpci->flags = 0;
+                /*
+                 * Between the 'pirq_guest_bind' and before 'pirq_guest_unbind'
+                 * an interrupt can be scheduled. No more of them are going to
+                 * be scheduled but we must deal with the one that is in the
+                 * queue.
+                 */
+                pt_pirq_softirq_reset(pirq_dpci);
                 pirq_cleanup_check(info, d);
                 spin_unlock(&d->event_lock);
                 return rc;
@@ -269,6 +395,10 @@ int pt_irq_create_bind(
             {
                 if ( pt_irq_need_timer(pirq_dpci->flags) )
                     kill_timer(&pirq_dpci->timer);
+                /*
+                 * There is no path for __do_IRQ to schedule softirq as
+                 * IRQ_GUEST is not set. As such we can reset 'dom' right away.
+                 */
                 pirq_dpci->dom = NULL;
                 list_del(&girq->list);
                 list_del(&digl->list);
@@ -402,8 +532,12 @@ int pt_irq_destroy_bind(
         msixtbl_pt_unregister(d, pirq);
         if ( pt_irq_need_timer(pirq_dpci->flags) )
             kill_timer(&pirq_dpci->timer);
-        pirq_dpci->dom   = NULL;
         pirq_dpci->flags = 0;
+        /*
+         * See comment in pt_irq_create_bind's PT_IRQ_TYPE_MSI before the
+         * call to pt_pirq_softirq_reset.
+         */
+        pt_pirq_softirq_reset(pirq_dpci);
         pirq_cleanup_check(pirq, d);
     }
 
@@ -426,14 +560,12 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
-    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    if ( !dpci->flags )
+    if ( !dpci->flags && !pt_pirq_softirq_active(dpci) )
     {
-        tasklet_kill(&dpci->tasklet);
         dpci->dom = NULL;
         return 1;
     }
@@ -476,8 +608,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
          !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         return 0;
 
-    pirq_dpci->masked = 1;
-    tasklet_schedule(&pirq_dpci->tasklet);
+    raise_softirq_for(pirq_dpci);
     return 1;
 }
 
@@ -531,28 +662,12 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static void hvm_dirq_assist(unsigned long arg)
+static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
 {
-    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
-    struct domain *d = pirq_dpci->dom;
-
-    /*
-     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
-     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
-     *
-     * And '->dom' gets cleared later in the destroy path. We exit and clear
-     * 'masked' - which is OK as later in this code we would
-     * do nothing except clear the ->masked field anyhow.
-     */
-    if ( !d )
-    {
-        pirq_dpci->masked = 0;
-        return;
-    }
     ASSERT(d->arch.hvm_domain.irq.dpci);
 
     spin_lock(&d->event_lock);
-    if ( test_and_clear_bool(pirq_dpci->masked) )
+    if ( pirq_dpci->state )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
         const struct dev_intx_gsi_link *digl;
@@ -654,3 +769,81 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+static void dpci_softirq(void)
+{
+    unsigned int cpu = smp_processor_id();
+    LIST_HEAD(our_list);
+
+    local_irq_disable();
+    list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
+    local_irq_enable();
+
+    while ( !list_empty(&our_list) )
+    {
+        struct hvm_pirq_dpci *pirq_dpci;
+        struct domain *d;
+
+        pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
+        list_del(&pirq_dpci->softirq_list);
+
+        d = pirq_dpci->dom;
+        smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
+        if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
+            BUG();
+        /*
+         * The one who clears STATE_SCHED MUST refcount the domain.
+         */
+        if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
+        {
+            hvm_dirq_assist(d, pirq_dpci);
+            put_domain(d);
+        }
+        clear_bit(STATE_RUN, &pirq_dpci->state);
+    }
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        /*
+         * On CPU_DYING this callback is called (on the CPU that is dying)
+         * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
+         * clear out any outstanding domains (by the virtue of the idle loop
+         * calling the softirq later). In CPU_DEAD case the CPU is deaf and
+         * there are no pending softirqs for us to handle so we can chill.
+         */
+        ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
+        break;
+    default:
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 81e8a3a..d612cfa 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
-    tasklet_kill(&pirq_dpci->tasklet);
-
-    return 0;
+    return pt_pirq_softirq_active(pirq_dpci);
 }
 
-static void pci_clean_dpci_irqs(struct domain *d)
+static int pci_clean_dpci_irqs(struct domain *d)
 {
     struct hvm_irq_dpci *hvm_irq_dpci = NULL;
 
     if ( !iommu_enabled )
-        return;
+        return -ENODEV;
 
     if ( !is_hvm_domain(d) )
-        return;
+        return -EINVAL;
 
     spin_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+        int ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+
+        if ( ret )
+        {
+            spin_unlock(&d->event_lock);
+            return ret;
+        }
 
         d->arch.hvm_domain.irq.dpci = NULL;
         free_hvm_irq_dpci(hvm_irq_dpci);
     }
     spin_unlock(&d->event_lock);
+    return 0;
 }
 
-void pci_release_devices(struct domain *d)
+int pci_release_devices(struct domain *d)
 {
     struct pci_dev *pdev;
     u8 bus, devfn;
+    int ret;
 
     spin_lock(&pcidevs_lock);
-    pci_clean_dpci_irqs(d);
+    ret = pci_clean_dpci_irqs(d);
+    if ( ret == -ERESTART )
+    {
+        spin_unlock(&pcidevs_lock);
+        return ret;
+    }
     while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
     {
         bus = pdev->bus;
@@ -811,6 +822,8 @@ void pci_release_devices(struct domain *d)
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
     }
     spin_unlock(&pcidevs_lock);
+
+    return 0;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
diff --git a/xen/include/asm-x86/softirq.h b/xen/include/asm-x86/softirq.h
index 7225dea..ec787d6 100644
--- a/xen/include/asm-x86/softirq.h
+++ b/xen/include/asm-x86/softirq.h
@@ -7,7 +7,8 @@
 
 #define MACHINE_CHECK_SOFTIRQ  (NR_COMMON_SOFTIRQS + 3)
 #define PCI_SERR_SOFTIRQ       (NR_COMMON_SOFTIRQS + 4)
-#define NR_ARCH_SOFTIRQS       5
+#define HVM_DPCI_SOFTIRQ       (NR_COMMON_SOFTIRQS + 5)
+#define NR_ARCH_SOFTIRQS       6
 
 bool_t arch_skip_send_event_check(unsigned int cpu);
 
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 94a550a..9709397 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -93,13 +93,13 @@ struct hvm_irq_dpci {
 /* Machine IRQ to guest device/intx mapping. */
 struct hvm_pirq_dpci {
     uint32_t flags;
-    bool_t masked;
+    unsigned int state;
     uint16_t pending;
     struct list_head digl_list;
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
-    struct tasklet tasklet;
+    struct list_head softirq_list;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
@@ -109,6 +109,7 @@ int pt_pirq_iterate(struct domain *d,
                               struct hvm_pirq_dpci *, void *arg),
                     void *arg);
 
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *);
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 91520bc..5f295f3 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev(
 
 void setup_hwdom_pci_devices(struct domain *,
                             int (*)(u8 devfn, struct pci_dev *));
-void pci_release_devices(struct domain *d);
+int pci_release_devices(struct domain *d);
 int pci_add_segment(u16 seg);
 const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
-- 
1.9.3


[-- Attachment #2: 0002-dpci-Replace-tasklet-with-an-softirq-v9.patch --]
[-- Type: text/plain, Size: 23906 bytes --]

>From c958ee7d4350584a1d6654615303819bf987b8e8 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Thu, 23 Oct 2014 20:41:24 -0400
Subject: [PATCH 2/2] dpci: Replace tasklet with an softirq (v9)

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct hvm_pirq_dpci' structure the
'hvm_dirq_assist' needs to run on. That is simple solved by
threading the 'struct hvm_pirq_dpci' through a linked list.
The rule of only running one 'hvm_dirq_assist' for only
one 'hvm_pirq_dpci' is also preserved by having
'schedule_dpci_for' ignore any subsequent calls for an domain
which has already been scheduled.

== Code details ==

Most of the code complexity comes from the '->dom' field
in the 'hvm_pirq_dpci' structure. We use it for ref-counting
and as such it MUST be valid as long as STATE_SCHED bit is
set. Whoever clears the STATE_SCHED bit does the ref-counting
and can also reset the '->dom' field.

To compound the complexity, there are multiple points where the
'hvm_pirq_dpci' structure is reset or re-used. Initially
(first time the domain uses the pirq), the 'hvm_pirq_dpci->dom'
field is set to NULL as it is allocated. On subsequent calls
in to 'pt_irq_create_bind' the ->dom is whatever it had last time.

As this is the initial call (which QEMU ends up calling when the
guest writes an vector value in the MSI field) we MUST set the
'->dom' to a the proper structure (otherwise we cannot do
proper ref-counting).

The mechanism to tear it down is more complex as there
are three ways it can be executed. To make it simpler
everything revolves around 'pt_pirq_softirq_active'. If it
returns -EAGAIN that means there is an outstanding softirq
that needs to finish running before we can continue tearing
down. With that in mind:

a) pci_clean_dpci_irq. This gets called when the guest is
   being destroyed. We end up calling 'pt_pirq_softirq_active'
   to see if it is OK to continue the destruction.

   The scenarios in which the 'struct pirq' (and subsequently
   the 'hvm_pirq_dpci') gets destroyed is when:

   - guest did not use the pirq at all after setup.
   - guest did use pirq, but decided to mask and left it in that
     state.
   - guest did use pirq, but crashed.

   In all of those scenarios we end up calling
   'pt_pirq_softirq_active' to check if the softirq is still
   active. Read below on the 'pt_pirq_softirq_active' loop.

b) pt_irq_destroy_bind (guest disables the MSI). We double-check
   that the softirq has run by piggy-backing on the existing
   'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
   We add the extra call to 'pt_pirq_softirq_active' in
   'pt_pirq_cleanup_check'.

   NOTE: Guests that use event channels unbind first the
   event channel from PIRQs, so the 'pt_pirq_cleanup_check'
   won't be called as 'event' is set to zero. In that case
   we either clean it up via the a) or c) mechanism.

   There is an extra scenario regardless of 'event' being
   set or not: the guest did 'pt_irq_destroy_bind' while an
   interrupt was triggered and softirq was scheduled (but had not
   been run). It is OK to still run the softirq as
   hvm_dirq_assist won't do anything (as the flags are
   set to zero). However we will try to deschedule the
   softirq if we can (by clearing the STATE_SCHED bit and us
   doing the ref-counting).

c) pt_irq_create_bind (not a typo). The scenarios are:

     - guest disables the MSI and then enables it
       (rmmod and modprobe in a loop). We call 'pt_pirq_reset'
       which checks to see if the softirq has been scheduled.
       Imagine the 'b)' with interrupts in flight and c) getting
       called in a loop.

We will spin up on 'pt_pirq_is_active' (at the start of the
'pt_irq_create_bind') with the event_lock spinlock dropped and
calling 'process_pending_softirqs'. hvm_dirq_assist will be executed
and then the softirq will clear 'state' which signals that that we
can re-use the 'hvm_pirq_dpci' structure.

     - we hit once the error paths in 'pt_irq_create_bind' while
       an interrupt was triggered and softirq was scheduled.

If the softirq is in STATE_RUN that means it is executing and we should
let it continue on. We can clear the '->dom' field as the softirq
has stashed it beforehand. If the softirq is STATE_SCHED and
we are successful in clearing it, we do the ref-counting and
clear the '->dom' field. Otherwise we let the softirq continue
on and the '->dom' field is left intact. The clearing of
the '->dom' is left to a), b) or again c) case.

Note that in both cases the 'flags' variable is cleared so
hvm_dirq_assist won't actually do anything.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Suggested-by: Jan Beulich <JBeulich@suse.com>

---
v2: On top of ref-cnts also have wait loop for the outstanding
    'struct domain' that need to be processed.
v3: Add -ERETRY, fix up StyleGuide issues
v4: Clean it up mode, redo per_cpu, this_cpu logic
v5: Instead of threading struct domain, use hvm_pirq_dpci.
v6: Ditch the 'state' bit, expand description, simplify
    softirq and teardown sequence.
v7: Flesh out the comments. Drop the use of domain refcounts
v8: Add two bits (STATE_[SCHED|RUN]) to allow refcounts.
v9: Use cmpxchg, ASSSERT, fix up comments per Jan.
---
 xen/arch/x86/domain.c         |   4 +-
 xen/drivers/passthrough/io.c  | 251 +++++++++++++++++++++++++++++++++++++-----
 xen/drivers/passthrough/pci.c |  31 ++++--
 xen/include/asm-x86/softirq.h |   3 +-
 xen/include/xen/hvm/irq.h     |   5 +-
 xen/include/xen/pci.h         |   2 +-
 6 files changed, 253 insertions(+), 43 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ae0a344..73d01bb 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1965,7 +1965,9 @@ int domain_relinquish_resources(struct domain *d)
     switch ( d->arch.relmem )
     {
     case RELMEM_not_started:
-        pci_release_devices(d);
+        ret = pci_release_devices(d);
+        if ( ret )
+            return ret;
 
         /* Tear down paging-assistance stuff. */
         ret = paging_teardown(d);
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index dceb17e..342406e 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,14 +20,119 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long arg);
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/*
+ * These two bit states help to safely schedule, deschedule, and wait until
+ * the softirq has finished.
+ *
+ * The semantics behind these two bits is as follow:
+ *  - STATE_SCHED - whoever modifies it has to ref-count the domain (->dom).
+ *  - STATE_RUN - only softirq is allowed to set and clear it. If it has
+ *      been set hvm_dirq_assist will RUN with a saved value of the
+ *      'struct domain' copied from 'pirq_dpci->dom' before STATE_RUN was set.
+ *
+ * The usual states are: STATE_SCHED(set) -> STATE_RUN(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(unset).
+ *
+ * However the states can also diverge such as: STATE_SCHED(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(set) -> STATE_RUN(unset). That means
+ * the 'hvm_dirq_assist' never run and that the softirq did not do any
+ * ref-counting.
+ */
+enum {
+    STATE_SCHED,
+    STATE_RUN,
+};
+
+/*
+ * Should only be called from hvm_do_IRQ_dpci. We use the
+ * 'state' as a gate to thwart multiple interrupts being scheduled.
+ * The 'state' is cleared by 'softirq_dpci' when it has
+ * completed executing 'hvm_dirq_assist' or by 'pt_pirq_softirq_reset'
+ * if we want to try to unschedule the softirq before it runs.
+ *
+ */
+static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
+{
+    unsigned long flags;
+
+    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+        return;
+
+    get_knownalive_domain(pirq_dpci->dom);
+
+    local_irq_save(flags);
+    list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+    local_irq_restore(flags);
+
+    raise_softirq(HVM_DPCI_SOFTIRQ);
+}
+
+/*
+ * If we are racing with softirq_dpci (state is still set) we return
+ * true. Otherwise we return false.
+ *
+ *  If it is false, it is the callers responsibility to make sure
+ *  that the softirq (with the event_lock dropped) has ran. We need
+ *  to flush out the outstanding 'dpci_softirq' (no more of them
+ *  will be added for this pirq as the IRQ action handler has been
+ *  reset in pt_irq_destroy_bind).
+ */
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
+{
+    if ( pirq_dpci->state & (STATE_RUN | STATE_SCHED) )
+        return 1;
+
+    /*
+     * If in the future we would call 'raise_softirq_for' right away
+     * after 'pt_pirq_softirq_active' we MUST reset the list (otherwise it
+     * might have stale data).
+     */
+    return 0;
+}
+
+/*
+ * Reset the pirq_dpci->dom parameter to NULL.
+ *
+ * This function checks the different states to make sure it can do
+ * at the right time and if unschedules the softirq before it has
+ * run it also refcounts (which is what the softirq would have done).
+ */
+static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
+{
+    struct domain *d = pirq_dpci->dom;
+
+    ASSERT(spin_is_locked(&d->event_lock));
+    switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, 0) )
+    {
+        /*
+         * We are going to try to de-schedule the softirq before it goes in
+         * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
+         */
+        case STATE_SCHED:
+            put_domain(d);
+            /* fallthrough. */
+        /*
+         * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due
+         * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom' in
+         * a local variable before it sets STATE_RUN - and therefore will not
+         * dereference '->dom' which would result in a crash.
+        */
+        case STATE_RUN:
+            pirq_dpci->dom = NULL;
+            break;
+        default:
+            break;
+    }
+}
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -40,7 +145,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
     if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT,
                               &pirq_dpci->flags) )
     {
-        pirq_dpci->masked = 0;
+        pirq_dpci->state = 0;
         pirq_dpci->pending = 0;
         pirq_guest_eoi(dpci_pirq(pirq_dpci));
     }
@@ -101,6 +206,7 @@ int pt_irq_create_bind(
     if ( pirq < 0 || pirq >= d->nr_pirqs )
         return -EINVAL;
 
+ restart:
     spin_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
@@ -127,7 +233,21 @@ int pt_irq_create_bind(
         return -ENOMEM;
     }
     pirq_dpci = pirq_dpci(info);
-
+    /*
+     * A crude 'while' loop with us dropping the spinlock and giving
+     * the softirq_dpci a chance to run.
+     * We MUST check for this condition as the softirq could be scheduled
+     * and hasn't run yet. Note that this code replaced tasklet_kill which
+     * would have spun forever and would do the same thing (wait to flush out
+     * outstanding hvm_dirq_assist calls.
+     */
+    if ( pt_pirq_softirq_active(pirq_dpci) )
+    {
+        spin_unlock(&d->event_lock);
+        ASSERT_NOT_IN_ATOMIC();
+        process_pending_softirqs();
+        goto restart;
+    }
     switch ( pt_irq_bind->irq_type )
     {
     case PT_IRQ_TYPE_MSI:
@@ -165,8 +285,14 @@ int pt_irq_create_bind(
             {
                 pirq_dpci->gmsi.gflags = 0;
                 pirq_dpci->gmsi.gvec = 0;
-                pirq_dpci->dom = NULL;
                 pirq_dpci->flags = 0;
+                /*
+                 * Between the 'pirq_guest_bind' and before 'pirq_guest_unbind'
+                 * an interrupt can be scheduled. No more of them are going to
+                 * be scheduled but we must deal with the one that is in the
+                 * queue.
+                 */
+                pt_pirq_softirq_reset(pirq_dpci);
                 pirq_cleanup_check(info, d);
                 spin_unlock(&d->event_lock);
                 return rc;
@@ -269,6 +395,10 @@ int pt_irq_create_bind(
             {
                 if ( pt_irq_need_timer(pirq_dpci->flags) )
                     kill_timer(&pirq_dpci->timer);
+                /*
+                 * There is no path for __do_IRQ to schedule softirq as
+                 * IRQ_GUEST is not set. As such we can reset 'dom' right away.
+                 */
                 pirq_dpci->dom = NULL;
                 list_del(&girq->list);
                 list_del(&digl->list);
@@ -402,8 +532,12 @@ int pt_irq_destroy_bind(
         msixtbl_pt_unregister(d, pirq);
         if ( pt_irq_need_timer(pirq_dpci->flags) )
             kill_timer(&pirq_dpci->timer);
-        pirq_dpci->dom   = NULL;
         pirq_dpci->flags = 0;
+        /*
+         * See comment in pt_irq_create_bind's PT_IRQ_TYPE_MSI before the
+         * call to pt_pirq_softirq_reset.
+         */
+        pt_pirq_softirq_reset(pirq_dpci);
         pirq_cleanup_check(pirq, d);
     }
 
@@ -426,14 +560,12 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
-    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    if ( !dpci->flags )
+    if ( !dpci->flags && !pt_pirq_softirq_active(dpci) )
     {
-        tasklet_kill(&dpci->tasklet);
         dpci->dom = NULL;
         return 1;
     }
@@ -476,8 +608,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
          !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         return 0;
 
-    pirq_dpci->masked = 1;
-    tasklet_schedule(&pirq_dpci->tasklet);
+    raise_softirq_for(pirq_dpci);
     return 1;
 }
 
@@ -531,28 +662,12 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static void hvm_dirq_assist(unsigned long arg)
+static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
 {
-    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
-    struct domain *d = pirq_dpci->dom;
-
-    /*
-     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
-     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
-     *
-     * And '->dom' gets cleared later in the destroy path. We exit and clear
-     * 'masked' - which is OK as later in this code we would
-     * do nothing except clear the ->masked field anyhow.
-     */
-    if ( !d )
-    {
-        pirq_dpci->masked = 0;
-        return;
-    }
     ASSERT(d->arch.hvm_domain.irq.dpci);
 
     spin_lock(&d->event_lock);
-    if ( test_and_clear_bool(pirq_dpci->masked) )
+    if ( pirq_dpci->state )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
         const struct dev_intx_gsi_link *digl;
@@ -654,3 +769,81 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+static void dpci_softirq(void)
+{
+    unsigned int cpu = smp_processor_id();
+    LIST_HEAD(our_list);
+
+    local_irq_disable();
+    list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
+    local_irq_enable();
+
+    while ( !list_empty(&our_list) )
+    {
+        struct hvm_pirq_dpci *pirq_dpci;
+        struct domain *d;
+
+        pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
+        list_del(&pirq_dpci->softirq_list);
+
+        d = pirq_dpci->dom;
+        smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
+        if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
+            BUG();
+        /*
+         * The one who clears STATE_SCHED MUST refcount the domain.
+         */
+        if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
+        {
+            hvm_dirq_assist(d, pirq_dpci);
+            put_domain(d);
+        }
+        clear_bit(STATE_RUN, &pirq_dpci->state);
+    }
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        /*
+         * On CPU_DYING this callback is called (on the CPU that is dying)
+         * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
+         * clear out any outstanding domains (by the virtue of the idle loop
+         * calling the softirq later). In CPU_DEAD case the CPU is deaf and
+         * there are no pending softirqs for us to handle so we can chill.
+         */
+        ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
+        break;
+    default:
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 81e8a3a..d612cfa 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
-    tasklet_kill(&pirq_dpci->tasklet);
-
-    return 0;
+    return pt_pirq_softirq_active(pirq_dpci);
 }
 
-static void pci_clean_dpci_irqs(struct domain *d)
+static int pci_clean_dpci_irqs(struct domain *d)
 {
     struct hvm_irq_dpci *hvm_irq_dpci = NULL;
 
     if ( !iommu_enabled )
-        return;
+        return -ENODEV;
 
     if ( !is_hvm_domain(d) )
-        return;
+        return -EINVAL;
 
     spin_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+        int ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+
+        if ( ret )
+        {
+            spin_unlock(&d->event_lock);
+            return ret;
+        }
 
         d->arch.hvm_domain.irq.dpci = NULL;
         free_hvm_irq_dpci(hvm_irq_dpci);
     }
     spin_unlock(&d->event_lock);
+    return 0;
 }
 
-void pci_release_devices(struct domain *d)
+int pci_release_devices(struct domain *d)
 {
     struct pci_dev *pdev;
     u8 bus, devfn;
+    int ret;
 
     spin_lock(&pcidevs_lock);
-    pci_clean_dpci_irqs(d);
+    ret = pci_clean_dpci_irqs(d);
+    if ( ret == -ERESTART )
+    {
+        spin_unlock(&pcidevs_lock);
+        return ret;
+    }
     while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
     {
         bus = pdev->bus;
@@ -811,6 +822,8 @@ void pci_release_devices(struct domain *d)
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
     }
     spin_unlock(&pcidevs_lock);
+
+    return 0;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
diff --git a/xen/include/asm-x86/softirq.h b/xen/include/asm-x86/softirq.h
index 7225dea..ec787d6 100644
--- a/xen/include/asm-x86/softirq.h
+++ b/xen/include/asm-x86/softirq.h
@@ -7,7 +7,8 @@
 
 #define MACHINE_CHECK_SOFTIRQ  (NR_COMMON_SOFTIRQS + 3)
 #define PCI_SERR_SOFTIRQ       (NR_COMMON_SOFTIRQS + 4)
-#define NR_ARCH_SOFTIRQS       5
+#define HVM_DPCI_SOFTIRQ       (NR_COMMON_SOFTIRQS + 5)
+#define NR_ARCH_SOFTIRQS       6
 
 bool_t arch_skip_send_event_check(unsigned int cpu);
 
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 94a550a..9709397 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -93,13 +93,13 @@ struct hvm_irq_dpci {
 /* Machine IRQ to guest device/intx mapping. */
 struct hvm_pirq_dpci {
     uint32_t flags;
-    bool_t masked;
+    unsigned int state;
     uint16_t pending;
     struct list_head digl_list;
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
-    struct tasklet tasklet;
+    struct list_head softirq_list;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
@@ -109,6 +109,7 @@ int pt_pirq_iterate(struct domain *d,
                               struct hvm_pirq_dpci *, void *arg),
                     void *arg);
 
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *);
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 91520bc..5f295f3 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev(
 
 void setup_hwdom_pci_devices(struct domain *,
                             int (*)(u8 devfn, struct pci_dev *));
-void pci_release_devices(struct domain *d);
+int pci_release_devices(struct domain *d);
 int pci_add_segment(u16 seg);
 const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
-- 
1.9.3


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
  2014-10-24  1:58     ` Konrad Rzeszutek Wilk
@ 2014-10-24  9:49       ` Jan Beulich
  2014-10-24 19:09         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2014-10-24  9:49 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

>>> On 24.10.14 at 03:58, <konrad.wilk@oracle.com> wrote:
> On Thu, Oct 23, 2014 at 09:58:34AM +0100, Jan Beulich wrote:
>> >>> On 21.10.14 at 19:19, <konrad.wilk@oracle.com> wrote:
>> > @@ -156,6 +165,7 @@ int pt_irq_create_bind(
>> >              {
>> >                  pirq_dpci->gmsi.gflags = 0;
>> >                  pirq_dpci->gmsi.gvec = 0;
>> > +                pirq_dpci->dom = NULL;
>> >                  pirq_dpci->flags = 0;
>> >                  pirq_cleanup_check(info, d);
>> >                  spin_unlock(&d->event_lock);
>> 
>> Just like this error path needing adjustment, the other one following
>> failure of pirq_guest_bind() (after
>> 
>> > @@ -232,7 +242,6 @@ int pt_irq_create_bind(
>> >          {
>> >              unsigned int share;
>> >  
>> > -            pirq_dpci->dom = d;
>> >              if ( pt_irq_bind->irq_type == PT_IRQ_TYPE_MSI_TRANSLATE )
>> >              {
>> >                  pirq_dpci->flags = HVM_IRQ_DPCI_MAPPED |
>> 
>> ) would seem to need adjustment too.
> 
> However I am at lost of what you mean here. If by adjustment you mean
> leave it alone I concur on the later.

Indeed I managed to overlook that ->dom is being cleared there
already. Over time I've been trying to make the code in this file at
least a little more legible, but it's still lacking (i.e. in the case here
proper grouping of the cleanup done on the different data
structures would probably have made this more obvious, e.g. in
the case at hand having the clearing of pirq_dpci->dom and
pirq_dpci->flags side by side).

> @@ -144,6 +141,18 @@ int pt_irq_create_bind(
>                                 HVM_IRQ_DPCI_GUEST_MSI;
>              pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>              pirq_dpci->gmsi.gflags = pt_irq_bind->u.msi.gflags;
> +            /*
> +             * 'pt_irq_create_bind' can be called after 'pt_irq_destroy_bind'.
> +             * The 'pirq_cleanup_check' which would free the structure is only
> +             * called if the event channel for the PIRQ is active. However
> +             * OS-es that use event channels usually bind PIRQs to eventds
> +             * and unbind them before calling 'pt_irq_destroy_bind' - with the
> +             * result that we re-use the 'dpci' structure. This can be
> +             * reproduced with unloading and loading the driver for a device.
> +             *
> +             * As such on every 'pt_irq_create_bind' call we MUST set it.
> +             */
> +            pirq_dpci->dom = d;
>              /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/
>              rc = pirq_guest_bind(d->vcpu[0], info, 0);
>              if ( rc == 0 && pt_irq_bind->u.msi.gtable )
> @@ -156,6 +165,7 @@ int pt_irq_create_bind(
>              {
>                  pirq_dpci->gmsi.gflags = 0;
>                  pirq_dpci->gmsi.gvec = 0;
> +                pirq_dpci->dom = NULL;
>                  pirq_dpci->flags = 0;
>                  pirq_cleanup_check(info, d);
>                  spin_unlock(&d->event_lock);

Wait - is this correct even when pirq_guest_bind() succeeded but
msixtbl_pt_register() failed? At the first glance I would say no. But
apart from that needing sorting out I think the patch is fine now
(and I hope the latest re-shuffling didn't break anything).

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-24  1:58     ` Konrad Rzeszutek Wilk
@ 2014-10-24 10:09       ` Jan Beulich
  2014-10-24 20:55         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2014-10-24 10:09 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

>>> On 24.10.14 at 03:58, <konrad.wilk@oracle.com> wrote:
> On Thu, Oct 23, 2014 at 10:36:07AM +0100, Jan Beulich wrote:
>> >>> On 21.10.14 at 19:19, <konrad.wilk@oracle.com> wrote:
> Was not sure whether you prefer 'true'
> or 'false' instead of numbers - decided on numbers since most of the code-base
> uses numbers.

So far we don't use "true" and "false" in hypervisor code at all (or if
you spotted any such use, it slipped in by mistake). We ought to
consider switching to bool/true/false though.

> +/*
> + * Reset the pirq_dpci->dom parameter to NULL.
> + *
> + * This function checks the different states to make sure it can do
> + * at the right time and if unschedules the softirq before it has
> + * run it also refcounts (which is what the softirq would have done).
> + */
> +static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
> +{
> +    struct domain *d = pirq_dpci->dom;
> +
> +    ASSERT(spin_is_locked(&d->event_lock));
> +    switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, 0) )
> +    {
> +        /*
> +         * We are going to try to de-schedule the softirq before it goes in
> +         * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
> +         */
> +        case STATE_SCHED:
> +            put_domain(d);
> +            /* fallthrough. */
> +        /*
> +         * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due
> +         * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom' in
> +         * a local variable before it sets STATE_RUN - and therefore will not
> +         * dereference '->dom' which would result in a crash.
> +        */
> +        case STATE_RUN:
> +            pirq_dpci->dom = NULL;
> +            break;
> +        default:
> +            break;
> +    }

Apart from the indentation being wrong (case labels having the same
indentation as the switch()'s opening brace), this doesn't seem to be a
proper equivalent of the former code: There you cleared ->dom when
STATE_RUN regardless of STATE_SCHED. Leaving out the comments
I'd suggest

    switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, 0) )
    {
    case STATE_SCHED:
        put_domain(d);
    case STATE_RUN: case STATE_SCHED|STATE_RUN:
        pirq_dpci->dom = NULL;
        break;
    default:
        BUG();
    case 0:
        break;
    }

> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
>          xfree(digl);
>      }
>  
> -    tasklet_kill(&pirq_dpci->tasklet);
> -
> -    return 0;
> +    return pt_pirq_softirq_active(pirq_dpci);

This function returns a bool_t now, but the (indirect) caller of this
function expects -ERESTART.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
  2014-10-24  9:49       ` Jan Beulich
@ 2014-10-24 19:09         ` Konrad Rzeszutek Wilk
  2014-10-27  9:25           ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-24 19:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

On Fri, Oct 24, 2014 at 10:49:32AM +0100, Jan Beulich wrote:
> >>> On 24.10.14 at 03:58, <konrad.wilk@oracle.com> wrote:
> > On Thu, Oct 23, 2014 at 09:58:34AM +0100, Jan Beulich wrote:
> >> >>> On 21.10.14 at 19:19, <konrad.wilk@oracle.com> wrote:
> >> > @@ -156,6 +165,7 @@ int pt_irq_create_bind(
> >> >              {
> >> >                  pirq_dpci->gmsi.gflags = 0;
> >> >                  pirq_dpci->gmsi.gvec = 0;
> >> > +                pirq_dpci->dom = NULL;
> >> >                  pirq_dpci->flags = 0;
> >> >                  pirq_cleanup_check(info, d);
> >> >                  spin_unlock(&d->event_lock);
> >> 
> >> Just like this error path needing adjustment, the other one following
> >> failure of pirq_guest_bind() (after
> >> 
> >> > @@ -232,7 +242,6 @@ int pt_irq_create_bind(
> >> >          {
> >> >              unsigned int share;
> >> >  
> >> > -            pirq_dpci->dom = d;
> >> >              if ( pt_irq_bind->irq_type == PT_IRQ_TYPE_MSI_TRANSLATE )
> >> >              {
> >> >                  pirq_dpci->flags = HVM_IRQ_DPCI_MAPPED |
> >> 
> >> ) would seem to need adjustment too.
> > 
> > However I am at lost of what you mean here. If by adjustment you mean
> > leave it alone I concur on the later.
> 
> Indeed I managed to overlook that ->dom is being cleared there
> already. Over time I've been trying to make the code in this file at
> least a little more legible, but it's still lacking (i.e. in the case here
> proper grouping of the cleanup done on the different data
> structures would probably have made this more obvious, e.g. in
> the case at hand having the clearing of pirq_dpci->dom and
> pirq_dpci->flags side by side).
> 
> > @@ -144,6 +141,18 @@ int pt_irq_create_bind(
> >                                 HVM_IRQ_DPCI_GUEST_MSI;
> >              pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
> >              pirq_dpci->gmsi.gflags = pt_irq_bind->u.msi.gflags;
> > +            /*
> > +             * 'pt_irq_create_bind' can be called after 'pt_irq_destroy_bind'.
> > +             * The 'pirq_cleanup_check' which would free the structure is only
> > +             * called if the event channel for the PIRQ is active. However
> > +             * OS-es that use event channels usually bind PIRQs to eventds
> > +             * and unbind them before calling 'pt_irq_destroy_bind' - with the
> > +             * result that we re-use the 'dpci' structure. This can be
> > +             * reproduced with unloading and loading the driver for a device.
> > +             *
> > +             * As such on every 'pt_irq_create_bind' call we MUST set it.
> > +             */
> > +            pirq_dpci->dom = d;
> >              /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/
> >              rc = pirq_guest_bind(d->vcpu[0], info, 0);
> >              if ( rc == 0 && pt_irq_bind->u.msi.gtable )
> > @@ -156,6 +165,7 @@ int pt_irq_create_bind(
> >              {
> >                  pirq_dpci->gmsi.gflags = 0;
> >                  pirq_dpci->gmsi.gvec = 0;
> > +                pirq_dpci->dom = NULL;
> >                  pirq_dpci->flags = 0;
> >                  pirq_cleanup_check(info, d);
> >                  spin_unlock(&d->event_lock);
> 
> Wait - is this correct even when pirq_guest_bind() succeeded but
> msixtbl_pt_register() failed? At the first glance I would say no. But

Keep in mind that if 'msixtbl_pt_register' fails we immediately call
'pirq_guest_unbind' and then land in here.

> apart from that needing sorting out I think the patch is fine now
> (and I hope the latest re-shuffling didn't break anything).

Wheew. I think so too.
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-24 10:09       ` Jan Beulich
@ 2014-10-24 20:55         ` Konrad Rzeszutek Wilk
  2014-10-25  0:39           ` Konrad Rzeszutek Wilk
  2014-10-27  9:32           ` Jan Beulich
  0 siblings, 2 replies; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-24 20:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

On Fri, Oct 24, 2014 at 11:09:59AM +0100, Jan Beulich wrote:
> >>> On 24.10.14 at 03:58, <konrad.wilk@oracle.com> wrote:
> > On Thu, Oct 23, 2014 at 10:36:07AM +0100, Jan Beulich wrote:
> >> >>> On 21.10.14 at 19:19, <konrad.wilk@oracle.com> wrote:
> > Was not sure whether you prefer 'true'
> > or 'false' instead of numbers - decided on numbers since most of the code-base
> > uses numbers.
> 
> So far we don't use "true" and "false" in hypervisor code at all (or if
> you spotted any such use, it slipped in by mistake). We ought to
> consider switching to bool/true/false though.

The dec_lzma2 had it.
> 
> > +/*
> > + * Reset the pirq_dpci->dom parameter to NULL.
> > + *
> > + * This function checks the different states to make sure it can do
> > + * at the right time and if unschedules the softirq before it has
> > + * run it also refcounts (which is what the softirq would have done).
> > + */
> > +static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
> > +{
> > +    struct domain *d = pirq_dpci->dom;
> > +
> > +    ASSERT(spin_is_locked(&d->event_lock));
> > +    switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, 0) )
> > +    {
> > +        /*
> > +         * We are going to try to de-schedule the softirq before it goes in
> > +         * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
> > +         */
> > +        case STATE_SCHED:
> > +            put_domain(d);
> > +            /* fallthrough. */
> > +        /*
> > +         * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due
> > +         * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom' in
> > +         * a local variable before it sets STATE_RUN - and therefore will not
> > +         * dereference '->dom' which would result in a crash.
> > +        */
> > +        case STATE_RUN:
> > +            pirq_dpci->dom = NULL;
> > +            break;
> > +        default:
> > +            break;
> > +    }
> 
> Apart from the indentation being wrong (case labels having the same
> indentation as the switch()'s opening brace), this doesn't seem to be a
> proper equivalent of the former code: There you cleared ->dom when
> STATE_RUN regardless of STATE_SCHED. Leaving out the comments
> I'd suggest
> 
>     switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, 0) )
>     {
>     case STATE_SCHED:
>         put_domain(d);
>     case STATE_RUN: case STATE_SCHED|STATE_RUN:

.. which made me realize that to testing of values as opposed to
bit positions requires ditching the 'enum' and introducing an
STATE_SCHED_BIT, STATE_SCHED, STATE_RUN_BIT, and STATE_RUN_BIT
to complement each other when checking for values or setting bits.

>         pirq_dpci->dom = NULL;
>         break;
>     default:
>         BUG();
>     case 0:
>         break;
>     }
> 
> > --- a/xen/drivers/passthrough/pci.c
> > +++ b/xen/drivers/passthrough/pci.c
> > @@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
> >          xfree(digl);
> >      }
> >  
> > -    tasklet_kill(&pirq_dpci->tasklet);
> > -
> > -    return 0;
> > +    return pt_pirq_softirq_active(pirq_dpci);
> 
> This function returns a bool_t now, but the (indirect) caller of this
> function expects -ERESTART.

Fixed up.


I added some extra code so that I could reliabily trigger the error paths
and got:

(XEN) pt_pirq_softirq_active: is 0 (debug: 1)

This is the first ever usage of pt_pirq_create_bind and for fun the
code returns 'false'

(XEN) Assertion '!preempt_count()' failed at preempt.c:37
(XEN) ----[ Xen-4.5.0-rc  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    9
(XEN) RIP:    e008:[<ffff82d08011d6db>] ASSERT_NOT_IN_ATOMIC+0x22/0x67
(XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor
(XEN) rax: ffff82d080320d20   rbx: ffff8302fb126470   rcx: 0000000000000001
(XEN) rdx: 00000032b8d99300   rsi: 000000000000000a   rdi: ffff8303149670b8
(XEN) rbp: ffff8303390a7bd8   rsp: ffff8303390a7bd8   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: 00000000fffffffd   r11: ffff82d08023e5e0
(XEN) r12: ffff83025c126700   r13: ffff830314967000   r14: 0000000000000030
(XEN) r15: ffff83025c126728   cr0: 0000000080050033   cr4: 00000000000026f0
(XEN) cr3: 000000031b4b7000   cr2: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff8303390a7bd8:
(XEN)    ffff8303390a7c98 ffff82d080149335 ffff8303390a7ca8 ffff82d08016d0e6
(XEN)    0000000000000001 0000000000000206 00000003c0027c38 ffff8303390a7d88
(XEN)    00000072390a7c68 ffff830312e5f4d0 0000000000000000 0000001000000003
(XEN)    0000000000000246 ffff8303390a7c58 ffff82d08012ce66 ffff8303390a7e48
(XEN)    0000007f390a7c88 ffff8303149670b8 fffffffffffffffd fffffffffffffffd
(XEN)    00007f81af369004 ffff830314967000 ffff8303390a7e38 0000000000000008
(XEN)    ffff8303390a7d78 ffff82d080160131 ffff8303390a7cd8 ffff830302526f10
(XEN)    ffff8303390a7ce8 0000000000000073 ffff830330907324 ffff830330907300
(XEN)    ffff8303390a7d08 ffff82d08016dc79 ffff8303390a7cf8 ffff82d08013b469
(XEN)    ffff8303390a7d28 ffff82d08016e628 ffff8303390a7d28 ffff830314967000
(XEN)    000000000000007f 0000000000000206 ffff8303390a7dc8 ffff82d0801711dc
(XEN)    ffff8303390a7d88 ffff82d08011e203 0000000000000202 fffffffffffffffd
(XEN)    00007f81af369004 ffff830314967000 ffff8800331eb488 0000000000000008
(XEN)    ffff8303390a7ef8 ffff82d0801048e8 ffff830302526f10 ffff83025c126700
(XEN)    ffff8303390a7dc8 ffff830314967000 00000000ffffffff 0000000000000000
(XEN)    ffff8303390a7e98 ffff8303390a7e70 ffff8303390a7e48 ffff82d080184019
(XEN)    ffff8303390a7f18 ffffffff8158b0de ffff8303390a7e98 ffff8303149670b8
(XEN)    ffff83030000007f ffff82d080191105 000000730000f800 ffff8303390a7e74
(XEN)    ffff8300bf52e000 000000000000000d 00007f81af369004 ffff8300bf52e000
(XEN)    0000000a00000026 0000000000f70002 000000020000007f 00007f8100000002
(XEN) Xen call trace:
(XEN)    [<ffff82d08011d6db>] ASSERT_NOT_IN_ATOMIC+0x22/0x67
(XEN)    [<ffff82d080149335>] pt_irq_create_bind+0xf7/0x5c2
(XEN)    [<ffff82d080160131>] arch_do_domctl+0x1131/0x23e0
(XEN)    [<ffff82d0801048e8>] do_domctl+0x1934/0x1a9c
(XEN)    [<ffff82d08022c71b>] syscall_enter+0xeb/0x145

which is entirely due to holding the 'domctl_lock_acquire','
'rcu_read_lock', and 'pcidevs_lock' lock.

It seems that the approach of waiting for kicking 'process_pending_softirq'
is not good as other softirq might want to grab any of those locks
at some point and be unhappy.

Ugh.

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-24 20:55         ` Konrad Rzeszutek Wilk
@ 2014-10-25  0:39           ` Konrad Rzeszutek Wilk
  2014-10-27  9:36             ` Jan Beulich
  2014-10-27  9:32           ` Jan Beulich
  1 sibling, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-25  0:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

On Fri, Oct 24, 2014 at 04:55:44PM -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, Oct 24, 2014 at 11:09:59AM +0100, Jan Beulich wrote:
> > >>> On 24.10.14 at 03:58, <konrad.wilk@oracle.com> wrote:
> > > On Thu, Oct 23, 2014 at 10:36:07AM +0100, Jan Beulich wrote:
> > >> >>> On 21.10.14 at 19:19, <konrad.wilk@oracle.com> wrote:
> > > Was not sure whether you prefer 'true'
> > > or 'false' instead of numbers - decided on numbers since most of the code-base
> > > uses numbers.
> > 
> > So far we don't use "true" and "false" in hypervisor code at all (or if
> > you spotted any such use, it slipped in by mistake). We ought to
> > consider switching to bool/true/false though.
> 
> The dec_lzma2 had it.
> > 
> > > +/*
> > > + * Reset the pirq_dpci->dom parameter to NULL.
> > > + *
> > > + * This function checks the different states to make sure it can do
> > > + * at the right time and if unschedules the softirq before it has
> > > + * run it also refcounts (which is what the softirq would have done).
> > > + */
> > > +static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
> > > +{
> > > +    struct domain *d = pirq_dpci->dom;
> > > +
> > > +    ASSERT(spin_is_locked(&d->event_lock));
> > > +    switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, 0) )
> > > +    {
> > > +        /*
> > > +         * We are going to try to de-schedule the softirq before it goes in
> > > +         * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
> > > +         */
> > > +        case STATE_SCHED:
> > > +            put_domain(d);
> > > +            /* fallthrough. */
> > > +        /*
> > > +         * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due
> > > +         * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom' in
> > > +         * a local variable before it sets STATE_RUN - and therefore will not
> > > +         * dereference '->dom' which would result in a crash.
> > > +        */
> > > +        case STATE_RUN:
> > > +            pirq_dpci->dom = NULL;
> > > +            break;
> > > +        default:
> > > +            break;
> > > +    }
> > 
> > Apart from the indentation being wrong (case labels having the same
> > indentation as the switch()'s opening brace), this doesn't seem to be a
> > proper equivalent of the former code: There you cleared ->dom when
> > STATE_RUN regardless of STATE_SCHED. Leaving out the comments
> > I'd suggest
> > 
> >     switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, 0) )
> >     {
> >     case STATE_SCHED:
> >         put_domain(d);
> >     case STATE_RUN: case STATE_SCHED|STATE_RUN:
> 
> .. which made me realize that to testing of values as opposed to
> bit positions requires ditching the 'enum' and introducing an
> STATE_SCHED_BIT, STATE_SCHED, STATE_RUN_BIT, and STATE_RUN_BIT
> to complement each other when checking for values or setting bits.
> 
> >         pirq_dpci->dom = NULL;
> >         break;
> >     default:
> >         BUG();
> >     case 0:
> >         break;
> >     }
> > 
> > > --- a/xen/drivers/passthrough/pci.c
> > > +++ b/xen/drivers/passthrough/pci.c
> > > @@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
> > >          xfree(digl);
> > >      }
> > >  
> > > -    tasklet_kill(&pirq_dpci->tasklet);
> > > -
> > > -    return 0;
> > > +    return pt_pirq_softirq_active(pirq_dpci);
> > 
> > This function returns a bool_t now, but the (indirect) caller of this
> > function expects -ERESTART.
> 
> Fixed up.
> 
> 
> I added some extra code so that I could reliabily trigger the error paths
> and got:
> 
> (XEN) pt_pirq_softirq_active: is 0 (debug: 1)
> 
> This is the first ever usage of pt_pirq_create_bind and for fun the
> code returns 'false'
> 
> (XEN) Assertion '!preempt_count()' failed at preempt.c:37
> (XEN) ----[ Xen-4.5.0-rc  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    9
> (XEN) RIP:    e008:[<ffff82d08011d6db>] ASSERT_NOT_IN_ATOMIC+0x22/0x67
> (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor
> (XEN) rax: ffff82d080320d20   rbx: ffff8302fb126470   rcx: 0000000000000001
> (XEN) rdx: 00000032b8d99300   rsi: 000000000000000a   rdi: ffff8303149670b8
> (XEN) rbp: ffff8303390a7bd8   rsp: ffff8303390a7bd8   r8:  0000000000000000
> (XEN) r9:  0000000000000000   r10: 00000000fffffffd   r11: ffff82d08023e5e0
> (XEN) r12: ffff83025c126700   r13: ffff830314967000   r14: 0000000000000030
> (XEN) r15: ffff83025c126728   cr0: 0000000080050033   cr4: 00000000000026f0
> (XEN) cr3: 000000031b4b7000   cr2: 0000000000000000
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff8303390a7bd8:
> (XEN)    ffff8303390a7c98 ffff82d080149335 ffff8303390a7ca8 ffff82d08016d0e6
> (XEN)    0000000000000001 0000000000000206 00000003c0027c38 ffff8303390a7d88
> (XEN)    00000072390a7c68 ffff830312e5f4d0 0000000000000000 0000001000000003
> (XEN)    0000000000000246 ffff8303390a7c58 ffff82d08012ce66 ffff8303390a7e48
> (XEN)    0000007f390a7c88 ffff8303149670b8 fffffffffffffffd fffffffffffffffd
> (XEN)    00007f81af369004 ffff830314967000 ffff8303390a7e38 0000000000000008
> (XEN)    ffff8303390a7d78 ffff82d080160131 ffff8303390a7cd8 ffff830302526f10
> (XEN)    ffff8303390a7ce8 0000000000000073 ffff830330907324 ffff830330907300
> (XEN)    ffff8303390a7d08 ffff82d08016dc79 ffff8303390a7cf8 ffff82d08013b469
> (XEN)    ffff8303390a7d28 ffff82d08016e628 ffff8303390a7d28 ffff830314967000
> (XEN)    000000000000007f 0000000000000206 ffff8303390a7dc8 ffff82d0801711dc
> (XEN)    ffff8303390a7d88 ffff82d08011e203 0000000000000202 fffffffffffffffd
> (XEN)    00007f81af369004 ffff830314967000 ffff8800331eb488 0000000000000008
> (XEN)    ffff8303390a7ef8 ffff82d0801048e8 ffff830302526f10 ffff83025c126700
> (XEN)    ffff8303390a7dc8 ffff830314967000 00000000ffffffff 0000000000000000
> (XEN)    ffff8303390a7e98 ffff8303390a7e70 ffff8303390a7e48 ffff82d080184019
> (XEN)    ffff8303390a7f18 ffffffff8158b0de ffff8303390a7e98 ffff8303149670b8
> (XEN)    ffff83030000007f ffff82d080191105 000000730000f800 ffff8303390a7e74
> (XEN)    ffff8300bf52e000 000000000000000d 00007f81af369004 ffff8300bf52e000
> (XEN)    0000000a00000026 0000000000f70002 000000020000007f 00007f8100000002
> (XEN) Xen call trace:
> (XEN)    [<ffff82d08011d6db>] ASSERT_NOT_IN_ATOMIC+0x22/0x67
> (XEN)    [<ffff82d080149335>] pt_irq_create_bind+0xf7/0x5c2
> (XEN)    [<ffff82d080160131>] arch_do_domctl+0x1131/0x23e0
> (XEN)    [<ffff82d0801048e8>] do_domctl+0x1934/0x1a9c
> (XEN)    [<ffff82d08022c71b>] syscall_enter+0xeb/0x145
> 
> which is entirely due to holding the 'domctl_lock_acquire','
> 'rcu_read_lock', and 'pcidevs_lock' lock.
> 
> It seems that the approach of waiting for kicking 'process_pending_softirq'
> is not good as other softirq might want to grab any of those locks
> at some point and be unhappy.
> 
> Ugh.

The original code (as is now) life-cycle was that the tasklet would be initialized
the moment the guest tried to set an MSI (or PCIe) interrupt. It would be then
cleanup (spin forever waiting for the tasklet to finish) when the domain was
being destroyed. So all revolving around 'struct domain'.

With these patches we put this life-cycle around each specific PIRQ. And the
life-cycle starts when the interrupt is setup and either when the interrupt
is disabled or guest is shutdown. However due to the asynchronous nature
of this we MUST quisce the softirq before we setup the PIRQ. That is:
pt_pirq_destroy_bind -> pt_irq_create_bind [must have softirq stopped].

And that is so that we can do the reset of 'hvm_pirq_dpci' structure
properly (in case we can actually clean it up in the error paths of
'pt_pirq_create_bind').

We can stick the 'quisce' part at the end of 'pt_pirq_destroy_bind' or
at the start of 'pt_irq_create_bind'. And the only way I can think of doing
this quisce is:

     if ( pt_pirq_softirq_active(pirq_dpci) )
     {
         spin_unlock(&d->event_lock);
-        ASSERT_NOT_IN_ATOMIC();
-        process_pending_softirqs();
+        cpu_relax();
         goto restart;
     }

Which would replicate what a 'tasklet_kill' does.
> 
> > 
> > Jan
> > 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
  2014-10-24 19:09         ` Konrad Rzeszutek Wilk
@ 2014-10-27  9:25           ` Jan Beulich
  2014-10-27 16:36             ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2014-10-27  9:25 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

>>> On 24.10.14 at 21:09, <konrad.wilk@oracle.com> wrote:
> On Fri, Oct 24, 2014 at 10:49:32AM +0100, Jan Beulich wrote:
>> >>> On 24.10.14 at 03:58, <konrad.wilk@oracle.com> wrote:
>> > @@ -156,6 +165,7 @@ int pt_irq_create_bind(
>> >              {
>> >                  pirq_dpci->gmsi.gflags = 0;
>> >                  pirq_dpci->gmsi.gvec = 0;
>> > +                pirq_dpci->dom = NULL;
>> >                  pirq_dpci->flags = 0;
>> >                  pirq_cleanup_check(info, d);
>> >                  spin_unlock(&d->event_lock);
>> 
>> Wait - is this correct even when pirq_guest_bind() succeeded but
>> msixtbl_pt_register() failed? At the first glance I would say no. But
> 
> Keep in mind that if 'msixtbl_pt_register' fails we immediately call
> 'pirq_guest_unbind' and then land in here.

Of course. But there was a window where the interrupt was
bound (and hence potentially got triggered).

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-24 20:55         ` Konrad Rzeszutek Wilk
  2014-10-25  0:39           ` Konrad Rzeszutek Wilk
@ 2014-10-27  9:32           ` Jan Beulich
  2014-10-27 10:40             ` Andrew Cooper
  1 sibling, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2014-10-27  9:32 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

>>> On 24.10.14 at 22:55, <konrad.wilk@oracle.com> wrote:
> On Fri, Oct 24, 2014 at 11:09:59AM +0100, Jan Beulich wrote:
>> >>> On 24.10.14 at 03:58, <konrad.wilk@oracle.com> wrote:
>> > On Thu, Oct 23, 2014 at 10:36:07AM +0100, Jan Beulich wrote:
>> >> >>> On 21.10.14 at 19:19, <konrad.wilk@oracle.com> wrote:
>> > Was not sure whether you prefer 'true'
>> > or 'false' instead of numbers - decided on numbers since most of the 
> code-base
>> > uses numbers.
>> 
>> So far we don't use "true" and "false" in hypervisor code at all (or if
>> you spotted any such use, it slipped in by mistake). We ought to
>> consider switching to bool/true/false though.
> 
> The dec_lzma2 had it.

With its private.h having

#define false 0
#define true 1

>> Apart from the indentation being wrong (case labels having the same
>> indentation as the switch()'s opening brace), this doesn't seem to be a
>> proper equivalent of the former code: There you cleared ->dom when
>> STATE_RUN regardless of STATE_SCHED. Leaving out the comments
>> I'd suggest
>> 
>>     switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, 0) )
>>     {
>>     case STATE_SCHED:
>>         put_domain(d);
>>     case STATE_RUN: case STATE_SCHED|STATE_RUN:
> 
> .. which made me realize that to testing of values as opposed to
> bit positions requires ditching the 'enum' and introducing an
> STATE_SCHED_BIT, STATE_SCHED, STATE_RUN_BIT, and STATE_RUN_BIT
> to complement each other when checking for values or setting bits.

No, keep the enum and just use 1 << STATE_... here.

> I added some extra code so that I could reliabily trigger the error paths
> and got:
> 
> (XEN) pt_pirq_softirq_active: is 0 (debug: 1)
> 
> This is the first ever usage of pt_pirq_create_bind and for fun the
> code returns 'false'
> 
> (XEN) Assertion '!preempt_count()' failed at preempt.c:37
> (XEN) ----[ Xen-4.5.0-rc  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    9
> (XEN) RIP:    e008:[<ffff82d08011d6db>] ASSERT_NOT_IN_ATOMIC+0x22/0x67
> (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor
> (XEN) rax: ffff82d080320d20   rbx: ffff8302fb126470   rcx: 0000000000000001
> (XEN) rdx: 00000032b8d99300   rsi: 000000000000000a   rdi: ffff8303149670b8
> (XEN) rbp: ffff8303390a7bd8   rsp: ffff8303390a7bd8   r8:  0000000000000000
> (XEN) r9:  0000000000000000   r10: 00000000fffffffd   r11: ffff82d08023e5e0
> (XEN) r12: ffff83025c126700   r13: ffff830314967000   r14: 0000000000000030
> (XEN) r15: ffff83025c126728   cr0: 0000000080050033   cr4: 00000000000026f0
> (XEN) cr3: 000000031b4b7000   cr2: 0000000000000000
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff8303390a7bd8:
> (XEN)    ffff8303390a7c98 ffff82d080149335 ffff8303390a7ca8 ffff82d08016d0e6
> (XEN)    0000000000000001 0000000000000206 00000003c0027c38 ffff8303390a7d88
> (XEN)    00000072390a7c68 ffff830312e5f4d0 0000000000000000 0000001000000003
> (XEN)    0000000000000246 ffff8303390a7c58 ffff82d08012ce66 ffff8303390a7e48
> (XEN)    0000007f390a7c88 ffff8303149670b8 fffffffffffffffd fffffffffffffffd
> (XEN)    00007f81af369004 ffff830314967000 ffff8303390a7e38 0000000000000008
> (XEN)    ffff8303390a7d78 ffff82d080160131 ffff8303390a7cd8 ffff830302526f10
> (XEN)    ffff8303390a7ce8 0000000000000073 ffff830330907324 ffff830330907300
> (XEN)    ffff8303390a7d08 ffff82d08016dc79 ffff8303390a7cf8 ffff82d08013b469
> (XEN)    ffff8303390a7d28 ffff82d08016e628 ffff8303390a7d28 ffff830314967000
> (XEN)    000000000000007f 0000000000000206 ffff8303390a7dc8 ffff82d0801711dc
> (XEN)    ffff8303390a7d88 ffff82d08011e203 0000000000000202 fffffffffffffffd
> (XEN)    00007f81af369004 ffff830314967000 ffff8800331eb488 0000000000000008
> (XEN)    ffff8303390a7ef8 ffff82d0801048e8 ffff830302526f10 ffff83025c126700
> (XEN)    ffff8303390a7dc8 ffff830314967000 00000000ffffffff 0000000000000000
> (XEN)    ffff8303390a7e98 ffff8303390a7e70 ffff8303390a7e48 ffff82d080184019
> (XEN)    ffff8303390a7f18 ffffffff8158b0de ffff8303390a7e98 ffff8303149670b8
> (XEN)    ffff83030000007f ffff82d080191105 000000730000f800 ffff8303390a7e74
> (XEN)    ffff8300bf52e000 000000000000000d 00007f81af369004 ffff8300bf52e000
> (XEN)    0000000a00000026 0000000000f70002 000000020000007f 00007f8100000002
> (XEN) Xen call trace:
> (XEN)    [<ffff82d08011d6db>] ASSERT_NOT_IN_ATOMIC+0x22/0x67
> (XEN)    [<ffff82d080149335>] pt_irq_create_bind+0xf7/0x5c2
> (XEN)    [<ffff82d080160131>] arch_do_domctl+0x1131/0x23e0
> (XEN)    [<ffff82d0801048e8>] do_domctl+0x1934/0x1a9c
> (XEN)    [<ffff82d08022c71b>] syscall_enter+0xeb/0x145
> 
> which is entirely due to holding the 'domctl_lock_acquire','
> 'rcu_read_lock', and 'pcidevs_lock' lock.
> 
> It seems that the approach of waiting for kicking 'process_pending_softirq'
> is not good as other softirq might want to grab any of those locks
> at some point and be unhappy.
> 
> Ugh.

Ugly. I'd suggest keeping the ASSERT_NOT_IN_ATOMIC() in place
but commented out, briefly explaining that this can't be used with
since the domctl lock is being held. That lock is not really going to be
acquired by softirq handlers, but we also shouldn't adjust the
generic macro to account for this one special case.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-25  0:39           ` Konrad Rzeszutek Wilk
@ 2014-10-27  9:36             ` Jan Beulich
  2014-10-27 16:36               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2014-10-27  9:36 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

>>> On 25.10.14 at 02:39, <konrad.wilk@oracle.com> wrote:
> We can stick the 'quisce' part at the end of 'pt_pirq_destroy_bind' or
> at the start of 'pt_irq_create_bind'. And the only way I can think of doing
> this quisce is:
> 
>      if ( pt_pirq_softirq_active(pirq_dpci) )
>      {
>          spin_unlock(&d->event_lock);
> -        ASSERT_NOT_IN_ATOMIC();
> -        process_pending_softirqs();
> +        cpu_relax();
>          goto restart;
>      }
> 
> Which would replicate what a 'tasklet_kill' does.

Ah, right, it was you who added the process_pending_softirqs()
despite tasklet_kill() not having it. With the comment referring to
tasklet_kill()'s behavior I think this is fine to be done as you
suggest above.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-27  9:32           ` Jan Beulich
@ 2014-10-27 10:40             ` Andrew Cooper
  2014-10-27 10:59               ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2014-10-27 10:40 UTC (permalink / raw)
  To: Jan Beulich, Konrad Rzeszutek Wilk
  Cc: xen-devel, keir, ian.jackson, ian.campbell, tim

On 27/10/14 09:32, Jan Beulich wrote:
>>>> On 24.10.14 at 22:55, <konrad.wilk@oracle.com> wrote:
>> On Fri, Oct 24, 2014 at 11:09:59AM +0100, Jan Beulich wrote:
>>>>>> On 24.10.14 at 03:58, <konrad.wilk@oracle.com> wrote:
>>>> On Thu, Oct 23, 2014 at 10:36:07AM +0100, Jan Beulich wrote:
>>>>>>>> On 21.10.14 at 19:19, <konrad.wilk@oracle.com> wrote:
>>>> Was not sure whether you prefer 'true'
>>>> or 'false' instead of numbers - decided on numbers since most of the 
>> code-base
>>>> uses numbers.
>>> So far we don't use "true" and "false" in hypervisor code at all (or if
>>> you spotted any such use, it slipped in by mistake). We ought to
>>> consider switching to bool/true/false though.
>> The dec_lzma2 had it.
> With its private.h having
>
> #define false 0
> #define true 1
>
>>> Apart from the indentation being wrong (case labels having the same
>>> indentation as the switch()'s opening brace), this doesn't seem to be a
>>> proper equivalent of the former code: There you cleared ->dom when
>>> STATE_RUN regardless of STATE_SCHED. Leaving out the comments
>>> I'd suggest
>>>
>>>     switch ( cmpxchg(&pirq_dpci->state, STATE_SCHED, 0) )
>>>     {
>>>     case STATE_SCHED:
>>>         put_domain(d);
>>>     case STATE_RUN: case STATE_SCHED|STATE_RUN:
>> .. which made me realize that to testing of values as opposed to
>> bit positions requires ditching the 'enum' and introducing an
>> STATE_SCHED_BIT, STATE_SCHED, STATE_RUN_BIT, and STATE_RUN_BIT
>> to complement each other when checking for values or setting bits.
> No, keep the enum and just use 1 << STATE_... here.
>
>> I added some extra code so that I could reliabily trigger the error paths
>> and got:
>>
>> (XEN) pt_pirq_softirq_active: is 0 (debug: 1)
>>
>> This is the first ever usage of pt_pirq_create_bind and for fun the
>> code returns 'false'
>>
>> (XEN) Assertion '!preempt_count()' failed at preempt.c:37
>> (XEN) ----[ Xen-4.5.0-rc  x86_64  debug=y  Not tainted ]----
>> (XEN) CPU:    9
>> (XEN) RIP:    e008:[<ffff82d08011d6db>] ASSERT_NOT_IN_ATOMIC+0x22/0x67
>> (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor
>> (XEN) rax: ffff82d080320d20   rbx: ffff8302fb126470   rcx: 0000000000000001
>> (XEN) rdx: 00000032b8d99300   rsi: 000000000000000a   rdi: ffff8303149670b8
>> (XEN) rbp: ffff8303390a7bd8   rsp: ffff8303390a7bd8   r8:  0000000000000000
>> (XEN) r9:  0000000000000000   r10: 00000000fffffffd   r11: ffff82d08023e5e0
>> (XEN) r12: ffff83025c126700   r13: ffff830314967000   r14: 0000000000000030
>> (XEN) r15: ffff83025c126728   cr0: 0000000080050033   cr4: 00000000000026f0
>> (XEN) cr3: 000000031b4b7000   cr2: 0000000000000000
>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
>> (XEN) Xen stack trace from rsp=ffff8303390a7bd8:
>> (XEN)    ffff8303390a7c98 ffff82d080149335 ffff8303390a7ca8 ffff82d08016d0e6
>> (XEN)    0000000000000001 0000000000000206 00000003c0027c38 ffff8303390a7d88
>> (XEN)    00000072390a7c68 ffff830312e5f4d0 0000000000000000 0000001000000003
>> (XEN)    0000000000000246 ffff8303390a7c58 ffff82d08012ce66 ffff8303390a7e48
>> (XEN)    0000007f390a7c88 ffff8303149670b8 fffffffffffffffd fffffffffffffffd
>> (XEN)    00007f81af369004 ffff830314967000 ffff8303390a7e38 0000000000000008
>> (XEN)    ffff8303390a7d78 ffff82d080160131 ffff8303390a7cd8 ffff830302526f10
>> (XEN)    ffff8303390a7ce8 0000000000000073 ffff830330907324 ffff830330907300
>> (XEN)    ffff8303390a7d08 ffff82d08016dc79 ffff8303390a7cf8 ffff82d08013b469
>> (XEN)    ffff8303390a7d28 ffff82d08016e628 ffff8303390a7d28 ffff830314967000
>> (XEN)    000000000000007f 0000000000000206 ffff8303390a7dc8 ffff82d0801711dc
>> (XEN)    ffff8303390a7d88 ffff82d08011e203 0000000000000202 fffffffffffffffd
>> (XEN)    00007f81af369004 ffff830314967000 ffff8800331eb488 0000000000000008
>> (XEN)    ffff8303390a7ef8 ffff82d0801048e8 ffff830302526f10 ffff83025c126700
>> (XEN)    ffff8303390a7dc8 ffff830314967000 00000000ffffffff 0000000000000000
>> (XEN)    ffff8303390a7e98 ffff8303390a7e70 ffff8303390a7e48 ffff82d080184019
>> (XEN)    ffff8303390a7f18 ffffffff8158b0de ffff8303390a7e98 ffff8303149670b8
>> (XEN)    ffff83030000007f ffff82d080191105 000000730000f800 ffff8303390a7e74
>> (XEN)    ffff8300bf52e000 000000000000000d 00007f81af369004 ffff8300bf52e000
>> (XEN)    0000000a00000026 0000000000f70002 000000020000007f 00007f8100000002
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d08011d6db>] ASSERT_NOT_IN_ATOMIC+0x22/0x67
>> (XEN)    [<ffff82d080149335>] pt_irq_create_bind+0xf7/0x5c2
>> (XEN)    [<ffff82d080160131>] arch_do_domctl+0x1131/0x23e0
>> (XEN)    [<ffff82d0801048e8>] do_domctl+0x1934/0x1a9c
>> (XEN)    [<ffff82d08022c71b>] syscall_enter+0xeb/0x145
>>
>> which is entirely due to holding the 'domctl_lock_acquire','
>> 'rcu_read_lock', and 'pcidevs_lock' lock.
>>
>> It seems that the approach of waiting for kicking 'process_pending_softirq'
>> is not good as other softirq might want to grab any of those locks
>> at some point and be unhappy.
>>
>> Ugh.
> Ugly. I'd suggest keeping the ASSERT_NOT_IN_ATOMIC() in place
> but commented out, briefly explaining that this can't be used with
> since the domctl lock is being held. That lock is not really going to be
> acquired by softirq handlers, but we also shouldn't adjust the
> generic macro to account for this one special case.
>
> Jan

The domctl lock is a hot lock, and looping with it held like this will
further starve other toolstack operations.

What is the likelihood that this loop will actually be used?  I am
guessing fairly rare, although it looks more likely to happen for an
interrupt which is firing frequently?

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-27 10:40             ` Andrew Cooper
@ 2014-10-27 10:59               ` Jan Beulich
  2014-10-27 11:09                 ` Andrew Cooper
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2014-10-27 10:59 UTC (permalink / raw)
  To: Andrew Cooper, Konrad Rzeszutek Wilk
  Cc: xen-devel, keir, ian.jackson, ian.campbell, tim

>>> On 27.10.14 at 11:40, <andrew.cooper3@citrix.com> wrote:
> The domctl lock is a hot lock, and looping with it held like this will
> further starve other toolstack operations.

The domctl lock normally shouldn't be a hot one. It only is if some tool
stack component permanently issues domctl-s to obtain information on
the state of the system. I think it wasn't so long ago that I said I
consider this bad behavior of the tool stack...

> What is the likelihood that this loop will actually be used?  I am
> guessing fairly rare, although it looks more likely to happen for an
> interrupt which is firing frequently?

I think it doesn't so much matter whether the loop will be used
than how long it could end up spinning if used. And that shouldn't
be long, considering that it only waits for interrupt handling to get
finished.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-27 10:59               ` Jan Beulich
@ 2014-10-27 11:09                 ` Andrew Cooper
  2014-10-27 11:24                   ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2014-10-27 11:09 UTC (permalink / raw)
  To: Jan Beulich, Konrad Rzeszutek Wilk
  Cc: xen-devel, keir, ian.jackson, ian.campbell, tim

On 27/10/14 10:59, Jan Beulich wrote:
>>>> On 27.10.14 at 11:40, <andrew.cooper3@citrix.com> wrote:
>> The domctl lock is a hot lock, and looping with it held like this will
>> further starve other toolstack operations.
> The domctl lock normally shouldn't be a hot one. It only is if some tool
> stack component permanently issues domctl-s to obtain information on
> the state of the system. I think it wasn't so long ago that I said I
> consider this bad behavior of the tool stack...

Or there is a multi-threaded toolstack which is genuinely performing
legitimate toolstack operations on multiple domains.  This behaviour is
very easy to encounter with cloud workloads and automatic load
balancers.  This is not to say that Xapi could be less silly with some
of the operations it issues, but the global domctl lock is fundamentally
a bottleneck to parallel domain operations.

>
>> What is the likelihood that this loop will actually be used?  I am
>> guessing fairly rare, although it looks more likely to happen for an
>> interrupt which is firing frequently?
> I think it doesn't so much matter whether the loop will be used
> than how long it could end up spinning if used. And that shouldn't
> be long, considering that it only waits for interrupt handling to get
> finished.

Yes.  I don't disagree that this is probably the best way of handling
the situation, but it isn't 0-cost either.

Can it ever be the case that we are waiting for a remote pcpu to run its
softirq handler?  If so, the time spent looping here could be up to 1
scheduling timeslice in the worst case, and 30ms is a very long time to
wait.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-27 11:09                 ` Andrew Cooper
@ 2014-10-27 11:24                   ` Jan Beulich
  2014-10-27 17:01                     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2014-10-27 11:24 UTC (permalink / raw)
  To: Andrew Cooper, Konrad Rzeszutek Wilk
  Cc: xen-devel, keir, ian.jackson, ian.campbell, tim

>>> On 27.10.14 at 12:09, <andrew.cooper3@citrix.com> wrote:
> Can it ever be the case that we are waiting for a remote pcpu to run its
> softirq handler?  If so, the time spent looping here could be up to 1
> scheduling timeslice in the worst case, and 30ms is a very long time to
> wait.

Good point - I think this can be the case. But there seems to be a
simple counter measure: The first time we get to this point, send an
event check IPI to the CPU in question (or in the worst case
broadcast one if the CPU can't be determined in a race free way).

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
  2014-10-27  9:25           ` Jan Beulich
@ 2014-10-27 16:36             ` Konrad Rzeszutek Wilk
  2014-10-27 16:57               ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-27 16:36 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

On Mon, Oct 27, 2014 at 09:25:41AM +0000, Jan Beulich wrote:
> >>> On 24.10.14 at 21:09, <konrad.wilk@oracle.com> wrote:
> > On Fri, Oct 24, 2014 at 10:49:32AM +0100, Jan Beulich wrote:
> >> >>> On 24.10.14 at 03:58, <konrad.wilk@oracle.com> wrote:
> >> > @@ -156,6 +165,7 @@ int pt_irq_create_bind(
> >> >              {
> >> >                  pirq_dpci->gmsi.gflags = 0;
> >> >                  pirq_dpci->gmsi.gvec = 0;
> >> > +                pirq_dpci->dom = NULL;
> >> >                  pirq_dpci->flags = 0;
> >> >                  pirq_cleanup_check(info, d);
> >> >                  spin_unlock(&d->event_lock);
> >> 
> >> Wait - is this correct even when pirq_guest_bind() succeeded but
> >> msixtbl_pt_register() failed? At the first glance I would say no. But
> > 
> > Keep in mind that if 'msixtbl_pt_register' fails we immediately call
> > 'pirq_guest_unbind' and then land in here.
> 
> Of course. But there was a window where the interrupt was
> bound (and hence potentially got triggered).

Correct.

And the hvm_dirq_assist (thanks to your suggestion) would not crash,
instead it will just return as it checks for 'pirq_dpci->dom' being NULL.

I think this patch does not need any more changes?

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-27  9:36             ` Jan Beulich
@ 2014-10-27 16:36               ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-27 16:36 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

On Mon, Oct 27, 2014 at 09:36:04AM +0000, Jan Beulich wrote:
> >>> On 25.10.14 at 02:39, <konrad.wilk@oracle.com> wrote:
> > We can stick the 'quisce' part at the end of 'pt_pirq_destroy_bind' or
> > at the start of 'pt_irq_create_bind'. And the only way I can think of doing
> > this quisce is:
> > 
> >      if ( pt_pirq_softirq_active(pirq_dpci) )
> >      {
> >          spin_unlock(&d->event_lock);
> > -        ASSERT_NOT_IN_ATOMIC();
> > -        process_pending_softirqs();
> > +        cpu_relax();
> >          goto restart;
> >      }
> > 
> > Which would replicate what a 'tasklet_kill' does.
> 
> Ah, right, it was you who added the process_pending_softirqs()
> despite tasklet_kill() not having it. With the comment referring to

Yes. I was trying to be clever.

> tasklet_kill()'s behavior I think this is fine to be done as you
> suggest above.
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
  2014-10-27 16:36             ` Konrad Rzeszutek Wilk
@ 2014-10-27 16:57               ` Jan Beulich
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2014-10-27 16:57 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

>>> On 27.10.14 at 17:36, <konrad.wilk@oracle.com> wrote:
> On Mon, Oct 27, 2014 at 09:25:41AM +0000, Jan Beulich wrote:
>> >>> On 24.10.14 at 21:09, <konrad.wilk@oracle.com> wrote:
>> > On Fri, Oct 24, 2014 at 10:49:32AM +0100, Jan Beulich wrote:
>> >> >>> On 24.10.14 at 03:58, <konrad.wilk@oracle.com> wrote:
>> >> > @@ -156,6 +165,7 @@ int pt_irq_create_bind(
>> >> >              {
>> >> >                  pirq_dpci->gmsi.gflags = 0;
>> >> >                  pirq_dpci->gmsi.gvec = 0;
>> >> > +                pirq_dpci->dom = NULL;
>> >> >                  pirq_dpci->flags = 0;
>> >> >                  pirq_cleanup_check(info, d);
>> >> >                  spin_unlock(&d->event_lock);
>> >> 
>> >> Wait - is this correct even when pirq_guest_bind() succeeded but
>> >> msixtbl_pt_register() failed? At the first glance I would say no. But
>> > 
>> > Keep in mind that if 'msixtbl_pt_register' fails we immediately call
>> > 'pirq_guest_unbind' and then land in here.
>> 
>> Of course. But there was a window where the interrupt was
>> bound (and hence potentially got triggered).
> 
> Correct.
> 
> And the hvm_dirq_assist (thanks to your suggestion) would not crash,
> instead it will just return as it checks for 'pirq_dpci->dom' being NULL.
> 
> I think this patch does not need any more changes?

Right - the replacement of these assignments with calls to the
helper function was in the other patch iirc.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-27 11:24                   ` Jan Beulich
@ 2014-10-27 17:01                     ` Konrad Rzeszutek Wilk
  2014-10-27 17:36                       ` Andrew Cooper
  2014-10-28  7:53                       ` Jan Beulich
  0 siblings, 2 replies; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-27 17:01 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, Andrew Cooper, tim, xen-devel, ian.jackson

On Mon, Oct 27, 2014 at 11:24:31AM +0000, Jan Beulich wrote:
> >>> On 27.10.14 at 12:09, <andrew.cooper3@citrix.com> wrote:
> > Can it ever be the case that we are waiting for a remote pcpu to run its
> > softirq handler?  If so, the time spent looping here could be up to 1
> > scheduling timeslice in the worst case, and 30ms is a very long time to
> > wait.
> 
> Good point - I think this can be the case. But there seems to be a
> simple counter measure: The first time we get to this point, send an
> event check IPI to the CPU in question (or in the worst case
> broadcast one if the CPU can't be determined in a race free way).

I can either do this using the wrapper:

     if ( pt_pirq_softirq_active(pirq_dpci) )
     {
         spin_unlock(&d->event_lock);
	 if ( pirq_dpci->cpu >= 0 )
	 {
         	cpu_raise_softirq(pirq_dpci->cpu, HVM_DPCI_SOFTIRQ);
		pirq_dpci->cpu = -1;
	 }
         cpu_relax();
         goto restart;

Ought to do it (cpu_raise_softirq will exit out if
the 'pirq_dpci->cpu == smp_processor_id()'). It also has some batching checks
so that we won't do the IPI if we are in the middle of IPI-ing already
an CPU.

Or just write it out (and bypass some of the checks 'cpu_raise_softirq'
has):

     if ( pt_pirq_softirq_active(pirq_dpci) )
     {
         spin_unlock(&d->event_lock);
	 if ( pirq_dpci->cpu >= 0 && pirq_dpci->cpu != smp_processor_id() )
	 {
		smp_send_event_check_cpu(pirq_dpci->cpu);
		pirq_dpci->cpu = -1;
	 }
         cpu_relax();
         goto restart;


Note:

The 'cpu' is stashed whenever 'raise_softirq_for' has been called.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-27 17:01                     ` Konrad Rzeszutek Wilk
@ 2014-10-27 17:36                       ` Andrew Cooper
  2014-10-27 18:00                         ` Konrad Rzeszutek Wilk
  2014-10-28  7:58                         ` Jan Beulich
  2014-10-28  7:53                       ` Jan Beulich
  1 sibling, 2 replies; 35+ messages in thread
From: Andrew Cooper @ 2014-10-27 17:36 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Jan Beulich
  Cc: xen-devel, keir, ian.jackson, ian.campbell, tim

On 27/10/14 17:01, Konrad Rzeszutek Wilk wrote:
> On Mon, Oct 27, 2014 at 11:24:31AM +0000, Jan Beulich wrote:
>>>>> On 27.10.14 at 12:09, <andrew.cooper3@citrix.com> wrote:
>>> Can it ever be the case that we are waiting for a remote pcpu to run its
>>> softirq handler?  If so, the time spent looping here could be up to 1
>>> scheduling timeslice in the worst case, and 30ms is a very long time to
>>> wait.
>> Good point - I think this can be the case. But there seems to be a
>> simple counter measure: The first time we get to this point, send an
>> event check IPI to the CPU in question (or in the worst case
>> broadcast one if the CPU can't be determined in a race free way).
> I can either do this using the wrapper:
>
>      if ( pt_pirq_softirq_active(pirq_dpci) )
>      {
>          spin_unlock(&d->event_lock);
> 	 if ( pirq_dpci->cpu >= 0 )
> 	 {
>          	cpu_raise_softirq(pirq_dpci->cpu, HVM_DPCI_SOFTIRQ);
> 		pirq_dpci->cpu = -1;
> 	 }
>          cpu_relax();
>          goto restart;
>
> Ought to do it (cpu_raise_softirq will exit out if
> the 'pirq_dpci->cpu == smp_processor_id()'). It also has some batching checks
> so that we won't do the IPI if we are in the middle of IPI-ing already
> an CPU.
>
> Or just write it out (and bypass some of the checks 'cpu_raise_softirq'
> has):
>
>      if ( pt_pirq_softirq_active(pirq_dpci) )
>      {
>          spin_unlock(&d->event_lock);
> 	 if ( pirq_dpci->cpu >= 0 && pirq_dpci->cpu != smp_processor_id() )
> 	 {
> 		smp_send_event_check_cpu(pirq_dpci->cpu);
> 		pirq_dpci->cpu = -1;
> 	 }
>          cpu_relax();
>          goto restart;
>
>
> Note:
>
> The 'cpu' is stashed whenever 'raise_softirq_for' has been called.
>

You need to send at most 1 IPI, or you will be pointlessly spamming the
target pcpu.  Therefore, a blind goto restart seems ill-advised.

The second version doesn't necessarily set HVM_DPCI_SOFTIRQ pending,
while the first version suffers a risk of the softirq being caught in a
batch.

Furthermore, with mwait support, the IPI is elided completely, which is
completely wrong in this situation.

Therefore, I think you need to manually set the HVM_DPCI_SOFTIRQ bit,
then forcibly send the IPI.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-27 17:36                       ` Andrew Cooper
@ 2014-10-27 18:00                         ` Konrad Rzeszutek Wilk
  2014-10-27 21:13                           ` Konrad Rzeszutek Wilk
  2014-10-28  7:58                         ` Jan Beulich
  1 sibling, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-27 18:00 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: keir, ian.campbell, tim, ian.jackson, Jan Beulich, xen-devel

On Mon, Oct 27, 2014 at 05:36:03PM +0000, Andrew Cooper wrote:
> On 27/10/14 17:01, Konrad Rzeszutek Wilk wrote:
> > On Mon, Oct 27, 2014 at 11:24:31AM +0000, Jan Beulich wrote:
> >>>>> On 27.10.14 at 12:09, <andrew.cooper3@citrix.com> wrote:
> >>> Can it ever be the case that we are waiting for a remote pcpu to run its
> >>> softirq handler?  If so, the time spent looping here could be up to 1
> >>> scheduling timeslice in the worst case, and 30ms is a very long time to
> >>> wait.
> >> Good point - I think this can be the case. But there seems to be a
> >> simple counter measure: The first time we get to this point, send an
> >> event check IPI to the CPU in question (or in the worst case
> >> broadcast one if the CPU can't be determined in a race free way).
> > I can either do this using the wrapper:
> >
> >      if ( pt_pirq_softirq_active(pirq_dpci) )
> >      {
> >          spin_unlock(&d->event_lock);
> > 	 if ( pirq_dpci->cpu >= 0 )
> > 	 {
> >          	cpu_raise_softirq(pirq_dpci->cpu, HVM_DPCI_SOFTIRQ);
> > 		pirq_dpci->cpu = -1;
> > 	 }
> >          cpu_relax();
> >          goto restart;
> >
> > Ought to do it (cpu_raise_softirq will exit out if
> > the 'pirq_dpci->cpu == smp_processor_id()'). It also has some batching checks
> > so that we won't do the IPI if we are in the middle of IPI-ing already
> > an CPU.
> >
> > Or just write it out (and bypass some of the checks 'cpu_raise_softirq'
> > has):
> >
> >      if ( pt_pirq_softirq_active(pirq_dpci) )
> >      {
> >          spin_unlock(&d->event_lock);
> > 	 if ( pirq_dpci->cpu >= 0 && pirq_dpci->cpu != smp_processor_id() )
> > 	 {
> > 		smp_send_event_check_cpu(pirq_dpci->cpu);
> > 		pirq_dpci->cpu = -1;
> > 	 }
> >          cpu_relax();
> >          goto restart;
> >
> >
> > Note:
> >
> > The 'cpu' is stashed whenever 'raise_softirq_for' has been called.
> >
> 
> You need to send at most 1 IPI, or you will be pointlessly spamming the
> target pcpu.  Therefore, a blind goto restart seems ill-advised.

Right. That is what it does (it sets pirq_dpci->cpu to a negative value
so that we don't try to spam the target).
> 
> The second version doesn't necessarily set HVM_DPCI_SOFTIRQ pending,

It does not have to as the target has already done so. That is because
the ->cpu value is set in raise_softirq_for which also sets the
HVM_DPIC_SOFTIRQ pending.

> while the first version suffers a risk of the softirq being caught in a
> batch.

Correct.
> 
> Furthermore, with mwait support, the IPI is elided completely, which is
> completely wrong in this situation.

Wait, where did that come from? If we use mwaits IPIs are ignored?
Oh, you mean with the 'batching' support.
>O 
> Therefore, I think you need to manually set the HVM_DPCI_SOFTIRQ bit,
> then forcibly send the IPI.

OK, so the second (smp_send_event_check_cpu). And the bit is already
set - but I will add a comment explaining that.
> 
> ~Andrew
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-27 18:00                         ` Konrad Rzeszutek Wilk
@ 2014-10-27 21:13                           ` Konrad Rzeszutek Wilk
  2014-10-28 10:43                             ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-27 21:13 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: keir, ian.campbell, tim, ian.jackson, Jan Beulich, xen-devel

[-- Attachment #1: Type: text/plain, Size: 28108 bytes --]

On Mon, Oct 27, 2014 at 02:00:10PM -0400, Konrad Rzeszutek Wilk wrote:
> On Mon, Oct 27, 2014 at 05:36:03PM +0000, Andrew Cooper wrote:
> > On 27/10/14 17:01, Konrad Rzeszutek Wilk wrote:
> > > On Mon, Oct 27, 2014 at 11:24:31AM +0000, Jan Beulich wrote:
> > >>>>> On 27.10.14 at 12:09, <andrew.cooper3@citrix.com> wrote:
> > >>> Can it ever be the case that we are waiting for a remote pcpu to run its
> > >>> softirq handler?  If so, the time spent looping here could be up to 1
> > >>> scheduling timeslice in the worst case, and 30ms is a very long time to
> > >>> wait.
> > >> Good point - I think this can be the case. But there seems to be a
> > >> simple counter measure: The first time we get to this point, send an
> > >> event check IPI to the CPU in question (or in the worst case
> > >> broadcast one if the CPU can't be determined in a race free way).
> > > I can either do this using the wrapper:
> > >
> > >      if ( pt_pirq_softirq_active(pirq_dpci) )
> > >      {
> > >          spin_unlock(&d->event_lock);
> > > 	 if ( pirq_dpci->cpu >= 0 )
> > > 	 {
> > >          	cpu_raise_softirq(pirq_dpci->cpu, HVM_DPCI_SOFTIRQ);
> > > 		pirq_dpci->cpu = -1;
> > > 	 }
> > >          cpu_relax();
> > >          goto restart;
> > >
> > > Ought to do it (cpu_raise_softirq will exit out if
> > > the 'pirq_dpci->cpu == smp_processor_id()'). It also has some batching checks
> > > so that we won't do the IPI if we are in the middle of IPI-ing already
> > > an CPU.
> > >
> > > Or just write it out (and bypass some of the checks 'cpu_raise_softirq'
> > > has):
> > >
> > >      if ( pt_pirq_softirq_active(pirq_dpci) )
> > >      {
> > >          spin_unlock(&d->event_lock);
> > > 	 if ( pirq_dpci->cpu >= 0 && pirq_dpci->cpu != smp_processor_id() )
> > > 	 {
> > > 		smp_send_event_check_cpu(pirq_dpci->cpu);
> > > 		pirq_dpci->cpu = -1;
> > > 	 }
> > >          cpu_relax();
> > >          goto restart;
> > >
> > >
> > > Note:
> > >
> > > The 'cpu' is stashed whenever 'raise_softirq_for' has been called.
> > >
> > 
> > You need to send at most 1 IPI, or you will be pointlessly spamming the
> > target pcpu.  Therefore, a blind goto restart seems ill-advised.
> 
> Right. That is what it does (it sets pirq_dpci->cpu to a negative value
> so that we don't try to spam the target).
> > 
> > The second version doesn't necessarily set HVM_DPCI_SOFTIRQ pending,
> 
> It does not have to as the target has already done so. That is because
> the ->cpu value is set in raise_softirq_for which also sets the
> HVM_DPIC_SOFTIRQ pending.
> 
> > while the first version suffers a risk of the softirq being caught in a
> > batch.
> 
> Correct.
> > 
> > Furthermore, with mwait support, the IPI is elided completely, which is
> > completely wrong in this situation.
> 
> Wait, where did that come from? If we use mwaits IPIs are ignored?
> Oh, you mean with the 'batching' support.
> >O 
> > Therefore, I think you need to manually set the HVM_DPCI_SOFTIRQ bit,
> > then forcibly send the IPI.
> 
> OK, so the second (smp_send_event_check_cpu). And the bit is already
> set - but I will add a comment explaining that.

This is what I came up with. Rather a simple change but I believe it
addresses your concerns.

Note that I could have also sprinkled 'pirq_dpci->cpu = -1' in 
pt_pirq_softirq_reset (after the 'put_domain(d)') and in 'dpci_softirq'
(before clear_bit). But it is not needed - because we determine whether
to send an IPI based on the 'STATE_SCHED' bit and if that is unset
we don't care what is there.

Thought if it makes it more maintainable I can certainly add it there.

>From 7ca4009e4613078ad603aef7af0cc65e6f8dd90f Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Thu, 23 Oct 2014 20:41:24 -0400
Subject: [PATCH] dpci: Replace tasklet with an softirq (v10)

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct hvm_pirq_dpci' structure the
'hvm_dirq_assist' needs to run on. That is simple solved by
threading the 'struct hvm_pirq_dpci' through a linked list.
The rule of only running one 'hvm_dirq_assist' for only
one 'hvm_pirq_dpci' is also preserved by having
'schedule_dpci_for' ignore any subsequent calls for an domain
which has already been scheduled.

== Code details ==

Most of the code complexity comes from the '->dom' field
in the 'hvm_pirq_dpci' structure. We use it for ref-counting
and as such it MUST be valid as long as STATE_SCHED bit is
set. Whoever clears the STATE_SCHED bit does the ref-counting
and can also reset the '->dom' field.

To compound the complexity, there are multiple points where the
'hvm_pirq_dpci' structure is reset or re-used. Initially
(first time the domain uses the pirq), the 'hvm_pirq_dpci->dom'
field is set to NULL as it is allocated. On subsequent calls
in to 'pt_irq_create_bind' the ->dom is whatever it had last time.

As this is the initial call (which QEMU ends up calling when the
guest writes an vector value in the MSI field) we MUST set the
'->dom' to a the proper structure (otherwise we cannot do
proper ref-counting).

The mechanism to tear it down is more complex as there
are three ways it can be executed. To make it simpler
everything revolves around 'pt_pirq_softirq_active'. If it
returns -EAGAIN that means there is an outstanding softirq
that needs to finish running before we can continue tearing
down. With that in mind:

a) pci_clean_dpci_irq. This gets called when the guest is
   being destroyed. We end up calling 'pt_pirq_softirq_active'
   to see if it is OK to continue the destruction.

   The scenarios in which the 'struct pirq' (and subsequently
   the 'hvm_pirq_dpci') gets destroyed is when:

   - guest did not use the pirq at all after setup.
   - guest did use pirq, but decided to mask and left it in that
     state.
   - guest did use pirq, but crashed.

   In all of those scenarios we end up calling
   'pt_pirq_softirq_active' to check if the softirq is still
   active. Read below on the 'pt_pirq_softirq_active' loop.

b) pt_irq_destroy_bind (guest disables the MSI). We double-check
   that the softirq has run by piggy-backing on the existing
   'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
   We add the extra call to 'pt_pirq_softirq_active' in
   'pt_pirq_cleanup_check'.

   NOTE: Guests that use event channels unbind first the
   event channel from PIRQs, so the 'pt_pirq_cleanup_check'
   won't be called as 'event' is set to zero. In that case
   we either clean it up via the a) or c) mechanism.

   There is an extra scenario regardless of 'event' being
   set or not: the guest did 'pt_irq_destroy_bind' while an
   interrupt was triggered and softirq was scheduled (but had not
   been run). It is OK to still run the softirq as
   hvm_dirq_assist won't do anything (as the flags are
   set to zero). However we will try to deschedule the
   softirq if we can (by clearing the STATE_SCHED bit and us
   doing the ref-counting).

c) pt_irq_create_bind (not a typo). The scenarios are:

     - guest disables the MSI and then enables it
       (rmmod and modprobe in a loop). We call 'pt_pirq_reset'
       which checks to see if the softirq has been scheduled.
       Imagine the 'b)' with interrupts in flight and c) getting
       called in a loop.

We will spin up on 'pt_pirq_is_active' (at the start of the
'pt_irq_create_bind') with the event_lock spinlock dropped and
calling 'process_pending_softirqs'. hvm_dirq_assist will be executed
and then the softirq will clear 'state' which signals that that we
can re-use the 'hvm_pirq_dpci' structure.

     - we hit once the error paths in 'pt_irq_create_bind' while
       an interrupt was triggered and softirq was scheduled.

If the softirq is in STATE_RUN that means it is executing and we should
let it continue on. We can clear the '->dom' field as the softirq
has stashed it beforehand. If the softirq is STATE_SCHED and
we are successful in clearing it, we do the ref-counting and
clear the '->dom' field. Otherwise we let the softirq continue
on and the '->dom' field is left intact. The clearing of
the '->dom' is left to a), b) or again c) case.

Note that in both cases the 'flags' variable is cleared so
hvm_dirq_assist won't actually do anything.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Suggested-by: Jan Beulich <JBeulich@suse.com>

---
v2: On top of ref-cnts also have wait loop for the outstanding
    'struct domain' that need to be processed.
v3: Add -ERETRY, fix up StyleGuide issues
v4: Clean it up mode, redo per_cpu, this_cpu logic
v5: Instead of threading struct domain, use hvm_pirq_dpci.
v6: Ditch the 'state' bit, expand description, simplify
    softirq and teardown sequence.
v7: Flesh out the comments. Drop the use of domain refcounts
v8: Add two bits (STATE_[SCHED|RUN]) to allow refcounts.
v9: Use cmpxchg, ASSSERT, fix up comments per Jan.
v10: Fix up issues spotted by Jan and Andrew.
---
 xen/arch/x86/domain.c         |   4 +-
 xen/drivers/passthrough/io.c  | 266 +++++++++++++++++++++++++++++++++++++-----
 xen/drivers/passthrough/pci.c |  31 +++--
 xen/include/asm-x86/softirq.h |   3 +-
 xen/include/xen/hvm/irq.h     |   6 +-
 xen/include/xen/pci.h         |   2 +-
 6 files changed, 269 insertions(+), 43 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ae0a344..73d01bb 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1965,7 +1965,9 @@ int domain_relinquish_resources(struct domain *d)
     switch ( d->arch.relmem )
     {
     case RELMEM_not_started:
-        pci_release_devices(d);
+        ret = pci_release_devices(d);
+        if ( ret )
+            return ret;
 
         /* Tear down paging-assistance stuff. */
         ret = paging_teardown(d);
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index dceb17e..66869e9 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,14 +20,125 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long arg);
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/*
+ * These two bit states help to safely schedule, deschedule, and wait until
+ * the softirq has finished.
+ *
+ * The semantics behind these two bits is as follow:
+ *  - STATE_SCHED - whoever modifies it has to ref-count the domain (->dom).
+ *  - STATE_RUN - only softirq is allowed to set and clear it. If it has
+ *      been set hvm_dirq_assist will RUN with a saved value of the
+ *      'struct domain' copied from 'pirq_dpci->dom' before STATE_RUN was set.
+ *
+ * The usual states are: STATE_SCHED(set) -> STATE_RUN(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(unset).
+ *
+ * However the states can also diverge such as: STATE_SCHED(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(set) -> STATE_RUN(unset). That means
+ * the 'hvm_dirq_assist' never run and that the softirq did not do any
+ * ref-counting.
+ */
+
+enum {
+    STATE_SCHED,
+    STATE_RUN
+};
+
+/*
+ * Should only be called from hvm_do_IRQ_dpci. We use the
+ * 'state' as a gate to thwart multiple interrupts being scheduled.
+ * The 'state' is cleared by 'softirq_dpci' when it has
+ * completed executing 'hvm_dirq_assist' or by 'pt_pirq_softirq_reset'
+ * if we want to try to unschedule the softirq before it runs.
+ *
+ */
+static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
+{
+    unsigned long flags;
+    unsigned int cpu;
+
+    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+        return;
+
+    cpu = smp_processor_id();
+    pirq_dpci->cpu = cpu;
+    get_knownalive_domain(pirq_dpci->dom);
+
+    local_irq_save(flags);
+    list_add_tail(&pirq_dpci->softirq_list, &per_cpu(dpci_list, cpu));
+    local_irq_restore(flags);
+
+    raise_softirq(HVM_DPCI_SOFTIRQ);
+}
+
+/*
+ * If we are racing with softirq_dpci (state is still set) we return
+ * true. Otherwise we return false.
+ *
+ *  If it is false, it is the callers responsibility to make sure
+ *  that the softirq (with the event_lock dropped) has ran. We need
+ *  to flush out the outstanding 'dpci_softirq' (no more of them
+ *  will be added for this pirq as the IRQ action handler has been
+ *  reset in pt_irq_destroy_bind).
+ */
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
+{
+    if ( pirq_dpci->state & ((1 << STATE_RUN) | (1 << STATE_SCHED)) )
+        return 1;
+
+    /*
+     * If in the future we would call 'raise_softirq_for' right away
+     * after 'pt_pirq_softirq_active' we MUST reset the list (otherwise it
+     * might have stale data).
+     */
+    return 0;
+}
+
+/*
+ * Reset the pirq_dpci->dom parameter to NULL.
+ *
+ * This function checks the different states to make sure it can do
+ * at the right time and if unschedules the softirq before it has
+ * run it also refcounts (which is what the softirq would have done).
+ */
+static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
+{
+    struct domain *d = pirq_dpci->dom;
+
+    ASSERT(spin_is_locked(&d->event_lock));
+
+    switch ( cmpxchg(&pirq_dpci->state, 1 << STATE_SCHED, 0) )
+    {
+            /*
+             * We are going to try to de-schedule the softirq before it goes in
+             * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
+             */
+        case (1 << STATE_SCHED):
+            put_domain(d);
+            /* fallthrough. */
+            /*
+             * The reason it is OK to reset 'dom' when STATE_RUN bit is set is
+             * due to a shortcut the 'dpci_softirq' implements. It stashes the
+             * a 'dom' in local variable before it sets STATE_RUN - and
+             * therefore will not dereference '->dom' which would crash.
+             */
+        case (1 << STATE_RUN):
+        case (1 << STATE_RUN) | (1 << STATE_SCHED):
+            pirq_dpci->dom = NULL;
+            break;
+        default:
+            break;
+    }
+}
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -40,7 +151,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
     if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT,
                               &pirq_dpci->flags) )
     {
-        pirq_dpci->masked = 0;
+        pirq_dpci->state = 0;
         pirq_dpci->pending = 0;
         pirq_guest_eoi(dpci_pirq(pirq_dpci));
     }
@@ -101,6 +212,7 @@ int pt_irq_create_bind(
     if ( pirq < 0 || pirq >= d->nr_pirqs )
         return -EINVAL;
 
+ restart:
     spin_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
@@ -127,7 +239,29 @@ int pt_irq_create_bind(
         return -ENOMEM;
     }
     pirq_dpci = pirq_dpci(info);
-
+    /*
+     * A crude 'while' loop with us dropping the spinlock and giving
+     * the softirq_dpci a chance to run.
+     * We MUST check for this condition as the softirq could be scheduled
+     * and hasn't run yet. Note that this code replaced tasklet_kill which
+     * would have spun forever and would do the same thing (wait to flush out
+     * outstanding hvm_dirq_assist calls.
+     */
+    if ( pt_pirq_softirq_active(pirq_dpci) )
+    {
+        spin_unlock(&d->event_lock);
+        if ( pirq_dpci->cpu >= 0 && pirq_dpci->cpu != smp_processor_id() )
+        {
+            /*
+             * The 'raise_softirq_for' sets the CPU and raises the softirq bit
+             * so we do not need to set the target CPU's HVM_DPCI_SOFTIRQ.
+             */
+            smp_send_event_check_cpu(pirq_dpci->cpu);
+            pirq_dpci->cpu = -1;
+        }
+        cpu_relax();
+        goto restart;
+    }
     switch ( pt_irq_bind->irq_type )
     {
     case PT_IRQ_TYPE_MSI:
@@ -165,8 +299,14 @@ int pt_irq_create_bind(
             {
                 pirq_dpci->gmsi.gflags = 0;
                 pirq_dpci->gmsi.gvec = 0;
-                pirq_dpci->dom = NULL;
                 pirq_dpci->flags = 0;
+                /*
+                 * Between the 'pirq_guest_bind' and before 'pirq_guest_unbind'
+                 * an interrupt can be scheduled. No more of them are going to
+                 * be scheduled but we must deal with the one that is in the
+                 * queue.
+                 */
+                pt_pirq_softirq_reset(pirq_dpci);
                 pirq_cleanup_check(info, d);
                 spin_unlock(&d->event_lock);
                 return rc;
@@ -269,6 +409,10 @@ int pt_irq_create_bind(
             {
                 if ( pt_irq_need_timer(pirq_dpci->flags) )
                     kill_timer(&pirq_dpci->timer);
+                /*
+                 * There is no path for __do_IRQ to schedule softirq as
+                 * IRQ_GUEST is not set. As such we can reset 'dom' right away.
+                 */
                 pirq_dpci->dom = NULL;
                 list_del(&girq->list);
                 list_del(&digl->list);
@@ -402,8 +546,13 @@ int pt_irq_destroy_bind(
         msixtbl_pt_unregister(d, pirq);
         if ( pt_irq_need_timer(pirq_dpci->flags) )
             kill_timer(&pirq_dpci->timer);
-        pirq_dpci->dom   = NULL;
         pirq_dpci->flags = 0;
+        /*
+         * See comment in pt_irq_create_bind's PT_IRQ_TYPE_MSI before the
+         * call to pt_pirq_softirq_reset.
+         */
+        pt_pirq_softirq_reset(pirq_dpci);
+
         pirq_cleanup_check(pirq, d);
     }
 
@@ -426,14 +575,12 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
-    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    if ( !dpci->flags )
+    if ( !dpci->flags && !pt_pirq_softirq_active(dpci) )
     {
-        tasklet_kill(&dpci->tasklet);
         dpci->dom = NULL;
         return 1;
     }
@@ -476,8 +623,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
          !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         return 0;
 
-    pirq_dpci->masked = 1;
-    tasklet_schedule(&pirq_dpci->tasklet);
+    raise_softirq_for(pirq_dpci);
     return 1;
 }
 
@@ -531,28 +677,12 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static void hvm_dirq_assist(unsigned long arg)
+static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
 {
-    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
-    struct domain *d = pirq_dpci->dom;
-
-    /*
-     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
-     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
-     *
-     * And '->dom' gets cleared later in the destroy path. We exit and clear
-     * 'masked' - which is OK as later in this code we would
-     * do nothing except clear the ->masked field anyhow.
-     */
-    if ( !d )
-    {
-        pirq_dpci->masked = 0;
-        return;
-    }
     ASSERT(d->arch.hvm_domain.irq.dpci);
 
     spin_lock(&d->event_lock);
-    if ( test_and_clear_bool(pirq_dpci->masked) )
+    if ( pirq_dpci->state )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
         const struct dev_intx_gsi_link *digl;
@@ -654,3 +784,81 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+static void dpci_softirq(void)
+{
+    unsigned int cpu = smp_processor_id();
+    LIST_HEAD(our_list);
+
+    local_irq_disable();
+    list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
+    local_irq_enable();
+
+    while ( !list_empty(&our_list) )
+    {
+        struct hvm_pirq_dpci *pirq_dpci;
+        struct domain *d;
+
+        pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
+        list_del(&pirq_dpci->softirq_list);
+
+        d = pirq_dpci->dom;
+        smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
+        if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
+            BUG();
+        /*
+         * The one who clears STATE_SCHED MUST refcount the domain.
+         */
+        if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
+        {
+            hvm_dirq_assist(d, pirq_dpci);
+            put_domain(d);
+        }
+        clear_bit(STATE_RUN, &pirq_dpci->state);
+    }
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        /*
+         * On CPU_DYING this callback is called (on the CPU that is dying)
+         * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
+         * clear out any outstanding domains (by the virtue of the idle loop
+         * calling the softirq later). In CPU_DEAD case the CPU is deaf and
+         * there are no pending softirqs for us to handle so we can chill.
+         */
+        ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
+        break;
+    default:
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 81e8a3a..8631473 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
-    tasklet_kill(&pirq_dpci->tasklet);
-
-    return 0;
+    return pt_pirq_softirq_active(pirq_dpci) ? -ERESTART : 0;
 }
 
-static void pci_clean_dpci_irqs(struct domain *d)
+static int pci_clean_dpci_irqs(struct domain *d)
 {
     struct hvm_irq_dpci *hvm_irq_dpci = NULL;
 
     if ( !iommu_enabled )
-        return;
+        return -ENODEV;
 
     if ( !is_hvm_domain(d) )
-        return;
+        return -EINVAL;
 
     spin_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+        int ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+
+        if ( ret )
+        {
+            spin_unlock(&d->event_lock);
+            return ret;
+        }
 
         d->arch.hvm_domain.irq.dpci = NULL;
         free_hvm_irq_dpci(hvm_irq_dpci);
     }
     spin_unlock(&d->event_lock);
+    return 0;
 }
 
-void pci_release_devices(struct domain *d)
+int pci_release_devices(struct domain *d)
 {
     struct pci_dev *pdev;
     u8 bus, devfn;
+    int ret;
 
     spin_lock(&pcidevs_lock);
-    pci_clean_dpci_irqs(d);
+    ret = pci_clean_dpci_irqs(d);
+    if ( ret == -ERESTART )
+    {
+        spin_unlock(&pcidevs_lock);
+        return ret;
+    }
     while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
     {
         bus = pdev->bus;
@@ -811,6 +822,8 @@ void pci_release_devices(struct domain *d)
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
     }
     spin_unlock(&pcidevs_lock);
+
+    return 0;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
diff --git a/xen/include/asm-x86/softirq.h b/xen/include/asm-x86/softirq.h
index 7225dea..ec787d6 100644
--- a/xen/include/asm-x86/softirq.h
+++ b/xen/include/asm-x86/softirq.h
@@ -7,7 +7,8 @@
 
 #define MACHINE_CHECK_SOFTIRQ  (NR_COMMON_SOFTIRQS + 3)
 #define PCI_SERR_SOFTIRQ       (NR_COMMON_SOFTIRQS + 4)
-#define NR_ARCH_SOFTIRQS       5
+#define HVM_DPCI_SOFTIRQ       (NR_COMMON_SOFTIRQS + 5)
+#define NR_ARCH_SOFTIRQS       6
 
 bool_t arch_skip_send_event_check(unsigned int cpu);
 
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 94a550a..1db45b2 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -93,13 +93,14 @@ struct hvm_irq_dpci {
 /* Machine IRQ to guest device/intx mapping. */
 struct hvm_pirq_dpci {
     uint32_t flags;
-    bool_t masked;
+    unsigned int state;
+    int cpu;
     uint16_t pending;
     struct list_head digl_list;
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
-    struct tasklet tasklet;
+    struct list_head softirq_list;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
@@ -109,6 +110,7 @@ int pt_pirq_iterate(struct domain *d,
                               struct hvm_pirq_dpci *, void *arg),
                     void *arg);
 
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *);
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 91520bc..5f295f3 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev(
 
 void setup_hwdom_pci_devices(struct domain *,
                             int (*)(u8 devfn, struct pci_dev *));
-void pci_release_devices(struct domain *d);
+int pci_release_devices(struct domain *d);
 int pci_add_segment(u16 seg);
 const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
-- 
1.8.0.1


[-- Attachment #2: 0001-dpci-Replace-tasklet-with-an-softirq-v10.patch --]
[-- Type: text/plain, Size: 24502 bytes --]

>From 7ca4009e4613078ad603aef7af0cc65e6f8dd90f Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Thu, 23 Oct 2014 20:41:24 -0400
Subject: [PATCH] dpci: Replace tasklet with an softirq (v10)

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct hvm_pirq_dpci' structure the
'hvm_dirq_assist' needs to run on. That is simple solved by
threading the 'struct hvm_pirq_dpci' through a linked list.
The rule of only running one 'hvm_dirq_assist' for only
one 'hvm_pirq_dpci' is also preserved by having
'schedule_dpci_for' ignore any subsequent calls for an domain
which has already been scheduled.

== Code details ==

Most of the code complexity comes from the '->dom' field
in the 'hvm_pirq_dpci' structure. We use it for ref-counting
and as such it MUST be valid as long as STATE_SCHED bit is
set. Whoever clears the STATE_SCHED bit does the ref-counting
and can also reset the '->dom' field.

To compound the complexity, there are multiple points where the
'hvm_pirq_dpci' structure is reset or re-used. Initially
(first time the domain uses the pirq), the 'hvm_pirq_dpci->dom'
field is set to NULL as it is allocated. On subsequent calls
in to 'pt_irq_create_bind' the ->dom is whatever it had last time.

As this is the initial call (which QEMU ends up calling when the
guest writes an vector value in the MSI field) we MUST set the
'->dom' to a the proper structure (otherwise we cannot do
proper ref-counting).

The mechanism to tear it down is more complex as there
are three ways it can be executed. To make it simpler
everything revolves around 'pt_pirq_softirq_active'. If it
returns -EAGAIN that means there is an outstanding softirq
that needs to finish running before we can continue tearing
down. With that in mind:

a) pci_clean_dpci_irq. This gets called when the guest is
   being destroyed. We end up calling 'pt_pirq_softirq_active'
   to see if it is OK to continue the destruction.

   The scenarios in which the 'struct pirq' (and subsequently
   the 'hvm_pirq_dpci') gets destroyed is when:

   - guest did not use the pirq at all after setup.
   - guest did use pirq, but decided to mask and left it in that
     state.
   - guest did use pirq, but crashed.

   In all of those scenarios we end up calling
   'pt_pirq_softirq_active' to check if the softirq is still
   active. Read below on the 'pt_pirq_softirq_active' loop.

b) pt_irq_destroy_bind (guest disables the MSI). We double-check
   that the softirq has run by piggy-backing on the existing
   'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
   We add the extra call to 'pt_pirq_softirq_active' in
   'pt_pirq_cleanup_check'.

   NOTE: Guests that use event channels unbind first the
   event channel from PIRQs, so the 'pt_pirq_cleanup_check'
   won't be called as 'event' is set to zero. In that case
   we either clean it up via the a) or c) mechanism.

   There is an extra scenario regardless of 'event' being
   set or not: the guest did 'pt_irq_destroy_bind' while an
   interrupt was triggered and softirq was scheduled (but had not
   been run). It is OK to still run the softirq as
   hvm_dirq_assist won't do anything (as the flags are
   set to zero). However we will try to deschedule the
   softirq if we can (by clearing the STATE_SCHED bit and us
   doing the ref-counting).

c) pt_irq_create_bind (not a typo). The scenarios are:

     - guest disables the MSI and then enables it
       (rmmod and modprobe in a loop). We call 'pt_pirq_reset'
       which checks to see if the softirq has been scheduled.
       Imagine the 'b)' with interrupts in flight and c) getting
       called in a loop.

We will spin up on 'pt_pirq_is_active' (at the start of the
'pt_irq_create_bind') with the event_lock spinlock dropped and
calling 'process_pending_softirqs'. hvm_dirq_assist will be executed
and then the softirq will clear 'state' which signals that that we
can re-use the 'hvm_pirq_dpci' structure.

     - we hit once the error paths in 'pt_irq_create_bind' while
       an interrupt was triggered and softirq was scheduled.

If the softirq is in STATE_RUN that means it is executing and we should
let it continue on. We can clear the '->dom' field as the softirq
has stashed it beforehand. If the softirq is STATE_SCHED and
we are successful in clearing it, we do the ref-counting and
clear the '->dom' field. Otherwise we let the softirq continue
on and the '->dom' field is left intact. The clearing of
the '->dom' is left to a), b) or again c) case.

Note that in both cases the 'flags' variable is cleared so
hvm_dirq_assist won't actually do anything.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Suggested-by: Jan Beulich <JBeulich@suse.com>

---
v2: On top of ref-cnts also have wait loop for the outstanding
    'struct domain' that need to be processed.
v3: Add -ERETRY, fix up StyleGuide issues
v4: Clean it up mode, redo per_cpu, this_cpu logic
v5: Instead of threading struct domain, use hvm_pirq_dpci.
v6: Ditch the 'state' bit, expand description, simplify
    softirq and teardown sequence.
v7: Flesh out the comments. Drop the use of domain refcounts
v8: Add two bits (STATE_[SCHED|RUN]) to allow refcounts.
v9: Use cmpxchg, ASSSERT, fix up comments per Jan.
v10: Fix up issues spotted by Jan.
---
 xen/arch/x86/domain.c         |   4 +-
 xen/drivers/passthrough/io.c  | 266 +++++++++++++++++++++++++++++++++++++-----
 xen/drivers/passthrough/pci.c |  31 +++--
 xen/include/asm-x86/softirq.h |   3 +-
 xen/include/xen/hvm/irq.h     |   6 +-
 xen/include/xen/pci.h         |   2 +-
 6 files changed, 269 insertions(+), 43 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ae0a344..73d01bb 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1965,7 +1965,9 @@ int domain_relinquish_resources(struct domain *d)
     switch ( d->arch.relmem )
     {
     case RELMEM_not_started:
-        pci_release_devices(d);
+        ret = pci_release_devices(d);
+        if ( ret )
+            return ret;
 
         /* Tear down paging-assistance stuff. */
         ret = paging_teardown(d);
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index dceb17e..66869e9 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,14 +20,125 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long arg);
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/*
+ * These two bit states help to safely schedule, deschedule, and wait until
+ * the softirq has finished.
+ *
+ * The semantics behind these two bits is as follow:
+ *  - STATE_SCHED - whoever modifies it has to ref-count the domain (->dom).
+ *  - STATE_RUN - only softirq is allowed to set and clear it. If it has
+ *      been set hvm_dirq_assist will RUN with a saved value of the
+ *      'struct domain' copied from 'pirq_dpci->dom' before STATE_RUN was set.
+ *
+ * The usual states are: STATE_SCHED(set) -> STATE_RUN(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(unset).
+ *
+ * However the states can also diverge such as: STATE_SCHED(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(set) -> STATE_RUN(unset). That means
+ * the 'hvm_dirq_assist' never run and that the softirq did not do any
+ * ref-counting.
+ */
+
+enum {
+    STATE_SCHED,
+    STATE_RUN
+};
+
+/*
+ * Should only be called from hvm_do_IRQ_dpci. We use the
+ * 'state' as a gate to thwart multiple interrupts being scheduled.
+ * The 'state' is cleared by 'softirq_dpci' when it has
+ * completed executing 'hvm_dirq_assist' or by 'pt_pirq_softirq_reset'
+ * if we want to try to unschedule the softirq before it runs.
+ *
+ */
+static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
+{
+    unsigned long flags;
+    unsigned int cpu;
+
+    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+        return;
+
+    cpu = smp_processor_id();
+    pirq_dpci->cpu = cpu;
+    get_knownalive_domain(pirq_dpci->dom);
+
+    local_irq_save(flags);
+    list_add_tail(&pirq_dpci->softirq_list, &per_cpu(dpci_list, cpu));
+    local_irq_restore(flags);
+
+    raise_softirq(HVM_DPCI_SOFTIRQ);
+}
+
+/*
+ * If we are racing with softirq_dpci (state is still set) we return
+ * true. Otherwise we return false.
+ *
+ *  If it is false, it is the callers responsibility to make sure
+ *  that the softirq (with the event_lock dropped) has ran. We need
+ *  to flush out the outstanding 'dpci_softirq' (no more of them
+ *  will be added for this pirq as the IRQ action handler has been
+ *  reset in pt_irq_destroy_bind).
+ */
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
+{
+    if ( pirq_dpci->state & ((1 << STATE_RUN) | (1 << STATE_SCHED)) )
+        return 1;
+
+    /*
+     * If in the future we would call 'raise_softirq_for' right away
+     * after 'pt_pirq_softirq_active' we MUST reset the list (otherwise it
+     * might have stale data).
+     */
+    return 0;
+}
+
+/*
+ * Reset the pirq_dpci->dom parameter to NULL.
+ *
+ * This function checks the different states to make sure it can do
+ * at the right time and if unschedules the softirq before it has
+ * run it also refcounts (which is what the softirq would have done).
+ */
+static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
+{
+    struct domain *d = pirq_dpci->dom;
+
+    ASSERT(spin_is_locked(&d->event_lock));
+
+    switch ( cmpxchg(&pirq_dpci->state, 1 << STATE_SCHED, 0) )
+    {
+            /*
+             * We are going to try to de-schedule the softirq before it goes in
+             * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
+             */
+        case (1 << STATE_SCHED):
+            put_domain(d);
+            /* fallthrough. */
+            /*
+             * The reason it is OK to reset 'dom' when STATE_RUN bit is set is
+             * due to a shortcut the 'dpci_softirq' implements. It stashes the
+             * a 'dom' in local variable before it sets STATE_RUN - and
+             * therefore will not dereference '->dom' which would crash.
+             */
+        case (1 << STATE_RUN):
+        case (1 << STATE_RUN) | (1 << STATE_SCHED):
+            pirq_dpci->dom = NULL;
+            break;
+        default:
+            break;
+    }
+}
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -40,7 +151,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
     if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT,
                               &pirq_dpci->flags) )
     {
-        pirq_dpci->masked = 0;
+        pirq_dpci->state = 0;
         pirq_dpci->pending = 0;
         pirq_guest_eoi(dpci_pirq(pirq_dpci));
     }
@@ -101,6 +212,7 @@ int pt_irq_create_bind(
     if ( pirq < 0 || pirq >= d->nr_pirqs )
         return -EINVAL;
 
+ restart:
     spin_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
@@ -127,7 +239,29 @@ int pt_irq_create_bind(
         return -ENOMEM;
     }
     pirq_dpci = pirq_dpci(info);
-
+    /*
+     * A crude 'while' loop with us dropping the spinlock and giving
+     * the softirq_dpci a chance to run.
+     * We MUST check for this condition as the softirq could be scheduled
+     * and hasn't run yet. Note that this code replaced tasklet_kill which
+     * would have spun forever and would do the same thing (wait to flush out
+     * outstanding hvm_dirq_assist calls.
+     */
+    if ( pt_pirq_softirq_active(pirq_dpci) )
+    {
+        spin_unlock(&d->event_lock);
+        if ( pirq_dpci->cpu >= 0 && pirq_dpci->cpu != smp_processor_id() )
+        {
+            /*
+             * The 'raise_softirq_for' sets the CPU and raises the softirq bit
+             * so we do not need to set the target CPU's HVM_DPCI_SOFTIRQ.
+             */
+            smp_send_event_check_cpu(pirq_dpci->cpu);
+            pirq_dpci->cpu = -1;
+        }
+        cpu_relax();
+        goto restart;
+    }
     switch ( pt_irq_bind->irq_type )
     {
     case PT_IRQ_TYPE_MSI:
@@ -165,8 +299,14 @@ int pt_irq_create_bind(
             {
                 pirq_dpci->gmsi.gflags = 0;
                 pirq_dpci->gmsi.gvec = 0;
-                pirq_dpci->dom = NULL;
                 pirq_dpci->flags = 0;
+                /*
+                 * Between the 'pirq_guest_bind' and before 'pirq_guest_unbind'
+                 * an interrupt can be scheduled. No more of them are going to
+                 * be scheduled but we must deal with the one that is in the
+                 * queue.
+                 */
+                pt_pirq_softirq_reset(pirq_dpci);
                 pirq_cleanup_check(info, d);
                 spin_unlock(&d->event_lock);
                 return rc;
@@ -269,6 +409,10 @@ int pt_irq_create_bind(
             {
                 if ( pt_irq_need_timer(pirq_dpci->flags) )
                     kill_timer(&pirq_dpci->timer);
+                /*
+                 * There is no path for __do_IRQ to schedule softirq as
+                 * IRQ_GUEST is not set. As such we can reset 'dom' right away.
+                 */
                 pirq_dpci->dom = NULL;
                 list_del(&girq->list);
                 list_del(&digl->list);
@@ -402,8 +546,13 @@ int pt_irq_destroy_bind(
         msixtbl_pt_unregister(d, pirq);
         if ( pt_irq_need_timer(pirq_dpci->flags) )
             kill_timer(&pirq_dpci->timer);
-        pirq_dpci->dom   = NULL;
         pirq_dpci->flags = 0;
+        /*
+         * See comment in pt_irq_create_bind's PT_IRQ_TYPE_MSI before the
+         * call to pt_pirq_softirq_reset.
+         */
+        pt_pirq_softirq_reset(pirq_dpci);
+
         pirq_cleanup_check(pirq, d);
     }
 
@@ -426,14 +575,12 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
-    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    if ( !dpci->flags )
+    if ( !dpci->flags && !pt_pirq_softirq_active(dpci) )
     {
-        tasklet_kill(&dpci->tasklet);
         dpci->dom = NULL;
         return 1;
     }
@@ -476,8 +623,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
          !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         return 0;
 
-    pirq_dpci->masked = 1;
-    tasklet_schedule(&pirq_dpci->tasklet);
+    raise_softirq_for(pirq_dpci);
     return 1;
 }
 
@@ -531,28 +677,12 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static void hvm_dirq_assist(unsigned long arg)
+static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
 {
-    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
-    struct domain *d = pirq_dpci->dom;
-
-    /*
-     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
-     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
-     *
-     * And '->dom' gets cleared later in the destroy path. We exit and clear
-     * 'masked' - which is OK as later in this code we would
-     * do nothing except clear the ->masked field anyhow.
-     */
-    if ( !d )
-    {
-        pirq_dpci->masked = 0;
-        return;
-    }
     ASSERT(d->arch.hvm_domain.irq.dpci);
 
     spin_lock(&d->event_lock);
-    if ( test_and_clear_bool(pirq_dpci->masked) )
+    if ( pirq_dpci->state )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
         const struct dev_intx_gsi_link *digl;
@@ -654,3 +784,81 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+static void dpci_softirq(void)
+{
+    unsigned int cpu = smp_processor_id();
+    LIST_HEAD(our_list);
+
+    local_irq_disable();
+    list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
+    local_irq_enable();
+
+    while ( !list_empty(&our_list) )
+    {
+        struct hvm_pirq_dpci *pirq_dpci;
+        struct domain *d;
+
+        pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
+        list_del(&pirq_dpci->softirq_list);
+
+        d = pirq_dpci->dom;
+        smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
+        if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
+            BUG();
+        /*
+         * The one who clears STATE_SCHED MUST refcount the domain.
+         */
+        if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
+        {
+            hvm_dirq_assist(d, pirq_dpci);
+            put_domain(d);
+        }
+        clear_bit(STATE_RUN, &pirq_dpci->state);
+    }
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        /*
+         * On CPU_DYING this callback is called (on the CPU that is dying)
+         * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
+         * clear out any outstanding domains (by the virtue of the idle loop
+         * calling the softirq later). In CPU_DEAD case the CPU is deaf and
+         * there are no pending softirqs for us to handle so we can chill.
+         */
+        ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
+        break;
+    default:
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 81e8a3a..8631473 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
-    tasklet_kill(&pirq_dpci->tasklet);
-
-    return 0;
+    return pt_pirq_softirq_active(pirq_dpci) ? -ERESTART : 0;
 }
 
-static void pci_clean_dpci_irqs(struct domain *d)
+static int pci_clean_dpci_irqs(struct domain *d)
 {
     struct hvm_irq_dpci *hvm_irq_dpci = NULL;
 
     if ( !iommu_enabled )
-        return;
+        return -ENODEV;
 
     if ( !is_hvm_domain(d) )
-        return;
+        return -EINVAL;
 
     spin_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+        int ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+
+        if ( ret )
+        {
+            spin_unlock(&d->event_lock);
+            return ret;
+        }
 
         d->arch.hvm_domain.irq.dpci = NULL;
         free_hvm_irq_dpci(hvm_irq_dpci);
     }
     spin_unlock(&d->event_lock);
+    return 0;
 }
 
-void pci_release_devices(struct domain *d)
+int pci_release_devices(struct domain *d)
 {
     struct pci_dev *pdev;
     u8 bus, devfn;
+    int ret;
 
     spin_lock(&pcidevs_lock);
-    pci_clean_dpci_irqs(d);
+    ret = pci_clean_dpci_irqs(d);
+    if ( ret == -ERESTART )
+    {
+        spin_unlock(&pcidevs_lock);
+        return ret;
+    }
     while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
     {
         bus = pdev->bus;
@@ -811,6 +822,8 @@ void pci_release_devices(struct domain *d)
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
     }
     spin_unlock(&pcidevs_lock);
+
+    return 0;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
diff --git a/xen/include/asm-x86/softirq.h b/xen/include/asm-x86/softirq.h
index 7225dea..ec787d6 100644
--- a/xen/include/asm-x86/softirq.h
+++ b/xen/include/asm-x86/softirq.h
@@ -7,7 +7,8 @@
 
 #define MACHINE_CHECK_SOFTIRQ  (NR_COMMON_SOFTIRQS + 3)
 #define PCI_SERR_SOFTIRQ       (NR_COMMON_SOFTIRQS + 4)
-#define NR_ARCH_SOFTIRQS       5
+#define HVM_DPCI_SOFTIRQ       (NR_COMMON_SOFTIRQS + 5)
+#define NR_ARCH_SOFTIRQS       6
 
 bool_t arch_skip_send_event_check(unsigned int cpu);
 
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 94a550a..1db45b2 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -93,13 +93,14 @@ struct hvm_irq_dpci {
 /* Machine IRQ to guest device/intx mapping. */
 struct hvm_pirq_dpci {
     uint32_t flags;
-    bool_t masked;
+    unsigned int state;
+    int cpu;
     uint16_t pending;
     struct list_head digl_list;
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
-    struct tasklet tasklet;
+    struct list_head softirq_list;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
@@ -109,6 +110,7 @@ int pt_pirq_iterate(struct domain *d,
                               struct hvm_pirq_dpci *, void *arg),
                     void *arg);
 
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *);
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 91520bc..5f295f3 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev(
 
 void setup_hwdom_pci_devices(struct domain *,
                             int (*)(u8 devfn, struct pci_dev *));
-void pci_release_devices(struct domain *d);
+int pci_release_devices(struct domain *d);
 int pci_add_segment(u16 seg);
 const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
-- 
1.8.0.1


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-27 17:01                     ` Konrad Rzeszutek Wilk
  2014-10-27 17:36                       ` Andrew Cooper
@ 2014-10-28  7:53                       ` Jan Beulich
  1 sibling, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2014-10-28  7:53 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, Andrew Cooper, tim, xen-devel, ian.jackson

>>> On 27.10.14 at 18:01, <konrad.wilk@oracle.com> wrote:
> On Mon, Oct 27, 2014 at 11:24:31AM +0000, Jan Beulich wrote:
>> >>> On 27.10.14 at 12:09, <andrew.cooper3@citrix.com> wrote:
>> > Can it ever be the case that we are waiting for a remote pcpu to run its
>> > softirq handler?  If so, the time spent looping here could be up to 1
>> > scheduling timeslice in the worst case, and 30ms is a very long time to
>> > wait.
>> 
>> Good point - I think this can be the case. But there seems to be a
>> simple counter measure: The first time we get to this point, send an
>> event check IPI to the CPU in question (or in the worst case
>> broadcast one if the CPU can't be determined in a race free way).
> 
> I can either do this using the wrapper:
> 
>      if ( pt_pirq_softirq_active(pirq_dpci) )
>      {
>          spin_unlock(&d->event_lock);
> 	 if ( pirq_dpci->cpu >= 0 )
> 	 {
>          	cpu_raise_softirq(pirq_dpci->cpu, HVM_DPCI_SOFTIRQ);
> 		pirq_dpci->cpu = -1;
> 	 }
>          cpu_relax();
>          goto restart;

As Andrew said in his reply, this is the variant you should continue
from. Provided the ->cpu use here is race free (which isn't
immediately clear without seeing the other code).

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-27 17:36                       ` Andrew Cooper
  2014-10-27 18:00                         ` Konrad Rzeszutek Wilk
@ 2014-10-28  7:58                         ` Jan Beulich
  1 sibling, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2014-10-28  7:58 UTC (permalink / raw)
  To: Andrew Cooper, Konrad Rzeszutek Wilk
  Cc: xen-devel, keir, ian.jackson, ian.campbell, tim

>>> On 27.10.14 at 18:36, <andrew.cooper3@citrix.com> wrote:
> On 27/10/14 17:01, Konrad Rzeszutek Wilk wrote:
>> On Mon, Oct 27, 2014 at 11:24:31AM +0000, Jan Beulich wrote:
>>>>>> On 27.10.14 at 12:09, <andrew.cooper3@citrix.com> wrote:
>>>> Can it ever be the case that we are waiting for a remote pcpu to run its
>>>> softirq handler?  If so, the time spent looping here could be up to 1
>>>> scheduling timeslice in the worst case, and 30ms is a very long time to
>>>> wait.
>>> Good point - I think this can be the case. But there seems to be a
>>> simple counter measure: The first time we get to this point, send an
>>> event check IPI to the CPU in question (or in the worst case
>>> broadcast one if the CPU can't be determined in a race free way).
>> I can either do this using the wrapper:
>>
>>      if ( pt_pirq_softirq_active(pirq_dpci) )
>>      {
>>          spin_unlock(&d->event_lock);
>> 	 if ( pirq_dpci->cpu >= 0 )
>> 	 {
>>          	cpu_raise_softirq(pirq_dpci->cpu, HVM_DPCI_SOFTIRQ);
>> 		pirq_dpci->cpu = -1;
>> 	 }
>>          cpu_relax();
>>          goto restart;
>>
>> Ought to do it (cpu_raise_softirq will exit out if
>> the 'pirq_dpci->cpu == smp_processor_id()'). It also has some batching checks
>> so that we won't do the IPI if we are in the middle of IPI-ing already
>> an CPU.
>>
>> Or just write it out (and bypass some of the checks 'cpu_raise_softirq'
>> has):
>>
>>      if ( pt_pirq_softirq_active(pirq_dpci) )
>>      {
>>          spin_unlock(&d->event_lock);
>> 	 if ( pirq_dpci->cpu >= 0 && pirq_dpci->cpu != smp_processor_id() )
>> 	 {
>> 		smp_send_event_check_cpu(pirq_dpci->cpu);
>> 		pirq_dpci->cpu = -1;
>> 	 }
>>          cpu_relax();
>>          goto restart;
>>
>>
>> Note:
>>
>> The 'cpu' is stashed whenever 'raise_softirq_for' has been called.
>>
> 
> You need to send at most 1 IPI, or you will be pointlessly spamming the
> target pcpu.  Therefore, a blind goto restart seems ill-advised.

With ->cpu being set to -1, I don't see how more than one IPI would
get sent here.

> The second version doesn't necessarily set HVM_DPCI_SOFTIRQ pending,

Right.

> while the first version suffers a risk of the softirq being caught in a
> batch.

Not without anyone up the call stack having called
cpu_raise_softirq_batch_begin().

> Furthermore, with mwait support, the IPI is elided completely, which is
> completely wrong in this situation.

As already said on IRC, this isn't the case: An IPI gets avoided only
when we _know_ the remote CPU is MWAITing (or resuming from
MWAIT).

> Therefore, I think you need to manually set the HVM_DPCI_SOFTIRQ bit,
> then forcibly send the IPI.

Open coding things is almost always wrong.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-27 21:13                           ` Konrad Rzeszutek Wilk
@ 2014-10-28 10:43                             ` Jan Beulich
  2014-10-28 20:07                               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2014-10-28 10:43 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, Andrew Cooper, tim, xen-devel, ian.jackson

>>> On 27.10.14 at 22:13, <konrad.wilk@oracle.com> wrote:
> +    /*
> +     * A crude 'while' loop with us dropping the spinlock and giving
> +     * the softirq_dpci a chance to run.
> +     * We MUST check for this condition as the softirq could be scheduled
> +     * and hasn't run yet. Note that this code replaced tasklet_kill which
> +     * would have spun forever and would do the same thing (wait to flush out
> +     * outstanding hvm_dirq_assist calls.
> +     */
> +    if ( pt_pirq_softirq_active(pirq_dpci) )
> +    {
> +        spin_unlock(&d->event_lock);
> +        if ( pirq_dpci->cpu >= 0 && pirq_dpci->cpu != smp_processor_id() )
> +        {
> +            /*
> +             * The 'raise_softirq_for' sets the CPU and raises the softirq bit
> +             * so we do not need to set the target CPU's HVM_DPCI_SOFTIRQ.
> +             */
> +            smp_send_event_check_cpu(pirq_dpci->cpu);
> +            pirq_dpci->cpu = -1;
> +        }
> +        cpu_relax();
> +        goto restart;
> +    }

As said in an earlier reply to Andrew, I think this open coding goes
too far. And with the softirq known to have got sent, I also don't
really see why it needs to be resent _at all_ (and the comments
don't explain this either).

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-28 10:43                             ` Jan Beulich
@ 2014-10-28 20:07                               ` Konrad Rzeszutek Wilk
  2014-10-29  8:28                                 ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-28 20:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, Andrew Cooper, tim, xen-devel, ian.jackson

[-- Attachment #1: Type: text/plain, Size: 5565 bytes --]

On Tue, Oct 28, 2014 at 10:43:32AM +0000, Jan Beulich wrote:
> >>> On 27.10.14 at 22:13, <konrad.wilk@oracle.com> wrote:
> > +    /*
> > +     * A crude 'while' loop with us dropping the spinlock and giving
> > +     * the softirq_dpci a chance to run.
> > +     * We MUST check for this condition as the softirq could be scheduled
> > +     * and hasn't run yet. Note that this code replaced tasklet_kill which
> > +     * would have spun forever and would do the same thing (wait to flush out
> > +     * outstanding hvm_dirq_assist calls.
> > +     */
> > +    if ( pt_pirq_softirq_active(pirq_dpci) )
> > +    {
> > +        spin_unlock(&d->event_lock);
> > +        if ( pirq_dpci->cpu >= 0 && pirq_dpci->cpu != smp_processor_id() )
> > +        {
> > +            /*
> > +             * The 'raise_softirq_for' sets the CPU and raises the softirq bit
> > +             * so we do not need to set the target CPU's HVM_DPCI_SOFTIRQ.
> > +             */
> > +            smp_send_event_check_cpu(pirq_dpci->cpu);
> > +            pirq_dpci->cpu = -1;
> > +        }
> > +        cpu_relax();
> > +        goto restart;
> > +    }
> 
> As said in an earlier reply to Andrew, I think this open coding goes
> too far. And with the softirq known to have got sent, I also don't
> really see why it needs to be resent _at all_ (and the comments
> don't explain this either).

In the other emails you and Andrew said:

	">> > Can it ever be the case that we are waiting for a remote pcpu to run its
	>> > softirq handler?  If so, the time spent looping here could be up to 1
	>> > scheduling timeslice in the worst case, and 30ms is a very long time to
	>> > wait.
	>>
	>> Good point - I think this can be the case. But there seems to be a
	>> simple counter measure: The first time we get to this point, send an
	>> event check IPI to the CPU in question (or in the worst case
	>> broadcast one if the CPU can't be determined in a race free way).
	>
	"

Which is true. That is what this is trying to address.

But if we use 'cpu_raise_softirq' which you advocate it would inhibit the IPI
if the  HVM_DPCI_SOFTIRQ is set on the remote CPU:

void cpu_raise_softirq(unsigned int cpu, unsigned int nr) 
{
    unsigned int this_cpu = smp_processor_id();

    if ( test_and_set_bit(nr, &softirq_pending(cpu))		<=== that will be true
         || (cpu == this_cpu)
         || arch_skip_send_event_check(cpu) )
        return;

    if ( !per_cpu(batching, this_cpu) || in_irq() )
        smp_send_event_check_cpu(cpu);
    else
        set_bit(nr, &per_cpu(batch_mask, this_cpu));
}

In which case we still won't be sending the IPI. The open-usage of
'smp_send_event_check_cpu' would bypass that check (which would in this
scenario cause to inhibit the IPI).

Perhaps you are suggesting something like this (on top of this patch) - also
attached is the new patch with this change folded in.

diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index 22e417a..2b90316 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -94,6 +94,14 @@ void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
         smp_send_event_check_mask(raise_mask);
 }
 
+void cpu_raise_softirq_ipi(unsigned int this_cpu, unsigned int cpu,
+                           unsigned int nr)
+{
+    if ( !per_cpu(batching, this_cpu) || in_irq() )
+        smp_send_event_check_cpu(cpu);
+    else
+        set_bit(nr, &per_cpu(batch_mask, this_cpu));
+}
 void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
 {
     unsigned int this_cpu = smp_processor_id();
@@ -103,10 +111,7 @@ void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
          || arch_skip_send_event_check(cpu) )
         return;
 
-    if ( !per_cpu(batching, this_cpu) || in_irq() )
-        smp_send_event_check_cpu(cpu);
-    else
-        set_bit(nr, &per_cpu(batch_mask, this_cpu));
+    cpu_raise_softirq_ipi(this_cpu, cpu, nr);
 }
 
 void cpu_raise_softirq_batch_begin(void)
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 66869e9..ddbb890 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -249,14 +249,18 @@ int pt_irq_create_bind(
      */
     if ( pt_pirq_softirq_active(pirq_dpci) )
     {
+        unsigned int cpu;
+
         spin_unlock(&d->event_lock);
-        if ( pirq_dpci->cpu >= 0 && pirq_dpci->cpu != smp_processor_id() )
+
+        cpu = smp_processor_id();
+        if ( pirq_dpci->cpu >= 0 && pirq_dpci->cpu != cpu )
         {
             /*
-             * The 'raise_softirq_for' sets the CPU and raises the softirq bit
-             * so we do not need to set the target CPU's HVM_DPCI_SOFTIRQ.
+             * We do NOT want to wait for the remote CPU to run its course which
+             * could be a full guest time-slice. As such, send one IPI there.
              */
-            smp_send_event_check_cpu(pirq_dpci->cpu);
+            cpu_raise_softirq_ipi(cpu, pirq_dpci->cpu, HVM_DPCI_SOFTIRQ);
             pirq_dpci->cpu = -1;
         }
         cpu_relax();
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 0895a16..16f7063 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -27,6 +27,8 @@ void open_softirq(int nr, softirq_handler handler);
 void softirq_init(void);
 
 void cpumask_raise_softirq(const cpumask_t *, unsigned int nr);
+void cpu_raise_softirq_ipi(unsigned int this_cpu, unsigned int cpu,
+                           unsigned int nr);
 void cpu_raise_softirq(unsigned int cpu, unsigned int nr);
 void raise_softirq(unsigned int nr);
 

[-- Attachment #2: 0001-dpci-Replace-tasklet-with-an-softirq-v11.patch --]
[-- Type: text/plain, Size: 26471 bytes --]

>From 5dc87eecb583272c30cfbbed5130474d60eb5c88 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Thu, 23 Oct 2014 20:41:24 -0400
Subject: [PATCH] dpci: Replace tasklet with an softirq (v11)

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct hvm_pirq_dpci' structure the
'hvm_dirq_assist' needs to run on. That is simple solved by
threading the 'struct hvm_pirq_dpci' through a linked list.
The rule of only running one 'hvm_dirq_assist' for only
one 'hvm_pirq_dpci' is also preserved by having
'schedule_dpci_for' ignore any subsequent calls for an domain
which has already been scheduled.

== Code details ==

Most of the code complexity comes from the '->dom' field
in the 'hvm_pirq_dpci' structure. We use it for ref-counting
and as such it MUST be valid as long as STATE_SCHED bit is
set. Whoever clears the STATE_SCHED bit does the ref-counting
and can also reset the '->dom' field.

To compound the complexity, there are multiple points where the
'hvm_pirq_dpci' structure is reset or re-used. Initially
(first time the domain uses the pirq), the 'hvm_pirq_dpci->dom'
field is set to NULL as it is allocated. On subsequent calls
in to 'pt_irq_create_bind' the ->dom is whatever it had last time.

As this is the initial call (which QEMU ends up calling when the
guest writes an vector value in the MSI field) we MUST set the
'->dom' to a the proper structure (otherwise we cannot do
proper ref-counting).

The mechanism to tear it down is more complex as there
are three ways it can be executed. To make it simpler
everything revolves around 'pt_pirq_softirq_active'. If it
returns -EAGAIN that means there is an outstanding softirq
that needs to finish running before we can continue tearing
down. With that in mind:

a) pci_clean_dpci_irq. This gets called when the guest is
   being destroyed. We end up calling 'pt_pirq_softirq_active'
   to see if it is OK to continue the destruction.

   The scenarios in which the 'struct pirq' (and subsequently
   the 'hvm_pirq_dpci') gets destroyed is when:

   - guest did not use the pirq at all after setup.
   - guest did use pirq, but decided to mask and left it in that
     state.
   - guest did use pirq, but crashed.

   In all of those scenarios we end up calling
   'pt_pirq_softirq_active' to check if the softirq is still
   active. Read below on the 'pt_pirq_softirq_active' loop.

b) pt_irq_destroy_bind (guest disables the MSI). We double-check
   that the softirq has run by piggy-backing on the existing
   'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
   We add the extra call to 'pt_pirq_softirq_active' in
   'pt_pirq_cleanup_check'.

   NOTE: Guests that use event channels unbind first the
   event channel from PIRQs, so the 'pt_pirq_cleanup_check'
   won't be called as 'event' is set to zero. In that case
   we either clean it up via the a) or c) mechanism.

   There is an extra scenario regardless of 'event' being
   set or not: the guest did 'pt_irq_destroy_bind' while an
   interrupt was triggered and softirq was scheduled (but had not
   been run). It is OK to still run the softirq as
   hvm_dirq_assist won't do anything (as the flags are
   set to zero). However we will try to deschedule the
   softirq if we can (by clearing the STATE_SCHED bit and us
   doing the ref-counting).

c) pt_irq_create_bind (not a typo). The scenarios are:

     - guest disables the MSI and then enables it
       (rmmod and modprobe in a loop). We call 'pt_pirq_reset'
       which checks to see if the softirq has been scheduled.
       Imagine the 'b)' with interrupts in flight and c) getting
       called in a loop.

We will spin up on 'pt_pirq_is_active' (at the start of the
'pt_irq_create_bind') with the event_lock spinlock dropped and
calling waiting. We cannot call 'process_pending_softirqs' as it
might result in a dead-lock.. hvm_dirq_assist will be executed
and then the softirq will clear 'state' which signals that that we
can re-use the 'hvm_pirq_dpci' structure. In case this softirq
is scheduled on a remote CPU we send one IPI there to wake it up.

     - we hit once the error paths in 'pt_irq_create_bind' while
       an interrupt was triggered and softirq was scheduled.

If the softirq is in STATE_RUN that means it is executing and we should
let it continue on. We can clear the '->dom' field as the softirq
has stashed it beforehand. If the softirq is STATE_SCHED and
we are successful in clearing it, we do the ref-counting and
clear the '->dom' field. Otherwise we let the softirq continue
on and the '->dom' field is left intact. The clearing of
the '->dom' is left to a), b) or again c) case.

Note that in both cases the 'flags' variable is cleared so
hvm_dirq_assist won't actually do anything.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Suggested-by: Jan Beulich <JBeulich@suse.com>

---
v2: On top of ref-cnts also have wait loop for the outstanding
    'struct domain' that need to be processed.
v3: Add -ERETRY, fix up StyleGuide issues
v4: Clean it up mode, redo per_cpu, this_cpu logic
v5: Instead of threading struct domain, use hvm_pirq_dpci.
v6: Ditch the 'state' bit, expand description, simplify
    softirq and teardown sequence.
v7: Flesh out the comments. Drop the use of domain refcounts
v8: Add two bits (STATE_[SCHED|RUN]) to allow refcounts.
v9: Use cmpxchg, ASSSERT, fix up comments per Jan.
v10: Fix up issues spotted by Jan.
v11: IPI the remote CPU (once) if it it has the softirq scheduled.
---
 xen/arch/x86/domain.c         |   4 +-
 xen/common/softirq.c          |  13 +-
 xen/drivers/passthrough/io.c  | 268 +++++++++++++++++++++++++++++++++++++-----
 xen/drivers/passthrough/pci.c |  31 +++--
 xen/include/asm-x86/softirq.h |   3 +-
 xen/include/xen/hvm/irq.h     |   6 +-
 xen/include/xen/pci.h         |   2 +-
 xen/include/xen/softirq.h     |   2 +
 8 files changed, 283 insertions(+), 46 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ae0a344..73d01bb 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1965,7 +1965,9 @@ int domain_relinquish_resources(struct domain *d)
     switch ( d->arch.relmem )
     {
     case RELMEM_not_started:
-        pci_release_devices(d);
+        ret = pci_release_devices(d);
+        if ( ret )
+            return ret;
 
         /* Tear down paging-assistance stuff. */
         ret = paging_teardown(d);
diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index 22e417a..2b90316 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -94,6 +94,14 @@ void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
         smp_send_event_check_mask(raise_mask);
 }
 
+void cpu_raise_softirq_ipi(unsigned int this_cpu, unsigned int cpu,
+                           unsigned int nr)
+{
+    if ( !per_cpu(batching, this_cpu) || in_irq() )
+        smp_send_event_check_cpu(cpu);
+    else
+        set_bit(nr, &per_cpu(batch_mask, this_cpu));
+}
 void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
 {
     unsigned int this_cpu = smp_processor_id();
@@ -103,10 +111,7 @@ void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
          || arch_skip_send_event_check(cpu) )
         return;
 
-    if ( !per_cpu(batching, this_cpu) || in_irq() )
-        smp_send_event_check_cpu(cpu);
-    else
-        set_bit(nr, &per_cpu(batch_mask, this_cpu));
+    cpu_raise_softirq_ipi(this_cpu, cpu, nr);
 }
 
 void cpu_raise_softirq_batch_begin(void)
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index dceb17e..ddbb890 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,14 +20,125 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long arg);
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/*
+ * These two bit states help to safely schedule, deschedule, and wait until
+ * the softirq has finished.
+ *
+ * The semantics behind these two bits is as follow:
+ *  - STATE_SCHED - whoever modifies it has to ref-count the domain (->dom).
+ *  - STATE_RUN - only softirq is allowed to set and clear it. If it has
+ *      been set hvm_dirq_assist will RUN with a saved value of the
+ *      'struct domain' copied from 'pirq_dpci->dom' before STATE_RUN was set.
+ *
+ * The usual states are: STATE_SCHED(set) -> STATE_RUN(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(unset).
+ *
+ * However the states can also diverge such as: STATE_SCHED(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(set) -> STATE_RUN(unset). That means
+ * the 'hvm_dirq_assist' never run and that the softirq did not do any
+ * ref-counting.
+ */
+
+enum {
+    STATE_SCHED,
+    STATE_RUN
+};
+
+/*
+ * Should only be called from hvm_do_IRQ_dpci. We use the
+ * 'state' as a gate to thwart multiple interrupts being scheduled.
+ * The 'state' is cleared by 'softirq_dpci' when it has
+ * completed executing 'hvm_dirq_assist' or by 'pt_pirq_softirq_reset'
+ * if we want to try to unschedule the softirq before it runs.
+ *
+ */
+static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
+{
+    unsigned long flags;
+    unsigned int cpu;
+
+    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+        return;
+
+    cpu = smp_processor_id();
+    pirq_dpci->cpu = cpu;
+    get_knownalive_domain(pirq_dpci->dom);
+
+    local_irq_save(flags);
+    list_add_tail(&pirq_dpci->softirq_list, &per_cpu(dpci_list, cpu));
+    local_irq_restore(flags);
+
+    raise_softirq(HVM_DPCI_SOFTIRQ);
+}
+
+/*
+ * If we are racing with softirq_dpci (state is still set) we return
+ * true. Otherwise we return false.
+ *
+ *  If it is false, it is the callers responsibility to make sure
+ *  that the softirq (with the event_lock dropped) has ran. We need
+ *  to flush out the outstanding 'dpci_softirq' (no more of them
+ *  will be added for this pirq as the IRQ action handler has been
+ *  reset in pt_irq_destroy_bind).
+ */
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
+{
+    if ( pirq_dpci->state & ((1 << STATE_RUN) | (1 << STATE_SCHED)) )
+        return 1;
+
+    /*
+     * If in the future we would call 'raise_softirq_for' right away
+     * after 'pt_pirq_softirq_active' we MUST reset the list (otherwise it
+     * might have stale data).
+     */
+    return 0;
+}
+
+/*
+ * Reset the pirq_dpci->dom parameter to NULL.
+ *
+ * This function checks the different states to make sure it can do
+ * at the right time and if unschedules the softirq before it has
+ * run it also refcounts (which is what the softirq would have done).
+ */
+static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
+{
+    struct domain *d = pirq_dpci->dom;
+
+    ASSERT(spin_is_locked(&d->event_lock));
+
+    switch ( cmpxchg(&pirq_dpci->state, 1 << STATE_SCHED, 0) )
+    {
+            /*
+             * We are going to try to de-schedule the softirq before it goes in
+             * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
+             */
+        case (1 << STATE_SCHED):
+            put_domain(d);
+            /* fallthrough. */
+            /*
+             * The reason it is OK to reset 'dom' when STATE_RUN bit is set is
+             * due to a shortcut the 'dpci_softirq' implements. It stashes the
+             * a 'dom' in local variable before it sets STATE_RUN - and
+             * therefore will not dereference '->dom' which would crash.
+             */
+        case (1 << STATE_RUN):
+        case (1 << STATE_RUN) | (1 << STATE_SCHED):
+            pirq_dpci->dom = NULL;
+            break;
+        default:
+            break;
+    }
+}
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -40,7 +151,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
     if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT,
                               &pirq_dpci->flags) )
     {
-        pirq_dpci->masked = 0;
+        pirq_dpci->state = 0;
         pirq_dpci->pending = 0;
         pirq_guest_eoi(dpci_pirq(pirq_dpci));
     }
@@ -101,6 +212,7 @@ int pt_irq_create_bind(
     if ( pirq < 0 || pirq >= d->nr_pirqs )
         return -EINVAL;
 
+ restart:
     spin_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
@@ -127,7 +239,33 @@ int pt_irq_create_bind(
         return -ENOMEM;
     }
     pirq_dpci = pirq_dpci(info);
+    /*
+     * A crude 'while' loop with us dropping the spinlock and giving
+     * the softirq_dpci a chance to run.
+     * We MUST check for this condition as the softirq could be scheduled
+     * and hasn't run yet. Note that this code replaced tasklet_kill which
+     * would have spun forever and would do the same thing (wait to flush out
+     * outstanding hvm_dirq_assist calls.
+     */
+    if ( pt_pirq_softirq_active(pirq_dpci) )
+    {
+        unsigned int cpu;
+
+        spin_unlock(&d->event_lock);
 
+        cpu = smp_processor_id();
+        if ( pirq_dpci->cpu >= 0 && pirq_dpci->cpu != cpu )
+        {
+            /*
+             * We do NOT want to wait for the remote CPU to run its course which
+             * could be a full guest time-slice. As such, send one IPI there.
+             */
+            cpu_raise_softirq_ipi(cpu, pirq_dpci->cpu, HVM_DPCI_SOFTIRQ);
+            pirq_dpci->cpu = -1;
+        }
+        cpu_relax();
+        goto restart;
+    }
     switch ( pt_irq_bind->irq_type )
     {
     case PT_IRQ_TYPE_MSI:
@@ -165,8 +303,14 @@ int pt_irq_create_bind(
             {
                 pirq_dpci->gmsi.gflags = 0;
                 pirq_dpci->gmsi.gvec = 0;
-                pirq_dpci->dom = NULL;
                 pirq_dpci->flags = 0;
+                /*
+                 * Between the 'pirq_guest_bind' and before 'pirq_guest_unbind'
+                 * an interrupt can be scheduled. No more of them are going to
+                 * be scheduled but we must deal with the one that is in the
+                 * queue.
+                 */
+                pt_pirq_softirq_reset(pirq_dpci);
                 pirq_cleanup_check(info, d);
                 spin_unlock(&d->event_lock);
                 return rc;
@@ -269,6 +413,10 @@ int pt_irq_create_bind(
             {
                 if ( pt_irq_need_timer(pirq_dpci->flags) )
                     kill_timer(&pirq_dpci->timer);
+                /*
+                 * There is no path for __do_IRQ to schedule softirq as
+                 * IRQ_GUEST is not set. As such we can reset 'dom' right away.
+                 */
                 pirq_dpci->dom = NULL;
                 list_del(&girq->list);
                 list_del(&digl->list);
@@ -402,8 +550,13 @@ int pt_irq_destroy_bind(
         msixtbl_pt_unregister(d, pirq);
         if ( pt_irq_need_timer(pirq_dpci->flags) )
             kill_timer(&pirq_dpci->timer);
-        pirq_dpci->dom   = NULL;
         pirq_dpci->flags = 0;
+        /*
+         * See comment in pt_irq_create_bind's PT_IRQ_TYPE_MSI before the
+         * call to pt_pirq_softirq_reset.
+         */
+        pt_pirq_softirq_reset(pirq_dpci);
+
         pirq_cleanup_check(pirq, d);
     }
 
@@ -426,14 +579,12 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
-    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    if ( !dpci->flags )
+    if ( !dpci->flags && !pt_pirq_softirq_active(dpci) )
     {
-        tasklet_kill(&dpci->tasklet);
         dpci->dom = NULL;
         return 1;
     }
@@ -476,8 +627,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
          !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         return 0;
 
-    pirq_dpci->masked = 1;
-    tasklet_schedule(&pirq_dpci->tasklet);
+    raise_softirq_for(pirq_dpci);
     return 1;
 }
 
@@ -531,28 +681,12 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static void hvm_dirq_assist(unsigned long arg)
+static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
 {
-    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
-    struct domain *d = pirq_dpci->dom;
-
-    /*
-     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
-     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
-     *
-     * And '->dom' gets cleared later in the destroy path. We exit and clear
-     * 'masked' - which is OK as later in this code we would
-     * do nothing except clear the ->masked field anyhow.
-     */
-    if ( !d )
-    {
-        pirq_dpci->masked = 0;
-        return;
-    }
     ASSERT(d->arch.hvm_domain.irq.dpci);
 
     spin_lock(&d->event_lock);
-    if ( test_and_clear_bool(pirq_dpci->masked) )
+    if ( pirq_dpci->state )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
         const struct dev_intx_gsi_link *digl;
@@ -654,3 +788,81 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+static void dpci_softirq(void)
+{
+    unsigned int cpu = smp_processor_id();
+    LIST_HEAD(our_list);
+
+    local_irq_disable();
+    list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
+    local_irq_enable();
+
+    while ( !list_empty(&our_list) )
+    {
+        struct hvm_pirq_dpci *pirq_dpci;
+        struct domain *d;
+
+        pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
+        list_del(&pirq_dpci->softirq_list);
+
+        d = pirq_dpci->dom;
+        smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
+        if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
+            BUG();
+        /*
+         * The one who clears STATE_SCHED MUST refcount the domain.
+         */
+        if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
+        {
+            hvm_dirq_assist(d, pirq_dpci);
+            put_domain(d);
+        }
+        clear_bit(STATE_RUN, &pirq_dpci->state);
+    }
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        /*
+         * On CPU_DYING this callback is called (on the CPU that is dying)
+         * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
+         * clear out any outstanding domains (by the virtue of the idle loop
+         * calling the softirq later). In CPU_DEAD case the CPU is deaf and
+         * there are no pending softirqs for us to handle so we can chill.
+         */
+        ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
+        break;
+    default:
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 81e8a3a..8631473 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
-    tasklet_kill(&pirq_dpci->tasklet);
-
-    return 0;
+    return pt_pirq_softirq_active(pirq_dpci) ? -ERESTART : 0;
 }
 
-static void pci_clean_dpci_irqs(struct domain *d)
+static int pci_clean_dpci_irqs(struct domain *d)
 {
     struct hvm_irq_dpci *hvm_irq_dpci = NULL;
 
     if ( !iommu_enabled )
-        return;
+        return -ENODEV;
 
     if ( !is_hvm_domain(d) )
-        return;
+        return -EINVAL;
 
     spin_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+        int ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+
+        if ( ret )
+        {
+            spin_unlock(&d->event_lock);
+            return ret;
+        }
 
         d->arch.hvm_domain.irq.dpci = NULL;
         free_hvm_irq_dpci(hvm_irq_dpci);
     }
     spin_unlock(&d->event_lock);
+    return 0;
 }
 
-void pci_release_devices(struct domain *d)
+int pci_release_devices(struct domain *d)
 {
     struct pci_dev *pdev;
     u8 bus, devfn;
+    int ret;
 
     spin_lock(&pcidevs_lock);
-    pci_clean_dpci_irqs(d);
+    ret = pci_clean_dpci_irqs(d);
+    if ( ret == -ERESTART )
+    {
+        spin_unlock(&pcidevs_lock);
+        return ret;
+    }
     while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
     {
         bus = pdev->bus;
@@ -811,6 +822,8 @@ void pci_release_devices(struct domain *d)
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
     }
     spin_unlock(&pcidevs_lock);
+
+    return 0;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
diff --git a/xen/include/asm-x86/softirq.h b/xen/include/asm-x86/softirq.h
index 7225dea..ec787d6 100644
--- a/xen/include/asm-x86/softirq.h
+++ b/xen/include/asm-x86/softirq.h
@@ -7,7 +7,8 @@
 
 #define MACHINE_CHECK_SOFTIRQ  (NR_COMMON_SOFTIRQS + 3)
 #define PCI_SERR_SOFTIRQ       (NR_COMMON_SOFTIRQS + 4)
-#define NR_ARCH_SOFTIRQS       5
+#define HVM_DPCI_SOFTIRQ       (NR_COMMON_SOFTIRQS + 5)
+#define NR_ARCH_SOFTIRQS       6
 
 bool_t arch_skip_send_event_check(unsigned int cpu);
 
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 94a550a..1db45b2 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -93,13 +93,14 @@ struct hvm_irq_dpci {
 /* Machine IRQ to guest device/intx mapping. */
 struct hvm_pirq_dpci {
     uint32_t flags;
-    bool_t masked;
+    unsigned int state;
+    int cpu;
     uint16_t pending;
     struct list_head digl_list;
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
-    struct tasklet tasklet;
+    struct list_head softirq_list;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
@@ -109,6 +110,7 @@ int pt_pirq_iterate(struct domain *d,
                               struct hvm_pirq_dpci *, void *arg),
                     void *arg);
 
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *);
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 91520bc..5f295f3 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev(
 
 void setup_hwdom_pci_devices(struct domain *,
                             int (*)(u8 devfn, struct pci_dev *));
-void pci_release_devices(struct domain *d);
+int pci_release_devices(struct domain *d);
 int pci_add_segment(u16 seg);
 const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 0895a16..16f7063 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -27,6 +27,8 @@ void open_softirq(int nr, softirq_handler handler);
 void softirq_init(void);
 
 void cpumask_raise_softirq(const cpumask_t *, unsigned int nr);
+void cpu_raise_softirq_ipi(unsigned int this_cpu, unsigned int cpu,
+                           unsigned int nr);
 void cpu_raise_softirq(unsigned int cpu, unsigned int nr);
 void raise_softirq(unsigned int nr);
 
-- 
1.8.4.2


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-28 20:07                               ` Konrad Rzeszutek Wilk
@ 2014-10-29  8:28                                 ` Jan Beulich
  2014-10-29 21:11                                   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2014-10-29  8:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, Andrew Cooper, tim, xen-devel, ian.jackson

>>> On 28.10.14 at 21:07, <konrad.wilk@oracle.com> wrote:
> On Tue, Oct 28, 2014 at 10:43:32AM +0000, Jan Beulich wrote:
>> >>> On 27.10.14 at 22:13, <konrad.wilk@oracle.com> wrote:
>> > +    /*
>> > +     * A crude 'while' loop with us dropping the spinlock and giving
>> > +     * the softirq_dpci a chance to run.
>> > +     * We MUST check for this condition as the softirq could be scheduled
>> > +     * and hasn't run yet. Note that this code replaced tasklet_kill which
>> > +     * would have spun forever and would do the same thing (wait to flush out
>> > +     * outstanding hvm_dirq_assist calls.
>> > +     */
>> > +    if ( pt_pirq_softirq_active(pirq_dpci) )
>> > +    {
>> > +        spin_unlock(&d->event_lock);
>> > +        if ( pirq_dpci->cpu >= 0 && pirq_dpci->cpu != smp_processor_id() )
>> > +        {
>> > +            /*
>> > +             * The 'raise_softirq_for' sets the CPU and raises the softirq bit
>> > +             * so we do not need to set the target CPU's HVM_DPCI_SOFTIRQ.
>> > +             */
>> > +            smp_send_event_check_cpu(pirq_dpci->cpu);
>> > +            pirq_dpci->cpu = -1;
>> > +        }
>> > +        cpu_relax();
>> > +        goto restart;
>> > +    }
>> 
>> As said in an earlier reply to Andrew, I think this open coding goes
>> too far. And with the softirq known to have got sent, I also don't
>> really see why it needs to be resent _at all_ (and the comments
>> don't explain this either).
> 
> In the other emails you and Andrew said:
> 
> 	">> > Can it ever be the case that we are waiting for a remote pcpu to run its
> 	>> > softirq handler?  If so, the time spent looping here could be up to 1
> 	>> > scheduling timeslice in the worst case, and 30ms is a very long time to
> 	>> > wait.
> 	>>
> 	>> Good point - I think this can be the case. But there seems to be a
> 	>> simple counter measure: The first time we get to this point, send an
> 	>> event check IPI to the CPU in question (or in the worst case
> 	>> broadcast one if the CPU can't be determined in a race free way).
> 	>
> 	"

And I was wrong to agree with Andrew. We can't be waiting for a full
time slice - that would be contrary to the concept of softirqs. Locally
raised ones cause them to be executed before returning back to guest
context; remotely raised ones issue an IPI.

> Which is true. That is what this is trying to address.
> 
> But if we use 'cpu_raise_softirq' which you advocate it would inhibit the 
> IPI
> if the  HVM_DPCI_SOFTIRQ is set on the remote CPU:
> 
> void cpu_raise_softirq(unsigned int cpu, unsigned int nr) 
> {
>     unsigned int this_cpu = smp_processor_id();
> 
>     if ( test_and_set_bit(nr, &softirq_pending(cpu))		<=== that will be true
>          || (cpu == this_cpu)
>          || arch_skip_send_event_check(cpu) )
>         return;
> 
>     if ( !per_cpu(batching, this_cpu) || in_irq() )
>         smp_send_event_check_cpu(cpu);
>     else
>         set_bit(nr, &per_cpu(batch_mask, this_cpu));
> }
> 
> In which case we still won't be sending the IPI.

And we don't need to, as it was already done by whoever set that flag.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-29  8:28                                 ` Jan Beulich
@ 2014-10-29 21:11                                   ` Konrad Rzeszutek Wilk
  2014-10-30  9:04                                     ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-29 21:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, Andrew Cooper, tim, xen-devel, ian.jackson

. snip..
> > In the other emails you and Andrew said:
> > 
> > 	">> > Can it ever be the case that we are waiting for a remote pcpu to run its
> > 	>> > softirq handler?  If so, the time spent looping here could be up to 1
> > 	>> > scheduling timeslice in the worst case, and 30ms is a very long time to
> > 	>> > wait.
> > 	>>
> > 	>> Good point - I think this can be the case. But there seems to be a
> > 	>> simple counter measure: The first time we get to this point, send an
> > 	>> event check IPI to the CPU in question (or in the worst case
> > 	>> broadcast one if the CPU can't be determined in a race free way).
> > 	>
> > 	"
> 
> And I was wrong to agree with Andrew. We can't be waiting for a full
> time slice - that would be contrary to the concept of softirqs. Locally
> raised ones cause them to be executed before returning back to guest
> context; remotely raised ones issue an IPI.
> 
> > Which is true. That is what this is trying to address.
> > 
> > But if we use 'cpu_raise_softirq' which you advocate it would inhibit the 
> > IPI
> > if the  HVM_DPCI_SOFTIRQ is set on the remote CPU:
.. snip code..
> > In which case we still won't be sending the IPI.
> 
> And we don't need to, as it was already done by whoever set that flag.

OK. In that case this version (the last?) should do it (attached and inline):

Or do you want me to remove the 'goto restart' loop as it
is unlikely to ever be triggered because the softirq would be
executed right away?

>From 1fb22a591447830c09c4a6ddf1e0dad6937148c2 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Thu, 23 Oct 2014 20:41:24 -0400
Subject: [PATCH] dpci: Replace tasklet with an softirq (v12)

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct hvm_pirq_dpci' structure the
'hvm_dirq_assist' needs to run on. That is simple solved by
threading the 'struct hvm_pirq_dpci' through a linked list.
The rule of only running one 'hvm_dirq_assist' for only
one 'hvm_pirq_dpci' is also preserved by having
'schedule_dpci_for' ignore any subsequent calls for an domain
which has already been scheduled.

== Code details ==

Most of the code complexity comes from the '->dom' field
in the 'hvm_pirq_dpci' structure. We use it for ref-counting
and as such it MUST be valid as long as STATE_SCHED bit is
set. Whoever clears the STATE_SCHED bit does the ref-counting
and can also reset the '->dom' field.

To compound the complexity, there are multiple points where the
'hvm_pirq_dpci' structure is reset or re-used. Initially
(first time the domain uses the pirq), the 'hvm_pirq_dpci->dom'
field is set to NULL as it is allocated. On subsequent calls
in to 'pt_irq_create_bind' the ->dom is whatever it had last time.

As this is the initial call (which QEMU ends up calling when the
guest writes an vector value in the MSI field) we MUST set the
'->dom' to a the proper structure (otherwise we cannot do
proper ref-counting).

The mechanism to tear it down is more complex as there
are three ways it can be executed. To make it simpler
everything revolves around 'pt_pirq_softirq_active'. If it
returns -EAGAIN that means there is an outstanding softirq
that needs to finish running before we can continue tearing
down. With that in mind:

a) pci_clean_dpci_irq. This gets called when the guest is
   being destroyed. We end up calling 'pt_pirq_softirq_active'
   to see if it is OK to continue the destruction.

   The scenarios in which the 'struct pirq' (and subsequently
   the 'hvm_pirq_dpci') gets destroyed is when:

   - guest did not use the pirq at all after setup.
   - guest did use pirq, but decided to mask and left it in that
     state.
   - guest did use pirq, but crashed.

   In all of those scenarios we end up calling
   'pt_pirq_softirq_active' to check if the softirq is still
   active. Read below on the 'pt_pirq_softirq_active' loop.

b) pt_irq_destroy_bind (guest disables the MSI). We double-check
   that the softirq has run by piggy-backing on the existing
   'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
   We add the extra call to 'pt_pirq_softirq_active' in
   'pt_pirq_cleanup_check'.

   NOTE: Guests that use event channels unbind first the
   event channel from PIRQs, so the 'pt_pirq_cleanup_check'
   won't be called as 'event' is set to zero. In that case
   we either clean it up via the a) or c) mechanism.

   There is an extra scenario regardless of 'event' being
   set or not: the guest did 'pt_irq_destroy_bind' while an
   interrupt was triggered and softirq was scheduled (but had not
   been run). It is OK to still run the softirq as
   hvm_dirq_assist won't do anything (as the flags are
   set to zero). However we will try to deschedule the
   softirq if we can (by clearing the STATE_SCHED bit and us
   doing the ref-counting).

c) pt_irq_create_bind (not a typo). The scenarios are:

     - guest disables the MSI and then enables it
       (rmmod and modprobe in a loop). We call 'pt_pirq_reset'
       which checks to see if the softirq has been scheduled.
       Imagine the 'b)' with interrupts in flight and c) getting
       called in a loop.

We will spin up on 'pt_pirq_is_active' (at the start of the
'pt_irq_create_bind') with the event_lock spinlock dropped and
waiting (cpu_relax). We cannot call 'process_pending_softirqs' as it
might result in a dead-lock. hvm_dirq_assist will be executed
and then the softirq will clear 'state' which signals that that we
can re-use the 'hvm_pirq_dpci' structure. In case this softirq
is scheduled on a remote CPU the softirq will run on it as the
semantics behind an softirq is that it will execute within the
guest interruption.

     - we hit once the error paths in 'pt_irq_create_bind' while
       an interrupt was triggered and softirq was scheduled.

If the softirq is in STATE_RUN that means it is executing and we should
let it continue on. We can clear the '->dom' field as the softirq
has stashed it beforehand. If the softirq is STATE_SCHED and
we are successful in clearing it, we do the ref-counting and
clear the '->dom' field. Otherwise we let the softirq continue
on and the '->dom' field is left intact. The clearing of
the '->dom' is left to a), b) or again c) case.

Note that in both cases the 'flags' variable is cleared so
hvm_dirq_assist won't actually do anything.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Suggested-by: Jan Beulich <JBeulich@suse.com>

---
v2: On top of ref-cnts also have wait loop for the outstanding
    'struct domain' that need to be processed.
v3: Add -ERETRY, fix up StyleGuide issues
v4: Clean it up mode, redo per_cpu, this_cpu logic
v5: Instead of threading struct domain, use hvm_pirq_dpci.
v6: Ditch the 'state' bit, expand description, simplify
    softirq and teardown sequence.
v7: Flesh out the comments. Drop the use of domain refcounts
v8: Add two bits (STATE_[SCHED|RUN]) to allow refcounts.
v9: Use cmpxchg, ASSSERT, fix up comments per Jan.
v10: Fix up issues spotted by Jan.
v11: IPI the remote CPU (once) if it it has the softirq scheduled.
v12: Remove the IPI for the remote CPU as the sofitrq mechanism does that.
---
 xen/arch/x86/domain.c         |   4 +-
 xen/drivers/passthrough/io.c  | 254 +++++++++++++++++++++++++++++++++++++-----
 xen/drivers/passthrough/pci.c |  31 ++++--
 xen/include/asm-x86/softirq.h |   3 +-
 xen/include/xen/hvm/irq.h     |   5 +-
 xen/include/xen/pci.h         |   2 +-
 6 files changed, 256 insertions(+), 43 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ae0a344..73d01bb 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1965,7 +1965,9 @@ int domain_relinquish_resources(struct domain *d)
     switch ( d->arch.relmem )
     {
     case RELMEM_not_started:
-        pci_release_devices(d);
+        ret = pci_release_devices(d);
+        if ( ret )
+            return ret;
 
         /* Tear down paging-assistance stuff. */
         ret = paging_teardown(d);
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index dceb17e..a781d5f 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,14 +20,122 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long arg);
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/*
+ * These two bit states help to safely schedule, deschedule, and wait until
+ * the softirq has finished.
+ *
+ * The semantics behind these two bits is as follow:
+ *  - STATE_SCHED - whoever modifies it has to ref-count the domain (->dom).
+ *  - STATE_RUN - only softirq is allowed to set and clear it. If it has
+ *      been set hvm_dirq_assist will RUN with a saved value of the
+ *      'struct domain' copied from 'pirq_dpci->dom' before STATE_RUN was set.
+ *
+ * The usual states are: STATE_SCHED(set) -> STATE_RUN(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(unset).
+ *
+ * However the states can also diverge such as: STATE_SCHED(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(set) -> STATE_RUN(unset). That means
+ * the 'hvm_dirq_assist' never run and that the softirq did not do any
+ * ref-counting.
+ */
+
+enum {
+    STATE_SCHED,
+    STATE_RUN
+};
+
+/*
+ * Should only be called from hvm_do_IRQ_dpci. We use the
+ * 'state' as a gate to thwart multiple interrupts being scheduled.
+ * The 'state' is cleared by 'softirq_dpci' when it has
+ * completed executing 'hvm_dirq_assist' or by 'pt_pirq_softirq_reset'
+ * if we want to try to unschedule the softirq before it runs.
+ *
+ */
+static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
+{
+    unsigned long flags;
+
+    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+        return;
+
+    get_knownalive_domain(pirq_dpci->dom);
+
+    local_irq_save(flags);
+    list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+    local_irq_restore(flags);
+
+    raise_softirq(HVM_DPCI_SOFTIRQ);
+}
+
+/*
+ * If we are racing with softirq_dpci (state is still set) we return
+ * true. Otherwise we return false.
+ *
+ *  If it is false, it is the callers responsibility to make sure
+ *  that the softirq (with the event_lock dropped) has ran. We need
+ *  to flush out the outstanding 'dpci_softirq' (no more of them
+ *  will be added for this pirq as the IRQ action handler has been
+ *  reset in pt_irq_destroy_bind).
+ */
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
+{
+    if ( pirq_dpci->state & ((1 << STATE_RUN) | (1 << STATE_SCHED)) )
+        return 1;
+
+    /*
+     * If in the future we would call 'raise_softirq_for' right away
+     * after 'pt_pirq_softirq_active' we MUST reset the list (otherwise it
+     * might have stale data).
+     */
+    return 0;
+}
+
+/*
+ * Reset the pirq_dpci->dom parameter to NULL.
+ *
+ * This function checks the different states to make sure it can do
+ * at the right time and if unschedules the softirq before it has
+ * run it also refcounts (which is what the softirq would have done).
+ */
+static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
+{
+    struct domain *d = pirq_dpci->dom;
+
+    ASSERT(spin_is_locked(&d->event_lock));
+
+    switch ( cmpxchg(&pirq_dpci->state, 1 << STATE_SCHED, 0) )
+    {
+            /*
+             * We are going to try to de-schedule the softirq before it goes in
+             * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
+             */
+        case (1 << STATE_SCHED):
+            put_domain(d);
+            /* fallthrough. */
+            /*
+             * The reason it is OK to reset 'dom' when STATE_RUN bit is set is
+             * due to a shortcut the 'dpci_softirq' implements. It stashes the
+             * a 'dom' in local variable before it sets STATE_RUN - and
+             * therefore will not dereference '->dom' which would crash.
+             */
+        case (1 << STATE_RUN):
+        case (1 << STATE_RUN) | (1 << STATE_SCHED):
+            pirq_dpci->dom = NULL;
+            break;
+        default:
+            break;
+    }
+}
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -40,7 +148,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
     if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT,
                               &pirq_dpci->flags) )
     {
-        pirq_dpci->masked = 0;
+        pirq_dpci->state = 0;
         pirq_dpci->pending = 0;
         pirq_guest_eoi(dpci_pirq(pirq_dpci));
     }
@@ -101,6 +209,7 @@ int pt_irq_create_bind(
     if ( pirq < 0 || pirq >= d->nr_pirqs )
         return -EINVAL;
 
+ restart:
     spin_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
@@ -127,7 +236,20 @@ int pt_irq_create_bind(
         return -ENOMEM;
     }
     pirq_dpci = pirq_dpci(info);
-
+    /*
+     * A crude 'while' loop with us dropping the spinlock and giving
+     * the softirq_dpci a chance to run.
+     * We MUST check for this condition as the softirq could be scheduled
+     * and hasn't run yet. Note that this code replaced tasklet_kill which
+     * would have spun forever and would do the same thing (wait to flush out
+     * outstanding hvm_dirq_assist calls.
+     */
+    if ( pt_pirq_softirq_active(pirq_dpci) )
+    {
+        spin_unlock(&d->event_lock);
+        cpu_relax();
+        goto restart;
+    }
     switch ( pt_irq_bind->irq_type )
     {
     case PT_IRQ_TYPE_MSI:
@@ -165,8 +287,14 @@ int pt_irq_create_bind(
             {
                 pirq_dpci->gmsi.gflags = 0;
                 pirq_dpci->gmsi.gvec = 0;
-                pirq_dpci->dom = NULL;
                 pirq_dpci->flags = 0;
+                /*
+                 * Between the 'pirq_guest_bind' and before 'pirq_guest_unbind'
+                 * an interrupt can be scheduled. No more of them are going to
+                 * be scheduled but we must deal with the one that is in the
+                 * queue.
+                 */
+                pt_pirq_softirq_reset(pirq_dpci);
                 pirq_cleanup_check(info, d);
                 spin_unlock(&d->event_lock);
                 return rc;
@@ -269,6 +397,10 @@ int pt_irq_create_bind(
             {
                 if ( pt_irq_need_timer(pirq_dpci->flags) )
                     kill_timer(&pirq_dpci->timer);
+                /*
+                 * There is no path for __do_IRQ to schedule softirq as
+                 * IRQ_GUEST is not set. As such we can reset 'dom' right away.
+                 */
                 pirq_dpci->dom = NULL;
                 list_del(&girq->list);
                 list_del(&digl->list);
@@ -402,8 +534,13 @@ int pt_irq_destroy_bind(
         msixtbl_pt_unregister(d, pirq);
         if ( pt_irq_need_timer(pirq_dpci->flags) )
             kill_timer(&pirq_dpci->timer);
-        pirq_dpci->dom   = NULL;
         pirq_dpci->flags = 0;
+        /*
+         * See comment in pt_irq_create_bind's PT_IRQ_TYPE_MSI before the
+         * call to pt_pirq_softirq_reset.
+         */
+        pt_pirq_softirq_reset(pirq_dpci);
+
         pirq_cleanup_check(pirq, d);
     }
 
@@ -426,14 +563,12 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
-    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    if ( !dpci->flags )
+    if ( !dpci->flags && !pt_pirq_softirq_active(dpci) )
     {
-        tasklet_kill(&dpci->tasklet);
         dpci->dom = NULL;
         return 1;
     }
@@ -476,8 +611,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
          !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         return 0;
 
-    pirq_dpci->masked = 1;
-    tasklet_schedule(&pirq_dpci->tasklet);
+    raise_softirq_for(pirq_dpci);
     return 1;
 }
 
@@ -531,28 +665,12 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static void hvm_dirq_assist(unsigned long arg)
+static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
 {
-    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
-    struct domain *d = pirq_dpci->dom;
-
-    /*
-     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
-     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
-     *
-     * And '->dom' gets cleared later in the destroy path. We exit and clear
-     * 'masked' - which is OK as later in this code we would
-     * do nothing except clear the ->masked field anyhow.
-     */
-    if ( !d )
-    {
-        pirq_dpci->masked = 0;
-        return;
-    }
     ASSERT(d->arch.hvm_domain.irq.dpci);
 
     spin_lock(&d->event_lock);
-    if ( test_and_clear_bool(pirq_dpci->masked) )
+    if ( pirq_dpci->state )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
         const struct dev_intx_gsi_link *digl;
@@ -654,3 +772,81 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+static void dpci_softirq(void)
+{
+    unsigned int cpu = smp_processor_id();
+    LIST_HEAD(our_list);
+
+    local_irq_disable();
+    list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
+    local_irq_enable();
+
+    while ( !list_empty(&our_list) )
+    {
+        struct hvm_pirq_dpci *pirq_dpci;
+        struct domain *d;
+
+        pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
+        list_del(&pirq_dpci->softirq_list);
+
+        d = pirq_dpci->dom;
+        smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
+        if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
+            BUG();
+        /*
+         * The one who clears STATE_SCHED MUST refcount the domain.
+         */
+        if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
+        {
+            hvm_dirq_assist(d, pirq_dpci);
+            put_domain(d);
+        }
+        clear_bit(STATE_RUN, &pirq_dpci->state);
+    }
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        /*
+         * On CPU_DYING this callback is called (on the CPU that is dying)
+         * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
+         * clear out any outstanding domains (by the virtue of the idle loop
+         * calling the softirq later). In CPU_DEAD case the CPU is deaf and
+         * there are no pending softirqs for us to handle so we can chill.
+         */
+        ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
+        break;
+    default:
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 81e8a3a..8631473 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
-    tasklet_kill(&pirq_dpci->tasklet);
-
-    return 0;
+    return pt_pirq_softirq_active(pirq_dpci) ? -ERESTART : 0;
 }
 
-static void pci_clean_dpci_irqs(struct domain *d)
+static int pci_clean_dpci_irqs(struct domain *d)
 {
     struct hvm_irq_dpci *hvm_irq_dpci = NULL;
 
     if ( !iommu_enabled )
-        return;
+        return -ENODEV;
 
     if ( !is_hvm_domain(d) )
-        return;
+        return -EINVAL;
 
     spin_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+        int ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+
+        if ( ret )
+        {
+            spin_unlock(&d->event_lock);
+            return ret;
+        }
 
         d->arch.hvm_domain.irq.dpci = NULL;
         free_hvm_irq_dpci(hvm_irq_dpci);
     }
     spin_unlock(&d->event_lock);
+    return 0;
 }
 
-void pci_release_devices(struct domain *d)
+int pci_release_devices(struct domain *d)
 {
     struct pci_dev *pdev;
     u8 bus, devfn;
+    int ret;
 
     spin_lock(&pcidevs_lock);
-    pci_clean_dpci_irqs(d);
+    ret = pci_clean_dpci_irqs(d);
+    if ( ret == -ERESTART )
+    {
+        spin_unlock(&pcidevs_lock);
+        return ret;
+    }
     while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
     {
         bus = pdev->bus;
@@ -811,6 +822,8 @@ void pci_release_devices(struct domain *d)
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
     }
     spin_unlock(&pcidevs_lock);
+
+    return 0;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
diff --git a/xen/include/asm-x86/softirq.h b/xen/include/asm-x86/softirq.h
index 7225dea..ec787d6 100644
--- a/xen/include/asm-x86/softirq.h
+++ b/xen/include/asm-x86/softirq.h
@@ -7,7 +7,8 @@
 
 #define MACHINE_CHECK_SOFTIRQ  (NR_COMMON_SOFTIRQS + 3)
 #define PCI_SERR_SOFTIRQ       (NR_COMMON_SOFTIRQS + 4)
-#define NR_ARCH_SOFTIRQS       5
+#define HVM_DPCI_SOFTIRQ       (NR_COMMON_SOFTIRQS + 5)
+#define NR_ARCH_SOFTIRQS       6
 
 bool_t arch_skip_send_event_check(unsigned int cpu);
 
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 94a550a..9709397 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -93,13 +93,13 @@ struct hvm_irq_dpci {
 /* Machine IRQ to guest device/intx mapping. */
 struct hvm_pirq_dpci {
     uint32_t flags;
-    bool_t masked;
+    unsigned int state;
     uint16_t pending;
     struct list_head digl_list;
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
-    struct tasklet tasklet;
+    struct list_head softirq_list;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
@@ -109,6 +109,7 @@ int pt_pirq_iterate(struct domain *d,
                               struct hvm_pirq_dpci *, void *arg),
                     void *arg);
 
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *);
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 91520bc..5f295f3 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev(
 
 void setup_hwdom_pci_devices(struct domain *,
                             int (*)(u8 devfn, struct pci_dev *));
-void pci_release_devices(struct domain *d);
+int pci_release_devices(struct domain *d);
 int pci_add_segment(u16 seg);
 const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-29 21:11                                   ` Konrad Rzeszutek Wilk
@ 2014-10-30  9:04                                     ` Jan Beulich
  2014-11-02 20:09                                       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2014-10-30  9:04 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, Andrew Cooper, tim, xen-devel, ian.jackson

>>> On 29.10.14 at 22:11, <konrad.wilk@oracle.com> wrote:
> Or do you want me to remove the 'goto restart' loop as it
> is unlikely to ever be triggered because the softirq would be
> executed right away?

No, that one definitely needs to stay.

> +static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
> +{
> +    struct domain *d = pirq_dpci->dom;
> +
> +    ASSERT(spin_is_locked(&d->event_lock));
> +
> +    switch ( cmpxchg(&pirq_dpci->state, 1 << STATE_SCHED, 0) )
> +    {
> +            /*
> +             * We are going to try to de-schedule the softirq before it goes in
> +             * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
> +             */
> +        case (1 << STATE_SCHED):
> +            put_domain(d);
> +            /* fallthrough. */
> +            /*
> +             * The reason it is OK to reset 'dom' when STATE_RUN bit is set is
> +             * due to a shortcut the 'dpci_softirq' implements. It stashes the
> +             * a 'dom' in local variable before it sets STATE_RUN - and
> +             * therefore will not dereference '->dom' which would crash.
> +             */
> +        case (1 << STATE_RUN):
> +        case (1 << STATE_RUN) | (1 << STATE_SCHED):
> +            pirq_dpci->dom = NULL;
> +            break;
> +        default:
> +            break;

Indentation is still wrong. Also I think the comments are a little
odd in their placement. In particular, a fall-through comment should
imo immediately precede the subsequent case label(s). Hence I think
the larger comment would better go after the two labels involving
STATE_RUN.

> @@ -165,8 +287,14 @@ int pt_irq_create_bind(
>              {
>                  pirq_dpci->gmsi.gflags = 0;
>                  pirq_dpci->gmsi.gvec = 0;
> -                pirq_dpci->dom = NULL;
>                  pirq_dpci->flags = 0;
> +                /*
> +                 * Between the 'pirq_guest_bind' and before 'pirq_guest_unbind'
> +                 * an interrupt can be scheduled. No more of them are going to
> +                 * be scheduled but we must deal with the one that is in the
> +                 * queue.
> +                 */
> +                pt_pirq_softirq_reset(pirq_dpci);
>                  pirq_cleanup_check(info, d);
>                  spin_unlock(&d->event_lock);
>                  return rc;

I wonder whether that wouldn't better be moved into the conditional
where pirq_guest_unbind() gets called, while - along the lines of the
subsequent change - handling failure of pirq_guest_bind() the more
straightforward way.

> +static int cpu_callback(
> +    struct notifier_block *nfb, unsigned long action, void *hcpu)
> +{
> +    unsigned int cpu = (unsigned long)hcpu;
> +
> +    switch ( action )
> +    {
> +    case CPU_UP_PREPARE:
> +        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
> +        break;
> +    case CPU_UP_CANCELED:
> +    case CPU_DEAD:
> +        /*
> +         * On CPU_DYING this callback is called (on the CPU that is dying)
> +         * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
> +         * clear out any outstanding domains (by the virtue of the idle loop
> +         * calling the softirq later). In CPU_DEAD case the CPU is deaf and
> +         * there are no pending softirqs for us to handle so we can chill.
> +         */
> +        ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
> +        break;
> +    default:
> +        break;

Could I additionally talk you into omitting unnecessary default cases
like this one?

> -void pci_release_devices(struct domain *d)
> +int pci_release_devices(struct domain *d)
>  {
>      struct pci_dev *pdev;
>      u8 bus, devfn;
> +    int ret;
>  
>      spin_lock(&pcidevs_lock);
> -    pci_clean_dpci_irqs(d);
> +    ret = pci_clean_dpci_irqs(d);
> +    if ( ret == -ERESTART )
> +    {
> +        spin_unlock(&pcidevs_lock);
> +        return ret;
> +    }
>      while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
>      {
>          bus = pdev->bus;

Even if errors other than -ERESTART aren't possible right now, the
code now looks like it is ignoring such. I think it would be better if
you simply dropped the special casing of -ERESTART and propagated
all errors here.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-10-30  9:04                                     ` Jan Beulich
@ 2014-11-02 20:09                                       ` Konrad Rzeszutek Wilk
  2014-11-03  8:46                                         ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-02 20:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, Andrew Cooper, tim, xen-devel, ian.jackson

> > -void pci_release_devices(struct domain *d)
> > +int pci_release_devices(struct domain *d)
> >  {
> >      struct pci_dev *pdev;
> >      u8 bus, devfn;
> > +    int ret;
> >  
> >      spin_lock(&pcidevs_lock);
> > -    pci_clean_dpci_irqs(d);
> > +    ret = pci_clean_dpci_irqs(d);
> > +    if ( ret == -ERESTART )
> > +    {
> > +        spin_unlock(&pcidevs_lock);
> > +        return ret;
> > +    }
> >      while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
> >      {
> >          bus = pdev->bus;
> 
> Even if errors other than -ERESTART aren't possible right now, the
> code now looks like it is ignoring such. I think it would be better if
> you simply dropped the special casing of -ERESTART and propagated
> all errors here.

The will cause a regression. The pci_clean_dpci_irqs will
now return -EINVAL if the iommu is turned off - which will mean
that we won't be able to kill an HVM guest without passthrough.

I can do:

1). if ( ret && !(ret == -EINVAL || ret == -E..))

2). Or remove the various 'return -EINVAL' in the  pci_clean_dpci_irqs
   for the causes where IOMMU is off or there are no PCI passthrough
   devices - and just make them return 0.

3). Leave it as is.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8)
  2014-11-02 20:09                                       ` Konrad Rzeszutek Wilk
@ 2014-11-03  8:46                                         ` Jan Beulich
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2014-11-03  8:46 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, Andrew Cooper, ian.jackson, tim, xen-devel

>>> On 02.11.14 at 21:09, <konrad@darnok.org> wrote:
> 2). Or remove the various 'return -EINVAL' in the  pci_clean_dpci_irqs
>    for the causes where IOMMU is off or there are no PCI passthrough
>    devices - and just make them return 0.

This one I would say - it seems quite natural for these not to fail
but succeed when there's nothing to do.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2014-11-03  8:46 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-21 17:19 [PATCH v8 for-xen-4.5] Fix interrupt latency of HVM PCI passthrough devices Konrad Rzeszutek Wilk
2014-10-21 17:19 ` [PATCH v8 for-xen-4.5 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model Konrad Rzeszutek Wilk
2014-10-23  8:58   ` Jan Beulich
2014-10-24  1:58     ` Konrad Rzeszutek Wilk
2014-10-24  9:49       ` Jan Beulich
2014-10-24 19:09         ` Konrad Rzeszutek Wilk
2014-10-27  9:25           ` Jan Beulich
2014-10-27 16:36             ` Konrad Rzeszutek Wilk
2014-10-27 16:57               ` Jan Beulich
2014-10-21 17:19 ` [PATCH v8 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq (v8) Konrad Rzeszutek Wilk
2014-10-23  9:36   ` Jan Beulich
2014-10-24  1:58     ` Konrad Rzeszutek Wilk
2014-10-24 10:09       ` Jan Beulich
2014-10-24 20:55         ` Konrad Rzeszutek Wilk
2014-10-25  0:39           ` Konrad Rzeszutek Wilk
2014-10-27  9:36             ` Jan Beulich
2014-10-27 16:36               ` Konrad Rzeszutek Wilk
2014-10-27  9:32           ` Jan Beulich
2014-10-27 10:40             ` Andrew Cooper
2014-10-27 10:59               ` Jan Beulich
2014-10-27 11:09                 ` Andrew Cooper
2014-10-27 11:24                   ` Jan Beulich
2014-10-27 17:01                     ` Konrad Rzeszutek Wilk
2014-10-27 17:36                       ` Andrew Cooper
2014-10-27 18:00                         ` Konrad Rzeszutek Wilk
2014-10-27 21:13                           ` Konrad Rzeszutek Wilk
2014-10-28 10:43                             ` Jan Beulich
2014-10-28 20:07                               ` Konrad Rzeszutek Wilk
2014-10-29  8:28                                 ` Jan Beulich
2014-10-29 21:11                                   ` Konrad Rzeszutek Wilk
2014-10-30  9:04                                     ` Jan Beulich
2014-11-02 20:09                                       ` Konrad Rzeszutek Wilk
2014-11-03  8:46                                         ` Jan Beulich
2014-10-28  7:58                         ` Jan Beulich
2014-10-28  7:53                       ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.