All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v1] Replace tasklets with per-cpu implementation.
@ 2014-08-27 17:58 Konrad Rzeszutek Wilk
  2014-08-27 17:58 ` [RFC PATCH v1 1/5] tasklet: Introduce per-cpu tasklet for softirq Konrad Rzeszutek Wilk
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-27 17:58 UTC (permalink / raw)
  To: jbeulich, keir, xen-devel

Hey,

With the Xen 4.5 feature freeze being right on the doorsteps I am not
expecting this to go in as:
 1) It touches core code,
 2) It has never been tested on ARM,
 3) It is RFC for right now.

With those expectations out of the way, I am submitting for review
an over-haul of the tasklet code. We had found one large machines
with a small size of guests (12) that the steal time for idle
guests was excessively high. Further debugging revealed that the
global tasklet lock was taken across all sockets at an excessively
high rate. To the point that 1/10th of a guest idle time was
taken (and actually accounted for as in RUNNING state!).

The ideal situation to reproduce this behavior is:
 1). Allocate a twelve guests with one to four SR-IOV VFs.
 2). Have half of them (six) heavily use the SR-IOV VFs devices.
 3). Monitor the rest (which are in idle) and despair.

As I discovered under the hood, we have two tasklets that are
scheduled and executed quite often - the VIRQ_TIMER one:
aassert_evtchn_irq_taskle, and the one in charge of injecting
an PCI interrupt in the guest: hvm_do_IRQ_dpci.

The 'hvm_do_IRQ_dpci' is the on that is most often scheduled
and run. The performance bottleneck comes from the fact that
we take the same spinlock three times: tasklet_schedule,
when we are about to execute the tasklet, and when we are
done executing the tasklet.

This patchset throws away the global list and lock for all
tasklets. Instead there are two per-cpu lists: one for
softirq, and one run when scheduler decides it. There is also
an global list and lock when we have cross-CPU tasklet scheduling
- which thankfully rarely happens (microcode update and
hypercall continuation).

The insertion and removal from the list is done by disabling
interrupts - which are short bursts of time. The behavior
of the code to only execute one tasklet per iteration is
also preserved (the Linux code would run through all 
of its tasklets).

The performance benefit of this patch were astounding and
removed the issues we saw. It also decreased the IRQ
latency of delievering an interrupt to a guest.

In terms of the patchset I choose an path in which:
 0) The first patch fixes the performance bug we saw and it
    was easy to backport.
 1) It is bisectable.
 2) If something breaks it should be fairly easy to figure
    out which patch broke it.
 3) It is spit up in a bit weird fashion with scaffolding code
    was added to keep it ticking (as at some point we have
    the old and the new implementation existing and used).
    And then later on removed. This is how Greg KH added
    kref and kobjects long time ago in the kernel and it had
    worked - so I figured I would borrow from this workflow.

I would appreciate feedback from the maintainers if they
would like this to be organized better.

 xen/common/tasklet.c      | 305 +++++++++++++++++++++++++++++++++-------------
 xen/include/xen/tasklet.h |  52 +++++++-
 2 files changed, 271 insertions(+), 86 deletions(-)

Konrad Rzeszutek Wilk (5):
      tasklet: Introduce per-cpu tasklet for softirq.
      tasklet: Add cross CPU feeding of per-cpu tasklets.
      tasklet: Remove the old-softirq implementation.
      tasklet: Introduce per-cpu tasklet for schedule tasklet.
      tasklet: Remove the scaffolding.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 1/5] tasklet: Introduce per-cpu tasklet for softirq.
  2014-08-27 17:58 [RFC PATCH v1] Replace tasklets with per-cpu implementation Konrad Rzeszutek Wilk
@ 2014-08-27 17:58 ` Konrad Rzeszutek Wilk
  2014-08-27 18:53   ` Andrew Cooper
  2014-08-27 17:58 ` [RFC PATCH v1 2/5] tasklet: Add cross CPU feeding of per-cpu tasklets Konrad Rzeszutek Wilk
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-27 17:58 UTC (permalink / raw)
  To: jbeulich, keir, xen-devel; +Cc: Konrad Rzeszutek Wilk

This implements a lockless per-cpu tasklet mechanism.

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case).

As such this patch introduces the code to setup
softirq per-cpu tasklets and only modifies the PCI
passthrough cases instead of doing it wholesale. This
is done because:
 - We want to easily bisect it if things break.
 - We modify the code one section at a time to
   make it easier to review this core code.

Now on the code itself. The Linux code (softirq.c)
has an per-cpu implementation of tasklets on which
this was based on. However there are differences:
 - This patch executes one tasklet at a time - similar
   to how the existing implementation does it.
 - We use a double-linked list instead of a single linked
   list. We could use a single-linked list but folks are
   more familiar with 'list_*' type macros.
 - This patch does not have the cross-CPU feeders
   implemented. That code is in the patch
   titled: tasklet: Add cross CPU feeding of per-cpu
   tasklets. This is done to support:
   "tasklet_schedule_on_cpu"
 - We add an temporary 'TASKLET_SOFTIRQ_PERCPU' which
   is can co-exist with the TASKLET_SOFTIRQ. It will be
   replaced in 'tasklet: Remove the old-softirq
   implementation."

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/arch/x86/hvm/hvm.c       |   2 +-
 xen/common/tasklet.c         | 129 +++++++++++++++++++++++++++++++++++++++++--
 xen/drivers/passthrough/io.c |   2 +-
 xen/include/xen/softirq.h    |   1 +
 xen/include/xen/tasklet.h    |  61 ++++++++++++++++++--
 5 files changed, 184 insertions(+), 11 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 94b18ba..4b4cad1 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2270,7 +2270,7 @@ int hvm_vcpu_initialise(struct vcpu *v)
     if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 ) /* teardown: hvm_funcs.vcpu_destroy */
         goto fail3;
 
-    softirq_tasklet_init(
+    percpu_tasklet_init(
         &v->arch.hvm_vcpu.assert_evtchn_irq_tasklet,
         (void(*)(unsigned long))hvm_assert_evtchn_irq,
         (unsigned long)v);
diff --git a/xen/common/tasklet.c b/xen/common/tasklet.c
index 4e42fa7..319866f 100644
--- a/xen/common/tasklet.c
+++ b/xen/common/tasklet.c
@@ -31,10 +31,30 @@ static DEFINE_PER_CPU(struct list_head, softirq_tasklet_list);
 /* Protects all lists and tasklet structures. */
 static DEFINE_SPINLOCK(tasklet_lock);
 
+static DEFINE_PER_CPU(struct list_head, softirq_list);
+
 static void tasklet_enqueue(struct tasklet *t)
 {
     unsigned int cpu = t->scheduled_on;
 
+    if ( t->is_percpu )
+    {
+        unsigned long flags;
+        struct list_head *list;
+
+        INIT_LIST_HEAD(&t->list);
+        BUG_ON( !t->is_softirq );
+        BUG_ON( cpu != smp_processor_id() ); /* Not implemented yet. */
+
+        local_irq_save(flags);
+
+        list = &__get_cpu_var(softirq_list);
+        list_add_tail(&t->list, list);
+        raise_softirq(TASKLET_SOFTIRQ_PERCPU);
+
+        local_irq_restore(flags);
+        return;
+    }
     if ( t->is_softirq )
     {
         struct list_head *list = &per_cpu(softirq_tasklet_list, cpu);
@@ -56,16 +76,25 @@ void tasklet_schedule_on_cpu(struct tasklet *t, unsigned int cpu)
 {
     unsigned long flags;
 
-    spin_lock_irqsave(&tasklet_lock, flags);
+    if ( !tasklets_initialised || t->is_dead )
+        return;
 
-    if ( tasklets_initialised && !t->is_dead )
+    if ( t->is_percpu )
     {
-        t->scheduled_on = cpu;
-        if ( !t->is_running )
+        if ( !test_and_set_bit(TASKLET_STATE_SCHED, &t->state) )
         {
-            list_del(&t->list);
+            t->scheduled_on = cpu;
             tasklet_enqueue(t);
         }
+        return;
+    }
+    spin_lock_irqsave(&tasklet_lock, flags);
+
+    t->scheduled_on = cpu;
+    if ( !t->is_running )
+    {
+        list_del(&t->list);
+        tasklet_enqueue(t);
     }
 
     spin_unlock_irqrestore(&tasklet_lock, flags);
@@ -104,6 +133,66 @@ static void do_tasklet_work(unsigned int cpu, struct list_head *list)
     }
 }
 
+void do_tasklet_work_percpu(void)
+{
+    struct tasklet *t = NULL;
+    struct list_head *head;
+    bool_t poke = 0;
+
+    local_irq_disable();
+    head = &__get_cpu_var(softirq_list);
+
+    if ( !list_empty(head) )
+    {
+        t = list_entry(head->next, struct tasklet, list);
+
+        if ( head->next == head->prev ) /* One singular item. Re-init head. */
+            INIT_LIST_HEAD(&__get_cpu_var(softirq_list));
+        else
+        {
+            /* Multiple items, update head to skip 't'. */
+            struct list_head *list;
+
+            /* One item past 't'. */
+            list = head->next->next;
+
+            BUG_ON(list == NULL);
+
+            /* And update head to skip 't'. Note that t->list.prev still
+             * points to head, but we don't care as we only process one tasklet
+             * and once done the tasklet list is re-init one way or another.
+             */
+            head->next = list;
+            poke = 1;
+        }
+    }
+    local_irq_enable();
+
+    if ( !t )
+        return; /* Never saw it happend, but we might have a spurious case? */
+
+    if ( tasklet_trylock(t) )
+    {
+        if ( !test_and_clear_bit(TASKLET_STATE_SCHED, &t->state) )
+                BUG();
+        sync_local_execstate();
+        t->func(t->data);
+        tasklet_unlock(t);
+        if ( poke )
+            raise_softirq(TASKLET_SOFTIRQ_PERCPU);
+        /* We could reinit the t->list but tasklet_enqueue does it for us. */
+        return;
+    }
+
+    local_irq_disable();
+
+    INIT_LIST_HEAD(&t->list);
+    list_add_tail(&t->list, &__get_cpu_var(softirq_list));
+    smp_wmb();
+    raise_softirq(TASKLET_SOFTIRQ_PERCPU);
+    local_irq_enable();
+}
+
 /* VCPU context work */
 void do_tasklet(void)
 {
@@ -147,10 +236,29 @@ static void tasklet_softirq_action(void)
     spin_unlock_irq(&tasklet_lock);
 }
 
+/* Per CPU softirq context work. */
+static void tasklet_softirq_percpu_action(void)
+{
+    do_tasklet_work_percpu();
+}
+
 void tasklet_kill(struct tasklet *t)
 {
     unsigned long flags;
 
+    if ( t->is_percpu )
+    {
+        while ( test_and_set_bit(TASKLET_STATE_SCHED, &t->state) )
+        {
+            do {
+                process_pending_softirqs();
+            } while ( test_bit(TASKLET_STATE_SCHED, &t->state) );
+        }
+        tasklet_unlock_wait(t);
+        clear_bit(TASKLET_STATE_SCHED, &t->state);
+        t->is_dead = 1;
+        return;
+    }
     spin_lock_irqsave(&tasklet_lock, flags);
 
     if ( !list_empty(&t->list) )
@@ -208,6 +316,14 @@ void softirq_tasklet_init(
     t->is_softirq = 1;
 }
 
+void percpu_tasklet_init(
+    struct tasklet *t, void (*func)(unsigned long), unsigned long data)
+{
+    tasklet_init(t, func, data);
+    t->is_softirq = 1;
+    t->is_percpu = 1;
+}
+
 static int cpu_callback(
     struct notifier_block *nfb, unsigned long action, void *hcpu)
 {
@@ -218,11 +334,13 @@ static int cpu_callback(
     case CPU_UP_PREPARE:
         INIT_LIST_HEAD(&per_cpu(tasklet_list, cpu));
         INIT_LIST_HEAD(&per_cpu(softirq_tasklet_list, cpu));
+        INIT_LIST_HEAD(&per_cpu(softirq_list, cpu));
         break;
     case CPU_UP_CANCELED:
     case CPU_DEAD:
         migrate_tasklets_from_cpu(cpu, &per_cpu(tasklet_list, cpu));
         migrate_tasklets_from_cpu(cpu, &per_cpu(softirq_tasklet_list, cpu));
+        migrate_tasklets_from_cpu(cpu, &per_cpu(softirq_list, cpu));
         break;
     default:
         break;
@@ -242,6 +360,7 @@ void __init tasklet_subsys_init(void)
     cpu_callback(&cpu_nfb, CPU_UP_PREPARE, hcpu);
     register_cpu_notifier(&cpu_nfb);
     open_softirq(TASKLET_SOFTIRQ, tasklet_softirq_action);
+    open_softirq(TASKLET_SOFTIRQ_PERCPU, tasklet_softirq_percpu_action);
     tasklets_initialised = 1;
 }
 
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ef75b94..740cee5 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -114,7 +114,7 @@ int pt_irq_create_bind(
             spin_unlock(&d->event_lock);
             return -ENOMEM;
         }
-        softirq_tasklet_init(
+        percpu_tasklet_init(
             &hvm_irq_dpci->dirq_tasklet,
             hvm_dirq_assist, (unsigned long)d);
         for ( i = 0; i < NR_HVM_IRQS; i++ )
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 0c0d481..8b15c8c 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -7,6 +7,7 @@ enum {
     SCHEDULE_SOFTIRQ,
     NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
     RCU_SOFTIRQ,
+    TASKLET_SOFTIRQ_PERCPU,
     TASKLET_SOFTIRQ,
     NR_COMMON_SOFTIRQS
 };
diff --git a/xen/include/xen/tasklet.h b/xen/include/xen/tasklet.h
index 8c3de7e..9497c47 100644
--- a/xen/include/xen/tasklet.h
+++ b/xen/include/xen/tasklet.h
@@ -17,21 +17,24 @@
 struct tasklet
 {
     struct list_head list;
+    unsigned long state;
     int scheduled_on;
     bool_t is_softirq;
     bool_t is_running;
     bool_t is_dead;
+    bool_t is_percpu;
     void (*func)(unsigned long);
     unsigned long data;
 };
 
-#define _DECLARE_TASKLET(name, func, data, softirq)                     \
+#define _DECLARE_TASKLET(name, func, data, softirq, percpu)             \
     struct tasklet name = {                                             \
-        LIST_HEAD_INIT(name.list), -1, softirq, 0, 0, func, data }
+        LIST_HEAD_INIT(name.list), 0, -1, softirq, 0, 0, percpu,        \
+        func, data }
 #define DECLARE_TASKLET(name, func, data)               \
-    _DECLARE_TASKLET(name, func, data, 0)
+    _DECLARE_TASKLET(name, func, data, 0, 0)
 #define DECLARE_SOFTIRQ_TASKLET(name, func, data)       \
-    _DECLARE_TASKLET(name, func, data, 1)
+    _DECLARE_TASKLET(name, func, data, 1, 0)
 
 /* Indicates status of tasklet work on each CPU. */
 DECLARE_PER_CPU(unsigned long, tasklet_work_to_do);
@@ -40,6 +43,54 @@ DECLARE_PER_CPU(unsigned long, tasklet_work_to_do);
 #define TASKLET_enqueued   (1ul << _TASKLET_enqueued)
 #define TASKLET_scheduled  (1ul << _TASKLET_scheduled)
 
+/* These fancy bit manipulations (bit 0 and bit 1) along with using a lock
+ * operation allow us to have four stages in tasklet life-time.
+ *  a) 0x0: Completely empty (not scheduled nor running).
+ *  b) 0x1: Scheduled but not running. Used to guard in 'tasklet_schedule'
+ *     such that we will only schedule one. If it is scheduled and had never
+ *     run (hence never clearing STATE_SCHED bit), tasklet_kill will spin
+ *     forever on said tasklet. However 'tasklet_schedule' raises the
+ *     softirq associated with the per-cpu - so it will run, albeit there might
+ *     be a race (tasklet_kill spinning until the softirq handler runs).
+ *  c) 0x2: it is running (only on one CPU) and can be scheduled on any
+ *     CPU. The bit 0 - scheduled is cleared at this stage allowing
+ *     'tasklet_schedule' to succesfully schedule.
+ *  d) 0x3: scheduled and running - only possible if the running tasklet calls
+ *     tasklet_schedule (on same CPU) or the tasklet is scheduled from another
+ *     CPU while the tasklet is running on another CPU.
+ *
+ * The two bits play a vital role in assuring that the tasklet is scheduled
+ * once and runs only once. The steps are:
+ *
+ *  1) tasklet_schedule: STATE_SCHED bit set (0x1), added on the per cpu list.
+ *  2) tasklet_softirq_percpu_action picks one tasklet from the list. Schedules
+ *  itself later if there are more tasklets on it. Tries to set STATE_RUN bit
+ *  (0x3) - if it fails adds tasklet back to the per-cpu list. If it succeeds
+ *  clears the STATE_SCHED bit (0x2). Once tasklet completed, unsets STATE_RUN
+ *  (0x0 or 0x1 if tasklet called tasklet_schedule).
+ */
+enum {
+    TASKLET_STATE_SCHED, /* Bit 0 */
+    TASKLET_STATE_RUN
+};
+
+static inline int tasklet_trylock(struct tasklet *t)
+{
+    return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state);
+}
+
+static inline void tasklet_unlock(struct tasklet *t)
+{
+    barrier();
+    clear_bit(TASKLET_STATE_RUN, &(t)->state);
+}
+static inline void tasklet_unlock_wait(struct tasklet *t)
+{
+    while (test_bit(TASKLET_STATE_RUN, &(t)->state))
+    {
+        barrier();
+    }
+}
 void tasklet_schedule_on_cpu(struct tasklet *t, unsigned int cpu);
 void tasklet_schedule(struct tasklet *t);
 void do_tasklet(void);
@@ -48,6 +99,8 @@ void tasklet_init(
     struct tasklet *t, void (*func)(unsigned long), unsigned long data);
 void softirq_tasklet_init(
     struct tasklet *t, void (*func)(unsigned long), unsigned long data);
+void percpu_tasklet_init(
+    struct tasklet *t, void (*func)(unsigned long), unsigned long data);
 void tasklet_subsys_init(void);
 
 #endif /* __XEN_TASKLET_H__ */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 2/5] tasklet: Add cross CPU feeding of per-cpu tasklets.
  2014-08-27 17:58 [RFC PATCH v1] Replace tasklets with per-cpu implementation Konrad Rzeszutek Wilk
  2014-08-27 17:58 ` [RFC PATCH v1 1/5] tasklet: Introduce per-cpu tasklet for softirq Konrad Rzeszutek Wilk
@ 2014-08-27 17:58 ` Konrad Rzeszutek Wilk
  2014-08-27 17:58 ` [RFC PATCH v1 3/5] tasklet: Remove the old-softirq implementation Konrad Rzeszutek Wilk
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-27 17:58 UTC (permalink / raw)
  To: jbeulich, keir, xen-devel; +Cc: Konrad Rzeszutek Wilk

Since the per-cpu tasklets are lockfree and run only
within their own CPU context - we need a mechanism
for 'tasklet_schedule_on_cpu' from other CPUs to
insert tasklets in the destination per-cpu lists.

We use an IPI mechanism which function implements an
per-cpu 'feeding' list and a global lock.

When the IPI happens on the target CPU it will drain
its per-cpu feeding list and add them in the per-cpu
tasklet vector. With that in place we can now swap
over all of the softirq_tasklet users to use it.
That means we can also eliminate the percpu_tasklet
scaffolding.

Note that we don't do this tasklet schedule on other
CPUs that often - during microcode update and when
doing hypercall_continuation.

This could be squashed in "tasklet: Introduce per-cpu
tasklet for softirq." but the author thought it would
be an easier aid in understanding the code with these
parts split out.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/arch/x86/hvm/hvm.c       |  2 +-
 xen/common/tasklet.c         | 50 +++++++++++++++++++++++++++++++++++++-------
 xen/drivers/passthrough/io.c |  2 +-
 xen/include/xen/tasklet.h    |  2 --
 4 files changed, 44 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 4b4cad1..94b18ba 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2270,7 +2270,7 @@ int hvm_vcpu_initialise(struct vcpu *v)
     if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 ) /* teardown: hvm_funcs.vcpu_destroy */
         goto fail3;
 
-    percpu_tasklet_init(
+    softirq_tasklet_init(
         &v->arch.hvm_vcpu.assert_evtchn_irq_tasklet,
         (void(*)(unsigned long))hvm_assert_evtchn_irq,
         (unsigned long)v);
diff --git a/xen/common/tasklet.c b/xen/common/tasklet.c
index 319866f..d8f3cb3 100644
--- a/xen/common/tasklet.c
+++ b/xen/common/tasklet.c
@@ -30,8 +30,36 @@ static DEFINE_PER_CPU(struct list_head, softirq_tasklet_list);
 
 /* Protects all lists and tasklet structures. */
 static DEFINE_SPINLOCK(tasklet_lock);
+static DEFINE_SPINLOCK(feeder_lock);
 
 static DEFINE_PER_CPU(struct list_head, softirq_list);
+static DEFINE_PER_CPU(struct list_head, tasklet_feeder);
+
+static void percpu_tasklet_feed(void *arg)
+{
+    unsigned long flags;
+    struct tasklet *t;
+    struct list_head *dst_list;
+    struct list_head *list = &__get_cpu_var(tasklet_feeder);
+
+    spin_lock_irqsave(&feeder_lock, flags);
+
+    if ( list_empty(list) )
+        goto out;
+
+    while ( !list_empty(list) )
+    {
+        t = list_entry(list->next, struct tasklet, list);
+        BUG_ON(!t->is_percpu);
+        list_del(&t->list);
+
+        dst_list = &__get_cpu_var(softirq_list);
+        list_add_tail(&t->list, dst_list);
+    }
+    raise_softirq(TASKLET_SOFTIRQ_PERCPU);
+out:
+    spin_unlock_irqrestore(&feeder_lock, flags);
+}
 
 static void tasklet_enqueue(struct tasklet *t)
 {
@@ -44,7 +72,18 @@ static void tasklet_enqueue(struct tasklet *t)
 
         INIT_LIST_HEAD(&t->list);
         BUG_ON( !t->is_softirq );
-        BUG_ON( cpu != smp_processor_id() ); /* Not implemented yet. */
+
+        if ( cpu != smp_processor_id() )
+        {
+            spin_lock_irqsave(&feeder_lock, flags);
+
+            list = &per_cpu(tasklet_feeder, cpu);
+            list_add_tail(&t->list, list);
+
+            spin_unlock_irqrestore(&feeder_lock, flags);
+            on_selected_cpus(cpumask_of(cpu), percpu_tasklet_feed, NULL, 1);
+            return;
+        }
 
         local_irq_save(flags);
 
@@ -314,13 +353,6 @@ void softirq_tasklet_init(
 {
     tasklet_init(t, func, data);
     t->is_softirq = 1;
-}
-
-void percpu_tasklet_init(
-    struct tasklet *t, void (*func)(unsigned long), unsigned long data)
-{
-    tasklet_init(t, func, data);
-    t->is_softirq = 1;
     t->is_percpu = 1;
 }
 
@@ -335,12 +367,14 @@ static int cpu_callback(
         INIT_LIST_HEAD(&per_cpu(tasklet_list, cpu));
         INIT_LIST_HEAD(&per_cpu(softirq_tasklet_list, cpu));
         INIT_LIST_HEAD(&per_cpu(softirq_list, cpu));
+        INIT_LIST_HEAD(&per_cpu(tasklet_feeder, cpu));
         break;
     case CPU_UP_CANCELED:
     case CPU_DEAD:
         migrate_tasklets_from_cpu(cpu, &per_cpu(tasklet_list, cpu));
         migrate_tasklets_from_cpu(cpu, &per_cpu(softirq_tasklet_list, cpu));
         migrate_tasklets_from_cpu(cpu, &per_cpu(softirq_list, cpu));
+        migrate_tasklets_from_cpu(cpu, &per_cpu(tasklet_feeder, cpu));
         break;
     default:
         break;
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 740cee5..ef75b94 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -114,7 +114,7 @@ int pt_irq_create_bind(
             spin_unlock(&d->event_lock);
             return -ENOMEM;
         }
-        percpu_tasklet_init(
+        softirq_tasklet_init(
             &hvm_irq_dpci->dirq_tasklet,
             hvm_dirq_assist, (unsigned long)d);
         for ( i = 0; i < NR_HVM_IRQS; i++ )
diff --git a/xen/include/xen/tasklet.h b/xen/include/xen/tasklet.h
index 9497c47..530a5e7 100644
--- a/xen/include/xen/tasklet.h
+++ b/xen/include/xen/tasklet.h
@@ -99,8 +99,6 @@ void tasklet_init(
     struct tasklet *t, void (*func)(unsigned long), unsigned long data);
 void softirq_tasklet_init(
     struct tasklet *t, void (*func)(unsigned long), unsigned long data);
-void percpu_tasklet_init(
-    struct tasklet *t, void (*func)(unsigned long), unsigned long data);
 void tasklet_subsys_init(void);
 
 #endif /* __XEN_TASKLET_H__ */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 3/5] tasklet: Remove the old-softirq implementation.
  2014-08-27 17:58 [RFC PATCH v1] Replace tasklets with per-cpu implementation Konrad Rzeszutek Wilk
  2014-08-27 17:58 ` [RFC PATCH v1 1/5] tasklet: Introduce per-cpu tasklet for softirq Konrad Rzeszutek Wilk
  2014-08-27 17:58 ` [RFC PATCH v1 2/5] tasklet: Add cross CPU feeding of per-cpu tasklets Konrad Rzeszutek Wilk
@ 2014-08-27 17:58 ` Konrad Rzeszutek Wilk
  2014-08-27 17:58 ` [RFC PATCH v1 4/5] tasklet: Introduce per-cpu tasklet for schedule tasklet Konrad Rzeszutek Wilk
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-27 17:58 UTC (permalink / raw)
  To: jbeulich, keir, xen-devel; +Cc: Konrad Rzeszutek Wilk

With the new percpu tasklet (see "tasklet: Introduce per-cpu tasklet."
and "tasklet: Add cross CPU feeding of per-cpu-tasklets") we have
now in a place a working version of per-cpu softirq tasklets.

We can now remove the old implementation of the
softirq tasklet. We also remove the temporary scaffolding
of TASKLET_SOFTIRQ_PERCPU. Further removal of code will
be done in "tasklet: Remove the old scaffolding" once the
schedule tasklet code is in.

This could be squashed in "tasklet: Introduce per-cpu
tasklet for softirq." but the author thought it would
be an easier aid in understanding the code with these
parts split out.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/common/tasklet.c      | 35 ++++++-----------------------------
 xen/include/xen/softirq.h |  1 -
 xen/include/xen/tasklet.h |  2 +-
 3 files changed, 7 insertions(+), 31 deletions(-)

diff --git a/xen/common/tasklet.c b/xen/common/tasklet.c
index d8f3cb3..ab23338 100644
--- a/xen/common/tasklet.c
+++ b/xen/common/tasklet.c
@@ -26,7 +26,6 @@ static bool_t tasklets_initialised;
 DEFINE_PER_CPU(unsigned long, tasklet_work_to_do);
 
 static DEFINE_PER_CPU(struct list_head, tasklet_list);
-static DEFINE_PER_CPU(struct list_head, softirq_tasklet_list);
 
 /* Protects all lists and tasklet structures. */
 static DEFINE_SPINLOCK(tasklet_lock);
@@ -56,7 +55,7 @@ static void percpu_tasklet_feed(void *arg)
         dst_list = &__get_cpu_var(softirq_list);
         list_add_tail(&t->list, dst_list);
     }
-    raise_softirq(TASKLET_SOFTIRQ_PERCPU);
+    raise_softirq(TASKLET_SOFTIRQ);
 out:
     spin_unlock_irqrestore(&feeder_lock, flags);
 }
@@ -89,18 +88,14 @@ static void tasklet_enqueue(struct tasklet *t)
 
         list = &__get_cpu_var(softirq_list);
         list_add_tail(&t->list, list);
-        raise_softirq(TASKLET_SOFTIRQ_PERCPU);
+        raise_softirq(TASKLET_SOFTIRQ);
 
         local_irq_restore(flags);
         return;
     }
     if ( t->is_softirq )
     {
-        struct list_head *list = &per_cpu(softirq_tasklet_list, cpu);
-        bool_t was_empty = list_empty(list);
-        list_add_tail(&t->list, list);
-        if ( was_empty )
-            cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
+        BUG();
     }
     else
     {
@@ -218,7 +213,7 @@ void do_tasklet_work_percpu(void)
         t->func(t->data);
         tasklet_unlock(t);
         if ( poke )
-            raise_softirq(TASKLET_SOFTIRQ_PERCPU);
+            raise_softirq(TASKLET_SOFTIRQ);
         /* We could reinit the t->list but tasklet_enqueue does it for us. */
         return;
     }
@@ -228,7 +223,7 @@ void do_tasklet_work_percpu(void)
     INIT_LIST_HEAD(&t->list);
     list_add_tail(&t->list, &__get_cpu_var(softirq_list));
     smp_wmb();
-    raise_softirq(TASKLET_SOFTIRQ_PERCPU);
+    raise_softirq(TASKLET_SOFTIRQ);
     local_irq_enable();
 }
 
@@ -259,24 +254,9 @@ void do_tasklet(void)
     spin_unlock_irq(&tasklet_lock);
 }
 
-/* Softirq context work */
-static void tasklet_softirq_action(void)
-{
-    unsigned int cpu = smp_processor_id();
-    struct list_head *list = &per_cpu(softirq_tasklet_list, cpu);
-
-    spin_lock_irq(&tasklet_lock);
-
-    do_tasklet_work(cpu, list);
-
-    if ( !list_empty(list) && !cpu_is_offline(cpu) )
-        raise_softirq(TASKLET_SOFTIRQ);
-
-    spin_unlock_irq(&tasklet_lock);
-}
 
 /* Per CPU softirq context work. */
-static void tasklet_softirq_percpu_action(void)
+static void tasklet_softirq_action(void)
 {
     do_tasklet_work_percpu();
 }
@@ -365,14 +345,12 @@ static int cpu_callback(
     {
     case CPU_UP_PREPARE:
         INIT_LIST_HEAD(&per_cpu(tasklet_list, cpu));
-        INIT_LIST_HEAD(&per_cpu(softirq_tasklet_list, cpu));
         INIT_LIST_HEAD(&per_cpu(softirq_list, cpu));
         INIT_LIST_HEAD(&per_cpu(tasklet_feeder, cpu));
         break;
     case CPU_UP_CANCELED:
     case CPU_DEAD:
         migrate_tasklets_from_cpu(cpu, &per_cpu(tasklet_list, cpu));
-        migrate_tasklets_from_cpu(cpu, &per_cpu(softirq_tasklet_list, cpu));
         migrate_tasklets_from_cpu(cpu, &per_cpu(softirq_list, cpu));
         migrate_tasklets_from_cpu(cpu, &per_cpu(tasklet_feeder, cpu));
         break;
@@ -394,7 +372,6 @@ void __init tasklet_subsys_init(void)
     cpu_callback(&cpu_nfb, CPU_UP_PREPARE, hcpu);
     register_cpu_notifier(&cpu_nfb);
     open_softirq(TASKLET_SOFTIRQ, tasklet_softirq_action);
-    open_softirq(TASKLET_SOFTIRQ_PERCPU, tasklet_softirq_percpu_action);
     tasklets_initialised = 1;
 }
 
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 8b15c8c..0c0d481 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -7,7 +7,6 @@ enum {
     SCHEDULE_SOFTIRQ,
     NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
     RCU_SOFTIRQ,
-    TASKLET_SOFTIRQ_PERCPU,
     TASKLET_SOFTIRQ,
     NR_COMMON_SOFTIRQS
 };
diff --git a/xen/include/xen/tasklet.h b/xen/include/xen/tasklet.h
index 530a5e7..21efe7b 100644
--- a/xen/include/xen/tasklet.h
+++ b/xen/include/xen/tasklet.h
@@ -34,7 +34,7 @@ struct tasklet
 #define DECLARE_TASKLET(name, func, data)               \
     _DECLARE_TASKLET(name, func, data, 0, 0)
 #define DECLARE_SOFTIRQ_TASKLET(name, func, data)       \
-    _DECLARE_TASKLET(name, func, data, 1, 0)
+    _DECLARE_TASKLET(name, func, data, 1, 1)
 
 /* Indicates status of tasklet work on each CPU. */
 DECLARE_PER_CPU(unsigned long, tasklet_work_to_do);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 4/5] tasklet: Introduce per-cpu tasklet for schedule tasklet.
  2014-08-27 17:58 [RFC PATCH v1] Replace tasklets with per-cpu implementation Konrad Rzeszutek Wilk
                   ` (2 preceding siblings ...)
  2014-08-27 17:58 ` [RFC PATCH v1 3/5] tasklet: Remove the old-softirq implementation Konrad Rzeszutek Wilk
@ 2014-08-27 17:58 ` Konrad Rzeszutek Wilk
  2014-08-27 17:58 ` [RFC PATCH v1 5/5] tasklet: Remove the scaffolding Konrad Rzeszutek Wilk
  2014-08-28 12:39 ` [RFC PATCH v1] Replace tasklets with per-cpu implementation Jan Beulich
  5 siblings, 0 replies; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-27 17:58 UTC (permalink / raw)
  To: jbeulich, keir, xen-devel; +Cc: Konrad Rzeszutek Wilk

With the "tasklet: Introduce per-cpu tasklet for softirq."
and "tasklet: Add cross CPU feeding of per-cpu tasklets."
we have all the neccessary pieces to swap over the
schedule tasklet. We re-use the tasklet_list.

It also removes the old implementation in one swoop.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/common/tasklet.c | 193 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 112 insertions(+), 81 deletions(-)

diff --git a/xen/common/tasklet.c b/xen/common/tasklet.c
index ab23338..618e73b 100644
--- a/xen/common/tasklet.c
+++ b/xen/common/tasklet.c
@@ -25,12 +25,10 @@ static bool_t tasklets_initialised;
 
 DEFINE_PER_CPU(unsigned long, tasklet_work_to_do);
 
-static DEFINE_PER_CPU(struct list_head, tasklet_list);
-
-/* Protects all lists and tasklet structures. */
-static DEFINE_SPINLOCK(tasklet_lock);
+/* Protects the tasklet_feeder list. */
 static DEFINE_SPINLOCK(feeder_lock);
 
+static DEFINE_PER_CPU(struct list_head, tasklet_list);
 static DEFINE_PER_CPU(struct list_head, softirq_list);
 static DEFINE_PER_CPU(struct list_head, tasklet_feeder);
 
@@ -40,6 +38,8 @@ static void percpu_tasklet_feed(void *arg)
     struct tasklet *t;
     struct list_head *dst_list;
     struct list_head *list = &__get_cpu_var(tasklet_feeder);
+    unsigned long *work_to_do = &__get_cpu_var(tasklet_work_to_do);
+    bool_t poke = 0;
 
     spin_lock_irqsave(&feeder_lock, flags);
 
@@ -52,10 +52,23 @@ static void percpu_tasklet_feed(void *arg)
         BUG_ON(!t->is_percpu);
         list_del(&t->list);
 
-        dst_list = &__get_cpu_var(softirq_list);
+        if ( t->is_softirq )
+        {
+            dst_list = &__get_cpu_var(softirq_list);
+            poke = 1;
+        }
+        else
+            dst_list = &__get_cpu_var(tasklet_list);
+
         list_add_tail(&t->list, dst_list);
     }
-    raise_softirq(TASKLET_SOFTIRQ);
+    if ( poke )
+        raise_softirq(TASKLET_SOFTIRQ);
+    else
+    {
+        if ( !test_and_set_bit(_TASKLET_enqueued, work_to_do) )
+            raise_softirq(SCHEDULE_SOFTIRQ);
+    }
 out:
     spin_unlock_irqrestore(&feeder_lock, flags);
 }
@@ -70,7 +83,6 @@ static void tasklet_enqueue(struct tasklet *t)
         struct list_head *list;
 
         INIT_LIST_HEAD(&t->list);
-        BUG_ON( !t->is_softirq );
 
         if ( cpu != smp_processor_id() )
         {
@@ -83,15 +95,32 @@ static void tasklet_enqueue(struct tasklet *t)
             on_selected_cpus(cpumask_of(cpu), percpu_tasklet_feed, NULL, 1);
             return;
         }
+        if ( t->is_softirq )
+        {
 
-        local_irq_save(flags);
+            local_irq_save(flags);
 
-        list = &__get_cpu_var(softirq_list);
-        list_add_tail(&t->list, list);
-        raise_softirq(TASKLET_SOFTIRQ);
+            list = &__get_cpu_var(softirq_list);
+            list_add_tail(&t->list, list);
+            raise_softirq(TASKLET_SOFTIRQ);
 
-        local_irq_restore(flags);
-        return;
+            local_irq_restore(flags);
+            return;
+        }
+        else
+        {
+            unsigned long *work_to_do = &__get_cpu_var(tasklet_work_to_do);
+
+            local_irq_save(flags);
+
+            list = &__get_cpu_var(tasklet_list);
+            list_add_tail(&t->list, list);
+            if ( !test_and_set_bit(_TASKLET_enqueued, work_to_do) )
+                raise_softirq(SCHEDULE_SOFTIRQ);
+
+            local_irq_restore(flags);
+            return;
+        }
     }
     if ( t->is_softirq )
     {
@@ -99,17 +128,12 @@ static void tasklet_enqueue(struct tasklet *t)
     }
     else
     {
-        unsigned long *work_to_do = &per_cpu(tasklet_work_to_do, cpu);
-        list_add_tail(&t->list, &per_cpu(tasklet_list, cpu));
-        if ( !test_and_set_bit(_TASKLET_enqueued, work_to_do) )
-            cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
+        BUG();
     }
 }
 
 void tasklet_schedule_on_cpu(struct tasklet *t, unsigned int cpu)
 {
-    unsigned long flags;
-
     if ( !tasklets_initialised || t->is_dead )
         return;
 
@@ -122,16 +146,7 @@ void tasklet_schedule_on_cpu(struct tasklet *t, unsigned int cpu)
         }
         return;
     }
-    spin_lock_irqsave(&tasklet_lock, flags);
-
-    t->scheduled_on = cpu;
-    if ( !t->is_running )
-    {
-        list_del(&t->list);
-        tasklet_enqueue(t);
-    }
-
-    spin_unlock_irqrestore(&tasklet_lock, flags);
+    BUG();
 }
 
 void tasklet_schedule(struct tasklet *t)
@@ -139,32 +154,67 @@ void tasklet_schedule(struct tasklet *t)
     tasklet_schedule_on_cpu(t, smp_processor_id());
 }
 
-static void do_tasklet_work(unsigned int cpu, struct list_head *list)
+/* Return 0 if there is more work to be done. */
+static int do_tasklet_work(void)
 {
-    struct tasklet *t;
+    struct tasklet *t = NULL;
+    struct list_head *head;
+    int rc = 1; /* All done. */
 
-    if ( unlikely(list_empty(list) || cpu_is_offline(cpu)) )
-        return;
+    local_irq_disable();
+    head = &__get_cpu_var(tasklet_list);
 
-    t = list_entry(list->next, struct tasklet, list);
-    list_del_init(&t->list);
+    if ( !list_empty(head) )
+    {
+        t = list_entry(head->next, struct tasklet, list);
 
-    BUG_ON(t->is_dead || t->is_running || (t->scheduled_on != cpu));
-    t->scheduled_on = -1;
-    t->is_running = 1;
+        if ( head->next == head->prev ) /* One singular item. Re-init head. */
+            INIT_LIST_HEAD(&__get_cpu_var(tasklet_list));
+        else
+        {
+            /* Multiple items, update head to skip 't'. */
+            struct list_head *list;
 
-    spin_unlock_irq(&tasklet_lock);
-    sync_local_execstate();
-    t->func(t->data);
-    spin_lock_irq(&tasklet_lock);
+            /* One item past 't'. */
+            list = head->next->next;
 
-    t->is_running = 0;
+            BUG_ON(list == NULL);
+
+            /* And update head to skip 't'. Note that t->list.prev still
+             * points to head, but we don't care as we only process one tasklet
+             * and once done the tasklet list is re-init one way or another.
+             */
+            head->next = list;
+            rc = 0; /* More work to be done. */
+        }
+    }
+    local_irq_enable();
 
-    if ( t->scheduled_on >= 0 )
+    if ( !t )
+        return 1; /* Never saw it happend, but we might have a spurious case? */
+
+    if ( tasklet_trylock(t) )
     {
-        BUG_ON(t->is_dead || !list_empty(&t->list));
-        tasklet_enqueue(t);
+        if ( !test_and_clear_bit(TASKLET_STATE_SCHED, &t->state) )
+                BUG();
+        sync_local_execstate();
+        t->func(t->data);
+        tasklet_unlock(t);
+        if ( rc == 0 )
+            raise_softirq(TASKLET_SOFTIRQ);
+        /* We could reinit the t->list but tasklet_enqueue does it for us. */
+        return rc;
     }
+
+    local_irq_disable();
+
+    INIT_LIST_HEAD(&t->list);
+    list_add_tail(&t->list, &__get_cpu_var(tasklet_list));
+    smp_wmb();
+    raise_softirq(TASKLET_SOFTIRQ);
+    local_irq_enable();
+
+    return 0; /* More to do. */
 }
 
 void do_tasklet_work_percpu(void)
@@ -232,7 +282,6 @@ void do_tasklet(void)
 {
     unsigned int cpu = smp_processor_id();
     unsigned long *work_to_do = &per_cpu(tasklet_work_to_do, cpu);
-    struct list_head *list = &per_cpu(tasklet_list, cpu);
 
     /*
      * Work must be enqueued *and* scheduled. Otherwise there is no work to
@@ -241,17 +290,11 @@ void do_tasklet(void)
     if ( likely(*work_to_do != (TASKLET_enqueued|TASKLET_scheduled)) )
         return;
 
-    spin_lock_irq(&tasklet_lock);
-
-    do_tasklet_work(cpu, list);
-
-    if ( list_empty(list) )
+    if ( do_tasklet_work() )
     {
-        clear_bit(_TASKLET_enqueued, work_to_do);        
+        clear_bit(_TASKLET_enqueued, work_to_do);
         raise_softirq(SCHEDULE_SOFTIRQ);
     }
-
-    spin_unlock_irq(&tasklet_lock);
 }
 
 
@@ -263,8 +306,6 @@ static void tasklet_softirq_action(void)
 
 void tasklet_kill(struct tasklet *t)
 {
-    unsigned long flags;
-
     if ( t->is_percpu )
     {
         while ( test_and_set_bit(TASKLET_STATE_SCHED, &t->state) )
@@ -278,25 +319,6 @@ void tasklet_kill(struct tasklet *t)
         t->is_dead = 1;
         return;
     }
-    spin_lock_irqsave(&tasklet_lock, flags);
-
-    if ( !list_empty(&t->list) )
-    {
-        BUG_ON(t->is_dead || t->is_running || (t->scheduled_on < 0));
-        list_del_init(&t->list);
-    }
-
-    t->scheduled_on = -1;
-    t->is_dead = 1;
-
-    while ( t->is_running )
-    {
-        spin_unlock_irqrestore(&tasklet_lock, flags);
-        cpu_relax();
-        spin_lock_irqsave(&tasklet_lock, flags);
-    }
-
-    spin_unlock_irqrestore(&tasklet_lock, flags);
 }
 
 static void migrate_tasklets_from_cpu(unsigned int cpu, struct list_head *list)
@@ -304,7 +326,7 @@ static void migrate_tasklets_from_cpu(unsigned int cpu, struct list_head *list)
     unsigned long flags;
     struct tasklet *t;
 
-    spin_lock_irqsave(&tasklet_lock, flags);
+    spin_lock_irqsave(&feeder_lock, flags);
 
     while ( !list_empty(list) )
     {
@@ -315,7 +337,7 @@ static void migrate_tasklets_from_cpu(unsigned int cpu, struct list_head *list)
         tasklet_enqueue(t);
     }
 
-    spin_unlock_irqrestore(&tasklet_lock, flags);
+    spin_unlock_irqrestore(&feeder_lock, flags);
 }
 
 void tasklet_init(
@@ -326,6 +348,7 @@ void tasklet_init(
     t->scheduled_on = -1;
     t->func = func;
     t->data = data;
+    t->is_percpu = 1;
 }
 
 void softirq_tasklet_init(
@@ -333,26 +356,34 @@ void softirq_tasklet_init(
 {
     tasklet_init(t, func, data);
     t->is_softirq = 1;
-    t->is_percpu = 1;
 }
 
 static int cpu_callback(
     struct notifier_block *nfb, unsigned long action, void *hcpu)
 {
     unsigned int cpu = (unsigned long)hcpu;
+    unsigned long *work_to_do;
 
     switch ( action )
     {
     case CPU_UP_PREPARE:
-        INIT_LIST_HEAD(&per_cpu(tasklet_list, cpu));
         INIT_LIST_HEAD(&per_cpu(softirq_list, cpu));
         INIT_LIST_HEAD(&per_cpu(tasklet_feeder, cpu));
+        INIT_LIST_HEAD(&per_cpu(tasklet_list, cpu));
         break;
     case CPU_UP_CANCELED:
     case CPU_DEAD:
-        migrate_tasklets_from_cpu(cpu, &per_cpu(tasklet_list, cpu));
         migrate_tasklets_from_cpu(cpu, &per_cpu(softirq_list, cpu));
         migrate_tasklets_from_cpu(cpu, &per_cpu(tasklet_feeder, cpu));
+        migrate_tasklets_from_cpu(cpu, &per_cpu(tasklet_list, cpu));
+
+        work_to_do = &per_cpu(tasklet_work_to_do, cpu);
+        if ( test_bit(_TASKLET_enqueued, work_to_do) )
+        {
+            work_to_do = &__get_cpu_var(tasklet_work_to_do);
+            if ( !test_and_set_bit(_TASKLET_enqueued, work_to_do) )
+                raise_softirq(SCHEDULE_SOFTIRQ);
+        }
         break;
     default:
         break;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v1 5/5] tasklet: Remove the scaffolding.
  2014-08-27 17:58 [RFC PATCH v1] Replace tasklets with per-cpu implementation Konrad Rzeszutek Wilk
                   ` (3 preceding siblings ...)
  2014-08-27 17:58 ` [RFC PATCH v1 4/5] tasklet: Introduce per-cpu tasklet for schedule tasklet Konrad Rzeszutek Wilk
@ 2014-08-27 17:58 ` Konrad Rzeszutek Wilk
  2014-08-28 12:39 ` [RFC PATCH v1] Replace tasklets with per-cpu implementation Jan Beulich
  5 siblings, 0 replies; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-27 17:58 UTC (permalink / raw)
  To: jbeulich, keir, xen-devel; +Cc: Konrad Rzeszutek Wilk

To catch any bisection issues, we had been replacing parts of
the tasklet code one functionality on top of each other. Now
that all of it is per-cpu and working we can remove the
old scaffolding and collapse functions.

We also remove the 'is_percpu' flag that is not needed
anymore. Most of this is code deletion and code
motion. No new functionality is added.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/common/tasklet.c      | 110 ++++++++++++++++++----------------------------
 xen/include/xen/tasklet.h |   9 ++--
 2 files changed, 46 insertions(+), 73 deletions(-)

diff --git a/xen/common/tasklet.c b/xen/common/tasklet.c
index 618e73b..192fa79 100644
--- a/xen/common/tasklet.c
+++ b/xen/common/tasklet.c
@@ -49,7 +49,6 @@ static void percpu_tasklet_feed(void *arg)
     while ( !list_empty(list) )
     {
         t = list_entry(list->next, struct tasklet, list);
-        BUG_ON(!t->is_percpu);
         list_del(&t->list);
 
         if ( t->is_softirq )
@@ -76,59 +75,44 @@ out:
 static void tasklet_enqueue(struct tasklet *t)
 {
     unsigned int cpu = t->scheduled_on;
+    unsigned long flags;
+    struct list_head *list;
 
-    if ( t->is_percpu )
-    {
-        unsigned long flags;
-        struct list_head *list;
-
-        INIT_LIST_HEAD(&t->list);
-
-        if ( cpu != smp_processor_id() )
-        {
-            spin_lock_irqsave(&feeder_lock, flags);
-
-            list = &per_cpu(tasklet_feeder, cpu);
-            list_add_tail(&t->list, list);
-
-            spin_unlock_irqrestore(&feeder_lock, flags);
-            on_selected_cpus(cpumask_of(cpu), percpu_tasklet_feed, NULL, 1);
-            return;
-        }
-        if ( t->is_softirq )
-        {
-
-            local_irq_save(flags);
-
-            list = &__get_cpu_var(softirq_list);
-            list_add_tail(&t->list, list);
-            raise_softirq(TASKLET_SOFTIRQ);
+    INIT_LIST_HEAD(&t->list);
 
-            local_irq_restore(flags);
-            return;
-        }
-        else
-        {
-            unsigned long *work_to_do = &__get_cpu_var(tasklet_work_to_do);
+    if ( cpu != smp_processor_id() )
+    {
+        spin_lock_irqsave(&feeder_lock, flags);
 
-            local_irq_save(flags);
+        list = &per_cpu(tasklet_feeder, cpu);
+        list_add_tail(&t->list, list);
 
-            list = &__get_cpu_var(tasklet_list);
-            list_add_tail(&t->list, list);
-            if ( !test_and_set_bit(_TASKLET_enqueued, work_to_do) )
-                raise_softirq(SCHEDULE_SOFTIRQ);
+        spin_unlock_irqrestore(&feeder_lock, flags);
+        on_selected_cpus(cpumask_of(cpu), percpu_tasklet_feed, NULL, 1);
+        return;
+     }
+     if ( t->is_softirq )
+     {
+         local_irq_save(flags);
+
+         list = &__get_cpu_var(softirq_list);
+         list_add_tail(&t->list, list);
+         raise_softirq(TASKLET_SOFTIRQ);
+
+         local_irq_restore(flags);
+     }
+     else
+     {
+          unsigned long *work_to_do = &__get_cpu_var(tasklet_work_to_do);
+
+          local_irq_save(flags);
+
+          list = &__get_cpu_var(tasklet_list);
+          list_add_tail(&t->list, list);
+          if ( !test_and_set_bit(_TASKLET_enqueued, work_to_do) )
+            raise_softirq(SCHEDULE_SOFTIRQ);
 
-            local_irq_restore(flags);
-            return;
-        }
-    }
-    if ( t->is_softirq )
-    {
-        BUG();
-    }
-    else
-    {
-        BUG();
+          local_irq_restore(flags);
     }
 }
 
@@ -137,16 +121,11 @@ void tasklet_schedule_on_cpu(struct tasklet *t, unsigned int cpu)
     if ( !tasklets_initialised || t->is_dead )
         return;
 
-    if ( t->is_percpu )
+    if ( !test_and_set_bit(TASKLET_STATE_SCHED, &t->state) )
     {
-        if ( !test_and_set_bit(TASKLET_STATE_SCHED, &t->state) )
-        {
-            t->scheduled_on = cpu;
-            tasklet_enqueue(t);
-        }
-        return;
+        t->scheduled_on = cpu;
+        tasklet_enqueue(t);
     }
-    BUG();
 }
 
 void tasklet_schedule(struct tasklet *t)
@@ -306,19 +285,15 @@ static void tasklet_softirq_action(void)
 
 void tasklet_kill(struct tasklet *t)
 {
-    if ( t->is_percpu )
+    while ( test_and_set_bit(TASKLET_STATE_SCHED, &t->state) )
     {
-        while ( test_and_set_bit(TASKLET_STATE_SCHED, &t->state) )
-        {
-            do {
+        do {
                 process_pending_softirqs();
-            } while ( test_bit(TASKLET_STATE_SCHED, &t->state) );
-        }
-        tasklet_unlock_wait(t);
-        clear_bit(TASKLET_STATE_SCHED, &t->state);
-        t->is_dead = 1;
-        return;
+        } while ( test_bit(TASKLET_STATE_SCHED, &t->state) );
     }
+    tasklet_unlock_wait(t);
+    clear_bit(TASKLET_STATE_SCHED, &t->state);
+    t->is_dead = 1;
 }
 
 static void migrate_tasklets_from_cpu(unsigned int cpu, struct list_head *list)
@@ -348,7 +323,6 @@ void tasklet_init(
     t->scheduled_on = -1;
     t->func = func;
     t->data = data;
-    t->is_percpu = 1;
 }
 
 void softirq_tasklet_init(
diff --git a/xen/include/xen/tasklet.h b/xen/include/xen/tasklet.h
index 21efe7b..b7a6a81 100644
--- a/xen/include/xen/tasklet.h
+++ b/xen/include/xen/tasklet.h
@@ -22,19 +22,18 @@ struct tasklet
     bool_t is_softirq;
     bool_t is_running;
     bool_t is_dead;
-    bool_t is_percpu;
     void (*func)(unsigned long);
     unsigned long data;
 };
 
-#define _DECLARE_TASKLET(name, func, data, softirq, percpu)             \
+#define _DECLARE_TASKLET(name, func, data, softirq)                     \
     struct tasklet name = {                                             \
-        LIST_HEAD_INIT(name.list), 0, -1, softirq, 0, 0, percpu,        \
+        LIST_HEAD_INIT(name.list), 0, -1, softirq, 0, 0,                \
         func, data }
 #define DECLARE_TASKLET(name, func, data)               \
-    _DECLARE_TASKLET(name, func, data, 0, 0)
+    _DECLARE_TASKLET(name, func, data, 0)
 #define DECLARE_SOFTIRQ_TASKLET(name, func, data)       \
-    _DECLARE_TASKLET(name, func, data, 1, 1)
+    _DECLARE_TASKLET(name, func, data, 1)
 
 /* Indicates status of tasklet work on each CPU. */
 DECLARE_PER_CPU(unsigned long, tasklet_work_to_do);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 1/5] tasklet: Introduce per-cpu tasklet for softirq.
  2014-08-27 17:58 ` [RFC PATCH v1 1/5] tasklet: Introduce per-cpu tasklet for softirq Konrad Rzeszutek Wilk
@ 2014-08-27 18:53   ` Andrew Cooper
  2014-08-27 19:06     ` Konrad Rzeszutek Wilk
  2014-08-28  8:17     ` Jan Beulich
  0 siblings, 2 replies; 23+ messages in thread
From: Andrew Cooper @ 2014-08-27 18:53 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, jbeulich, keir, xen-devel

On 27/08/14 18:58, Konrad Rzeszutek Wilk wrote:
> This implements a lockless per-cpu tasklet mechanism.
>
> The existing tasklet mechanism has a single global
> spinlock that is taken every-time the global list
> is touched. And we use this lock quite a lot - when
> we call do_tasklet_work which is called via an softirq
> and from the idle loop. We take the lock on any
> operation on the tasklet_list.
>
> The problem we are facing is that there are quite a lot of
> tasklets scheduled. The most common one that is invoked is
> the one injecting the VIRQ_TIMER in the guest. Guests
> are not insane and don't set the one-shot or periodic
> clocks to be in sub 1ms intervals (causing said tasklet
> to be scheduled for such small intervalls).
>
> The problem appears when PCI passthrough devices are used
> over many sockets and we have an mix of heavy-interrupt
> guests and idle guests. The idle guests end up seeing
> 1/10 of its RUNNING timeslice eaten by the hypervisor
> (and 40% steal time).
>
> The mechanism by which we inject PCI interrupts is by
> hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
> tasklet every time an interrupt is received.
> The callchain is:
>
> _asm_vmexit_handler
>  -> vmx_vmexit_handler
>     ->vmx_do_extint
>         -> do_IRQ
>             -> __do_IRQ_guest
>                 -> hvm_do_IRQ_dpci
>                    tasklet_schedule(&dpci->dirq_tasklet);
>                    [takes lock to put the tasklet on]
>
> [later on the schedule_tail is invoked which is 'vmx_do_resume']
>
> vmx_do_resume
>  -> vmx_asm_do_vmentry
>         -> call vmx_intr_assist
>           -> vmx_process_softirqs
>             -> do_softirq
>               [executes the tasklet function, takes the
>                lock again]
>
> While on other CPUs they might be sitting in a idle loop
> and invoked to deliver an VIRQ_TIMER, which also ends
> up taking the lock twice: first to schedule the
> v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
> the guests' BLOCKED_state); then to execute it - which is
> accounted for in the guest's RUNTIME_state.
>
> The end result is that on a 8 socket machine with
> PCI passthrough, where four sockets are busy with interrupts,
> and the other sockets have idle guests - we end up with
> the idle guests having around 40% steal time and 1/10
> of its timeslice (3ms out of 30 ms) being tied up
> taking the lock. The latency of the PCI interrupts delieved
> to guest is also hindered.
>
> With this patch the problem disappears completly.
> That is removing the lock for the PCI passthrough use-case
> (the 'hvm_dirq_assist' case).
>
> As such this patch introduces the code to setup
> softirq per-cpu tasklets and only modifies the PCI
> passthrough cases instead of doing it wholesale. This
> is done because:
>  - We want to easily bisect it if things break.
>  - We modify the code one section at a time to
>    make it easier to review this core code.
>
> Now on the code itself. The Linux code (softirq.c)
> has an per-cpu implementation of tasklets on which
> this was based on. However there are differences:
>  - This patch executes one tasklet at a time - similar
>    to how the existing implementation does it.
>  - We use a double-linked list instead of a single linked
>    list. We could use a single-linked list but folks are
>    more familiar with 'list_*' type macros.
>  - This patch does not have the cross-CPU feeders
>    implemented. That code is in the patch
>    titled: tasklet: Add cross CPU feeding of per-cpu
>    tasklets. This is done to support:
>    "tasklet_schedule_on_cpu"
>  - We add an temporary 'TASKLET_SOFTIRQ_PERCPU' which
>    is can co-exist with the TASKLET_SOFTIRQ. It will be
>    replaced in 'tasklet: Remove the old-softirq
>    implementation."
>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  xen/arch/x86/hvm/hvm.c       |   2 +-
>  xen/common/tasklet.c         | 129 +++++++++++++++++++++++++++++++++++++++++--
>  xen/drivers/passthrough/io.c |   2 +-
>  xen/include/xen/softirq.h    |   1 +
>  xen/include/xen/tasklet.h    |  61 ++++++++++++++++++--
>  5 files changed, 184 insertions(+), 11 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 94b18ba..4b4cad1 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -2270,7 +2270,7 @@ int hvm_vcpu_initialise(struct vcpu *v)
>      if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 ) /* teardown: hvm_funcs.vcpu_destroy */
>          goto fail3;
>  
> -    softirq_tasklet_init(
> +    percpu_tasklet_init(
>          &v->arch.hvm_vcpu.assert_evtchn_irq_tasklet,
>          (void(*)(unsigned long))hvm_assert_evtchn_irq,
>          (unsigned long)v);
> diff --git a/xen/common/tasklet.c b/xen/common/tasklet.c
> index 4e42fa7..319866f 100644
> --- a/xen/common/tasklet.c
> +++ b/xen/common/tasklet.c
> @@ -31,10 +31,30 @@ static DEFINE_PER_CPU(struct list_head, softirq_tasklet_list);
>  /* Protects all lists and tasklet structures. */
>  static DEFINE_SPINLOCK(tasklet_lock);
>  
> +static DEFINE_PER_CPU(struct list_head, softirq_list);
> +
>  static void tasklet_enqueue(struct tasklet *t)
>  {
>      unsigned int cpu = t->scheduled_on;
>  
> +    if ( t->is_percpu )
> +    {
> +        unsigned long flags;
> +        struct list_head *list;
> +
> +        INIT_LIST_HEAD(&t->list);
> +        BUG_ON( !t->is_softirq );
> +        BUG_ON( cpu != smp_processor_id() ); /* Not implemented yet. */
> +
> +        local_irq_save(flags);
> +
> +        list = &__get_cpu_var(softirq_list);
> +        list_add_tail(&t->list, list);
> +        raise_softirq(TASKLET_SOFTIRQ_PERCPU);
> +
> +        local_irq_restore(flags);

The raise_softirq() call can be done with interrupts re-enabled, which
reduces the critical window.

__get_cpu_var() does some inspection of the stack pointer behind your
back.  It would be far more efficient in the critical window to take
"unsigned int cpu = smp_processor_id();"  outside and use "per_cpu($FOO,
cpu)" inside.

> +        return;
> +    }
>      if ( t->is_softirq )
>      {
>          struct list_head *list = &per_cpu(softirq_tasklet_list, cpu);
> @@ -56,16 +76,25 @@ void tasklet_schedule_on_cpu(struct tasklet *t, unsigned int cpu)
>  {
>      unsigned long flags;
>  
> -    spin_lock_irqsave(&tasklet_lock, flags);
> +    if ( !tasklets_initialised || t->is_dead )
> +        return;
>  
> -    if ( tasklets_initialised && !t->is_dead )
> +    if ( t->is_percpu )
>      {
> -        t->scheduled_on = cpu;
> -        if ( !t->is_running )
> +        if ( !test_and_set_bit(TASKLET_STATE_SCHED, &t->state) )
>          {
> -            list_del(&t->list);
> +            t->scheduled_on = cpu;
>              tasklet_enqueue(t);
>          }
> +        return;
> +    }
> +    spin_lock_irqsave(&tasklet_lock, flags);
> +
> +    t->scheduled_on = cpu;
> +    if ( !t->is_running )
> +    {
> +        list_del(&t->list);
> +        tasklet_enqueue(t);
>      }
>  
>      spin_unlock_irqrestore(&tasklet_lock, flags);
> @@ -104,6 +133,66 @@ static void do_tasklet_work(unsigned int cpu, struct list_head *list)
>      }
>  }
>  
> +void do_tasklet_work_percpu(void)
> +{
> +    struct tasklet *t = NULL;
> +    struct list_head *head;
> +    bool_t poke = 0;
> +
> +    local_irq_disable();
> +    head = &__get_cpu_var(softirq_list);
> +
> +    if ( !list_empty(head) )
> +    {
> +        t = list_entry(head->next, struct tasklet, list);
> +
> +        if ( head->next == head->prev ) /* One singular item. Re-init head. */
> +            INIT_LIST_HEAD(&__get_cpu_var(softirq_list));

It would be most efficient to hoist "struct list_head *this_softirq_list
= &this_cpu(softirq_list);" outside the critical region.

> +        else
> +        {
> +            /* Multiple items, update head to skip 't'. */
> +            struct list_head *list;
> +
> +            /* One item past 't'. */
> +            list = head->next->next;
> +
> +            BUG_ON(list == NULL);
> +
> +            /* And update head to skip 't'. Note that t->list.prev still
> +             * points to head, but we don't care as we only process one tasklet
> +             * and once done the tasklet list is re-init one way or another.
> +             */
> +            head->next = list;
> +            poke = 1;
> +        }
> +    }
> +    local_irq_enable();
> +
> +    if ( !t )
> +        return; /* Never saw it happend, but we might have a spurious case? */
> +
> +    if ( tasklet_trylock(t) )
> +    {
> +        if ( !test_and_clear_bit(TASKLET_STATE_SCHED, &t->state) )
> +                BUG();
> +        sync_local_execstate();
> +        t->func(t->data);
> +        tasklet_unlock(t);
> +        if ( poke )
> +            raise_softirq(TASKLET_SOFTIRQ_PERCPU);
> +        /* We could reinit the t->list but tasklet_enqueue does it for us. */
> +        return;
> +    }
> +
> +    local_irq_disable();
> +
> +    INIT_LIST_HEAD(&t->list);
> +    list_add_tail(&t->list, &__get_cpu_var(softirq_list));
> +    smp_wmb();

Is this needed? all of this infrastructure is local to the cpu.

> +    raise_softirq(TASKLET_SOFTIRQ_PERCPU);
> +    local_irq_enable();
> +}
> +
>  /* VCPU context work */
>  void do_tasklet(void)
>  {
> @@ -147,10 +236,29 @@ static void tasklet_softirq_action(void)
>      spin_unlock_irq(&tasklet_lock);
>  }
>  
> +/* Per CPU softirq context work. */
> +static void tasklet_softirq_percpu_action(void)
> +{
> +    do_tasklet_work_percpu();
> +}
> +
>  void tasklet_kill(struct tasklet *t)
>  {
>      unsigned long flags;
>  
> +    if ( t->is_percpu )
> +    {
> +        while ( test_and_set_bit(TASKLET_STATE_SCHED, &t->state) )
> +        {
> +            do {
> +                process_pending_softirqs();
> +            } while ( test_bit(TASKLET_STATE_SCHED, &t->state) );
> +        }
> +        tasklet_unlock_wait(t);
> +        clear_bit(TASKLET_STATE_SCHED, &t->state);
> +        t->is_dead = 1;
> +        return;
> +    }
>      spin_lock_irqsave(&tasklet_lock, flags);
>  
>      if ( !list_empty(&t->list) )
> @@ -208,6 +316,14 @@ void softirq_tasklet_init(
>      t->is_softirq = 1;
>  }
>  
> +void percpu_tasklet_init(
> +    struct tasklet *t, void (*func)(unsigned long), unsigned long data)
> +{
> +    tasklet_init(t, func, data);
> +    t->is_softirq = 1;
> +    t->is_percpu = 1;
> +}
> +
>  static int cpu_callback(
>      struct notifier_block *nfb, unsigned long action, void *hcpu)
>  {
> @@ -218,11 +334,13 @@ static int cpu_callback(
>      case CPU_UP_PREPARE:
>          INIT_LIST_HEAD(&per_cpu(tasklet_list, cpu));
>          INIT_LIST_HEAD(&per_cpu(softirq_tasklet_list, cpu));
> +        INIT_LIST_HEAD(&per_cpu(softirq_list, cpu));
>          break;
>      case CPU_UP_CANCELED:
>      case CPU_DEAD:
>          migrate_tasklets_from_cpu(cpu, &per_cpu(tasklet_list, cpu));
>          migrate_tasklets_from_cpu(cpu, &per_cpu(softirq_tasklet_list, cpu));
> +        migrate_tasklets_from_cpu(cpu, &per_cpu(softirq_list, cpu));
>          break;
>      default:
>          break;
> @@ -242,6 +360,7 @@ void __init tasklet_subsys_init(void)
>      cpu_callback(&cpu_nfb, CPU_UP_PREPARE, hcpu);
>      register_cpu_notifier(&cpu_nfb);
>      open_softirq(TASKLET_SOFTIRQ, tasklet_softirq_action);
> +    open_softirq(TASKLET_SOFTIRQ_PERCPU, tasklet_softirq_percpu_action);
>      tasklets_initialised = 1;
>  }
>  
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index ef75b94..740cee5 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -114,7 +114,7 @@ int pt_irq_create_bind(
>              spin_unlock(&d->event_lock);
>              return -ENOMEM;
>          }
> -        softirq_tasklet_init(
> +        percpu_tasklet_init(
>              &hvm_irq_dpci->dirq_tasklet,
>              hvm_dirq_assist, (unsigned long)d);
>          for ( i = 0; i < NR_HVM_IRQS; i++ )
> diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
> index 0c0d481..8b15c8c 100644
> --- a/xen/include/xen/softirq.h
> +++ b/xen/include/xen/softirq.h
> @@ -7,6 +7,7 @@ enum {
>      SCHEDULE_SOFTIRQ,
>      NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
>      RCU_SOFTIRQ,
> +    TASKLET_SOFTIRQ_PERCPU,
>      TASKLET_SOFTIRQ,
>      NR_COMMON_SOFTIRQS
>  };
> diff --git a/xen/include/xen/tasklet.h b/xen/include/xen/tasklet.h
> index 8c3de7e..9497c47 100644
> --- a/xen/include/xen/tasklet.h
> +++ b/xen/include/xen/tasklet.h
> @@ -17,21 +17,24 @@
>  struct tasklet
>  {
>      struct list_head list;
> +    unsigned long state;
>      int scheduled_on;
>      bool_t is_softirq;
>      bool_t is_running;
>      bool_t is_dead;
> +    bool_t is_percpu;
>      void (*func)(unsigned long);
>      unsigned long data;
>  };
>  
> -#define _DECLARE_TASKLET(name, func, data, softirq)                     \
> +#define _DECLARE_TASKLET(name, func, data, softirq, percpu)             \
>      struct tasklet name = {                                             \
> -        LIST_HEAD_INIT(name.list), -1, softirq, 0, 0, func, data }
> +        LIST_HEAD_INIT(name.list), 0, -1, softirq, 0, 0, percpu,        \
> +        func, data }
>  #define DECLARE_TASKLET(name, func, data)               \
> -    _DECLARE_TASKLET(name, func, data, 0)
> +    _DECLARE_TASKLET(name, func, data, 0, 0)
>  #define DECLARE_SOFTIRQ_TASKLET(name, func, data)       \
> -    _DECLARE_TASKLET(name, func, data, 1)
> +    _DECLARE_TASKLET(name, func, data, 1, 0)
>  
>  /* Indicates status of tasklet work on each CPU. */
>  DECLARE_PER_CPU(unsigned long, tasklet_work_to_do);
> @@ -40,6 +43,54 @@ DECLARE_PER_CPU(unsigned long, tasklet_work_to_do);
>  #define TASKLET_enqueued   (1ul << _TASKLET_enqueued)
>  #define TASKLET_scheduled  (1ul << _TASKLET_scheduled)
>  
> +/* These fancy bit manipulations (bit 0 and bit 1) along with using a lock
> + * operation allow us to have four stages in tasklet life-time.
> + *  a) 0x0: Completely empty (not scheduled nor running).
> + *  b) 0x1: Scheduled but not running. Used to guard in 'tasklet_schedule'
> + *     such that we will only schedule one. If it is scheduled and had never
> + *     run (hence never clearing STATE_SCHED bit), tasklet_kill will spin
> + *     forever on said tasklet. However 'tasklet_schedule' raises the
> + *     softirq associated with the per-cpu - so it will run, albeit there might
> + *     be a race (tasklet_kill spinning until the softirq handler runs).
> + *  c) 0x2: it is running (only on one CPU) and can be scheduled on any
> + *     CPU. The bit 0 - scheduled is cleared at this stage allowing
> + *     'tasklet_schedule' to succesfully schedule.
> + *  d) 0x3: scheduled and running - only possible if the running tasklet calls
> + *     tasklet_schedule (on same CPU) or the tasklet is scheduled from another
> + *     CPU while the tasklet is running on another CPU.
> + *
> + * The two bits play a vital role in assuring that the tasklet is scheduled
> + * once and runs only once. The steps are:
> + *
> + *  1) tasklet_schedule: STATE_SCHED bit set (0x1), added on the per cpu list.
> + *  2) tasklet_softirq_percpu_action picks one tasklet from the list. Schedules
> + *  itself later if there are more tasklets on it. Tries to set STATE_RUN bit
> + *  (0x3) - if it fails adds tasklet back to the per-cpu list. If it succeeds
> + *  clears the STATE_SCHED bit (0x2). Once tasklet completed, unsets STATE_RUN
> + *  (0x0 or 0x1 if tasklet called tasklet_schedule).
> + */
> +enum {
> +    TASKLET_STATE_SCHED, /* Bit 0 */
> +    TASKLET_STATE_RUN
> +};
> +
> +static inline int tasklet_trylock(struct tasklet *t)
> +{
> +    return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state);

No need for brackets around (t) for these static inlines.

> +}
> +
> +static inline void tasklet_unlock(struct tasklet *t)
> +{
> +    barrier();

clear_bit() has a memory clobber.  This barrier() is entirely redundant.

> +    clear_bit(TASKLET_STATE_RUN, &(t)->state);
> +}
> +static inline void tasklet_unlock_wait(struct tasklet *t)
> +{
> +    while (test_bit(TASKLET_STATE_RUN, &(t)->state))
> +    {
> +        barrier();

cpu_relax();

~Andrew

> +    }
> +}
>  void tasklet_schedule_on_cpu(struct tasklet *t, unsigned int cpu);
>  void tasklet_schedule(struct tasklet *t);
>  void do_tasklet(void);
> @@ -48,6 +99,8 @@ void tasklet_init(
>      struct tasklet *t, void (*func)(unsigned long), unsigned long data);
>  void softirq_tasklet_init(
>      struct tasklet *t, void (*func)(unsigned long), unsigned long data);
> +void percpu_tasklet_init(
> +    struct tasklet *t, void (*func)(unsigned long), unsigned long data);
>  void tasklet_subsys_init(void);
>  
>  #endif /* __XEN_TASKLET_H__ */

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 1/5] tasklet: Introduce per-cpu tasklet for softirq.
  2014-08-27 18:53   ` Andrew Cooper
@ 2014-08-27 19:06     ` Konrad Rzeszutek Wilk
  2014-08-28  8:17     ` Jan Beulich
  1 sibling, 0 replies; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-27 19:06 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, keir, jbeulich

> >  static void tasklet_enqueue(struct tasklet *t)
> >  {
> >      unsigned int cpu = t->scheduled_on;
> >  
> > +    if ( t->is_percpu )
> > +    {
> > +        unsigned long flags;
> > +        struct list_head *list;
> > +
> > +        INIT_LIST_HEAD(&t->list);
> > +        BUG_ON( !t->is_softirq );
> > +        BUG_ON( cpu != smp_processor_id() ); /* Not implemented yet. */
> > +
> > +        local_irq_save(flags);
> > +
> > +        list = &__get_cpu_var(softirq_list);
> > +        list_add_tail(&t->list, list);
> > +        raise_softirq(TASKLET_SOFTIRQ_PERCPU);
> > +
> > +        local_irq_restore(flags);
> 
> The raise_softirq() call can be done with interrupts re-enabled, which
> reduces the critical window.
> 
> __get_cpu_var() does some inspection of the stack pointer behind your
> back.  It would be far more efficient in the critical window to take

Exactly 4 lines of assembler code :-)

> "unsigned int cpu = smp_processor_id();"  outside and use "per_cpu($FOO,
> cpu)" inside.

<nods> That would indeed be better.
> 
> > +        return;
> > +    }
> >      if ( t->is_softirq )
> >      {
> >          struct list_head *list = &per_cpu(softirq_tasklet_list, cpu);
> > @@ -56,16 +76,25 @@ void tasklet_schedule_on_cpu(struct tasklet *t, unsigned int cpu)
> >  {
> >      unsigned long flags;
> >  
> > -    spin_lock_irqsave(&tasklet_lock, flags);
> > +    if ( !tasklets_initialised || t->is_dead )
> > +        return;
> >  
> > -    if ( tasklets_initialised && !t->is_dead )
> > +    if ( t->is_percpu )
> >      {
> > -        t->scheduled_on = cpu;
> > -        if ( !t->is_running )
> > +        if ( !test_and_set_bit(TASKLET_STATE_SCHED, &t->state) )
> >          {
> > -            list_del(&t->list);
> > +            t->scheduled_on = cpu;
> >              tasklet_enqueue(t);
> >          }
> > +        return;
> > +    }
> > +    spin_lock_irqsave(&tasklet_lock, flags);
> > +
> > +    t->scheduled_on = cpu;
> > +    if ( !t->is_running )
> > +    {
> > +        list_del(&t->list);
> > +        tasklet_enqueue(t);
> >      }
> >  
> >      spin_unlock_irqrestore(&tasklet_lock, flags);
> > @@ -104,6 +133,66 @@ static void do_tasklet_work(unsigned int cpu, struct list_head *list)
> >      }
> >  }
> >  
> > +void do_tasklet_work_percpu(void)
> > +{
> > +    struct tasklet *t = NULL;
> > +    struct list_head *head;
> > +    bool_t poke = 0;
> > +
> > +    local_irq_disable();
> > +    head = &__get_cpu_var(softirq_list);
> > +
> > +    if ( !list_empty(head) )
> > +    {
> > +        t = list_entry(head->next, struct tasklet, list);
> > +
> > +        if ( head->next == head->prev ) /* One singular item. Re-init head. */
> > +            INIT_LIST_HEAD(&__get_cpu_var(softirq_list));
> 
> It would be most efficient to hoist "struct list_head *this_softirq_list
> = &this_cpu(softirq_list);" outside the critical region.

<nods>
> 
> > +        else
> > +        {
> > +            /* Multiple items, update head to skip 't'. */
> > +            struct list_head *list;
> > +
> > +            /* One item past 't'. */
> > +            list = head->next->next;
> > +
> > +            BUG_ON(list == NULL);
> > +
> > +            /* And update head to skip 't'. Note that t->list.prev still
> > +             * points to head, but we don't care as we only process one tasklet
> > +             * and once done the tasklet list is re-init one way or another.
> > +             */
> > +            head->next = list;
> > +            poke = 1;
> > +        }
> > +    }
> > +    local_irq_enable();
> > +
> > +    if ( !t )
> > +        return; /* Never saw it happend, but we might have a spurious case? */
> > +
> > +    if ( tasklet_trylock(t) )
> > +    {
> > +        if ( !test_and_clear_bit(TASKLET_STATE_SCHED, &t->state) )
> > +                BUG();
> > +        sync_local_execstate();
> > +        t->func(t->data);
> > +        tasklet_unlock(t);
> > +        if ( poke )
> > +            raise_softirq(TASKLET_SOFTIRQ_PERCPU);
> > +        /* We could reinit the t->list but tasklet_enqueue does it for us. */
> > +        return;
> > +    }
> > +
> > +    local_irq_disable();
> > +
> > +    INIT_LIST_HEAD(&t->list);
> > +    list_add_tail(&t->list, &__get_cpu_var(softirq_list));
> > +    smp_wmb();
> 
> Is this needed? all of this infrastructure is local to the cpu.

Artifact of debugging. Thought I do wonder what to do about
ARM. As I understand it, the world of ARM is a wild place
where there is a need for these barriers to exist. But maybe
I have just heard to many stories.

> 
> > +    raise_softirq(TASKLET_SOFTIRQ_PERCPU);
> > +    local_irq_enable();
> > +}
> > +
> >  /* VCPU context work */
> >  void do_tasklet(void)
> >  {
> > @@ -147,10 +236,29 @@ static void tasklet_softirq_action(void)
> >      spin_unlock_irq(&tasklet_lock);
> >  }
> >  
> > +/* Per CPU softirq context work. */
> > +static void tasklet_softirq_percpu_action(void)
> > +{
> > +    do_tasklet_work_percpu();
> > +}
> > +
> >  void tasklet_kill(struct tasklet *t)
> >  {
> >      unsigned long flags;
> >  
> > +    if ( t->is_percpu )
> > +    {
> > +        while ( test_and_set_bit(TASKLET_STATE_SCHED, &t->state) )
> > +        {
> > +            do {
> > +                process_pending_softirqs();
> > +            } while ( test_bit(TASKLET_STATE_SCHED, &t->state) );
> > +        }
> > +        tasklet_unlock_wait(t);
> > +        clear_bit(TASKLET_STATE_SCHED, &t->state);
> > +        t->is_dead = 1;
> > +        return;
> > +    }
> >      spin_lock_irqsave(&tasklet_lock, flags);
> >  
> >      if ( !list_empty(&t->list) )
> > @@ -208,6 +316,14 @@ void softirq_tasklet_init(
> >      t->is_softirq = 1;
> >  }
> >  
> > +void percpu_tasklet_init(
> > +    struct tasklet *t, void (*func)(unsigned long), unsigned long data)
> > +{
> > +    tasklet_init(t, func, data);
> > +    t->is_softirq = 1;
> > +    t->is_percpu = 1;
> > +}
> > +
> >  static int cpu_callback(
> >      struct notifier_block *nfb, unsigned long action, void *hcpu)
> >  {
> > @@ -218,11 +334,13 @@ static int cpu_callback(
> >      case CPU_UP_PREPARE:
> >          INIT_LIST_HEAD(&per_cpu(tasklet_list, cpu));
> >          INIT_LIST_HEAD(&per_cpu(softirq_tasklet_list, cpu));
> > +        INIT_LIST_HEAD(&per_cpu(softirq_list, cpu));
> >          break;
> >      case CPU_UP_CANCELED:
> >      case CPU_DEAD:
> >          migrate_tasklets_from_cpu(cpu, &per_cpu(tasklet_list, cpu));
> >          migrate_tasklets_from_cpu(cpu, &per_cpu(softirq_tasklet_list, cpu));
> > +        migrate_tasklets_from_cpu(cpu, &per_cpu(softirq_list, cpu));
> >          break;
> >      default:
> >          break;
> > @@ -242,6 +360,7 @@ void __init tasklet_subsys_init(void)
> >      cpu_callback(&cpu_nfb, CPU_UP_PREPARE, hcpu);
> >      register_cpu_notifier(&cpu_nfb);
> >      open_softirq(TASKLET_SOFTIRQ, tasklet_softirq_action);
> > +    open_softirq(TASKLET_SOFTIRQ_PERCPU, tasklet_softirq_percpu_action);
> >      tasklets_initialised = 1;
> >  }
> >  
> > diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> > index ef75b94..740cee5 100644
> > --- a/xen/drivers/passthrough/io.c
> > +++ b/xen/drivers/passthrough/io.c
> > @@ -114,7 +114,7 @@ int pt_irq_create_bind(
> >              spin_unlock(&d->event_lock);
> >              return -ENOMEM;
> >          }
> > -        softirq_tasklet_init(
> > +        percpu_tasklet_init(
> >              &hvm_irq_dpci->dirq_tasklet,
> >              hvm_dirq_assist, (unsigned long)d);
> >          for ( i = 0; i < NR_HVM_IRQS; i++ )
> > diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
> > index 0c0d481..8b15c8c 100644
> > --- a/xen/include/xen/softirq.h
> > +++ b/xen/include/xen/softirq.h
> > @@ -7,6 +7,7 @@ enum {
> >      SCHEDULE_SOFTIRQ,
> >      NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
> >      RCU_SOFTIRQ,
> > +    TASKLET_SOFTIRQ_PERCPU,
> >      TASKLET_SOFTIRQ,
> >      NR_COMMON_SOFTIRQS
> >  };
> > diff --git a/xen/include/xen/tasklet.h b/xen/include/xen/tasklet.h
> > index 8c3de7e..9497c47 100644
> > --- a/xen/include/xen/tasklet.h
> > +++ b/xen/include/xen/tasklet.h
> > @@ -17,21 +17,24 @@
> >  struct tasklet
> >  {
> >      struct list_head list;
> > +    unsigned long state;
> >      int scheduled_on;
> >      bool_t is_softirq;
> >      bool_t is_running;
> >      bool_t is_dead;
> > +    bool_t is_percpu;
> >      void (*func)(unsigned long);
> >      unsigned long data;
> >  };
> >  
> > -#define _DECLARE_TASKLET(name, func, data, softirq)                     \
> > +#define _DECLARE_TASKLET(name, func, data, softirq, percpu)             \
> >      struct tasklet name = {                                             \
> > -        LIST_HEAD_INIT(name.list), -1, softirq, 0, 0, func, data }
> > +        LIST_HEAD_INIT(name.list), 0, -1, softirq, 0, 0, percpu,        \
> > +        func, data }
> >  #define DECLARE_TASKLET(name, func, data)               \
> > -    _DECLARE_TASKLET(name, func, data, 0)
> > +    _DECLARE_TASKLET(name, func, data, 0, 0)
> >  #define DECLARE_SOFTIRQ_TASKLET(name, func, data)       \
> > -    _DECLARE_TASKLET(name, func, data, 1)
> > +    _DECLARE_TASKLET(name, func, data, 1, 0)
> >  
> >  /* Indicates status of tasklet work on each CPU. */
> >  DECLARE_PER_CPU(unsigned long, tasklet_work_to_do);
> > @@ -40,6 +43,54 @@ DECLARE_PER_CPU(unsigned long, tasklet_work_to_do);
> >  #define TASKLET_enqueued   (1ul << _TASKLET_enqueued)
> >  #define TASKLET_scheduled  (1ul << _TASKLET_scheduled)
> >  
> > +/* These fancy bit manipulations (bit 0 and bit 1) along with using a lock
> > + * operation allow us to have four stages in tasklet life-time.

s/./:/

> > + *  a) 0x0: Completely empty (not scheduled nor running).
> > + *  b) 0x1: Scheduled but not running. Used to guard in 'tasklet_schedule'

s/to guard/as a guard/
> > + *     such that we will only schedule one. If it is scheduled and had never
> > + *     run (hence never clearing STATE_SCHED bit), tasklet_kill will spin
> > + *     forever on said tasklet. However 'tasklet_schedule' raises the
> > + *     softirq associated with the per-cpu - so it will run, albeit there might
> > + *     be a race (tasklet_kill spinning until the softirq handler runs).
> > + *  c) 0x2: it is running (only on one CPU) and can be scheduled on any
> > + *     CPU. The bit 0 - scheduled is cleared at this stage allowing
> > + *     'tasklet_schedule' to succesfully schedule.
> > + *  d) 0x3: scheduled and running - only possible if the running tasklet calls
> > + *     tasklet_schedule (on same CPU) or the tasklet is scheduled from another
> > + *     CPU while the tasklet is running on another CPU.
> > + *
> > + * The two bits play a vital role in assuring that the tasklet is scheduled
> > + * once and runs only once. The steps are:
> > + *
> > + *  1) tasklet_schedule: STATE_SCHED bit set (0x1), added on the per cpu list.
> > + *  2) tasklet_softirq_percpu_action picks one tasklet from the list. Schedules
> > + *  itself later if there are more tasklets on it. Tries to set STATE_RUN bit
> > + *  (0x3) - if it fails adds tasklet back to the per-cpu list. If it succeeds
> > + *  clears the STATE_SCHED bit (0x2). Once tasklet completed, unsets STATE_RUN

s/completed/completes/

> > + *  (0x0 or 0x1 if tasklet called tasklet_schedule).
> > + */
> > +enum {
> > +    TASKLET_STATE_SCHED, /* Bit 0 */
> > +    TASKLET_STATE_RUN
> > +};
> > +
> > +static inline int tasklet_trylock(struct tasklet *t)
> > +{
> > +    return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state);
> 
> No need for brackets around (t) for these static inlines.

<nods>
> 
> > +}
> > +
> > +static inline void tasklet_unlock(struct tasklet *t)
> > +{
> > +    barrier();
> 
> clear_bit() has a memory clobber.  This barrier() is entirely redundant.
> 
> > +    clear_bit(TASKLET_STATE_RUN, &(t)->state);
> > +}
> > +static inline void tasklet_unlock_wait(struct tasklet *t)
> > +{
> > +    while (test_bit(TASKLET_STATE_RUN, &(t)->state))
> > +    {
> > +        barrier();
> 
> cpu_relax();

Ah yes!
> 
> ~Andrew

Thank you for your quick review! (less than hour after posting?!
Is that the record?
> 
> > +    }
> > +}
> >  void tasklet_schedule_on_cpu(struct tasklet *t, unsigned int cpu);
> >  void tasklet_schedule(struct tasklet *t);
> >  void do_tasklet(void);
> > @@ -48,6 +99,8 @@ void tasklet_init(
> >      struct tasklet *t, void (*func)(unsigned long), unsigned long data);
> >  void softirq_tasklet_init(
> >      struct tasklet *t, void (*func)(unsigned long), unsigned long data);
> > +void percpu_tasklet_init(
> > +    struct tasklet *t, void (*func)(unsigned long), unsigned long data);
> >  void tasklet_subsys_init(void);
> >  
> >  #endif /* __XEN_TASKLET_H__ */
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1 1/5] tasklet: Introduce per-cpu tasklet for softirq.
  2014-08-27 18:53   ` Andrew Cooper
  2014-08-27 19:06     ` Konrad Rzeszutek Wilk
@ 2014-08-28  8:17     ` Jan Beulich
  1 sibling, 0 replies; 23+ messages in thread
From: Jan Beulich @ 2014-08-28  8:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, keir

>>> On 27.08.14 at 20:53, <andrew.cooper3@citrix.com> wrote:
> On 27/08/14 18:58, Konrad Rzeszutek Wilk wrote:
>> +static inline void tasklet_unlock(struct tasklet *t)
>> +{
>> +    barrier();
> 
> clear_bit() has a memory clobber.  This barrier() is entirely redundant.

I think this just is an implementation artifact of clear_bit(), and
hence user shouldn't rely on that behavior.

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1] Replace tasklets with per-cpu implementation.
  2014-08-27 17:58 [RFC PATCH v1] Replace tasklets with per-cpu implementation Konrad Rzeszutek Wilk
                   ` (4 preceding siblings ...)
  2014-08-27 17:58 ` [RFC PATCH v1 5/5] tasklet: Remove the scaffolding Konrad Rzeszutek Wilk
@ 2014-08-28 12:39 ` Jan Beulich
  2014-08-29 13:46   ` Konrad Rzeszutek Wilk
  5 siblings, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2014-08-28 12:39 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, keir

>>> On 27.08.14 at 19:58, <konrad.wilk@oracle.com> wrote:
> The 'hvm_do_IRQ_dpci' is the on that is most often scheduled
> and run. The performance bottleneck comes from the fact that
> we take the same spinlock three times: tasklet_schedule,
> when we are about to execute the tasklet, and when we are
> done executing the tasklet.

Before starting all the work here, did you investigate alternatives
to this specific use of a tasklet? E.g., it being a softirq one, making
it have its own softirq?

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1] Replace tasklets with per-cpu implementation.
  2014-08-28 12:39 ` [RFC PATCH v1] Replace tasklets with per-cpu implementation Jan Beulich
@ 2014-08-29 13:46   ` Konrad Rzeszutek Wilk
  2014-08-29 14:10     ` Jan Beulich
  0 siblings, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-29 13:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, keir

On Thu, Aug 28, 2014 at 01:39:54PM +0100, Jan Beulich wrote:
> >>> On 27.08.14 at 19:58, <konrad.wilk@oracle.com> wrote:
> > The 'hvm_do_IRQ_dpci' is the on that is most often scheduled
> > and run. The performance bottleneck comes from the fact that
> > we take the same spinlock three times: tasklet_schedule,
> > when we are about to execute the tasklet, and when we are
> > done executing the tasklet.
> 
> Before starting all the work here, did you investigate alternatives
> to this specific use of a tasklet? E.g., it being a softirq one, making
> it have its own softirq?

If I understand you right, you mean implement an tasklet API that
would only been be used by the hvm_do_IRQ_dpci? Its own spinlock,
list, and an seperate tasklet_schedule?

I did think about it a bit a the start, but discarded it since
I figured it would be a no-go upstream - as it is an one-off and seems
hackish.

I can certainly prototype one up and see if it matches the performance
of this implementation if you would like?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1] Replace tasklets with per-cpu implementation.
  2014-08-29 13:46   ` Konrad Rzeszutek Wilk
@ 2014-08-29 14:10     ` Jan Beulich
  2014-09-02 20:28       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2014-08-29 14:10 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, keir

>>> On 29.08.14 at 15:46, <konrad.wilk@oracle.com> wrote:
> On Thu, Aug 28, 2014 at 01:39:54PM +0100, Jan Beulich wrote:
>> >>> On 27.08.14 at 19:58, <konrad.wilk@oracle.com> wrote:
>> > The 'hvm_do_IRQ_dpci' is the on that is most often scheduled
>> > and run. The performance bottleneck comes from the fact that
>> > we take the same spinlock three times: tasklet_schedule,
>> > when we are about to execute the tasklet, and when we are
>> > done executing the tasklet.
>> 
>> Before starting all the work here, did you investigate alternatives
>> to this specific use of a tasklet? E.g., it being a softirq one, making
>> it have its own softirq?
> 
> If I understand you right, you mean implement an tasklet API that
> would only been be used by the hvm_do_IRQ_dpci? Its own spinlock,
> list, and an seperate tasklet_schedule?

No, just a new softirq type, e.g. HVM_DPCI_SOFTIRQ (added to
the enum in xen/include/xen/softirq.h and all the necessary
handling put in place).

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1] Replace tasklets with per-cpu implementation.
  2014-08-29 14:10     ` Jan Beulich
@ 2014-09-02 20:28       ` Konrad Rzeszutek Wilk
  2014-09-03  8:03         ` Jan Beulich
  0 siblings, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-02 20:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, keir

On Fri, Aug 29, 2014 at 03:10:19PM +0100, Jan Beulich wrote:
> >>> On 29.08.14 at 15:46, <konrad.wilk@oracle.com> wrote:
> > On Thu, Aug 28, 2014 at 01:39:54PM +0100, Jan Beulich wrote:
> >> >>> On 27.08.14 at 19:58, <konrad.wilk@oracle.com> wrote:
> >> > The 'hvm_do_IRQ_dpci' is the on that is most often scheduled
> >> > and run. The performance bottleneck comes from the fact that
> >> > we take the same spinlock three times: tasklet_schedule,
> >> > when we are about to execute the tasklet, and when we are
> >> > done executing the tasklet.
> >> 
> >> Before starting all the work here, did you investigate alternatives
> >> to this specific use of a tasklet? E.g., it being a softirq one, making
> >> it have its own softirq?
> > 
> > If I understand you right, you mean implement an tasklet API that
> > would only been be used by the hvm_do_IRQ_dpci? Its own spinlock,
> > list, and an seperate tasklet_schedule?
> 
> No, just a new softirq type, e.g. HVM_DPCI_SOFTIRQ (added to
> the enum in xen/include/xen/softirq.h and all the necessary
> handling put in place).

I typed this prototype up and asked folks with the right hardware to
test it. It _ought_ to, thought I think that the tasklet code
still could use an overhaul.


>From deecf148e0061027c61af30882eee76a66299686 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 29 Aug 2014 13:40:09 -0400
Subject: [PATCH] dpci: Replace tasklet with an softirq

PROTOTYPE.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/passthrough/io.c  | 149 ++++++++++++++++++++++++++++++++++++++++--
 xen/drivers/passthrough/pci.c |   5 +-
 xen/include/xen/hvm/irq.h     |   3 +
 xen/include/xen/sched.h       |   3 +
 xen/include/xen/softirq.h     |   1 +
 5 files changed, 156 insertions(+), 5 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index c83af68..4c8ff3b 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -26,10 +26,105 @@
 #include <xen/hvm/irq.h>
 #include <xen/tasklet.h>
 
+bool_t use_softirq;
+boolean_param("use_softirq", use_softirq);
+
 struct rangeset *__read_mostly mmio_ro_ranges;
 
 static void hvm_dirq_assist(unsigned long _d);
 
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+enum {
+    STATE_SCHED,
+    STATE_RUN
+};
+static void schedule_dpci_for(struct domain *d)
+{
+    if ( !test_and_set_bit(STATE_SCHED, &d->state) )
+    {
+        unsigned long flags;
+        struct list_head *list;
+
+        local_irq_save(flags);
+        INIT_LIST_HEAD(&d->list);
+        list = &__get_cpu_var(dpci_list);
+        list_add_tail(&d->list, list);
+
+        local_irq_restore(flags);
+        raise_softirq(HVM_DPCI_SOFTIRQ);
+    }
+}
+
+void dpci_kill(struct domain *d)
+{
+    s_time_t s, now;
+    int i = 0;
+
+    s = NOW();
+    while ( test_and_set_bit(STATE_SCHED, &d->state) )
+    {
+        do {
+            now = NOW();
+            process_pending_softirqs();
+            if ( ((now - s) >> 30) > 5 )
+            {
+                s = now;
+                printk("%s stuck .. \n", __func__);
+                i++;
+            }
+            if ( i > 12 )
+                BUG();
+        } while ( test_bit(STATE_SCHED, &d->state) );
+    }
+    while (test_bit(STATE_RUN, &(d)->state))
+    {
+        cpu_relax();
+    }
+    clear_bit(STATE_SCHED, &d->state);
+}
+
+static void dpci_softirq(void)
+{
+
+    struct domain *d;
+    struct list_head *list;
+    struct list_head our_list;
+
+    local_irq_disable();
+    list = &__get_cpu_var(dpci_list);
+
+    INIT_LIST_HEAD(&our_list);
+    list_splice(list, &our_list);
+
+    INIT_LIST_HEAD(&__get_cpu_var(dpci_list));
+
+    local_irq_enable();
+
+    while (!list_empty(&our_list))
+    {
+        d = list_entry(our_list.next, struct domain, list);
+        list_del(&d->list);
+
+        if ( !test_and_set_bit(STATE_RUN, &(d)->state) )
+        {
+            if ( !test_and_clear_bit(STATE_SCHED, &d->state) )
+                BUG();
+            hvm_dirq_assist((unsigned long)d);
+            clear_bit(STATE_RUN, &(d)->state);
+            continue;
+        }
+
+        local_irq_disable();
+
+        INIT_LIST_HEAD(&d->list);
+        list_add_tail(&d->list, &__get_cpu_var(dpci_list));
+        local_irq_enable();
+
+        raise_softirq(HVM_DPCI_SOFTIRQ);
+    }
+}
+
 bool_t pt_irq_need_timer(uint32_t flags)
 {
     return !(flags & (HVM_IRQ_DPCI_GUEST_MSI | HVM_IRQ_DPCI_TRANSLATE));
@@ -119,9 +214,13 @@ int pt_irq_create_bind_vtd(
             return -ENOMEM;
         }
         memset(hvm_irq_dpci, 0, sizeof(*hvm_irq_dpci));
-        percpu_tasklet_init(
-            &hvm_irq_dpci->dirq_tasklet,
-            hvm_dirq_assist, (unsigned long)d);
+        if ( !use_softirq )
+            percpu_tasklet_init(
+                &hvm_irq_dpci->dirq_tasklet,
+                hvm_dirq_assist, (unsigned long)d);
+        else
+            printk("%s: Using HVM_DPCI_SOFTIRQ for d%d.\n", __func__, d->domain_id);
+
         hvm_irq_dpci->mirq = xmalloc_array(struct hvm_mirq_dpci_mapping,
                                            d->nr_pirqs);
         hvm_irq_dpci->dirq_mask = xmalloc_array(unsigned long,
@@ -398,7 +497,10 @@ int hvm_do_IRQ_dpci(struct domain *d, unsigned int mirq)
         return 0;
 
     set_bit(mirq, dpci->dirq_mask);
-    tasklet_schedule(&dpci->dirq_tasklet);
+    if ( !use_softirq )
+        tasklet_schedule(&dpci->dirq_tasklet);
+    else
+        schedule_dpci_for(d);
     return 1;
 }
 
@@ -574,11 +676,50 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+#include <xen/cpu.h>
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    printk("%s for CPU%d\n", __func__, cpu);
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        BUG(); /* To implement. */
+        break;
+    default:
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+    .priority = 99
+};
 
 static int __init setup_mmio_ro_ranges(void)
 {
     mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
                                   RANGESETF_prettyprint_hex);
+    printk("HVM_DPCI_SOFTIRQ is %s\n", use_softirq ? "active" : "offline");
+    if ( use_softirq )
+    {
+        int cpu;
+
+        for_each_online_cpu(cpu)
+            INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+        open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+        register_cpu_notifier(&cpu_nfb);
+    }
     return 0;
 }
 __initcall(setup_mmio_ro_ranges);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 111ac96..24900da 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -296,7 +296,10 @@ static void pci_clean_dpci_irqs(struct domain *d)
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        tasklet_kill(&hvm_irq_dpci->dirq_tasklet);
+        if ( !use_softirq )
+            tasklet_kill(&hvm_irq_dpci->dirq_tasklet);
+        else
+            dpci_kill(d);
 
         for ( i = find_first_bit(hvm_irq_dpci->mapping, d->nr_pirqs);
               i < d->nr_pirqs;
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index f21b02c..340293c 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -102,6 +102,9 @@ struct hvm_irq_dpci {
     struct tasklet dirq_tasklet;
 };
 
+extern bool_t use_softirq;
+void dpci_kill(struct domain *d);
+
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index c04b25d..ba9982e 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -332,6 +332,9 @@ struct domain
     nodemask_t node_affinity;
     unsigned int last_alloc_node;
     spinlock_t node_affinity_lock;
+    /* For HVM_DPCI_SOFTIRQ. Locking is bit wonky. */
+    struct list_head list;
+    unsigned long state;
 };
 
 struct domain_setup_info
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index c5b429c..7134727 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -9,6 +9,7 @@ enum {
     RCU_SOFTIRQ,
     TASKLET_SOFTIRQ_PERCPU,
     TASKLET_SOFTIRQ,
+    HVM_DPCI_SOFTIRQ,
     NR_COMMON_SOFTIRQS
 };
 
-- 
1.9.3

> 
> Jan
> 
> 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1] Replace tasklets with per-cpu implementation.
  2014-09-02 20:28       ` Konrad Rzeszutek Wilk
@ 2014-09-03  8:03         ` Jan Beulich
  2014-09-08 19:01           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2014-09-03  8:03 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, keir

>>> On 02.09.14 at 22:28, <konrad.wilk@oracle.com> wrote:
> I typed this prototype up and asked folks with the right hardware to
> test it. It _ought_ to, thought I think that the tasklet code
> still could use an overhaul.

Apart from needing general cleaning up, the main question I have
is whether the dpci_kill() part couldn't be replaced by obtaining a
proper reference on the domain when scheduling a softirq, and
dropping that reference after having handled it.

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1] Replace tasklets with per-cpu implementation.
  2014-09-03  8:03         ` Jan Beulich
@ 2014-09-08 19:01           ` Konrad Rzeszutek Wilk
  2014-09-09  9:01             ` Jan Beulich
  0 siblings, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-08 19:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, keir

On Wed, Sep 03, 2014 at 09:03:45AM +0100, Jan Beulich wrote:
> >>> On 02.09.14 at 22:28, <konrad.wilk@oracle.com> wrote:
> > I typed this prototype up and asked folks with the right hardware to
> > test it. It _ought_ to, thought I think that the tasklet code
> > still could use an overhaul.
> 
> Apart from needing general cleaning up, the main question I have
> is whether the dpci_kill() part couldn't be replaced by obtaining a
> proper reference on the domain when scheduling a softirq, and
> dropping that reference after having handled it.


The one fatal corner is when we call 'hvm_dirq_assist' with
d->arch.hvm_domain.irq.dpci set to NULL. There is even an ASSERT in 
'hvm_dirq_assist' for that.

The one edge condition I am troubled by is when we receive an
interrupt and process it - while on anothre CPU we get an hypercall
for 'domain_kill'.

That is:
 a) 'do_IRQ'-> .. schedule_dpci_for(d)' had been called (and so
    d->arch.hvm_domain.irq.dpci is still valid).

 b) on another CPU 'domain_kill' calls domain_relinquish_resources
   >pci_release_devices->pci_clean_dpci_irqs which then makes 
   d->arch.hvm_domain.irq.dpci NULL.

 c). the 'schedule_tail' (vmx_do_resume) right after a) is called 
   which means we call the do_softirq. The 'dpci_softirq' is called
   which calls 'hvm_dirq_assist' and blows up.

 d). the 'domain_relinquish_resources' on another CPU continues on trying
   to tear down the domain.


What we need is a form of semantic barrier in 'pci_clean_dpci_irqs' that
will enforce that all outstanding 'dpci_softirq' have been executed.

The refcnt does not enforce that - unless we:

 1). Add a new type of refcnt (and spin in pci_clean_dpci_irqs to make sure
     that the refcnt is at zero). But that would still require an 'dpci_kill'
     function - albeit with a different implementation.
 2). Add a check in 'dpci_softirq' to see if the domain is going through
     destruction, ie

	if ( d->is_dying )
		/* Skip it. */
	
 3). Leave it as the prototype had.

Here is a draft patch based on the 2) idea. Compile tested and going to
test it overnight.

>From 8c5d3c6b9ec25eb0f921917f36a29d758816a44f Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 29 Aug 2014 13:40:09 -0400
Subject: [PATCH] dpci: Replace tasklet with an softirq (v1)

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct domain' structure the 'hvm_dirq_assist'
needs to run on. That is simple solved by threading the
'struct domain' through a linked list. The rule of only running
one 'hvm_dirq_assist' for only one 'struct domain' is also preserved
by having 'schedule_dpci_for' ignore any subsequent calls for
an domain which has already been scheduled.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/passthrough/io.c  | 137 ++++++++++++++++++++++++++++++++++++++++--
 xen/drivers/passthrough/pci.c |   2 -
 xen/include/xen/hvm/irq.h     |   1 -
 xen/include/xen/sched.h       |   6 ++
 xen/include/xen/softirq.h     |   1 +
 5 files changed, 139 insertions(+), 8 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ef75b94..d87f7dc 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,15 +20,93 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
 static void hvm_dirq_assist(unsigned long _d);
 
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+enum {
+    STATE_SCHED,
+    STATE_RUN
+};
+
+static void schedule_dpci_for(struct domain *d)
+{
+    if ( !test_and_set_bit(STATE_SCHED, &d->state) )
+    {
+        unsigned long flags;
+        struct list_head *list;
+
+        local_irq_save(flags);
+        INIT_LIST_HEAD(&d->list);
+        get_domain(d); /* To be put by 'dpci_softirq' */
+        list = &__get_cpu_var(dpci_list);
+        list_add_tail(&d->list, list);
+
+        local_irq_restore(flags);
+        raise_softirq(HVM_DPCI_SOFTIRQ);
+    }
+}
+
+static void dpci_softirq(void)
+{
+
+    struct domain *d;
+    struct list_head *list;
+    struct list_head our_list;
+
+    local_irq_disable();
+    list = &__get_cpu_var(dpci_list);
+
+    INIT_LIST_HEAD(&our_list);
+    list_splice(list, &our_list);
+
+    INIT_LIST_HEAD(&__get_cpu_var(dpci_list));
+
+    local_irq_enable();
+
+    while (!list_empty(&our_list))
+    {
+        d = list_entry(our_list.next, struct domain, list);
+        list_del(&d->list);
+
+        if ( !test_and_set_bit(STATE_RUN, &(d)->state) )
+        {
+            if ( !test_and_clear_bit(STATE_SCHED, &d->state) )
+                BUG();
+            /*
+             * Quite nasty corner case: do_IRQ is called which schedules the
+             * dpci. Right as it is ready to call 'do_softirq', on another CPU
+             * we call 'domain_kill' which calls 'pci_clean_dpci_irqs'. The
+             * d->arch.hvm_domain.irq.dpci is NULL, and hvm_dirq_assist
+             * hits the ASSERT. If we detect that we are in the shutdown
+             * situation we won't process anymore interrupts.
+             */
+            if ( d->is_dying )
+                goto skip;
+            hvm_dirq_assist((unsigned long)d);
+ skip:
+            clear_bit(STATE_RUN, &(d)->state);
+            put_domain(d);
+            continue;
+        }
+
+        local_irq_disable();
+
+        INIT_LIST_HEAD(&d->list);
+        list_add_tail(&d->list, &__get_cpu_var(dpci_list));
+        local_irq_enable();
+
+        raise_softirq(HVM_DPCI_SOFTIRQ);
+    }
+}
+
 bool_t pt_irq_need_timer(uint32_t flags)
 {
     return !(flags & (HVM_IRQ_DPCI_GUEST_MSI | HVM_IRQ_DPCI_TRANSLATE));
@@ -114,9 +192,7 @@ int pt_irq_create_bind(
             spin_unlock(&d->event_lock);
             return -ENOMEM;
         }
-        softirq_tasklet_init(
-            &hvm_irq_dpci->dirq_tasklet,
-            hvm_dirq_assist, (unsigned long)d);
+
         for ( i = 0; i < NR_HVM_IRQS; i++ )
             INIT_LIST_HEAD(&hvm_irq_dpci->girq[i]);
 
@@ -459,7 +535,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
         return 0;
 
     pirq_dpci->masked = 1;
-    tasklet_schedule(&dpci->dirq_tasklet);
+    schedule_dpci_for(d);
     return 1;
 }
 
@@ -631,3 +707,54 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+static void migrate_tasklets_from_cpu(unsigned int cpu, struct list_head *list)
+{
+    struct domain *d;
+
+    while ( !list_empty(list) )
+    {
+        d = list_entry(list->next, struct domain, list);
+        list_del(&d->list);
+        schedule_dpci_for(d);
+    }
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        migrate_tasklets_from_cpu(cpu, (&per_cpu(dpci_list, cpu)));
+        break;
+    default:
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+    .priority = 99
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1eba833..59c4386 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -784,8 +784,6 @@ static void pci_clean_dpci_irqs(struct domain *d)
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        tasklet_kill(&hvm_irq_dpci->dirq_tasklet);
-
         pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
 
         d->arch.hvm_domain.irq.dpci = NULL;
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index c89f4b1..a08dbeb 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -88,7 +88,6 @@ struct hvm_irq_dpci {
     DECLARE_BITMAP(isairq_map, NR_ISAIRQS);
     /* Record of mapped Links */
     uint8_t link_cnt[NR_LINK];
-    struct tasklet dirq_tasklet;
 };
 
 /* Machine IRQ to guest device/intx mapping. */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index c5157e6..7984263 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -456,6 +456,12 @@ struct domain
     /* vNUMA topology accesses are protected by rwlock. */
     rwlock_t vnuma_rwlock;
     struct vnuma_info *vnuma;
+
+#ifdef HAS_PASSTHROUGH
+    /* For HVM_DPCI_SOFTIRQ. We use refcnt when scheduled to run. */
+    struct list_head list;
+    unsigned long state;
+#endif
 };
 
 struct domain_setup_info
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 0c0d481..bdc607c 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -8,6 +8,7 @@ enum {
     NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
     RCU_SOFTIRQ,
     TASKLET_SOFTIRQ,
+    HVM_DPCI_SOFTIRQ,
     NR_COMMON_SOFTIRQS
 };
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1] Replace tasklets with per-cpu implementation.
  2014-09-08 19:01           ` Konrad Rzeszutek Wilk
@ 2014-09-09  9:01             ` Jan Beulich
  2014-09-09 14:37               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2014-09-09  9:01 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, keir

>>> On 08.09.14 at 21:01, <konrad.wilk@oracle.com> wrote:
> On Wed, Sep 03, 2014 at 09:03:45AM +0100, Jan Beulich wrote:
>> >>> On 02.09.14 at 22:28, <konrad.wilk@oracle.com> wrote:
>> > I typed this prototype up and asked folks with the right hardware to
>> > test it. It _ought_ to, thought I think that the tasklet code
>> > still could use an overhaul.
>> 
>> Apart from needing general cleaning up, the main question I have
>> is whether the dpci_kill() part couldn't be replaced by obtaining a
>> proper reference on the domain when scheduling a softirq, and
>> dropping that reference after having handled it.
> 
> 
> The one fatal corner is when we call 'hvm_dirq_assist' with
> d->arch.hvm_domain.irq.dpci set to NULL. There is even an ASSERT in 
> 'hvm_dirq_assist' for that.
> 
> The one edge condition I am troubled by is when we receive an
> interrupt and process it - while on anothre CPU we get an hypercall
> for 'domain_kill'.
> 
> That is:
>  a) 'do_IRQ'-> .. schedule_dpci_for(d)' had been called (and so
>     d->arch.hvm_domain.irq.dpci is still valid).
> 
>  b) on another CPU 'domain_kill' calls domain_relinquish_resources
>    >pci_release_devices->pci_clean_dpci_irqs which then makes 
>    d->arch.hvm_domain.irq.dpci NULL.
> 
>  c). the 'schedule_tail' (vmx_do_resume) right after a) is called 
>    which means we call the do_softirq. The 'dpci_softirq' is called
>    which calls 'hvm_dirq_assist' and blows up.
> 
>  d). the 'domain_relinquish_resources' on another CPU continues on trying
>    to tear down the domain.

I have to admit that I don't immediately see why this would be a
problem with the new softirq approach, but not with the previous
tasklet one. In any event, couldn't domain_relinquish_resources()
wait for the list of scheduled entities to become empty, while not
inserting new entries onto the list when ->is_dying?

> +static void schedule_dpci_for(struct domain *d)
> +{
> +    if ( !test_and_set_bit(STATE_SCHED, &d->state) )
> +    {
> +        unsigned long flags;
> +        struct list_head *list;
> +
> +        local_irq_save(flags);
> +        INIT_LIST_HEAD(&d->list);
> +        get_domain(d); /* To be put by 'dpci_softirq' */

get_knownalive_domain()?

> +static void migrate_tasklets_from_cpu(unsigned int cpu, struct list_head *list)

tasklets?

> +{
> +    struct domain *d;
> +
> +    while ( !list_empty(list) )
> +    {
> +        d = list_entry(list->next, struct domain, list);
> +        list_del(&d->list);
> +        schedule_dpci_for(d);
> +    }
> +}
> +
> +static int cpu_callback(
> +    struct notifier_block *nfb, unsigned long action, void *hcpu)
> +{
> +    unsigned int cpu = (unsigned long)hcpu;
> +
> +    switch ( action )
> +    {
> +    case CPU_UP_PREPARE:
> +        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
> +        break;
> +    case CPU_UP_CANCELED:
> +    case CPU_DEAD:
> +        migrate_tasklets_from_cpu(cpu, (&per_cpu(dpci_list, cpu)));

Can CPUs go down while softirqs are pending on them?

> +        break;
> +    default:
> +        break;
> +    }
> +
> +    return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block cpu_nfb = {
> +    .notifier_call = cpu_callback,
> +    .priority = 99

If the whole notification is needed, this seemingly arbitrary number
would need a comment.

> --- a/xen/include/xen/softirq.h
> +++ b/xen/include/xen/softirq.h
> @@ -8,6 +8,7 @@ enum {
>      NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
>      RCU_SOFTIRQ,
>      TASKLET_SOFTIRQ,
> +    HVM_DPCI_SOFTIRQ,
>      NR_COMMON_SOFTIRQS
>  };

This should of course also be conditional upon HAS_PASSTHROUGH.

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1] Replace tasklets with per-cpu implementation.
  2014-09-09  9:01             ` Jan Beulich
@ 2014-09-09 14:37               ` Konrad Rzeszutek Wilk
  2014-09-09 16:37                 ` Jan Beulich
  0 siblings, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-09 14:37 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, keir

On Tue, Sep 09, 2014 at 10:01:09AM +0100, Jan Beulich wrote:
> >>> On 08.09.14 at 21:01, <konrad.wilk@oracle.com> wrote:
> > On Wed, Sep 03, 2014 at 09:03:45AM +0100, Jan Beulich wrote:
> >> >>> On 02.09.14 at 22:28, <konrad.wilk@oracle.com> wrote:
> >> > I typed this prototype up and asked folks with the right hardware to
> >> > test it. It _ought_ to, thought I think that the tasklet code
> >> > still could use an overhaul.
> >> 
> >> Apart from needing general cleaning up, the main question I have
> >> is whether the dpci_kill() part couldn't be replaced by obtaining a
> >> proper reference on the domain when scheduling a softirq, and
> >> dropping that reference after having handled it.
> > 
> > 
> > The one fatal corner is when we call 'hvm_dirq_assist' with
> > d->arch.hvm_domain.irq.dpci set to NULL. There is even an ASSERT in 
> > 'hvm_dirq_assist' for that.
> > 
> > The one edge condition I am troubled by is when we receive an
> > interrupt and process it - while on anothre CPU we get an hypercall
> > for 'domain_kill'.
> > 
> > That is:
> >  a) 'do_IRQ'-> .. schedule_dpci_for(d)' had been called (and so
> >     d->arch.hvm_domain.irq.dpci is still valid).
> > 
> >  b) on another CPU 'domain_kill' calls domain_relinquish_resources
> >    >pci_release_devices->pci_clean_dpci_irqs which then makes 
> >    d->arch.hvm_domain.irq.dpci NULL.
> > 
> >  c). the 'schedule_tail' (vmx_do_resume) right after a) is called 
> >    which means we call the do_softirq. The 'dpci_softirq' is called
> >    which calls 'hvm_dirq_assist' and blows up.
> > 
> >  d). the 'domain_relinquish_resources' on another CPU continues on trying
> >    to tear down the domain.
> 
> I have to admit that I don't immediately see why this would be a
> problem with the new softirq approach, but not with the previous
> tasklet one. In any event, couldn't domain_relinquish_resources()


Before this patch it used the 'tasklet_kill' in the pci_clean_dpci_irqs which
would wait for the tasklet (if it had been scheduled) to finish up.

In the the patch you had just reviewed (prototype) you asked whether
we could ditch the 'dpci_kill' altogether and depend on refcnts.
The 'dpci_kill' replaced the 'tasklet_kill' in pci_clean_dpci_irqs.

That would work, but we need need some mechanism to stall the
'domain_kill' sequence to make sure that the 'hvm_dirq_assist' runs
before we nuke d->arch.hvm_domain.irq.dpci. Or we fix the softirq
to skip running it if the guest is in process of dying.

> wait for the list of scheduled entities to become empty, while not
> inserting new entries onto the list when ->is_dying?

Of couse. That would require pci_clean_dpci_irqs to call
'dpci_kill' and 'dpci_kill' would wait for the list to become empty.
(That was the first idea I had in the prototype and the code
you just reviewed).

Going back to your original question:
You were wondering if that 'dpci_kill' could be outright removed and
just use ref-counts. The answer is yes, but we MUST check
for d->is_dying == DOMDYING_dying in the softirq otherwise we can blow up.

Or we can go back to what the prototype which had an wait mechanism.

See attached patch (which has been tested).
> 
> > +static void schedule_dpci_for(struct domain *d)
> > +{
> > +    if ( !test_and_set_bit(STATE_SCHED, &d->state) )
> > +    {
> > +        unsigned long flags;
> > +        struct list_head *list;
> > +
> > +        local_irq_save(flags);
> > +        INIT_LIST_HEAD(&d->list);
> > +        get_domain(d); /* To be put by 'dpci_softirq' */
> 
> get_knownalive_domain()?

<nods>
> 
> > +static void migrate_tasklets_from_cpu(unsigned int cpu, struct list_head *list)
> 
> tasklets?

<blushes>
> 
> > +{
> > +    struct domain *d;
> > +
> > +    while ( !list_empty(list) )
> > +    {
> > +        d = list_entry(list->next, struct domain, list);
> > +        list_del(&d->list);
> > +        schedule_dpci_for(d);
> > +    }
> > +}
> > +
> > +static int cpu_callback(
> > +    struct notifier_block *nfb, unsigned long action, void *hcpu)
> > +{
> > +    unsigned int cpu = (unsigned long)hcpu;
> > +
> > +    switch ( action )
> > +    {
> > +    case CPU_UP_PREPARE:
> > +        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
> > +        break;
> > +    case CPU_UP_CANCELED:
> > +    case CPU_DEAD:
> > +        migrate_tasklets_from_cpu(cpu, (&per_cpu(dpci_list, cpu)));
> 
> Can CPUs go down while softirqs are pending on them?

No. By the time we get here, the CPU is no longer "hearing" them.
> 
> > +        break;
> > +    default:
> > +        break;
> > +    }
> > +
> > +    return NOTIFY_DONE;
> > +}
> > +
> > +static struct notifier_block cpu_nfb = {
> > +    .notifier_call = cpu_callback,
> > +    .priority = 99
> 
> If the whole notification is needed, this seemingly arbitrary number
> would need a comment.

I copied it from the tasklet code, which didn't have an comment
either. Will remove the priority and use default.
> 
> > --- a/xen/include/xen/softirq.h
> > +++ b/xen/include/xen/softirq.h
> > @@ -8,6 +8,7 @@ enum {
> >      NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
> >      RCU_SOFTIRQ,
> >      TASKLET_SOFTIRQ,
> > +    HVM_DPCI_SOFTIRQ,
> >      NR_COMMON_SOFTIRQS
> >  };
> 
> This should of course also be conditional upon HAS_PASSTHROUGH.

Aye!

Pls see attached patch that I hope follows what you had asked for.
Compile and tested with PCI passthrough:


>From f8705ed031d500ce190916969ae63b23bb487cf2 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 29 Aug 2014 13:40:09 -0400
Subject: [PATCH] dpci: Replace tasklet with an softirq (v2)

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct domain' structure the 'hvm_dirq_assist'
needs to run on. That is simple solved by threading the
'struct domain' through a linked list. The rule of only running
one 'hvm_dirq_assist' for only one 'struct domain' is also preserved
by having 'schedule_dpci_for' ignore any subsequent calls for
an domain which has already been scheduled.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: On top of ref-cnts also have wait loop for the outstanding
    'struct domain' that need to be processed.
---
 xen/drivers/passthrough/io.c  | 168 ++++++++++++++++++++++++++++++++++++++++--
 xen/drivers/passthrough/pci.c |   2 +-
 xen/include/xen/hvm/irq.h     |   2 +-
 xen/include/xen/sched.h       |   6 ++
 xen/include/xen/softirq.h     |   3 +
 5 files changed, 174 insertions(+), 7 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ef75b94..d52db0b 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,15 +20,125 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
 static void hvm_dirq_assist(unsigned long _d);
 
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/* These fancy bit manipulations (bit 0 and bit 1) along with using a lock
+ * operation allow us to have four stages in dpci life-time.
+ *   a) 0x0: Completely empty (not scheduled nor running).
+ *   b) 0x1: Scheduled but not running. Used to guard in 'schedule_dpci_for'
+ *      such that we will only schedule one. If it is scheduled and had never
+ *      run (hence never clearing STATE_SCHED bit), dpci_kill will spin
+ *      forever on said domain. However 'schedule_dpci_for' raises the
+ *      softirq  so it will run, albeit there might be a race (dpci_kill
+ *      spinning until the softirq handler runs).
+ *   c) 0x2: it is running (only on one CPU) and can be scheduled on any
+ *      CPU. The bit 0 - scheduled is cleared at this stage allowing
+ *      'schedule_dpci_for' to succesfully schedule.
+ *   d) 0x3: scheduled and running - only possible if the running hvm_dirq_assist
+ *      calls 'schedule_dpci_for'. It does not do that, but it could in the
+ *      future.
+ *
+ *  The two bits play a vital role in assuring that the 'hvm_dirq_assist' is
+ *  scheduled once and runs only once for a domain. The steps are:
+ *
+ *   1) schedule_dpci_for: STATE_SCHED bit set (0x1), added on the per cpu list.
+ *   2) dpci_softirq picks one domain from the list. Schedules
+ *   itself later if there are more domain's on it. Tries to set STATE_RUN bit
+ *   (0x3) - if it fails adds domain back to the per-cpu list. If it succeeds
+ *   clears the STATE_SCHED bit (0x2). Once hvm_dirq_assist for an domain
+ *   complets, it unsets STATE_RUN (0x0 or 0x1 if 'schedule_dpci_for' is called
+ *   from 'hvm_dirq_assist').
+ */
+enum {
+    STATE_SCHED, /* Bit 0 */
+    STATE_RUN
+};
+
+static void schedule_dpci_for(struct domain *d)
+{
+    if ( !test_and_set_bit(STATE_SCHED, &d->state) )
+    {
+        unsigned long flags;
+        struct list_head *list;
+
+        local_irq_save(flags);
+        INIT_LIST_HEAD(&d->list);
+        get_knownalive_domain(d); /* To be put by 'dpci_softirq' */
+        list = &__get_cpu_var(dpci_list);
+        list_add_tail(&d->list, list);
+
+        local_irq_restore(flags);
+        raise_softirq(HVM_DPCI_SOFTIRQ);
+    }
+}
+
+void dpci_kill(struct domain *d)
+{
+    while ( test_and_set_bit(STATE_SCHED, &d->state) )
+    {
+        do {
+            process_pending_softirqs();
+        } while ( test_bit(STATE_SCHED, &d->state) );
+    }
+
+    while ( test_bit(STATE_RUN, &d->state) )
+    {
+        cpu_relax();
+    }
+    clear_bit(STATE_SCHED, &d->state);
+}
+
+static void dpci_softirq(void)
+{
+
+    struct domain *d;
+    struct list_head *list;
+    struct list_head our_list;
+
+    local_irq_disable();
+    list = &__get_cpu_var(dpci_list);
+
+    INIT_LIST_HEAD(&our_list);
+    list_splice(list, &our_list);
+
+    INIT_LIST_HEAD(&__get_cpu_var(dpci_list));
+
+    local_irq_enable();
+
+    while (!list_empty(&our_list))
+    {
+        d = list_entry(our_list.next, struct domain, list);
+        list_del(&d->list);
+
+        if ( !test_and_set_bit(STATE_RUN, &(d)->state) )
+        {
+            if ( !test_and_clear_bit(STATE_SCHED, &d->state) )
+                BUG();
+            hvm_dirq_assist((unsigned long)d);
+            clear_bit(STATE_RUN, &(d)->state);
+            put_domain(d);
+            continue;
+        }
+
+        local_irq_disable();
+
+        INIT_LIST_HEAD(&d->list);
+        list_add_tail(&d->list, &__get_cpu_var(dpci_list));
+        local_irq_enable();
+
+        raise_softirq(HVM_DPCI_SOFTIRQ);
+    }
+}
+
 bool_t pt_irq_need_timer(uint32_t flags)
 {
     return !(flags & (HVM_IRQ_DPCI_GUEST_MSI | HVM_IRQ_DPCI_TRANSLATE));
@@ -114,9 +224,7 @@ int pt_irq_create_bind(
             spin_unlock(&d->event_lock);
             return -ENOMEM;
         }
-        softirq_tasklet_init(
-            &hvm_irq_dpci->dirq_tasklet,
-            hvm_dirq_assist, (unsigned long)d);
+
         for ( i = 0; i < NR_HVM_IRQS; i++ )
             INIT_LIST_HEAD(&hvm_irq_dpci->girq[i]);
 
@@ -459,7 +567,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
         return 0;
 
     pirq_dpci->masked = 1;
-    tasklet_schedule(&dpci->dirq_tasklet);
+    schedule_dpci_for(d);
     return 1;
 }
 
@@ -631,3 +739,53 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+static void migrate_domains_from_cpu(unsigned int cpu, struct list_head *list)
+{
+    struct domain *d;
+
+    while ( !list_empty(list) )
+    {
+        d = list_entry(list->next, struct domain, list);
+        list_del(&d->list);
+        schedule_dpci_for(d);
+    }
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        migrate_domains_from_cpu(cpu, (&per_cpu(dpci_list, cpu)));
+        break;
+    default:
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1eba833..71d0fbf 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -784,7 +784,7 @@ static void pci_clean_dpci_irqs(struct domain *d)
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        tasklet_kill(&hvm_irq_dpci->dirq_tasklet);
+        dpci_kill(d);
 
         pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
 
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index c89f4b1..5dff010 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -88,7 +88,6 @@ struct hvm_irq_dpci {
     DECLARE_BITMAP(isairq_map, NR_ISAIRQS);
     /* Record of mapped Links */
     uint8_t link_cnt[NR_LINK];
-    struct tasklet dirq_tasklet;
 };
 
 /* Machine IRQ to guest device/intx mapping. */
@@ -109,6 +108,7 @@ int pt_pirq_iterate(struct domain *d,
                               struct hvm_pirq_dpci *, void *arg),
                     void *arg);
 
+void dpci_kill(struct domain *d);
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index c5157e6..7984263 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -456,6 +456,12 @@ struct domain
     /* vNUMA topology accesses are protected by rwlock. */
     rwlock_t vnuma_rwlock;
     struct vnuma_info *vnuma;
+
+#ifdef HAS_PASSTHROUGH
+    /* For HVM_DPCI_SOFTIRQ. We use refcnt when scheduled to run. */
+    struct list_head list;
+    unsigned long state;
+#endif
 };
 
 struct domain_setup_info
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 0c0d481..64ae25f 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -8,6 +8,9 @@ enum {
     NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
     RCU_SOFTIRQ,
     TASKLET_SOFTIRQ,
+#ifdef HAS_PASSTHROUGH
+    HVM_DPCI_SOFTIRQ,
+#endif
     NR_COMMON_SOFTIRQS
 };
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1] Replace tasklets with per-cpu implementation.
  2014-09-09 14:37               ` Konrad Rzeszutek Wilk
@ 2014-09-09 16:37                 ` Jan Beulich
  2014-09-10 16:03                   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2014-09-09 16:37 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, keir

>>> On 09.09.14 at 16:37, <konrad.wilk@oracle.com> wrote:
> On Tue, Sep 09, 2014 at 10:01:09AM +0100, Jan Beulich wrote:
>> >>> On 08.09.14 at 21:01, <konrad.wilk@oracle.com> wrote:
>> > +static int cpu_callback(
>> > +    struct notifier_block *nfb, unsigned long action, void *hcpu)
>> > +{
>> > +    unsigned int cpu = (unsigned long)hcpu;
>> > +
>> > +    switch ( action )
>> > +    {
>> > +    case CPU_UP_PREPARE:
>> > +        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
>> > +        break;
>> > +    case CPU_UP_CANCELED:
>> > +    case CPU_DEAD:
>> > +        migrate_tasklets_from_cpu(cpu, (&per_cpu(dpci_list, cpu)));
>> 
>> Can CPUs go down while softirqs are pending on them?
> 
> No. By the time we get here, the CPU is no longer "hearing" them.

So what's that code (also still present in the newer patch you
had attached here) for then?

> +void dpci_kill(struct domain *d)
> +{
> +    while ( test_and_set_bit(STATE_SCHED, &d->state) )
> +    {
> +        do {
> +            process_pending_softirqs();
> +        } while ( test_bit(STATE_SCHED, &d->state) );
> +    }
> +
> +    while ( test_bit(STATE_RUN, &d->state) )
> +    {
> +        cpu_relax();
> +    }
> +    clear_bit(STATE_SCHED, &d->state);

Does all this perhaps need preemption handling? The caller
(pci_release_devices()) is direct descendant from
domain_relinquish_resources(), so even bubbling -EAGAIN or
-ERESTART back up instead of spinning would seem like an
option.

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1] Replace tasklets with per-cpu implementation.
  2014-09-09 16:37                 ` Jan Beulich
@ 2014-09-10 16:03                   ` Konrad Rzeszutek Wilk
  2014-09-10 16:25                     ` Jan Beulich
  0 siblings, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-10 16:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, keir

On Tue, Sep 09, 2014 at 05:37:41PM +0100, Jan Beulich wrote:
> >>> On 09.09.14 at 16:37, <konrad.wilk@oracle.com> wrote:
> > On Tue, Sep 09, 2014 at 10:01:09AM +0100, Jan Beulich wrote:
> >> >>> On 08.09.14 at 21:01, <konrad.wilk@oracle.com> wrote:
> >> > +static int cpu_callback(
> >> > +    struct notifier_block *nfb, unsigned long action, void *hcpu)
> >> > +{
> >> > +    unsigned int cpu = (unsigned long)hcpu;
> >> > +
> >> > +    switch ( action )
> >> > +    {
> >> > +    case CPU_UP_PREPARE:
> >> > +        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
> >> > +        break;
> >> > +    case CPU_UP_CANCELED:
> >> > +    case CPU_DEAD:
> >> > +        migrate_tasklets_from_cpu(cpu, (&per_cpu(dpci_list, cpu)));
> >> 
> >> Can CPUs go down while softirqs are pending on them?
> > 
> > No. By the time we get here, the CPU is no longer "hearing" them.
> 
> So what's that code (also still present in the newer patch you
> had attached here) for then?

I should have been more specific. This callback is called during
the CPU shutdown with CPU_DYING. At that point the softirq can
be active. After the callbacks of CPY_DYING have completed the
CPU becomes "deaf" and has no more softirqs pending.

As such we can safely remove the 'migrate_..' part out for
this specific case. I will add a verbose comment why 
it is safe to ignore the CPU_DEAD case and add an ASSERT.
> 
> > +void dpci_kill(struct domain *d)
> > +{
> > +    while ( test_and_set_bit(STATE_SCHED, &d->state) )
> > +    {
> > +        do {
> > +            process_pending_softirqs();
> > +        } while ( test_bit(STATE_SCHED, &d->state) );
> > +    }
> > +
> > +    while ( test_bit(STATE_RUN, &d->state) )
> > +    {
> > +        cpu_relax();
> > +    }
> > +    clear_bit(STATE_SCHED, &d->state);
> 
> Does all this perhaps need preemption handling? The caller
> (pci_release_devices()) is direct descendant from
> domain_relinquish_resources(), so even bubbling -EAGAIN or
> -ERESTART back up instead of spinning would seem like an
> option.

Excellent idea!

I am attaching the patch here with your suggestions, and if you
would like a break from reading my code - by all means - please
feel free to ignore it and I can repost it in a week or two.

>From bb5879c24b2f4140fb6b0c5b35e299b139b92365 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 29 Aug 2014 13:40:09 -0400
Subject: [PATCH] dpci: Replace tasklet with an softirq (v3)

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct domain' structure the 'hvm_dirq_assist'
needs to run on. That is simple solved by threading the
'struct domain' through a linked list. The rule of only running
one 'hvm_dirq_assist' for only one 'struct domain' is also preserved
by having 'schedule_dpci_for' ignore any subsequent calls for
an domain which has already been scheduled.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: On top of ref-cnts also have wait loop for the outstanding
    'struct domain' that need to be processed.
---
 xen/arch/x86/domain.c         |   4 +-
 xen/drivers/passthrough/io.c  | 161 ++++++++++++++++++++++++++++++++++++++++--
 xen/drivers/passthrough/pci.c |  24 +++++--
 xen/include/xen/hvm/irq.h     |   2 +-
 xen/include/xen/pci.h         |   2 +-
 xen/include/xen/sched.h       |   6 ++
 xen/include/xen/softirq.h     |   3 +
 7 files changed, 188 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index f7e0e78..f14c611 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1921,7 +1921,9 @@ int domain_relinquish_resources(struct domain *d)
     switch ( d->arch.relmem )
     {
     case RELMEM_not_started:
-        pci_release_devices(d);
+        ret = pci_release_devices(d);
+        if ( ret )
+            return ret;
 
         /* Tear down paging-assistance stuff. */
         paging_teardown(d);
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ef75b94..f19d56e 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,15 +20,123 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
 static void hvm_dirq_assist(unsigned long _d);
 
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/*
+ * These fancy bit manipulations (bit 0 and bit 1) along with using a lock
+ * operation allow us to have four stages in dpci life-time.
+ *   a) 0x0: Completely empty (not scheduled nor running).
+ *   b) 0x1: Scheduled but not running. Used to guard in 'schedule_dpci_for'
+ *      such that we will only schedule one. If it is scheduled and had never
+ *      run (hence never clearing STATE_SCHED bit), dpci_kill will spin
+ *      forever on said domain. However 'schedule_dpci_for' raises the
+ *      softirq  so it will run, albeit there might be a race (dpci_kill
+ *      spinning until the softirq handler runs).
+ *   c) 0x2: it is running (only on one CPU) and can be scheduled on any
+ *      CPU. The bit 0 - scheduled is cleared at this stage allowing
+ *      'schedule_dpci_for' to succesfully schedule.
+ *   d) 0x3: scheduled and running - only possible if the running hvm_dirq_assist
+ *      calls 'schedule_dpci_for'. It does not do that, but it could in the
+ *      future.
+ *
+ *  The two bits play a vital role in assuring that the 'hvm_dirq_assist' is
+ *  scheduled once and runs only once for a domain. The steps are:
+ *
+ *   1) schedule_dpci_for: STATE_SCHED bit set (0x1), added on the per cpu list.
+ *   2) dpci_softirq picks one domain from the list. Schedules
+ *   itself later if there are more domain's on it. Tries to set STATE_RUN bit
+ *   (0x3) - if it fails adds domain back to the per-cpu list. If it succeeds
+ *   clears the STATE_SCHED bit (0x2). Once hvm_dirq_assist for an domain
+ *   complets, it unsets STATE_RUN (0x0 or 0x1 if 'schedule_dpci_for' is called
+ *   from 'hvm_dirq_assist').
+ */
+enum {
+    STATE_SCHED, /* Bit 0 */
+    STATE_RUN
+};
+
+static void schedule_dpci_for(struct domain *d)
+{
+    if ( !test_and_set_bit(STATE_SCHED, &d->state) )
+    {
+        unsigned long flags;
+        struct list_head *list;
+
+        local_irq_save(flags);
+        INIT_LIST_HEAD(&d->list);
+        get_knownalive_domain(d); /* To be put by 'dpci_softirq' */
+        list = &__get_cpu_var(dpci_list);
+        list_add_tail(&d->list, list);
+
+        local_irq_restore(flags);
+        raise_softirq(HVM_DPCI_SOFTIRQ);
+    }
+}
+
+int dpci_kill(struct domain *d)
+{
+    if ( test_and_set_bit(STATE_SCHED, &d->state) )
+        return -EAGAIN;
+
+    if ( test_bit(STATE_RUN, &d->state) )
+        return -EAGAIN;
+
+    clear_bit(STATE_SCHED, &d->state);
+
+    return 0;
+}
+
+static void dpci_softirq(void)
+{
+
+    struct domain *d;
+    struct list_head *list;
+    struct list_head our_list;
+
+    local_irq_disable();
+    list = &__get_cpu_var(dpci_list);
+
+    INIT_LIST_HEAD(&our_list);
+    list_splice(list, &our_list);
+
+    INIT_LIST_HEAD(&__get_cpu_var(dpci_list));
+
+    local_irq_enable();
+
+    while (!list_empty(&our_list))
+    {
+        d = list_entry(our_list.next, struct domain, list);
+        list_del(&d->list);
+
+        if ( !test_and_set_bit(STATE_RUN, &(d)->state) )
+        {
+            if ( !test_and_clear_bit(STATE_SCHED, &d->state) )
+                BUG();
+            hvm_dirq_assist((unsigned long)d);
+            clear_bit(STATE_RUN, &(d)->state);
+            put_domain(d);
+            continue;
+        }
+
+        local_irq_disable();
+
+        INIT_LIST_HEAD(&d->list);
+        list_add_tail(&d->list, &__get_cpu_var(dpci_list));
+        local_irq_enable();
+
+        raise_softirq(HVM_DPCI_SOFTIRQ);
+    }
+}
+
 bool_t pt_irq_need_timer(uint32_t flags)
 {
     return !(flags & (HVM_IRQ_DPCI_GUEST_MSI | HVM_IRQ_DPCI_TRANSLATE));
@@ -114,9 +222,7 @@ int pt_irq_create_bind(
             spin_unlock(&d->event_lock);
             return -ENOMEM;
         }
-        softirq_tasklet_init(
-            &hvm_irq_dpci->dirq_tasklet,
-            hvm_dirq_assist, (unsigned long)d);
+
         for ( i = 0; i < NR_HVM_IRQS; i++ )
             INIT_LIST_HEAD(&hvm_irq_dpci->girq[i]);
 
@@ -459,7 +565,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
         return 0;
 
     pirq_dpci->masked = 1;
-    tasklet_schedule(&dpci->dirq_tasklet);
+    schedule_dpci_for(d);
     return 1;
 }
 
@@ -631,3 +737,48 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        /*
+         * On CPU_DYING this callback is called (on the CPU that is dying)
+         * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
+         * clear out any outstanding domains (by the virtue of the idle loop
+         * calling the softirq later). In CPU_DEAD case the CPU is deaf and
+         * there are no pending softirqs for us to handle so we can chill.
+         */
+        ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
+        break;
+    default:
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1eba833..655c97c 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -770,37 +770,47 @@ static int pci_clean_dpci_irq(struct domain *d,
     return 0;
 }
 
-static void pci_clean_dpci_irqs(struct domain *d)
+static int pci_clean_dpci_irqs(struct domain *d)
 {
     struct hvm_irq_dpci *hvm_irq_dpci = NULL;
 
     if ( !iommu_enabled )
-        return;
+        return -ESRCH;
 
     if ( !is_hvm_domain(d) )
-        return;
+        return -EINVAL;
 
     spin_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        tasklet_kill(&hvm_irq_dpci->dirq_tasklet);
+        int ret = dpci_kill(d);
 
+        if ( ret )
+        {
+            spin_unlock(&d->event_lock);
+            return ret;
+        }
         pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
 
         d->arch.hvm_domain.irq.dpci = NULL;
         free_hvm_irq_dpci(hvm_irq_dpci);
     }
     spin_unlock(&d->event_lock);
+    return 0;
 }
 
-void pci_release_devices(struct domain *d)
+int pci_release_devices(struct domain *d)
 {
     struct pci_dev *pdev;
     u8 bus, devfn;
+    int ret;
 
     spin_lock(&pcidevs_lock);
-    pci_clean_dpci_irqs(d);
+    ret = pci_clean_dpci_irqs(d);
+    if ( ret && ret == -EAGAIN )
+        return ret;
+
     while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
     {
         bus = pdev->bus;
@@ -811,6 +821,8 @@ void pci_release_devices(struct domain *d)
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
     }
     spin_unlock(&pcidevs_lock);
+
+    return 0;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index c89f4b1..ccd2966 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -88,7 +88,6 @@ struct hvm_irq_dpci {
     DECLARE_BITMAP(isairq_map, NR_ISAIRQS);
     /* Record of mapped Links */
     uint8_t link_cnt[NR_LINK];
-    struct tasklet dirq_tasklet;
 };
 
 /* Machine IRQ to guest device/intx mapping. */
@@ -109,6 +108,7 @@ int pt_pirq_iterate(struct domain *d,
                               struct hvm_pirq_dpci *, void *arg),
                     void *arg);
 
+int dpci_kill(struct domain *d);
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 91520bc..5f295f3 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev(
 
 void setup_hwdom_pci_devices(struct domain *,
                             int (*)(u8 devfn, struct pci_dev *));
-void pci_release_devices(struct domain *d);
+int pci_release_devices(struct domain *d);
 int pci_add_segment(u16 seg);
 const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index c5157e6..7984263 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -456,6 +456,12 @@ struct domain
     /* vNUMA topology accesses are protected by rwlock. */
     rwlock_t vnuma_rwlock;
     struct vnuma_info *vnuma;
+
+#ifdef HAS_PASSTHROUGH
+    /* For HVM_DPCI_SOFTIRQ. We use refcnt when scheduled to run. */
+    struct list_head list;
+    unsigned long state;
+#endif
 };
 
 struct domain_setup_info
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 0c0d481..64ae25f 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -8,6 +8,9 @@ enum {
     NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
     RCU_SOFTIRQ,
     TASKLET_SOFTIRQ,
+#ifdef HAS_PASSTHROUGH
+    HVM_DPCI_SOFTIRQ,
+#endif
     NR_COMMON_SOFTIRQS
 };
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1] Replace tasklets with per-cpu implementation.
  2014-09-10 16:03                   ` Konrad Rzeszutek Wilk
@ 2014-09-10 16:25                     ` Jan Beulich
  2014-09-10 16:35                       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2014-09-10 16:25 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, keir

>>> On 10.09.14 at 18:03, <konrad.wilk@oracle.com> wrote:
> +static void dpci_softirq(void)
> +{
> +
> +    struct domain *d;
> +    struct list_head *list;
> +    struct list_head our_list;
> +
> +    local_irq_disable();
> +    list = &__get_cpu_var(dpci_list);
> +
> +    INIT_LIST_HEAD(&our_list);
> +    list_splice(list, &our_list);
> +
> +    INIT_LIST_HEAD(&__get_cpu_var(dpci_list));
> +
> +    local_irq_enable();
> +
> +    while (!list_empty(&our_list))
> +    {
> +        d = list_entry(our_list.next, struct domain, list);
> +        list_del(&d->list);
> +
> +        if ( !test_and_set_bit(STATE_RUN, &(d)->state) )
> +        {
> +            if ( !test_and_clear_bit(STATE_SCHED, &d->state) )
> +                BUG();
> +            hvm_dirq_assist((unsigned long)d);

You surely want to change the parameter type of that function to
no longer need such a cast.

Also the parentheses around d (above and below) as well as the
formatting of the while() above will need cleaning up. And the
__get_cpu_var()s want replacing with this_cpu() or per_cpu().

But apart from that it all looks quite good now. Does it also do
what we hope it would?

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1] Replace tasklets with per-cpu implementation.
  2014-09-10 16:25                     ` Jan Beulich
@ 2014-09-10 16:35                       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-10 16:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, keir

On Wed, Sep 10, 2014 at 05:25:33PM +0100, Jan Beulich wrote:
> >>> On 10.09.14 at 18:03, <konrad.wilk@oracle.com> wrote:
> > +static void dpci_softirq(void)
> > +{
> > +
> > +    struct domain *d;
> > +    struct list_head *list;
> > +    struct list_head our_list;
> > +
> > +    local_irq_disable();
> > +    list = &__get_cpu_var(dpci_list);
> > +
> > +    INIT_LIST_HEAD(&our_list);
> > +    list_splice(list, &our_list);
> > +
> > +    INIT_LIST_HEAD(&__get_cpu_var(dpci_list));
> > +
> > +    local_irq_enable();
> > +
> > +    while (!list_empty(&our_list))
> > +    {
> > +        d = list_entry(our_list.next, struct domain, list);
> > +        list_del(&d->list);
> > +
> > +        if ( !test_and_set_bit(STATE_RUN, &(d)->state) )
> > +        {
> > +            if ( !test_and_clear_bit(STATE_SCHED, &d->state) )
> > +                BUG();
> > +            hvm_dirq_assist((unsigned long)d);
> 
> You surely want to change the parameter type of that function to
> no longer need such a cast.

Duh!
> 
> Also the parentheses around d (above and below) as well as the
> formatting of the while() above will need cleaning up. And the
> __get_cpu_var()s want replacing with this_cpu() or per_cpu().

Right.
> 
> But apart from that it all looks quite good now. Does it also do
> what we hope it would?

Oh yes!. The test results show that this patch and the original
(" [RFC PATCH v1 1/5] tasklet: Introduce per-cpu taskle	for softirq.",
 see http://lists.xen.org/archives/html/xen-devel/2014-08/msg02662.html)

have the same performance effect - and make a huge improvement with
PCI passthrough.

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1] Replace tasklets with per-cpu implementation.
  2014-08-29 20:58 Arianna Avanzini
@ 2014-09-02 20:10 ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-02 20:10 UTC (permalink / raw)
  To: Arianna Avanzini; +Cc: xen-devel, Keir Fraser, Jan Beulich

On Fri, Aug 29, 2014 at 10:58:06PM +0200, Arianna Avanzini wrote:
> > Hey,
> >
> > With the Xen 4.5 feature freeze being right on the doorsteps I am not
> > expecting this to go in as:
> >  1) It touches core code,
> >  2) It has never been tested on ARM,
> 
> Sorry if I intrude - for what it's worth, the patchset works on my setup. I am
> running Xen from the development repository, plus this patchset, with a Linux
> 3.15 dom0 (linux-sunxi) on a cubieboard2.

Woot! Thank you!
> 
> 
> >  3) It is RFC for right now.
> >
> > With those expectations out of the way, I am submitting for review
> > an over-haul of the tasklet code. We had found one large machines
> > with a small size of guests (12) that the steal time for idle
> > guests was excessively high. Further debugging revealed that the
> > global tasklet lock was taken across all sockets at an excessively
> > high rate. To the point that 1/10th of a guest idle time was
> > taken (and actually accounted for as in RUNNING state!).
> >
> > The ideal situation to reproduce this behavior is:
> >  1). Allocate a twelve guests with one to four SR-IOV VFs.
> >  2). Have half of them (six) heavily use the SR-IOV VFs devices.
> >  3). Monitor the rest (which are in idle) and despair.
> >
> > As I discovered under the hood, we have two tasklets that are
> > scheduled and executed quite often - the VIRQ_TIMER one:
> > aassert_evtchn_irq_taskle, and the one in charge of injecting
> > an PCI interrupt in the guest: hvm_do_IRQ_dpci.
> >
> > The 'hvm_do_IRQ_dpci' is the on that is most often scheduled
> > and run. The performance bottleneck comes from the fact that
> > we take the same spinlock three times: tasklet_schedule,
> > when we are about to execute the tasklet, and when we are
> > done executing the tasklet.
> >
> > This patchset throws away the global list and lock for all
> > tasklets. Instead there are two per-cpu lists: one for
> > softirq, and one run when scheduler decides it. There is also
> > an global list and lock when we have cross-CPU tasklet scheduling
> > - which thankfully rarely happens (microcode update and
> > hypercall continuation).
> >
> > The insertion and removal from the list is done by disabling
> > interrupts - which are short bursts of time. The behavior
> > of the code to only execute one tasklet per iteration is
> > also preserved (the Linux code would run through all
> > of its tasklets).
> >
> > The performance benefit of this patch were astounding and
> > removed the issues we saw. It also decreased the IRQ
> > latency of delievering an interrupt to a guest.
> >
> > In terms of the patchset I choose an path in which:
> >  0) The first patch fixes the performance bug we saw and it
> >     was easy to backport.
> >  1) It is bisectable.
> >  2) If something breaks it should be fairly easy to figure
> >     out which patch broke it.
> >  3) It is spit up in a bit weird fashion with scaffolding code
> >     was added to keep it ticking (as at some point we have
> >     the old and the new implementation existing and used).
> >     And then later on removed. This is how Greg KH added
> >     kref and kobjects long time ago in the kernel and it had
> >     worked - so I figured I would borrow from this workflow.
> >
> > I would appreciate feedback from the maintainers if they
> > would like this to be organized better.
> >
> >  xen/common/tasklet.c      | 305 +++++++++++++++++++++++++++++++++-------------
> >  xen/include/xen/tasklet.h |  52 +++++++-
> >  2 files changed, 271 insertions(+), 86 deletions(-)
> >
> > Konrad Rzeszutek Wilk (5):
> >       tasklet: Introduce per-cpu tasklet for softirq.
> >       tasklet: Add cross CPU feeding of per-cpu tasklets.
> >       tasklet: Remove the old-softirq implementation.
> >       tasklet: Introduce per-cpu tasklet for schedule tasklet.
> >       tasklet: Remove the scaffolding.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v1] Replace tasklets with per-cpu implementation.
@ 2014-08-29 20:58 Arianna Avanzini
  2014-09-02 20:10 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Arianna Avanzini @ 2014-08-29 20:58 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, Keir Fraser, Jan Beulich

> Hey,
>
> With the Xen 4.5 feature freeze being right on the doorsteps I am not
> expecting this to go in as:
>  1) It touches core code,
>  2) It has never been tested on ARM,

Sorry if I intrude - for what it's worth, the patchset works on my setup. I am
running Xen from the development repository, plus this patchset, with a Linux
3.15 dom0 (linux-sunxi) on a cubieboard2.


>  3) It is RFC for right now.
>
> With those expectations out of the way, I am submitting for review
> an over-haul of the tasklet code. We had found one large machines
> with a small size of guests (12) that the steal time for idle
> guests was excessively high. Further debugging revealed that the
> global tasklet lock was taken across all sockets at an excessively
> high rate. To the point that 1/10th of a guest idle time was
> taken (and actually accounted for as in RUNNING state!).
>
> The ideal situation to reproduce this behavior is:
>  1). Allocate a twelve guests with one to four SR-IOV VFs.
>  2). Have half of them (six) heavily use the SR-IOV VFs devices.
>  3). Monitor the rest (which are in idle) and despair.
>
> As I discovered under the hood, we have two tasklets that are
> scheduled and executed quite often - the VIRQ_TIMER one:
> aassert_evtchn_irq_taskle, and the one in charge of injecting
> an PCI interrupt in the guest: hvm_do_IRQ_dpci.
>
> The 'hvm_do_IRQ_dpci' is the on that is most often scheduled
> and run. The performance bottleneck comes from the fact that
> we take the same spinlock three times: tasklet_schedule,
> when we are about to execute the tasklet, and when we are
> done executing the tasklet.
>
> This patchset throws away the global list and lock for all
> tasklets. Instead there are two per-cpu lists: one for
> softirq, and one run when scheduler decides it. There is also
> an global list and lock when we have cross-CPU tasklet scheduling
> - which thankfully rarely happens (microcode update and
> hypercall continuation).
>
> The insertion and removal from the list is done by disabling
> interrupts - which are short bursts of time. The behavior
> of the code to only execute one tasklet per iteration is
> also preserved (the Linux code would run through all
> of its tasklets).
>
> The performance benefit of this patch were astounding and
> removed the issues we saw. It also decreased the IRQ
> latency of delievering an interrupt to a guest.
>
> In terms of the patchset I choose an path in which:
>  0) The first patch fixes the performance bug we saw and it
>     was easy to backport.
>  1) It is bisectable.
>  2) If something breaks it should be fairly easy to figure
>     out which patch broke it.
>  3) It is spit up in a bit weird fashion with scaffolding code
>     was added to keep it ticking (as at some point we have
>     the old and the new implementation existing and used).
>     And then later on removed. This is how Greg KH added
>     kref and kobjects long time ago in the kernel and it had
>     worked - so I figured I would borrow from this workflow.
>
> I would appreciate feedback from the maintainers if they
> would like this to be organized better.
>
>  xen/common/tasklet.c      | 305 +++++++++++++++++++++++++++++++++-------------
>  xen/include/xen/tasklet.h |  52 +++++++-
>  2 files changed, 271 insertions(+), 86 deletions(-)
>
> Konrad Rzeszutek Wilk (5):
>       tasklet: Introduce per-cpu tasklet for softirq.
>       tasklet: Add cross CPU feeding of per-cpu tasklets.
>       tasklet: Remove the old-softirq implementation.
>       tasklet: Introduce per-cpu tasklet for schedule tasklet.
>       tasklet: Remove the scaffolding.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-09-10 16:35 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-27 17:58 [RFC PATCH v1] Replace tasklets with per-cpu implementation Konrad Rzeszutek Wilk
2014-08-27 17:58 ` [RFC PATCH v1 1/5] tasklet: Introduce per-cpu tasklet for softirq Konrad Rzeszutek Wilk
2014-08-27 18:53   ` Andrew Cooper
2014-08-27 19:06     ` Konrad Rzeszutek Wilk
2014-08-28  8:17     ` Jan Beulich
2014-08-27 17:58 ` [RFC PATCH v1 2/5] tasklet: Add cross CPU feeding of per-cpu tasklets Konrad Rzeszutek Wilk
2014-08-27 17:58 ` [RFC PATCH v1 3/5] tasklet: Remove the old-softirq implementation Konrad Rzeszutek Wilk
2014-08-27 17:58 ` [RFC PATCH v1 4/5] tasklet: Introduce per-cpu tasklet for schedule tasklet Konrad Rzeszutek Wilk
2014-08-27 17:58 ` [RFC PATCH v1 5/5] tasklet: Remove the scaffolding Konrad Rzeszutek Wilk
2014-08-28 12:39 ` [RFC PATCH v1] Replace tasklets with per-cpu implementation Jan Beulich
2014-08-29 13:46   ` Konrad Rzeszutek Wilk
2014-08-29 14:10     ` Jan Beulich
2014-09-02 20:28       ` Konrad Rzeszutek Wilk
2014-09-03  8:03         ` Jan Beulich
2014-09-08 19:01           ` Konrad Rzeszutek Wilk
2014-09-09  9:01             ` Jan Beulich
2014-09-09 14:37               ` Konrad Rzeszutek Wilk
2014-09-09 16:37                 ` Jan Beulich
2014-09-10 16:03                   ` Konrad Rzeszutek Wilk
2014-09-10 16:25                     ` Jan Beulich
2014-09-10 16:35                       ` Konrad Rzeszutek Wilk
2014-08-29 20:58 Arianna Avanzini
2014-09-02 20:10 ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.