All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] mitigate the per-pCPU blocking list may be too long
@ 2017-05-11  6:04 Chao Gao
  2017-05-11  6:04 ` [PATCH v2 1/5] xentrace: add TRC_HVM_PI_LIST_ADD Chao Gao
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Chao Gao @ 2017-05-11  6:04 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Stefano Stabellini, Wei Liu, Jan Beulich,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Jun Nakajima, Chao Gao

VT-d PI introduces a per-pCPU blocking list to track the blocked vCPU
running on the pCPU. Theoretically, there are 32K domain on single
host, 128 vCPUs per domain. If all vCPUs are blocked on the same pCPU,
4M vCPUs are in the same list. Traversing this list consumes too
much time. More discussion can be found in [1,2,3].

To mitigate this issue, this series implements the following two methods:
1. put vcpus to another pcpu's list when the local pcpu's list length
reachs an upper bound. The upper bound is determined by total vcpu
count and total pcpu count in the system.
2. Don't put the blocked vCPUs which won't be woken by the wakeup
interrupt into the list.

PATCH 1/5 tracks the event, adding entry to PI blocking list. With the
patch, some data can be acquired to help to validate the following
patches. 

PATCH 2/5 uses a global variable to track how many hvm vcpus on this
system. It is used to calculate the number limit of blocked vcpu on a
given pcpu.

In patch 3/5, a policy is used to restrict the vcpu count on a given
pcpu's pi blocking list in case the list grows too long.

Patch 4/5 adds a refcount to vcpu's pi_desc. If the pi_desc is
recorded in one IRTE, the refcount increases by 1 and If the pi_desc is
cleared in one IRTE, the refcount decreases by 1.

In Patch 5/5, one vCPU is added to PI blocking list only if its
pi_desc is referred by at least one IRTE.

[1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted-interrupt%20core%20logic%20handling;#422661
[2] https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20of%20the%20list%20depends;#422567
[3] https://lists.gt.net/xen/devel/472749?search_string=enable%20vt-d%20pi%20by%20default;#472749

Chao Gao (5):
  xentrace: add TRC_HVM_PI_LIST_ADD
  vcpu: track hvm vcpu number on the system
  VT-d PI: restrict the vcpu number on a given pcpu
  VT-d PI: Adding reference count to pi_desc
  VT-d PI: Don't add vCPU to PI blocking list for a case

 tools/xentrace/formats                 |   1 +
 xen/arch/x86/hvm/vmx/vmx.c             | 133 ++++++++++++++++++++++++++++++---
 xen/common/domain.c                    |   8 ++
 xen/drivers/passthrough/io.c           |   2 +-
 xen/drivers/passthrough/vtd/intremap.c |  59 ++++++++++++++-
 xen/include/asm-x86/hvm/domain.h       |   6 ++
 xen/include/asm-x86/hvm/trace.h        |   1 +
 xen/include/asm-x86/hvm/vmx/vmcs.h     |   3 +
 xen/include/asm-x86/iommu.h            |   2 +-
 xen/include/asm-x86/msi.h              |   2 +-
 xen/include/public/trace.h             |   1 +
 xen/include/xen/sched.h                |   2 +
 12 files changed, 203 insertions(+), 17 deletions(-)

-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/5] xentrace: add TRC_HVM_PI_LIST_ADD
  2017-05-11  6:04 [PATCH v2 0/5] mitigate the per-pCPU blocking list may be too long Chao Gao
@ 2017-05-11  6:04 ` Chao Gao
  2017-05-15  1:33   ` Tian, Kevin
  2017-05-15 15:14   ` George Dunlap
  2017-05-11  6:04 ` [PATCH v2 2/5] vcpu: track hvm vcpu number on the system Chao Gao
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Chao Gao @ 2017-05-11  6:04 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Wei Liu, Jan Beulich, George Dunlap, Andrew Cooper,
	Ian Jackson, Jun Nakajima, Chao Gao

This patch adds TRC_HVM_PI_LIST_ADD to track adding one entry to
the per-pcpu blocking list. Also introduce a 'counter' to track
the number of entries in the list.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 tools/xentrace/formats          |  1 +
 xen/arch/x86/hvm/vmx/vmx.c      | 12 +++++++++++-
 xen/include/asm-x86/hvm/trace.h |  1 +
 xen/include/public/trace.h      |  1 +
 4 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index 8b31780..999ca8c 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -125,6 +125,7 @@
 0x00082020  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  INTR_WINDOW [ value = 0x%(1)08x ]
 0x00082021  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  NPF         [ gpa = 0x%(2)08x%(1)08x mfn = 0x%(4)08x%(3)08x qual = 0x%(5)04x p2mt = 0x%(6)04x ]
 0x00082023  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  TRAP        [ vector = 0x%(1)02x ]
+0x00082026  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  PI_LIST_ADD [ domid = 0x%(1)04x vcpu = 0x%(2)04x, pcpu = 0x%(3)04x, #entry = 0x%(4)04x ]
 
 0x0010f001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_map      [ domid = %(1)d ]
 0x0010f002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_unmap    [ domid = %(1)d ]
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index c8ef18a..efff6cd 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -82,6 +82,7 @@ static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);
 struct vmx_pi_blocking_vcpu {
     struct list_head     list;
     spinlock_t           lock;
+    atomic_t             counter;
 };
 
 /*
@@ -119,6 +120,9 @@ static void vmx_vcpu_block(struct vcpu *v)
      */
     ASSERT(old_lock == NULL);
 
+    atomic_inc(&per_cpu(vmx_pi_blocking, v->processor).counter);
+    HVMTRACE_4D(PI_LIST_ADD, v->domain->domain_id, v->vcpu_id, v->processor,
+                atomic_read(&per_cpu(vmx_pi_blocking, v->processor).counter));
     list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
                   &per_cpu(vmx_pi_blocking, v->processor).list);
     spin_unlock_irqrestore(pi_blocking_list_lock, flags);
@@ -186,6 +190,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
     {
         ASSERT(v->arch.hvm_vmx.pi_blocking.lock == pi_blocking_list_lock);
         list_del(&v->arch.hvm_vmx.pi_blocking.list);
+        atomic_dec(&container_of(pi_blocking_list_lock,
+                                 struct vmx_pi_blocking_vcpu, lock)->counter);
         v->arch.hvm_vmx.pi_blocking.lock = NULL;
     }
 
@@ -234,6 +240,7 @@ void vmx_pi_desc_fixup(unsigned int cpu)
         if ( pi_test_on(&vmx->pi_desc) )
         {
             list_del(&vmx->pi_blocking.list);
+            atomic_dec(&per_cpu(vmx_pi_blocking, cpu).counter);
             vmx->pi_blocking.lock = NULL;
             vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
         }
@@ -258,6 +265,8 @@ void vmx_pi_desc_fixup(unsigned int cpu)
 
             list_move(&vmx->pi_blocking.list,
                       &per_cpu(vmx_pi_blocking, new_cpu).list);
+            atomic_dec(&per_cpu(vmx_pi_blocking, cpu).counter);
+            atomic_inc(&per_cpu(vmx_pi_blocking, new_cpu).counter);
             vmx->pi_blocking.lock = new_lock;
 
             spin_unlock(new_lock);
@@ -2360,7 +2369,7 @@ static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
     struct arch_vmx_struct *vmx, *tmp;
     spinlock_t *lock = &per_cpu(vmx_pi_blocking, smp_processor_id()).lock;
     struct list_head *blocked_vcpus =
-		&per_cpu(vmx_pi_blocking, smp_processor_id()).list;
+               &per_cpu(vmx_pi_blocking, smp_processor_id()).list;
 
     ack_APIC_irq();
     this_cpu(irq_count)++;
@@ -2377,6 +2386,7 @@ static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
         if ( pi_test_on(&vmx->pi_desc) )
         {
             list_del(&vmx->pi_blocking.list);
+            atomic_dec(&per_cpu(vmx_pi_blocking, smp_processor_id()).counter);
             ASSERT(vmx->pi_blocking.lock == lock);
             vmx->pi_blocking.lock = NULL;
             vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
diff --git a/xen/include/asm-x86/hvm/trace.h b/xen/include/asm-x86/hvm/trace.h
index de802a6..b74ffdd 100644
--- a/xen/include/asm-x86/hvm/trace.h
+++ b/xen/include/asm-x86/hvm/trace.h
@@ -54,6 +54,7 @@
 #define DO_TRC_HVM_TRAP             DEFAULT_HVM_MISC
 #define DO_TRC_HVM_TRAP_DEBUG       DEFAULT_HVM_MISC
 #define DO_TRC_HVM_VLAPIC           DEFAULT_HVM_MISC
+#define DO_TRC_HVM_PI_LIST_ADD      DEFAULT_HVM_MISC
 
 
 #define TRC_PAR_LONG(par) ((par)&0xFFFFFFFF),((par)>>32)
diff --git a/xen/include/public/trace.h b/xen/include/public/trace.h
index 7f2e891..c716d57 100644
--- a/xen/include/public/trace.h
+++ b/xen/include/public/trace.h
@@ -234,6 +234,7 @@
 #define TRC_HVM_TRAP             (TRC_HVM_HANDLER + 0x23)
 #define TRC_HVM_TRAP_DEBUG       (TRC_HVM_HANDLER + 0x24)
 #define TRC_HVM_VLAPIC           (TRC_HVM_HANDLER + 0x25)
+#define TRC_HVM_PI_LIST_ADD      (TRC_HVM_HANDLER + 0x26)
 
 #define TRC_HVM_IOPORT_WRITE    (TRC_HVM_HANDLER + 0x216)
 #define TRC_HVM_IOMEM_WRITE     (TRC_HVM_HANDLER + 0x217)
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/5] vcpu: track hvm vcpu number on the system
  2017-05-11  6:04 [PATCH v2 0/5] mitigate the per-pCPU blocking list may be too long Chao Gao
  2017-05-11  6:04 ` [PATCH v2 1/5] xentrace: add TRC_HVM_PI_LIST_ADD Chao Gao
@ 2017-05-11  6:04 ` Chao Gao
  2017-05-11 11:35   ` Wei Liu
  2017-05-11  6:04 ` [PATCH v2 3/5] VT-d PI: restrict the vcpu number on a given pcpu Chao Gao
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Chao Gao @ 2017-05-11  6:04 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Jan Beulich, Chao Gao

This number is used to calculate how many hvm vcpu on a pcpu on average.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 xen/common/domain.c     | 8 ++++++++
 xen/include/xen/sched.h | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index b22aacc..d433d9e 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -71,6 +71,9 @@ struct vcpu *idle_vcpu[NR_CPUS] __read_mostly;
 
 vcpu_info_t dummy_vcpu_info;
 
+/* how many hvm vcpu on this system? */
+atomic_t num_hvm_vcpus;
+
 static void __domain_finalise_shutdown(struct domain *d)
 {
     struct vcpu *v;
@@ -193,6 +196,9 @@ struct vcpu *alloc_vcpu(
     if ( !is_idle_domain(d) )
         domain_update_node_affinity(d);
 
+    if ( is_hvm_domain(d) )
+        atomic_inc(&num_hvm_vcpus);
+
     return v;
 }
 
@@ -803,6 +809,8 @@ static void complete_domain_destroy(struct rcu_head *head)
         vcpu_destroy(v);
         sched_destroy_vcpu(v);
         destroy_waitqueue_vcpu(v);
+        if ( is_hvm_domain(d) )
+            atomic_dec(&num_hvm_vcpus);
     }
 
     grant_table_destroy(d);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 1127ca9..5fb492d 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -139,6 +139,8 @@ void evtchn_destroy_final(struct domain *d); /* from complete_domain_destroy */
 
 struct waitqueue_vcpu;
 
+extern atomic_t num_hvm_vcpus;
+
 struct vcpu
 {
     int              vcpu_id;
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 3/5] VT-d PI: restrict the vcpu number on a given pcpu
  2017-05-11  6:04 [PATCH v2 0/5] mitigate the per-pCPU blocking list may be too long Chao Gao
  2017-05-11  6:04 ` [PATCH v2 1/5] xentrace: add TRC_HVM_PI_LIST_ADD Chao Gao
  2017-05-11  6:04 ` [PATCH v2 2/5] vcpu: track hvm vcpu number on the system Chao Gao
@ 2017-05-11  6:04 ` Chao Gao
  2017-05-15  5:24   ` Tian, Kevin
  2017-05-15 15:48   ` George Dunlap
  2017-05-11  6:04 ` [PATCH v2 4/5] VT-d PI: Adding reference count to pi_desc Chao Gao
  2017-05-11  6:04 ` [PATCH v2 5/5] VT-d PI: Don't add vCPU to PI blocking list for a case Chao Gao
  4 siblings, 2 replies; 17+ messages in thread
From: Chao Gao @ 2017-05-11  6:04 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Kevin Tian, Jan Beulich, Jun Nakajima, Chao Gao

Currently, a blocked vCPU is put in its pCPU's pi blocking list. If
too many vCPUs are blocked on a given pCPU, it will incur that the list
grows too long. After a simple analysis, there are 32k domains and
128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's
PI blocking list. When a wakeup interrupt arrives, the list is
traversed to find some specific vCPUs to wake them up. This traversal in
that case would consume much time.

To mitigate this issue, this patch limits the vcpu number on a given
pCPU, taking factors such as perfomance of common case, current hvm vcpu
count and current pcpu count into consideration. With this method, for
the common case, it works fast and for some extreme cases, the list
length is under control.

The change in vmx_pi_unblock_vcpu() is for the following case:
vcpu is running -> try to block (this patch may change NSDT to
another pCPU) but notification comes in time, thus the vcpu
goes back to running station -> VM-entry (we should set NSDT again,
reverting the change we make to NSDT in vmx_vcpu_block())

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 xen/arch/x86/hvm/vmx/vmx.c | 78 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 71 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index efff6cd..c0d0b58 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -100,16 +100,70 @@ void vmx_pi_per_cpu_init(unsigned int cpu)
     spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
 }
 
+/*
+ * Choose an appropriate pcpu to receive wakeup interrupt.
+ * By default, the local pcpu is chosen as the destination. But if the
+ * vcpu number of the local pcpu exceeds a limit, another pcpu is chosen.
+ *
+ * Currently, choose (v_tot/p_tot) + K as the limit of vcpu, where
+ * v_tot is the total number of vcpus on the system, p_tot is the total
+ * number of pcpus in the system, and K is a fixed number. Experments shows
+ * the maximal time to wakeup a vcpu from a 128-entry blocking list is
+ * considered acceptable. So choose 128 as the fixed number K.
+ *
+ * This policy makes sure:
+ * 1) for common cases, the limit won't be reached and the local pcpu is used
+ * which is beneficial to performance (at least, avoid an IPI when unblocking
+ * vcpu).
+ * 2) for the worst case, the blocking list length scales with the vcpu count
+ * divided by the pcpu count.
+ */
+#define PI_LIST_FIXED_NUM 128
+#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) / num_online_cpus() + \
+                           PI_LIST_FIXED_NUM)
+
+static unsigned int vmx_pi_choose_dest_cpu(struct vcpu *v)
+{
+    int count, limit = PI_LIST_LIMIT;
+    unsigned int dest = v->processor;
+
+    count = atomic_read(&per_cpu(vmx_pi_blocking, dest).counter);
+    while ( unlikely(count >= limit) )
+    {
+        dest = cpumask_cycle(dest, &cpu_online_map);
+        count = atomic_read(&per_cpu(vmx_pi_blocking, dest).counter);
+    }
+    return dest;
+}
+
 static void vmx_vcpu_block(struct vcpu *v)
 {
     unsigned long flags;
-    unsigned int dest;
+    unsigned int dest, dest_cpu;
     spinlock_t *old_lock;
-    spinlock_t *pi_blocking_list_lock =
-		&per_cpu(vmx_pi_blocking, v->processor).lock;
     struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    spinlock_t *pi_blocking_list_lock;
+
+    /*
+     * After pCPU goes down, the per-cpu PI blocking list is cleared.
+     * To make sure the parameter vCPU is added to the chosen pCPU's
+     * PI blocking list before the list is cleared, just retry when
+     * finding the pCPU has gone down. Also retry to choose another
+     * pCPU when finding the list length reachs the limit.
+     */
+ retry:
+    dest_cpu = vmx_pi_choose_dest_cpu(v);
+    pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, dest_cpu).lock;
 
     spin_lock_irqsave(pi_blocking_list_lock, flags);
+    if ( unlikely((!cpu_online(dest_cpu)) ||
+                  (atomic_read(&per_cpu(vmx_pi_blocking, dest_cpu).counter) >=
+                   PI_LIST_LIMIT)) )
+    {
+        spin_unlock_irqrestore(pi_blocking_list_lock, flags);
+        goto retry;
+    }
+
     old_lock = cmpxchg(&v->arch.hvm_vmx.pi_blocking.lock, NULL,
                        pi_blocking_list_lock);
 
@@ -120,11 +174,11 @@ static void vmx_vcpu_block(struct vcpu *v)
      */
     ASSERT(old_lock == NULL);
 
-    atomic_inc(&per_cpu(vmx_pi_blocking, v->processor).counter);
-    HVMTRACE_4D(PI_LIST_ADD, v->domain->domain_id, v->vcpu_id, v->processor,
-                atomic_read(&per_cpu(vmx_pi_blocking, v->processor).counter));
+    atomic_inc(&per_cpu(vmx_pi_blocking, dest_cpu).counter);
+    HVMTRACE_4D(PI_LIST_ADD, v->domain->domain_id, v->vcpu_id, dest_cpu,
+                atomic_read(&per_cpu(vmx_pi_blocking, dest_cpu).counter));
     list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
-                  &per_cpu(vmx_pi_blocking, v->processor).list);
+                  &per_cpu(vmx_pi_blocking, dest_cpu).list);
     spin_unlock_irqrestore(pi_blocking_list_lock, flags);
 
     ASSERT(!pi_test_sn(pi_desc));
@@ -134,6 +188,13 @@ static void vmx_vcpu_block(struct vcpu *v)
     ASSERT(pi_desc->ndst ==
            (x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK)));
 
+    if ( unlikely(dest_cpu != v->processor) )
+    {
+        dest = cpu_physical_id(dest_cpu);
+        write_atomic(&pi_desc->ndst,
+                 (x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK)));
+    }
+
     write_atomic(&pi_desc->nv, pi_wakeup_vector);
 }
 
@@ -163,6 +224,7 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
     unsigned long flags;
     spinlock_t *pi_blocking_list_lock;
     struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    unsigned int dest = cpu_physical_id(v->processor);
 
     /*
      * Set 'NV' field back to posted_intr_vector, so the
@@ -170,6 +232,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
      * it is running in non-root mode.
      */
     write_atomic(&pi_desc->nv, posted_intr_vector);
+    write_atomic(&pi_desc->ndst,
+                 x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));
 
     pi_blocking_list_lock = v->arch.hvm_vmx.pi_blocking.lock;
 
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 4/5] VT-d PI: Adding reference count to pi_desc
  2017-05-11  6:04 [PATCH v2 0/5] mitigate the per-pCPU blocking list may be too long Chao Gao
                   ` (2 preceding siblings ...)
  2017-05-11  6:04 ` [PATCH v2 3/5] VT-d PI: restrict the vcpu number on a given pcpu Chao Gao
@ 2017-05-11  6:04 ` Chao Gao
  2017-05-15 14:42   ` George Dunlap
  2017-05-11  6:04 ` [PATCH v2 5/5] VT-d PI: Don't add vCPU to PI blocking list for a case Chao Gao
  4 siblings, 1 reply; 17+ messages in thread
From: Chao Gao @ 2017-05-11  6:04 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Kevin Tian, Jan Beulich, Jun Nakajima, Chao Gao

This patch intruduces a 'refcnt' field in vmx_pi_blocking to track
the reference count of 'pi_desc' of the vCPU. And change this field
every time we re-program one IRTE.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 xen/arch/x86/hvm/vmx/vmx.c             | 29 ++++++++++++++++++++++++
 xen/drivers/passthrough/io.c           |  2 +-
 xen/drivers/passthrough/vtd/intremap.c | 41 ++++++++++++++++++++++++++++++++--
 xen/include/asm-x86/hvm/domain.h       |  6 +++++
 xen/include/asm-x86/hvm/vmx/vmcs.h     |  3 +++
 xen/include/asm-x86/iommu.h            |  2 +-
 xen/include/asm-x86/msi.h              |  2 +-
 7 files changed, 80 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index c0d0b58..45a372e 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -100,6 +100,23 @@ void vmx_pi_per_cpu_init(unsigned int cpu)
     spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
 }
 
+void vmx_pi_get_ref(struct vcpu *v)
+{
+    ASSERT(atomic_read(&v->arch.hvm_vmx.pi_blocking.refcnt) >= 0);
+    atomic_inc(&v->arch.hvm_vmx.pi_blocking.refcnt);
+}
+
+void vmx_pi_put_ref(struct vcpu *v)
+{
+    atomic_dec(&v->arch.hvm_vmx.pi_blocking.refcnt);
+    ASSERT(atomic_read(&v->arch.hvm_vmx.pi_blocking.refcnt) >= 0);
+}
+
+bool vmx_pi_in_use(struct vcpu *v)
+{
+    return !!atomic_read(&v->arch.hvm_vmx.pi_blocking.refcnt);
+}
+
 /*
  * Choose an appropriate pcpu to receive wakeup interrupt.
  * By default, the local pcpu is chosen as the destination. But if the
@@ -422,6 +439,9 @@ void vmx_pi_hooks_assign(struct domain *d)
 
     d->arch.hvm_domain.pi_ops.vcpu_block = vmx_vcpu_block;
     d->arch.hvm_domain.pi_ops.do_resume = vmx_pi_do_resume;
+    d->arch.hvm_domain.pi_ops.get_ref = vmx_pi_get_ref;
+    d->arch.hvm_domain.pi_ops.put_ref = vmx_pi_put_ref;
+    d->arch.hvm_domain.pi_ops.in_use = vmx_pi_in_use;
 }
 
 /* This function is called when pcidevs_lock is held */
@@ -460,6 +480,15 @@ void vmx_pi_hooks_deassign(struct domain *d)
     d->arch.hvm_domain.pi_ops.vcpu_block = NULL;
     d->arch.hvm_domain.pi_ops.switch_from = NULL;
     d->arch.hvm_domain.pi_ops.do_resume = NULL;
+    d->arch.hvm_domain.pi_ops.get_ref = NULL;
+    d->arch.hvm_domain.pi_ops.put_ref = NULL;
+    d->arch.hvm_domain.pi_ops.in_use = NULL;
+    /*
+     * If device is still using by guest, but we forcibly deassign it,
+     * then the 'refcnt' is not zero here. Clear it for re-assignment.
+     */
+    for_each_vcpu ( d, v )
+        atomic_set(&v->arch.hvm_vmx.pi_blocking.refcnt, 0);
 
     for_each_vcpu ( d, v )
         vmx_pi_unblock_vcpu(v);
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index e5a43e5..979be77 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -319,7 +319,7 @@ int pt_irq_create_bind(
     {
         uint8_t dest, dest_mode, delivery_mode;
         int dest_vcpu_id;
-        const struct vcpu *vcpu;
+        struct vcpu *vcpu;
 
         if ( !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         {
diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index 1e0317c..99f1cce 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -596,6 +596,32 @@ static int remap_entry_to_msi_msg(
     return 0;
 }
 
+static void pi_get_ref(struct pi_desc *pi_desc)
+{
+    struct vcpu *v;
+
+    if ( !pi_desc )
+        return;
+
+    v = pi_desc_to_vcpu(pi_desc);
+    ASSERT(is_hvm_domain(v->domain));
+    if ( v->domain->arch.hvm_domain.pi_ops.get_ref )
+        v->domain->arch.hvm_domain.pi_ops.get_ref(v);
+}
+
+static void pi_put_ref(struct pi_desc *pi_desc)
+{
+    struct vcpu *v;
+
+    if ( !pi_desc )
+        return;
+
+    v = pi_desc_to_vcpu(pi_desc);
+    ASSERT(is_hvm_domain(v->domain));
+    if ( v->domain->arch.hvm_domain.pi_ops.put_ref )
+        v->domain->arch.hvm_domain.pi_ops.put_ref(v);
+}
+
 static int msi_msg_to_remap_entry(
     struct iommu *iommu, struct pci_dev *pdev,
     struct msi_desc *msi_desc, struct msi_msg *msg)
@@ -619,6 +645,7 @@ static int msi_msg_to_remap_entry(
         {
             free_remap_entry(iommu, msi_desc->remap_index + i);
             msi_desc[i].irte_initialized = false;
+            pi_put_ref(msi_desc[i].pi_desc);
         }
         spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
         return 0;
@@ -962,11 +989,12 @@ void iommu_disable_x2apic_IR(void)
  * This function is used to update the IRTE for posted-interrupt
  * when guest changes MSI/MSI-X information.
  */
-int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
+int pi_update_irte(struct pi_desc *pi_desc, const struct pirq *pirq,
     const uint8_t gvec)
 {
     struct irq_desc *desc;
     struct msi_desc *msi_desc;
+    struct pi_desc *old_pi_desc;
     int rc;
 
     desc = pirq_spin_lock_irq_desc(pirq, NULL);
@@ -979,13 +1007,22 @@ int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
         rc = -ENODEV;
         goto unlock_out;
     }
+    old_pi_desc = msi_desc->pi_desc;
+
+    pi_get_ref(pi_desc);
     msi_desc->pi_desc = pi_desc;
     msi_desc->gvec = gvec;
 
     spin_unlock_irq(&desc->lock);
 
     ASSERT(pcidevs_locked());
-    return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
+    rc = iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
+    if ( !rc )
+        pi_put_ref(old_pi_desc);
+    else
+        ASSERT_UNREACHABLE();
+
+    return rc;
 
  unlock_out:
     spin_unlock_irq(&desc->lock);
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index d2899c9..6fc97c4 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -98,6 +98,12 @@ struct hvm_pi_ops {
 
     /* Hook into the vmentry path. */
     void (*do_resume)(struct vcpu *v);
+
+    /* Get/Put refcount of PI blocking of this vCPU */
+    void (*get_ref)(struct vcpu *v);
+    void (*put_ref)(struct vcpu *v);
+    /* Is the PI blocking is referred by IRTEs */
+    bool (*in_use)(struct vcpu *v);
 };
 
 struct hvm_domain {
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index 9507bd2..7cb1a92 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -85,6 +85,7 @@ struct pi_desc {
 struct pi_blocking_vcpu {
     struct list_head     list;
     spinlock_t           *lock;
+    atomic_t             refcnt;  /* How many IRTEs refer to this vCPU? */
 };
 
 struct arch_vmx_struct {
@@ -160,6 +161,8 @@ struct arch_vmx_struct {
     struct pi_blocking_vcpu pi_blocking;
 };
 
+#define pi_desc_to_vcpu(a) container_of(a, struct vcpu, arch.hvm_vmx.pi_desc)
+
 int vmx_create_vmcs(struct vcpu *v);
 void vmx_destroy_vmcs(struct vcpu *v);
 void vmx_vmcs_enter(struct vcpu *v);
diff --git a/xen/include/asm-x86/iommu.h b/xen/include/asm-x86/iommu.h
index 0431233..cfa0058 100644
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -92,7 +92,7 @@ bool_t iommu_supports_eim(void);
 int iommu_enable_x2apic_IR(void);
 void iommu_disable_x2apic_IR(void);
 
-int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
+int pi_update_irte(struct pi_desc *pi_desc, const struct pirq *pirq,
                    const uint8_t gvec);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
diff --git a/xen/include/asm-x86/msi.h b/xen/include/asm-x86/msi.h
index a5de6a1..fbf1793 100644
--- a/xen/include/asm-x86/msi.h
+++ b/xen/include/asm-x86/msi.h
@@ -105,7 +105,7 @@ struct msi_desc {
 
 	bool irte_initialized;
 	uint8_t gvec;			/* guest vector. valid when pi_desc isn't NULL */
-	const struct pi_desc *pi_desc;	/* pointer to posted descriptor */
+	struct pi_desc *pi_desc;	/* pointer to posted descriptor */
 
 	struct list_head list;
 
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 5/5] VT-d PI: Don't add vCPU to PI blocking list for a case
  2017-05-11  6:04 [PATCH v2 0/5] mitigate the per-pCPU blocking list may be too long Chao Gao
                   ` (3 preceding siblings ...)
  2017-05-11  6:04 ` [PATCH v2 4/5] VT-d PI: Adding reference count to pi_desc Chao Gao
@ 2017-05-11  6:04 ` Chao Gao
  4 siblings, 0 replies; 17+ messages in thread
From: Chao Gao @ 2017-05-11  6:04 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Kevin Tian, Jan Beulich, Jun Nakajima, Chao Gao

Currently, blocked vCPUs are added to PI blocking list if its
domain has assign devices. Actually, some blocked vCPUs will not
be woken up by wakeup interrupt generated by VT-d hardware. They
may be woken up by IPIs or other interrupts from emulated devices.
For these vCPUs we don't add them to PI blocking list.

If a vCPU is blocked prior to its getting bound with a IRTE, we need
adding this vCPU to blocking list when we bind a vCPU with a IRTE.
In that case, arch_vcpu_block() may be called from another vCPU which
the current implementation can't handle. This patch expands the
arch_vcpu_block(), removing some restrictions expressed by
assertions and handling the target vCPU according to its status and
its PI blocking list lock (v->arch.hvm_vmx.pi_blocking.lock).

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 xen/arch/x86/hvm/vmx/vmx.c             | 20 +++++++++++++-------
 xen/drivers/passthrough/vtd/intremap.c | 18 ++++++++++++++++++
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 45a372e..03d5fce 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -161,6 +161,14 @@ static void vmx_vcpu_block(struct vcpu *v)
     struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
     spinlock_t *pi_blocking_list_lock;
 
+    /* If no IRTE refers to 'pi_desc', no further operation needs */
+    if ( v->domain->arch.hvm_domain.pi_ops.in_use &&
+         !v->domain->arch.hvm_domain.pi_ops.in_use(v) )
+        return;
+
+    if ( !test_bit(_VPF_blocked, &v->pause_flags) )
+        return;
+
     /*
      * After pCPU goes down, the per-cpu PI blocking list is cleared.
      * To make sure the parameter vCPU is added to the chosen pCPU's
@@ -183,13 +191,11 @@ static void vmx_vcpu_block(struct vcpu *v)
 
     old_lock = cmpxchg(&v->arch.hvm_vmx.pi_blocking.lock, NULL,
                        pi_blocking_list_lock);
-
-    /*
-     * 'v->arch.hvm_vmx.pi_blocking.lock' should be NULL before
-     * being assigned to a new value, since the vCPU is currently
-     * running and it cannot be on any blocking list.
-     */
-    ASSERT(old_lock == NULL);
+    if ( old_lock )
+    {
+        spin_unlock_irqrestore(pi_blocking_list_lock, flags);
+        return;
+    }
 
     atomic_inc(&per_cpu(vmx_pi_blocking, dest_cpu).counter);
     HVMTRACE_4D(PI_LIST_ADD, v->domain->domain_id, v->vcpu_id, dest_cpu,
diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index 99f1cce..806e397 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -622,6 +622,20 @@ static void pi_put_ref(struct pi_desc *pi_desc)
         v->domain->arch.hvm_domain.pi_ops.put_ref(v);
 }
 
+static bool pi_in_use(struct pi_desc *pi_desc)
+{
+    struct vcpu *v;
+
+    if ( !pi_desc )
+        return 0;
+
+    v = pi_desc_to_vcpu(pi_desc);
+    ASSERT(is_hvm_domain(v->domain));
+    if ( v->domain->arch.hvm_domain.pi_ops.in_use )
+        return v->domain->arch.hvm_domain.pi_ops.in_use(v);
+    return 0;
+}
+
 static int msi_msg_to_remap_entry(
     struct iommu *iommu, struct pci_dev *pdev,
     struct msi_desc *msi_desc, struct msi_msg *msg)
@@ -996,6 +1010,7 @@ int pi_update_irte(struct pi_desc *pi_desc, const struct pirq *pirq,
     struct msi_desc *msi_desc;
     struct pi_desc *old_pi_desc;
     int rc;
+    bool first_ref;
 
     desc = pirq_spin_lock_irq_desc(pirq, NULL);
     if ( !desc )
@@ -1009,7 +1024,10 @@ int pi_update_irte(struct pi_desc *pi_desc, const struct pirq *pirq,
     }
     old_pi_desc = msi_desc->pi_desc;
 
+    first_ref = !pi_in_use(pi_desc);
     pi_get_ref(pi_desc);
+    if ( pi_desc && first_ref )
+        arch_vcpu_block(pi_desc_to_vcpu(pi_desc));
     msi_desc->pi_desc = pi_desc;
     msi_desc->gvec = gvec;
 
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/5] vcpu: track hvm vcpu number on the system
  2017-05-11  6:04 ` [PATCH v2 2/5] vcpu: track hvm vcpu number on the system Chao Gao
@ 2017-05-11 11:35   ` Wei Liu
  2017-05-11 11:37     ` Wei Liu
  0 siblings, 1 reply; 17+ messages in thread
From: Wei Liu @ 2017-05-11 11:35 UTC (permalink / raw)
  To: Chao Gao
  Cc: Tim Deegan, Stefano Stabellini, Wei Liu, George Dunlap,
	Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich

On Thu, May 11, 2017 at 02:04:09PM +0800, Chao Gao wrote:
> This number is used to calculate how many hvm vcpu on a pcpu on average.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
>  xen/common/domain.c     | 8 ++++++++
>  xen/include/xen/sched.h | 2 ++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index b22aacc..d433d9e 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -71,6 +71,9 @@ struct vcpu *idle_vcpu[NR_CPUS] __read_mostly;
>  
>  vcpu_info_t dummy_vcpu_info;
>  
> +/* how many hvm vcpu on this system? */
> +atomic_t num_hvm_vcpus;
> +

This is x86 specific and should go to x86/domain.c

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/5] vcpu: track hvm vcpu number on the system
  2017-05-11 11:35   ` Wei Liu
@ 2017-05-11 11:37     ` Wei Liu
  2017-05-12  8:23       ` Chao Gao
  0 siblings, 1 reply; 17+ messages in thread
From: Wei Liu @ 2017-05-11 11:37 UTC (permalink / raw)
  To: Chao Gao
  Cc: Tim Deegan, Stefano Stabellini, Wei Liu, George Dunlap,
	Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich

On Thu, May 11, 2017 at 12:35:11PM +0100, Wei Liu wrote:
> On Thu, May 11, 2017 at 02:04:09PM +0800, Chao Gao wrote:
> > This number is used to calculate how many hvm vcpu on a pcpu on average.
> > 
> > Signed-off-by: Chao Gao <chao.gao@intel.com>
> > ---
> >  xen/common/domain.c     | 8 ++++++++
> >  xen/include/xen/sched.h | 2 ++
> >  2 files changed, 10 insertions(+)
> > 
> > diff --git a/xen/common/domain.c b/xen/common/domain.c
> > index b22aacc..d433d9e 100644
> > --- a/xen/common/domain.c
> > +++ b/xen/common/domain.c
> > @@ -71,6 +71,9 @@ struct vcpu *idle_vcpu[NR_CPUS] __read_mostly;
> >  
> >  vcpu_info_t dummy_vcpu_info;
> >  
> > +/* how many hvm vcpu on this system? */
> > +atomic_t num_hvm_vcpus;
> > +
> 
> This is x86 specific and should go to x86/domain.c

... as with all the code that manipulates it. I'm sure you can find the
appropriate places like arch_initialise/destroy_vcpu.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/5] vcpu: track hvm vcpu number on the system
  2017-05-11 11:37     ` Wei Liu
@ 2017-05-12  8:23       ` Chao Gao
  0 siblings, 0 replies; 17+ messages in thread
From: Chao Gao @ 2017-05-12  8:23 UTC (permalink / raw)
  To: Wei Liu
  Cc: Tim Deegan, Stefano Stabellini, George Dunlap, Andrew Cooper,
	Ian Jackson, xen-devel, Jan Beulich

On Thu, May 11, 2017 at 12:37:37PM +0100, Wei Liu wrote:
>On Thu, May 11, 2017 at 12:35:11PM +0100, Wei Liu wrote:
>> On Thu, May 11, 2017 at 02:04:09PM +0800, Chao Gao wrote:
>> > This number is used to calculate how many hvm vcpu on a pcpu on average.
>> > 
>> > Signed-off-by: Chao Gao <chao.gao@intel.com>
>> > ---
>> >  xen/common/domain.c     | 8 ++++++++
>> >  xen/include/xen/sched.h | 2 ++
>> >  2 files changed, 10 insertions(+)
>> > 
>> > diff --git a/xen/common/domain.c b/xen/common/domain.c
>> > index b22aacc..d433d9e 100644
>> > --- a/xen/common/domain.c
>> > +++ b/xen/common/domain.c
>> > @@ -71,6 +71,9 @@ struct vcpu *idle_vcpu[NR_CPUS] __read_mostly;
>> >  
>> >  vcpu_info_t dummy_vcpu_info;
>> >  
>> > +/* how many hvm vcpu on this system? */
>> > +atomic_t num_hvm_vcpus;
>> > +
>> 
>> This is x86 specific and should go to x86/domain.c
>
>... as with all the code that manipulates it. I'm sure you can find the
>appropriate places like arch_initialise/destroy_vcpu.

Agree. I could make things better if thinking more about it.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/5] xentrace: add TRC_HVM_PI_LIST_ADD
  2017-05-11  6:04 ` [PATCH v2 1/5] xentrace: add TRC_HVM_PI_LIST_ADD Chao Gao
@ 2017-05-15  1:33   ` Tian, Kevin
  2017-05-15  8:57     ` Chao Gao
  2017-05-15 15:14   ` George Dunlap
  1 sibling, 1 reply; 17+ messages in thread
From: Tian, Kevin @ 2017-05-15  1:33 UTC (permalink / raw)
  To: Gao, Chao, xen-devel
  Cc: Wei Liu, Nakajima, Jun, George Dunlap, Andrew Cooper,
	Ian Jackson, Jan Beulich

> From: Gao, Chao
> Sent: Thursday, May 11, 2017 2:04 PM
> 
> This patch adds TRC_HVM_PI_LIST_ADD to track adding one entry to
> the per-pcpu blocking list. Also introduce a 'counter' to track
> the number of entries in the list.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
>  tools/xentrace/formats          |  1 +
>  xen/arch/x86/hvm/vmx/vmx.c      | 12 +++++++++++-
>  xen/include/asm-x86/hvm/trace.h |  1 +
>  xen/include/public/trace.h      |  1 +
>  4 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/xentrace/formats b/tools/xentrace/formats
> index 8b31780..999ca8c 100644
> --- a/tools/xentrace/formats
> +++ b/tools/xentrace/formats
> @@ -125,6 +125,7 @@
>  0x00082020  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  INTR_WINDOW [ value =
> 0x%(1)08x ]
>  0x00082021  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  NPF         [ gpa =
> 0x%(2)08x%(1)08x mfn = 0x%(4)08x%(3)08x qual = 0x%(5)04x p2mt =
> 0x%(6)04x ]
>  0x00082023  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  TRAP        [ vector =
> 0x%(1)02x ]
> +0x00082026  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  PI_LIST_ADD [ domid =
> 0x%(1)04x vcpu = 0x%(2)04x, pcpu = 0x%(3)04x, #entry = 0x%(4)04x ]
> 
>  0x0010f001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_map
> [ domid = %(1)d ]
>  0x0010f002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_unmap
> [ domid = %(1)d ]
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index c8ef18a..efff6cd 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -82,6 +82,7 @@ static int vmx_vmfunc_intercept(struct cpu_user_regs
> *regs);
>  struct vmx_pi_blocking_vcpu {
>      struct list_head     list;
>      spinlock_t           lock;
> +    atomic_t             counter;
>  };
> 
>  /*
> @@ -119,6 +120,9 @@ static void vmx_vcpu_block(struct vcpu *v)
>       */
>      ASSERT(old_lock == NULL);
> 
> +    atomic_inc(&per_cpu(vmx_pi_blocking, v->processor).counter);
> +    HVMTRACE_4D(PI_LIST_ADD, v->domain->domain_id, v->vcpu_id, v-
> >processor,
> +                atomic_read(&per_cpu(vmx_pi_blocking, v->processor).counter));
>      list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
>                    &per_cpu(vmx_pi_blocking, v->processor).list);
>      spin_unlock_irqrestore(pi_blocking_list_lock, flags);
> @@ -186,6 +190,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
>      {
>          ASSERT(v->arch.hvm_vmx.pi_blocking.lock == pi_blocking_list_lock);
>          list_del(&v->arch.hvm_vmx.pi_blocking.list);
> +        atomic_dec(&container_of(pi_blocking_list_lock,
> +                                 struct vmx_pi_blocking_vcpu, lock)->counter);
>          v->arch.hvm_vmx.pi_blocking.lock = NULL;
>      }
> 
> @@ -234,6 +240,7 @@ void vmx_pi_desc_fixup(unsigned int cpu)
>          if ( pi_test_on(&vmx->pi_desc) )
>          {
>              list_del(&vmx->pi_blocking.list);
> +            atomic_dec(&per_cpu(vmx_pi_blocking, cpu).counter);
>              vmx->pi_blocking.lock = NULL;
>              vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
>          }
> @@ -258,6 +265,8 @@ void vmx_pi_desc_fixup(unsigned int cpu)
> 
>              list_move(&vmx->pi_blocking.list,
>                        &per_cpu(vmx_pi_blocking, new_cpu).list);
> +            atomic_dec(&per_cpu(vmx_pi_blocking, cpu).counter);
> +            atomic_inc(&per_cpu(vmx_pi_blocking, new_cpu).counter);

Don't you also need a trace here?

and from completeness p.o.v, is it useful to trace both dec/inc?

>              vmx->pi_blocking.lock = new_lock;
> 
>              spin_unlock(new_lock);
> @@ -2360,7 +2369,7 @@ static void pi_wakeup_interrupt(struct
> cpu_user_regs *regs)
>      struct arch_vmx_struct *vmx, *tmp;
>      spinlock_t *lock = &per_cpu(vmx_pi_blocking, smp_processor_id()).lock;
>      struct list_head *blocked_vcpus =
> -		&per_cpu(vmx_pi_blocking, smp_processor_id()).list;
> +               &per_cpu(vmx_pi_blocking, smp_processor_id()).list;
> 
>      ack_APIC_irq();
>      this_cpu(irq_count)++;
> @@ -2377,6 +2386,7 @@ static void pi_wakeup_interrupt(struct
> cpu_user_regs *regs)
>          if ( pi_test_on(&vmx->pi_desc) )
>          {
>              list_del(&vmx->pi_blocking.list);
> +            atomic_dec(&per_cpu(vmx_pi_blocking,
> smp_processor_id()).counter);
>              ASSERT(vmx->pi_blocking.lock == lock);
>              vmx->pi_blocking.lock = NULL;
>              vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
> diff --git a/xen/include/asm-x86/hvm/trace.h b/xen/include/asm-
> x86/hvm/trace.h
> index de802a6..b74ffdd 100644
> --- a/xen/include/asm-x86/hvm/trace.h
> +++ b/xen/include/asm-x86/hvm/trace.h
> @@ -54,6 +54,7 @@
>  #define DO_TRC_HVM_TRAP             DEFAULT_HVM_MISC
>  #define DO_TRC_HVM_TRAP_DEBUG       DEFAULT_HVM_MISC
>  #define DO_TRC_HVM_VLAPIC           DEFAULT_HVM_MISC
> +#define DO_TRC_HVM_PI_LIST_ADD      DEFAULT_HVM_MISC
> 
> 
>  #define TRC_PAR_LONG(par) ((par)&0xFFFFFFFF),((par)>>32)
> diff --git a/xen/include/public/trace.h b/xen/include/public/trace.h
> index 7f2e891..c716d57 100644
> --- a/xen/include/public/trace.h
> +++ b/xen/include/public/trace.h
> @@ -234,6 +234,7 @@
>  #define TRC_HVM_TRAP             (TRC_HVM_HANDLER + 0x23)
>  #define TRC_HVM_TRAP_DEBUG       (TRC_HVM_HANDLER + 0x24)
>  #define TRC_HVM_VLAPIC           (TRC_HVM_HANDLER + 0x25)
> +#define TRC_HVM_PI_LIST_ADD      (TRC_HVM_HANDLER + 0x26)
> 
>  #define TRC_HVM_IOPORT_WRITE    (TRC_HVM_HANDLER + 0x216)
>  #define TRC_HVM_IOMEM_WRITE     (TRC_HVM_HANDLER + 0x217)
> --
> 1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/5] VT-d PI: restrict the vcpu number on a given pcpu
  2017-05-11  6:04 ` [PATCH v2 3/5] VT-d PI: restrict the vcpu number on a given pcpu Chao Gao
@ 2017-05-15  5:24   ` Tian, Kevin
  2017-05-15  9:27     ` Chao Gao
  2017-05-15 15:48   ` George Dunlap
  1 sibling, 1 reply; 17+ messages in thread
From: Tian, Kevin @ 2017-05-15  5:24 UTC (permalink / raw)
  To: Gao, Chao, xen-devel; +Cc: Andrew Cooper, Jan Beulich, Nakajima, Jun

> From: Gao, Chao
> Sent: Thursday, May 11, 2017 2:04 PM
> 
> Currently, a blocked vCPU is put in its pCPU's pi blocking list. If
> too many vCPUs are blocked on a given pCPU, it will incur that the list
> grows too long. After a simple analysis, there are 32k domains and
> 128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's
> PI blocking list. When a wakeup interrupt arrives, the list is
> traversed to find some specific vCPUs to wake them up. This traversal in
> that case would consume much time.
> 
> To mitigate this issue, this patch limits the vcpu number on a given
> pCPU, taking factors such as perfomance of common case, current hvm vcpu
> count and current pcpu count into consideration. With this method, for
> the common case, it works fast and for some extreme cases, the list
> length is under control.
> 
> The change in vmx_pi_unblock_vcpu() is for the following case:
> vcpu is running -> try to block (this patch may change NSDT to
> another pCPU) but notification comes in time, thus the vcpu
> goes back to running station -> VM-entry (we should set NSDT again,
> reverting the change we make to NSDT in vmx_vcpu_block())
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
>  xen/arch/x86/hvm/vmx/vmx.c | 78
> +++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 71 insertions(+), 7 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index efff6cd..c0d0b58 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -100,16 +100,70 @@ void vmx_pi_per_cpu_init(unsigned int cpu)
>      spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
>  }
> 
> +/*
> + * Choose an appropriate pcpu to receive wakeup interrupt.
> + * By default, the local pcpu is chosen as the destination. But if the
> + * vcpu number of the local pcpu exceeds a limit, another pcpu is chosen.
> + *
> + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu, where
> + * v_tot is the total number of vcpus on the system, p_tot is the total
> + * number of pcpus in the system, and K is a fixed number. Experments
> shows
> + * the maximal time to wakeup a vcpu from a 128-entry blocking list is
> + * considered acceptable. So choose 128 as the fixed number K.

better you can provide your experimental data here so others have
a gut-feeling why it's acceptable...

> + *
> + * This policy makes sure:
> + * 1) for common cases, the limit won't be reached and the local pcpu is
> used
> + * which is beneficial to performance (at least, avoid an IPI when unblocking
> + * vcpu).
> + * 2) for the worst case, the blocking list length scales with the vcpu count
> + * divided by the pcpu count.
> + */
> +#define PI_LIST_FIXED_NUM 128
> +#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) /
> num_online_cpus() + \
> +                           PI_LIST_FIXED_NUM)
> +
> +static unsigned int vmx_pi_choose_dest_cpu(struct vcpu *v)
> +{
> +    int count, limit = PI_LIST_LIMIT;
> +    unsigned int dest = v->processor;
> +
> +    count = atomic_read(&per_cpu(vmx_pi_blocking, dest).counter);
> +    while ( unlikely(count >= limit) )
> +    {
> +        dest = cpumask_cycle(dest, &cpu_online_map);
> +        count = atomic_read(&per_cpu(vmx_pi_blocking, dest).counter);
> +    }

is it possible to hit infinite loop here?

> +    return dest;
> +}
> +
>  static void vmx_vcpu_block(struct vcpu *v)
>  {
>      unsigned long flags;
> -    unsigned int dest;
> +    unsigned int dest, dest_cpu;
>      spinlock_t *old_lock;
> -    spinlock_t *pi_blocking_list_lock =
> -		&per_cpu(vmx_pi_blocking, v->processor).lock;
>      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    spinlock_t *pi_blocking_list_lock;
> +
> +    /*
> +     * After pCPU goes down, the per-cpu PI blocking list is cleared.
> +     * To make sure the parameter vCPU is added to the chosen pCPU's
> +     * PI blocking list before the list is cleared, just retry when
> +     * finding the pCPU has gone down. Also retry to choose another
> +     * pCPU when finding the list length reachs the limit.
> +     */
> + retry:
> +    dest_cpu = vmx_pi_choose_dest_cpu(v);
> +    pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, dest_cpu).lock;
> 
>      spin_lock_irqsave(pi_blocking_list_lock, flags);
> +    if ( unlikely((!cpu_online(dest_cpu)) ||
> +                  (atomic_read(&per_cpu(vmx_pi_blocking, dest_cpu).counter) >=
> +                   PI_LIST_LIMIT)) )
> +    {
> +        spin_unlock_irqrestore(pi_blocking_list_lock, flags);
> +        goto retry;
> +    }
> +
>      old_lock = cmpxchg(&v->arch.hvm_vmx.pi_blocking.lock, NULL,
>                         pi_blocking_list_lock);
> 
> @@ -120,11 +174,11 @@ static void vmx_vcpu_block(struct vcpu *v)
>       */
>      ASSERT(old_lock == NULL);
> 
> -    atomic_inc(&per_cpu(vmx_pi_blocking, v->processor).counter);
> -    HVMTRACE_4D(PI_LIST_ADD, v->domain->domain_id, v->vcpu_id, v-
> >processor,
> -                atomic_read(&per_cpu(vmx_pi_blocking, v->processor).counter));
> +    atomic_inc(&per_cpu(vmx_pi_blocking, dest_cpu).counter);
> +    HVMTRACE_4D(PI_LIST_ADD, v->domain->domain_id, v->vcpu_id,
> dest_cpu,
> +                atomic_read(&per_cpu(vmx_pi_blocking, dest_cpu).counter));
>      list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
> -                  &per_cpu(vmx_pi_blocking, v->processor).list);
> +                  &per_cpu(vmx_pi_blocking, dest_cpu).list);
>      spin_unlock_irqrestore(pi_blocking_list_lock, flags);
> 
>      ASSERT(!pi_test_sn(pi_desc));
> @@ -134,6 +188,13 @@ static void vmx_vcpu_block(struct vcpu *v)
>      ASSERT(pi_desc->ndst ==
>             (x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK)));
> 
> +    if ( unlikely(dest_cpu != v->processor) )
> +    {
> +        dest = cpu_physical_id(dest_cpu);
> +        write_atomic(&pi_desc->ndst,
> +                 (x2apic_enabled ? dest : MASK_INSR(dest,
> PI_xAPIC_NDST_MASK)));
> +    }
> +
>      write_atomic(&pi_desc->nv, pi_wakeup_vector);
>  }
> 
> @@ -163,6 +224,7 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
>      unsigned long flags;
>      spinlock_t *pi_blocking_list_lock;
>      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    unsigned int dest = cpu_physical_id(v->processor);
> 
>      /*
>       * Set 'NV' field back to posted_intr_vector, so the
> @@ -170,6 +232,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
>       * it is running in non-root mode.
>       */
>      write_atomic(&pi_desc->nv, posted_intr_vector);
> +    write_atomic(&pi_desc->ndst,
> +                 x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));
> 
>      pi_blocking_list_lock = v->arch.hvm_vmx.pi_blocking.lock;
> 
> --
> 1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/5] xentrace: add TRC_HVM_PI_LIST_ADD
  2017-05-15  1:33   ` Tian, Kevin
@ 2017-05-15  8:57     ` Chao Gao
  0 siblings, 0 replies; 17+ messages in thread
From: Chao Gao @ 2017-05-15  8:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Wei Liu, Nakajima, Jun, George Dunlap, Andrew Cooper,
	Ian Jackson, xen-devel, Jan Beulich

On Mon, May 15, 2017 at 09:33:04AM +0800, Tian, Kevin wrote:
>> From: Gao, Chao
>> Sent: Thursday, May 11, 2017 2:04 PM
>> 
>> This patch adds TRC_HVM_PI_LIST_ADD to track adding one entry to
>> the per-pcpu blocking list. Also introduce a 'counter' to track
>> the number of entries in the list.
>> 
>> Signed-off-by: Chao Gao <chao.gao@intel.com>
>> ---
>> @@ -119,6 +120,9 @@ static void vmx_vcpu_block(struct vcpu *v)
>>       */
>>      ASSERT(old_lock == NULL);
>> 
>> +    atomic_inc(&per_cpu(vmx_pi_blocking, v->processor).counter);
>> +    HVMTRACE_4D(PI_LIST_ADD, v->domain->domain_id, v->vcpu_id, v-
>> >processor,
>> +                atomic_read(&per_cpu(vmx_pi_blocking, v->processor).counter));
>>      list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
>>                    &per_cpu(vmx_pi_blocking, v->processor).list);
>>      spin_unlock_irqrestore(pi_blocking_list_lock, flags);
>> @@ -186,6 +190,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
>>      {
>>          ASSERT(v->arch.hvm_vmx.pi_blocking.lock == pi_blocking_list_lock);
>>          list_del(&v->arch.hvm_vmx.pi_blocking.list);
>> +        atomic_dec(&container_of(pi_blocking_list_lock,
>> +                                 struct vmx_pi_blocking_vcpu, lock)->counter);
>>          v->arch.hvm_vmx.pi_blocking.lock = NULL;
>>      }
>> 
>> @@ -234,6 +240,7 @@ void vmx_pi_desc_fixup(unsigned int cpu)
>>          if ( pi_test_on(&vmx->pi_desc) )
>>          {
>>              list_del(&vmx->pi_blocking.list);
>> +            atomic_dec(&per_cpu(vmx_pi_blocking, cpu).counter);
>>              vmx->pi_blocking.lock = NULL;
>>              vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
>>          }
>> @@ -258,6 +265,8 @@ void vmx_pi_desc_fixup(unsigned int cpu)
>> 
>>              list_move(&vmx->pi_blocking.list,
>>                        &per_cpu(vmx_pi_blocking, new_cpu).list);
>> +            atomic_dec(&per_cpu(vmx_pi_blocking, cpu).counter);
>> +            atomic_inc(&per_cpu(vmx_pi_blocking, new_cpu).counter);
>
>Don't you also need a trace here?

Yes, it is needed.

>
>and from completeness p.o.v, is it useful to trace both dec/inc?
>

I tried to do this. Assuming the log should show which pcpu's list
has decreased, I don't know how to get the pcpu id in vmx_pi_unblock_vcpu(),
though we can reference the per-cpu structure.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/5] VT-d PI: restrict the vcpu number on a given pcpu
  2017-05-15  5:24   ` Tian, Kevin
@ 2017-05-15  9:27     ` Chao Gao
  0 siblings, 0 replies; 17+ messages in thread
From: Chao Gao @ 2017-05-15  9:27 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Andrew Cooper, Jan Beulich, Nakajima, Jun, xen-devel

On Mon, May 15, 2017 at 01:24:45PM +0800, Tian, Kevin wrote:
>> From: Gao, Chao
>> Sent: Thursday, May 11, 2017 2:04 PM
>> 
>> Currently, a blocked vCPU is put in its pCPU's pi blocking list. If
>> too many vCPUs are blocked on a given pCPU, it will incur that the list
>> grows too long. After a simple analysis, there are 32k domains and
>> 128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's
>> PI blocking list. When a wakeup interrupt arrives, the list is
>> traversed to find some specific vCPUs to wake them up. This traversal in
>> that case would consume much time.
>> 
>> To mitigate this issue, this patch limits the vcpu number on a given
>> pCPU, taking factors such as perfomance of common case, current hvm vcpu
>> count and current pcpu count into consideration. With this method, for
>> the common case, it works fast and for some extreme cases, the list
>> length is under control.
>> 
>> The change in vmx_pi_unblock_vcpu() is for the following case:
>> vcpu is running -> try to block (this patch may change NSDT to
>> another pCPU) but notification comes in time, thus the vcpu
>> goes back to running station -> VM-entry (we should set NSDT again,
>> reverting the change we make to NSDT in vmx_vcpu_block())
>> 
>> Signed-off-by: Chao Gao <chao.gao@intel.com>
>> ---
>>  xen/arch/x86/hvm/vmx/vmx.c | 78
>> +++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 71 insertions(+), 7 deletions(-)
>> 
>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>> index efff6cd..c0d0b58 100644
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -100,16 +100,70 @@ void vmx_pi_per_cpu_init(unsigned int cpu)
>>      spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
>>  }
>> 
>> +/*
>> + * Choose an appropriate pcpu to receive wakeup interrupt.
>> + * By default, the local pcpu is chosen as the destination. But if the
>> + * vcpu number of the local pcpu exceeds a limit, another pcpu is chosen.
>> + *
>> + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu, where
>> + * v_tot is the total number of vcpus on the system, p_tot is the total
>> + * number of pcpus in the system, and K is a fixed number. Experments
>> shows
>> + * the maximal time to wakeup a vcpu from a 128-entry blocking list is
>> + * considered acceptable. So choose 128 as the fixed number K.
>
>better you can provide your experimental data here so others have
>a gut-feeling why it's acceptable...

Will add this.

>
>> + *
>> + * This policy makes sure:
>> + * 1) for common cases, the limit won't be reached and the local pcpu is
>> used
>> + * which is beneficial to performance (at least, avoid an IPI when unblocking
>> + * vcpu).
>> + * 2) for the worst case, the blocking list length scales with the vcpu count
>> + * divided by the pcpu count.
>> + */
>> +#define PI_LIST_FIXED_NUM 128
>> +#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) /
>> num_online_cpus() + \
>> +                           PI_LIST_FIXED_NUM)
>> +
>> +static unsigned int vmx_pi_choose_dest_cpu(struct vcpu *v)
>> +{
>> +    int count, limit = PI_LIST_LIMIT;
>> +    unsigned int dest = v->processor;
>> +
>> +    count = atomic_read(&per_cpu(vmx_pi_blocking, dest).counter);
>> +    while ( unlikely(count >= limit) )
>> +    {
>> +        dest = cpumask_cycle(dest, &cpu_online_map);
>> +        count = atomic_read(&per_cpu(vmx_pi_blocking, dest).counter);
>> +    }
>
>is it possible to hit infinite loop here?
>

theoretically, it will not for cpumask_cycle() will iterate through all
online pcpus and it's impossible that all online pcpus have reach the
upper bound.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/5] VT-d PI: Adding reference count to pi_desc
  2017-05-11  6:04 ` [PATCH v2 4/5] VT-d PI: Adding reference count to pi_desc Chao Gao
@ 2017-05-15 14:42   ` George Dunlap
  0 siblings, 0 replies; 17+ messages in thread
From: George Dunlap @ 2017-05-15 14:42 UTC (permalink / raw)
  To: Chao Gao; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, Jan Beulich, xen-devel

On Thu, May 11, 2017 at 7:04 AM, Chao Gao <chao.gao@intel.com> wrote:
> This patch intruduces a 'refcnt' field in vmx_pi_blocking to track
> the reference count of 'pi_desc' of the vCPU. And change this field
> every time we re-program one IRTE.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
>  xen/arch/x86/hvm/vmx/vmx.c             | 29 ++++++++++++++++++++++++
>  xen/drivers/passthrough/io.c           |  2 +-
>  xen/drivers/passthrough/vtd/intremap.c | 41 ++++++++++++++++++++++++++++++++--
>  xen/include/asm-x86/hvm/domain.h       |  6 +++++
>  xen/include/asm-x86/hvm/vmx/vmcs.h     |  3 +++
>  xen/include/asm-x86/iommu.h            |  2 +-
>  xen/include/asm-x86/msi.h              |  2 +-

This doesn't apply to staging anymore:

error: while searching for:
int iommu_enable_x2apic_IR(void);
void iommu_disable_x2apic_IR(void);

int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
                   const uint8_t gvec);

#endif /* !__ARCH_X86_IOMMU_H__ */

error: patch failed: xen/include/asm-x86/iommu.h:92


 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/5] xentrace: add TRC_HVM_PI_LIST_ADD
  2017-05-11  6:04 ` [PATCH v2 1/5] xentrace: add TRC_HVM_PI_LIST_ADD Chao Gao
  2017-05-15  1:33   ` Tian, Kevin
@ 2017-05-15 15:14   ` George Dunlap
  1 sibling, 0 replies; 17+ messages in thread
From: George Dunlap @ 2017-05-15 15:14 UTC (permalink / raw)
  To: Chao Gao, xen-devel
  Cc: Kevin Tian, Wei Liu, Jun Nakajima, George Dunlap, Andrew Cooper,
	Ian Jackson, Jan Beulich

On 11/05/17 07:04, Chao Gao wrote:
> This patch adds TRC_HVM_PI_LIST_ADD to track adding one entry to
> the per-pcpu blocking list. Also introduce a 'counter' to track
> the number of entries in the list.

So first of all, you have the importance of the order here backwards.
The most important thing this patch is doing is adding a counter to see
how many vcpus are on the list; tracing how that counter is moving is
secondary.

Secondly...

> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
>  tools/xentrace/formats          |  1 +
>  xen/arch/x86/hvm/vmx/vmx.c      | 12 +++++++++++-
>  xen/include/asm-x86/hvm/trace.h |  1 +
>  xen/include/public/trace.h      |  1 +
>  4 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/xentrace/formats b/tools/xentrace/formats
> index 8b31780..999ca8c 100644
> --- a/tools/xentrace/formats
> +++ b/tools/xentrace/formats
> @@ -125,6 +125,7 @@
>  0x00082020  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  INTR_WINDOW [ value = 0x%(1)08x ]
>  0x00082021  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  NPF         [ gpa = 0x%(2)08x%(1)08x mfn = 0x%(4)08x%(3)08x qual = 0x%(5)04x p2mt = 0x%(6)04x ]
>  0x00082023  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  TRAP        [ vector = 0x%(1)02x ]
> +0x00082026  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  PI_LIST_ADD [ domid = 0x%(1)04x vcpu = 0x%(2)04x, pcpu = 0x%(3)04x, #entry = 0x%(4)04x ]
>  
>  0x0010f001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_map      [ domid = %(1)d ]
>  0x0010f002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_unmap    [ domid = %(1)d ]
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index c8ef18a..efff6cd 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -82,6 +82,7 @@ static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);
>  struct vmx_pi_blocking_vcpu {
>      struct list_head     list;
>      spinlock_t           lock;
> +    atomic_t             counter;

Why is this atomic?  There's already a lock for this structure, and as
far as I can tell every access throughout the series is (or could be)
protected by a lock.

Finally, please add an entry to tools/xentrace/xenalyze.c to interpret
this value as well.

Thanks,
 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/5] VT-d PI: restrict the vcpu number on a given pcpu
  2017-05-11  6:04 ` [PATCH v2 3/5] VT-d PI: restrict the vcpu number on a given pcpu Chao Gao
  2017-05-15  5:24   ` Tian, Kevin
@ 2017-05-15 15:48   ` George Dunlap
  2017-05-15 16:13     ` Chao Gao
  1 sibling, 1 reply; 17+ messages in thread
From: George Dunlap @ 2017-05-15 15:48 UTC (permalink / raw)
  To: Chao Gao; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, Jan Beulich, xen-devel

On Thu, May 11, 2017 at 7:04 AM, Chao Gao <chao.gao@intel.com> wrote:
> Currently, a blocked vCPU is put in its pCPU's pi blocking list. If
> too many vCPUs are blocked on a given pCPU, it will incur that the list
> grows too long. After a simple analysis, there are 32k domains and
> 128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's
> PI blocking list. When a wakeup interrupt arrives, the list is
> traversed to find some specific vCPUs to wake them up. This traversal in
> that case would consume much time.
>
> To mitigate this issue, this patch limits the vcpu number on a given
> pCPU, taking factors such as perfomance of common case, current hvm vcpu
> count and current pcpu count into consideration. With this method, for
> the common case, it works fast and for some extreme cases, the list
> length is under control.
>
> The change in vmx_pi_unblock_vcpu() is for the following case:
> vcpu is running -> try to block (this patch may change NSDT to
> another pCPU) but notification comes in time, thus the vcpu
> goes back to running station -> VM-entry (we should set NSDT again,
> reverting the change we make to NSDT in vmx_vcpu_block())
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
>  xen/arch/x86/hvm/vmx/vmx.c | 78 +++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 71 insertions(+), 7 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index efff6cd..c0d0b58 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -100,16 +100,70 @@ void vmx_pi_per_cpu_init(unsigned int cpu)
>      spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
>  }
>
> +/*
> + * Choose an appropriate pcpu to receive wakeup interrupt.
> + * By default, the local pcpu is chosen as the destination. But if the
> + * vcpu number of the local pcpu exceeds a limit, another pcpu is chosen.
> + *
> + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu, where
> + * v_tot is the total number of vcpus on the system, p_tot is the total
> + * number of pcpus in the system, and K is a fixed number. Experments shows
> + * the maximal time to wakeup a vcpu from a 128-entry blocking list is
> + * considered acceptable. So choose 128 as the fixed number K.
> + *
> + * This policy makes sure:
> + * 1) for common cases, the limit won't be reached and the local pcpu is used
> + * which is beneficial to performance (at least, avoid an IPI when unblocking
> + * vcpu).
> + * 2) for the worst case, the blocking list length scales with the vcpu count
> + * divided by the pcpu count.
> + */
> +#define PI_LIST_FIXED_NUM 128
> +#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) / num_online_cpus() + \
> +                           PI_LIST_FIXED_NUM)
> +
> +static unsigned int vmx_pi_choose_dest_cpu(struct vcpu *v)
> +{
> +    int count, limit = PI_LIST_LIMIT;
> +    unsigned int dest = v->processor;
> +
> +    count = atomic_read(&per_cpu(vmx_pi_blocking, dest).counter);
> +    while ( unlikely(count >= limit) )
> +    {
> +        dest = cpumask_cycle(dest, &cpu_online_map);
> +        count = atomic_read(&per_cpu(vmx_pi_blocking, dest).counter);
> +    }
> +    return dest;
> +}
> +
>  static void vmx_vcpu_block(struct vcpu *v)
>  {
>      unsigned long flags;
> -    unsigned int dest;
> +    unsigned int dest, dest_cpu;
>      spinlock_t *old_lock;
> -    spinlock_t *pi_blocking_list_lock =
> -               &per_cpu(vmx_pi_blocking, v->processor).lock;
>      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    spinlock_t *pi_blocking_list_lock;
> +
> +    /*
> +     * After pCPU goes down, the per-cpu PI blocking list is cleared.
> +     * To make sure the parameter vCPU is added to the chosen pCPU's
> +     * PI blocking list before the list is cleared, just retry when
> +     * finding the pCPU has gone down. Also retry to choose another
> +     * pCPU when finding the list length reachs the limit.
> +     */
> + retry:
> +    dest_cpu = vmx_pi_choose_dest_cpu(v);
> +    pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, dest_cpu).lock;
>
>      spin_lock_irqsave(pi_blocking_list_lock, flags);
> +    if ( unlikely((!cpu_online(dest_cpu)) ||
> +                  (atomic_read(&per_cpu(vmx_pi_blocking, dest_cpu).counter) >=
> +                   PI_LIST_LIMIT)) )
> +    {
> +        spin_unlock_irqrestore(pi_blocking_list_lock, flags);
> +        goto retry;
> +    }

Algorithmically I think this is on the right track. But all these
atomic reads and writes are a mess.  Atomic accesses aren't free; and
the vast majority of the time you're doing things with the
pi_blocking_list_lock anyway.

Why not do something like this at the top of vmx_vcpu_block()
(replacing dest_cpu with pi_cpu for clarity)?

    pi_cpu = v->processor;
retry:
    pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, pi_cpu).lock;
    spin_lock_irqsave(pi_blocking_list_lock, flags);
    /*
     * Since dest_cpu may now be one other than the one v is currently
     * running on, check to make sure that it's still up.
     */
    if ( unlikely((!cpu_online(pi_cpu)) ||
                  pi_over_limit(per_cpu(vmx_pi_blocking, pi_cpu).counter)) )
    {
        pi_cpu = cpumask_cycle(pi_cpu, &cpu_online_map);
        spin_unlock_irqrestore(pi_blocking_list_lock, flags);
        goto retry;
    }

where we define pi_over_limit() like this:

static bool pi_over_limit(int count) {
    /* Compare w/ constant first to save an atomic read in the common case */
    return (count > PI_LIST_FIXED_NUM)
         &&  (count > (atomic_read(&num_hvm_vcpus) / num_online_cpus()) +
              PI_LIST_FIXED_NUM ) );
}

That way, in the incredibly common case where count < 128, you simply
grab the lock once and don't to *any* atomic reads, rather than doing
at least four atomic reads in the common case.

[snip]
> @@ -163,6 +224,7 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
>      unsigned long flags;
>      spinlock_t *pi_blocking_list_lock;
>      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    unsigned int dest = cpu_physical_id(v->processor);
>
>      /*
>       * Set 'NV' field back to posted_intr_vector, so the
> @@ -170,6 +232,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
>       * it is running in non-root mode.
>       */
>      write_atomic(&pi_desc->nv, posted_intr_vector);
> +    write_atomic(&pi_desc->ndst,
> +                 x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));

Just checking -- if an interrupt is raised between these two lines,
what will happen?  Will the interrupt be queued up to be delivered to
the vcpu the next time it does a VMENTRY?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/5] VT-d PI: restrict the vcpu number on a given pcpu
  2017-05-15 15:48   ` George Dunlap
@ 2017-05-15 16:13     ` Chao Gao
  0 siblings, 0 replies; 17+ messages in thread
From: Chao Gao @ 2017-05-15 16:13 UTC (permalink / raw)
  To: George Dunlap
  Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, Jan Beulich, xen-devel

On Mon, May 15, 2017 at 04:48:47PM +0100, George Dunlap wrote:
>On Thu, May 11, 2017 at 7:04 AM, Chao Gao <chao.gao@intel.com> wrote:
>> Currently, a blocked vCPU is put in its pCPU's pi blocking list. If
>> too many vCPUs are blocked on a given pCPU, it will incur that the list
>> grows too long. After a simple analysis, there are 32k domains and
>> 128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's
>> PI blocking list. When a wakeup interrupt arrives, the list is
>> traversed to find some specific vCPUs to wake them up. This traversal in
>> that case would consume much time.
>>
>> To mitigate this issue, this patch limits the vcpu number on a given
>> pCPU, taking factors such as perfomance of common case, current hvm vcpu
>> count and current pcpu count into consideration. With this method, for
>> the common case, it works fast and for some extreme cases, the list
>> length is under control.
>>
>> The change in vmx_pi_unblock_vcpu() is for the following case:
>> vcpu is running -> try to block (this patch may change NSDT to
>> another pCPU) but notification comes in time, thus the vcpu
>> goes back to running station -> VM-entry (we should set NSDT again,
>> reverting the change we make to NSDT in vmx_vcpu_block())
>>
>> Signed-off-by: Chao Gao <chao.gao@intel.com>
>> ---
>>  xen/arch/x86/hvm/vmx/vmx.c | 78 +++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 71 insertions(+), 7 deletions(-)
>>
>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>> index efff6cd..c0d0b58 100644
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -100,16 +100,70 @@ void vmx_pi_per_cpu_init(unsigned int cpu)
>>      spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
>>  }
>>
>> +/*
>> + * Choose an appropriate pcpu to receive wakeup interrupt.
>> + * By default, the local pcpu is chosen as the destination. But if the
>> + * vcpu number of the local pcpu exceeds a limit, another pcpu is chosen.
>> + *
>> + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu, where
>> + * v_tot is the total number of vcpus on the system, p_tot is the total
>> + * number of pcpus in the system, and K is a fixed number. Experments shows
>> + * the maximal time to wakeup a vcpu from a 128-entry blocking list is
>> + * considered acceptable. So choose 128 as the fixed number K.
>> + *
>> + * This policy makes sure:
>> + * 1) for common cases, the limit won't be reached and the local pcpu is used
>> + * which is beneficial to performance (at least, avoid an IPI when unblocking
>> + * vcpu).
>> + * 2) for the worst case, the blocking list length scales with the vcpu count
>> + * divided by the pcpu count.
>> + */
>> +#define PI_LIST_FIXED_NUM 128
>> +#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) / num_online_cpus() + \
>> +                           PI_LIST_FIXED_NUM)
>> +
>> +static unsigned int vmx_pi_choose_dest_cpu(struct vcpu *v)
>> +{
>> +    int count, limit = PI_LIST_LIMIT;
>> +    unsigned int dest = v->processor;
>> +
>> +    count = atomic_read(&per_cpu(vmx_pi_blocking, dest).counter);
>> +    while ( unlikely(count >= limit) )
>> +    {
>> +        dest = cpumask_cycle(dest, &cpu_online_map);
>> +        count = atomic_read(&per_cpu(vmx_pi_blocking, dest).counter);
>> +    }
>> +    return dest;
>> +}
>> +
>>  static void vmx_vcpu_block(struct vcpu *v)
>>  {
>>      unsigned long flags;
>> -    unsigned int dest;
>> +    unsigned int dest, dest_cpu;
>>      spinlock_t *old_lock;
>> -    spinlock_t *pi_blocking_list_lock =
>> -               &per_cpu(vmx_pi_blocking, v->processor).lock;
>>      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
>> +    spinlock_t *pi_blocking_list_lock;
>> +
>> +    /*
>> +     * After pCPU goes down, the per-cpu PI blocking list is cleared.
>> +     * To make sure the parameter vCPU is added to the chosen pCPU's
>> +     * PI blocking list before the list is cleared, just retry when
>> +     * finding the pCPU has gone down. Also retry to choose another
>> +     * pCPU when finding the list length reachs the limit.
>> +     */
>> + retry:
>> +    dest_cpu = vmx_pi_choose_dest_cpu(v);
>> +    pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, dest_cpu).lock;
>>
>>      spin_lock_irqsave(pi_blocking_list_lock, flags);
>> +    if ( unlikely((!cpu_online(dest_cpu)) ||
>> +                  (atomic_read(&per_cpu(vmx_pi_blocking, dest_cpu).counter) >=
>> +                   PI_LIST_LIMIT)) )
>> +    {
>> +        spin_unlock_irqrestore(pi_blocking_list_lock, flags);
>> +        goto retry;
>> +    }
>
>Algorithmically I think this is on the right track. But all these
>atomic reads and writes are a mess.  Atomic accesses aren't free; and
>the vast majority of the time you're doing things with the
>pi_blocking_list_lock anyway.
>
>Why not do something like this at the top of vmx_vcpu_block()
>(replacing dest_cpu with pi_cpu for clarity)?
>
>    pi_cpu = v->processor;
>retry:
>    pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, pi_cpu).lock;
>    spin_lock_irqsave(pi_blocking_list_lock, flags);
>    /*
>     * Since dest_cpu may now be one other than the one v is currently
>     * running on, check to make sure that it's still up.
>     */
>    if ( unlikely((!cpu_online(pi_cpu)) ||
>                  pi_over_limit(per_cpu(vmx_pi_blocking, pi_cpu).counter)) )
>    {
>        pi_cpu = cpumask_cycle(pi_cpu, &cpu_online_map);
>        spin_unlock_irqrestore(pi_blocking_list_lock, flags);
>        goto retry;
>    }
>
>where we define pi_over_limit() like this:
>
>static bool pi_over_limit(int count) {
>    /* Compare w/ constant first to save an atomic read in the common case */
>    return (count > PI_LIST_FIXED_NUM)
>         &&  (count > (atomic_read(&num_hvm_vcpus) / num_online_cpus()) +
>              PI_LIST_FIXED_NUM ) );
>}
>
>That way, in the incredibly common case where count < 128, you simply
>grab the lock once and don't to *any* atomic reads, rather than doing
>at least four atomic reads in the common case.

Agree. We should make things fast if possible. Could I add you as
suggested-by? Thanks for your help.

>
>[snip]
>> @@ -163,6 +224,7 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
>>      unsigned long flags;
>>      spinlock_t *pi_blocking_list_lock;
>>      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
>> +    unsigned int dest = cpu_physical_id(v->processor);
>>
>>      /*
>>       * Set 'NV' field back to posted_intr_vector, so the
>> @@ -170,6 +232,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
>>       * it is running in non-root mode.
>>       */
>>      write_atomic(&pi_desc->nv, posted_intr_vector);
>> +    write_atomic(&pi_desc->ndst,
>> +                 x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));
>
>Just checking -- if an interrupt is raised between these two lines,
>what will happen?  Will the interrupt be queued up to be delivered to
>the vcpu the next time it does a VMENTRY?

I think an interrupt between the two lines incurs a spurious interrupt
to the current destination pcpu. The interrupt will be inject in this
VM-entry as long as we will sync PIR to vIRR.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-05-15 16:13 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-11  6:04 [PATCH v2 0/5] mitigate the per-pCPU blocking list may be too long Chao Gao
2017-05-11  6:04 ` [PATCH v2 1/5] xentrace: add TRC_HVM_PI_LIST_ADD Chao Gao
2017-05-15  1:33   ` Tian, Kevin
2017-05-15  8:57     ` Chao Gao
2017-05-15 15:14   ` George Dunlap
2017-05-11  6:04 ` [PATCH v2 2/5] vcpu: track hvm vcpu number on the system Chao Gao
2017-05-11 11:35   ` Wei Liu
2017-05-11 11:37     ` Wei Liu
2017-05-12  8:23       ` Chao Gao
2017-05-11  6:04 ` [PATCH v2 3/5] VT-d PI: restrict the vcpu number on a given pcpu Chao Gao
2017-05-15  5:24   ` Tian, Kevin
2017-05-15  9:27     ` Chao Gao
2017-05-15 15:48   ` George Dunlap
2017-05-15 16:13     ` Chao Gao
2017-05-11  6:04 ` [PATCH v2 4/5] VT-d PI: Adding reference count to pi_desc Chao Gao
2017-05-15 14:42   ` George Dunlap
2017-05-11  6:04 ` [PATCH v2 5/5] VT-d PI: Don't add vCPU to PI blocking list for a case Chao Gao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.