All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/4] mitigate the per-pCPU blocking list may be too long
@ 2017-07-07  6:48 Chao Gao
  2017-07-07  6:48 ` [PATCH v4 1/4] VT-d PI: track the vcpu number on pi blocking list Chao Gao
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Chao Gao @ 2017-07-07  6:48 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Wei Liu, Jan Beulich, George Dunlap, Andrew Cooper,
	Ian Jackson, Jun Nakajima, Chao Gao

Changes in v4:
 - In Patch 3, use a new lock to avoid adding a blocked vcpu to a
 offline pcpu's blocking list.
 - divide Patch 1 in v3 to trace part (Patch 4) and non-trace part
 (Patch 1)
 - move the place we increase/decrease the hvm vcpu number to
 hvm_vcpu_{initialise, destory}

VT-d PI introduces a per-pCPU blocking list to track the blocked vCPU
on a given pCPU. Theoretically, there are 32K domain on single host,
128 vCPUs per domain. If all vCPUs are blocked on the same pCPU,
4M vCPUs are in the same list. Traversing this list consumes too
much time. More discussion can be found in [1,2,3].

To mitigate this issue, this series put vcpus to another pcpu's list
when the local pcpu's list length reachs an upper bound which is the
average vcpus per pcpu ratio plus a constant. 

PATCH 1/4 adds a counter to track the per-pCPU blocking list's length
to compare with the upper bound every time adding a entry to the list.

PATCH 2/4 uses a global variable to track how many hvm vcpus on this
system. It is used to calculate the number limit of blocked vcpu on a
given pcpu, namely the upper bound.

patch 3/4 employs a policy to restrict the vcpu count on a given
pcpu's pi blocking list in case the list grows too long. In one work,
If list length is smaller than the upper bound, the vcpu is added to
the pi blocking list of the pcpu which it is running on. Otherwise,
another online pcpu is chosen to accept the vcpu.

patch 4/4 adds some relevant events to xentrace to aid analysis and
obtain the list length. With this patch, some data can be acquired to
validate patch 3/4. 

[1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted-interrupt%20core%20logic%20handling;#422661
[2] https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20of%20the%20list%20depends;#422567
[3] https://lists.gt.net/xen/devel/472749?search_string=enable%20vt-d%20pi%20by%20default;#472749


Chao Gao (4):
  VT-d PI: track the vcpu number on pi blocking list
  x86/vcpu: track hvm vcpu number on the system
  VT-d PI: restrict the vcpu number on a given pcpu
  Xentrace: add support for HVM's PI blocking list operation

 tools/xentrace/formats          |   3 +
 tools/xentrace/xenalyze.c       | 154 +++++++++++++++++++++++++++++++++---
 xen/arch/x86/hvm/hvm.c          |   6 ++
 xen/arch/x86/hvm/vmx/vmx.c      | 169 +++++++++++++++++++++++++++++++++-------
 xen/include/asm-x86/hvm/hvm.h   |   3 +
 xen/include/asm-x86/hvm/trace.h |   1 +
 xen/include/public/trace.h      |   5 ++
 7 files changed, 305 insertions(+), 36 deletions(-)

-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 1/4] VT-d PI: track the vcpu number on pi blocking list
  2017-07-07  6:48 [PATCH v4 0/4] mitigate the per-pCPU blocking list may be too long Chao Gao
@ 2017-07-07  6:48 ` Chao Gao
  2017-07-07 15:41   ` Jan Beulich
  2017-07-21 15:04   ` George Dunlap
  2017-07-07  6:48 ` [PATCH v4 2/4] x86/vcpu: track hvm vcpu number on the system Chao Gao
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 18+ messages in thread
From: Chao Gao @ 2017-07-07  6:48 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, Jan Beulich, Chao Gao

This patch adds a field, counter, in struct vmx_pi_blocking_vcpu to track
how many entries are on the pi blocking list.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
v4:
 - non-trace part of Patch 1 in v3

---
 xen/arch/x86/hvm/vmx/vmx.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 69ce3aa..ecd6485 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -83,6 +83,7 @@ static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);
 struct vmx_pi_blocking_vcpu {
     struct list_head     list;
     spinlock_t           lock;
+    unsigned int         counter;
 };
 
 /*
@@ -120,6 +121,7 @@ static void vmx_vcpu_block(struct vcpu *v)
      */
     ASSERT(old_lock == NULL);
 
+    per_cpu(vmx_pi_blocking, v->processor).counter++;
     list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
                   &per_cpu(vmx_pi_blocking, v->processor).list);
     spin_unlock_irqrestore(pi_blocking_list_lock, flags);
@@ -187,6 +189,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
     {
         ASSERT(v->arch.hvm_vmx.pi_blocking.lock == pi_blocking_list_lock);
         list_del(&v->arch.hvm_vmx.pi_blocking.list);
+        container_of(pi_blocking_list_lock,
+                     struct vmx_pi_blocking_vcpu, lock)->counter--;
         v->arch.hvm_vmx.pi_blocking.lock = NULL;
     }
 
@@ -235,6 +239,7 @@ void vmx_pi_desc_fixup(unsigned int cpu)
         if ( pi_test_on(&vmx->pi_desc) )
         {
             list_del(&vmx->pi_blocking.list);
+            per_cpu(vmx_pi_blocking, cpu).counter--;
             vmx->pi_blocking.lock = NULL;
             vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
         }
@@ -259,6 +264,8 @@ void vmx_pi_desc_fixup(unsigned int cpu)
 
             list_move(&vmx->pi_blocking.list,
                       &per_cpu(vmx_pi_blocking, new_cpu).list);
+            per_cpu(vmx_pi_blocking, cpu).counter--;
+            per_cpu(vmx_pi_blocking, new_cpu).counter++;
             vmx->pi_blocking.lock = new_lock;
 
             spin_unlock(new_lock);
@@ -2358,9 +2365,9 @@ static struct hvm_function_table __initdata vmx_function_table = {
 static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
 {
     struct arch_vmx_struct *vmx, *tmp;
-    spinlock_t *lock = &per_cpu(vmx_pi_blocking, smp_processor_id()).lock;
-    struct list_head *blocked_vcpus =
-		&per_cpu(vmx_pi_blocking, smp_processor_id()).list;
+    unsigned int cpu = smp_processor_id();
+    spinlock_t *lock = &per_cpu(vmx_pi_blocking, cpu).lock;
+    struct list_head *blocked_vcpus = &per_cpu(vmx_pi_blocking, cpu).list;
 
     ack_APIC_irq();
     this_cpu(irq_count)++;
@@ -2377,6 +2384,7 @@ static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
         if ( pi_test_on(&vmx->pi_desc) )
         {
             list_del(&vmx->pi_blocking.list);
+            per_cpu(vmx_pi_blocking, cpu).counter--;
             ASSERT(vmx->pi_blocking.lock == lock);
             vmx->pi_blocking.lock = NULL;
             vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 2/4] x86/vcpu: track hvm vcpu number on the system
  2017-07-07  6:48 [PATCH v4 0/4] mitigate the per-pCPU blocking list may be too long Chao Gao
  2017-07-07  6:48 ` [PATCH v4 1/4] VT-d PI: track the vcpu number on pi blocking list Chao Gao
@ 2017-07-07  6:48 ` Chao Gao
  2017-07-07 15:42   ` Jan Beulich
  2017-07-07  6:48 ` [PATCH v4 3/4] VT-d PI: restrict the vcpu number on a given pcpu Chao Gao
  2017-07-07  6:49 ` [PATCH v4 4/4] Xentrace: add support for HVM's PI blocking list operation Chao Gao
  3 siblings, 1 reply; 18+ messages in thread
From: Chao Gao @ 2017-07-07  6:48 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Jan Beulich, Chao Gao

This number is used to calculate the average vcpus per pcpu ratio.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
v4:
 - move the place we increase/decrease the hvm vcpu number to
 hvm_vcpu_{initialise, destory}

---
 xen/arch/x86/hvm/hvm.c        | 6 ++++++
 xen/include/asm-x86/hvm/hvm.h | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 3ed6ec4..6a510b3 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -109,6 +109,9 @@ static const char __initconst warning_hvm_fep[] =
 static bool_t __initdata opt_altp2m_enabled = 0;
 boolean_param("altp2m", opt_altp2m_enabled);
 
+/* Total number of HVM vCPUs on this system */
+atomic_t num_hvm_vcpus;
+
 static int cpu_callback(
     struct notifier_block *nfb, unsigned long action, void *hcpu)
 {
@@ -1512,6 +1515,7 @@ int hvm_vcpu_initialise(struct vcpu *v)
 
     hvm_update_guest_vendor(v);
 
+    atomic_inc(&num_hvm_vcpus);
     return 0;
 
  fail6:
@@ -1530,6 +1534,8 @@ int hvm_vcpu_initialise(struct vcpu *v)
 
 void hvm_vcpu_destroy(struct vcpu *v)
 {
+    atomic_dec(&num_hvm_vcpus);
+
     viridian_vcpu_deinit(v);
 
     hvm_all_ioreq_servers_remove_vcpu(v->domain, v);
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index b687e03..c51bd9f 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -25,6 +25,7 @@
 #include <asm/hvm/asid.h>
 #include <public/domctl.h>
 #include <public/hvm/save.h>
+#include <xen/atomic.h>
 #include <xen/mm.h>
 
 #ifdef CONFIG_HVM_FEP
@@ -233,6 +234,8 @@ extern bool_t hvm_enabled;
 extern bool_t cpu_has_lmsl;
 extern s8 hvm_port80_allowed;
 
+extern atomic_t num_hvm_vcpus;
+
 extern const struct hvm_function_table *start_svm(void);
 extern const struct hvm_function_table *start_vmx(void);
 
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 3/4] VT-d PI: restrict the vcpu number on a given pcpu
  2017-07-07  6:48 [PATCH v4 0/4] mitigate the per-pCPU blocking list may be too long Chao Gao
  2017-07-07  6:48 ` [PATCH v4 1/4] VT-d PI: track the vcpu number on pi blocking list Chao Gao
  2017-07-07  6:48 ` [PATCH v4 2/4] x86/vcpu: track hvm vcpu number on the system Chao Gao
@ 2017-07-07  6:48 ` Chao Gao
  2017-07-07 15:57   ` Jan Beulich
  2017-07-21 15:43   ` George Dunlap
  2017-07-07  6:49 ` [PATCH v4 4/4] Xentrace: add support for HVM's PI blocking list operation Chao Gao
  3 siblings, 2 replies; 18+ messages in thread
From: Chao Gao @ 2017-07-07  6:48 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, Jan Beulich, Chao Gao

Currently, a blocked vCPU is put in its pCPU's pi blocking list. If
too many vCPUs are blocked on a given pCPU, it will incur that the list
grows too long. After a simple analysis, there are 32k domains and
128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's
PI blocking list. When a wakeup interrupt arrives, the list is
traversed to find some specific vCPUs to wake them up. This traversal in
that case would consume much time.

To mitigate this issue, this patch limits the number of vCPUs tracked on a
given pCPU's blocking list, taking factors such as perfomance of common case,
current hvm vCPU count and current pCPU count into consideration. With this
method, for the common case, it works fast and for some extreme cases, the
list length is under control.

With this patch, when a vcpu is to be blocked, we check whether the pi
blocking list's length of the pcpu where the vcpu is running exceeds
the limit which is the average vcpus per pcpu ratio plus a constant.
If no, the vcpu is added to this pcpu's pi blocking list. Otherwise,
another online pcpu is chosen to accept the vcpu.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
v4:
 - use a new lock to avoid adding a blocked vcpu to a offline pcpu's blocking
 list.

---
 xen/arch/x86/hvm/vmx/vmx.c | 136 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 114 insertions(+), 22 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index ecd6485..04e9aa6 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -95,22 +95,91 @@ static DEFINE_PER_CPU(struct vmx_pi_blocking_vcpu, vmx_pi_blocking);
 uint8_t __read_mostly posted_intr_vector;
 static uint8_t __read_mostly pi_wakeup_vector;
 
+/*
+ * Protect critical sections to avoid adding a blocked vcpu to a destroyed
+ * blocking list.
+ */
+static DEFINE_SPINLOCK(remote_pbl_operation);
+
+#define remote_pbl_operation_begin(flags)                   \
+({                                                          \
+    spin_lock_irqsave(&remote_pbl_operation, flags);        \
+})
+
+#define remote_pbl_operation_done(flags)                    \
+({                                                          \
+    spin_unlock_irqrestore(&remote_pbl_operation, flags);   \
+})
+
 void vmx_pi_per_cpu_init(unsigned int cpu)
 {
     INIT_LIST_HEAD(&per_cpu(vmx_pi_blocking, cpu).list);
     spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
 }
 
+/*
+ * By default, the local pcpu (means the one the vcpu is currently running on)
+ * is chosen as the destination of wakeup interrupt. But if the vcpu number of
+ * the pcpu exceeds a limit, another pcpu is chosen until we find a suitable
+ * one.
+ *
+ * Currently, choose (v_tot/p_tot) + K as the limit of vcpu count, where
+ * v_tot is the total number of hvm vcpus on the system, p_tot is the total
+ * number of pcpus in the system, and K is a fixed number. An experment on a
+ * skylake server which has 112 cpus and 64G memory shows the maximum time to
+ * wakeup a vcpu from a 128-entry blocking list takes about 22us, which is
+ * tolerable. So choose 128 as the fixed number K.
+ *
+ * This policy makes sure:
+ * 1) for common cases, the limit won't be reached and the local pcpu is used
+ * which is beneficial to performance (at least, avoid an IPI when unblocking
+ * vcpu).
+ * 2) for the worst case, the blocking list length scales with the vcpu count
+ * divided by the pcpu count.
+ */
+#define PI_LIST_FIXED_NUM 128
+#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) / num_online_cpus() + \
+                           PI_LIST_FIXED_NUM)
+static inline bool pi_over_limit(int cpu)
+{
+    return per_cpu(vmx_pi_blocking, cpu).counter > PI_LIST_LIMIT;
+}
+
 static void vmx_vcpu_block(struct vcpu *v)
 {
-    unsigned long flags;
-    unsigned int dest;
+    unsigned long flags[2];
+    unsigned int dest, pi_cpu;
     spinlock_t *old_lock;
-    spinlock_t *pi_blocking_list_lock =
-		&per_cpu(vmx_pi_blocking, v->processor).lock;
     struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    spinlock_t *pi_blocking_list_lock;
+    bool in_remote_operation = false;
+
+    pi_cpu = v->processor;
+
+    if ( unlikely(pi_over_limit(pi_cpu)) )
+    {
+        remote_pbl_operation_begin(flags[0]);
+        in_remote_operation = true;
+        while (true)
+        {
+            /*
+             * Online pCPU's blocking list is always usable for it is
+             * destroyed after clearing the bit of cpu_online_map.
+             */
+            pi_cpu = cpumask_cycle(pi_cpu, &cpu_online_map);
+            pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, pi_cpu).lock;
+            spin_lock_irqsave(pi_blocking_list_lock, flags[1]);
+            if ( !pi_over_limit(pi_cpu) )
+                break;
+            spin_unlock_irqrestore(pi_blocking_list_lock, flags[1]);
+        }
+    }
+    else
+    {
+        pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, pi_cpu).lock;
+        spin_lock_irqsave(pi_blocking_list_lock, flags[1]);
+    }
 
-    spin_lock_irqsave(pi_blocking_list_lock, flags);
     old_lock = cmpxchg(&v->arch.hvm_vmx.pi_blocking.lock, NULL,
                        pi_blocking_list_lock);
 
@@ -123,8 +192,10 @@ static void vmx_vcpu_block(struct vcpu *v)
 
     per_cpu(vmx_pi_blocking, v->processor).counter++;
     list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
-                  &per_cpu(vmx_pi_blocking, v->processor).list);
-    spin_unlock_irqrestore(pi_blocking_list_lock, flags);
+                  &per_cpu(vmx_pi_blocking, pi_cpu).list);
+    spin_unlock_irqrestore(pi_blocking_list_lock, flags[1]);
+    if ( unlikely(in_remote_operation) )
+        remote_pbl_operation_done(flags[0]);
 
     ASSERT(!pi_test_sn(pi_desc));
 
@@ -132,6 +203,19 @@ static void vmx_vcpu_block(struct vcpu *v)
 
     ASSERT(pi_desc->ndst ==
            (x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK)));
+    if ( unlikely(pi_cpu != v->processor) )
+    {
+        /*
+         * The vcpu is put to another pCPU's blocking list. Change 'NDST'
+         * field to that pCPU to make sure it can wake up the vcpu when an
+         * interrupt arrives. The 'NDST' field will be set to the pCPU which
+         * the vcpu is running on during VM-Entry, seeing
+         * vmx_pi_unblock_vcpu().
+         */
+        dest = cpu_physical_id(pi_cpu);
+        write_atomic(&pi_desc->ndst,
+                  x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));
+    }
 
     write_atomic(&pi_desc->nv, pi_wakeup_vector);
 }
@@ -162,13 +246,17 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
     unsigned long flags;
     spinlock_t *pi_blocking_list_lock;
     struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    unsigned int dest = cpu_physical_id(v->processor);
 
     /*
-     * Set 'NV' field back to posted_intr_vector, so the
-     * Posted-Interrupts can be delivered to the vCPU when
-     * it is running in non-root mode.
+     * Set 'NV' field back to posted_intr_vector and 'NDST' field to the pCPU
+     * where the vcpu is running (for this field may now point to another
+     * pCPU), so the Posted-Interrupts can be delivered to the vCPU when it
+     * is running in non-root mode.
      */
     write_atomic(&pi_desc->nv, posted_intr_vector);
+    write_atomic(&pi_desc->ndst,
+                 x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));
 
     pi_blocking_list_lock = v->arch.hvm_vmx.pi_blocking.lock;
 
@@ -215,13 +303,8 @@ void vmx_pi_desc_fixup(unsigned int cpu)
     if ( !iommu_intpost )
         return;
 
-    /*
-     * We are in the context of CPU_DEAD or CPU_UP_CANCELED notification,
-     * and it is impossible for a second CPU go down in parallel. So we
-     * can safely acquire the old cpu's lock and then acquire the new_cpu's
-     * lock after that.
-     */
-    spin_lock_irqsave(old_lock, flags);
+    remote_pbl_operation_begin(flags);
+    spin_lock(old_lock);
 
     list_for_each_entry_safe(vmx, tmp, blocked_vcpus, pi_blocking.list)
     {
@@ -245,16 +328,24 @@ void vmx_pi_desc_fixup(unsigned int cpu)
         }
         else
         {
+            new_cpu = cpu;
             /*
              * We need to find an online cpu as the NDST of the PI descriptor, it
              * doesn't matter whether it is within the cpupool of the domain or
              * not. As long as it is online, the vCPU will be woken up once the
-             * notification event arrives.
+             * notification event arrives. Through a while-loop to find a
+             * target pCPU whose PI Blocking List's length isn't over the limit.
              */
-            new_cpu = cpumask_any(&cpu_online_map);
-            new_lock = &per_cpu(vmx_pi_blocking, new_cpu).lock;
+            while (true)
+            {
+                new_cpu = cpumask_cycle(new_cpu, &cpu_online_map);
+                new_lock = &per_cpu(vmx_pi_blocking, new_cpu).lock;
 
-            spin_lock(new_lock);
+                spin_lock(new_lock);
+                if ( !pi_over_limit(new_cpu) )
+                    break;
+                spin_unlock(new_lock);
+            }
 
             ASSERT(vmx->pi_blocking.lock == old_lock);
 
@@ -274,7 +365,8 @@ void vmx_pi_desc_fixup(unsigned int cpu)
         pi_clear_sn(&vmx->pi_desc);
     }
 
-    spin_unlock_irqrestore(old_lock, flags);
+    spin_unlock(old_lock);
+    remote_pbl_operation_done(flags);
 }
 
 /*
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 4/4] Xentrace: add support for HVM's PI blocking list operation
  2017-07-07  6:48 [PATCH v4 0/4] mitigate the per-pCPU blocking list may be too long Chao Gao
                   ` (2 preceding siblings ...)
  2017-07-07  6:48 ` [PATCH v4 3/4] VT-d PI: restrict the vcpu number on a given pcpu Chao Gao
@ 2017-07-07  6:49 ` Chao Gao
  2017-07-07 15:37   ` Jan Beulich
  2017-07-21 16:26   ` George Dunlap
  3 siblings, 2 replies; 18+ messages in thread
From: Chao Gao @ 2017-07-07  6:49 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Wei Liu, Jan Beulich, George Dunlap, Andrew Cooper,
	Ian Jackson, Jun Nakajima, Chao Gao

In order to analyze PI blocking list operation frequence and obtain
the list length, add some relevant events to xentrace and some
associated code in xenalyze. Event ASYNC_PI_LIST_DEL may happen in interrupt
context, which incurs current assumptions checked in toplevel_assert_check()
are not suitable any more. Thus, this patch extends the toplevel_assert_check()
to remove such assumptions for events of type ASYNC_PI_LIST_DEL.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
v4:
 - trace part of Patch 1 in v3

---
 tools/xentrace/formats          |   3 +
 tools/xentrace/xenalyze.c       | 154 +++++++++++++++++++++++++++++++++++++---
 xen/arch/x86/hvm/vmx/vmx.c      |  21 +++++-
 xen/include/asm-x86/hvm/trace.h |   1 +
 xen/include/public/trace.h      |   5 ++
 5 files changed, 172 insertions(+), 12 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index 8b31780..54e0b11 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -125,6 +125,9 @@
 0x00082020  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  INTR_WINDOW [ value = 0x%(1)08x ]
 0x00082021  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  NPF         [ gpa = 0x%(2)08x%(1)08x mfn = 0x%(4)08x%(3)08x qual = 0x%(5)04x p2mt = 0x%(6)04x ]
 0x00082023  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  TRAP        [ vector = 0x%(1)02x ]
+0x00082026  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  PI_LIST_ADD [ domid = 0x%(1)04x vcpu = 0x%(2)04x, pcpu = 0x%(3)04x, #entry = 0x%(4)04x ]
+
+0x00088001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  ASYNC_PI_LIST_DEL [ domid = 0x%(1)04x vcpu = 0x%(2)04x ]
 
 0x0010f001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_map      [ domid = %(1)d ]
 0x0010f002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_unmap    [ domid = %(1)d ]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index fa608ad..fbc2429 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -296,6 +296,7 @@ struct symbol_struct {
 };
 
 void error(enum error_level l, struct record_info *ri);
+struct vcpu_data * vcpu_find(int did, int vid);
 
 void parse_symbol_file(char *fn) {
     unsigned long long last_addr = 0;
@@ -944,6 +945,7 @@ enum {
     HVM_EVENT_TRAP,
     HVM_EVENT_TRAP_DEBUG,
     HVM_EVENT_VLAPIC,
+    HVM_EVENT_PI_LIST_ADD,
     HVM_EVENT_HANDLER_MAX
 };
 char * hvm_event_handler_name[HVM_EVENT_HANDLER_MAX] = {
@@ -979,13 +981,15 @@ char * hvm_event_handler_name[HVM_EVENT_HANDLER_MAX] = {
     "realmode_emulate",
     "trap",
     "trap_debug",
-    "vlapic"
+    "vlapic",
+    "pi_list_add",
 };
 
 enum {
     HVM_VOL_VMENTRY,
     HVM_VOL_VMEXIT,
     HVM_VOL_HANDLER,
+    HVM_VOL_ASYNC,
     HVM_VOL_MAX
 };
 
@@ -1012,6 +1016,7 @@ char *hvm_vol_name[HVM_VOL_MAX] = {
     [HVM_VOL_VMENTRY]="vmentry",
     [HVM_VOL_VMEXIT] ="vmexit",
     [HVM_VOL_HANDLER]="handler",
+    [HVM_VOL_ASYNC]="async",
 };
 
 enum {
@@ -1337,6 +1342,9 @@ struct hvm_data {
         struct {
             struct io_address *mmio, *pio;
         } io;
+        struct {
+            int pi_list_add, pi_list_del;
+        } pi;
     } summary;
 
     /* In-flight accumulation information */
@@ -1391,6 +1399,9 @@ struct hvm_data {
 
     /* Historical info */
     tsc_t last_rdtsc;
+
+    /* Destination pcpu of posted interrupt's wakeup interrupt */
+    int pi_cpu;
 };
 
 enum {
@@ -1457,6 +1468,8 @@ void init_hvm_data(struct hvm_data *h, struct vcpu_data *v) {
     }
     for(i=0; i<GUEST_INTERRUPT_MAX+1; i++)
         h->summary.guest_interrupt[i].count=0;
+
+    h->pi_cpu = -1;
 }
 
 /* PV data */
@@ -1766,6 +1779,14 @@ char * toplevel_name[TOPLEVEL_MAX] = {
     [TOPLEVEL_HW]="hw",
 };
 
+enum {
+    SUBLEVEL_HVM_ENTRYEXIT=1,
+    SUBLEVEL_HVM_HANDLER,
+    SUBLEVEL_HVM_EMUL,
+    SUBLEVEL_HVM_ASYNC,
+    SUBLEVEL_HVM_MAX=SUBLEVEL_HVM_ASYNC+1,
+};
+
 struct trace_volume {
     unsigned long long toplevel[TOPLEVEL_MAX];
     unsigned long long sched_verbose;
@@ -1852,6 +1873,9 @@ struct pcpu_info {
         tsc_t tsc;
         struct cycle_summary idle, running, lost;
     } time;
+
+    /* Posted Interrupt List Length */
+    int pi_list_length;
 };
 
 void __fill_in_record_info(struct pcpu_info *p);
@@ -4726,6 +4750,71 @@ void hvm_generic_dump(struct record_info *ri, char * prefix)
     printf(" ]\n");
 }
 
+void hvm_pi_list_add_process(struct record_info *ri, struct hvm_data *h)
+{
+    struct {
+        int did;
+        int vid;
+        int pcpu;
+        int len;
+    } *data;
+    struct vcpu_data *v;
+
+    data = (typeof(data))ri->rec.u.tsc.data;
+    v = vcpu_find(data->did, data->vid);
+    if ( !v->hvm.init )
+        init_hvm_data(&v->hvm, v);
+
+    if ( opt.dump_all )
+        printf("d%uv%u is added to pi blocking list of pcpu%u. "
+               "The list length is now %d\n",
+               data->did, data->vid, data->pcpu, data->len);
+
+    v->hvm.pi_cpu = data->pcpu;
+    v->hvm.summary.pi.pi_list_add++;
+    if ( data->pcpu > P.max_active_pcpu || !P.pcpu[data->pcpu].active )
+        fprintf(stderr, "Strange! pcpu%u is inactive but a vcpu is added"
+                "to it", data->pcpu);
+    else if ( P.pcpu[data->pcpu].pi_list_length == -1 )
+        P.pcpu[data->pcpu].pi_list_length = data->len;
+    else if ( data->len != ++P.pcpu[data->pcpu].pi_list_length )
+        /*
+         * Correct pi list length. Removing one vcpu that is already on the
+         * list before tracing starts would not decrease the pi list length;
+         * the list length would be inaccuate.
+         */
+        P.pcpu[data->pcpu].pi_list_length = data->len;
+}
+
+void hvm_pi_list_del_process(struct record_info *ri)
+{
+    struct {
+        int did;
+        int vid;
+    } *data;
+    struct vcpu_data *v;
+
+    data = (typeof(data))ri->rec.u.tsc.data;
+    v = vcpu_find(data->did, data->vid);
+    if ( !v->hvm.init )
+        init_hvm_data(&v->hvm, v);
+
+    if ( opt.dump_all )
+    {
+        if ( v->hvm.pi_cpu != -1 )
+            printf("d%uv%u is removed from pi blocking list of pcpu%u\n",
+                   data->did, data->vid, v->hvm.pi_cpu);
+        else
+            printf("d%uv%u is removed from pi blocking list\n",
+                   data->did, data->vid);
+    }
+
+    if ( P.pcpu[v->hvm.pi_cpu].pi_list_length != -1 )
+        P.pcpu[v->hvm.pi_cpu].pi_list_length--;
+    v->hvm.pi_cpu = -1;
+    v->hvm.summary.pi.pi_list_del++;
+}
+
 void hvm_handler_process(struct record_info *ri, struct hvm_data *h) {
     /* Wait for first vmexit to initialize */
     if(!h->init)
@@ -4763,6 +4852,9 @@ void hvm_handler_process(struct record_info *ri, struct hvm_data *h) {
     case TRC_HVM_INTR_WINDOW:
         hvm_intr_window_process(ri, h);
         break;
+    case TRC_HVM_PI_LIST_ADD:
+        hvm_pi_list_add_process(ri, h);
+        break;
     case TRC_HVM_OP_DESTROY_PROC:
         if(h->v->cr3.data) {
             struct cr3_value_struct *cur = h->v->cr3.data;
@@ -4862,7 +4954,6 @@ needs_vmexit:
 void vcpu_next_update(struct pcpu_info *p, struct vcpu_data *next, tsc_t tsc);
 void vcpu_prev_update(struct pcpu_info *p, struct vcpu_data *prev,
                       tsc_t tsc, int new_runstate);
-struct vcpu_data * vcpu_find(int did, int vid);
 void lose_vcpu(struct vcpu_data *v, tsc_t tsc);
 
 int domain_runstate(struct domain_data *d) {
@@ -5267,6 +5358,21 @@ void hvm_process(struct pcpu_info *p)
     struct vcpu_data *v = p->current;
     struct hvm_data *h = &v->hvm;
 
+    if(ri->evt.sub == 8)
+    {
+        UPDATE_VOLUME(p, hvm[HVM_VOL_ASYNC], ri->size);
+
+        switch(ri->event) {
+        case TRC_HVM_ASYNC_PI_LIST_DEL:
+            hvm_pi_list_del_process(ri);
+            break;
+
+        default:
+            fprintf(warn, "Unknown hvm event: %x\n", ri->event);
+        }
+        return;
+    }
+
     assert(p->current);
 
     if(vcpu_set_data_type(p->current, VCPU_DATA_HVM))
@@ -5359,6 +5465,10 @@ void hvm_summary(struct hvm_data *h) {
                   i, h->summary.ipi_count[i]);
    hvm_io_address_summary(h->summary.io.pio, "IO address summary:");
    hvm_io_address_summary(h->summary.io.mmio, "MMIO address summary:");
+
+   printf("Posted Interrupt:\n");
+   printf(" List Add: %u\n", h->summary.pi.pi_list_add);
+   printf(" List Del: %u\n", h->summary.pi.pi_list_del);
 }
 
 /* ---- Shadow records ---- */
@@ -8962,6 +9072,7 @@ off_t scan_for_new_pcpu(off_t offset) {
 
         p->file_offset = offset;
         p->next_cpu_change_offset = offset;
+        p->pi_list_length = -1;
 
         record_order_insert(p);
 
@@ -9142,7 +9253,6 @@ int find_toplevel_event(struct record_info *ri)
     return toplevel;
 }
 
-
 void process_cpu_change(struct pcpu_info *p) {
     struct record_info *ri = &p->ri;
     struct cpu_change_data *r = (typeof(r))ri->d;
@@ -9190,6 +9300,7 @@ void process_cpu_change(struct pcpu_info *p) {
         p2->ri.d = p2->ri.rec.u.notsc.data;
         p2->file_offset = p->file_offset;
         p2->next_cpu_change_offset = p->file_offset;
+        p2->pi_list_length = -1;
 
         fprintf(warn, "%s: Activating pcpu %d at offset %lld\n",
                 __func__, r->cpu, (unsigned long long)p->file_offset);
@@ -9276,26 +9387,52 @@ void process_cpu_change(struct pcpu_info *p) {
     }
 }
 
-struct tl_assert_mask {
+struct assert_mask {
     unsigned p_current:1,
-        not_idle_domain:1;
+        not_idle_domain:1,
+        check_sublevel:1;
     int vcpu_data_mode;
+    int sub_max;
+    struct assert_mask *sub;
 };
-static struct tl_assert_mask tl_assert_checks[TOPLEVEL_MAX] = {
-    [TRC_HVM_MAIN]={ .p_current=1, .not_idle_domain=1, .vcpu_data_mode=VCPU_DATA_HVM },
+
+static struct assert_mask sl_hvm_assert_mask[SUBLEVEL_HVM_MAX] = {
+    [SUBLEVEL_HVM_ENTRYEXIT] = { .p_current=1, .not_idle_domain=1, .vcpu_data_mode=VCPU_DATA_HVM },
+    [SUBLEVEL_HVM_HANDLER] = { .p_current=1, .not_idle_domain=1, .vcpu_data_mode=VCPU_DATA_HVM },
+    [SUBLEVEL_HVM_EMUL] = { .p_current=1, .not_idle_domain=1, .vcpu_data_mode=VCPU_DATA_HVM },
+};
+
+static struct assert_mask tl_assert_checks[TOPLEVEL_MAX] = {
+    [TRC_HVM_MAIN]={ .check_sublevel=1, .sub=sl_hvm_assert_mask, .sub_max=SUBLEVEL_HVM_MAX },
     [TRC_SHADOW_MAIN]={ .p_current=1, .not_idle_domain=1, .vcpu_data_mode=VCPU_DATA_HVM },
     [TRC_PV_MAIN]={ .p_current=1, .not_idle_domain=1, .vcpu_data_mode=VCPU_DATA_PV },
 };
 
+/* Other sub types are reserved */
+static int sublevel_to_index[16] = {
+    [1] = 1,
+    [2] = 2,
+    [4] = 3,
+    [8] = 4,
+};
+
 /* There are a lot of common assumptions for the various processing
  * routines.  Check them all in one place, doing something else if
  * they don't pass. */
 int toplevel_assert_check(int toplevel, struct pcpu_info *p)
 {
-    struct tl_assert_mask mask;
+    struct assert_mask mask;
 
     mask = tl_assert_checks[toplevel];
 
+    if (mask.check_sublevel)
+    {
+        int sub = sublevel_to_index[p->ri.evt.sub];
+
+        assert(mask.sub && (sub < mask.sub_max));
+        mask = mask.sub[sub];
+    }
+
     if (mask.p_current && p->current == NULL)
     {
         fprintf(warn, "WARNING: p->current null!  Not processing\n");
@@ -9362,7 +9499,6 @@ void process_record(struct pcpu_info *p) {
     if(opt.dump_all)
         create_dump_header(ri, p);
 
-
     toplevel = find_toplevel_event(ri);
     if ( toplevel < 0 )
         return;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 04e9aa6..ccb1c8d 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -190,7 +190,9 @@ static void vmx_vcpu_block(struct vcpu *v)
      */
     ASSERT(old_lock == NULL);
 
-    per_cpu(vmx_pi_blocking, v->processor).counter++;
+    per_cpu(vmx_pi_blocking, pi_cpu).counter++;
+    TRACE_4D(TRC_HVM_PI_LIST_ADD, v->domain->domain_id, v->vcpu_id, pi_cpu,
+             per_cpu(vmx_pi_blocking, pi_cpu).counter);
     list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
                   &per_cpu(vmx_pi_blocking, pi_cpu).list);
     spin_unlock_irqrestore(pi_blocking_list_lock, flags[1]);
@@ -279,6 +281,7 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
         list_del(&v->arch.hvm_vmx.pi_blocking.list);
         container_of(pi_blocking_list_lock,
                      struct vmx_pi_blocking_vcpu, lock)->counter--;
+        TRACE_2D(TRC_HVM_ASYNC_PI_LIST_DEL, v->domain->domain_id, v->vcpu_id);
         v->arch.hvm_vmx.pi_blocking.lock = NULL;
     }
 
@@ -308,6 +311,8 @@ void vmx_pi_desc_fixup(unsigned int cpu)
 
     list_for_each_entry_safe(vmx, tmp, blocked_vcpus, pi_blocking.list)
     {
+        struct vcpu *v = container_of(vmx, struct vcpu, arch.hvm_vmx);
+
         /*
          * Suppress notification or we may miss an interrupt when the
          * target cpu is dying.
@@ -322,9 +327,11 @@ void vmx_pi_desc_fixup(unsigned int cpu)
         if ( pi_test_on(&vmx->pi_desc) )
         {
             list_del(&vmx->pi_blocking.list);
+            TRACE_2D(TRC_HVM_ASYNC_PI_LIST_DEL, v->domain->domain_id,
+                     v->vcpu_id);
             per_cpu(vmx_pi_blocking, cpu).counter--;
             vmx->pi_blocking.lock = NULL;
-            vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
+            vcpu_unblock(v);
         }
         else
         {
@@ -357,6 +364,10 @@ void vmx_pi_desc_fixup(unsigned int cpu)
                       &per_cpu(vmx_pi_blocking, new_cpu).list);
             per_cpu(vmx_pi_blocking, cpu).counter--;
             per_cpu(vmx_pi_blocking, new_cpu).counter++;
+            TRACE_2D(TRC_HVM_ASYNC_PI_LIST_DEL, v->domain->domain_id,
+                     v->vcpu_id);
+            TRACE_4D(TRC_HVM_PI_LIST_ADD, v->domain->domain_id, v->vcpu_id,
+                     new_cpu, per_cpu(vmx_pi_blocking, new_cpu).counter);
             vmx->pi_blocking.lock = new_lock;
 
             spin_unlock(new_lock);
@@ -2475,11 +2486,15 @@ static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
     {
         if ( pi_test_on(&vmx->pi_desc) )
         {
+            struct vcpu *v = container_of(vmx, struct vcpu, arch.hvm_vmx);
+
             list_del(&vmx->pi_blocking.list);
+            TRACE_2D(TRC_HVM_ASYNC_PI_LIST_DEL, v->domain->domain_id,
+                     v->vcpu_id);
             per_cpu(vmx_pi_blocking, cpu).counter--;
             ASSERT(vmx->pi_blocking.lock == lock);
             vmx->pi_blocking.lock = NULL;
-            vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
+            vcpu_unblock(v);
         }
     }
 
diff --git a/xen/include/asm-x86/hvm/trace.h b/xen/include/asm-x86/hvm/trace.h
index de802a6..b74ffdd 100644
--- a/xen/include/asm-x86/hvm/trace.h
+++ b/xen/include/asm-x86/hvm/trace.h
@@ -54,6 +54,7 @@
 #define DO_TRC_HVM_TRAP             DEFAULT_HVM_MISC
 #define DO_TRC_HVM_TRAP_DEBUG       DEFAULT_HVM_MISC
 #define DO_TRC_HVM_VLAPIC           DEFAULT_HVM_MISC
+#define DO_TRC_HVM_PI_LIST_ADD      DEFAULT_HVM_MISC
 
 
 #define TRC_PAR_LONG(par) ((par)&0xFFFFFFFF),((par)>>32)
diff --git a/xen/include/public/trace.h b/xen/include/public/trace.h
index 7f2e891..15ea87c 100644
--- a/xen/include/public/trace.h
+++ b/xen/include/public/trace.h
@@ -53,6 +53,7 @@
 #define TRC_HVM_ENTRYEXIT   0x00081000   /* VMENTRY and #VMEXIT       */
 #define TRC_HVM_HANDLER     0x00082000   /* various HVM handlers      */
 #define TRC_HVM_EMUL        0x00084000   /* emulated devices */
+#define TRC_HVM_ASYNC       0x00088000   /* Asynchronous events */
 
 #define TRC_SCHED_MIN       0x00021000   /* Just runstate changes */
 #define TRC_SCHED_CLASS     0x00022000   /* Scheduler-specific    */
@@ -234,6 +235,7 @@
 #define TRC_HVM_TRAP             (TRC_HVM_HANDLER + 0x23)
 #define TRC_HVM_TRAP_DEBUG       (TRC_HVM_HANDLER + 0x24)
 #define TRC_HVM_VLAPIC           (TRC_HVM_HANDLER + 0x25)
+#define TRC_HVM_PI_LIST_ADD      (TRC_HVM_HANDLER + 0x26)
 
 #define TRC_HVM_IOPORT_WRITE    (TRC_HVM_HANDLER + 0x216)
 #define TRC_HVM_IOMEM_WRITE     (TRC_HVM_HANDLER + 0x217)
@@ -257,6 +259,9 @@
 #define TRC_HVM_EMUL_PIC_PEND_IRQ_CALL (TRC_HVM_EMUL + 0x10)
 #define TRC_HVM_EMUL_LAPIC_PIC_INTR    (TRC_HVM_EMUL + 0x11)
 
+/* Trace asynconous events for HVM */
+#define TRC_HVM_ASYNC_PI_LIST_DEL      (TRC_HVM_ASYNC + 0x1)
+
 /* trace events for per class */
 #define TRC_PM_FREQ_CHANGE      (TRC_HW_PM + 0x01)
 #define TRC_PM_IDLE_ENTRY       (TRC_HW_PM + 0x02)
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 4/4] Xentrace: add support for HVM's PI blocking list operation
  2017-07-07  6:49 ` [PATCH v4 4/4] Xentrace: add support for HVM's PI blocking list operation Chao Gao
@ 2017-07-07 15:37   ` Jan Beulich
  2017-07-10  0:45     ` Chao Gao
  2017-07-21 16:26   ` George Dunlap
  1 sibling, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2017-07-07 15:37 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima

>>> On 07.07.17 at 08:49, <chao.gao@intel.com> wrote:
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -190,7 +190,9 @@ static void vmx_vcpu_block(struct vcpu *v)
>       */
>      ASSERT(old_lock == NULL);
>  
> -    per_cpu(vmx_pi_blocking, v->processor).counter++;
> +    per_cpu(vmx_pi_blocking, pi_cpu).counter++;

Isn't this an unrelated change, which likely would belong into an
earlier patch (f it's really intended in the first place)?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/4] VT-d PI: track the vcpu number on pi blocking list
  2017-07-07  6:48 ` [PATCH v4 1/4] VT-d PI: track the vcpu number on pi blocking list Chao Gao
@ 2017-07-07 15:41   ` Jan Beulich
  2017-07-10  0:50     ` Chao Gao
  2017-07-21 15:04   ` George Dunlap
  1 sibling, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2017-07-07 15:41 UTC (permalink / raw)
  To: Chao Gao; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, xen-devel

>>> On 07.07.17 at 08:48, <chao.gao@intel.com> wrote:
> This patch adds a field, counter, in struct vmx_pi_blocking_vcpu to track
> how many entries are on the pi blocking list.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>

Looks okay now, but didn't you have ASSERT()s in place earlier
on checking the counter isn't zero before decrements or after
increments?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] x86/vcpu: track hvm vcpu number on the system
  2017-07-07  6:48 ` [PATCH v4 2/4] x86/vcpu: track hvm vcpu number on the system Chao Gao
@ 2017-07-07 15:42   ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2017-07-07 15:42 UTC (permalink / raw)
  To: Chao Gao; +Cc: Andrew Cooper, xen-devel

>>> On 07.07.17 at 08:48, <chao.gao@intel.com> wrote:
> This number is used to calculate the average vcpus per pcpu ratio.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>

Acked-by: Jan Beulich <jbeulich@suse.com>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/4] VT-d PI: restrict the vcpu number on a given pcpu
  2017-07-07  6:48 ` [PATCH v4 3/4] VT-d PI: restrict the vcpu number on a given pcpu Chao Gao
@ 2017-07-07 15:57   ` Jan Beulich
  2017-07-10  1:17     ` Chao Gao
  2017-07-21 15:43   ` George Dunlap
  1 sibling, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2017-07-07 15:57 UTC (permalink / raw)
  To: Chao Gao; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, xen-devel

>>> On 07.07.17 at 08:48, <chao.gao@intel.com> wrote:
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -95,22 +95,91 @@ static DEFINE_PER_CPU(struct vmx_pi_blocking_vcpu, 
> vmx_pi_blocking);
>  uint8_t __read_mostly posted_intr_vector;
>  static uint8_t __read_mostly pi_wakeup_vector;
>  
> +/*
> + * Protect critical sections to avoid adding a blocked vcpu to a destroyed
> + * blocking list.
> + */
> +static DEFINE_SPINLOCK(remote_pbl_operation);

What is "pbl" supposed to stand for?

> +#define remote_pbl_operation_begin(flags)                   \
> +({                                                          \
> +    spin_lock_irqsave(&remote_pbl_operation, flags);        \
> +})
> +
> +#define remote_pbl_operation_done(flags)                    \
> +({                                                          \
> +    spin_unlock_irqrestore(&remote_pbl_operation, flags);   \
> +})

No need for the ({ }) here.

But then I don't understand what this is needed for in the first
place. If this is once again about CPU offlining, then I can only
repeat that such happens in stop_machine context. Otherwise
I'm afraid the comment ahead of this code section needs
adjustment, as I can't interpret it in another way.

> +/*
> + * By default, the local pcpu (means the one the vcpu is currently running on)
> + * is chosen as the destination of wakeup interrupt. But if the vcpu number of
> + * the pcpu exceeds a limit, another pcpu is chosen until we find a suitable
> + * one.
> + *
> + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu count, where
> + * v_tot is the total number of hvm vcpus on the system, p_tot is the total
> + * number of pcpus in the system, and K is a fixed number. An experment on a
> + * skylake server which has 112 cpus and 64G memory shows the maximum time to
> + * wakeup a vcpu from a 128-entry blocking list takes about 22us, which is
> + * tolerable. So choose 128 as the fixed number K.
> + *
> + * This policy makes sure:
> + * 1) for common cases, the limit won't be reached and the local pcpu is used
> + * which is beneficial to performance (at least, avoid an IPI when unblocking
> + * vcpu).
> + * 2) for the worst case, the blocking list length scales with the vcpu count
> + * divided by the pcpu count.
> + */
> +#define PI_LIST_FIXED_NUM 128
> +#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) / num_online_cpus() + \
> +                           PI_LIST_FIXED_NUM)
> +static inline bool pi_over_limit(int cpu)

unsigned int

> +{
> +    return per_cpu(vmx_pi_blocking, cpu).counter > PI_LIST_LIMIT;
> +}
> +
>  static void vmx_vcpu_block(struct vcpu *v)
>  {
> -    unsigned long flags;
> -    unsigned int dest;
> +    unsigned long flags[2];

??? You almost never need two flags instances in a function.

> +    unsigned int dest, pi_cpu;
>      spinlock_t *old_lock;
> -    spinlock_t *pi_blocking_list_lock =
> -		&per_cpu(vmx_pi_blocking, v->processor).lock;
>      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    spinlock_t *pi_blocking_list_lock;
> +    bool in_remote_operation = false;
> +
> +    pi_cpu = v->processor;
> +
> +    if ( unlikely(pi_over_limit(pi_cpu)) )
> +    {
> +        remote_pbl_operation_begin(flags[0]);
> +        in_remote_operation = true;
> +        while (true)

Coding style (and "for ( ; ; )" is generally better anyway).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 4/4] Xentrace: add support for HVM's PI blocking list operation
  2017-07-07 15:37   ` Jan Beulich
@ 2017-07-10  0:45     ` Chao Gao
  0 siblings, 0 replies; 18+ messages in thread
From: Chao Gao @ 2017-07-10  0:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima

On Fri, Jul 07, 2017 at 09:37:04AM -0600, Jan Beulich wrote:
>>>> On 07.07.17 at 08:49, <chao.gao@intel.com> wrote:
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -190,7 +190,9 @@ static void vmx_vcpu_block(struct vcpu *v)
>>       */
>>      ASSERT(old_lock == NULL);
>>  
>> -    per_cpu(vmx_pi_blocking, v->processor).counter++;
>> +    per_cpu(vmx_pi_blocking, pi_cpu).counter++;
>
>Isn't this an unrelated change, which likely would belong into an
>earlier patch (f it's really intended in the first place)?

You are right. this change should be in patch 3/4.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/4] VT-d PI: track the vcpu number on pi blocking list
  2017-07-07 15:41   ` Jan Beulich
@ 2017-07-10  0:50     ` Chao Gao
  0 siblings, 0 replies; 18+ messages in thread
From: Chao Gao @ 2017-07-10  0:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, xen-devel

On Fri, Jul 07, 2017 at 09:41:18AM -0600, Jan Beulich wrote:
>>>> On 07.07.17 at 08:48, <chao.gao@intel.com> wrote:
>> This patch adds a field, counter, in struct vmx_pi_blocking_vcpu to track
>> how many entries are on the pi blocking list.
>> 
>> Signed-off-by: Chao Gao <chao.gao@intel.com>
>
>Looks okay now, but didn't you have ASSERT()s in place earlier
>on checking the counter isn't zero before decrements or after
>increments?

No, I didn't. But I obtained the list length from xentrace and
the counter was always equal or greater than 0. I will check
the counter in next version.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/4] VT-d PI: restrict the vcpu number on a given pcpu
  2017-07-07 15:57   ` Jan Beulich
@ 2017-07-10  1:17     ` Chao Gao
  2017-07-10  9:36       ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Chao Gao @ 2017-07-10  1:17 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, xen-devel

On Fri, Jul 07, 2017 at 09:57:47AM -0600, Jan Beulich wrote:
>>>> On 07.07.17 at 08:48, <chao.gao@intel.com> wrote:
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -95,22 +95,91 @@ static DEFINE_PER_CPU(struct vmx_pi_blocking_vcpu, 
>> vmx_pi_blocking);
>>  uint8_t __read_mostly posted_intr_vector;
>>  static uint8_t __read_mostly pi_wakeup_vector;
>>  
>> +/*
>> + * Protect critical sections to avoid adding a blocked vcpu to a destroyed
>> + * blocking list.
>> + */
>> +static DEFINE_SPINLOCK(remote_pbl_operation);
>
>What is "pbl" supposed to stand for?

pi blocking list.

>
>> +#define remote_pbl_operation_begin(flags)                   \
>> +({                                                          \
>> +    spin_lock_irqsave(&remote_pbl_operation, flags);        \
>> +})
>> +
>> +#define remote_pbl_operation_done(flags)                    \
>> +({                                                          \
>> +    spin_unlock_irqrestore(&remote_pbl_operation, flags);   \
>> +})
>
>No need for the ({ }) here.
>
>But then I don't understand what this is needed for in the first
>place. If this is once again about CPU offlining, then I can only
>repeat that such happens in stop_machine context. Otherwise

But I don't think vmx_pi_desc_fixup() happens in stop_machine context,
please refer to cpu_callback() function in hvm.c and the time
notifier_call_chain(CPU_DEAD) is called in cpu_down().

Our goal here is to avoid adding one entry to a destroyed list.
To avoid destruction happens during adding, we can put these two
process in critical sections, like

add:
	remote_pbl_operation_begin()
	add one entry to the list
	remote_pbl_operation_end()

destroy:
	remote_pbl_operation_begin()
	destruction
	remote_pbl_operation_end()

Destruction may happen before we enter the critical section.
so adding should be:

add:
	remote_pbl_operation_begin()
	check the list is still valid
	add one entry to the list
	remote_pbl_operation_end()

In this patch, we choose an online cpu's list. The list should be valid
for the list is always destroyed after offline.

>I'm afraid the comment ahead of this code section needs
>adjustment, as I can't interpret it in another way.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/4] VT-d PI: restrict the vcpu number on a given pcpu
  2017-07-10  1:17     ` Chao Gao
@ 2017-07-10  9:36       ` Jan Beulich
  2017-07-10 11:42         ` Chao Gao
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2017-07-10  9:36 UTC (permalink / raw)
  To: Chao Gao; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, xen-devel

>>> On 10.07.17 at 03:17, <chao.gao@intel.com> wrote:
> On Fri, Jul 07, 2017 at 09:57:47AM -0600, Jan Beulich wrote:
>>>>> On 07.07.17 at 08:48, <chao.gao@intel.com> wrote:
>>> +#define remote_pbl_operation_begin(flags)                   \
>>> +({                                                          \
>>> +    spin_lock_irqsave(&remote_pbl_operation, flags);        \
>>> +})
>>> +
>>> +#define remote_pbl_operation_done(flags)                    \
>>> +({                                                          \
>>> +    spin_unlock_irqrestore(&remote_pbl_operation, flags);   \
>>> +})
>>
>>No need for the ({ }) here.
>>
>>But then I don't understand what this is needed for in the first
>>place. If this is once again about CPU offlining, then I can only
>>repeat that such happens in stop_machine context. Otherwise
> 
> But I don't think vmx_pi_desc_fixup() happens in stop_machine context,
> please refer to cpu_callback() function in hvm.c and the time
> notifier_call_chain(CPU_DEAD) is called in cpu_down().

While that's true, the CPU at that point is no longer marked
online, so it shouldn't be a candidate anyway.

> Our goal here is to avoid adding one entry to a destroyed list.
> To avoid destruction happens during adding, we can put these two
> process in critical sections, like
> 
> add:
> 	remote_pbl_operation_begin()
> 	add one entry to the list
> 	remote_pbl_operation_end()
> 
> destroy:
> 	remote_pbl_operation_begin()
> 	destruction
> 	remote_pbl_operation_end()
> 
> Destruction may happen before we enter the critical section.

I don't think so, no: Xen is not preemptible, and stop-machine logic
involves scheduling a tasklet on each pCPU and waiting for it to
gain control. So as long as you don't "manually" force tasklets to
be run, I still don't see the need for this extra locking.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/4] VT-d PI: restrict the vcpu number on a given pcpu
  2017-07-10  9:36       ` Jan Beulich
@ 2017-07-10 11:42         ` Chao Gao
  0 siblings, 0 replies; 18+ messages in thread
From: Chao Gao @ 2017-07-10 11:42 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, xen-devel

On Mon, Jul 10, 2017 at 03:36:52AM -0600, Jan Beulich wrote:
>>>> On 10.07.17 at 03:17, <chao.gao@intel.com> wrote:
>> On Fri, Jul 07, 2017 at 09:57:47AM -0600, Jan Beulich wrote:
>>>>>> On 07.07.17 at 08:48, <chao.gao@intel.com> wrote:
>>>> +#define remote_pbl_operation_begin(flags)                   \
>>>> +({                                                          \
>>>> +    spin_lock_irqsave(&remote_pbl_operation, flags);        \
>>>> +})
>>>> +
>>>> +#define remote_pbl_operation_done(flags)                    \
>>>> +({                                                          \
>>>> +    spin_unlock_irqrestore(&remote_pbl_operation, flags);   \
>>>> +})
>>>
>>>No need for the ({ }) here.
>>>
>>>But then I don't understand what this is needed for in the first
>>>place. If this is once again about CPU offlining, then I can only
>>>repeat that such happens in stop_machine context. Otherwise
>> 
>> But I don't think vmx_pi_desc_fixup() happens in stop_machine context,
>> please refer to cpu_callback() function in hvm.c and the time
>> notifier_call_chain(CPU_DEAD) is called in cpu_down().
>
>While that's true, the CPU at that point is no longer marked
>online, so it shouldn't be a candidate anyway.
>
>> Our goal here is to avoid adding one entry to a destroyed list.
>> To avoid destruction happens during adding, we can put these two
>> process in critical sections, like
>> 
>> add:
>> 	remote_pbl_operation_begin()
>> 	add one entry to the list
>> 	remote_pbl_operation_end()
>> 
>> destroy:
>> 	remote_pbl_operation_begin()
>> 	destruction
>> 	remote_pbl_operation_end()
>> 
>> Destruction may happen before we enter the critical section.
>
>I don't think so, no: Xen is not preemptible, and stop-machine logic
>involves scheduling a tasklet on each pCPU and waiting for it to
>gain control. So as long as you don't "manually" force tasklets to
>be run, I still don't see the need for this extra locking.

Impressive! I understand why you repeatedly mentioned stop_machine()
context to me now.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/4] VT-d PI: track the vcpu number on pi blocking list
  2017-07-07  6:48 ` [PATCH v4 1/4] VT-d PI: track the vcpu number on pi blocking list Chao Gao
  2017-07-07 15:41   ` Jan Beulich
@ 2017-07-21 15:04   ` George Dunlap
  1 sibling, 0 replies; 18+ messages in thread
From: George Dunlap @ 2017-07-21 15:04 UTC (permalink / raw)
  To: Chao Gao; +Cc: Andrew Cooper, Kevin Tian, Jan Beulich, Jun Nakajima, xen-devel

On Fri, Jul 7, 2017 at 7:48 AM, Chao Gao <chao.gao@intel.com> wrote:
> This patch adds a field, counter, in struct vmx_pi_blocking_vcpu to track
> how many entries are on the pi blocking list.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>

Minor nit:  The grammar in the title isn't quite right; "vcpu number"
would be "the number identifying a particular vcpu", not "the number
of vcpus".  It should be, "VT-d PI: Track the number of vcpus on pi
blocking list".

With that:

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

> ---
> v4:
>  - non-trace part of Patch 1 in v3
>
> ---
>  xen/arch/x86/hvm/vmx/vmx.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 69ce3aa..ecd6485 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -83,6 +83,7 @@ static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);
>  struct vmx_pi_blocking_vcpu {
>      struct list_head     list;
>      spinlock_t           lock;
> +    unsigned int         counter;
>  };
>
>  /*
> @@ -120,6 +121,7 @@ static void vmx_vcpu_block(struct vcpu *v)
>       */
>      ASSERT(old_lock == NULL);
>
> +    per_cpu(vmx_pi_blocking, v->processor).counter++;
>      list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
>                    &per_cpu(vmx_pi_blocking, v->processor).list);
>      spin_unlock_irqrestore(pi_blocking_list_lock, flags);
> @@ -187,6 +189,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
>      {
>          ASSERT(v->arch.hvm_vmx.pi_blocking.lock == pi_blocking_list_lock);
>          list_del(&v->arch.hvm_vmx.pi_blocking.list);
> +        container_of(pi_blocking_list_lock,
> +                     struct vmx_pi_blocking_vcpu, lock)->counter--;
>          v->arch.hvm_vmx.pi_blocking.lock = NULL;
>      }
>
> @@ -235,6 +239,7 @@ void vmx_pi_desc_fixup(unsigned int cpu)
>          if ( pi_test_on(&vmx->pi_desc) )
>          {
>              list_del(&vmx->pi_blocking.list);
> +            per_cpu(vmx_pi_blocking, cpu).counter--;
>              vmx->pi_blocking.lock = NULL;
>              vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
>          }
> @@ -259,6 +264,8 @@ void vmx_pi_desc_fixup(unsigned int cpu)
>
>              list_move(&vmx->pi_blocking.list,
>                        &per_cpu(vmx_pi_blocking, new_cpu).list);
> +            per_cpu(vmx_pi_blocking, cpu).counter--;
> +            per_cpu(vmx_pi_blocking, new_cpu).counter++;
>              vmx->pi_blocking.lock = new_lock;
>
>              spin_unlock(new_lock);
> @@ -2358,9 +2365,9 @@ static struct hvm_function_table __initdata vmx_function_table = {
>  static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
>  {
>      struct arch_vmx_struct *vmx, *tmp;
> -    spinlock_t *lock = &per_cpu(vmx_pi_blocking, smp_processor_id()).lock;
> -    struct list_head *blocked_vcpus =
> -               &per_cpu(vmx_pi_blocking, smp_processor_id()).list;
> +    unsigned int cpu = smp_processor_id();
> +    spinlock_t *lock = &per_cpu(vmx_pi_blocking, cpu).lock;
> +    struct list_head *blocked_vcpus = &per_cpu(vmx_pi_blocking, cpu).list;
>
>      ack_APIC_irq();
>      this_cpu(irq_count)++;
> @@ -2377,6 +2384,7 @@ static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
>          if ( pi_test_on(&vmx->pi_desc) )
>          {
>              list_del(&vmx->pi_blocking.list);
> +            per_cpu(vmx_pi_blocking, cpu).counter--;
>              ASSERT(vmx->pi_blocking.lock == lock);
>              vmx->pi_blocking.lock = NULL;
>              vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
> --
> 1.8.3.1
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/4] VT-d PI: restrict the vcpu number on a given pcpu
  2017-07-07  6:48 ` [PATCH v4 3/4] VT-d PI: restrict the vcpu number on a given pcpu Chao Gao
  2017-07-07 15:57   ` Jan Beulich
@ 2017-07-21 15:43   ` George Dunlap
  1 sibling, 0 replies; 18+ messages in thread
From: George Dunlap @ 2017-07-21 15:43 UTC (permalink / raw)
  To: Chao Gao; +Cc: Andrew Cooper, Kevin Tian, Jan Beulich, Jun Nakajima, xen-devel

On Fri, Jul 7, 2017 at 7:48 AM, Chao Gao <chao.gao@intel.com> wrote:
> Currently, a blocked vCPU is put in its pCPU's pi blocking list. If
> too many vCPUs are blocked on a given pCPU, it will incur that the list
> grows too long. After a simple analysis, there are 32k domains and
> 128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's
> PI blocking list. When a wakeup interrupt arrives, the list is
> traversed to find some specific vCPUs to wake them up. This traversal in
> that case would consume much time.
>
> To mitigate this issue, this patch limits the number of vCPUs tracked on a
> given pCPU's blocking list, taking factors such as perfomance of common case,
> current hvm vCPU count and current pCPU count into consideration. With this
> method, for the common case, it works fast and for some extreme cases, the
> list length is under control.
>
> With this patch, when a vcpu is to be blocked, we check whether the pi
> blocking list's length of the pcpu where the vcpu is running exceeds
> the limit which is the average vcpus per pcpu ratio plus a constant.
> If no, the vcpu is added to this pcpu's pi blocking list. Otherwise,
> another online pcpu is chosen to accept the vcpu.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
> v4:
>  - use a new lock to avoid adding a blocked vcpu to a offline pcpu's blocking
>  list.
>
> ---
>  xen/arch/x86/hvm/vmx/vmx.c | 136 +++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 114 insertions(+), 22 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index ecd6485..04e9aa6 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -95,22 +95,91 @@ static DEFINE_PER_CPU(struct vmx_pi_blocking_vcpu, vmx_pi_blocking);
>  uint8_t __read_mostly posted_intr_vector;
>  static uint8_t __read_mostly pi_wakeup_vector;
>
> +/*
> + * Protect critical sections to avoid adding a blocked vcpu to a destroyed
> + * blocking list.
> + */
> +static DEFINE_SPINLOCK(remote_pbl_operation);
> +
> +#define remote_pbl_operation_begin(flags)                   \
> +({                                                          \
> +    spin_lock_irqsave(&remote_pbl_operation, flags);        \
> +})
> +
> +#define remote_pbl_operation_done(flags)                    \
> +({                                                          \
> +    spin_unlock_irqrestore(&remote_pbl_operation, flags);   \
> +})
> +
>  void vmx_pi_per_cpu_init(unsigned int cpu)
>  {
>      INIT_LIST_HEAD(&per_cpu(vmx_pi_blocking, cpu).list);
>      spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
>  }
>
> +/*
> + * By default, the local pcpu (means the one the vcpu is currently running on)
> + * is chosen as the destination of wakeup interrupt. But if the vcpu number of
> + * the pcpu exceeds a limit, another pcpu is chosen until we find a suitable
> + * one.
> + *
> + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu count, where
> + * v_tot is the total number of hvm vcpus on the system, p_tot is the total
> + * number of pcpus in the system, and K is a fixed number. An experment on a
> + * skylake server which has 112 cpus and 64G memory shows the maximum time to
> + * wakeup a vcpu from a 128-entry blocking list takes about 22us, which is
> + * tolerable. So choose 128 as the fixed number K.
> + *
> + * This policy makes sure:
> + * 1) for common cases, the limit won't be reached and the local pcpu is used
> + * which is beneficial to performance (at least, avoid an IPI when unblocking
> + * vcpu).
> + * 2) for the worst case, the blocking list length scales with the vcpu count
> + * divided by the pcpu count.
> + */
> +#define PI_LIST_FIXED_NUM 128
> +#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) / num_online_cpus() + \
> +                           PI_LIST_FIXED_NUM)
> +static inline bool pi_over_limit(int cpu)
> +{
> +    return per_cpu(vmx_pi_blocking, cpu).counter > PI_LIST_LIMIT;

Is there any reason to hide this calculation behind a #define, when
it's only used once anyway?

Also -- the vast majority of the time, .counter will be <
PI_LIST_FIXED_NUM; there's no reason to do an atomic read and an
integer division in that case.  I would do this:

if ( likely(per_cpu(vm_pi_blocking, cpu).counter <= PI_LIST_FIXED_LIMIT) )
  return 0;

return per_cpu(vm_pi_blocking, cpu).counter < PI_LIST_FIXED_LIMIT +
(atomic_read(&num_hvm_vcpus) / num_online_cpus));

Also, I personally think it would make the code more readable to say,
"pi_under_limit()" instead; that way...

> +}
> +
>  static void vmx_vcpu_block(struct vcpu *v)
>  {
> -    unsigned long flags;
> -    unsigned int dest;
> +    unsigned long flags[2];
> +    unsigned int dest, pi_cpu;
>      spinlock_t *old_lock;
> -    spinlock_t *pi_blocking_list_lock =
> -               &per_cpu(vmx_pi_blocking, v->processor).lock;
>      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    spinlock_t *pi_blocking_list_lock;
> +    bool in_remote_operation = false;
> +
> +    pi_cpu = v->processor;
> +
> +    if ( unlikely(pi_over_limit(pi_cpu)) )
> +    {

...here you put the most common thing first, and the exceptional case
second (which I think makes the code easier to understand).

In fact, you might consider putting this whole thing in a function;
something like:

unsigned int pi_get_blocking_cpu(unsigned int pi_cpu, unsigned long &flags)
{
    if ( pi_under_limit(pi_cpu) ) {
     spin_lock_irqsave([pi lock], flags);
     return pi_cpu;
   }

  /* Loop looking for pi's in other places */
}

Probably also worth mentioning briefly in a comment why this loop is
guaranteed to terminate.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 4/4] Xentrace: add support for HVM's PI blocking list operation
  2017-07-07  6:49 ` [PATCH v4 4/4] Xentrace: add support for HVM's PI blocking list operation Chao Gao
  2017-07-07 15:37   ` Jan Beulich
@ 2017-07-21 16:26   ` George Dunlap
  2017-07-28  8:23     ` Chao Gao
  1 sibling, 1 reply; 18+ messages in thread
From: George Dunlap @ 2017-07-21 16:26 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, Wei Liu, Jun Nakajima, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich

On Fri, Jul 7, 2017 at 7:49 AM, Chao Gao <chao.gao@intel.com> wrote:
> In order to analyze PI blocking list operation frequence and obtain
> the list length, add some relevant events to xentrace and some
> associated code in xenalyze. Event ASYNC_PI_LIST_DEL may happen in interrupt
> context, which incurs current assumptions checked in toplevel_assert_check()
> are not suitable any more. Thus, this patch extends the toplevel_assert_check()
> to remove such assumptions for events of type ASYNC_PI_LIST_DEL.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>

Hey Chao Gao,

Thanks for doing the work to add this tracing support to xentrace --
and in particular taking the effort to adapt the assert mechanism to
be able to handle asynchronous events.

I think in this case though, having a separate HVM sub-class for
asynchronous events isn't really the right approach.  The main purpose
of sub-classes is to help filter the events you want; and I can't
think of any time you'd want to trace PI_LIST_DEL and not PI_LIST_ADD
(or vice versa).  Secondly, the "asynchronous event" problem will be
an issue for other contexts as well, and the solution will be the
same.

I think a better solution would be to do something similar to
TRC_64_FLAG and TRC_HVM_IOMEM_[read,write], and claim another bit to
create a TRC_ASYNC_FLAG (0x400 probably).  Then we can filter the
"not_idle_domain" and "vcpu_data_mode" asserts on that.

What do you think?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 4/4] Xentrace: add support for HVM's PI blocking list operation
  2017-07-21 16:26   ` George Dunlap
@ 2017-07-28  8:23     ` Chao Gao
  0 siblings, 0 replies; 18+ messages in thread
From: Chao Gao @ 2017-07-28  8:23 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, Wei Liu, Jun Nakajima, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich

On Fri, Jul 21, 2017 at 05:26:47PM +0100, George Dunlap wrote:
>On Fri, Jul 7, 2017 at 7:49 AM, Chao Gao <chao.gao@intel.com> wrote:
>> In order to analyze PI blocking list operation frequence and obtain
>> the list length, add some relevant events to xentrace and some
>> associated code in xenalyze. Event ASYNC_PI_LIST_DEL may happen in interrupt
>> context, which incurs current assumptions checked in toplevel_assert_check()
>> are not suitable any more. Thus, this patch extends the toplevel_assert_check()
>> to remove such assumptions for events of type ASYNC_PI_LIST_DEL.
>>
>> Signed-off-by: Chao Gao <chao.gao@intel.com>
>
>Hey Chao Gao,
>
>Thanks for doing the work to add this tracing support to xentrace --
>and in particular taking the effort to adapt the assert mechanism to
>be able to handle asynchronous events.
>
>I think in this case though, having a separate HVM sub-class for
>asynchronous events isn't really the right approach.  The main purpose
>of sub-classes is to help filter the events you want; and I can't
>think of any time you'd want to trace PI_LIST_DEL and not PI_LIST_ADD
>(or vice versa).  Secondly, the "asynchronous event" problem will be
>an issue for other contexts as well, and the solution will be the
>same.
>
>I think a better solution would be to do something similar to
>TRC_64_FLAG and TRC_HVM_IOMEM_[read,write], and claim another bit to
>create a TRC_ASYNC_FLAG (0x400 probably).  Then we can filter the
>"not_idle_domain" and "vcpu_data_mode" asserts on that.
>
>What do you think?

It makes sense to me. Your other comments on this series are also fine
to me. I will cook another version based on your suggestions.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-07-28  8:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-07  6:48 [PATCH v4 0/4] mitigate the per-pCPU blocking list may be too long Chao Gao
2017-07-07  6:48 ` [PATCH v4 1/4] VT-d PI: track the vcpu number on pi blocking list Chao Gao
2017-07-07 15:41   ` Jan Beulich
2017-07-10  0:50     ` Chao Gao
2017-07-21 15:04   ` George Dunlap
2017-07-07  6:48 ` [PATCH v4 2/4] x86/vcpu: track hvm vcpu number on the system Chao Gao
2017-07-07 15:42   ` Jan Beulich
2017-07-07  6:48 ` [PATCH v4 3/4] VT-d PI: restrict the vcpu number on a given pcpu Chao Gao
2017-07-07 15:57   ` Jan Beulich
2017-07-10  1:17     ` Chao Gao
2017-07-10  9:36       ` Jan Beulich
2017-07-10 11:42         ` Chao Gao
2017-07-21 15:43   ` George Dunlap
2017-07-07  6:49 ` [PATCH v4 4/4] Xentrace: add support for HVM's PI blocking list operation Chao Gao
2017-07-07 15:37   ` Jan Beulich
2017-07-10  0:45     ` Chao Gao
2017-07-21 16:26   ` George Dunlap
2017-07-28  8:23     ` Chao Gao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.