All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
@ 2017-04-26  0:52 Chao Gao
  2017-04-26  0:52 ` [PATCH 1/4] xentrace: add TRC_HVM_VT_D_PI_BLOCK Chao Gao
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Chao Gao @ 2017-04-26  0:52 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Wei Liu, Jan Beulich, George Dunlap, Andrew Cooper,
	Ian Jackson, Jun Nakajima, Chao Gao

VT-d PI introduces a per-pCPU blocking list to track the blocked vCPU
running on the pCPU. Theoretically, there are 32K domain on single
host, 128 vCPUs per domain. If all vCPUs are blocked on the same pCPU,
4M vCPUs are in the same list. Travelling this issue consumes too
much time. We have discussed this issue in [1,2,3].

To mitigate this issue, we proposed the following two method [3]:
1. Evenly distributing all the blocked vCPUs among all pCPUs.
2. Don't put the blocked vCPUs which won't be woken by the wakeup
interrupt into the per-pCPU list.

PATCH 1/4 tracks the event, adding entry to PI blocking list. With the
patch, some data can be acquired to help to validate the following
patches. 

Patch 2/4 randomly distritbutes entries (vCPUs) among all oneline
pCPUs, which can theoretically decrease the maximum of #entry
in the list by N times. N is #pCPU.

Patch 3/4 adds a refcount to vcpu's pi_desc. If the pi_desc is
recorded in one IRTE, the refcount increase by 1 and If the pi_desc is
cleared in one IRTE, the refcount decrease by 1.

In Patch 4/4, one vCPU is added to PI blocking list only if its
pi_desc is referred by at least one IRTE.

I tested this series in the following scene:
* One 128 vCPUs guest and assign a NIC to it
* all 128 vCPUs are pinned to one pCPU.
* use xentrace to collect events for 5 minutes

I compared the maximum of #entry in one list and #event (adding entry to
PI blocking list) with and without the three latter patches. Here
is the result:
-------------------------------------------------------------
|               |                      |                    |
|    Items      |   Maximum of #entry  |      #event        |
|               |                      |                    |
-------------------------------------------------------------
|               |                      |                    |
|W/ the patches |         6            |       22740        |
|               |                      |                    |
-------------------------------------------------------------
|               |                      |                    |
|W/O the patches|        128           |       46481        |
|               |                      |                    |
-------------------------------------------------------------

This is a very simple test. But with this patch series,
maximum of #entry has largely decreased and #adding entry to
PI blocking list reduces about half. From this aspect, this
patch series takes effect.

[1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted-interrupt%20core%20logic%20handling;#422661
[2] https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20of%20the%20list%20depends;#422567
[3] https://lists.gt.net/xen/devel/472749?search_string=enable%20vt-d%20pi%20by%20default;#472749

Chao Gao (4):
  xentrace: add TRC_HVM_VT_D_PI_BLOCK
  VT-d PI: Randomly Distribute entries to all online pCPUs' pi blocking
    list
  VT-d PI: Add reference count to pi_desc
  VT-d PI: Don't add vCPU to PI blocking list for a case

 tools/xentrace/formats                 |  1 +
 xen/arch/x86/hvm/vmx/vmx.c             | 90 +++++++++++++++++++++++++++++-----
 xen/drivers/passthrough/io.c           |  2 +-
 xen/drivers/passthrough/vtd/intremap.c | 60 ++++++++++++++++++++++-
 xen/include/asm-x86/hvm/domain.h       |  6 +++
 xen/include/asm-x86/hvm/trace.h        |  1 +
 xen/include/asm-x86/hvm/vmx/vmcs.h     |  3 ++
 xen/include/asm-x86/iommu.h            |  2 +-
 xen/include/asm-x86/msi.h              |  2 +-
 xen/include/public/trace.h             |  1 +
 10 files changed, 151 insertions(+), 17 deletions(-)

-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] xentrace: add TRC_HVM_VT_D_PI_BLOCK
  2017-04-26  0:52 [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Chao Gao
@ 2017-04-26  0:52 ` Chao Gao
  2017-04-26  0:52 ` [PATCH 2/4] VT-d PI: Randomly Distribute entries to all online pCPUs' pi blocking list Chao Gao
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Chao Gao @ 2017-04-26  0:52 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Wei Liu, Jan Beulich, George Dunlap, Andrew Cooper,
	Ian Jackson, Jun Nakajima, Chao Gao

This patch adds TRC_HVM_VT_D_PI_BLOCK to track adding one entry to
the per-pcpu blocking list. Also introduce a 'counter' to track
the number of entries in the list.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 tools/xentrace/formats          |  1 +
 xen/arch/x86/hvm/vmx/vmx.c      | 10 +++++++++-
 xen/include/asm-x86/hvm/trace.h |  1 +
 xen/include/public/trace.h      |  1 +
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index 8b31780..ecfa955 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -125,6 +125,7 @@
 0x00082020  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  INTR_WINDOW [ value = 0x%(1)08x ]
 0x00082021  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  NPF         [ gpa = 0x%(2)08x%(1)08x mfn = 0x%(4)08x%(3)08x qual = 0x%(5)04x p2mt = 0x%(6)04x ]
 0x00082023  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  TRAP        [ vector = 0x%(1)02x ]
+0x00082026  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  PI_BLOCK_LIST  [ domid = 0x%(1)04x vcpu = 0x%(2)04x, pcpu = 0x%(3)04x, #entry = 0x%(4)04x ]
 
 0x0010f001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_map      [ domid = %(1)d ]
 0x0010f002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_unmap    [ domid = %(1)d ]
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index f8d3c17..a2dac56 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -82,6 +82,7 @@ static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);
 struct vmx_pi_blocking_vcpu {
     struct list_head     list;
     spinlock_t           lock;
+    atomic_t             counter;
 };
 
 /*
@@ -119,6 +120,9 @@ static void vmx_vcpu_block(struct vcpu *v)
      */
     ASSERT(old_lock == NULL);
 
+    atomic_inc(&per_cpu(vmx_pi_blocking, v->processor).counter);
+    HVMTRACE_4D(VT_D_PI_BLOCK, v->domain->domain_id, v->vcpu_id, v->processor,
+                atomic_read(&per_cpu(vmx_pi_blocking, v->processor).counter));
     list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
                   &per_cpu(vmx_pi_blocking, v->processor).list);
     spin_unlock_irqrestore(pi_blocking_list_lock, flags);
@@ -186,6 +190,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
     {
         ASSERT(v->arch.hvm_vmx.pi_blocking.lock == pi_blocking_list_lock);
         list_del(&v->arch.hvm_vmx.pi_blocking.list);
+        atomic_dec(&container_of(pi_blocking_list_lock,
+                                 struct vmx_pi_blocking_vcpu, lock)->counter);
         v->arch.hvm_vmx.pi_blocking.lock = NULL;
     }
 
@@ -234,6 +240,7 @@ void vmx_pi_desc_fixup(unsigned int cpu)
         if ( pi_test_on(&vmx->pi_desc) )
         {
             list_del(&vmx->pi_blocking.list);
+            atomic_dec(&per_cpu(vmx_pi_blocking, cpu).counter);
             vmx->pi_blocking.lock = NULL;
             vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
         }
@@ -2360,7 +2367,7 @@ static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
     struct arch_vmx_struct *vmx, *tmp;
     spinlock_t *lock = &per_cpu(vmx_pi_blocking, smp_processor_id()).lock;
     struct list_head *blocked_vcpus =
-		&per_cpu(vmx_pi_blocking, smp_processor_id()).list;
+               &per_cpu(vmx_pi_blocking, smp_processor_id()).list;
 
     ack_APIC_irq();
     this_cpu(irq_count)++;
@@ -2377,6 +2384,7 @@ static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
         if ( pi_test_on(&vmx->pi_desc) )
         {
             list_del(&vmx->pi_blocking.list);
+            atomic_dec(&per_cpu(vmx_pi_blocking, smp_processor_id()).counter);
             ASSERT(vmx->pi_blocking.lock == lock);
             vmx->pi_blocking.lock = NULL;
             vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
diff --git a/xen/include/asm-x86/hvm/trace.h b/xen/include/asm-x86/hvm/trace.h
index de802a6..d122c70 100644
--- a/xen/include/asm-x86/hvm/trace.h
+++ b/xen/include/asm-x86/hvm/trace.h
@@ -54,6 +54,7 @@
 #define DO_TRC_HVM_TRAP             DEFAULT_HVM_MISC
 #define DO_TRC_HVM_TRAP_DEBUG       DEFAULT_HVM_MISC
 #define DO_TRC_HVM_VLAPIC           DEFAULT_HVM_MISC
+#define DO_TRC_HVM_VT_D_PI_BLOCK    DEFAULT_HVM_MISC
 
 
 #define TRC_PAR_LONG(par) ((par)&0xFFFFFFFF),((par)>>32)
diff --git a/xen/include/public/trace.h b/xen/include/public/trace.h
index 7f2e891..a50c089 100644
--- a/xen/include/public/trace.h
+++ b/xen/include/public/trace.h
@@ -234,6 +234,7 @@
 #define TRC_HVM_TRAP             (TRC_HVM_HANDLER + 0x23)
 #define TRC_HVM_TRAP_DEBUG       (TRC_HVM_HANDLER + 0x24)
 #define TRC_HVM_VLAPIC           (TRC_HVM_HANDLER + 0x25)
+#define TRC_HVM_VT_D_PI_BLOCK    (TRC_HVM_HANDLER + 0x26)
 
 #define TRC_HVM_IOPORT_WRITE    (TRC_HVM_HANDLER + 0x216)
 #define TRC_HVM_IOMEM_WRITE     (TRC_HVM_HANDLER + 0x217)
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/4] VT-d PI: Randomly Distribute entries to all online pCPUs' pi blocking list
  2017-04-26  0:52 [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Chao Gao
  2017-04-26  0:52 ` [PATCH 1/4] xentrace: add TRC_HVM_VT_D_PI_BLOCK Chao Gao
@ 2017-04-26  0:52 ` Chao Gao
  2017-04-26  0:52 ` [PATCH 3/4] VT-d PI: Add reference count to pi_desc Chao Gao
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Chao Gao @ 2017-04-26  0:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Kevin Tian, Jan Beulich, Jun Nakajima, Chao Gao

Currently, a blocked vCPU is put in its pCPU's pi blocking list. If
too many vCPUs are blocked on one pCPU, it will incur that the list
grows too long. After a simple analysis, we can have 32k domains and
128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's
PI blocking list. When a wakeup interrupt arrives, the list is
travelled to find some specific vCPUs to wake them up. This travel in
that case would consume much time.

To mitigate this issue, this patch randomly distributes entries to all
online pCPUs' pi blocking list, other than the pCPU which the vCPU
(the entry) is running on. With this method, the number of entries in
the PI blocking list can reduce by N-times theoretically. N is the
number of online pCPUs.

The change in vmx_pi_unblock_vcpu() is for the following case:
vcpu is running -> try to block (this patch changes NSDT to
another random pCPU) but notification comes in time, thus the vcpu
goes back to running station -> VM-entry (we should set NSDT again,
reverting the change we make to NSDT in vmx_vcpu_block())

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 xen/arch/x86/hvm/vmx/vmx.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index a2dac56..4c9af59 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -103,13 +103,28 @@ void vmx_pi_per_cpu_init(unsigned int cpu)
 static void vmx_vcpu_block(struct vcpu *v)
 {
     unsigned long flags;
-    unsigned int dest;
+    unsigned int dest, dest_cpu;
     spinlock_t *old_lock;
-    spinlock_t *pi_blocking_list_lock =
-		&per_cpu(vmx_pi_blocking, v->processor).lock;
     struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    spinlock_t *pi_blocking_list_lock;
+
+    /*
+     * After pCPU goes down, the per-cpu PI blocking list is cleared.
+     * To make sure the parameter vCPU is added to the chosen pCPU's
+     * PI blocking list before the list is cleared, just retry when
+     * finding the pCPU has gone down.
+     */
+ retry:
+    dest_cpu = cpumask_any(&cpu_online_map);
+    pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, dest_cpu).lock;
 
     spin_lock_irqsave(pi_blocking_list_lock, flags);
+    if ( unlikely(!cpu_online(dest_cpu)) )
+    {
+        spin_unlock_irqrestore(pi_blocking_list_lock, flags);
+        goto retry;
+    }
+
     old_lock = cmpxchg(&v->arch.hvm_vmx.pi_blocking.lock, NULL,
                        pi_blocking_list_lock);
 
@@ -120,11 +135,11 @@ static void vmx_vcpu_block(struct vcpu *v)
      */
     ASSERT(old_lock == NULL);
 
-    atomic_inc(&per_cpu(vmx_pi_blocking, v->processor).counter);
-    HVMTRACE_4D(VT_D_PI_BLOCK, v->domain->domain_id, v->vcpu_id, v->processor,
-                atomic_read(&per_cpu(vmx_pi_blocking, v->processor).counter));
+    atomic_inc(&per_cpu(vmx_pi_blocking, dest_cpu).counter);
+    HVMTRACE_4D(VT_D_PI_BLOCK, v->domain->domain_id, v->vcpu_id, dest_cpu,
+                atomic_read(&per_cpu(vmx_pi_blocking, dest_cpu).counter));
     list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
-                  &per_cpu(vmx_pi_blocking, v->processor).list);
+                  &per_cpu(vmx_pi_blocking, dest_cpu).list);
     spin_unlock_irqrestore(pi_blocking_list_lock, flags);
 
     ASSERT(!pi_test_sn(pi_desc));
@@ -134,7 +149,11 @@ static void vmx_vcpu_block(struct vcpu *v)
     ASSERT(pi_desc->ndst ==
            (x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK)));
 
+    dest = cpu_physical_id(dest_cpu);
+
     write_atomic(&pi_desc->nv, pi_wakeup_vector);
+    write_atomic(&pi_desc->ndst,
+                 (x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK)));
 }
 
 static void vmx_pi_switch_from(struct vcpu *v)
@@ -163,6 +182,7 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
     unsigned long flags;
     spinlock_t *pi_blocking_list_lock;
     struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    unsigned int dest = cpu_physical_id(v->processor);
 
     /*
      * Set 'NV' field back to posted_intr_vector, so the
@@ -170,6 +190,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
      * it is running in non-root mode.
      */
     write_atomic(&pi_desc->nv, posted_intr_vector);
+    write_atomic(&pi_desc->ndst,
+                 x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));
 
     pi_blocking_list_lock = v->arch.hvm_vmx.pi_blocking.lock;
 
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/4] VT-d PI: Add reference count to pi_desc
  2017-04-26  0:52 [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Chao Gao
  2017-04-26  0:52 ` [PATCH 1/4] xentrace: add TRC_HVM_VT_D_PI_BLOCK Chao Gao
  2017-04-26  0:52 ` [PATCH 2/4] VT-d PI: Randomly Distribute entries to all online pCPUs' pi blocking list Chao Gao
@ 2017-04-26  0:52 ` Chao Gao
  2017-04-26  0:52 ` [PATCH 4/4] VT-d PI: Don't add vCPU to PI blocking list for a case Chao Gao
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Chao Gao @ 2017-04-26  0:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Kevin Tian, Jan Beulich, Jun Nakajima, Chao Gao

This patch intruduces a 'refcnt' field in vmx_pi_blocking to track
the reference count of 'pi_desc' of the vCPU. And change this field
every time we re-program one IRTE.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 xen/arch/x86/hvm/vmx/vmx.c             | 29 ++++++++++++++++++++++++
 xen/drivers/passthrough/io.c           |  2 +-
 xen/drivers/passthrough/vtd/intremap.c | 41 ++++++++++++++++++++++++++++++++--
 xen/include/asm-x86/hvm/domain.h       |  6 +++++
 xen/include/asm-x86/hvm/vmx/vmcs.h     |  3 +++
 xen/include/asm-x86/iommu.h            |  2 +-
 xen/include/asm-x86/msi.h              |  2 +-
 7 files changed, 80 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 4c9af59..76cb421 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -100,6 +100,23 @@ void vmx_pi_per_cpu_init(unsigned int cpu)
     spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
 }
 
+void vmx_pi_get_ref(struct vcpu *v)
+{
+    ASSERT(atomic_read(&v->arch.hvm_vmx.pi_blocking.refcnt) >= 0);
+    atomic_inc(&v->arch.hvm_vmx.pi_blocking.refcnt);
+}
+
+void vmx_pi_put_ref(struct vcpu *v)
+{
+    atomic_dec(&v->arch.hvm_vmx.pi_blocking.refcnt);
+    ASSERT(atomic_read(&v->arch.hvm_vmx.pi_blocking.refcnt) >= 0);
+}
+
+bool vmx_pi_in_use(struct vcpu *v)
+{
+    return !!atomic_read(&v->arch.hvm_vmx.pi_blocking.refcnt);
+}
+
 static void vmx_vcpu_block(struct vcpu *v)
 {
     unsigned long flags;
@@ -378,6 +395,9 @@ void vmx_pi_hooks_assign(struct domain *d)
 
     d->arch.hvm_domain.pi_ops.vcpu_block = vmx_vcpu_block;
     d->arch.hvm_domain.pi_ops.do_resume = vmx_pi_do_resume;
+    d->arch.hvm_domain.pi_ops.get_ref = vmx_pi_get_ref;
+    d->arch.hvm_domain.pi_ops.put_ref = vmx_pi_put_ref;
+    d->arch.hvm_domain.pi_ops.in_use = vmx_pi_in_use;
 }
 
 /* This function is called when pcidevs_lock is held */
@@ -416,6 +436,15 @@ void vmx_pi_hooks_deassign(struct domain *d)
     d->arch.hvm_domain.pi_ops.vcpu_block = NULL;
     d->arch.hvm_domain.pi_ops.switch_from = NULL;
     d->arch.hvm_domain.pi_ops.do_resume = NULL;
+    d->arch.hvm_domain.pi_ops.get_ref = NULL;
+    d->arch.hvm_domain.pi_ops.put_ref = NULL;
+    d->arch.hvm_domain.pi_ops.in_use = NULL;
+    /*
+     * If device is still using by guest, but we forcibly deassign it,
+     * then the 'refcnt' is not zero here. Clear it for re-assignment.
+     */
+    for_each_vcpu ( d, v )
+        atomic_set(&v->arch.hvm_vmx.pi_blocking.refcnt, 0);
 
     for_each_vcpu ( d, v )
         vmx_pi_unblock_vcpu(v);
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index e5a43e5..979be77 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -319,7 +319,7 @@ int pt_irq_create_bind(
     {
         uint8_t dest, dest_mode, delivery_mode;
         int dest_vcpu_id;
-        const struct vcpu *vcpu;
+        struct vcpu *vcpu;
 
         if ( !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         {
diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index 1e0317c..99f1cce 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -596,6 +596,32 @@ static int remap_entry_to_msi_msg(
     return 0;
 }
 
+static void pi_get_ref(struct pi_desc *pi_desc)
+{
+    struct vcpu *v;
+
+    if ( !pi_desc )
+        return;
+
+    v = pi_desc_to_vcpu(pi_desc);
+    ASSERT(is_hvm_domain(v->domain));
+    if ( v->domain->arch.hvm_domain.pi_ops.get_ref )
+        v->domain->arch.hvm_domain.pi_ops.get_ref(v);
+}
+
+static void pi_put_ref(struct pi_desc *pi_desc)
+{
+    struct vcpu *v;
+
+    if ( !pi_desc )
+        return;
+
+    v = pi_desc_to_vcpu(pi_desc);
+    ASSERT(is_hvm_domain(v->domain));
+    if ( v->domain->arch.hvm_domain.pi_ops.put_ref )
+        v->domain->arch.hvm_domain.pi_ops.put_ref(v);
+}
+
 static int msi_msg_to_remap_entry(
     struct iommu *iommu, struct pci_dev *pdev,
     struct msi_desc *msi_desc, struct msi_msg *msg)
@@ -619,6 +645,7 @@ static int msi_msg_to_remap_entry(
         {
             free_remap_entry(iommu, msi_desc->remap_index + i);
             msi_desc[i].irte_initialized = false;
+            pi_put_ref(msi_desc[i].pi_desc);
         }
         spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
         return 0;
@@ -962,11 +989,12 @@ void iommu_disable_x2apic_IR(void)
  * This function is used to update the IRTE for posted-interrupt
  * when guest changes MSI/MSI-X information.
  */
-int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
+int pi_update_irte(struct pi_desc *pi_desc, const struct pirq *pirq,
     const uint8_t gvec)
 {
     struct irq_desc *desc;
     struct msi_desc *msi_desc;
+    struct pi_desc *old_pi_desc;
     int rc;
 
     desc = pirq_spin_lock_irq_desc(pirq, NULL);
@@ -979,13 +1007,22 @@ int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
         rc = -ENODEV;
         goto unlock_out;
     }
+    old_pi_desc = msi_desc->pi_desc;
+
+    pi_get_ref(pi_desc);
     msi_desc->pi_desc = pi_desc;
     msi_desc->gvec = gvec;
 
     spin_unlock_irq(&desc->lock);
 
     ASSERT(pcidevs_locked());
-    return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
+    rc = iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
+    if ( !rc )
+        pi_put_ref(old_pi_desc);
+    else
+        ASSERT_UNREACHABLE();
+
+    return rc;
 
  unlock_out:
     spin_unlock_irq(&desc->lock);
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index d2899c9..6fc97c4 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -98,6 +98,12 @@ struct hvm_pi_ops {
 
     /* Hook into the vmentry path. */
     void (*do_resume)(struct vcpu *v);
+
+    /* Get/Put refcount of PI blocking of this vCPU */
+    void (*get_ref)(struct vcpu *v);
+    void (*put_ref)(struct vcpu *v);
+    /* Is the PI blocking is referred by IRTEs */
+    bool (*in_use)(struct vcpu *v);
 };
 
 struct hvm_domain {
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index 9507bd2..7cb1a92 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -85,6 +85,7 @@ struct pi_desc {
 struct pi_blocking_vcpu {
     struct list_head     list;
     spinlock_t           *lock;
+    atomic_t             refcnt;  /* How many IRTEs refer to this vCPU? */
 };
 
 struct arch_vmx_struct {
@@ -160,6 +161,8 @@ struct arch_vmx_struct {
     struct pi_blocking_vcpu pi_blocking;
 };
 
+#define pi_desc_to_vcpu(a) container_of(a, struct vcpu, arch.hvm_vmx.pi_desc)
+
 int vmx_create_vmcs(struct vcpu *v);
 void vmx_destroy_vmcs(struct vcpu *v);
 void vmx_vmcs_enter(struct vcpu *v);
diff --git a/xen/include/asm-x86/iommu.h b/xen/include/asm-x86/iommu.h
index 0431233..cfa0058 100644
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -92,7 +92,7 @@ bool_t iommu_supports_eim(void);
 int iommu_enable_x2apic_IR(void);
 void iommu_disable_x2apic_IR(void);
 
-int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
+int pi_update_irte(struct pi_desc *pi_desc, const struct pirq *pirq,
                    const uint8_t gvec);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
diff --git a/xen/include/asm-x86/msi.h b/xen/include/asm-x86/msi.h
index a5de6a1..fbf1793 100644
--- a/xen/include/asm-x86/msi.h
+++ b/xen/include/asm-x86/msi.h
@@ -105,7 +105,7 @@ struct msi_desc {
 
 	bool irte_initialized;
 	uint8_t gvec;			/* guest vector. valid when pi_desc isn't NULL */
-	const struct pi_desc *pi_desc;	/* pointer to posted descriptor */
+	struct pi_desc *pi_desc;	/* pointer to posted descriptor */
 
 	struct list_head list;
 
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/4] VT-d PI: Don't add vCPU to PI blocking list for a case
  2017-04-26  0:52 [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Chao Gao
                   ` (2 preceding siblings ...)
  2017-04-26  0:52 ` [PATCH 3/4] VT-d PI: Add reference count to pi_desc Chao Gao
@ 2017-04-26  0:52 ` Chao Gao
  2017-04-26  8:19 ` [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Jan Beulich
  2017-04-26 16:39 ` George Dunlap
  5 siblings, 0 replies; 21+ messages in thread
From: Chao Gao @ 2017-04-26  0:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Kevin Tian, Jan Beulich, Jun Nakajima, Chao Gao

Currently, blocked vCPUs are added to PI blocking list if its
domain has assign devices. Actually, some blocked vCPUs will not
be woken up by wakeup interrupt generated by VT-d hardware. They
may be woken up by IPIs or other interrupts from emulated devices.
For these vCPUs we don't add them to PI blocking list.

If a vCPU is blocked prior to its getting bound with a IRTE, we need
adding this vCPU to blocking list when we bind a vCPU with a IRTE.
In that case, arch_vcpu_block() may be called from another vCPU which
the current implementation can't handle. This patch expands the
arch_vcpu_block(), removing some restrictions expressed by
assertions and handling the target vCPU according to its state and
its PI blocking list lock (v->arch.hvm_vmx.pi_blocking.lock).

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 xen/arch/x86/hvm/vmx/vmx.c             | 21 ++++++++++++++-------
 xen/drivers/passthrough/vtd/intremap.c | 19 +++++++++++++++++++
 2 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 76cb421..ae16f39 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -125,6 +125,14 @@ static void vmx_vcpu_block(struct vcpu *v)
     struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
     spinlock_t *pi_blocking_list_lock;
 
+    /* If no IRTE refers to 'pi_desc', no further operation needs */
+    if ( v->domain->arch.hvm_domain.pi_ops.in_use &&
+         !v->domain->arch.hvm_domain.pi_ops.in_use(v) )
+        return;
+
+    if ( !test_bit(_VPF_blocked, &v->pause_flags) )
+        return;
+
     /*
      * After pCPU goes down, the per-cpu PI blocking list is cleared.
      * To make sure the parameter vCPU is added to the chosen pCPU's
@@ -144,13 +152,8 @@ static void vmx_vcpu_block(struct vcpu *v)
 
     old_lock = cmpxchg(&v->arch.hvm_vmx.pi_blocking.lock, NULL,
                        pi_blocking_list_lock);
-
-    /*
-     * 'v->arch.hvm_vmx.pi_blocking.lock' should be NULL before
-     * being assigned to a new value, since the vCPU is currently
-     * running and it cannot be on any blocking list.
-     */
-    ASSERT(old_lock == NULL);
+    if ( old_lock )
+        goto out;
 
     atomic_inc(&per_cpu(vmx_pi_blocking, dest_cpu).counter);
     HVMTRACE_4D(VT_D_PI_BLOCK, v->domain->domain_id, v->vcpu_id, dest_cpu,
@@ -171,6 +174,10 @@ static void vmx_vcpu_block(struct vcpu *v)
     write_atomic(&pi_desc->nv, pi_wakeup_vector);
     write_atomic(&pi_desc->ndst,
                  (x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK)));
+    return;
+
+ out:
+    spin_unlock_irqrestore(pi_blocking_list_lock, flags);
 }
 
 static void vmx_pi_switch_from(struct vcpu *v)
diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index 99f1cce..c43cfa2 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -30,6 +30,7 @@
 #include "extern.h"
 
 #include <asm/apic.h>
+#include <asm/hvm/hvm.h>
 #include <asm/io_apic.h>
 #define nr_ioapic_entries(i)  nr_ioapic_entries[i]
 
@@ -622,6 +623,20 @@ static void pi_put_ref(struct pi_desc *pi_desc)
         v->domain->arch.hvm_domain.pi_ops.put_ref(v);
 }
 
+static bool pi_in_use(struct pi_desc *pi_desc)
+{
+    struct vcpu *v;
+
+    if ( !pi_desc )
+        return 0;
+
+    v = pi_desc_to_vcpu(pi_desc);
+    ASSERT(is_hvm_domain(v->domain));
+    if ( v->domain->arch.hvm_domain.pi_ops.in_use )
+        return v->domain->arch.hvm_domain.pi_ops.in_use(v);
+    return 0;
+}
+
 static int msi_msg_to_remap_entry(
     struct iommu *iommu, struct pci_dev *pdev,
     struct msi_desc *msi_desc, struct msi_msg *msg)
@@ -996,6 +1011,7 @@ int pi_update_irte(struct pi_desc *pi_desc, const struct pirq *pirq,
     struct msi_desc *msi_desc;
     struct pi_desc *old_pi_desc;
     int rc;
+    bool first_ref;
 
     desc = pirq_spin_lock_irq_desc(pirq, NULL);
     if ( !desc )
@@ -1009,7 +1025,10 @@ int pi_update_irte(struct pi_desc *pi_desc, const struct pirq *pirq,
     }
     old_pi_desc = msi_desc->pi_desc;
 
+    first_ref = !pi_in_use(pi_desc);
     pi_get_ref(pi_desc);
+    if ( pi_desc && first_ref )
+        arch_vcpu_block(pi_desc_to_vcpu(pi_desc));
     msi_desc->pi_desc = pi_desc;
     msi_desc->gvec = gvec;
 
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-04-26  8:19 ` [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Jan Beulich
@ 2017-04-26  3:30   ` Chao Gao
  2017-04-26 10:52     ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Chao Gao @ 2017-04-26  3:30 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima

On Wed, Apr 26, 2017 at 02:19:22AM -0600, Jan Beulich wrote:
>>>> On 26.04.17 at 02:52, <chao.gao@intel.com> wrote:
>> Patch 2/4 randomly distritbutes entries (vCPUs) among all oneline
>> pCPUs, which can theoretically decrease the maximum of #entry
>> in the list by N times. N is #pCPU.
>
>Why randomly? Shouldn't current list length determine which CPU(s)
>to prefer?

I thought randomly distribution is simple and also can meet our
demand.  What you said I think is also ok. How about just choose the
CPU which has the shortest list length? I will use this policy in next
version. Will you review this version when you have time? or I can
prepare and send next version directly?

Regarding to the test I do, any other complementary tests do you
think are needed to prove that these patches really mitigate the
issue?

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-04-26  0:52 [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Chao Gao
                   ` (3 preceding siblings ...)
  2017-04-26  0:52 ` [PATCH 4/4] VT-d PI: Don't add vCPU to PI blocking list for a case Chao Gao
@ 2017-04-26  8:19 ` Jan Beulich
  2017-04-26  3:30   ` Chao Gao
  2017-04-26 16:39 ` George Dunlap
  5 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2017-04-26  8:19 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima

>>> On 26.04.17 at 02:52, <chao.gao@intel.com> wrote:
> Patch 2/4 randomly distritbutes entries (vCPUs) among all oneline
> pCPUs, which can theoretically decrease the maximum of #entry
> in the list by N times. N is #pCPU.

Why randomly? Shouldn't current list length determine which CPU(s)
to prefer?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-04-26  3:30   ` Chao Gao
@ 2017-04-26 10:52     ` Jan Beulich
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Beulich @ 2017-04-26 10:52 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima

>>> On 26.04.17 at 05:30, <chao.gao@intel.com> wrote:
> On Wed, Apr 26, 2017 at 02:19:22AM -0600, Jan Beulich wrote:
>>>>> On 26.04.17 at 02:52, <chao.gao@intel.com> wrote:
>>> Patch 2/4 randomly distritbutes entries (vCPUs) among all oneline
>>> pCPUs, which can theoretically decrease the maximum of #entry
>>> in the list by N times. N is #pCPU.
>>
>>Why randomly? Shouldn't current list length determine which CPU(s)
>>to prefer?
> 
> I thought randomly distribution is simple and also can meet our
> demand.  What you said I think is also ok. How about just choose the
> CPU which has the shortest list length?

If that's doable without having to iterate over all CPUs every time.
There may need to be a hybrid approach, or one only coming close
to optimal distribution.

> I will use this policy in next
> version. Will you review this version when you have time?

Of course, but I can't predict when that will be.

> or I can
> prepare and send next version directly?

That's entirely up to you.

> Regarding to the test I do, any other complementary tests do you
> think are needed to prove that these patches really mitigate the
> issue?

I think the primary criteria is that you can abstractly prove the
problem to be gone. That's (as I'm sure you understand) an
argument against using any randomness as input here.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-04-26  0:52 [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Chao Gao
                   ` (4 preceding siblings ...)
  2017-04-26  8:19 ` [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Jan Beulich
@ 2017-04-26 16:39 ` George Dunlap
  2017-04-27  0:43   ` Chao Gao
  2017-05-02  5:45   ` Chao Gao
  5 siblings, 2 replies; 21+ messages in thread
From: George Dunlap @ 2017-04-26 16:39 UTC (permalink / raw)
  To: Chao Gao, xen-devel
  Cc: Kevin Tian, Wei Liu, Jun Nakajima, George Dunlap, Andrew Cooper,
	Ian Jackson, Jan Beulich

On 26/04/17 01:52, Chao Gao wrote:
> VT-d PI introduces a per-pCPU blocking list to track the blocked vCPU
> running on the pCPU. Theoretically, there are 32K domain on single
> host, 128 vCPUs per domain. If all vCPUs are blocked on the same pCPU,
> 4M vCPUs are in the same list. Travelling this issue consumes too
> much time. We have discussed this issue in [1,2,3].
> 
> To mitigate this issue, we proposed the following two method [3]:
> 1. Evenly distributing all the blocked vCPUs among all pCPUs.

So you're not actually distributing the *vcpus* among the pcpus (which
would imply some interaction with the scheduler); you're distributing
the vcpu PI wake-up interrupt between pcpus.  Is that right?

Doesn't having a PI on a different pcpu than where the vcpu is running
mean at least one IPI to wake up that vcpu?  If so, aren't we imposing a
constant overhead on basically every single interrupt, as well as
increasing the IPI traffic, in order to avoid a highly unlikely
theoretical corner case?

A general maxim in OS development is "Make the common case fast, and the
uncommon case correct."  It seems like it would be better in the common
case to have the PI vectors on the pcpu on which the vcpu is running,
and only implement the balancing when the list starts to get too long.

What do you think?

> 2. Don't put the blocked vCPUs which won't be woken by the wakeup
> interrupt into the per-pCPU list.
> 
> PATCH 1/4 tracks the event, adding entry to PI blocking list. With the
> patch, some data can be acquired to help to validate the following
> patches. 
> 
> Patch 2/4 randomly distritbutes entries (vCPUs) among all oneline
> pCPUs, which can theoretically decrease the maximum of #entry
> in the list by N times. N is #pCPU.
> 
> Patch 3/4 adds a refcount to vcpu's pi_desc. If the pi_desc is
> recorded in one IRTE, the refcount increase by 1 and If the pi_desc is
> cleared in one IRTE, the refcount decrease by 1.
> 
> In Patch 4/4, one vCPU is added to PI blocking list only if its
> pi_desc is referred by at least one IRTE.
> 
> I tested this series in the following scene:
> * One 128 vCPUs guest and assign a NIC to it
> * all 128 vCPUs are pinned to one pCPU.
> * use xentrace to collect events for 5 minutes
> 
> I compared the maximum of #entry in one list and #event (adding entry to
> PI blocking list) with and without the three latter patches. Here
> is the result:
> -------------------------------------------------------------
> |               |                      |                    |
> |    Items      |   Maximum of #entry  |      #event        |
> |               |                      |                    |
> -------------------------------------------------------------
> |               |                      |                    |
> |W/ the patches |         6            |       22740        |
> |               |                      |                    |
> -------------------------------------------------------------
> |               |                      |                    |
> |W/O the patches|        128           |       46481        |
> |               |                      |                    |
> -------------------------------------------------------------

Any chance you could trace how long the list traversal took?  It would
be good for future reference to have an idea what kinds of timescales
we're talking about.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-04-26 16:39 ` George Dunlap
@ 2017-04-27  0:43   ` Chao Gao
  2017-04-27  9:44     ` George Dunlap
  2017-05-02  5:45   ` Chao Gao
  1 sibling, 1 reply; 21+ messages in thread
From: Chao Gao @ 2017-04-27  0:43 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, Wei Liu, Jan Beulich, George Dunlap, Andrew Cooper,
	Ian Jackson, xen-devel, Jun Nakajima

On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>On 26/04/17 01:52, Chao Gao wrote:
>> VT-d PI introduces a per-pCPU blocking list to track the blocked vCPU
>> running on the pCPU. Theoretically, there are 32K domain on single
>> host, 128 vCPUs per domain. If all vCPUs are blocked on the same pCPU,
>> 4M vCPUs are in the same list. Travelling this issue consumes too
>> much time. We have discussed this issue in [1,2,3].
>> 
>> To mitigate this issue, we proposed the following two method [3]:
>> 1. Evenly distributing all the blocked vCPUs among all pCPUs.
>
>So you're not actually distributing the *vcpus* among the pcpus (which
>would imply some interaction with the scheduler); you're distributing
>the vcpu PI wake-up interrupt between pcpus.  Is that right?

Yes. I should describe things more clearly.

>
>Doesn't having a PI on a different pcpu than where the vcpu is running
>mean at least one IPI to wake up that vcpu?  If so, aren't we imposing a
>constant overhead on basically every single interrupt, as well as
>increasing the IPI traffic, in order to avoid a highly unlikely
>theoretical corner case?

If it will incur at least one more IPI, I can't agree more. I think it
depends on whether calling vcpu_unblock() to wake up a vCPU not running
on current pCPU will lead to a more IPI compared to the vCPU running
on the current pCPU. In my mind, different schedulers may differ on this point.

>
>A general maxim in OS development is "Make the common case fast, and the
>uncommon case correct."  It seems like it would be better in the common
>case to have the PI vectors on the pcpu on which the vcpu is running,
>and only implement the balancing when the list starts to get too long.

Agree. Distributing PI wakeup interrupt among the pcpus will increase
spurious interrupts, I think. Anyhow, I should take your advice. Kevin
also gave a similar advice in the discussion happened one year ago.

>
>Any chance you could trace how long the list traversal took?  It would
>be good for future reference to have an idea what kinds of timescales
>we're talking about.

Will do later.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-04-27  9:44     ` George Dunlap
@ 2017-04-27  5:02       ` Chao Gao
  0 siblings, 0 replies; 21+ messages in thread
From: Chao Gao @ 2017-04-27  5:02 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, Wei Liu, Jan Beulich, George Dunlap, Andrew Cooper,
	Ian Jackson, xen-devel, Jun Nakajima

On Thu, Apr 27, 2017 at 10:44:26AM +0100, George Dunlap wrote:
>On Thu, Apr 27, 2017 at 1:43 AM, Chao Gao <chao.gao@intel.com> wrote:
>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>On 26/04/17 01:52, Chao Gao wrote:
>>>> VT-d PI introduces a per-pCPU blocking list to track the blocked vCPU
>>>> running on the pCPU. Theoretically, there are 32K domain on single
>>>> host, 128 vCPUs per domain. If all vCPUs are blocked on the same pCPU,
>>>> 4M vCPUs are in the same list. Travelling this issue consumes too
>>>> much time. We have discussed this issue in [1,2,3].
>>>>
>>>> To mitigate this issue, we proposed the following two method [3]:
>>>> 1. Evenly distributing all the blocked vCPUs among all pCPUs.
>>>
>>>So you're not actually distributing the *vcpus* among the pcpus (which
>>>would imply some interaction with the scheduler); you're distributing
>>>the vcpu PI wake-up interrupt between pcpus.  Is that right?
>>
>> Yes. I should describe things more clearly.
>>
>>>
>>>Doesn't having a PI on a different pcpu than where the vcpu is running
>>>mean at least one IPI to wake up that vcpu?  If so, aren't we imposing a
>>>constant overhead on basically every single interrupt, as well as
>>>increasing the IPI traffic, in order to avoid a highly unlikely
>>>theoretical corner case?
>>
>> If it will incur at least one more IPI, I can't agree more. I think it
>> depends on whether calling vcpu_unblock() to wake up a vCPU not running
>> on current pCPU will lead to a more IPI compared to the vCPU running
>> on the current pCPU. In my mind, different schedulers may differ on this point.
>
>Well I'm not aware of any way to tell another processor to do
>something in a timely manner other than with an IPI; and in any case
>that's the method that both credit1 and credit2 use.  It's true that
>not all vcpu_wake() calls will end up with an IPI, but a fairly large
>number of them will.  Avoiding this overhead when it's not necessary
>for performance is pretty important.

Ok, I agree and will avoid this overhead in next version. 
Really appreciate your comments.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-04-27  0:43   ` Chao Gao
@ 2017-04-27  9:44     ` George Dunlap
  2017-04-27  5:02       ` Chao Gao
  0 siblings, 1 reply; 21+ messages in thread
From: George Dunlap @ 2017-04-27  9:44 UTC (permalink / raw)
  To: George Dunlap, xen-devel, George Dunlap, Ian Jackson, Wei Liu,
	Jun Nakajima, Kevin Tian, Jan Beulich, Andrew Cooper

On Thu, Apr 27, 2017 at 1:43 AM, Chao Gao <chao.gao@intel.com> wrote:
> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>On 26/04/17 01:52, Chao Gao wrote:
>>> VT-d PI introduces a per-pCPU blocking list to track the blocked vCPU
>>> running on the pCPU. Theoretically, there are 32K domain on single
>>> host, 128 vCPUs per domain. If all vCPUs are blocked on the same pCPU,
>>> 4M vCPUs are in the same list. Travelling this issue consumes too
>>> much time. We have discussed this issue in [1,2,3].
>>>
>>> To mitigate this issue, we proposed the following two method [3]:
>>> 1. Evenly distributing all the blocked vCPUs among all pCPUs.
>>
>>So you're not actually distributing the *vcpus* among the pcpus (which
>>would imply some interaction with the scheduler); you're distributing
>>the vcpu PI wake-up interrupt between pcpus.  Is that right?
>
> Yes. I should describe things more clearly.
>
>>
>>Doesn't having a PI on a different pcpu than where the vcpu is running
>>mean at least one IPI to wake up that vcpu?  If so, aren't we imposing a
>>constant overhead on basically every single interrupt, as well as
>>increasing the IPI traffic, in order to avoid a highly unlikely
>>theoretical corner case?
>
> If it will incur at least one more IPI, I can't agree more. I think it
> depends on whether calling vcpu_unblock() to wake up a vCPU not running
> on current pCPU will lead to a more IPI compared to the vCPU running
> on the current pCPU. In my mind, different schedulers may differ on this point.

Well I'm not aware of any way to tell another processor to do
something in a timely manner other than with an IPI; and in any case
that's the method that both credit1 and credit2 use.  It's true that
not all vcpu_wake() calls will end up with an IPI, but a fairly large
number of them will.  Avoiding this overhead when it's not necessary
for performance is pretty important.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-04-26 16:39 ` George Dunlap
  2017-04-27  0:43   ` Chao Gao
@ 2017-05-02  5:45   ` Chao Gao
  2017-05-03 10:08     ` George Dunlap
  1 sibling, 1 reply; 21+ messages in thread
From: Chao Gao @ 2017-05-02  5:45 UTC (permalink / raw)
  To: George Dunlap, Jan Beulich, Kevin Tian
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima

On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>On 26/04/17 01:52, Chao Gao wrote:
>> I compared the maximum of #entry in one list and #event (adding entry to
>> PI blocking list) with and without the three latter patches. Here
>> is the result:
>> -------------------------------------------------------------
>> |               |                      |                    |
>> |    Items      |   Maximum of #entry  |      #event        |
>> |               |                      |                    |
>> -------------------------------------------------------------
>> |               |                      |                    |
>> |W/ the patches |         6            |       22740        |
>> |               |                      |                    |
>> -------------------------------------------------------------
>> |               |                      |                    |
>> |W/O the patches|        128           |       46481        |
>> |               |                      |                    |
>> -------------------------------------------------------------
>
>Any chance you could trace how long the list traversal took?  It would
>be good for future reference to have an idea what kinds of timescales
>we're talking about.

Hi.

I made a simple test to get the time consumed by the list traversal.
Apply below patch and create one hvm guest with 128 vcpus and a passthrough 40 NIC.
All guest vcpu are pinned to one pcpu. collect data by
'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
xentrace_format. When the list length is about 128, the traversal time
is in the range of 1750 cycles to 39330 cycles. The physical cpu's
frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
to 22us. If 0.5ms is the upper bound the system can tolerate, at most
2900 vcpus can be added into the list.

I hope there is no error in the test and analysis.

Thanks
Chao

---8<---
From 504fd32bc042670812daad41efcd982434b98cd4 Mon Sep 17 00:00:00 2001
From: Chao Gao <chao.gao@intel.com>
Date: Wed, 26 Apr 2017 03:39:06 +0800
Subject: [PATCH] xentrace: trace PI-related events.

This patch adds TRC_HVM_VT_D_PI_BLOCK, TRC_HVM_PI_WAKEUP_START and
TRC_HVM_PI_WAKEUP_END to track PI-related events. Specifically,
TRC_HVM_VT_D_PI_BLOCK track adding one entry to the per-pcpu
blocking list. TRC_HVM_PI_WAKEUP_{START, END} mark the start and end
of PI blocking list traversal. Also introduce a 'counter' to track the
number of entries in the list.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 tools/xentrace/formats          |  3 +++
 xen/arch/x86/hvm/vmx/vmx.c      | 13 ++++++++++++-
 xen/include/asm-x86/hvm/trace.h |  3 +++
 xen/include/public/trace.h      |  3 +++
 4 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index 8b31780..34ed9e4 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -125,6 +125,9 @@
 0x00082020  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  INTR_WINDOW [ value = 0x%(1)08x ]
 0x00082021  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  NPF         [ gpa = 0x%(2)08x%(1)08x mfn = 0x%(4)08x%(3)08x qual = 0x%(5)04x p2mt = 0x%(6)04x ]
 0x00082023  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  TRAP        [ vector = 0x%(1)02x ]
+0x00082026  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  PI_BLOCK_LIST  [ domid = 0x%(1)04x vcpu = 0x%(2)04x, pcpu = 0x%(3)04x, #entry = 0x%(4)04x ]
+0x00082027  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  PI_WAKEUP_START [ list_len = 0x%(1)04x ]
+0x00082028  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  PI_WAKEUP_END
 
 0x0010f001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_map      [ domid = %(1)d ]
 0x0010f002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_unmap    [ domid = %(1)d ]
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index c8ef18a..3a6640b 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -82,6 +82,7 @@ static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);
 struct vmx_pi_blocking_vcpu {
     struct list_head     list;
     spinlock_t           lock;
+    atomic_t             counter;
 };
 
 /*
@@ -119,6 +120,9 @@ static void vmx_vcpu_block(struct vcpu *v)
      */
     ASSERT(old_lock == NULL);
 
+    atomic_inc(&per_cpu(vmx_pi_blocking, v->processor).counter);
+    HVMTRACE_4D(VT_D_PI_BLOCK, v->domain->domain_id, v->vcpu_id, v->processor,
+                atomic_read(&per_cpu(vmx_pi_blocking, v->processor).counter));
     list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
                   &per_cpu(vmx_pi_blocking, v->processor).list);
     spin_unlock_irqrestore(pi_blocking_list_lock, flags);
@@ -186,6 +190,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
     {
         ASSERT(v->arch.hvm_vmx.pi_blocking.lock == pi_blocking_list_lock);
         list_del(&v->arch.hvm_vmx.pi_blocking.list);
+        atomic_dec(&container_of(pi_blocking_list_lock,
+                                 struct vmx_pi_blocking_vcpu, lock)->counter);
         v->arch.hvm_vmx.pi_blocking.lock = NULL;
     }
 
@@ -234,6 +240,7 @@ void vmx_pi_desc_fixup(unsigned int cpu)
         if ( pi_test_on(&vmx->pi_desc) )
         {
             list_del(&vmx->pi_blocking.list);
+            atomic_dec(&per_cpu(vmx_pi_blocking, cpu).counter);
             vmx->pi_blocking.lock = NULL;
             vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
         }
@@ -2360,13 +2367,15 @@ static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
     struct arch_vmx_struct *vmx, *tmp;
     spinlock_t *lock = &per_cpu(vmx_pi_blocking, smp_processor_id()).lock;
     struct list_head *blocked_vcpus =
-		&per_cpu(vmx_pi_blocking, smp_processor_id()).list;
+               &per_cpu(vmx_pi_blocking, smp_processor_id()).list;
 
     ack_APIC_irq();
     this_cpu(irq_count)++;
 
     spin_lock(lock);
 
+    TRACE_1D(TRC_HVM_PI_WAKEUP_START,
+        atomic_read(&per_cpu(vmx_pi_blocking, smp_processor_id()).counter));
     /*
      * XXX: The length of the list depends on how many vCPU is current
      * blocked on this specific pCPU. This may hurt the interrupt latency
@@ -2377,11 +2386,13 @@ static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
         if ( pi_test_on(&vmx->pi_desc) )
         {
             list_del(&vmx->pi_blocking.list);
+            atomic_dec(&per_cpu(vmx_pi_blocking, smp_processor_id()).counter);
             ASSERT(vmx->pi_blocking.lock == lock);
             vmx->pi_blocking.lock = NULL;
             vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
         }
     }
+    TRACE_0D(TRC_HVM_PI_WAKEUP_END);
 
     spin_unlock(lock);
 }
diff --git a/xen/include/asm-x86/hvm/trace.h b/xen/include/asm-x86/hvm/trace.h
index de802a6..afe8b75 100644
--- a/xen/include/asm-x86/hvm/trace.h
+++ b/xen/include/asm-x86/hvm/trace.h
@@ -54,6 +54,9 @@
 #define DO_TRC_HVM_TRAP             DEFAULT_HVM_MISC
 #define DO_TRC_HVM_TRAP_DEBUG       DEFAULT_HVM_MISC
 #define DO_TRC_HVM_VLAPIC           DEFAULT_HVM_MISC
+#define DO_TRC_HVM_VT_D_PI_BLOCK    DEFAULT_HVM_MISC
+#define DO_TRC_HVM_PI_WAKEUP_START  DEFAULT_HVM_MISC
+#define DO_TRC_HVM_PI_WAKEUP_END    DEFAULT_HVM_MISC
 
 
 #define TRC_PAR_LONG(par) ((par)&0xFFFFFFFF),((par)>>32)
diff --git a/xen/include/public/trace.h b/xen/include/public/trace.h
index 7f2e891..c5b95ee 100644
--- a/xen/include/public/trace.h
+++ b/xen/include/public/trace.h
@@ -234,6 +234,9 @@
 #define TRC_HVM_TRAP             (TRC_HVM_HANDLER + 0x23)
 #define TRC_HVM_TRAP_DEBUG       (TRC_HVM_HANDLER + 0x24)
 #define TRC_HVM_VLAPIC           (TRC_HVM_HANDLER + 0x25)
+#define TRC_HVM_VT_D_PI_BLOCK    (TRC_HVM_HANDLER + 0x26)
+#define TRC_HVM_PI_WAKEUP_START  (TRC_HVM_HANDLER + 0x27)
+#define TRC_HVM_PI_WAKEUP_END    (TRC_HVM_HANDLER + 0x28)
 
 #define TRC_HVM_IOPORT_WRITE    (TRC_HVM_HANDLER + 0x216)
 #define TRC_HVM_IOMEM_WRITE     (TRC_HVM_HANDLER + 0x217)
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-05-02  5:45   ` Chao Gao
@ 2017-05-03 10:08     ` George Dunlap
  2017-05-03 10:21       ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: George Dunlap @ 2017-05-03 10:08 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian, xen-devel, George Dunlap, Ian Jackson,
	Wei Liu, Jun Nakajima, Andrew Cooper

On 02/05/17 06:45, Chao Gao wrote:
> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>> On 26/04/17 01:52, Chao Gao wrote:
>>> I compared the maximum of #entry in one list and #event (adding entry to
>>> PI blocking list) with and without the three latter patches. Here
>>> is the result:
>>> -------------------------------------------------------------
>>> |               |                      |                    |
>>> |    Items      |   Maximum of #entry  |      #event        |
>>> |               |                      |                    |
>>> -------------------------------------------------------------
>>> |               |                      |                    |
>>> |W/ the patches |         6            |       22740        |
>>> |               |                      |                    |
>>> -------------------------------------------------------------
>>> |               |                      |                    |
>>> |W/O the patches|        128           |       46481        |
>>> |               |                      |                    |
>>> -------------------------------------------------------------
>>
>> Any chance you could trace how long the list traversal took?  It would
>> be good for future reference to have an idea what kinds of timescales
>> we're talking about.
> 
> Hi.
> 
> I made a simple test to get the time consumed by the list traversal.
> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 40 NIC.
> All guest vcpu are pinned to one pcpu. collect data by
> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
> xentrace_format. When the list length is about 128, the traversal time
> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
> 2900 vcpus can be added into the list.

Great, thanks Chao Gao, that's useful.  I'm not sure a fixed latency --
say 500us -- is the right thing to look at; if all 2900 vcpus arranged
to have interrupts staggered at 500us intervals it could easily lock up
the cpu for nearly a full second.  But I'm having trouble formulating a
good limit scenario.

In any case, 22us should be safe from a security standpoint*, and 128
should be pretty safe from a "make the common case fast" standpoint:
i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
traffic will be the least of your performance problems I should think.

 -George

* Waiting for Jan to contradict me on this one. :-)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-05-03 10:08     ` George Dunlap
@ 2017-05-03 10:21       ` Jan Beulich
  2017-05-08 16:15         ` Chao Gao
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2017-05-03 10:21 UTC (permalink / raw)
  To: George Dunlap, Chao Gao
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, IanJackson,
	xen-devel, JunNakajima

>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
> On 02/05/17 06:45, Chao Gao wrote:
>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>> On 26/04/17 01:52, Chao Gao wrote:
>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>> PI blocking list) with and without the three latter patches. Here
>>>> is the result:
>>>> -------------------------------------------------------------
>>>> |               |                      |                    |
>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>> |               |                      |                    |
>>>> -------------------------------------------------------------
>>>> |               |                      |                    |
>>>> |W/ the patches |         6            |       22740        |
>>>> |               |                      |                    |
>>>> -------------------------------------------------------------
>>>> |               |                      |                    |
>>>> |W/O the patches|        128           |       46481        |
>>>> |               |                      |                    |
>>>> -------------------------------------------------------------
>>>
>>> Any chance you could trace how long the list traversal took?  It would
>>> be good for future reference to have an idea what kinds of timescales
>>> we're talking about.
>> 
>> Hi.
>> 
>> I made a simple test to get the time consumed by the list traversal.
>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 40 NIC.
>> All guest vcpu are pinned to one pcpu. collect data by
>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>> xentrace_format. When the list length is about 128, the traversal time
>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>> 2900 vcpus can be added into the list.
> 
> Great, thanks Chao Gao, that's useful.

Looks like Chao Gao has been dropped ...

>  I'm not sure a fixed latency --
> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
> to have interrupts staggered at 500us intervals it could easily lock up
> the cpu for nearly a full second.  But I'm having trouble formulating a
> good limit scenario.
> 
> In any case, 22us should be safe from a security standpoint*, and 128
> should be pretty safe from a "make the common case fast" standpoint:
> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
> traffic will be the least of your performance problems I should think.
> 
>  -George
> 
> * Waiting for Jan to contradict me on this one. :-)

22us would certainly be fine, if this was the worst case scenario.
I'm not sure the value measured for 128 list entries can be easily
scaled to several thousands of them, due cache and/or NUMA
effects. I continue to think that we primarily need theoretical
proof of an upper boundary on list length being enforced, rather
than any measurements or randomized balancing. And just to be
clear - if someone overloads their system, I do not see a need to
have a guaranteed maximum list traversal latency here. All I ask
for is that list traversal time scales with total vCPU count divided
by pCPU count.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-05-08 16:15         ` Chao Gao
@ 2017-05-08  8:39           ` Jan Beulich
  2017-05-08 16:38             ` Chao Gao
  2017-05-08  9:13           ` George Dunlap
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2017-05-08  8:39 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, JunNakajima

>>> On 08.05.17 at 18:15, <chao.gao@intel.com> wrote:
> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
>>> On 02/05/17 06:45, Chao Gao wrote:
>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>>> PI blocking list) with and without the three latter patches. Here
>>>>>> is the result:
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |W/ the patches |         6            |       22740        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |W/O the patches|        128           |       46481        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>
>>>>> Any chance you could trace how long the list traversal took?  It would
>>>>> be good for future reference to have an idea what kinds of timescales
>>>>> we're talking about.
>>>> 
>>>> Hi.
>>>> 
>>>> I made a simple test to get the time consumed by the list traversal.
>>>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 
> 40 NIC.
>>>> All guest vcpu are pinned to one pcpu. collect data by
>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>>> xentrace_format. When the list length is about 128, the traversal time
>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>>> 2900 vcpus can be added into the list.
>>> 
>>> Great, thanks Chao Gao, that's useful.
>>
>>Looks like Chao Gao has been dropped ...
>>
>>>  I'm not sure a fixed latency --
>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>>> to have interrupts staggered at 500us intervals it could easily lock up
>>> the cpu for nearly a full second.  But I'm having trouble formulating a
>>> good limit scenario.
>>> 
>>> In any case, 22us should be safe from a security standpoint*, and 128
>>> should be pretty safe from a "make the common case fast" standpoint:
>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>>> traffic will be the least of your performance problems I should think.
>>> 
>>>  -George
>>> 
>>> * Waiting for Jan to contradict me on this one. :-)
>>
>>22us would certainly be fine, if this was the worst case scenario.
>>I'm not sure the value measured for 128 list entries can be easily
>>scaled to several thousands of them, due cache and/or NUMA
>>effects. I continue to think that we primarily need theoretical
>>proof of an upper boundary on list length being enforced, rather
>>than any measurements or randomized balancing. And just to be
>>clear - if someone overloads their system, I do not see a need to
>>have a guaranteed maximum list traversal latency here. All I ask
>>for is that list traversal time scales with total vCPU count divided
>>by pCPU count.
> 
> Thanks, Jan & George.
> 
> I think it is more clear to me about what should I do next step.
> 
> In my understanding, we should distribute the wakeup interrupts like
> this:
> 1. By default, distribute it to the local pCPU ('local' means the vCPU
> is on the pCPU) to make the common case fast.
> 2. With the list grows to a point where we think it may consumers too
> much time to traverse the list, also distribute wakeup interrupt to local
> pCPU, ignoring admin intentionally overloads their system.
> 3. When the list length reachs the theoretic average maximum (means
> maximal vCPU count divided by pCPU count), distribute wakeup interrupt
> to another underutilized pCPU.
> 
> But, I am confused about that If we don't care someone overload their
> system, why we need the stage #3?  If not, I have no idea to meet Jan's
> request, the list traversal time scales with total vCPU count divided by
> pCPU count. Or we will reach stage #3 before stage #2?

Things is that imo point 2 is too fuzzy to be of any use, i.e. 3 should
take effect immediately. We don't mean to ignore any admin decisions
here, it is just that if they overload their systems, the net effect of 3
may still not be good enough to provide smooth behavior. But that's
then a result of them overloading their systems in the first place. IOW,
you should try to evenly distribute vCPU-s as soon as their count on
a given pCPU exceeds the calculated average.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-05-08 16:15         ` Chao Gao
  2017-05-08  8:39           ` Jan Beulich
@ 2017-05-08  9:13           ` George Dunlap
  2017-05-08  9:24             ` Jan Beulich
  1 sibling, 1 reply; 21+ messages in thread
From: George Dunlap @ 2017-05-08  9:13 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper, Wei Liu, George Dunlap, IanJackson,
	JunNakajima, Kevin Tian, xen-devel

On 08/05/17 17:15, Chao Gao wrote:
> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
>>> On 02/05/17 06:45, Chao Gao wrote:
>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>>> PI blocking list) with and without the three latter patches. Here
>>>>>> is the result:
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |W/ the patches |         6            |       22740        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |W/O the patches|        128           |       46481        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>
>>>>> Any chance you could trace how long the list traversal took?  It would
>>>>> be good for future reference to have an idea what kinds of timescales
>>>>> we're talking about.
>>>>
>>>> Hi.
>>>>
>>>> I made a simple test to get the time consumed by the list traversal.
>>>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 40 NIC.
>>>> All guest vcpu are pinned to one pcpu. collect data by
>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>>> xentrace_format. When the list length is about 128, the traversal time
>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>>> 2900 vcpus can be added into the list.
>>>
>>> Great, thanks Chao Gao, that's useful.
>>
>> Looks like Chao Gao has been dropped ...
>>
>>>  I'm not sure a fixed latency --
>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>>> to have interrupts staggered at 500us intervals it could easily lock up
>>> the cpu for nearly a full second.  But I'm having trouble formulating a
>>> good limit scenario.
>>>
>>> In any case, 22us should be safe from a security standpoint*, and 128
>>> should be pretty safe from a "make the common case fast" standpoint:
>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>>> traffic will be the least of your performance problems I should think.
>>>
>>>  -George
>>>
>>> * Waiting for Jan to contradict me on this one. :-)
>>
>> 22us would certainly be fine, if this was the worst case scenario.
>> I'm not sure the value measured for 128 list entries can be easily
>> scaled to several thousands of them, due cache and/or NUMA
>> effects. I continue to think that we primarily need theoretical
>> proof of an upper boundary on list length being enforced, rather
>> than any measurements or randomized balancing. And just to be
>> clear - if someone overloads their system, I do not see a need to
>> have a guaranteed maximum list traversal latency here. All I ask
>> for is that list traversal time scales with total vCPU count divided
>> by pCPU count.
> 
> Thanks, Jan & George.
> 
> I think it is more clear to me about what should I do next step.
> 
> In my understanding, we should distribute the wakeup interrupts like
> this:
> 1. By default, distribute it to the local pCPU ('local' means the vCPU
> is on the pCPU) to make the common case fast.
> 2. With the list grows to a point where we think it may consumers too
> much time to traverse the list, also distribute wakeup interrupt to local
> pCPU, ignoring admin intentionally overloads their system.
> 3. When the list length reachs the theoretic average maximum (means
> maximal vCPU count divided by pCPU count), distribute wakeup interrupt
> to another underutilized pCPU.

By "maximal vCPU count" do you mean, "total number of active vcpus on
the system"?  Or some other theoretical maximum vcpu count (e.g., 32k
domans * 512 vcpus each or something)?

What about saying that the limit of vcpus for any given pcpu will be:
 (v_tot / p_tot) + K
where v_tot is the total number of vcpus on the system, p_tot is the
total number of pcpus in the system, and K is a fixed number (such as
128) such that 1) the additional time walking the list is minimal, and
2) in the common case we should never come close to reaching that number?

Then the algorithm for choosing which pcpu to have the interrupt
delivered to would be:
 1. Set p = current_pcpu
 2. if len(list(p)) < v_tot / p_tot + k, choose p
 3. Otherwise, choose another p and goto 2

The "choose another p" could be random / pseudorandom selection, or it
could be some other mechanism (rotate, look for pcpus nearby on the
topology, choose the lowest one, &c).  But as long as we check the
length before assigning it, it should satisfy Jan.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-05-08  9:13           ` George Dunlap
@ 2017-05-08  9:24             ` Jan Beulich
  2017-05-08 17:37               ` Chao Gao
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2017-05-08  9:24 UTC (permalink / raw)
  To: George Dunlap, Chao Gao
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, IanJackson,
	xen-devel, JunNakajima

(Chao Gao got lost from the recipients list again; re-adding)

>>> On 08.05.17 at 11:13, <george.dunlap@citrix.com> wrote:
> On 08/05/17 17:15, Chao Gao wrote:
>> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
>>>> On 02/05/17 06:45, Chao Gao wrote:
>>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>>>> PI blocking list) with and without the three latter patches. Here
>>>>>>> is the result:
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |W/ the patches |         6            |       22740        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |W/O the patches|        128           |       46481        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>
>>>>>> Any chance you could trace how long the list traversal took?  It would
>>>>>> be good for future reference to have an idea what kinds of timescales
>>>>>> we're talking about.
>>>>>
>>>>> Hi.
>>>>>
>>>>> I made a simple test to get the time consumed by the list traversal.
>>>>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 
> 40 NIC.
>>>>> All guest vcpu are pinned to one pcpu. collect data by
>>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>>>> xentrace_format. When the list length is about 128, the traversal time
>>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>>>> 2900 vcpus can be added into the list.
>>>>
>>>> Great, thanks Chao Gao, that's useful.
>>>
>>> Looks like Chao Gao has been dropped ...
>>>
>>>>  I'm not sure a fixed latency --
>>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>>>> to have interrupts staggered at 500us intervals it could easily lock up
>>>> the cpu for nearly a full second.  But I'm having trouble formulating a
>>>> good limit scenario.
>>>>
>>>> In any case, 22us should be safe from a security standpoint*, and 128
>>>> should be pretty safe from a "make the common case fast" standpoint:
>>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>>>> traffic will be the least of your performance problems I should think.
>>>>
>>>>  -George
>>>>
>>>> * Waiting for Jan to contradict me on this one. :-)
>>>
>>> 22us would certainly be fine, if this was the worst case scenario.
>>> I'm not sure the value measured for 128 list entries can be easily
>>> scaled to several thousands of them, due cache and/or NUMA
>>> effects. I continue to think that we primarily need theoretical
>>> proof of an upper boundary on list length being enforced, rather
>>> than any measurements or randomized balancing. And just to be
>>> clear - if someone overloads their system, I do not see a need to
>>> have a guaranteed maximum list traversal latency here. All I ask
>>> for is that list traversal time scales with total vCPU count divided
>>> by pCPU count.
>> 
>> Thanks, Jan & George.
>> 
>> I think it is more clear to me about what should I do next step.
>> 
>> In my understanding, we should distribute the wakeup interrupts like
>> this:
>> 1. By default, distribute it to the local pCPU ('local' means the vCPU
>> is on the pCPU) to make the common case fast.
>> 2. With the list grows to a point where we think it may consumers too
>> much time to traverse the list, also distribute wakeup interrupt to local
>> pCPU, ignoring admin intentionally overloads their system.
>> 3. When the list length reachs the theoretic average maximum (means
>> maximal vCPU count divided by pCPU count), distribute wakeup interrupt
>> to another underutilized pCPU.
> 
> By "maximal vCPU count" do you mean, "total number of active vcpus on
> the system"?  Or some other theoretical maximum vcpu count (e.g., 32k
> domans * 512 vcpus each or something)?

The former.

> What about saying that the limit of vcpus for any given pcpu will be:
>  (v_tot / p_tot) + K
> where v_tot is the total number of vcpus on the system, p_tot is the
> total number of pcpus in the system, and K is a fixed number (such as
> 128) such that 1) the additional time walking the list is minimal, and
> 2) in the common case we should never come close to reaching that number?
> 
> Then the algorithm for choosing which pcpu to have the interrupt
> delivered to would be:
>  1. Set p = current_pcpu
>  2. if len(list(p)) < v_tot / p_tot + k, choose p
>  3. Otherwise, choose another p and goto 2
> 
> The "choose another p" could be random / pseudorandom selection, or it
> could be some other mechanism (rotate, look for pcpus nearby on the
> topology, choose the lowest one, &c).  But as long as we check the
> length before assigning it, it should satisfy Jan.

Right.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-05-03 10:21       ` Jan Beulich
@ 2017-05-08 16:15         ` Chao Gao
  2017-05-08  8:39           ` Jan Beulich
  2017-05-08  9:13           ` George Dunlap
  0 siblings, 2 replies; 21+ messages in thread
From: Chao Gao @ 2017-05-08 16:15 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, JunNakajima

On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
>> On 02/05/17 06:45, Chao Gao wrote:
>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>> PI blocking list) with and without the three latter patches. Here
>>>>> is the result:
>>>>> -------------------------------------------------------------
>>>>> |               |                      |                    |
>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>> |               |                      |                    |
>>>>> -------------------------------------------------------------
>>>>> |               |                      |                    |
>>>>> |W/ the patches |         6            |       22740        |
>>>>> |               |                      |                    |
>>>>> -------------------------------------------------------------
>>>>> |               |                      |                    |
>>>>> |W/O the patches|        128           |       46481        |
>>>>> |               |                      |                    |
>>>>> -------------------------------------------------------------
>>>>
>>>> Any chance you could trace how long the list traversal took?  It would
>>>> be good for future reference to have an idea what kinds of timescales
>>>> we're talking about.
>>> 
>>> Hi.
>>> 
>>> I made a simple test to get the time consumed by the list traversal.
>>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 40 NIC.
>>> All guest vcpu are pinned to one pcpu. collect data by
>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>> xentrace_format. When the list length is about 128, the traversal time
>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>> 2900 vcpus can be added into the list.
>> 
>> Great, thanks Chao Gao, that's useful.
>
>Looks like Chao Gao has been dropped ...
>
>>  I'm not sure a fixed latency --
>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>> to have interrupts staggered at 500us intervals it could easily lock up
>> the cpu for nearly a full second.  But I'm having trouble formulating a
>> good limit scenario.
>> 
>> In any case, 22us should be safe from a security standpoint*, and 128
>> should be pretty safe from a "make the common case fast" standpoint:
>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>> traffic will be the least of your performance problems I should think.
>> 
>>  -George
>> 
>> * Waiting for Jan to contradict me on this one. :-)
>
>22us would certainly be fine, if this was the worst case scenario.
>I'm not sure the value measured for 128 list entries can be easily
>scaled to several thousands of them, due cache and/or NUMA
>effects. I continue to think that we primarily need theoretical
>proof of an upper boundary on list length being enforced, rather
>than any measurements or randomized balancing. And just to be
>clear - if someone overloads their system, I do not see a need to
>have a guaranteed maximum list traversal latency here. All I ask
>for is that list traversal time scales with total vCPU count divided
>by pCPU count.

Thanks, Jan & George.

I think it is more clear to me about what should I do next step.

In my understanding, we should distribute the wakeup interrupts like
this:
1. By default, distribute it to the local pCPU ('local' means the vCPU
is on the pCPU) to make the common case fast.
2. With the list grows to a point where we think it may consumers too
much time to traverse the list, also distribute wakeup interrupt to local
pCPU, ignoring admin intentionally overloads their system.
3. When the list length reachs the theoretic average maximum (means
maximal vCPU count divided by pCPU count), distribute wakeup interrupt
to another underutilized pCPU.

But, I am confused about that If we don't care someone overload their
system, why we need the stage #3?  If not, I have no idea to meet Jan's
request, the list traversal time scales with total vCPU count divided by
pCPU count. Or we will reach stage #3 before stage #2?

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-05-08  8:39           ` Jan Beulich
@ 2017-05-08 16:38             ` Chao Gao
  0 siblings, 0 replies; 21+ messages in thread
From: Chao Gao @ 2017-05-08 16:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, JunNakajima

On Mon, May 08, 2017 at 02:39:25AM -0600, Jan Beulich wrote:
>>>> On 08.05.17 at 18:15, <chao.gao@intel.com> wrote:
>> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
>>>> On 02/05/17 06:45, Chao Gao wrote:
>>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>>>> PI blocking list) with and without the three latter patches. Here
>>>>>>> is the result:
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |W/ the patches |         6            |       22740        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |W/O the patches|        128           |       46481        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>
>>>>>> Any chance you could trace how long the list traversal took?  It would
>>>>>> be good for future reference to have an idea what kinds of timescales
>>>>>> we're talking about.
>>>>> 
>>>>> Hi.
>>>>> 
>>>>> I made a simple test to get the time consumed by the list traversal.
>>>>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 
>> 40 NIC.
>>>>> All guest vcpu are pinned to one pcpu. collect data by
>>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>>>> xentrace_format. When the list length is about 128, the traversal time
>>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>>>> 2900 vcpus can be added into the list.
>>>> 
>>>> Great, thanks Chao Gao, that's useful.
>>>
>>>Looks like Chao Gao has been dropped ...
>>>
>>>>  I'm not sure a fixed latency --
>>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>>>> to have interrupts staggered at 500us intervals it could easily lock up
>>>> the cpu for nearly a full second.  But I'm having trouble formulating a
>>>> good limit scenario.
>>>> 
>>>> In any case, 22us should be safe from a security standpoint*, and 128
>>>> should be pretty safe from a "make the common case fast" standpoint:
>>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>>>> traffic will be the least of your performance problems I should think.
>>>> 
>>>>  -George
>>>> 
>>>> * Waiting for Jan to contradict me on this one. :-)
>>>
>>>22us would certainly be fine, if this was the worst case scenario.
>>>I'm not sure the value measured for 128 list entries can be easily
>>>scaled to several thousands of them, due cache and/or NUMA
>>>effects. I continue to think that we primarily need theoretical
>>>proof of an upper boundary on list length being enforced, rather
>>>than any measurements or randomized balancing. And just to be
>>>clear - if someone overloads their system, I do not see a need to
>>>have a guaranteed maximum list traversal latency here. All I ask
>>>for is that list traversal time scales with total vCPU count divided
>>>by pCPU count.
>> 
>> Thanks, Jan & George.
>> 
>> I think it is more clear to me about what should I do next step.
>> 
>> In my understanding, we should distribute the wakeup interrupts like
>> this:
>> 1. By default, distribute it to the local pCPU ('local' means the vCPU
>> is on the pCPU) to make the common case fast.
>> 2. With the list grows to a point where we think it may consumers too
>> much time to traverse the list, also distribute wakeup interrupt to local
>> pCPU, ignoring admin intentionally overloads their system.
>> 3. When the list length reachs the theoretic average maximum (means
>> maximal vCPU count divided by pCPU count), distribute wakeup interrupt
>> to another underutilized pCPU.
>> 
>> But, I am confused about that If we don't care someone overload their
>> system, why we need the stage #3?  If not, I have no idea to meet Jan's
>> request, the list traversal time scales with total vCPU count divided by
>> pCPU count. Or we will reach stage #3 before stage #2?
>
>Things is that imo point 2 is too fuzzy to be of any use, i.e. 3 should
>take effect immediately. We don't mean to ignore any admin decisions
>here, it is just that if they overload their systems, the net effect of 3
>may still not be good enough to provide smooth behavior. But that's
>then a result of them overloading their systems in the first place. IOW,
>you should try to evenly distribute vCPU-s as soon as their count on
>a given pCPU exceeds the calculated average.

Very helpful and reasonable. Thank you, Jan.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
  2017-05-08  9:24             ` Jan Beulich
@ 2017-05-08 17:37               ` Chao Gao
  0 siblings, 0 replies; 21+ messages in thread
From: Chao Gao @ 2017-05-08 17:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, JunNakajima

On Mon, May 08, 2017 at 03:24:47AM -0600, Jan Beulich wrote:
>(Chao Gao got lost from the recipients list again; re-adding)
>
>>>> On 08.05.17 at 11:13, <george.dunlap@citrix.com> wrote:
>> On 08/05/17 17:15, Chao Gao wrote:
>>> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>>>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
>>>>> On 02/05/17 06:45, Chao Gao wrote:
>>>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>>>>> PI blocking list) with and without the three latter patches. Here
>>>>>>>> is the result:
>>>>>>>> -------------------------------------------------------------
>>>>>>>> |               |                      |                    |
>>>>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>>>>> |               |                      |                    |
>>>>>>>> -------------------------------------------------------------
>>>>>>>> |               |                      |                    |
>>>>>>>> |W/ the patches |         6            |       22740        |
>>>>>>>> |               |                      |                    |
>>>>>>>> -------------------------------------------------------------
>>>>>>>> |               |                      |                    |
>>>>>>>> |W/O the patches|        128           |       46481        |
>>>>>>>> |               |                      |                    |
>>>>>>>> -------------------------------------------------------------
>>>>>>>
>>>>>>> Any chance you could trace how long the list traversal took?  It would
>>>>>>> be good for future reference to have an idea what kinds of timescales
>>>>>>> we're talking about.
>>>>>>
>>>>>> Hi.
>>>>>>
>>>>>> I made a simple test to get the time consumed by the list traversal.
>>>>>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 
>> 40 NIC.
>>>>>> All guest vcpu are pinned to one pcpu. collect data by
>>>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>>>>> xentrace_format. When the list length is about 128, the traversal time
>>>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>>>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>>>>> 2900 vcpus can be added into the list.
>>>>>
>>>>> Great, thanks Chao Gao, that's useful.
>>>>
>>>> Looks like Chao Gao has been dropped ...
>>>>
>>>>>  I'm not sure a fixed latency --
>>>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>>>>> to have interrupts staggered at 500us intervals it could easily lock up
>>>>> the cpu for nearly a full second.  But I'm having trouble formulating a
>>>>> good limit scenario.
>>>>>
>>>>> In any case, 22us should be safe from a security standpoint*, and 128
>>>>> should be pretty safe from a "make the common case fast" standpoint:
>>>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>>>>> traffic will be the least of your performance problems I should think.
>>>>>
>>>>>  -George
>>>>>
>>>>> * Waiting for Jan to contradict me on this one. :-)
>>>>
>>>> 22us would certainly be fine, if this was the worst case scenario.
>>>> I'm not sure the value measured for 128 list entries can be easily
>>>> scaled to several thousands of them, due cache and/or NUMA
>>>> effects. I continue to think that we primarily need theoretical
>>>> proof of an upper boundary on list length being enforced, rather
>>>> than any measurements or randomized balancing. And just to be
>>>> clear - if someone overloads their system, I do not see a need to
>>>> have a guaranteed maximum list traversal latency here. All I ask
>>>> for is that list traversal time scales with total vCPU count divided
>>>> by pCPU count.
>>> 
>>> Thanks, Jan & George.
>>> 
>>> I think it is more clear to me about what should I do next step.
>>> 
>>> In my understanding, we should distribute the wakeup interrupts like
>>> this:
>>> 1. By default, distribute it to the local pCPU ('local' means the vCPU
>>> is on the pCPU) to make the common case fast.
>>> 2. With the list grows to a point where we think it may consumers too
>>> much time to traverse the list, also distribute wakeup interrupt to local
>>> pCPU, ignoring admin intentionally overloads their system.
>>> 3. When the list length reachs the theoretic average maximum (means
>>> maximal vCPU count divided by pCPU count), distribute wakeup interrupt
>>> to another underutilized pCPU.
>> 
>> By "maximal vCPU count" do you mean, "total number of active vcpus on
>> the system"?  Or some other theoretical maximum vcpu count (e.g., 32k
>> domans * 512 vcpus each or something)?
>
>The former.

Ok. Actually I meant the latter. But now, I realize I was wrong.

>
>> What about saying that the limit of vcpus for any given pcpu will be:
>>  (v_tot / p_tot) + K
>> where v_tot is the total number of vcpus on the system, p_tot is the
>> total number of pcpus in the system, and K is a fixed number (such as
>> 128) such that 1) the additional time walking the list is minimal, and
>> 2) in the common case we should never come close to reaching that number?
>> 
>> Then the algorithm for choosing which pcpu to have the interrupt
>> delivered to would be:
>>  1. Set p = current_pcpu
>>  2. if len(list(p)) < v_tot / p_tot + k, choose p
>>  3. Otherwise, choose another p and goto 2
>> 
>> The "choose another p" could be random / pseudorandom selection, or it
>> could be some other mechanism (rotate, look for pcpus nearby on the
>> topology, choose the lowest one, &c).  But as long as we check the
>> length before assigning it, it should satisfy Jan.

Very clear and helpful. Othewise, I may need spending several months to
reach this solution. Thanks, George. :)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-05-08 17:37 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-26  0:52 [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Chao Gao
2017-04-26  0:52 ` [PATCH 1/4] xentrace: add TRC_HVM_VT_D_PI_BLOCK Chao Gao
2017-04-26  0:52 ` [PATCH 2/4] VT-d PI: Randomly Distribute entries to all online pCPUs' pi blocking list Chao Gao
2017-04-26  0:52 ` [PATCH 3/4] VT-d PI: Add reference count to pi_desc Chao Gao
2017-04-26  0:52 ` [PATCH 4/4] VT-d PI: Don't add vCPU to PI blocking list for a case Chao Gao
2017-04-26  8:19 ` [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Jan Beulich
2017-04-26  3:30   ` Chao Gao
2017-04-26 10:52     ` Jan Beulich
2017-04-26 16:39 ` George Dunlap
2017-04-27  0:43   ` Chao Gao
2017-04-27  9:44     ` George Dunlap
2017-04-27  5:02       ` Chao Gao
2017-05-02  5:45   ` Chao Gao
2017-05-03 10:08     ` George Dunlap
2017-05-03 10:21       ` Jan Beulich
2017-05-08 16:15         ` Chao Gao
2017-05-08  8:39           ` Jan Beulich
2017-05-08 16:38             ` Chao Gao
2017-05-08  9:13           ` George Dunlap
2017-05-08  9:24             ` Jan Beulich
2017-05-08 17:37               ` Chao Gao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.