[PATCH v3 0/3] mitigate the per-pCPU blocking list may be too long

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/3] mitigate the per-pCPU blocking list may be too long
@ 2017-05-24  6:56 Chao Gao
  2017-05-24  6:56 ` [PATCH v3] VT-d PI: track the vcpu number in pi blocking list Chao Gao
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Chao Gao @ 2017-05-24  6:56 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Wei Liu, Jan Beulich, George Dunlap, Andrew Cooper,
	Ian Jackson, Jun Nakajima, Chao Gao

VT-d PI introduces a per-pCPU blocking list to track the blocked vCPU
on a given pCPU. Theoretically, there are 32K domain on single host,
128 vCPUs per domain. If all vCPUs are blocked on the same pCPU,
4M vCPUs are in the same list. Traversing this list consumes too
much time. More discussion can be found in [1,2,3].

To mitigate this issue, this series put vcpus to another pcpu's list
when the local pcpu's list length reachs an upper bound. The upper
bound is determined by total vcpu count and total pcpu count in the
system.

PATCH 1/3 adds a counter to track the per-pCPU blocking list's length
and to aid analysis and obtain the list length, add some relevant
events to xentrace. With this patch, some data can be acquired to help
to validate patch 3/3. 

PATCH 2/3 uses a global variable to track how many hvm vcpus on this
system. It is used to calculate the number limit of blocked vcpu on a
given pcpu.

In patch 3/3, a policy is used to restrict the vcpu count on a given
pcpu's pi blocking list in case the list grows too long.

In this new version, the method only putting vcpus which are referred
by at least one IRTE to the list is removed for
1. The implementation is intrusive.
2. Now, no evidence shows its necessity.

[1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted-interrupt%20core%20logic%20handling;#422661
[2] https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20of%20the%20list%20depends;#422567
[3] https://lists.gt.net/xen/devel/472749?search_string=enable%20vt-d%20pi%20by%20default;#472749

Chao Gao (3):
  VT-d PI: track the vcpu number in pi blocking list
  vcpu: track hvm vcpu number on the system
  VT-d PI: restrict the vcpu number on a given pcpu

 tools/xentrace/formats          |   3 +
 tools/xentrace/xenalyze.c       | 159 +++++++++++++++++++++++++++++++++++++---
 xen/arch/x86/domain.c           |  10 +++
 xen/arch/x86/hvm/vmx/vmx.c      |  97 +++++++++++++++++++++---
 xen/include/asm-x86/domain.h    |   2 +
 xen/include/asm-x86/hvm/trace.h |   2 +
 xen/include/public/trace.h      |   6 ++
 7 files changed, 261 insertions(+), 18 deletions(-)

-- 
1.8.3.1

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3] VT-d PI: track the vcpu number in pi blocking list
  2017-05-24  6:56 [PATCH v3 0/3] mitigate the per-pCPU blocking list may be too long Chao Gao
@ 2017-05-24  6:56 ` Chao Gao
  2017-06-16 14:34   ` Jan Beulich
  2017-05-24  6:56 ` [PATCH v3 2/3] vcpu: track hvm vcpu number on the system Chao Gao
  2017-05-24  6:56 ` [PATCH v3 3/3] VT-d PI: restrict the vcpu number on a given pcpu Chao Gao
  2 siblings, 1 reply; 13+ messages in thread
From: Chao Gao @ 2017-05-24  6:56 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Wei Liu, Jan Beulich, George Dunlap, Andrew Cooper,
	Ian Jackson, Jun Nakajima, Chao Gao

This patch adds a field, counter, in struct vmx_pi_blocking_vcpu to track
how many entries are on the pi blocking list. In order to analyze list
operation frequence and obtain the list length, add some relevant events
to xentrace and some associated code in xenalyze. As to xenalyze, for
removing from pi list can happen in various contexts (interrupt context,
and non-interrupt context) and be done by the vcpu itself or others, some of
the contexts would incur that toplevel_assert_check() fails. To bypass this
check, this patch adds a new type TRC_HVM_ASYNC and for this new type event,
the context would not be checked.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 tools/xentrace/formats          |   3 +
 tools/xentrace/xenalyze.c       | 154 +++++++++++++++++++++++++++++++++++++---
 xen/arch/x86/hvm/vmx/vmx.c      |  33 +++++++--
 xen/include/asm-x86/hvm/trace.h |   1 +
 xen/include/public/trace.h      |   5 ++
 5 files changed, 182 insertions(+), 14 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index 8b31780..54e0b11 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -125,6 +125,9 @@
 0x00082020  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  INTR_WINDOW [ value = 0x%(1)08x ]
 0x00082021  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  NPF         [ gpa = 0x%(2)08x%(1)08x mfn = 0x%(4)08x%(3)08x qual = 0x%(5)04x p2mt = 0x%(6)04x ]
 0x00082023  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  TRAP        [ vector = 0x%(1)02x ]
+0x00082026  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  PI_LIST_ADD [ domid = 0x%(1)04x vcpu = 0x%(2)04x, pcpu = 0x%(3)04x, #entry = 0x%(4)04x ]
+
+0x00088001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  ASYNC_PI_LIST_DEL [ domid = 0x%(1)04x vcpu = 0x%(2)04x ]
 
 0x0010f001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_map      [ domid = %(1)d ]
 0x0010f002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_unmap    [ domid = %(1)d ]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index fa608ad..fbc2429 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -296,6 +296,7 @@ struct symbol_struct {
 };
 
 void error(enum error_level l, struct record_info *ri);
+struct vcpu_data * vcpu_find(int did, int vid);
 
 void parse_symbol_file(char *fn) {
     unsigned long long last_addr = 0;
@@ -944,6 +945,7 @@ enum {
     HVM_EVENT_TRAP,
     HVM_EVENT_TRAP_DEBUG,
     HVM_EVENT_VLAPIC,
+    HVM_EVENT_PI_LIST_ADD,
     HVM_EVENT_HANDLER_MAX
 };
 char * hvm_event_handler_name[HVM_EVENT_HANDLER_MAX] = {
@@ -979,13 +981,15 @@ char * hvm_event_handler_name[HVM_EVENT_HANDLER_MAX] = {
     "realmode_emulate",
     "trap",
     "trap_debug",
-    "vlapic"
+    "vlapic",
+    "pi_list_add",
 };
 
 enum {
     HVM_VOL_VMENTRY,
     HVM_VOL_VMEXIT,
     HVM_VOL_HANDLER,
+    HVM_VOL_ASYNC,
     HVM_VOL_MAX
 };
 
@@ -1012,6 +1016,7 @@ char *hvm_vol_name[HVM_VOL_MAX] = {
     [HVM_VOL_VMENTRY]="vmentry",
     [HVM_VOL_VMEXIT] ="vmexit",
     [HVM_VOL_HANDLER]="handler",
+    [HVM_VOL_ASYNC]="async",
 };
 
 enum {
@@ -1337,6 +1342,9 @@ struct hvm_data {
         struct {
             struct io_address *mmio, *pio;
         } io;
+        struct {
+            int pi_list_add, pi_list_del;
+        } pi;
     } summary;
 
     /* In-flight accumulation information */
@@ -1391,6 +1399,9 @@ struct hvm_data {
 
     /* Historical info */
     tsc_t last_rdtsc;
+
+    /* Destination pcpu of posted interrupt's wakeup interrupt */
+    int pi_cpu;
 };
 
 enum {
@@ -1457,6 +1468,8 @@ void init_hvm_data(struct hvm_data *h, struct vcpu_data *v) {
     }
     for(i=0; i<GUEST_INTERRUPT_MAX+1; i++)
         h->summary.guest_interrupt[i].count=0;
+
+    h->pi_cpu = -1;
 }
 
 /* PV data */
@@ -1766,6 +1779,14 @@ char * toplevel_name[TOPLEVEL_MAX] = {
     [TOPLEVEL_HW]="hw",
 };
 
+enum {
+    SUBLEVEL_HVM_ENTRYEXIT=1,
+    SUBLEVEL_HVM_HANDLER,
+    SUBLEVEL_HVM_EMUL,
+    SUBLEVEL_HVM_ASYNC,
+    SUBLEVEL_HVM_MAX=SUBLEVEL_HVM_ASYNC+1,
+};
+
 struct trace_volume {
     unsigned long long toplevel[TOPLEVEL_MAX];
     unsigned long long sched_verbose;
@@ -1852,6 +1873,9 @@ struct pcpu_info {
         tsc_t tsc;
         struct cycle_summary idle, running, lost;
     } time;
+
+    /* Posted Interrupt List Length */
+    int pi_list_length;
 };
 
 void __fill_in_record_info(struct pcpu_info *p);
@@ -4726,6 +4750,71 @@ void hvm_generic_dump(struct record_info *ri, char * prefix)
     printf(" ]\n");
 }
 
+void hvm_pi_list_add_process(struct record_info *ri, struct hvm_data *h)
+{
+    struct {
+        int did;
+        int vid;
+        int pcpu;
+        int len;
+    } *data;
+    struct vcpu_data *v;
+
+    data = (typeof(data))ri->rec.u.tsc.data;
+    v = vcpu_find(data->did, data->vid);
+    if ( !v->hvm.init )
+        init_hvm_data(&v->hvm, v);
+
+    if ( opt.dump_all )
+        printf("d%uv%u is added to pi blocking list of pcpu%u. "
+               "The list length is now %d\n",
+               data->did, data->vid, data->pcpu, data->len);
+
+    v->hvm.pi_cpu = data->pcpu;
+    v->hvm.summary.pi.pi_list_add++;
+    if ( data->pcpu > P.max_active_pcpu || !P.pcpu[data->pcpu].active )
+        fprintf(stderr, "Strange! pcpu%u is inactive but a vcpu is added"
+                "to it", data->pcpu);
+    else if ( P.pcpu[data->pcpu].pi_list_length == -1 )
+        P.pcpu[data->pcpu].pi_list_length = data->len;
+    else if ( data->len != ++P.pcpu[data->pcpu].pi_list_length )
+        /*
+         * Correct pi list length. Removing one vcpu that is already on the
+         * list before tracing starts would not decrease the pi list length;
+         * the list length would be inaccuate.
+         */
+        P.pcpu[data->pcpu].pi_list_length = data->len;
+}
+
+void hvm_pi_list_del_process(struct record_info *ri)
+{
+    struct {
+        int did;
+        int vid;
+    } *data;
+    struct vcpu_data *v;
+
+    data = (typeof(data))ri->rec.u.tsc.data;
+    v = vcpu_find(data->did, data->vid);
+    if ( !v->hvm.init )
+        init_hvm_data(&v->hvm, v);
+
+    if ( opt.dump_all )
+    {
+        if ( v->hvm.pi_cpu != -1 )
+            printf("d%uv%u is removed from pi blocking list of pcpu%u\n",
+                   data->did, data->vid, v->hvm.pi_cpu);
+        else
+            printf("d%uv%u is removed from pi blocking list\n",
+                   data->did, data->vid);
+    }
+
+    if ( P.pcpu[v->hvm.pi_cpu].pi_list_length != -1 )
+        P.pcpu[v->hvm.pi_cpu].pi_list_length--;
+    v->hvm.pi_cpu = -1;
+    v->hvm.summary.pi.pi_list_del++;
+}
+
 void hvm_handler_process(struct record_info *ri, struct hvm_data *h) {
     /* Wait for first vmexit to initialize */
     if(!h->init)
@@ -4763,6 +4852,9 @@ void hvm_handler_process(struct record_info *ri, struct hvm_data *h) {
     case TRC_HVM_INTR_WINDOW:
         hvm_intr_window_process(ri, h);
         break;
+    case TRC_HVM_PI_LIST_ADD:
+        hvm_pi_list_add_process(ri, h);
+        break;
     case TRC_HVM_OP_DESTROY_PROC:
         if(h->v->cr3.data) {
             struct cr3_value_struct *cur = h->v->cr3.data;
@@ -4862,7 +4954,6 @@ needs_vmexit:
 void vcpu_next_update(struct pcpu_info *p, struct vcpu_data *next, tsc_t tsc);
 void vcpu_prev_update(struct pcpu_info *p, struct vcpu_data *prev,
                       tsc_t tsc, int new_runstate);
-struct vcpu_data * vcpu_find(int did, int vid);
 void lose_vcpu(struct vcpu_data *v, tsc_t tsc);
 
 int domain_runstate(struct domain_data *d) {
@@ -5267,6 +5358,21 @@ void hvm_process(struct pcpu_info *p)
     struct vcpu_data *v = p->current;
     struct hvm_data *h = &v->hvm;
 
+    if(ri->evt.sub == 8)
+    {
+        UPDATE_VOLUME(p, hvm[HVM_VOL_ASYNC], ri->size);
+
+        switch(ri->event) {
+        case TRC_HVM_ASYNC_PI_LIST_DEL:
+            hvm_pi_list_del_process(ri);
+            break;
+
+        default:
+            fprintf(warn, "Unknown hvm event: %x\n", ri->event);
+        }
+        return;
+    }
+
     assert(p->current);
 
     if(vcpu_set_data_type(p->current, VCPU_DATA_HVM))
@@ -5359,6 +5465,10 @@ void hvm_summary(struct hvm_data *h) {
                   i, h->summary.ipi_count[i]);
    hvm_io_address_summary(h->summary.io.pio, "IO address summary:");
    hvm_io_address_summary(h->summary.io.mmio, "MMIO address summary:");
+
+   printf("Posted Interrupt:\n");
+   printf(" List Add: %u\n", h->summary.pi.pi_list_add);
+   printf(" List Del: %u\n", h->summary.pi.pi_list_del);
 }
 
 /* ---- Shadow records ---- */
@@ -8962,6 +9072,7 @@ off_t scan_for_new_pcpu(off_t offset) {
 
         p->file_offset = offset;
         p->next_cpu_change_offset = offset;
+        p->pi_list_length = -1;
 
         record_order_insert(p);
 
@@ -9142,7 +9253,6 @@ int find_toplevel_event(struct record_info *ri)
     return toplevel;
 }
 
-
 void process_cpu_change(struct pcpu_info *p) {
     struct record_info *ri = &p->ri;
     struct cpu_change_data *r = (typeof(r))ri->d;
@@ -9190,6 +9300,7 @@ void process_cpu_change(struct pcpu_info *p) {
         p2->ri.d = p2->ri.rec.u.notsc.data;
         p2->file_offset = p->file_offset;
         p2->next_cpu_change_offset = p->file_offset;
+        p2->pi_list_length = -1;
 
         fprintf(warn, "%s: Activating pcpu %d at offset %lld\n",
                 __func__, r->cpu, (unsigned long long)p->file_offset);
@@ -9276,26 +9387,52 @@ void process_cpu_change(struct pcpu_info *p) {
     }
 }
 
-struct tl_assert_mask {
+struct assert_mask {
     unsigned p_current:1,
-        not_idle_domain:1;
+        not_idle_domain:1,
+        check_sublevel:1;
     int vcpu_data_mode;
+    int sub_max;
+    struct assert_mask *sub;
 };
-static struct tl_assert_mask tl_assert_checks[TOPLEVEL_MAX] = {
-    [TRC_HVM_MAIN]={ .p_current=1, .not_idle_domain=1, .vcpu_data_mode=VCPU_DATA_HVM },
+
+static struct assert_mask sl_hvm_assert_mask[SUBLEVEL_HVM_MAX] = {
+    [SUBLEVEL_HVM_ENTRYEXIT] = { .p_current=1, .not_idle_domain=1, .vcpu_data_mode=VCPU_DATA_HVM },
+    [SUBLEVEL_HVM_HANDLER] = { .p_current=1, .not_idle_domain=1, .vcpu_data_mode=VCPU_DATA_HVM },
+    [SUBLEVEL_HVM_EMUL] = { .p_current=1, .not_idle_domain=1, .vcpu_data_mode=VCPU_DATA_HVM },
+};
+
+static struct assert_mask tl_assert_checks[TOPLEVEL_MAX] = {
+    [TRC_HVM_MAIN]={ .check_sublevel=1, .sub=sl_hvm_assert_mask, .sub_max=SUBLEVEL_HVM_MAX },
     [TRC_SHADOW_MAIN]={ .p_current=1, .not_idle_domain=1, .vcpu_data_mode=VCPU_DATA_HVM },
     [TRC_PV_MAIN]={ .p_current=1, .not_idle_domain=1, .vcpu_data_mode=VCPU_DATA_PV },
 };
 
+/* Other sub types are reserved */
+static int sublevel_to_index[16] = {
+    [1] = 1,
+    [2] = 2,
+    [4] = 3,
+    [8] = 4,
+};
+
 /* There are a lot of common assumptions for the various processing
  * routines.  Check them all in one place, doing something else if
  * they don't pass. */
 int toplevel_assert_check(int toplevel, struct pcpu_info *p)
 {
-    struct tl_assert_mask mask;
+    struct assert_mask mask;
 
     mask = tl_assert_checks[toplevel];
 
+    if (mask.check_sublevel)
+    {
+        int sub = sublevel_to_index[p->ri.evt.sub];
+
+        assert(mask.sub && (sub < mask.sub_max));
+        mask = mask.sub[sub];
+    }
+
     if (mask.p_current && p->current == NULL)
     {
         fprintf(warn, "WARNING: p->current null!  Not processing\n");
@@ -9362,7 +9499,6 @@ void process_record(struct pcpu_info *p) {
     if(opt.dump_all)
         create_dump_header(ri, p);
 
-
     toplevel = find_toplevel_event(ri);
     if ( toplevel < 0 )
         return;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index c8ef18a..abbf16b 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -82,6 +82,7 @@ static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);
 struct vmx_pi_blocking_vcpu {
     struct list_head     list;
     spinlock_t           lock;
+    uint32_t             counter;
 };
 
 /*
@@ -119,6 +120,9 @@ static void vmx_vcpu_block(struct vcpu *v)
      */
     ASSERT(old_lock == NULL);
 
+    per_cpu(vmx_pi_blocking, v->processor).counter++;
+    TRACE_4D(TRC_HVM_PI_LIST_ADD, v->domain->domain_id, v->vcpu_id,
+             v->processor, per_cpu(vmx_pi_blocking, v->processor).counter);
     list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
                   &per_cpu(vmx_pi_blocking, v->processor).list);
     spin_unlock_irqrestore(pi_blocking_list_lock, flags);
@@ -186,6 +190,9 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
     {
         ASSERT(v->arch.hvm_vmx.pi_blocking.lock == pi_blocking_list_lock);
         list_del(&v->arch.hvm_vmx.pi_blocking.list);
+        container_of(pi_blocking_list_lock,
+                     struct vmx_pi_blocking_vcpu, lock)->counter--;
+        TRACE_2D(TRC_HVM_ASYNC_PI_LIST_DEL, v->domain->domain_id, v->vcpu_id);
         v->arch.hvm_vmx.pi_blocking.lock = NULL;
     }
 
@@ -220,6 +227,8 @@ void vmx_pi_desc_fixup(unsigned int cpu)
 
     list_for_each_entry_safe(vmx, tmp, blocked_vcpus, pi_blocking.list)
     {
+        struct vcpu *v = container_of(vmx, struct vcpu, arch.hvm_vmx);
+
         /*
          * Suppress notification or we may miss an interrupt when the
          * target cpu is dying.
@@ -234,8 +243,11 @@ void vmx_pi_desc_fixup(unsigned int cpu)
         if ( pi_test_on(&vmx->pi_desc) )
         {
             list_del(&vmx->pi_blocking.list);
+            TRACE_2D(TRC_HVM_ASYNC_PI_LIST_DEL, v->domain->domain_id,
+                     v->vcpu_id);
+            per_cpu(vmx_pi_blocking, cpu).counter--;
             vmx->pi_blocking.lock = NULL;
-            vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
+            vcpu_unblock(v);
         }
         else
         {
@@ -258,6 +270,12 @@ void vmx_pi_desc_fixup(unsigned int cpu)
 
             list_move(&vmx->pi_blocking.list,
                       &per_cpu(vmx_pi_blocking, new_cpu).list);
+            per_cpu(vmx_pi_blocking, cpu).counter--;
+            per_cpu(vmx_pi_blocking, new_cpu).counter++;
+            TRACE_2D(TRC_HVM_ASYNC_PI_LIST_DEL, v->domain->domain_id,
+                     v->vcpu_id);
+            TRACE_4D(TRC_HVM_PI_LIST_ADD, v->domain->domain_id, v->vcpu_id,
+                     new_cpu, per_cpu(vmx_pi_blocking, new_cpu).counter);
             vmx->pi_blocking.lock = new_lock;
 
             spin_unlock(new_lock);
@@ -2358,9 +2376,9 @@ static struct hvm_function_table __initdata vmx_function_table = {
 static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
 {
     struct arch_vmx_struct *vmx, *tmp;
-    spinlock_t *lock = &per_cpu(vmx_pi_blocking, smp_processor_id()).lock;
-    struct list_head *blocked_vcpus =
-		&per_cpu(vmx_pi_blocking, smp_processor_id()).list;
+    unsigned int cpu = smp_processor_id();
+    spinlock_t *lock = &per_cpu(vmx_pi_blocking, cpu).lock;
+    struct list_head *blocked_vcpus = &per_cpu(vmx_pi_blocking, cpu).list;
 
     ack_APIC_irq();
     this_cpu(irq_count)++;
@@ -2376,10 +2394,15 @@ static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
     {
         if ( pi_test_on(&vmx->pi_desc) )
         {
+            struct vcpu *v = container_of(vmx, struct vcpu, arch.hvm_vmx);
+
             list_del(&vmx->pi_blocking.list);
+            TRACE_2D(TRC_HVM_ASYNC_PI_LIST_DEL, v->domain->domain_id,
+                     v->vcpu_id);
+            per_cpu(vmx_pi_blocking, cpu).counter--;
             ASSERT(vmx->pi_blocking.lock == lock);
             vmx->pi_blocking.lock = NULL;
-            vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
+            vcpu_unblock(v);
         }
     }
 
diff --git a/xen/include/asm-x86/hvm/trace.h b/xen/include/asm-x86/hvm/trace.h
index de802a6..b74ffdd 100644
--- a/xen/include/asm-x86/hvm/trace.h
+++ b/xen/include/asm-x86/hvm/trace.h
@@ -54,6 +54,7 @@
 #define DO_TRC_HVM_TRAP             DEFAULT_HVM_MISC
 #define DO_TRC_HVM_TRAP_DEBUG       DEFAULT_HVM_MISC
 #define DO_TRC_HVM_VLAPIC           DEFAULT_HVM_MISC
+#define DO_TRC_HVM_PI_LIST_ADD      DEFAULT_HVM_MISC
 
 
 #define TRC_PAR_LONG(par) ((par)&0xFFFFFFFF),((par)>>32)
diff --git a/xen/include/public/trace.h b/xen/include/public/trace.h
index 7f2e891..15ea87c 100644
--- a/xen/include/public/trace.h
+++ b/xen/include/public/trace.h
@@ -53,6 +53,7 @@
 #define TRC_HVM_ENTRYEXIT   0x00081000   /* VMENTRY and #VMEXIT       */
 #define TRC_HVM_HANDLER     0x00082000   /* various HVM handlers      */
 #define TRC_HVM_EMUL        0x00084000   /* emulated devices */
+#define TRC_HVM_ASYNC       0x00088000   /* Asynchronous events */
 
 #define TRC_SCHED_MIN       0x00021000   /* Just runstate changes */
 #define TRC_SCHED_CLASS     0x00022000   /* Scheduler-specific    */
@@ -234,6 +235,7 @@
 #define TRC_HVM_TRAP             (TRC_HVM_HANDLER + 0x23)
 #define TRC_HVM_TRAP_DEBUG       (TRC_HVM_HANDLER + 0x24)
 #define TRC_HVM_VLAPIC           (TRC_HVM_HANDLER + 0x25)
+#define TRC_HVM_PI_LIST_ADD      (TRC_HVM_HANDLER + 0x26)
 
 #define TRC_HVM_IOPORT_WRITE    (TRC_HVM_HANDLER + 0x216)
 #define TRC_HVM_IOMEM_WRITE     (TRC_HVM_HANDLER + 0x217)
@@ -257,6 +259,9 @@
 #define TRC_HVM_EMUL_PIC_PEND_IRQ_CALL (TRC_HVM_EMUL + 0x10)
 #define TRC_HVM_EMUL_LAPIC_PIC_INTR    (TRC_HVM_EMUL + 0x11)
 
+/* Trace asynconous events for HVM */
+#define TRC_HVM_ASYNC_PI_LIST_DEL      (TRC_HVM_ASYNC + 0x1)
+
 /* trace events for per class */
 #define TRC_PM_FREQ_CHANGE      (TRC_HW_PM + 0x01)
 #define TRC_PM_IDLE_ENTRY       (TRC_HW_PM + 0x02)
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/3] vcpu: track hvm vcpu number on the system
  2017-05-24  6:56 [PATCH v3 0/3] mitigate the per-pCPU blocking list may be too long Chao Gao
  2017-05-24  6:56 ` [PATCH v3] VT-d PI: track the vcpu number in pi blocking list Chao Gao
@ 2017-05-24  6:56 ` Chao Gao
  2017-06-16 14:44   ` Jan Beulich
  2017-05-24  6:56 ` [PATCH v3 3/3] VT-d PI: restrict the vcpu number on a given pcpu Chao Gao
  2 siblings, 1 reply; 13+ messages in thread
From: Chao Gao @ 2017-05-24  6:56 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Jan Beulich, Chao Gao

This number is used to calculate how many hvm vcpu on a pcpu on average.
This counting is x86 specific.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 xen/arch/x86/domain.c        | 10 ++++++++++
 xen/include/asm-x86/domain.h |  2 ++
 2 files changed, 12 insertions(+)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 13cdc50..050fe0e 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -66,6 +66,9 @@
 
 DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
 
+/* how many hvm vcpu on this system? */
+atomic_t num_hvm_vcpus;
+
 static void default_idle(void);
 void (*pm_idle) (void) __read_mostly = default_idle;
 void (*dead_idle) (void) __read_mostly = default_dead_idle;
@@ -467,7 +470,11 @@ int vcpu_initialise(struct vcpu *v)
             xfree(v->arch.pv_vcpu.trap_ctxt);
     }
     else if ( !is_idle_domain(v->domain) )
+    {
         vpmu_initialise(v);
+        if ( is_hvm_domain(v->domain) )
+            atomic_inc(&num_hvm_vcpus);
+    }
 
     return rc;
 }
@@ -489,7 +496,10 @@ void vcpu_destroy(struct vcpu *v)
         vpmu_destroy(v);
 
     if ( is_hvm_vcpu(v) )
+    {
         hvm_vcpu_destroy(v);
+        atomic_dec(&num_hvm_vcpus);
+    }
     else
         xfree(v->arch.pv_vcpu.trap_ctxt);
 }
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 924caac..769cde2 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -31,6 +31,8 @@
 #define nmi_pending            nmi_state.pending
 #define mce_pending            mce_state.pending
 
+extern atomic_t num_hvm_vcpus;
+
 struct trap_bounce {
     uint32_t      error_code;
     uint8_t       flags; /* TBF_ */
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 3/3] VT-d PI: restrict the vcpu number on a given pcpu
  2017-05-24  6:56 [PATCH v3 0/3] mitigate the per-pCPU blocking list may be too long Chao Gao
  2017-05-24  6:56 ` [PATCH v3] VT-d PI: track the vcpu number in pi blocking list Chao Gao
  2017-05-24  6:56 ` [PATCH v3 2/3] vcpu: track hvm vcpu number on the system Chao Gao
@ 2017-05-24  6:56 ` Chao Gao
  2017-06-16 15:09   ` Jan Beulich
  2 siblings, 1 reply; 13+ messages in thread
From: Chao Gao @ 2017-05-24  6:56 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, Jan Beulich, Chao Gao

Currently, a blocked vCPU is put in its pCPU's pi blocking list. If
too many vCPUs are blocked on a given pCPU, it will incur that the list
grows too long. After a simple analysis, there are 32k domains and
128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's
PI blocking list. When a wakeup interrupt arrives, the list is
traversed to find some specific vCPUs to wake them up. This traversal in
that case would consume much time.

To mitigate this issue, this patch limits the vcpu number on a given
pCPU, taking factors such as perfomance of common case, current hvm vcpu
count and current pcpu count into consideration. With this method, for
the common case, it works fast and for some extreme cases, the list
length is under control.

The change in vmx_pi_unblock_vcpu() is for the following case:
vcpu is running -> try to block (this patch may change NSDT to
another pCPU) but notification comes in time, thus the vcpu
goes back to running station -> VM-entry (we should set NSDT again,
reverting the change we make to NSDT in vmx_vcpu_block())

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 xen/arch/x86/hvm/vmx/vmx.c | 70 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index abbf16b..91ee65b 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -100,16 +100,62 @@ void vmx_pi_per_cpu_init(unsigned int cpu)
     spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
 }
 
+/*
+ * By default, the local pcpu (means the one the vcpu is currently running on)
+ * is chosen as the destination of wakeup interrupt. But if the vcpu number of
+ * the pcpu exceeds a limit, another pcpu is chosen until we find a suitable
+ * one.
+ *
+ * Currently, choose (v_tot/p_tot) + K as the limit of vcpu count, where
+ * v_tot is the total number of hvm vcpus on the system, p_tot is the total
+ * number of pcpus in the system, and K is a fixed number. Experments shows
+ * the maximum time to wakeup a vcpu from a 128-entry blocking list is about
+ * 22us, which is tolerable. So choose 128 as the fixed number K.
+ *
+ * This policy makes sure:
+ * 1) for common cases, the limit won't be reached and the local pcpu is used
+ * which is beneficial to performance (at least, avoid an IPI when unblocking
+ * vcpu).
+ * 2) for the worst case, the blocking list length scales with the vcpu count
+ * divided by the pcpu count.
+ */
+#define PI_LIST_FIXED_NUM 128
+#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) / num_online_cpus() + \
+                           PI_LIST_FIXED_NUM)
+
+static bool pi_over_limit(int count)
+{
+    /* Compare w/ constant first to save an atomic read in the common case */
+    return ((count > PI_LIST_FIXED_NUM) &&
+            (count > (atomic_read(&num_hvm_vcpus) / num_online_cpus()) +
+                PI_LIST_FIXED_NUM));
+}
+
 static void vmx_vcpu_block(struct vcpu *v)
 {
     unsigned long flags;
-    unsigned int dest;
+    unsigned int dest, pi_cpu;
     spinlock_t *old_lock;
-    spinlock_t *pi_blocking_list_lock =
-		&per_cpu(vmx_pi_blocking, v->processor).lock;
     struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    spinlock_t *pi_blocking_list_lock;
+
+    pi_cpu = v->processor;
+ retry:
+    pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, pi_cpu).lock;
 
     spin_lock_irqsave(pi_blocking_list_lock, flags);
+    /*
+     * Since pi_cpu may now be one other than the one v is currently
+     * running on, check to make sure that it's still up.
+     */
+    if ( unlikely((!cpu_online(pi_cpu)) ||
+             pi_over_limit(per_cpu(vmx_pi_blocking, pi_cpu).counter)) )
+    {
+        pi_cpu = cpumask_cycle(pi_cpu, &cpu_online_map);
+        spin_unlock_irqrestore(pi_blocking_list_lock, flags);
+        goto retry;
+    }
+
     old_lock = cmpxchg(&v->arch.hvm_vmx.pi_blocking.lock, NULL,
                        pi_blocking_list_lock);
 
@@ -120,11 +166,11 @@ static void vmx_vcpu_block(struct vcpu *v)
      */
     ASSERT(old_lock == NULL);
 
-    per_cpu(vmx_pi_blocking, v->processor).counter++;
-    TRACE_4D(TRC_HVM_PI_LIST_ADD, v->domain->domain_id, v->vcpu_id,
-             v->processor, per_cpu(vmx_pi_blocking, v->processor).counter);
+    per_cpu(vmx_pi_blocking, pi_cpu).counter++;
+    TRACE_4D(TRC_HVM_PI_LIST_ADD, v->domain->domain_id, v->vcpu_id, pi_cpu,
+             per_cpu(vmx_pi_blocking, pi_cpu).counter);
     list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
-                  &per_cpu(vmx_pi_blocking, v->processor).list);
+                  &per_cpu(vmx_pi_blocking, pi_cpu).list);
     spin_unlock_irqrestore(pi_blocking_list_lock, flags);
 
     ASSERT(!pi_test_sn(pi_desc));
@@ -134,6 +180,13 @@ static void vmx_vcpu_block(struct vcpu *v)
     ASSERT(pi_desc->ndst ==
            (x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK)));
 
+    if ( unlikely(pi_cpu != v->processor) )
+    {
+        dest = cpu_physical_id(pi_cpu);
+        write_atomic(&pi_desc->ndst,
+                 (x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK)));
+    }
+
     write_atomic(&pi_desc->nv, pi_wakeup_vector);
 }
 
@@ -163,6 +216,7 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
     unsigned long flags;
     spinlock_t *pi_blocking_list_lock;
     struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    unsigned int dest = cpu_physical_id(v->processor);
 
     /*
      * Set 'NV' field back to posted_intr_vector, so the
@@ -170,6 +224,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
      * it is running in non-root mode.
      */
     write_atomic(&pi_desc->nv, posted_intr_vector);
+    write_atomic(&pi_desc->ndst,
+                 x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));
 
     pi_blocking_list_lock = v->arch.hvm_vmx.pi_blocking.lock;
 
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] VT-d PI: track the vcpu number in pi blocking list
  2017-05-24  6:56 ` [PATCH v3] VT-d PI: track the vcpu number in pi blocking list Chao Gao
@ 2017-06-16 14:34   ` Jan Beulich
  2017-06-22  5:16     ` Chao Gao
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2017-06-16 14:34 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima

>>> On 24.05.17 at 08:56, <chao.gao@intel.com> wrote:
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -82,6 +82,7 @@ static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);
>  struct vmx_pi_blocking_vcpu {
>      struct list_head     list;
>      spinlock_t           lock;
> +    uint32_t             counter;

Is there any reason for this to be fixed width? Other than that the
non-tracing parts look fine, but I really wonder whether the
introduction of the counter and the tracing wouldn't better be split.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/3] vcpu: track hvm vcpu number on the system
  2017-05-24  6:56 ` [PATCH v3 2/3] vcpu: track hvm vcpu number on the system Chao Gao
@ 2017-06-16 14:44   ` Jan Beulich
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Beulich @ 2017-06-16 14:44 UTC (permalink / raw)
  To: Chao Gao; +Cc: Andrew Cooper, xen-devel

>>> On 24.05.17 at 08:56, <chao.gao@intel.com> wrote:
> This number is used to calculate how many hvm vcpu on a pcpu on average.

This doesn't read well. Perhaps "This number is used to calculate the
average vcpus per pcpu ratio"?

> This counting is x86 specific.

By titling the patch accordingly you wouldn't even need to say so
here.

> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -66,6 +66,9 @@
>  
>  DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
>  
> +/* how many hvm vcpu on this system? */

"Total number of HVM vCPU-s on this system" or some such.

> @@ -467,7 +470,11 @@ int vcpu_initialise(struct vcpu *v)
>              xfree(v->arch.pv_vcpu.trap_ctxt);
>      }
>      else if ( !is_idle_domain(v->domain) )
> +    {
>          vpmu_initialise(v);
> +        if ( is_hvm_domain(v->domain) )
> +            atomic_inc(&num_hvm_vcpus);
> +    }

Please instead put this in the code block calling
hvm_vcpu_initialise().

> @@ -489,7 +496,10 @@ void vcpu_destroy(struct vcpu *v)
>          vpmu_destroy(v);
>  
>      if ( is_hvm_vcpu(v) )
> +    {
>          hvm_vcpu_destroy(v);
> +        atomic_dec(&num_hvm_vcpus);
> +    }

To mirror initialization behavior I think you want to decrement the
counter before calling the function.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/3] VT-d PI: restrict the vcpu number on a given pcpu
  2017-05-24  6:56 ` [PATCH v3 3/3] VT-d PI: restrict the vcpu number on a given pcpu Chao Gao
@ 2017-06-16 15:09   ` Jan Beulich
  2017-06-23  4:22     ` Chao Gao
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2017-06-16 15:09 UTC (permalink / raw)
  To: Chao Gao; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, xen-devel

>>> On 24.05.17 at 08:56, <chao.gao@intel.com> wrote:
> Currently, a blocked vCPU is put in its pCPU's pi blocking list. If
> too many vCPUs are blocked on a given pCPU, it will incur that the list
> grows too long. After a simple analysis, there are 32k domains and
> 128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's
> PI blocking list. When a wakeup interrupt arrives, the list is
> traversed to find some specific vCPUs to wake them up. This traversal in
> that case would consume much time.
> 
> To mitigate this issue, this patch limits the vcpu number on a given
> pCPU,

This would be a bug, but I think it's the description which is wrong
(or at least imprecise): You don't limit the number of vCPU-s _run_
on any pCPU, but those tracked on any pCPU-s blocking list. Please
say so here to avoid confusion.

> taking factors such as perfomance of common case, current hvm vcpu
> count and current pcpu count into consideration. With this method, for
> the common case, it works fast and for some extreme cases, the list
> length is under control.
> 
> The change in vmx_pi_unblock_vcpu() is for the following case:
> vcpu is running -> try to block (this patch may change NSDT to
> another pCPU) but notification comes in time, thus the vcpu

What does "but notification comes in time" mean?

> goes back to running station -> VM-entry (we should set NSDT again,

s/station/state/ ?

> reverting the change we make to NSDT in vmx_vcpu_block())

Overall I'm not sure I really understand what you try to explain
here.

> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -100,16 +100,62 @@ void vmx_pi_per_cpu_init(unsigned int cpu)
>      spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
>  }
>  
> +/*
> + * By default, the local pcpu (means the one the vcpu is currently running on)
> + * is chosen as the destination of wakeup interrupt. But if the vcpu number of
> + * the pcpu exceeds a limit, another pcpu is chosen until we find a suitable
> + * one.
> + *
> + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu count, where
> + * v_tot is the total number of hvm vcpus on the system, p_tot is the total
> + * number of pcpus in the system, and K is a fixed number. Experments shows
> + * the maximum time to wakeup a vcpu from a 128-entry blocking list is about
> + * 22us, which is tolerable. So choose 128 as the fixed number K.

Giving and kind of absolute time value requires also stating on what
hardware this was measured.

> + * This policy makes sure:
> + * 1) for common cases, the limit won't be reached and the local pcpu is used
> + * which is beneficial to performance (at least, avoid an IPI when unblocking
> + * vcpu).
> + * 2) for the worst case, the blocking list length scales with the vcpu count
> + * divided by the pcpu count.
> + */
> +#define PI_LIST_FIXED_NUM 128
> +#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) / num_online_cpus() + \
> +                           PI_LIST_FIXED_NUM)
> +
> +static bool pi_over_limit(int count)

Can a caller validly pass a negative argument? Otherwise unsigned int
please.

> +{
> +    /* Compare w/ constant first to save an atomic read in the common case */

As an atomic read is just a normal read on x86, does this really matter?

> +    return ((count > PI_LIST_FIXED_NUM) &&
> +            (count > (atomic_read(&num_hvm_vcpus) / num_online_cpus()) +
> +                PI_LIST_FIXED_NUM));

Right above you've #define-d PI_LIST_LIMIT - why do you open code
it here? Also note that the outer pair of parentheses is pointless (and
hampering readability).

>  static void vmx_vcpu_block(struct vcpu *v)
>  {
>      unsigned long flags;
> -    unsigned int dest;
> +    unsigned int dest, pi_cpu;
>      spinlock_t *old_lock;
> -    spinlock_t *pi_blocking_list_lock =
> -		&per_cpu(vmx_pi_blocking, v->processor).lock;
>      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    spinlock_t *pi_blocking_list_lock;
> +
> +    pi_cpu = v->processor;
> + retry:
> +    pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, pi_cpu).lock;
>  
>      spin_lock_irqsave(pi_blocking_list_lock, flags);
> +    /*
> +     * Since pi_cpu may now be one other than the one v is currently
> +     * running on, check to make sure that it's still up.
> +     */
> +    if ( unlikely((!cpu_online(pi_cpu)) ||

But this check comest to late then: You've already used per-CPU
data of an offline CPU by the time you make it here. I'm also not
you really need the lock here. A read_atomic() or the counter
would suffice afaics (of course the writers then would need to
use write_atomic() or add_sized()).

> +             pi_over_limit(per_cpu(vmx_pi_blocking, pi_cpu).counter)) )

Indentation.

> +    {
> +        pi_cpu = cpumask_cycle(pi_cpu, &cpu_online_map);

With this, how could the CPU be offline by the time you make it
back to the check above.

> +        spin_unlock_irqrestore(pi_blocking_list_lock, flags);
> +        goto retry;

Please try (hard) to avoid the goto here (we generally accept its
use only for keeping error handling reasonably simple).

> @@ -134,6 +180,13 @@ static void vmx_vcpu_block(struct vcpu *v)
>      ASSERT(pi_desc->ndst ==
>             (x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK)));
>  
> +    if ( unlikely(pi_cpu != v->processor) )
> +    {
> +        dest = cpu_physical_id(pi_cpu);
> +        write_atomic(&pi_desc->ndst,
> +                 (x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK)));
> +    }

This kind of contradicts the ASSERT() visible in patch context above.
I _think_ it is nevertheless correct, but it would be good to attach a
comment here explaining this.

Also there's no point in parenthesizing the argument of a function
call.

> @@ -170,6 +224,8 @@ static void vmx_pi_unblock_vcpu(struct vcpu *v)
>       * it is running in non-root mode.
>       */
>      write_atomic(&pi_desc->nv, posted_intr_vector);
> +    write_atomic(&pi_desc->ndst,
> +                 x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));

The explanation of why this is needed should also be put here
rather then (only) in the commit message.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] VT-d PI: track the vcpu number in pi blocking list
  2017-06-16 14:34   ` Jan Beulich
@ 2017-06-22  5:16     ` Chao Gao
  2017-06-22  6:51       ` Jan Beulich
  0 siblings, 1 reply; 13+ messages in thread
From: Chao Gao @ 2017-06-22  5:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima

On Fri, Jun 16, 2017 at 08:34:20AM -0600, Jan Beulich wrote:
>>>> On 24.05.17 at 08:56, <chao.gao@intel.com> wrote:
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -82,6 +82,7 @@ static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);
>>  struct vmx_pi_blocking_vcpu {
>>      struct list_head     list;
>>      spinlock_t           lock;
>> +    uint32_t             counter;
>
>Is there any reason for this to be fixed width? Other than that the

I will use 'int' instead of 'uint32_t'.

>non-tracing parts look fine, but I really wonder whether the
>introduction of the counter and the tracing wouldn't better be split.

Will split them apart.

Thanks
chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] VT-d PI: track the vcpu number in pi blocking list
  2017-06-22  5:16     ` Chao Gao
@ 2017-06-22  6:51       ` Jan Beulich
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Beulich @ 2017-06-22  6:51 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima

>>> On 22.06.17 at 07:16, <chao.gao@intel.com> wrote:
> On Fri, Jun 16, 2017 at 08:34:20AM -0600, Jan Beulich wrote:
>>>>> On 24.05.17 at 08:56, <chao.gao@intel.com> wrote:
>>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>>> @@ -82,6 +82,7 @@ static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);
>>>  struct vmx_pi_blocking_vcpu {
>>>      struct list_head     list;
>>>      spinlock_t           lock;
>>> +    uint32_t             counter;
>>
>>Is there any reason for this to be fixed width? Other than that the
> 
> I will use 'int' instead of 'uint32_t'.

Can the counter go negative? I don't think so, in which case you
really want to use "unsigned int".

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/3] VT-d PI: restrict the vcpu number on a given pcpu
  2017-06-16 15:09   ` Jan Beulich
@ 2017-06-23  4:22     ` Chao Gao
  2017-06-23  7:58       ` Jan Beulich
  0 siblings, 1 reply; 13+ messages in thread
From: Chao Gao @ 2017-06-23  4:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, xen-devel

On Fri, Jun 16, 2017 at 09:09:13AM -0600, Jan Beulich wrote:
>>>> On 24.05.17 at 08:56, <chao.gao@intel.com> wrote:
>> Currently, a blocked vCPU is put in its pCPU's pi blocking list. If
>> too many vCPUs are blocked on a given pCPU, it will incur that the list
>> grows too long. After a simple analysis, there are 32k domains and
>> 128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's
>> PI blocking list. When a wakeup interrupt arrives, the list is
>> traversed to find some specific vCPUs to wake them up. This traversal in
>> that case would consume much time.
>> 
>> To mitigate this issue, this patch limits the vcpu number on a given
>> pCPU,
>
>This would be a bug, but I think it's the description which is wrong
>(or at least imprecise): You don't limit the number of vCPU-s _run_
>on any pCPU, but those tracked on any pCPU-s blocking list. Please
>say so here to avoid confusion.

Agree.

>
>> taking factors such as perfomance of common case, current hvm vcpu
>> count and current pcpu count into consideration. With this method, for
>> the common case, it works fast and for some extreme cases, the list
>> length is under control.
>> 
>> The change in vmx_pi_unblock_vcpu() is for the following case:
>> vcpu is running -> try to block (this patch may change NSDT to
>> another pCPU) but notification comes in time, thus the vcpu
>
>What does "but notification comes in time" mean?
>

I mean when local_events_need_delivery() in vcpu_block() return true.

>> goes back to running station -> VM-entry (we should set NSDT again,
>
>s/station/state/ ?
>
>> reverting the change we make to NSDT in vmx_vcpu_block())
>
>Overall I'm not sure I really understand what you try to explain
>here.

Will put it above the related change.

I wanted to explain why we need this change if a vcpu can be added
to a remote pcpu (means the vcpu isn't running on this pcpu).

a vcpu may go through the two different paths from calling vcpu_block()
to VM-entry:
Path1: vcpu_block()->vmx_vcpu_block()->local_events_need_delivery(return
true) -> vmx_pi_unblock_vcpu (during VM-entry)
Path2: vcpu_block()->vmx_vcpu_block()->local_events_need_delivery(return
false) -> vmx_pi_switch_from() -> vmx_pi_switch_to()
->vmx_pi_unblock_vcpu (during VM-entry)

For migration a vcpu to another pcpu would lead to a incorrect
pi_desc->ndst, vmx_pi_switch_to() re-assigns pi_desc->ndst.
It was enough for Path1 (no one changed the pi_desc->ndst field and
changed the binding between pcpu and vcpu) and Path2. But, now
vmx_vcpu_block() would change pi_desc->ndst to another pcpu to receive
wakeup interrupt. If local_events_need_delivery() returns true, we
should correct pi_desc->ndst to current pcpu in vmx_pi_unblock_vcpu().

>
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -100,16 +100,62 @@ void vmx_pi_per_cpu_init(unsigned int cpu)
>>      spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
>>  }
>>  
>> +/*
>> + * By default, the local pcpu (means the one the vcpu is currently running on)
>> + * is chosen as the destination of wakeup interrupt. But if the vcpu number of
>> + * the pcpu exceeds a limit, another pcpu is chosen until we find a suitable
>> + * one.
>> + *
>> + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu count, where
>> + * v_tot is the total number of hvm vcpus on the system, p_tot is the total
>> + * number of pcpus in the system, and K is a fixed number. Experments shows
>> + * the maximum time to wakeup a vcpu from a 128-entry blocking list is about
>> + * 22us, which is tolerable. So choose 128 as the fixed number K.
>
>Giving and kind of absolute time value requires also stating on what
>hardware this was measured.
>
>> + * This policy makes sure:
>> + * 1) for common cases, the limit won't be reached and the local pcpu is used
>> + * which is beneficial to performance (at least, avoid an IPI when unblocking
>> + * vcpu).
>> + * 2) for the worst case, the blocking list length scales with the vcpu count
>> + * divided by the pcpu count.
>> + */
>> +#define PI_LIST_FIXED_NUM 128
>> +#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) / num_online_cpus() + \
>> +                           PI_LIST_FIXED_NUM)
>> +
>> +static bool pi_over_limit(int count)
>
>Can a caller validly pass a negative argument? Otherwise unsigned int
>please.
>
>> +{
>> +    /* Compare w/ constant first to save an atomic read in the common case */
>
>As an atomic read is just a normal read on x86, does this really matter?

agree.

>
>> +    return ((count > PI_LIST_FIXED_NUM) &&
>> +            (count > (atomic_read(&num_hvm_vcpus) / num_online_cpus()) +
>> +                PI_LIST_FIXED_NUM));
>
>Right above you've #define-d PI_LIST_LIMIT - why do you open code
>it here? Also note that the outer pair of parentheses is pointless (and
>hampering readability).
>
>>  static void vmx_vcpu_block(struct vcpu *v)
>>  {
>>      unsigned long flags;
>> -    unsigned int dest;
>> +    unsigned int dest, pi_cpu;
>>      spinlock_t *old_lock;
>> -    spinlock_t *pi_blocking_list_lock =
>> -		&per_cpu(vmx_pi_blocking, v->processor).lock;
>>      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
>> +    spinlock_t *pi_blocking_list_lock;
>> +
>> +    pi_cpu = v->processor;
>> + retry:
>> +    pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, pi_cpu).lock;
>>  
>>      spin_lock_irqsave(pi_blocking_list_lock, flags);
>> +    /*
>> +     * Since pi_cpu may now be one other than the one v is currently
>> +     * running on, check to make sure that it's still up.
>> +     */
>> +    if ( unlikely((!cpu_online(pi_cpu)) ||
>
>But this check comest to late then: You've already used per-CPU
>data of an offline CPU by the time you make it here. I'm also not
>you really need the lock here. A read_atomic() or the counter
>would suffice afaics (of course the writers then would need to
>use write_atomic() or add_sized()).
>
>> +             pi_over_limit(per_cpu(vmx_pi_blocking, pi_cpu).counter)) )
>
>Indentation.
>
>> +    {
>> +        pi_cpu = cpumask_cycle(pi_cpu, &cpu_online_map);
>
>With this, how could the CPU be offline by the time you make it
>back to the check above.

Thanks to point it out. It would incur a bug.

I think we should do things like this:

IF pi_blocking_list of current pcpu doesn't over the limit:
	add the vcpu to current pcpu.
ELSE
	add the vcpu to another pcpu.

To add the vcpu to another pcpu, we should avoid concurrency with
vmx_pi_desc_fixup(). Thus, a lock (e.g. remote_pi_list_lock)
can solve this potential concurrency. Using this lock like below:

in vmx_vcpu_block():

IF pi_blocking_list of current pcpu doesn't over the limit:
	add the vcpu to current pcpu
ELSE
	acquire remote_pi_list_lock
	choose another online pcpu	(don't worry this pcpu would goes
					 offline for we hold the
					 remote_pi_list_lock, which blocks
					 calling vmx_pi_desc_fixup(),
					 thus at least we can add this
					 vcpu to the pi_blocking_list
					 before cleanup)
	add the vcpu to the chosen pcpu
	release remote_pi_list_lock

in vmx_pi_desc_fixup():
acquire remote_pi_list_lock
...
release remote_pi_list_lock

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/3] VT-d PI: restrict the vcpu number on a given pcpu
  2017-06-23  4:22     ` Chao Gao
@ 2017-06-23  7:58       ` Jan Beulich
  2017-06-23  8:33         ` Chao Gao
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2017-06-23  7:58 UTC (permalink / raw)
  To: Chao Gao; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, xen-devel

>>> On 23.06.17 at 06:22, <chao.gao@intel.com> wrote:
> On Fri, Jun 16, 2017 at 09:09:13AM -0600, Jan Beulich wrote:
>>>>> On 24.05.17 at 08:56, <chao.gao@intel.com> wrote:
>>> +    {
>>> +        pi_cpu = cpumask_cycle(pi_cpu, &cpu_online_map);
>>
>>With this, how could the CPU be offline by the time you make it
>>back to the check above.
> 
> Thanks to point it out. It would incur a bug.

I don't understand what you're trying to tell me here.

> I think we should do things like this:
> 
> IF pi_blocking_list of current pcpu doesn't over the limit:
> 	add the vcpu to current pcpu.
> ELSE
> 	add the vcpu to another pcpu.

But that's what supposedly the patch already tries to do?

> To add the vcpu to another pcpu, we should avoid concurrency with
> vmx_pi_desc_fixup(). Thus, a lock (e.g. remote_pi_list_lock)

Please use names which actually exist in source code, or make
clear what exactly you're referring to. Talking of
remote_pi_list_lock, which I can't find any instance of, does not
help the discussion, as you leave me guessing whose lock you
mean to acquire.

> can solve this potential concurrency. Using this lock like below:
> 
> in vmx_vcpu_block():
> 
> IF pi_blocking_list of current pcpu doesn't over the limit:
> 	add the vcpu to current pcpu
> ELSE
> 	acquire remote_pi_list_lock
> 	choose another online pcpu	(don't worry this pcpu would goes
> 					 offline for we hold the
> 					 remote_pi_list_lock, which blocks
> 					 calling vmx_pi_desc_fixup(),
> 					 thus at least we can add this
> 					 vcpu to the pi_blocking_list
> 					 before cleanup)

I can't see why you need to hold a lock to make sure a pCPU doesn't
go offline - pCPU offlining happens in stop_machine context anyway.

Jan

> 	add the vcpu to the chosen pcpu
> 	release remote_pi_list_lock
> 
> in vmx_pi_desc_fixup():
> acquire remote_pi_list_lock
> ...
> release remote_pi_list_lock
> 
> Thanks
> Chao




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/3] VT-d PI: restrict the vcpu number on a given pcpu
  2017-06-23  7:58       ` Jan Beulich
@ 2017-06-23  8:33         ` Chao Gao
  2017-06-23  9:05           ` Jan Beulich
  0 siblings, 1 reply; 13+ messages in thread
From: Chao Gao @ 2017-06-23  8:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, xen-devel

On Fri, Jun 23, 2017 at 01:58:52AM -0600, Jan Beulich wrote:
>>>> On 23.06.17 at 06:22, <chao.gao@intel.com> wrote:
>> On Fri, Jun 16, 2017 at 09:09:13AM -0600, Jan Beulich wrote:
>>>>>> On 24.05.17 at 08:56, <chao.gao@intel.com> wrote:
>>>> +    {
>>>> +        pi_cpu = cpumask_cycle(pi_cpu, &cpu_online_map);
>>>
>>>With this, how could the CPU be offline by the time you make it
>>>back to the check above.
>> 
>> Thanks to point it out. It would incur a bug.
>
>I don't understand what you're trying to tell me here.
>

I agree with you that we might use an offline CPU's structure.
and I treat it as a bug.

>> I think we should do things like this:
>> 
>> IF pi_blocking_list of current pcpu doesn't over the limit:
>> 	add the vcpu to current pcpu.
>> ELSE
>> 	add the vcpu to another pcpu.
>
>But that's what supposedly the patch already tries to do?

yes. it is. but I want to unclose a key problem here is
we may add a vcpu to an offline pcpu's pi blocking list. Try to
avoid this we can introduce a _new_ lock to avoid racing with
vmx_pi_desc_fixup(). This version tried to use existing lock 
ie. per_cpu(vmx_pi_blocking, pi_cpu).lock to solve the key problem.
But as you pointed out, it may use a offline CPU's structure.

>
>> To add the vcpu to another pcpu, we should avoid concurrency with
>> vmx_pi_desc_fixup(). Thus, a lock (e.g. remote_pi_list_lock)
>
>Please use names which actually exist in source code, or make
>clear what exactly you're referring to. Talking of
>remote_pi_list_lock, which I can't find any instance of, does not
>help the discussion, as you leave me guessing whose lock you
>mean to acquire.

It is a new lock and I try to envision how it can take effect.

>
>> can solve this potential concurrency. Using this lock like below:
>> 
>> in vmx_vcpu_block():
>> 
>> IF pi_blocking_list of current pcpu doesn't over the limit:
>> 	add the vcpu to current pcpu
>> ELSE
>> 	acquire remote_pi_list_lock
>> 	choose another online pcpu	(don't worry this pcpu would goes
>> 					 offline for we hold the
>> 					 remote_pi_list_lock, which blocks
>> 					 calling vmx_pi_desc_fixup(),
>> 					 thus at least we can add this
>> 					 vcpu to the pi_blocking_list
>> 					 before cleanup)
>
>I can't see why you need to hold a lock to make sure a pCPU doesn't
>go offline - pCPU offlining happens in stop_machine context anyway.

In cpu_down(), the cpu_online_bitmap is cleared by this line:
	if ( (err = stop_machine_run(take_cpu_down, NULL, cpu)) < 0 )

And vmx_pi_desc_fixup() is called later by
	notifier_call_chain(&cpu_chain, CPU_DEAD, hcpu, NULL);

So I think if we acquire the new lock, namely remote_pi_list_lock before
choose another online pcpu, we can make sure the vcpu will be added to
the chosen pcpu's pi blocking list prior to some cleanup to the same pi
blocking list.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/3] VT-d PI: restrict the vcpu number on a given pcpu
  2017-06-23  8:33         ` Chao Gao
@ 2017-06-23  9:05           ` Jan Beulich
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Beulich @ 2017-06-23  9:05 UTC (permalink / raw)
  To: Chao Gao; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, xen-devel

>>> On 23.06.17 at 10:33, <chao.gao@intel.com> wrote:
> On Fri, Jun 23, 2017 at 01:58:52AM -0600, Jan Beulich wrote:
>>>>> On 23.06.17 at 06:22, <chao.gao@intel.com> wrote:
>>> On Fri, Jun 16, 2017 at 09:09:13AM -0600, Jan Beulich wrote:
>>>>>>> On 24.05.17 at 08:56, <chao.gao@intel.com> wrote:
>>>>> +    {
>>>>> +        pi_cpu = cpumask_cycle(pi_cpu, &cpu_online_map);
>>>>
>>>>With this, how could the CPU be offline by the time you make it
>>>>back to the check above.
>>> 
>>> Thanks to point it out. It would incur a bug.
>>
>>I don't understand what you're trying to tell me here.
> 
> I agree with you that we might use an offline CPU's structure.
> and I treat it as a bug.

Ah, but I didn't say so. Instead I questioned how you would see
the pCPU going offline between the code above and the use of
pi_cpu further down. Most or all of what you say later in your
reply seems inapplicable because of this apparent
misunderstanding, so I'm not going to further comment on that.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-06-23  9:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-24  6:56 [PATCH v3 0/3] mitigate the per-pCPU blocking list may be too long Chao Gao
2017-05-24  6:56 ` [PATCH v3] VT-d PI: track the vcpu number in pi blocking list Chao Gao
2017-06-16 14:34   ` Jan Beulich
2017-06-22  5:16     ` Chao Gao
2017-06-22  6:51       ` Jan Beulich
2017-05-24  6:56 ` [PATCH v3 2/3] vcpu: track hvm vcpu number on the system Chao Gao
2017-06-16 14:44   ` Jan Beulich
2017-05-24  6:56 ` [PATCH v3 3/3] VT-d PI: restrict the vcpu number on a given pcpu Chao Gao
2017-06-16 15:09   ` Jan Beulich
2017-06-23  4:22     ` Chao Gao
2017-06-23  7:58       ` Jan Beulich
2017-06-23  8:33         ` Chao Gao
2017-06-23  9:05           ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.