All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/4] sched: credit2: introduce per-vcpu hard and soft affinity
@ 2015-03-26  9:48 Justin T. Weaver
  2015-03-26  9:48 ` [PATCH v3 1/4] sched: credit2: respect per-vcpu hard affinity Justin T. Weaver
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Justin T. Weaver @ 2015-03-26  9:48 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap, dario.faggioli, henric

Hello,

The credit2 vcpu scheduler currently ignores per-vcpu hard and soft affinity
masks.

The first patch updates the scheduler to ensure that vcpus only run
on pcpus on which they are allowed to run (hard affinity). I tested it using
xl vcpu-pin and xl vcpu-list. I changed the affinity in different ways using
scripted calls to vcpu-pin and observed the results using vcpu-list. Each VCPU
ran where it was supposed to.

Patch two factors out code from the credit scheduler (sched_credit.c) related
to soft affinity load balancing and places it in a common header (sched-if.h).
This allows credit2 to reuse the functions and defines in the soft affinity
patch (patch four of this series).

The third patch only indents some code in sched_credit2.c in order to make
reviewing patch four easier.

The fourth patch updates the scheduler to make more informed run queue and
pcpu decisions by considering which pcpus that vcpus prefer to run on (soft
affinity).

I tested this series on a NUMA machine with Dario Faggioli's "fix
per-socket runqueue setup" patch series applied. Without it, the credit2
scheduler only creates one run queue, regardless of the type of machine.

Here are the results I gathered from testing. Each guest had 2 vcpus and 1GB
of memory. The hardware consisted of two quad core Intel Xeon X5570 processors
and 8GB of RAM per node. The sysbench memory test was run with the num-threads
option set to four, and was run simultaneously on two, then six, then ten VMs.
Each result below is an average of three runs.

-------------------------------------------------------
| Sysbench memory, throughput MB/s (higher is better) |
-------------------------------------------------------
| #VMs |  No affinity  |   Pinning  | NUMA scheduling |
|   2  |    417.01     |    406.16  |     428.83      |
|   6  |    389.31     |    407.07  |     402.90      |
|  10  |    317.91     |    320.53  |     321.98      |
-------------------------------------------------------

Despite the overhead added, NUMA scheduling performed best in both the two and
ten VM tests.

Thank you for all the review comments!

Justin Weaver
Masters candidate
University of Hawaiʻi at Mānoa

---
[1/4] sched: credit2: respect per-vcpu hard affinity
[2/4] sched: factor out per-vcpu affinity related code to common header file
[3/4] sched: credit2: indent code sections to make review of patch 4/4 easier
[4/4] sched: credit2: consider per-vcpu soft affinity

xen/common/sched_credit.c  |   87 +------
xen/common/sched_credit2.c |  551 ++++++++++++++++++++++++++++++++++++--------
xen/include/xen/sched-if.h |   65 ++++++
3 files changed, 536 insertions(+), 167 deletions(-)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/4] sched: credit2: respect per-vcpu hard affinity
  2015-03-26  9:48 [PATCH v3 0/4] sched: credit2: introduce per-vcpu hard and soft affinity Justin T. Weaver
@ 2015-03-26  9:48 ` Justin T. Weaver
  2015-03-31 14:37   ` George Dunlap
  2015-05-06 12:39   ` Dario Faggioli
  2015-03-26  9:48 ` [PATCH v3 2/4] sched: factor out per-vcpu affinity related code to common header file Justin T. Weaver
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Justin T. Weaver @ 2015-03-26  9:48 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap, dario.faggioli, Justin T. Weaver, henric

by making sure that vcpus only run on the pcpu(s) they are allowed to
run on based on their hard affinity cpu masks.

Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>
---
Changes in v3:
(all v3 changes are based on v2 review comments)
 * Renamed cpumask to scratch_mask
 * Renamed function get_safe_pcpu to get_fallback_cpu
 * Improved comment for function get_fallback_cpu
 * Replaced cpupool_online_cpumask with VCPU2ONLINE in function
   get_fallback_cpu to shorten the line
 * Added #define for VCPU2ONLINE (probably should be factored out of
   schedule.c and here, and put into a common header)
 * Modified code in function get_fallback_cpu: moved check for current
   processor to the top; added an ASSERT because the mask should not be empty
 * Modified code and comment in function choose_cpu in migrate request section
 * Added comment to function choose_cpu explaining why the vcpu passed to the
   function might not have hard affinity with any of the pcpus in its assigned
   run queue
 * Modified code in function choose_cpu to make it more readable
 * Moved/changed "We didn't find ..." comment in function choose_cpu
 * Combined migration flag check and hard affinity check into valid migration
   check helper function; replaced code in three places in function
   balance_load with call to the helper function
 * Changed a BUG_ON to an ASSERT in function csched2_vcpu_migrate   
 * Moved vc->processor assignment in function csched2_vcpu_migrate to an else
   block to execute only if current and destination run queues are the same;
   Note: without the processor assignment here the vcpu might be assigned to a
   processor it no longer is allowed to run on. In that case, function
   runq_candidate may only get called for the vcpu's old processor, and
   runq_candidate will no longer let a vcpu run on a processor that it's not
   allowed to run on (because of the hard affinity check first introduced in
   v1 of this patch).
 * csched2_init: changed xzalloc_bytes to xmalloc_array for allocation of
   scratch_mask
 * csched2_deinit: removed scratch_mask freeing loop; it was not needed
Changes in v2:
 * Added dynamically allocated cpu masks to avoid putting them on the stack;
   replaced temp masks from v1 throughout
 * Added helper function for code suggested in v1 review and called it in two
   locations in function choose_cpu
 * Removed v1 change to comment in the beginning of choose_cpu
 * Replaced two instances of cpumask_and/cpumask_empty with cpumask_intersects
 * Removed v1 re-work of code in function migrate; only change in migrate in
   v2 is the assignment of a valid pcpu from the destination run queue to
   vc->processor
 * In function csched2_vcpu_migrate: removed change from v1 that called
   function migrate even if cur and dest run queues were the same in order
   to get a runq_tickle call; added processor assignment to new_cpu to fix
   the real underlying issue which was the vcpu not getting a call to
   sched_move_irqs
 * Removed the looping added in v1 in function balance_load; may be added back
   later because it would help to have balance_load be more aware of hard
   affinity, but adding it does not affect credit2's current inability to
   respect hard affinity.
 * Removed coding style fix in function balance_load
 * Improved comment in function runq_candidate
---
 xen/common/sched_credit2.c |  139 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 113 insertions(+), 26 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 7581731..af716e4 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -176,6 +176,7 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
 #define c2r(_ops, _cpu)     (CSCHED2_PRIV(_ops)->runq_map[(_cpu)])
 /* CPU to runqueue struct macro */
 #define RQD(_ops, _cpu)     (&CSCHED2_PRIV(_ops)->rqd[c2r(_ops, _cpu)])
+#define VCPU2ONLINE(_v)     cpupool_online_cpumask((_v)->domain->cpupool)
 
 /*
  * Shifts for load average.
@@ -194,6 +195,12 @@ int opt_overload_balance_tolerance=-3;
 integer_param("credit2_balance_over", opt_overload_balance_tolerance);
 
 /*
+ * Use this to avoid having too many cpumask_t structs on the stack
+ */
+static cpumask_t **scratch_mask = NULL;
+#define csched2_cpumask scratch_mask[smp_processor_id()]
+
+/*
  * Per-runqueue data
  */
 struct csched2_runqueue_data {
@@ -268,6 +275,32 @@ struct csched2_dom {
     uint16_t nr_vcpus;
 };
 
+/*
+ * When a hard affinity change occurs, we may not be able to check some or
+ * all of the other run queues for a valid new processor for the given vcpu
+ * because (in function choose_cpu) either the trylock on the private data
+ * failed or the trylock on each run queue with valid processor(s) for svc
+ * failed. In these cases, this function is used to pick a pcpu that svc can
+ * run on. It returns svc's current pcpu if valid, or a different pcpu from
+ * the run queue svc is currently assigned to, or if none of those are valid,
+ * it returns a pcpu from the intersection of svc's hard affinity and the
+ * domain's online cpumask.
+ */
+static int get_fallback_cpu(struct csched2_vcpu *svc)
+{
+    if ( likely(cpumask_test_cpu(svc->vcpu->processor,
+        svc->vcpu->cpu_hard_affinity)) )
+        return svc->vcpu->processor;
+
+    cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
+        &svc->rqd->active);
+    if ( cpumask_empty(csched2_cpumask) )
+        cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
+            VCPU2ONLINE(svc->vcpu));
+    ASSERT( !cpumask_empty(csched2_cpumask) );
+
+    return cpumask_any(csched2_cpumask);
+}
 
 /*
  * Time-to-credit, credit-to-time.
@@ -501,8 +534,9 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
         goto tickle;
     }
     
-    /* Get a mask of idle, but not tickled */
+    /* Get a mask of idle, but not tickled, that new is allowed to run on. */
     cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
+    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
     
     /* If it's not empty, choose one */
     i = cpumask_cycle(cpu, &mask);
@@ -513,9 +547,11 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
     }
 
     /* Otherwise, look for the non-idle cpu with the lowest credit,
-     * skipping cpus which have been tickled but not scheduled yet */
+     * skipping cpus which have been tickled but not scheduled yet,
+     * that new is allowed to run on. */
     cpumask_andnot(&mask, &rqd->active, &rqd->idle);
     cpumask_andnot(&mask, &mask, &rqd->tickled);
+    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
 
     for_each_cpu(i, &mask)
     {
@@ -1078,9 +1114,8 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
             d2printk("%pv -\n", svc->vcpu);
             clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
         }
-        /* Leave it where it is for now.  When we actually pay attention
-         * to affinity we'll have to figure something out... */
-        return vc->processor;
+
+        return get_fallback_cpu(svc);
     }
 
     /* First check to see if we're here because someone else suggested a place
@@ -1091,41 +1126,53 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
         {
             printk("%s: Runqueue migrate aborted because target runqueue disappeared!\n",
                    __func__);
-            /* Fall-through to normal cpu pick */
         }
         else
         {
-            d2printk("%pv +\n", svc->vcpu);
-            new_cpu = cpumask_cycle(vc->processor, &svc->migrate_rqd->active);
-            goto out_up;
+            cpumask_and(csched2_cpumask, vc->cpu_hard_affinity,
+                &svc->migrate_rqd->active);
+            new_cpu = cpumask_any(csched2_cpumask);
+            if ( new_cpu < nr_cpu_ids )
+            {
+                d2printk("%pv +\n", svc->vcpu);
+                goto out_up;
+            }
         }
+        /* Fall-through to normal cpu pick */
     }
 
-    /* FIXME: Pay attention to cpu affinity */                                                                                      
-
     min_avgload = MAX_LOAD;
 
     /* Find the runqueue with the lowest instantaneous load */
     for_each_cpu(i, &prv->active_queues)
     {
         struct csched2_runqueue_data *rqd;
-        s_time_t rqd_avgload;
+        s_time_t rqd_avgload = MAX_LOAD;
 
         rqd = prv->rqd + i;
 
         /* If checking a different runqueue, grab the lock,
-         * read the avg, and then release the lock.
+         * check hard affinity, read the avg, and then release the lock.
          *
          * If on our own runqueue, don't grab or release the lock;
          * but subtract our own load from the runqueue load to simulate
-         * impartiality */
+         * impartiality.
+         *
+         * svc's hard affinity may have changed; this function is the
+         * credit 2 scheduler's first opportunity to react to the change,
+         * so it is possible here that svc does not have hard affinity
+         * with any of the pcpus of svc's currently assigned run queue.
+         */
         if ( rqd == svc->rqd )
         {
-            rqd_avgload = rqd->b_avgload - svc->avgload;
+            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
+                rqd_avgload = rqd->b_avgload - svc->avgload;
         }
         else if ( spin_trylock(&rqd->lock) )
         {
-            rqd_avgload = rqd->b_avgload;
+            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
+                rqd_avgload = rqd->b_avgload;
+
             spin_unlock(&rqd->lock);
         }
         else
@@ -1138,12 +1185,14 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
         }
     }
 
-    /* We didn't find anyone (most likely because of spinlock contention); leave it where it is */
+    /* We didn't find anyone (most likely because of spinlock contention). */
     if ( min_rqi == -1 )
-        new_cpu = vc->processor;
+        new_cpu = get_fallback_cpu(svc);
     else
     {
-        new_cpu = cpumask_cycle(vc->processor, &prv->rqd[min_rqi].active);
+        cpumask_and(csched2_cpumask, vc->cpu_hard_affinity,
+            &prv->rqd[min_rqi].active);
+        new_cpu = cpumask_any(csched2_cpumask);
         BUG_ON(new_cpu >= nr_cpu_ids);
     }
 
@@ -1223,7 +1272,12 @@ static void migrate(const struct scheduler *ops,
             on_runq=1;
         }
         __runq_deassign(svc);
-        svc->vcpu->processor = cpumask_any(&trqd->active);
+
+        cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
+            &trqd->active);
+        svc->vcpu->processor = cpumask_any(csched2_cpumask);
+        BUG_ON(svc->vcpu->processor >= nr_cpu_ids);
+
         __runq_assign(svc, trqd);
         if ( on_runq )
         {
@@ -1237,6 +1291,20 @@ static void migrate(const struct scheduler *ops,
     }
 }
 
+/*
+ * Migration of vcpu svc to run queue rqd is a valid option if svc is not
+ * already flagged to migrate and if svc is allowed to run on at least one of
+ * the pcpus assigned to rqd based on svc's hard affinity mask.
+ */
+static bool_t valid_vcpu_migration(struct csched2_vcpu *svc,
+                                   struct csched2_runqueue_data *rqd)
+{
+    if ( test_bit(__CSFLAG_runq_migrate_request, &svc->flags)
+        || !cpumask_intersects(svc->vcpu->cpu_hard_affinity, &rqd->active) )
+        return 0;
+    else
+        return 1;
+}
 
 static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
 {
@@ -1345,8 +1413,7 @@ retry:
 
         __update_svc_load(ops, push_svc, 0, now);
 
-        /* Skip this one if it's already been flagged to migrate */
-        if ( test_bit(__CSFLAG_runq_migrate_request, &push_svc->flags) )
+        if ( !valid_vcpu_migration(push_svc, st.orqd) )
             continue;
 
         list_for_each( pull_iter, &st.orqd->svc )
@@ -1358,8 +1425,7 @@ retry:
                 __update_svc_load(ops, pull_svc, 0, now);
             }
         
-            /* Skip this one if it's already been flagged to migrate */
-            if ( test_bit(__CSFLAG_runq_migrate_request, &pull_svc->flags) )
+            if ( !valid_vcpu_migration(pull_svc, st.lrqd) )
                 continue;
 
             consider(&st, push_svc, pull_svc);
@@ -1375,8 +1441,7 @@ retry:
     {
         struct csched2_vcpu * pull_svc = list_entry(pull_iter, struct csched2_vcpu, rqd_elem);
         
-        /* Skip this one if it's already been flagged to migrate */
-        if ( test_bit(__CSFLAG_runq_migrate_request, &pull_svc->flags) )
+        if ( !valid_vcpu_migration(pull_svc, st.lrqd) )
             continue;
 
         /* Consider pull only */
@@ -1415,11 +1480,14 @@ csched2_vcpu_migrate(
 
     /* Check if new_cpu is valid */
     BUG_ON(!cpumask_test_cpu(new_cpu, &CSCHED2_PRIV(ops)->initialized));
+    ASSERT(cpumask_test_cpu(new_cpu, vc->cpu_hard_affinity));
 
     trqd = RQD(ops, new_cpu);
 
     if ( trqd != svc->rqd )
         migrate(ops, svc, trqd, NOW());
+    else
+        vc->processor = new_cpu;
 }
 
 static int
@@ -1638,6 +1706,10 @@ runq_candidate(struct csched2_runqueue_data *rqd,
     {
         struct csched2_vcpu * svc = list_entry(iter, struct csched2_vcpu, runq_elem);
 
+        /* Only consider vcpus that are allowed to run on this processor. */
+        if ( !cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity) )
+            continue;
+
         /* If this is on a different processor, don't pull it unless
          * its credit is at least CSCHED2_MIGRATE_RESIST higher. */
         if ( svc->vcpu->processor != cpu
@@ -2024,6 +2096,13 @@ csched2_alloc_pdata(const struct scheduler *ops, int cpu)
         printk("%s: cpu %d not online yet, deferring initializatgion\n",
                __func__, cpu);
 
+    /*
+     * For each new pcpu, allocate a cpumask_t for use throughout the
+     * scheduler to avoid putting any cpumask_t structs on the stack.
+     */
+    if ( !zalloc_cpumask_var(&scratch_mask[cpu]) )
+        return NULL;
+
     return (void *)1;
 }
 
@@ -2072,6 +2151,8 @@ csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 
     spin_unlock_irqrestore(&prv->lock, flags);
 
+    free_cpumask_var(scratch_mask[cpu]);
+
     return;
 }
 
@@ -2159,6 +2240,10 @@ csched2_init(struct scheduler *ops)
 
     prv->load_window_shift = opt_load_window_shift;
 
+    scratch_mask = xmalloc_array(cpumask_t *, nr_cpu_ids);
+    if ( scratch_mask == NULL )
+        return -ENOMEM;
+
     return 0;
 }
 
@@ -2169,6 +2254,8 @@ csched2_deinit(const struct scheduler *ops)
 
     prv = CSCHED2_PRIV(ops);
     xfree(prv);
+
+    xfree(scratch_mask);
 }
 
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 2/4] sched: factor out per-vcpu affinity related code to common header file
  2015-03-26  9:48 [PATCH v3 0/4] sched: credit2: introduce per-vcpu hard and soft affinity Justin T. Weaver
  2015-03-26  9:48 ` [PATCH v3 1/4] sched: credit2: respect per-vcpu hard affinity Justin T. Weaver
@ 2015-03-26  9:48 ` Justin T. Weaver
  2015-04-23 15:22   ` Dario Faggioli
  2015-03-26  9:48 ` [PATCH v3 3/4] sched: credit2: indent code sections to make review of patch 4/4 easier Justin T. Weaver
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Justin T. Weaver @ 2015-03-26  9:48 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap, dario.faggioli, Justin T. Weaver, henric

Move affinity balancing related functions and defines from sched_credit.c to
sched-if.h so other schedulers can use them. Change name prefixes from csched
to sched since they are no longer specific to the credit scheduler.

Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>
---
Changes in v3: First introduced in patch series version 3
---
 xen/common/sched_credit.c  |   87 ++++++--------------------------------------
 xen/include/xen/sched-if.h |   65 +++++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+), 76 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index bec67ff..3eb9440 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -112,26 +112,6 @@
 
 
 /*
- * Hard and soft affinity load balancing.
- *
- * Idea is each vcpu has some pcpus that it prefers, some that it does not
- * prefer but is OK with, and some that it cannot run on at all. The first
- * set of pcpus are the ones that are both in the soft affinity *and* in the
- * hard affinity; the second set of pcpus are the ones that are in the hard
- * affinity but *not* in the soft affinity; the third set of pcpus are the
- * ones that are not in the hard affinity.
- *
- * We implement a two step balancing logic. Basically, every time there is
- * the need to decide where to run a vcpu, we first check the soft affinity
- * (well, actually, the && between soft and hard affinity), to see if we can
- * send it where it prefers to (and can) run on. However, if the first step
- * does not find any suitable and free pcpu, we fall back checking the hard
- * affinity.
- */
-#define CSCHED_BALANCE_SOFT_AFFINITY    0
-#define CSCHED_BALANCE_HARD_AFFINITY    1
-
-/*
  * Boot parameters
  */
 static int __read_mostly sched_credit_tslice_ms = CSCHED_DEFAULT_TSLICE_MS;
@@ -273,51 +253,6 @@ __runq_remove(struct csched_vcpu *svc)
 }
 
 
-#define for_each_csched_balance_step(step) \
-    for ( (step) = 0; (step) <= CSCHED_BALANCE_HARD_AFFINITY; (step)++ )
-
-
-/*
- * Hard affinity balancing is always necessary and must never be skipped.
- * But soft affinity need only be considered when it has a functionally
- * different effect than other constraints (such as hard affinity, cpus
- * online, or cpupools).
- *
- * Soft affinity only needs to be considered if:
- * * The cpus in the cpupool are not a subset of soft affinity
- * * The hard affinity is not a subset of soft affinity
- * * There is an overlap between the soft affinity and the mask which is
- *   currently being considered.
- */
-static inline int __vcpu_has_soft_affinity(const struct vcpu *vc,
-                                           const cpumask_t *mask)
-{
-    return !cpumask_subset(cpupool_online_cpumask(vc->domain->cpupool),
-                           vc->cpu_soft_affinity) &&
-           !cpumask_subset(vc->cpu_hard_affinity, vc->cpu_soft_affinity) &&
-           cpumask_intersects(vc->cpu_soft_affinity, mask);
-}
-
-/*
- * Each csched-balance step uses its own cpumask. This function determines
- * which one (given the step) and copies it in mask. For the soft affinity
- * balancing step, the pcpus that are not part of vc's hard affinity are
- * filtered out from the result, to avoid running a vcpu where it would
- * like, but is not allowed to!
- */
-static void
-csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
-{
-    if ( step == CSCHED_BALANCE_SOFT_AFFINITY )
-    {
-        cpumask_and(mask, vc->cpu_soft_affinity, vc->cpu_hard_affinity);
-
-        if ( unlikely(cpumask_empty(mask)) )
-            cpumask_copy(mask, vc->cpu_hard_affinity);
-    }
-    else /* step == CSCHED_BALANCE_HARD_AFFINITY */
-        cpumask_copy(mask, vc->cpu_hard_affinity);
-}
 
 static void burn_credits(struct csched_vcpu *svc, s_time_t now)
 {
@@ -379,18 +314,18 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new)
          * Soft and hard affinity balancing loop. For vcpus without
          * a useful soft affinity, consider hard affinity only.
          */
-        for_each_csched_balance_step( balance_step )
+        for_each_sched_balance_step( balance_step )
         {
             int new_idlers_empty;
 
-            if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
+            if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY
                  && !__vcpu_has_soft_affinity(new->vcpu,
                                               new->vcpu->cpu_hard_affinity) )
                 continue;
 
             /* Are there idlers suitable for new (for this balance step)? */
-            csched_balance_cpumask(new->vcpu, balance_step,
-                                   csched_balance_mask);
+            sched_balance_cpumask(new->vcpu, balance_step,
+                                  csched_balance_mask);
             cpumask_and(&idle_mask, prv->idlers, csched_balance_mask);
             new_idlers_empty = cpumask_empty(&idle_mask);
 
@@ -400,7 +335,7 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new)
              * hard affinity as well, before taking final decisions.
              */
             if ( new_idlers_empty
-                 && balance_step == CSCHED_BALANCE_SOFT_AFFINITY )
+                 && balance_step == SCHED_BALANCE_SOFT_AFFINITY )
                 continue;
 
             /*
@@ -622,7 +557,7 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit)
     online = cpupool_scheduler_cpumask(vc->domain->cpupool);
     cpumask_and(&cpus, vc->cpu_hard_affinity, online);
 
-    for_each_csched_balance_step( balance_step )
+    for_each_sched_balance_step( balance_step )
     {
         /*
          * We want to pick up a pcpu among the ones that are online and
@@ -642,12 +577,12 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit)
          * cpus and, if the result is empty, we just skip the soft affinity
          * balancing step all together.
          */
-        if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
+        if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY
              && !__vcpu_has_soft_affinity(vc, &cpus) )
             continue;
 
         /* Pick an online CPU from the proper affinity mask */
-        csched_balance_cpumask(vc, balance_step, &cpus);
+        sched_balance_cpumask(vc, balance_step, &cpus);
         cpumask_and(&cpus, &cpus, online);
 
         /* If present, prefer vc's current processor */
@@ -1465,11 +1400,11 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
              * vCPUs with useful soft affinities in some sort of bitmap
              * or counter.
              */
-            if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
+            if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY
                  && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
                 continue;
 
-            csched_balance_cpumask(vc, balance_step, csched_balance_mask);
+            sched_balance_cpumask(vc, balance_step, csched_balance_mask);
             if ( __csched_vcpu_is_migrateable(vc, cpu, csched_balance_mask) )
             {
                 /* We got a candidate. Grab it! */
@@ -1520,7 +1455,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
      *  1. any "soft-affine work" to steal first,
      *  2. if not finding anything, any "hard-affine work" to steal.
      */
-    for_each_csched_balance_step( bstep )
+    for_each_sched_balance_step( bstep )
     {
         /*
          * We peek at the non-idling CPUs in a node-wise fashion. In fact,
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 7cc25c6..3a118da 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -188,4 +188,69 @@ struct cpupool
 #define cpupool_online_cpumask(_pool) \
     (((_pool) == NULL) ? &cpu_online_map : (_pool)->cpu_valid)
 
+/*
+ * Hard and soft affinity load balancing.
+ *
+ * Idea is each vcpu has some pcpus that it prefers, some that it does not
+ * prefer but is OK with, and some that it cannot run on at all. The first
+ * set of pcpus are the ones that are both in the soft affinity *and* in the
+ * hard affinity; the second set of pcpus are the ones that are in the hard
+ * affinity but *not* in the soft affinity; the third set of pcpus are the
+ * ones that are not in the hard affinity.
+ *
+ * We implement a two step balancing logic. Basically, every time there is
+ * the need to decide where to run a vcpu, we first check the soft affinity
+ * (well, actually, the && between soft and hard affinity), to see if we can
+ * send it where it prefers to (and can) run on. However, if the first step
+ * does not find any suitable and free pcpu, we fall back checking the hard
+ * affinity.
+ */
+#define SCHED_BALANCE_SOFT_AFFINITY    0
+#define SCHED_BALANCE_HARD_AFFINITY    1
+
+#define for_each_sched_balance_step(step) \
+    for ( (step) = 0; (step) <= SCHED_BALANCE_HARD_AFFINITY; (step)++ )
+
+/*
+ * Hard affinity balancing is always necessary and must never be skipped.
+ * But soft affinity need only be considered when it has a functionally
+ * different effect than other constraints (such as hard affinity, cpus
+ * online, or cpupools).
+ *
+ * Soft affinity only needs to be considered if:
+ * * The cpus in the cpupool are not a subset of soft affinity
+ * * The hard affinity is not a subset of soft affinity
+ * * There is an overlap between the soft affinity and the mask which is
+ *   currently being considered.
+ */
+static inline int __vcpu_has_soft_affinity(const struct vcpu *vc,
+                                           const cpumask_t *mask)
+{
+    return !cpumask_subset(cpupool_online_cpumask(vc->domain->cpupool),
+                           vc->cpu_soft_affinity) &&
+           !cpumask_subset(vc->cpu_hard_affinity, vc->cpu_soft_affinity) &&
+           cpumask_intersects(vc->cpu_soft_affinity, mask);
+}
+
+/*
+ * Each sched-balance step uses its own cpumask. This function determines
+ * which one (given the step) and copies it in mask. For the soft affinity
+ * balancing step, the pcpus that are not part of vc's hard affinity are
+ * filtered out from the result, to avoid running a vcpu where it would
+ * like, but is not allowed to!
+ */
+static inline void
+sched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
+{
+    if ( step == SCHED_BALANCE_SOFT_AFFINITY )
+    {
+        cpumask_and(mask, vc->cpu_soft_affinity, vc->cpu_hard_affinity);
+
+        if ( unlikely(cpumask_empty(mask)) )
+            cpumask_copy(mask, vc->cpu_hard_affinity);
+    }
+    else /* step == SCHED_BALANCE_HARD_AFFINITY */
+        cpumask_copy(mask, vc->cpu_hard_affinity);
+}
+
 #endif /* __XEN_SCHED_IF_H__ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 3/4] sched: credit2: indent code sections to make review of patch 4/4 easier
  2015-03-26  9:48 [PATCH v3 0/4] sched: credit2: introduce per-vcpu hard and soft affinity Justin T. Weaver
  2015-03-26  9:48 ` [PATCH v3 1/4] sched: credit2: respect per-vcpu hard affinity Justin T. Weaver
  2015-03-26  9:48 ` [PATCH v3 2/4] sched: factor out per-vcpu affinity related code to common header file Justin T. Weaver
@ 2015-03-26  9:48 ` Justin T. Weaver
  2015-04-23 15:35   ` Dario Faggioli
  2015-03-26  9:48 ` [PATCH v3 4/4] sched: credit2: consider per-vcpu soft affinity Justin T. Weaver
  2015-09-17 14:27 ` [PATCH v3 0/4] sched: credit2: introduce per-vcpu hard and " Dario Faggioli
  4 siblings, 1 reply; 17+ messages in thread
From: Justin T. Weaver @ 2015-03-26  9:48 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap, dario.faggioli, Justin T. Weaver, henric

Functions runq_tickle and choose_cpu both have code sections that get turned
into loops in patch 4 v3, soft affinity. Do the indenting here to make the
patch 4 diff section easier to read. This patch does not have any changes
other than the addition of one four-space indent per line.

Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>
---
Changes in v3: First introduced in patch series version 3
---
 xen/common/sched_credit2.c |  152 ++++++++++++++++++++++----------------------
 1 file changed, 76 insertions(+), 76 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index af716e4..bbcfbf2 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -534,58 +534,58 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
         goto tickle;
     }
     
-    /* Get a mask of idle, but not tickled, that new is allowed to run on. */
-    cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
-    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
+        /* Get a mask of idle, but not tickled, that new is allowed to run on. */
+        cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
+        cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
     
-    /* If it's not empty, choose one */
-    i = cpumask_cycle(cpu, &mask);
-    if ( i < nr_cpu_ids )
-    {
-        ipid = i;
-        goto tickle;
-    }
+        /* If it's not empty, choose one */
+        i = cpumask_cycle(cpu, &mask);
+        if ( i < nr_cpu_ids )
+        {
+            ipid = i;
+            goto tickle;
+        }
 
     /* Otherwise, look for the non-idle cpu with the lowest credit,
      * skipping cpus which have been tickled but not scheduled yet,
      * that new is allowed to run on. */
-    cpumask_andnot(&mask, &rqd->active, &rqd->idle);
-    cpumask_andnot(&mask, &mask, &rqd->tickled);
-    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
+        cpumask_andnot(&mask, &rqd->active, &rqd->idle);
+        cpumask_andnot(&mask, &mask, &rqd->tickled);
+        cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
 
-    for_each_cpu(i, &mask)
-    {
-        struct csched2_vcpu * cur;
+        for_each_cpu(i, &mask)
+        {
+            struct csched2_vcpu * cur;
 
-        /* Already looked at this one above */
-        if ( i == cpu )
-            continue;
+            /* Already looked at this one above */
+            if ( i == cpu )
+                continue;
 
-        cur = CSCHED2_VCPU(curr_on_cpu(i));
+            cur = CSCHED2_VCPU(curr_on_cpu(i));
 
-        BUG_ON(is_idle_vcpu(cur->vcpu));
+            BUG_ON(is_idle_vcpu(cur->vcpu));
 
-        /* Update credits for current to see if we want to preempt */
-        burn_credits(rqd, cur, now);
+            /* Update credits for current to see if we want to preempt */
+            burn_credits(rqd, cur, now);
 
-        if ( cur->credit < lowest )
-        {
-            ipid = i;
-            lowest = cur->credit;
-        }
+            if ( cur->credit < lowest )
+            {
+                ipid = i;
+                lowest = cur->credit;
+            }
 
-        /* TRACE */ {
-            struct {
-                unsigned dom:16,vcpu:16;
-                unsigned credit;
-            } d;
-            d.dom = cur->vcpu->domain->domain_id;
-            d.vcpu = cur->vcpu->vcpu_id;
-            d.credit = cur->credit;
-            trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
-                      sizeof(d),
-                      (unsigned char *)&d);
-        }
+            /* TRACE */ {
+                struct {
+                    unsigned dom:16,vcpu:16;
+                    unsigned credit;
+                } d;
+                d.dom = cur->vcpu->domain->domain_id;
+                d.vcpu = cur->vcpu->vcpu_id;
+                d.credit = cur->credit;
+                trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
+                          sizeof(d),
+                          (unsigned char *)&d);
+            }
     }
 
     /* Only switch to another processor if the credit difference is greater
@@ -1144,45 +1144,45 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
     min_avgload = MAX_LOAD;
 
     /* Find the runqueue with the lowest instantaneous load */
-    for_each_cpu(i, &prv->active_queues)
-    {
-        struct csched2_runqueue_data *rqd;
-        s_time_t rqd_avgload = MAX_LOAD;
-
-        rqd = prv->rqd + i;
-
-        /* If checking a different runqueue, grab the lock,
-         * check hard affinity, read the avg, and then release the lock.
-         *
-         * If on our own runqueue, don't grab or release the lock;
-         * but subtract our own load from the runqueue load to simulate
-         * impartiality.
-         *
-         * svc's hard affinity may have changed; this function is the
-         * credit 2 scheduler's first opportunity to react to the change,
-         * so it is possible here that svc does not have hard affinity
-         * with any of the pcpus of svc's currently assigned run queue.
-         */
-        if ( rqd == svc->rqd )
+        for_each_cpu(i, &prv->active_queues)
         {
-            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
-                rqd_avgload = rqd->b_avgload - svc->avgload;
-        }
-        else if ( spin_trylock(&rqd->lock) )
-        {
-            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
-                rqd_avgload = rqd->b_avgload;
+            struct csched2_runqueue_data *rqd;
+            s_time_t rqd_avgload = MAX_LOAD;
+
+            rqd = prv->rqd + i;
+
+            /* If checking a different runqueue, grab the lock,
+             * check hard affinity, read the avg, and then release the lock.
+             *
+             * If on our own runqueue, don't grab or release the lock;
+             * but subtract our own load from the runqueue load to simulate
+             * impartiality.
+             *
+             * svc's hard affinity may have changed; this function is the
+             * credit 2 scheduler's first opportunity to react to the change,
+             * so it is possible here that svc does not have hard affinity
+             * with any of the pcpus of svc's currently assigned run queue.
+             */
+            if ( rqd == svc->rqd )
+            {
+                if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
+                    rqd_avgload = rqd->b_avgload - svc->avgload;
+            }
+            else if ( spin_trylock(&rqd->lock) )
+            {
+                if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
+                    rqd_avgload = rqd->b_avgload;
 
-            spin_unlock(&rqd->lock);
-        }
-        else
-            continue;
+                spin_unlock(&rqd->lock);
+            }
+            else
+                continue;
 
-        if ( rqd_avgload < min_avgload )
-        {
-            min_avgload = rqd_avgload;
-            min_rqi=i;
-        }
+            if ( rqd_avgload < min_avgload )
+            {
+                min_avgload = rqd_avgload;
+                min_rqi=i;
+            }
     }
 
     /* We didn't find anyone (most likely because of spinlock contention). */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 4/4] sched: credit2: consider per-vcpu soft affinity
  2015-03-26  9:48 [PATCH v3 0/4] sched: credit2: introduce per-vcpu hard and soft affinity Justin T. Weaver
                   ` (2 preceding siblings ...)
  2015-03-26  9:48 ` [PATCH v3 3/4] sched: credit2: indent code sections to make review of patch 4/4 easier Justin T. Weaver
@ 2015-03-26  9:48 ` Justin T. Weaver
  2015-03-31 17:38   ` George Dunlap
                     ` (2 more replies)
  2015-09-17 14:27 ` [PATCH v3 0/4] sched: credit2: introduce per-vcpu hard and " Dario Faggioli
  4 siblings, 3 replies; 17+ messages in thread
From: Justin T. Weaver @ 2015-03-26  9:48 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap, dario.faggioli, Justin T. Weaver, henric

when making decisions for vcpus (run queue assignment, run queue migration,
cpu assignment, and cpu tickling).

Added soft affinity balancing loops to...
 * get_fallback_cpu
 * runq_tickle (one for idle, but not tickled; one for non-idle, and not
   tickled)
 * choose_cpu

choose_cpu now tries to find the run queue with the most cpus in the given
vcpu's soft affinity. It uses minimum run queue load as a tie breaker.

Added a function to determine the number of soft cpus gained (or lost) by a
given vcpu if it is migrated from a given source run queue to a given
destination run queue.

Modified algorithm in balance_load and consider...
 * if the load on lrqd and/or orqd is less than the number of their active
   cpus, balance_load will look for vcpus that would have their soft affinity
   improved by being pushed and/or pulled. Load does not need to be considered
   since a run queue recieveing a pushed or pulled vcpu is not being fully
   utilized. This returns vcpus that may have been migrated away from their
   soft affinity due to load balancing back to their soft affinity.
 * in consider, vcpus that might be picked for migration because pushing or
   pulling them decreases the load delta are not picked if their current run
   queue's load is less than its active cpu count and if that migration would
   harm their soft affinity. There's no need to push/pull if the load is under
   capacity, and the vcpu would lose access to some or all of its soft cpus.
 * in consider, if a push/pull/swap migration decreases the load delta by a
   similar amount to another push/pull/swap migration, then use soft cpu gain
   as a tie breaker. This allows load to continue to balance across run queues,
   but favors soft affinity gains if the load deltas are close.

Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>
---
Changes in v3:
 * get_fallback_cpu: added balance loop to try to find a soft affinity cpu 
 * runq_tickle: replaced use of local var mask with csched2_cpumask
 * runq_tickle: added two balance loops, one for finding idle, but not
   tickled, and other for finding non-idle with lowest credit
 * choose_cpu: added balance loop to find cpu for given vcpu that has most
   soft cpus (with run queue load being a tie breaker), or if none were found,
   or not considering soft affinity, pick cpu from runq with least load
 * balance_load / consider: removed code that ignored a migration if it meant
   moving a vcpu away from its soft affinity; added migration of vcpus to
   improve their soft affinity if the destination run queue was under load;
   added check in consider, if current run queue is under load and migration
   would hurt the vcpu's soft affinity, do not consider the migration; added
   soft affinity tie breaker in consider if current load delta and consider
   load delta are close
 * added helper functions for soft affinity related changes to balance_load
Changes in v2:
 * Not submitted in version 2; focus was on the hard affinity patch
---
 xen/common/sched_credit2.c |  344 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 313 insertions(+), 31 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index bbcfbf2..47d0bad 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -127,6 +127,14 @@
 #define CSCHED2_CREDIT_RESET         0
 /* Max timer: Maximum time a guest can be run for. */
 #define CSCHED2_MAX_TIMER            MILLISECS(2)
+/* Used in balance_load to specify migration direction. */
+#define CSCHED2_PULL                 0
+#define CSCHED2_PUSH                 1
+/*
+ * Used in balance_load to decide if deltas are close enough to use soft
+ * affinity as a tie breaker.
+ */
+#define CSCHED2_DIVIDE_BY_16         4
 
 
 #define CSCHED2_IDLE_CREDIT                 (-(1<<30))
@@ -288,15 +296,33 @@ struct csched2_dom {
  */
 static int get_fallback_cpu(struct csched2_vcpu *svc)
 {
+    int balance_step;
+
     if ( likely(cpumask_test_cpu(svc->vcpu->processor,
         svc->vcpu->cpu_hard_affinity)) )
         return svc->vcpu->processor;
 
-    cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
-        &svc->rqd->active);
-    if ( cpumask_empty(csched2_cpumask) )
-        cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
-            VCPU2ONLINE(svc->vcpu));
+    for_each_sched_balance_step( balance_step )
+    {
+        if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY
+            && !__vcpu_has_soft_affinity(svc->vcpu,
+                svc->vcpu->cpu_hard_affinity) )
+            continue;
+
+        sched_balance_cpumask(svc->vcpu, balance_step, csched2_cpumask);
+        cpumask_and(csched2_cpumask, csched2_cpumask, &svc->rqd->active);
+        if ( !cpumask_empty(csched2_cpumask) )
+            break;
+        else
+        {
+            sched_balance_cpumask(svc->vcpu, balance_step, csched2_cpumask);
+            cpumask_and(csched2_cpumask, csched2_cpumask,
+                VCPU2ONLINE(svc->vcpu));
+            if ( !cpumask_empty(csched2_cpumask) )
+                break;
+        }
+    }
+
     ASSERT( !cpumask_empty(csched2_cpumask) );
 
     return cpumask_any(csched2_cpumask);
@@ -516,8 +542,8 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
     int i, ipid=-1;
     s_time_t lowest=(1<<30);
     struct csched2_runqueue_data *rqd = RQD(ops, cpu);
-    cpumask_t mask;
     struct csched2_vcpu * cur;
+    int balance_step;
 
     d2printk("rqt %pv curr %pv\n", new->vcpu, current);
 
@@ -534,26 +560,43 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
         goto tickle;
     }
     
+    for_each_sched_balance_step ( balance_step )
+    {
+        if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY
+            && !__vcpu_has_soft_affinity(new->vcpu,
+                new->vcpu->cpu_hard_affinity) )
+            continue;
+
         /* Get a mask of idle, but not tickled, that new is allowed to run on. */
-        cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
-        cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
-    
+        sched_balance_cpumask(new->vcpu, balance_step, csched2_cpumask);
+        cpumask_and(csched2_cpumask, csched2_cpumask, &rqd->idle);
+        cpumask_andnot(csched2_cpumask, csched2_cpumask, &rqd->tickled);
+
         /* If it's not empty, choose one */
-        i = cpumask_cycle(cpu, &mask);
+        i = cpumask_cycle(cpu, csched2_cpumask);
         if ( i < nr_cpu_ids )
         {
             ipid = i;
             goto tickle;
         }
+    }
 
     /* Otherwise, look for the non-idle cpu with the lowest credit,
      * skipping cpus which have been tickled but not scheduled yet,
      * that new is allowed to run on. */
-        cpumask_andnot(&mask, &rqd->active, &rqd->idle);
-        cpumask_andnot(&mask, &mask, &rqd->tickled);
-        cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
+    for_each_sched_balance_step ( balance_step )
+    {
+        if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY
+            && !__vcpu_has_soft_affinity(new->vcpu,
+                new->vcpu->cpu_hard_affinity) )
+            continue;
+
+        sched_balance_cpumask(new->vcpu, balance_step, csched2_cpumask);
+        cpumask_and(csched2_cpumask, csched2_cpumask, &rqd->active);
+        cpumask_andnot(csched2_cpumask, csched2_cpumask, &rqd->idle);
+        cpumask_andnot(csched2_cpumask, csched2_cpumask, &rqd->tickled);
 
-        for_each_cpu(i, &mask)
+        for_each_cpu(i, csched2_cpumask)
         {
             struct csched2_vcpu * cur;
 
@@ -586,6 +629,7 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
                           sizeof(d),
                           (unsigned char *)&d);
             }
+        }
     }
 
     /* Only switch to another processor if the credit difference is greater
@@ -1086,7 +1130,7 @@ static int
 choose_cpu(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched2_private *prv = CSCHED2_PRIV(ops);
-    int i, min_rqi = -1, new_cpu;
+    int i, rqi = -1, new_cpu, max_soft_cpus = 0, balance_step;
     struct csched2_vcpu *svc = CSCHED2_VCPU(vc);
     s_time_t min_avgload;
 
@@ -1143,9 +1187,28 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
 
     min_avgload = MAX_LOAD;
 
-    /* Find the runqueue with the lowest instantaneous load */
+    /*
+     * Find the run queue with the most cpus in vc's soft affinity. If there
+     * is more than one queue with the highest soft affinity cpu count, then
+     * pick the one with the lowest instantaneous run queue load. If the
+     * vcpu does not have soft affinity, then only try to find the run queue
+     * with the lowest instantaneous load.
+     */
+    for_each_sched_balance_step( balance_step )
+    {
+        if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY
+            && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
+            continue;
+
+        if ( balance_step == SCHED_BALANCE_HARD_AFFINITY && rqi > -1 )
+        {
+            balance_step = SCHED_BALANCE_SOFT_AFFINITY;
+            break;
+        }
+
         for_each_cpu(i, &prv->active_queues)
         {
+            int rqd_soft_cpus = 0;
             struct csched2_runqueue_data *rqd;
             s_time_t rqd_avgload = MAX_LOAD;
 
@@ -1163,35 +1226,61 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
              * so it is possible here that svc does not have hard affinity
              * with any of the pcpus of svc's currently assigned run queue.
              */
+            sched_balance_cpumask(vc, balance_step, csched2_cpumask);
             if ( rqd == svc->rqd )
             {
-                if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
+                if ( cpumask_intersects(csched2_cpumask, &rqd->active) )
                     rqd_avgload = rqd->b_avgload - svc->avgload;
+                if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY )
+                {
+                    cpumask_and(csched2_cpumask, csched2_cpumask,
+                        &rqd->active);
+                    rqd_soft_cpus = cpumask_weight(csched2_cpumask);
+                }
             }
             else if ( spin_trylock(&rqd->lock) )
             {
-                if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
+                if ( cpumask_intersects(csched2_cpumask, &rqd->active) )
                     rqd_avgload = rqd->b_avgload;
+                if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY )
+                {
+                    cpumask_and(csched2_cpumask, csched2_cpumask,
+                        &rqd->active);
+                    rqd_soft_cpus = cpumask_weight(csched2_cpumask);
+                }
 
                 spin_unlock(&rqd->lock);
             }
             else
                 continue;
 
-            if ( rqd_avgload < min_avgload )
+            if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY
+                && rqd_soft_cpus > 0
+                && ( rqd_soft_cpus > max_soft_cpus
+                    ||
+                   ( rqd_soft_cpus == max_soft_cpus
+                    && rqd_avgload < min_avgload )) )
+            {
+                max_soft_cpus = rqd_soft_cpus;
+                rqi = i;
+                min_avgload = rqd_avgload;
+            }
+            else if ( balance_step == SCHED_BALANCE_HARD_AFFINITY
+                     && rqd_avgload < min_avgload )
             {
+                rqi = i;
                 min_avgload = rqd_avgload;
-                min_rqi=i;
             }
+        }
     }
 
     /* We didn't find anyone (most likely because of spinlock contention). */
-    if ( min_rqi == -1 )
+    if ( rqi == -1 )
         new_cpu = get_fallback_cpu(svc);
     else
     {
-        cpumask_and(csched2_cpumask, vc->cpu_hard_affinity,
-            &prv->rqd[min_rqi].active);
+        sched_balance_cpumask(vc, balance_step, csched2_cpumask);
+        cpumask_and(csched2_cpumask, csched2_cpumask, &prv->rqd[rqi].active);
         new_cpu = cpumask_any(csched2_cpumask);
         BUG_ON(new_cpu >= nr_cpu_ids);
     }
@@ -1207,15 +1296,75 @@ typedef struct {
     /* NB: Modified by consider() */
     s_time_t load_delta;
     struct csched2_vcpu * best_push_svc, *best_pull_svc;
+    int soft_affinity_boost;
+    bool_t valid_sa_boost;
     /* NB: Read by consider() */
     struct csched2_runqueue_data *lrqd;
     struct csched2_runqueue_data *orqd;                  
 } balance_state_t;
 
+/*
+ * Return the number of pcpus gained in vc's soft affinity mask that vc can
+ * run on if vc is migrated from run queue src_rqd to run queue dst_rqd.
+ */
+static int get_soft_affinity_gain(const struct vcpu *vc,
+                                  const struct csched2_runqueue_data *src_rqd,
+                                  const struct csched2_runqueue_data *dst_rqd)
+{
+    /*
+     * Locks must already be held for src_rqd and dst_rqd.
+     * Function assumes vc has at least hard affinity with one or more
+     * pcpus in both the source and destination run queues.
+     */
+
+    /* Does vcpu not have soft affinity? */
+    if ( !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
+        return 0;
+
+    /* Does vcpu have soft affinity with pcpu(s) in the destination runq? */
+    sched_balance_cpumask(vc, SCHED_BALANCE_SOFT_AFFINITY, csched2_cpumask);
+    if ( cpumask_intersects(csched2_cpumask, &dst_rqd->active) )
+    {
+        int soft_cpus_dst;
+        cpumask_and(csched2_cpumask, csched2_cpumask, &dst_rqd->active);
+        soft_cpus_dst = cpumask_weight(csched2_cpumask);
+
+        /* ... and soft affinity with the source run queue? */
+        sched_balance_cpumask(vc, SCHED_BALANCE_SOFT_AFFINITY,
+            csched2_cpumask);
+        if ( cpumask_intersects(csched2_cpumask, &src_rqd->active) )
+        {
+            int soft_cpus_src;
+            cpumask_and(csched2_cpumask, csched2_cpumask, &src_rqd->active);
+            soft_cpus_src = cpumask_weight(csched2_cpumask);
+
+            /* Soft affinity to soft affinity migration. */
+            return soft_cpus_dst - soft_cpus_src;
+        }
+        else
+            /* Hard affinity to soft affinity migration. */
+            return soft_cpus_dst;
+    }
+    else
+    {
+        int soft_cpus_src = 0;
+        cpumask_and(csched2_cpumask, csched2_cpumask, &src_rqd->active);
+        soft_cpus_src = cpumask_weight(csched2_cpumask);
+
+        /*
+         * Hard affinity to hard affinity migration or soft affinity to hard
+         * affinity migration.
+         */
+        return -soft_cpus_src;
+    }
+}
+
 static void consider(balance_state_t *st, 
                      struct csched2_vcpu *push_svc,
-                     struct csched2_vcpu *pull_svc)
+                     struct csched2_vcpu *pull_svc,
+                     int load_window_shift)
 {
+    int delta_diff;
     s_time_t l_load, o_load, delta;
 
     l_load = st->lrqd->b_avgload;
@@ -1237,12 +1386,88 @@ static void consider(balance_state_t *st,
     if ( delta < 0 )
         delta = -delta;
 
-    if ( delta < st->load_delta )
+    /*
+     * Use soft affinity gain as a tie breaker if at least one migration has
+     * already been picked and stored in the balance state, and the absolute
+     * value of the difference between the delta in st and the new delta being
+     * considered here is less than 1/16th of the load_window_shift.
+     */
+    delta_diff = delta - st->load_delta;
+    if ( delta_diff < 0 )
+        delta_diff = -delta_diff;
+    if ( (st->best_push_svc != NULL || st->best_pull_svc != NULL)
+        && delta_diff < 1<<(load_window_shift - CSCHED2_DIVIDE_BY_16) )
     {
-        st->load_delta = delta;
-        st->best_push_svc=push_svc;
-        st->best_pull_svc=pull_svc;
+        int st_soft_gain = 0, consider_soft_gain = 0;
+
+        /* Find the soft affinity gain for the migration in st. */
+        if ( !st->valid_sa_boost )
+            if ( st->best_push_svc )
+                st_soft_gain += get_soft_affinity_gain(
+                    st->best_push_svc->vcpu, st->lrqd, st->orqd);
+            if ( st->best_pull_svc )
+                st_soft_gain += get_soft_affinity_gain(
+                    st->best_pull_svc->vcpu, st->orqd, st->lrqd);
+        else
+            st_soft_gain = st->soft_affinity_boost;
+
+        /* Find the soft affinity gain for the migration being considered. */
+        if ( push_svc )
+        {
+            int push_soft_gain = get_soft_affinity_gain(
+                push_svc->vcpu, st->lrqd, st->orqd);
+            if ( push_soft_gain < 0
+                && st->lrqd->load < cpumask_weight(&st->lrqd->active) )
+                return;
+            consider_soft_gain = push_soft_gain;
+        }
+        if ( pull_svc )
+        {
+            int pull_soft_gain = get_soft_affinity_gain(
+                pull_svc->vcpu, st->orqd, st->lrqd);
+            if ( pull_soft_gain < 0
+                && st->orqd->load < cpumask_weight(&st->orqd->active) )
+                return;
+            consider_soft_gain += pull_soft_gain;
+        }
+
+        /* Store the higher gain in the balance state. */
+        st->soft_affinity_boost = consider_soft_gain > st_soft_gain ?
+            consider_soft_gain : st_soft_gain;
+        st->valid_sa_boost = 1;
+
+        if ( consider_soft_gain > st_soft_gain )
+            goto choose_migration;
+        else if ( st_soft_gain > consider_soft_gain )
+            return;
+
+        /* Soft affinity gain is the same; fall through. */
     }
+
+    /* Only consider load delta. */
+    if ( delta < st->load_delta )
+        st->valid_sa_boost = 0;
+
+        /*
+         * If the migration results in a loss of some or all soft cpus and the
+         * vcpu's current run queue has less load than physical processors, do
+         * not use the migration.
+         */
+        if ( push_svc &&
+            (st->lrqd->load < cpumask_weight(&st->lrqd->active) &&
+             get_soft_affinity_gain(push_svc->vcpu, st->lrqd, st->orqd) < 0) )
+            return;
+        if ( pull_svc &&
+            (st->orqd->load < cpumask_weight(&st->orqd->active) &&
+             get_soft_affinity_gain(pull_svc->vcpu, st->orqd, st->lrqd) < 0) )
+            return;
+    else
+        return;
+
+choose_migration:
+    st->load_delta = delta;
+    st->best_push_svc=push_svc;
+    st->best_pull_svc=pull_svc;
 }
 
 
@@ -1291,6 +1516,31 @@ static void migrate(const struct scheduler *ops,
     }
 }
 
+/* Returns true if migration should be picked regardless of run queue load. */
+static bool_t consider_soft_affinity(balance_state_t *st,
+                                     struct csched2_vcpu *svc,
+                                     int direction)
+{
+    if ( direction == CSCHED2_PUSH )
+    {
+        if ( get_soft_affinity_gain(svc->vcpu, st->lrqd, st->orqd) > 0 )
+        {
+            st->best_push_svc = svc;
+            return 1;
+        }
+    }
+    else if ( direction == CSCHED2_PULL )
+    {
+        if ( get_soft_affinity_gain(svc->vcpu, st->orqd, st->lrqd) > 0 )
+        {
+            st->best_pull_svc = svc;
+            return 1;
+        }
+    }
+
+    return 0;
+}
+
 /*
  * Migration of vcpu svc to run queue rqd is a valid option if svc is not
  * already flagged to migrate and if svc is allowed to run on at least one of
@@ -1333,6 +1583,7 @@ retry:
         return;
 
     st.load_delta = 0;
+    st.soft_affinity_boost = 0;
 
     for_each_cpu(i, &prv->active_queues)
     {
@@ -1356,6 +1607,36 @@ retry:
             max_delta_rqi = i;
         }
 
+        /*
+         * If run queue load on lrqd and/or orqd is less than their active
+         * cpu counts, then look for any vcpus that can improve their
+         * soft affinity with a push and/or pull migration. Load does not
+         * need to be considered here.
+         */
+        if ( st.orqd->load < cpumask_weight(&st.orqd->active) )
+            list_for_each( push_iter, &st.lrqd->svc )
+            {
+                struct csched2_vcpu * push_svc = list_entry(
+                    push_iter, struct csched2_vcpu, rqd_elem);
+
+                if ( consider_soft_affinity(&st, push_svc, CSCHED2_PUSH) )
+                    break;
+            }
+        if ( st.lrqd->load < cpumask_weight(&st.lrqd->active) )
+            list_for_each( pull_iter, &st.orqd->svc )
+            {
+                struct csched2_vcpu * pull_svc = list_entry(
+                    pull_iter, struct csched2_vcpu, rqd_elem);
+
+                if ( consider_soft_affinity(&st, pull_svc, CSCHED2_PULL) )
+                    break;
+            }
+        if ( st.best_pull_svc != NULL || st.best_push_svc != NULL )
+        {
+            spin_unlock(&prv->lock);
+            goto migrate;
+        }
+
         spin_unlock(&st.orqd->lock);
     }
 
@@ -1428,13 +1709,13 @@ retry:
             if ( !valid_vcpu_migration(pull_svc, st.lrqd) )
                 continue;
 
-            consider(&st, push_svc, pull_svc);
+            consider(&st, push_svc, pull_svc, prv->load_window_shift);
         }
 
         inner_load_updated = 1;
 
         /* Consider push only */
-        consider(&st, push_svc, NULL);
+        consider(&st, push_svc, NULL, prv->load_window_shift);
     }
 
     list_for_each( pull_iter, &st.orqd->svc )
@@ -1445,9 +1726,10 @@ retry:
             continue;
 
         /* Consider pull only */
-        consider(&st, NULL, pull_svc);
+        consider(&st, NULL, pull_svc, prv->load_window_shift);
     }
 
+migrate:
     /* OK, now we have some candidates; do the moving */
     if ( st.best_push_svc )
         migrate(ops, st.best_push_svc, st.orqd, now);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/4] sched: credit2: respect per-vcpu hard affinity
  2015-03-26  9:48 ` [PATCH v3 1/4] sched: credit2: respect per-vcpu hard affinity Justin T. Weaver
@ 2015-03-31 14:37   ` George Dunlap
  2015-03-31 17:14     ` Dario Faggioli
  2015-04-23 16:00     ` Dario Faggioli
  2015-05-06 12:39   ` Dario Faggioli
  1 sibling, 2 replies; 17+ messages in thread
From: George Dunlap @ 2015-03-31 14:37 UTC (permalink / raw)
  To: Justin T. Weaver, xen-devel; +Cc: dario.faggioli, henric

On 03/26/2015 09:48 AM, Justin T. Weaver wrote:
> by making sure that vcpus only run on the pcpu(s) they are allowed to
> run on based on their hard affinity cpu masks.
> 
> Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>

Hey Justin!  Getting close.  A couple of comments:

> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 7581731..af716e4 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -176,6 +176,7 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
>  #define c2r(_ops, _cpu)     (CSCHED2_PRIV(_ops)->runq_map[(_cpu)])
>  /* CPU to runqueue struct macro */
>  #define RQD(_ops, _cpu)     (&CSCHED2_PRIV(_ops)->rqd[c2r(_ops, _cpu)])
> +#define VCPU2ONLINE(_v)     cpupool_online_cpumask((_v)->domain->cpupool)
>  
>  /*
>   * Shifts for load average.
> @@ -194,6 +195,12 @@ int opt_overload_balance_tolerance=-3;
>  integer_param("credit2_balance_over", opt_overload_balance_tolerance);
>  
>  /*
> + * Use this to avoid having too many cpumask_t structs on the stack
> + */
> +static cpumask_t **scratch_mask = NULL;
> +#define csched2_cpumask scratch_mask[smp_processor_id()]

Since I'm asking for changes below:  It's not clear to me when I'm
scanning the code that csched2_cpumask is a scratch variable.  What
about calling the actual variable _scratch_cpumask and havind the
#define be scratch_cpumask?

>      /* Find the runqueue with the lowest instantaneous load */
>      for_each_cpu(i, &prv->active_queues)
>      {
>          struct csched2_runqueue_data *rqd;
> -        s_time_t rqd_avgload;
> +        s_time_t rqd_avgload = MAX_LOAD;
>  
>          rqd = prv->rqd + i;
>  
>          /* If checking a different runqueue, grab the lock,
> -         * read the avg, and then release the lock.
> +         * check hard affinity, read the avg, and then release the lock.
>           *
>           * If on our own runqueue, don't grab or release the lock;
>           * but subtract our own load from the runqueue load to simulate
> -         * impartiality */
> +         * impartiality.
> +         *
> +         * svc's hard affinity may have changed; this function is the
> +         * credit 2 scheduler's first opportunity to react to the change,
> +         * so it is possible here that svc does not have hard affinity
> +         * with any of the pcpus of svc's currently assigned run queue.
> +         */
>          if ( rqd == svc->rqd )
>          {
> -            rqd_avgload = rqd->b_avgload - svc->avgload;
> +            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
> +                rqd_avgload = rqd->b_avgload - svc->avgload;
>          }
>          else if ( spin_trylock(&rqd->lock) )
>          {
> -            rqd_avgload = rqd->b_avgload;
> +            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
> +                rqd_avgload = rqd->b_avgload;
> +
>              spin_unlock(&rqd->lock);
>          }
>          else

Since we're already potentially falling through and doing the comparison
below with an unchanged rqd_avgload (which has now been set to MAX_LOAD
above), I wonder if it makes more sense to remove this "else continue"
here, just to avoid confusing people.

> @@ -2024,6 +2096,13 @@ csched2_alloc_pdata(const struct scheduler *ops, int cpu)
>          printk("%s: cpu %d not online yet, deferring initializatgion\n",
>                 __func__, cpu);
>  
> +    /*
> +     * For each new pcpu, allocate a cpumask_t for use throughout the
> +     * scheduler to avoid putting any cpumask_t structs on the stack.
> +     */
> +    if ( !zalloc_cpumask_var(&scratch_mask[cpu]) )

Any reason not to use "scratch_mask + cpu" here rather than
"&scratch_mask[cpu]"?

It might not be a bad idea to ad ASSERT(scratch_mask[cpu] == NULL)
before this, just to be paranoid...

> +        return NULL;
> +
>      return (void *)1;
>  }
>  
> @@ -2072,6 +2151,8 @@ csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>  
>      spin_unlock_irqrestore(&prv->lock, flags);
>  
> +    free_cpumask_var(scratch_mask[cpu]);

We should definitely set this to NULL after freeing it (see below)

> +
>      return;
>  }
>  
> @@ -2159,6 +2240,10 @@ csched2_init(struct scheduler *ops)
>  
>      prv->load_window_shift = opt_load_window_shift;
>  
> +    scratch_mask = xmalloc_array(cpumask_t *, nr_cpu_ids);

I realize Dario recommended using xmalloc_array() instead of
xzalloc_array(), but I don't understand why he thinks that's OK.  His
mail says "(see below about why I actually don't
think we need)", but I don't actually see that addressed in that e-mail.

I think it's just dangerous to leave uninitialized pointers around.  The
invariant should be that if the array entry is invalid it's NULL, and if
it's non-null then it's valid.

(* marc.info/?i=<1426266674.32500.25.camel@citrix.com>)

Also -- I think this allocation wants to happen in global_init(), right?
 Otherwise if you make a second cpupool with the credit2 scheduler this
will be clobbered.  (I think nr_cpu_ids should be defined at that point.)

 -George

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/4] sched: credit2: respect per-vcpu hard affinity
  2015-03-31 14:37   ` George Dunlap
@ 2015-03-31 17:14     ` Dario Faggioli
  2015-03-31 17:32       ` George Dunlap
  2015-04-23 16:00     ` Dario Faggioli
  1 sibling, 1 reply; 17+ messages in thread
From: Dario Faggioli @ 2015-03-31 17:14 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, jtweaver, henric


[-- Attachment #1.1: Type: text/plain, Size: 4169 bytes --]

On Tue, 2015-03-31 at 15:37 +0100, George Dunlap wrote:
> On 03/26/2015 09:48 AM, Justin T. Weaver wrote:
> > by making sure that vcpus only run on the pcpu(s) they are allowed to
> > run on based on their hard affinity cpu masks.
> > 
> > Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>
> 
> Hey Justin!  Getting close.  A couple of comments:
> 
Hi from here too...

I'll also provide my comments on this series shortly, just the time to
finish a thing I'm busy with. :-/

For now, just a few replies to direct questions...

> > diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> > index 7581731..af716e4 100644
> > --- a/xen/common/sched_credit2.c
> > +++ b/xen/common/sched_credit2.c

> > @@ -2024,6 +2096,13 @@ csched2_alloc_pdata(const struct scheduler *ops, int cpu)
> >          printk("%s: cpu %d not online yet, deferring initializatgion\n",
> >                 __func__, cpu);
> >  
> > +    /*
> > +     * For each new pcpu, allocate a cpumask_t for use throughout the
> > +     * scheduler to avoid putting any cpumask_t structs on the stack.
> > +     */
> > +    if ( !zalloc_cpumask_var(&scratch_mask[cpu]) )
> 
> Any reason not to use "scratch_mask + cpu" here rather than
> "&scratch_mask[cpu]"?
> 
With the renaming you suggested, that would be "_scratch_mask + cpu",
wouldn't it? I mean, it has to be the actual variable, not the #define,
since this is (IIRC) called from another CPU, and hence the macro, which
does smp_processor_id(), would give us the wrong element of the per-cpu
array.

That being said, I personally find the array syntax easier to read and
more consistent, especially if we add this:

> It might not be a bad idea to ad ASSERT(scratch_mask[cpu] == NULL)
> before this, just to be paranoid...
> 
But, no big deal, I'm fine with the '+'.

> > @@ -2159,6 +2240,10 @@ csched2_init(struct scheduler *ops)
> >  
> >      prv->load_window_shift = opt_load_window_shift;
> >  
> > +    scratch_mask = xmalloc_array(cpumask_t *, nr_cpu_ids);
> 
> I realize Dario recommended using xmalloc_array() instead of
> xzalloc_array(), but I don't understand why he thinks that's OK.  
>
Well, I didn't went as far as recommending it, but yes, I'd do it that
way, and I think it is both safe and fine.

> His
> mail says "(see below about why I actually don't
> think we need)", but I don't actually see that addressed in that e-mail.
> 
Right. In thet email (message-id:
<CA+o8iRVZzU++oPxeugNaSEZ8u-pyh7wk7cvaw7OWYkfB-pxNUw@mail.gmail.com> )

I was focusing on why the call to free() in a loop was not necessary,
and we should instead free what have been previously allocated, rather
than always freeing everything, relying on the fact that, at "worse" the
pointer will be NULL anyway.

IOW, if we always free what we allocate, there is no need for the
pointers to be NULL, and this is how I addressed the matter in the
message. I agree it probably doesn't look super clear, this other email,
describing a similar scenario, may contain a better explanation of my
take on this:

<1426601529.32500.94.camel@citrix.com>

> I think it's just dangerous to leave uninitialized pointers around.  The
> invariant should be that if the array entry is invalid it's NULL, and if
> it's non-null then it's valid.
> 
I see. I guess this makes things more "defensive programming"-ish
oriented, which is a good thing.

I don't know how well this is enforced around the scheduler code (or in
general), but I'm certainly ok making a step in that direction. This is
no hot-path, so no big deal zeroing the memory... Not zeroing forces one
to think harder at what is being allocated and freed, which is why I
like it, but I'm, of course more than ok with zalloc_*, so go for
it. :-)

> Also -- I think this allocation wants to happen in global_init(), right?
>  Otherwise if you make a second cpupool with the credit2 scheduler this
> will be clobbered.  (I think nr_cpu_ids should be defined at that point.)
> 
Good point, actually. This just made me realize I've done the same
mistake somewhere else... Thanks!! :-P

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/4] sched: credit2: respect per-vcpu hard affinity
  2015-03-31 17:14     ` Dario Faggioli
@ 2015-03-31 17:32       ` George Dunlap
  0 siblings, 0 replies; 17+ messages in thread
From: George Dunlap @ 2015-03-31 17:32 UTC (permalink / raw)
  To: Dario Faggioli, George Dunlap; +Cc: xen-devel, jtweaver, henric

On 03/31/2015 06:14 PM, Dario Faggioli wrote:
>>> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
>>> index 7581731..af716e4 100644
>>> --- a/xen/common/sched_credit2.c
>>> +++ b/xen/common/sched_credit2.c
> 
>>> @@ -2024,6 +2096,13 @@ csched2_alloc_pdata(const struct scheduler *ops, int cpu)
>>>          printk("%s: cpu %d not online yet, deferring initializatgion\n",
>>>                 __func__, cpu);
>>>  
>>> +    /*
>>> +     * For each new pcpu, allocate a cpumask_t for use throughout the
>>> +     * scheduler to avoid putting any cpumask_t structs on the stack.
>>> +     */
>>> +    if ( !zalloc_cpumask_var(&scratch_mask[cpu]) )
>>
>> Any reason not to use "scratch_mask + cpu" here rather than
>> "&scratch_mask[cpu]"?
>>
> With the renaming you suggested, that would be "_scratch_mask + cpu",
> wouldn't it? I mean, it has to be the actual variable, not the #define,
> since this is (IIRC) called from another CPU, and hence the macro, which
> does smp_processor_id(), would give us the wrong element of the per-cpu
> array.
> 
> That being said, I personally find the array syntax easier to read and
> more consistent, especially if we add this:

Yes, _scratch_mask.  I think I probably wrote this before suggesting the
rename. :-)

I actually find the other syntax easier to read in general; but it's not
too big a deal, and if we add the ASSERT, it certainly makes sense to
keep it an array for consistency.

> IOW, if we always free what we allocate, there is no need for the
> pointers to be NULL, and this is how I addressed the matter in the
> message. I agree it probably doesn't look super clear, this other email,
> describing a similar scenario, may contain a better explanation of my
> take on this:
> 
> <1426601529.32500.94.camel@citrix.com>
> 
>> I think it's just dangerous to leave uninitialized pointers around.  The
>> invariant should be that if the array entry is invalid it's NULL, and if
>> it's non-null then it's valid.
>>
> I see. I guess this makes things more "defensive programming"-ish
> oriented, which is a good thing.
> 
> I don't know how well this is enforced around the scheduler code (or in
> general), but I'm certainly ok making a step in that direction. This is
> no hot-path, so no big deal zeroing the memory... Not zeroing forces one
> to think harder at what is being allocated and freed, which is why I
> like it, but I'm, of course more than ok with zalloc_*, so go for
> it. :-)

I'm not sure how it forces you to think harder.  Having a null pointer
deference is bad enough that I already try hard to think carefully about
what's being allocated and freed.  Having a double free or
use-after-free bug is certainly worse than a null pointer dereference,
but at some point worse consequences don't actually lead to better behavior.

It's like those politicians who say they want to double the punishment
for some crime; say, assaulting hospital staff.  Well, assaulting
hospital staff is already a crime that can lead to jail time; if the
original punishment didn't deter the guy, doubling it isn't really going
to do much.  :-)

>> Also -- I think this allocation wants to happen in global_init(), right?
>>  Otherwise if you make a second cpupool with the credit2 scheduler this
>> will be clobbered.  (I think nr_cpu_ids should be defined at that point.)
>>
> Good point, actually. This just made me realize I've done the same
> mistake somewhere else... Thanks!! :-P

That one slipped under my radar too.

...and it's also a good example of why "just think harder about it"
doesn't really work.  I should have made you add ASSERTs when setting
that global variable, that it actually was NULL. :-D

 -George

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/4] sched: credit2: consider per-vcpu soft affinity
  2015-03-26  9:48 ` [PATCH v3 4/4] sched: credit2: consider per-vcpu soft affinity Justin T. Weaver
@ 2015-03-31 17:38   ` George Dunlap
  2015-04-20 15:38   ` George Dunlap
  2015-04-22 16:16   ` George Dunlap
  2 siblings, 0 replies; 17+ messages in thread
From: George Dunlap @ 2015-03-31 17:38 UTC (permalink / raw)
  To: Justin T. Weaver, xen-devel; +Cc: dario.faggioli, henric

On 03/26/2015 09:48 AM, Justin T. Weaver wrote:
> when making decisions for vcpus (run queue assignment, run queue migration,
> cpu assignment, and cpu tickling).
> 
> Added soft affinity balancing loops to...
>  * get_fallback_cpu
>  * runq_tickle (one for idle, but not tickled; one for non-idle, and not
>    tickled)
>  * choose_cpu
> 
> choose_cpu now tries to find the run queue with the most cpus in the given
> vcpu's soft affinity. It uses minimum run queue load as a tie breaker.
> 
> Added a function to determine the number of soft cpus gained (or lost) by a
> given vcpu if it is migrated from a given source run queue to a given
> destination run queue.
> 
> Modified algorithm in balance_load and consider...
>  * if the load on lrqd and/or orqd is less than the number of their active
>    cpus, balance_load will look for vcpus that would have their soft affinity
>    improved by being pushed and/or pulled. Load does not need to be considered
>    since a run queue recieveing a pushed or pulled vcpu is not being fully
>    utilized. This returns vcpus that may have been migrated away from their
>    soft affinity due to load balancing back to their soft affinity.
>  * in consider, vcpus that might be picked for migration because pushing or
>    pulling them decreases the load delta are not picked if their current run
>    queue's load is less than its active cpu count and if that migration would
>    harm their soft affinity. There's no need to push/pull if the load is under
>    capacity, and the vcpu would lose access to some or all of its soft cpus.
>  * in consider, if a push/pull/swap migration decreases the load delta by a
>    similar amount to another push/pull/swap migration, then use soft cpu gain
>    as a tie breaker. This allows load to continue to balance across run queues,
>    but favors soft affinity gains if the load deltas are close.
> 
> Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>

First of all, thank you for doing the careful work to try to make this
happen.  The logic here is *very* complicated!

One minor thing re the changelog: You should probably mention somewhere
that you're also removing on-stack cpumask_t to use the per-cpu scratch
variable instead.

> ---
> Changes in v3:
>  * get_fallback_cpu: added balance loop to try to find a soft affinity cpu 
>  * runq_tickle: replaced use of local var mask with csched2_cpumask
>  * runq_tickle: added two balance loops, one for finding idle, but not
>    tickled, and other for finding non-idle with lowest credit
>  * choose_cpu: added balance loop to find cpu for given vcpu that has most
>    soft cpus (with run queue load being a tie breaker), or if none were found,
>    or not considering soft affinity, pick cpu from runq with least load
>  * balance_load / consider: removed code that ignored a migration if it meant
>    moving a vcpu away from its soft affinity; added migration of vcpus to
>    improve their soft affinity if the destination run queue was under load;
>    added check in consider, if current run queue is under load and migration
>    would hurt the vcpu's soft affinity, do not consider the migration; added
>    soft affinity tie breaker in consider if current load delta and consider
>    load delta are close
>  * added helper functions for soft affinity related changes to balance_load
> Changes in v2:
>  * Not submitted in version 2; focus was on the hard affinity patch
> ---
>  xen/common/sched_credit2.c |  344 ++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 313 insertions(+), 31 deletions(-)
> 
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index bbcfbf2..47d0bad 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -127,6 +127,14 @@
>  #define CSCHED2_CREDIT_RESET         0
>  /* Max timer: Maximum time a guest can be run for. */
>  #define CSCHED2_MAX_TIMER            MILLISECS(2)
> +/* Used in balance_load to specify migration direction. */
> +#define CSCHED2_PULL                 0
> +#define CSCHED2_PUSH                 1
> +/*
> + * Used in balance_load to decide if deltas are close enough to use soft
> + * affinity as a tie breaker.
> + */
> +#define CSCHED2_DIVIDE_BY_16         4
>  
>  
>  #define CSCHED2_IDLE_CREDIT                 (-(1<<30))
> @@ -288,15 +296,33 @@ struct csched2_dom {
>   */
>  static int get_fallback_cpu(struct csched2_vcpu *svc)

Don't forget to update the comment. :-)

>  {
> +    int balance_step;
> +
>      if ( likely(cpumask_test_cpu(svc->vcpu->processor,
>          svc->vcpu->cpu_hard_affinity)) )
>          return svc->vcpu->processor;
>  
> -    cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
> -        &svc->rqd->active);
> -    if ( cpumask_empty(csched2_cpumask) )
> -        cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
> -            VCPU2ONLINE(svc->vcpu));
> +    for_each_sched_balance_step( balance_step )
> +    {
> +        if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY
> +            && !__vcpu_has_soft_affinity(svc->vcpu,
> +                svc->vcpu->cpu_hard_affinity) )
> +            continue;
> +
> +        sched_balance_cpumask(svc->vcpu, balance_step, csched2_cpumask);
> +        cpumask_and(csched2_cpumask, csched2_cpumask, &svc->rqd->active);
> +        if ( !cpumask_empty(csched2_cpumask) )
> +            break;
> +        else
> +        {
> +            sched_balance_cpumask(svc->vcpu, balance_step, csched2_cpumask);
> +            cpumask_and(csched2_cpumask, csched2_cpumask,
> +                VCPU2ONLINE(svc->vcpu));
> +            if ( !cpumask_empty(csched2_cpumask) )
> +                break;
> +        }

I think we don't need the second half here to be in an "else" clause --
if the statement in the if() is true, then it will break out of the loop
and not execute any further.

> +    }
> +
>      ASSERT( !cpumask_empty(csched2_cpumask) );
>  
>      return cpumask_any(csched2_cpumask);
> @@ -516,8 +542,8 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
>      int i, ipid=-1;
>      s_time_t lowest=(1<<30);
>      struct csched2_runqueue_data *rqd = RQD(ops, cpu);
> -    cpumask_t mask;
>      struct csched2_vcpu * cur;
> +    int balance_step;
>  
>      d2printk("rqt %pv curr %pv\n", new->vcpu, current);
>  
> @@ -534,26 +560,43 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
>          goto tickle;
>      }
>      
> +    for_each_sched_balance_step ( balance_step )
> +    {
> +        if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY
> +            && !__vcpu_has_soft_affinity(new->vcpu,
> +                new->vcpu->cpu_hard_affinity) )
> +            continue;
> +
>          /* Get a mask of idle, but not tickled, that new is allowed to run on. */

I think we want this comment moved above the for_each_ loop; and
probably say something like

/* Look for an idle, untickled cpu in the vcpu's soft affinity, then its
hard affinity. */

(Formated properly, obviously)

> -        cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
> -        cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
> -    
> +        sched_balance_cpumask(new->vcpu, balance_step, csched2_cpumask);
> +        cpumask_and(csched2_cpumask, csched2_cpumask, &rqd->idle);
> +        cpumask_andnot(csched2_cpumask, csched2_cpumask, &rqd->tickled);
> +
>          /* If it's not empty, choose one */
> -        i = cpumask_cycle(cpu, &mask);
> +        i = cpumask_cycle(cpu, csched2_cpumask);
>          if ( i < nr_cpu_ids )
>          {
>              ipid = i;
>              goto tickle;
>          }
> +    }
>  
>      /* Otherwise, look for the non-idle cpu with the lowest credit,
>       * skipping cpus which have been tickled but not scheduled yet,
>       * that new is allowed to run on. */
> -        cpumask_andnot(&mask, &rqd->active, &rqd->idle);
> -        cpumask_andnot(&mask, &mask, &rqd->tickled);
> -        cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
> +    for_each_sched_balance_step ( balance_step )
> +    {
> +        if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY
> +            && !__vcpu_has_soft_affinity(new->vcpu,
> +                new->vcpu->cpu_hard_affinity) )
> +            continue;
> +
> +        sched_balance_cpumask(new->vcpu, balance_step, csched2_cpumask);
> +        cpumask_and(csched2_cpumask, csched2_cpumask, &rqd->active);
> +        cpumask_andnot(csched2_cpumask, csched2_cpumask, &rqd->idle);
> +        cpumask_andnot(csched2_cpumask, csched2_cpumask, &rqd->tickled);
>  
> -        for_each_cpu(i, &mask)
> +        for_each_cpu(i, csched2_cpumask)

So here you go through this loop looking for the lowest credit once for
soft affinity and once for hard affinity; but you don't break out of the
loop when you're doing soft affinity, so unless I'm missing something,
the effect is the same as just doing the hard affinity (since soft is
guaranteed to be a subset of hard).

Up till this point, we are favoring (in order):
 1. idle, non-tickled vcpus in soft affinity
 2. idle, non-tickled vcpus in hard affinity

By putting this down here, we're already favoring idle vcpus in the hard
affinity over non-idle vcpus in the soft affinity.  I think that makes
sense.

We could at this point just go through looking for the lowest credit in
the hard affinity, getting rid of the loop entirely.  That gives us:
 3: cpu whose currently running vcpu has the lowest credit

But it might be worth doing the loop for the "soft" step first, and then
checking whether the cpu with the lowest credit we've found yet is lower
than new's credit.  If so, it would allow the vcpu to run immediately,
while staying within its soft affinity.  Only if the lowest credit we've
seen is higher than the vcpu we're looking to place should we then go
find the lowest one in the whole hard affinity.

That would give us instead:
 3: cpu in the soft affinity whose currently running vcpu has the lowest
credit lower than the vcpu we're looking to place
 4: cpu in the hard affinity whose currently running vcpu has the lowest
credit

Whatever we choose, we probably want to make sure there's a comment
explaining the high-level view.

---

OK, that's enough for one day; I'll get to the rest either tomorrow or
next week.

You know, you make 4 changes here; in get_fallback_cpu, runq_tickle,
choose_cpu, and balance_load.  Each of them are actually quite
independent, and many of them quite complicated.

I think it might be a good idea, when you resend them, to break it down
into four patches.  It's a bit unorthodox to send a patch which only
partially implements a feature like soft affinity; but in this case I
think it makes sense.  That way we can discuss, review, and sign-off on
each one individually.  And there will also be space in each changelog
to describe in more detail what's being done and why.

What do you (both Justin and Dario) think?

 -George

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/4] sched: credit2: consider per-vcpu soft affinity
  2015-03-26  9:48 ` [PATCH v3 4/4] sched: credit2: consider per-vcpu soft affinity Justin T. Weaver
  2015-03-31 17:38   ` George Dunlap
@ 2015-04-20 15:38   ` George Dunlap
  2015-04-22 16:16   ` George Dunlap
  2 siblings, 0 replies; 17+ messages in thread
From: George Dunlap @ 2015-04-20 15:38 UTC (permalink / raw)
  To: Justin T. Weaver; +Cc: Dario Faggioli, henric, xen-devel

On Thu, Mar 26, 2015 at 9:48 AM, Justin T. Weaver <jtweaver@hawaii.edu> wrote:
>  * choose_cpu
>
> choose_cpu now tries to find the run queue with the most cpus in the given
> vcpu's soft affinity. It uses minimum run queue load as a tie breaker.
[snip]
>  * choose_cpu: added balance loop to find cpu for given vcpu that has most
>    soft cpus (with run queue load being a tie breaker), or if none were found,
>    or not considering soft affinity, pick cpu from runq with least load
[snip]
> @@ -1086,7 +1130,7 @@ static int
>  choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>  {
>      struct csched2_private *prv = CSCHED2_PRIV(ops);
> -    int i, min_rqi = -1, new_cpu;
> +    int i, rqi = -1, new_cpu, max_soft_cpus = 0, balance_step;
>      struct csched2_vcpu *svc = CSCHED2_VCPU(vc);
>      s_time_t min_avgload;
>

Hey Justin -- sorry for taking so long to get back to this one.

Before getting into the changes to choose_cpu(): it looks like on the
__CSFLAG_runq_migrate_request path (starting with "First check to see
if we're here because someone else suggested a place for us to move"),
we only consider the hard affinity, not the soft affinity.  Is that
intentional?

> @@ -1143,9 +1187,28 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>
>      min_avgload = MAX_LOAD;
>
> -    /* Find the runqueue with the lowest instantaneous load */
> +    /*
> +     * Find the run queue with the most cpus in vc's soft affinity. If there
> +     * is more than one queue with the highest soft affinity cpu count, then
> +     * pick the one with the lowest instantaneous run queue load. If the
> +     * vcpu does not have soft affinity, then only try to find the run queue
> +     * with the lowest instantaneous load.
> +     */
> +    for_each_sched_balance_step( balance_step )
> +    {
> +        if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY
> +            && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
> +            continue;
> +
> +        if ( balance_step == SCHED_BALANCE_HARD_AFFINITY && rqi > -1 )
> +        {
> +            balance_step = SCHED_BALANCE_SOFT_AFFINITY;
> +            break;
> +        }
> +
>          for_each_cpu(i, &prv->active_queues)
>          {
> +            int rqd_soft_cpus = 0;
>              struct csched2_runqueue_data *rqd;
>              s_time_t rqd_avgload = MAX_LOAD;
>
> @@ -1163,35 +1226,61 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>               * so it is possible here that svc does not have hard affinity
>               * with any of the pcpus of svc's currently assigned run queue.
>               */
> +            sched_balance_cpumask(vc, balance_step, csched2_cpumask);
>              if ( rqd == svc->rqd )
>              {
> -                if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
> +                if ( cpumask_intersects(csched2_cpumask, &rqd->active) )
>                      rqd_avgload = rqd->b_avgload - svc->avgload;
> +                if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY )
> +                {
> +                    cpumask_and(csched2_cpumask, csched2_cpumask,
> +                        &rqd->active);
> +                    rqd_soft_cpus = cpumask_weight(csched2_cpumask);
> +                }
>              }
>              else if ( spin_trylock(&rqd->lock) )
>              {
> -                if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
> +                if ( cpumask_intersects(csched2_cpumask, &rqd->active) )
>                      rqd_avgload = rqd->b_avgload;
> +                if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY )
> +                {
> +                    cpumask_and(csched2_cpumask, csched2_cpumask,
> +                        &rqd->active);
> +                    rqd_soft_cpus = cpumask_weight(csched2_cpumask);
> +                }
>
>                  spin_unlock(&rqd->lock);
>              }
>              else
>                  continue;
>
> -            if ( rqd_avgload < min_avgload )
> +            if ( balance_step == SCHED_BALANCE_SOFT_AFFINITY
> +                && rqd_soft_cpus > 0
> +                && ( rqd_soft_cpus > max_soft_cpus
> +                    ||
> +                   ( rqd_soft_cpus == max_soft_cpus
> +                    && rqd_avgload < min_avgload )) )
> +            {
> +                max_soft_cpus = rqd_soft_cpus;
> +                rqi = i;
> +                min_avgload = rqd_avgload;
> +            }
> +            else if ( balance_step == SCHED_BALANCE_HARD_AFFINITY
> +                     && rqd_avgload < min_avgload )
>              {
> +                rqi = i;
>                  min_avgload = rqd_avgload;
> -                min_rqi=i;
>              }
> +        }
>      }
>
>      /* We didn't find anyone (most likely because of spinlock contention). */
> -    if ( min_rqi == -1 )
> +    if ( rqi == -1 )
>          new_cpu = get_fallback_cpu(svc);
>      else
>      {
> -        cpumask_and(csched2_cpumask, vc->cpu_hard_affinity,
> -            &prv->rqd[min_rqi].active);
> +        sched_balance_cpumask(vc, balance_step, csched2_cpumask);
> +        cpumask_and(csched2_cpumask, csched2_cpumask, &prv->rqd[rqi].active);
>          new_cpu = cpumask_any(csched2_cpumask);
>          BUG_ON(new_cpu >= nr_cpu_ids);
>      }

So the general plan here looks right; but is there really a need to go
through the whole thing twice?  Couldn't we keep track of "rqi with
highest # cpus in soft affinity / lowest avgload" and "rqi with lowest
global avgload" in one pass, and then choose whichever one looks the
best at the end?

I think for closure sake I'm going to send this e-mail, and review the
load balancing step in another mail (which will come later this
evening).

 -George

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/4] sched: credit2: consider per-vcpu soft affinity
  2015-03-26  9:48 ` [PATCH v3 4/4] sched: credit2: consider per-vcpu soft affinity Justin T. Weaver
  2015-03-31 17:38   ` George Dunlap
  2015-04-20 15:38   ` George Dunlap
@ 2015-04-22 16:16   ` George Dunlap
  2 siblings, 0 replies; 17+ messages in thread
From: George Dunlap @ 2015-04-22 16:16 UTC (permalink / raw)
  To: Justin T. Weaver; +Cc: Dario Faggioli, henric, xen-devel

On Thu, Mar 26, 2015 at 9:48 AM, Justin T. Weaver <jtweaver@hawaii.edu> wrote:
[snip]
> Added a function to determine the number of soft cpus gained (or lost) by a
> given vcpu if it is migrated from a given source run queue to a given
> destination run queue.
>
> Modified algorithm in balance_load and consider...
>  * if the load on lrqd and/or orqd is less than the number of their active
>    cpus, balance_load will look for vcpus that would have their soft affinity
>    improved by being pushed and/or pulled. Load does not need to be considered
>    since a run queue recieveing a pushed or pulled vcpu is not being fully
>    utilized. This returns vcpus that may have been migrated away from their
>    soft affinity due to load balancing back to their soft affinity.
>  * in consider, vcpus that might be picked for migration because pushing or
>    pulling them decreases the load delta are not picked if their current run
>    queue's load is less than its active cpu count and if that migration would
>    harm their soft affinity. There's no need to push/pull if the load is under
>    capacity, and the vcpu would lose access to some or all of its soft cpus.
>  * in consider, if a push/pull/swap migration decreases the load delta by a
>    similar amount to another push/pull/swap migration, then use soft cpu gain
>    as a tie breaker. This allows load to continue to balance across run queues,
>    but favors soft affinity gains if the load deltas are close.

OK, I've done a lot of thinking about this over the last few days, and
here's what I've come up with.

As I said before, there are basically two different activities in this
patch.  The first is to try to honor soft affinity  *given* where the
vcpu is placed; the other is to try to take soft affinity into account
when *placing* a vcpu.

The first bit is basically correct, and I've already given feedback on it.

The second half certainly represents a lot of hard work and I think
it's not a *bad* solution; but overall I'm not satisfied with it and I
think we can do better.

At a basic level, I don't like adding in the "if the load is less than
the number of pcpus then ignore load and only pull based on
softaffinity" shortcut to balance_load.  I think that it's likely that
the two halves of balance_load will have very different ideas about
what "good" looks like, and that as a result vcpus will end up getting
bounced back and forth depending on which particular algorithm ends up
being activated.

Apart from that, the whole thing is rather ad-hoc and makes it much
more complicated to understand what is going on, or predict what would
happen.

My original plan wrt load balancing was to calculate some sort of
"badness" (or perhaps "expected overhead") to the current placement
situation, and try to minimize that.  The idea was that you could then
try to boil all the factors down to a single number to optimize.  For
example, you could add in "badness" for spinning up extra cores or
sockets (or "goodness" for having them entirely empty), and then
automatically balance performance vs power consumption; or, in this
case, to balance the cost of running with remote memory accesses (as
running outside your soft affinity would normally mean) vs the cost of
waiting for the cpu.

Of course, if you're only considering load, then no matter what you
do, minimizing "badness" means minimizing load, so there's no reason
to complicate things, which is why I wrote the load balancing code the
way I did.  But I've been thinking for the last few days and I think I
can probably structure it differently to make it easier to get a
consistent evaluation of the goodness / badness of a given setup.

So what about this:

1. You re-submit this series, with the comments I've given, and the
first half of this patch (i.e., leaving out the changes to choose_cpu
and balance_load)

2. I'll work on re-working the load balancing code based on my ideas;
i.e., calculating "badness" of the load and deciding  based on that.
I'll probably try to do the soft-affinity factoring myself as a proof
of concept.

What do you think?  (Feel free to weigh in here too, Dario.)

---

Also, a few comments on the code (just for feedback):

> @@ -1207,15 +1296,75 @@ typedef struct {
>      /* NB: Modified by consider() */
>      s_time_t load_delta;
>      struct csched2_vcpu * best_push_svc, *best_pull_svc;
> +    int soft_affinity_boost;
> +    bool_t valid_sa_boost;
>      /* NB: Read by consider() */
>      struct csched2_runqueue_data *lrqd;
>      struct csched2_runqueue_data *orqd;
>  } balance_state_t;
>
> +/*
> + * Return the number of pcpus gained in vc's soft affinity mask that vc can
> + * run on if vc is migrated from run queue src_rqd to run queue dst_rqd.
> + */
> +static int get_soft_affinity_gain(const struct vcpu *vc,
> +                                  const struct csched2_runqueue_data *src_rqd,
> +                                  const struct csched2_runqueue_data *dst_rqd)
> +{
> +    /*
> +     * Locks must already be held for src_rqd and dst_rqd.
> +     * Function assumes vc has at least hard affinity with one or more
> +     * pcpus in both the source and destination run queues.
> +     */
> +
> +    /* Does vcpu not have soft affinity? */
> +    if ( !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
> +        return 0;
> +
> +    /* Does vcpu have soft affinity with pcpu(s) in the destination runq? */
> +    sched_balance_cpumask(vc, SCHED_BALANCE_SOFT_AFFINITY, csched2_cpumask);
> +    if ( cpumask_intersects(csched2_cpumask, &dst_rqd->active) )
> +    {
> +        int soft_cpus_dst;
> +        cpumask_and(csched2_cpumask, csched2_cpumask, &dst_rqd->active);
> +        soft_cpus_dst = cpumask_weight(csched2_cpumask);
> +
> +        /* ... and soft affinity with the source run queue? */
> +        sched_balance_cpumask(vc, SCHED_BALANCE_SOFT_AFFINITY,
> +            csched2_cpumask);
> +        if ( cpumask_intersects(csched2_cpumask, &src_rqd->active) )
> +        {
> +            int soft_cpus_src;
> +            cpumask_and(csched2_cpumask, csched2_cpumask, &src_rqd->active);
> +            soft_cpus_src = cpumask_weight(csched2_cpumask);
> +
> +            /* Soft affinity to soft affinity migration. */
> +            return soft_cpus_dst - soft_cpus_src;
> +        }
> +        else
> +            /* Hard affinity to soft affinity migration. */
> +            return soft_cpus_dst;
> +    }
> +    else
> +    {
> +        int soft_cpus_src = 0;
> +        cpumask_and(csched2_cpumask, csched2_cpumask, &src_rqd->active);
> +        soft_cpus_src = cpumask_weight(csched2_cpumask);
> +
> +        /*
> +         * Hard affinity to hard affinity migration or soft affinity to hard
> +         * affinity migration.
> +         */
> +        return -soft_cpus_src;
> +    }

Is there a reason you can't just do the following?
---
int soft_cpus_dst=0, soft_cpus_src=0;

sched_balance_cpumask(...)

if(cpumask_intersects(cpumask, dst)) {
  cpumask_and(...)
  soft_cpus_dst = cpumask_weight(...)

  /* Restore soft affinity mask for next check */
  sched_balance_cpumask(...)
}

if(cpumask_intersects(cpumask, src)) {
  cpumask_and(...)
  soft_cpus_src = cpumask_weight(...)
}

return soft_cpus_dst - soft_cpus_src;
---

It seems like that would be a lot easier to figure out what's going on.

> @@ -1237,12 +1386,88 @@ static void consider(balance_state_t *st,
>      if ( delta < 0 )
>          delta = -delta;
>
> -    if ( delta < st->load_delta )
> +    /*
> +     * Use soft affinity gain as a tie breaker if at least one migration has
> +     * already been picked and stored in the balance state, and the absolute
> +     * value of the difference between the delta in st and the new delta being
> +     * considered here is less than 1/16th of the load_window_shift.
> +     */
> +    delta_diff = delta - st->load_delta;
> +    if ( delta_diff < 0 )
> +        delta_diff = -delta_diff;
> +    if ( (st->best_push_svc != NULL || st->best_pull_svc != NULL)
> +        && delta_diff < 1<<(load_window_shift - CSCHED2_DIVIDE_BY_16) )
>      {
> -        st->load_delta = delta;
> -        st->best_push_svc=push_svc;
> -        st->best_pull_svc=pull_svc;
> +        int st_soft_gain = 0, consider_soft_gain = 0;
> +
> +        /* Find the soft affinity gain for the migration in st. */
> +        if ( !st->valid_sa_boost )
> +            if ( st->best_push_svc )
> +                st_soft_gain += get_soft_affinity_gain(
> +                    st->best_push_svc->vcpu, st->lrqd, st->orqd);
> +            if ( st->best_pull_svc )
> +                st_soft_gain += get_soft_affinity_gain(
> +                    st->best_pull_svc->vcpu, st->orqd, st->lrqd);
> +        else
> +            st_soft_gain = st->soft_affinity_boost;

Looks like you're missing some braces around those two indented if's.

> +    /* Only consider load delta. */
> +    if ( delta < st->load_delta )
> +        st->valid_sa_boost = 0;
> +
> +        /*
> +         * If the migration results in a loss of some or all soft cpus and the
> +         * vcpu's current run queue has less load than physical processors, do
> +         * not use the migration.
> +         */
> +        if ( push_svc &&
> +            (st->lrqd->load < cpumask_weight(&st->lrqd->active) &&
> +             get_soft_affinity_gain(push_svc->vcpu, st->lrqd, st->orqd) < 0) )
> +            return;
> +        if ( pull_svc &&
> +            (st->orqd->load < cpumask_weight(&st->orqd->active) &&
> +             get_soft_affinity_gain(pull_svc->vcpu, st->orqd, st->lrqd) < 0) )
> +            return;
> +    else
> +        return;

Missing braces again.

 -George

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/4] sched: factor out per-vcpu affinity related code to common header file
  2015-03-26  9:48 ` [PATCH v3 2/4] sched: factor out per-vcpu affinity related code to common header file Justin T. Weaver
@ 2015-04-23 15:22   ` Dario Faggioli
  0 siblings, 0 replies; 17+ messages in thread
From: Dario Faggioli @ 2015-04-23 15:22 UTC (permalink / raw)
  To: Justin T. Weaver; +Cc: george.dunlap, henric, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 717 bytes --]

On Wed, 2015-03-25 at 23:48 -1000, Justin T. Weaver wrote:
> Move affinity balancing related functions and defines from sched_credit.c to
> sched-if.h so other schedulers can use them. Change name prefixes from csched
> to sched since they are no longer specific to the credit scheduler.
> 
It's probably already evident from the wording here above, but it is
always good to put down explicitly that you do not intend and expect any
functional changes.

Just one more line saying exactly that "No functional changes intended"
is all it's necessary.

> Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>
>
With the above,

Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 3/4] sched: credit2: indent code sections to make review of patch 4/4 easier
  2015-03-26  9:48 ` [PATCH v3 3/4] sched: credit2: indent code sections to make review of patch 4/4 easier Justin T. Weaver
@ 2015-04-23 15:35   ` Dario Faggioli
  0 siblings, 0 replies; 17+ messages in thread
From: Dario Faggioli @ 2015-04-23 15:35 UTC (permalink / raw)
  To: Justin T. Weaver; +Cc: george.dunlap, henric, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 739 bytes --]

On Wed, 2015-03-25 at 23:48 -1000, Justin T. Weaver wrote:
> Functions runq_tickle and choose_cpu both have code sections that get turned
> into loops in patch 4 v3, soft affinity. Do the indenting here to make the
> patch 4 diff section easier to read. 
>
Yeah, I know what you mean, an thanks for trying making our job
easier! :-)

Still, it's rather uncommon a thing  to do (it's probably the first time
I see it), and I can't call myself a fan of it. So, I'm not sure what
others think, but me, I'd say just drop this patch, and do all the
functional and non-functional changes all at once.

Hopefully, splitting patch4 the way George suggests will make the result
easier to review anyway.

Thanks and Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/4] sched: credit2: respect per-vcpu hard affinity
  2015-03-31 14:37   ` George Dunlap
  2015-03-31 17:14     ` Dario Faggioli
@ 2015-04-23 16:00     ` Dario Faggioli
  1 sibling, 0 replies; 17+ messages in thread
From: Dario Faggioli @ 2015-04-23 16:00 UTC (permalink / raw)
  To: George Dunlap; +Cc: henric, Justin T. Weaver, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2955 bytes --]

On Tue, 2015-03-31 at 15:37 +0100, George Dunlap wrote:
> On 03/26/2015 09:48 AM, Justin T. Weaver wrote:

> > --- a/xen/common/sched_credit2.c
> > +++ b/xen/common/sched_credit2.c

> > @@ -194,6 +195,12 @@ int opt_overload_balance_tolerance=-3;
> >  integer_param("credit2_balance_over", opt_overload_balance_tolerance);
> >  
> >  /*
> > + * Use this to avoid having too many cpumask_t structs on the stack
> > + */
> > +static cpumask_t **scratch_mask = NULL;
> > +#define csched2_cpumask scratch_mask[smp_processor_id()]
> 
> Since I'm asking for changes below:  It's not clear to me when I'm
> scanning the code that csched2_cpumask is a scratch variable.  What
> about calling the actual variable _scratch_cpumask and havind the
> #define be scratch_cpumask?
> 
+1

> >      /* Find the runqueue with the lowest instantaneous load */
> >      for_each_cpu(i, &prv->active_queues)
> >      {
> >          struct csched2_runqueue_data *rqd;
> > -        s_time_t rqd_avgload;
> > +        s_time_t rqd_avgload = MAX_LOAD;
> >  
> >          rqd = prv->rqd + i;
> >  
> >          /* If checking a different runqueue, grab the lock,
> > -         * read the avg, and then release the lock.
> > +         * check hard affinity, read the avg, and then release the lock.
> >           *
> >           * If on our own runqueue, don't grab or release the lock;
> >           * but subtract our own load from the runqueue load to simulate
> > -         * impartiality */
> > +         * impartiality.
> > +         *
> > +         * svc's hard affinity may have changed; this function is the
> > +         * credit 2 scheduler's first opportunity to react to the change,
> > +         * so it is possible here that svc does not have hard affinity
> > +         * with any of the pcpus of svc's currently assigned run queue.
> > +         */
> >          if ( rqd == svc->rqd )
> >          {
> > -            rqd_avgload = rqd->b_avgload - svc->avgload;
> > +            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
> > +                rqd_avgload = rqd->b_avgload - svc->avgload;
> >          }
> >          else if ( spin_trylock(&rqd->lock) )
> >          {
> > -            rqd_avgload = rqd->b_avgload;
> > +            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
> > +                rqd_avgload = rqd->b_avgload;
> > +
> >              spin_unlock(&rqd->lock);
> >          }
> >          else
> 
> Since we're already potentially falling through and doing the comparison
> below with an unchanged rqd_avgload (which has now been set to MAX_LOAD
> above), I wonder if it makes more sense to remove this "else continue"
> here, just to avoid confusing people.
>
Agreed! It was me that suggested Justing how to reorg this block of code
during v2's review... my bad not spotting that the 'else / continue' was
useless then! :-P

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/4] sched: credit2: respect per-vcpu hard affinity
  2015-03-26  9:48 ` [PATCH v3 1/4] sched: credit2: respect per-vcpu hard affinity Justin T. Weaver
  2015-03-31 14:37   ` George Dunlap
@ 2015-05-06 12:39   ` Dario Faggioli
  1 sibling, 0 replies; 17+ messages in thread
From: Dario Faggioli @ 2015-05-06 12:39 UTC (permalink / raw)
  To: Justin T. Weaver; +Cc: george.dunlap, henric, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 7517 bytes --]

On Wed, 2015-03-25 at 23:48 -1000, Justin T. Weaver wrote:
> by making sure that vcpus only run on the pcpu(s) they are allowed to
> run on based on their hard affinity cpu masks.
> 
And here I am again... reviewing this is really taking _ages_...
Sorry! :-(

> Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>
> ---
> Changes in v3:

>  * Added #define for VCPU2ONLINE (probably should be factored out of
>    schedule.c and here, and put into a common header)
>
+1 for moving to an header. I think xen/include/xen/sched.h is the best
bit.
  
>  * Moved vc->processor assignment in function csched2_vcpu_migrate to an else
>    block to execute only if current and destination run queues are the same;
>    Note: without the processor assignment here the vcpu might be assigned to a
>    processor it no longer is allowed to run on. In that case, function
>    runq_candidate may only get called for the vcpu's old processor, and
>    runq_candidate will no longer let a vcpu run on a processor that it's not
>    allowed to run on (because of the hard affinity check first introduced in
>    v1 of this patch).
>
As I think I said already, this deserves to be somewhere where it won't
disappear, which is either the changelog or a comment in the code. I'd
go for the latter.

> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 7581731..af716e4 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -176,6 +176,7 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
>  #define c2r(_ops, _cpu)     (CSCHED2_PRIV(_ops)->runq_map[(_cpu)])
>  /* CPU to runqueue struct macro */
>  #define RQD(_ops, _cpu)     (&CSCHED2_PRIV(_ops)->rqd[c2r(_ops, _cpu)])
> +#define VCPU2ONLINE(_v)     cpupool_online_cpumask((_v)->domain->cpupool)
>
As said above, move this in a .h

>  integer_param("credit2_balance_over", opt_overload_balance_tolerance);
>  
>  /*
> + * Use this to avoid having too many cpumask_t structs on the stack
> + */
> +static cpumask_t **scratch_mask = NULL;
> +#define csched2_cpumask scratch_mask[smp_processor_id()]
> +
We covered this already.

> +/*
>   * Per-runqueue data
>   */
>  struct csched2_runqueue_data {
> @@ -268,6 +275,32 @@ struct csched2_dom {
>      uint16_t nr_vcpus;
>  };
>  
> +/*
> + * When a hard affinity change occurs, we may not be able to check some or
> + * all of the other run queues for a valid new processor for the given vcpu
> + * because (in function choose_cpu) either the trylock on the private data
> + * failed or the trylock on each run queue with valid processor(s) for svc
> + * failed. In these cases, this function is used to pick a pcpu that svc can
> + * run on. It returns svc's current pcpu if valid, or a different pcpu from
> + * the run queue svc is currently assigned to, or if none of those are valid,
> + * it returns a pcpu from the intersection of svc's hard affinity and the
> + * domain's online cpumask.
>
I'd make the last part a bulleted list:

"It returns, in order of preference:
 * svc's current pcpu, if still valid;
 * another (valid) pcpu from the svc's current runq, if any;
 * just a valid pcpu, i.e., a pcpu that is online, in svc's domain's 
   cpupool, and in svc's hard affinity."

> + */
> +static int get_fallback_cpu(struct csched2_vcpu *svc)
> +{
> +    if ( likely(cpumask_test_cpu(svc->vcpu->processor,
> +        svc->vcpu->cpu_hard_affinity)) )
> +        return svc->vcpu->processor;
> +
> +    cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
> +        &svc->rqd->active);
> +    if ( cpumask_empty(csched2_cpumask) )
> +        cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
> +            VCPU2ONLINE(svc->vcpu));
> +    ASSERT( !cpumask_empty(csched2_cpumask) );
> +
> +    return cpumask_any(csched2_cpumask);
>
I like the idea of cpumask_any(), but I'm not sure paying the pice of
making this random is worth here... after all, it's a fallback, so I
think just cpumask_first() is enough.

> +}

Also, coding style: align with the relevant opening parenthesis:

    if ( likely(cpumask_test_cpu(svc->vcpu->processor,
                                 svc->vcpu->cpu_hard_affinity)) )
    ...
    cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
                &svc->rqd->active);
    ...
    and the same everywhere else, of course.

All this being said, I think the variant here below is both easier to
understand (as it maps almost 1:1 with the bulleted list in the comment)
and a bit more efficient. Do you like it?

static int get_fallback_cpu(struct csched2_vcpu *svc)
{
    if ( likely(cpumask_test_cpu(svc->vcpu->processor,
                svc->vcpu->cpu_hard_affinity)) )
        return svc->vcpu->processor;

    cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
                &svc->rqd->active);
    cpu = cpumask_first(csched2_cpumask);
    if ( likely(cpu < nr_cpu_ids) )
        return cpu;

    cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
                VCPU2ONLINE(svc->vcpu));
    ASSERT( !cpumask_empty(csched2_cpumask) );
    return cpumask_first(csched2_cpumask);
}

> @@ -1223,7 +1272,12 @@ static void migrate(const struct scheduler *ops,
>              on_runq=1;
>          }
>          __runq_deassign(svc);
> -        svc->vcpu->processor = cpumask_any(&trqd->active);
> +
> +        cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
> +            &trqd->active);
> +        svc->vcpu->processor = cpumask_any(csched2_cpumask);
> +        BUG_ON(svc->vcpu->processor >= nr_cpu_ids);
> +
>          __runq_assign(svc, trqd);
>          if ( on_runq )
>          {
> @@ -1237,6 +1291,20 @@ static void migrate(const struct scheduler *ops,
>      }
>  }
>  
> +/*
> + * Migration of vcpu svc to run queue rqd is a valid option if svc is not
>
"Migration of svc" is fine.

Also, "runqueue" seems the dominant variant (over "run queue"), so I'd
stick to that.

> + * already flagged to migrate and if svc is allowed to run on at least one of
> + * the pcpus assigned to rqd based on svc's hard affinity mask.
> + */
> +static bool_t valid_vcpu_migration(struct csched2_vcpu *svc,
> +                                   struct csched2_runqueue_data *rqd)
>
The name, I don't like it much. In credit1 we have
__csched_vcpu_is_migrateable(), which does pretty much the same. I don't
care about the "__csched" part, but I like vcpu_is_migrateable() better
than valid_vcpu_migration().

> +{
> +    if ( test_bit(__CSFLAG_runq_migrate_request, &svc->flags)
> +        || !cpumask_intersects(svc->vcpu->cpu_hard_affinity, &rqd->active) )
> +        return 0;
> +    else
> +        return 1;
> +}
>
I'd also make it "positive", i.e., I'd structure the condition so that
if it's true, function returns true (and, BTW, you should use 'true' and
'false'), as I find that easier to read.

Finally, the else is pointless. Do either:

static bool_tvcpu_is_migrateable(...)
{
    return <condition>;
}

or:

static bool_tvcpu_is_migrateable(...)
{
    if ( <condition> )
        return true;
    return false;
}

Right... As I'm sure you've noticed, there still are some changes needed
(at least IMO), but they're rather minor adjustments... So great work on
this so far, we're really getting there.

Thanks and sorry again for the latency.

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 0/4] sched: credit2: introduce per-vcpu hard and soft affinity
  2015-03-26  9:48 [PATCH v3 0/4] sched: credit2: introduce per-vcpu hard and soft affinity Justin T. Weaver
                   ` (3 preceding siblings ...)
  2015-03-26  9:48 ` [PATCH v3 4/4] sched: credit2: consider per-vcpu soft affinity Justin T. Weaver
@ 2015-09-17 14:27 ` Dario Faggioli
  2015-09-17 15:15   ` Dario Faggioli
  4 siblings, 1 reply; 17+ messages in thread
From: Dario Faggioli @ 2015-09-17 14:27 UTC (permalink / raw)
  To: Justin T. Weaver; +Cc: george.dunlap, henric, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2080 bytes --]

On Wed, 2015-03-25 at 23:48 -1000, Justin T. Weaver wrote:

> Here are the results I gathered from testing. Each guest had 2 vcpus and 1GB
> of memory. 
>
Hey, thanks for doing the benchmarking as well! :-)

> The hardware consisted of two quad core Intel Xeon X5570 processors
> and 8GB of RAM per node. The sysbench memory test was run with the num-threads
> option set to four, and was run simultaneously on two, then six, then ten VMs.
> Each result below is an average of three runs.
> 
> -------------------------------------------------------
> | Sysbench memory, throughput MB/s (higher is better) |
> -------------------------------------------------------
> | #VMs |  No affinity  |   Pinning  | NUMA scheduling |
> |   2  |    417.01     |    406.16  |     428.83      |
> |   6  |    389.31     |    407.07  |     402.90      |
> |  10  |    317.91     |    320.53  |     321.98      |
> -------------------------------------------------------
> 
> Despite the overhead added, NUMA scheduling performed best in both the two and
> ten VM tests.
> 
Nice. Just to be sure, is my understending of the columns label
accurate?
 - 'No affinity'     == no hard nor soft affinity for any VM
 - 'Pinning'         == hard affinity used to pin VMs to NUMA nodes
                        (evenly, I guess?); soft affinity untouched
 - 'NUMA scheduling' == soft affinity used to associate VMs to NUMA
                        nodes (evenly, I guess?); hard affinity
                        untouched

Also, can you confirm that all the hard and soft affinity setting were
done at VM creation time, i.e., they were effectively influencing where
the memory of the VMs was being allocated? (It looks like so, from the
number, but I wanted to be sure...)

Thanks again and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 0/4] sched: credit2: introduce per-vcpu hard and soft affinity
  2015-09-17 14:27 ` [PATCH v3 0/4] sched: credit2: introduce per-vcpu hard and " Dario Faggioli
@ 2015-09-17 15:15   ` Dario Faggioli
  0 siblings, 0 replies; 17+ messages in thread
From: Dario Faggioli @ 2015-09-17 15:15 UTC (permalink / raw)
  To: Justin T. Weaver; +Cc: george.dunlap, xen-devel, henric


[-- Attachment #1.1: Type: text/plain, Size: 1676 bytes --]

On Thu, 2015-09-17 at 16:27 +0200, Dario Faggioli wrote:

> Nice. Just to be sure, is my understending of the columns label
> accurate?
>  - 'No affinity'     == no hard nor soft affinity for any VM
>  - 'Pinning'         == hard affinity used to pin VMs to NUMA nodes
>                         (evenly, I guess?); soft affinity untouched
>  - 'NUMA scheduling' == soft affinity used to associate VMs to NUMA
>                         nodes (evenly, I guess?); hard affinity
>                         untouched
> 
> Also, can you confirm that all the hard and soft affinity setting were
> done at VM creation time, i.e., they were effectively influencing where
> the memory of the VMs was being allocated? (It looks like so, from the
> number, but I wanted to be sure...)
> 
BTW, just to be clear, I'm actually reviewing v4 of this series... I'm
not re-reviewing v3. :-D

However, in the process of doing so, I was looking back at previous
submissions as well, I found this and decided it was worthwhile to ask.

I know that these measurements were done on v3 and are not valid for v4
(because that version misses, and that is intentional, some of the soft
affinity bits). Still, I think it is important to keep these numbers in
mind, as they provide (at least part of) the justification for doing the
whole hard and soft affinity work, and hence I asked for the
clarifications.

Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-09-17 15:15 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-26  9:48 [PATCH v3 0/4] sched: credit2: introduce per-vcpu hard and soft affinity Justin T. Weaver
2015-03-26  9:48 ` [PATCH v3 1/4] sched: credit2: respect per-vcpu hard affinity Justin T. Weaver
2015-03-31 14:37   ` George Dunlap
2015-03-31 17:14     ` Dario Faggioli
2015-03-31 17:32       ` George Dunlap
2015-04-23 16:00     ` Dario Faggioli
2015-05-06 12:39   ` Dario Faggioli
2015-03-26  9:48 ` [PATCH v3 2/4] sched: factor out per-vcpu affinity related code to common header file Justin T. Weaver
2015-04-23 15:22   ` Dario Faggioli
2015-03-26  9:48 ` [PATCH v3 3/4] sched: credit2: indent code sections to make review of patch 4/4 easier Justin T. Weaver
2015-04-23 15:35   ` Dario Faggioli
2015-03-26  9:48 ` [PATCH v3 4/4] sched: credit2: consider per-vcpu soft affinity Justin T. Weaver
2015-03-31 17:38   ` George Dunlap
2015-04-20 15:38   ` George Dunlap
2015-04-22 16:16   ` George Dunlap
2015-09-17 14:27 ` [PATCH v3 0/4] sched: credit2: introduce per-vcpu hard and " Dario Faggioli
2015-09-17 15:15   ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.