xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/11] Fixes and improvement (including hard affinity!) for Credit2
@ 2016-04-06 17:22 Dario Faggioli
  2016-04-06 17:22 ` [PATCH v2 01/11] xen: sched: make implementing .alloc_pdata optional Dario Faggioli
                   ` (10 more replies)
  0 siblings, 11 replies; 20+ messages in thread
From: Dario Faggioli @ 2016-04-06 17:22 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Justin Weaver, George Dunlap, Andrew Cooper,
	Tianyang Chen, Robert VanVossen, Uma Sharma, Josh Whitehead,
	Meng Xu, Jan Beulich

Hi,

Here's v2 of this series:
 http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg02544.html

It took a bit, but the changes in that tricky logic of switching scheduler for
pCPUs needed thorough re-testing. :-/

In any case, the series is smaller, and a few patches are reviewed and acked
already:
 - it's smaller because I followed George's suggestion and folded a few patches
   into one;
 - the patches that already have acks from the proper maintainers are (they're
   also marked with a 'k' in the git provided summary, at the very bottom):

     xen: sched: implement .init_pdata in Credit, Credit2 and RTDS
     xen: sched: move pCPU initialization in an helper
     xen: sched: improve credit2 bootparams' scope, placement and signedness
     xen: sched: on Credit2, don't reprogram the timer if idle
     xen: sched: fix per-socket runqueue creation in credit2
     xen: sched: per-core runqueues as default in credit2
     xen: sched: implement vcpu hard affinity in Credit2

So, it's really "only" these 4 that are in demand for people's (mostly,
George's) attention:

     xen: sched: make implementing .alloc_pdata optional
     xen: sched: close potential races when switching scheduler to CPUs
     xen: sched: allow for choosing credit2 runqueues configuration at boot
     xen: sched: privde some scratch space for not putting cpumasks on stack

Among which "xen: sched: close potential races when switching scheduler to
CPUs" is the trickiest, and the one that has underwent the most changes
(following review comments). Others, should be pieces of cake. :-P

There's a git branch for the series here:

 git://xenbits.xen.org/people/dariof/xen.git  rel/sched/credit2/fix-runq-and-haff-v2
 http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/rel/sched/credit2/fix-runq-and-haff-v2

Thanks in advance and Regards,
Dario
---
Dario Faggioli (9):
      xen: sched: make implementing .alloc_pdata optional
    k xen: sched: implement .init_pdata in Credit, Credit2 and RTDS
    k xen: sched: move pCPU initialization in an helper
      xen: sched: close potential races when switching scheduler to CPUs
    k xen: sched: on Credit2, don't reprogram the timer if idle
    k xen: sched: fix per-socket runqueue creation in credit2
      xen: sched: allow for choosing credit2 runqueues configuration at boot
    k xen: sched: per-core runqueues as default in credit2
      xen: sched: privde some scratch space for not putting cpumasks on stack

Justin Weaver (1):
    k xen: sched: implement vcpu hard affinity in Credit2

Uma Sharma (1):
    k xen: sched: improve credit2 bootparams' scope, placement and signedness

 docs/misc/xen-command-line.markdown |   19 ++
 xen/common/sched_arinc653.c         |   31 --
 xen/common/sched_credit.c           |  103 ++++++--
 xen/common/sched_credit2.c          |  436 ++++++++++++++++++++++++-----------
 xen/common/sched_rt.c               |  117 ++++-----
 xen/common/schedule.c               |   76 +++++-
 xen/include/xen/sched-if.h          |    7 +
 7 files changed, 496 insertions(+), 293 deletions(-)
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 01/11] xen: sched: make implementing .alloc_pdata optional
  2016-04-06 17:22 [PATCH v2 00/11] Fixes and improvement (including hard affinity!) for Credit2 Dario Faggioli
@ 2016-04-06 17:22 ` Dario Faggioli
  2016-04-07  4:56   ` Juergen Gross
  2016-04-07 11:24   ` George Dunlap
  2016-04-06 17:22 ` [PATCH v2 02/11] xen: sched: implement .init_pdata in Credit, Credit2 and RTDS Dario Faggioli
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 20+ messages in thread
From: Dario Faggioli @ 2016-04-06 17:22 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, George Dunlap, Robert VanVossen, Josh Whitehead,
	Meng Xu, Jan Beulich

The .alloc_pdata scheduler hook must, before this change,
be implemented by all schedulers --even those ones that
don't need to allocate anything.

Make it possible to just use the SCHED_OP(), like for
the other hooks, by using ERR_PTR() and IS_ERR() for
error reporting. This:
 - makes NULL a variant of success;
 - allows for errors other than ENOMEM to be properly
   communicated (if ever necessary).

This, in turn, means that schedulers not needing to
allocate any per-pCPU data, can avoid implementing the
hook. In fact, the artificial implementation of
.alloc_pdata in the ARINC653 is removed (and, while there,
nuke .free_pdata too, as it is equally useless).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Robert VanVossen <robert.vanvossen@dornerworks.com>
Cc: Josh Whitehead <josh.whitehead@dornerworks.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Juergen Gross <jgross@suse.com>
---
Changes from v1:
 * only update sd->sched_priv if alloc_pdata does not return
   IS_ERR, so that xfree() can always be safely called on
   sd->sched_priv itself, as requested during review;
 * xen/err.h included in .c files that actually need it,
   instead than in sched-if.h.
---
 xen/common/sched_arinc653.c |   31 -------------------------------
 xen/common/sched_credit.c   |    5 +++--
 xen/common/sched_credit2.c  |    2 +-
 xen/common/sched_rt.c       |    8 ++++----
 xen/common/schedule.c       |   27 +++++++++++++++++----------
 5 files changed, 25 insertions(+), 48 deletions(-)

diff --git a/xen/common/sched_arinc653.c b/xen/common/sched_arinc653.c
index 8a11a2f..b79fcdf 100644
--- a/xen/common/sched_arinc653.c
+++ b/xen/common/sched_arinc653.c
@@ -456,34 +456,6 @@ a653sched_free_vdata(const struct scheduler *ops, void *priv)
 }
 
 /**
- * This function allocates scheduler-specific data for a physical CPU
- *
- * We do not actually make use of any per-CPU data but the hypervisor expects
- * a non-NULL return value
- *
- * @param ops       Pointer to this instance of the scheduler structure
- *
- * @return          Pointer to the allocated data
- */
-static void *
-a653sched_alloc_pdata(const struct scheduler *ops, int cpu)
-{
-    /* return a non-NULL value to keep schedule.c happy */
-    return SCHED_PRIV(ops);
-}
-
-/**
- * This function frees scheduler-specific data for a physical CPU
- *
- * @param ops       Pointer to this instance of the scheduler structure
- */
-static void
-a653sched_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
-{
-    /* nop */
-}
-
-/**
  * This function allocates scheduler-specific data for a domain
  *
  * We do not actually make use of any per-domain data but the hypervisor
@@ -737,9 +709,6 @@ static const struct scheduler sched_arinc653_def = {
     .free_vdata     = a653sched_free_vdata,
     .alloc_vdata    = a653sched_alloc_vdata,
 
-    .free_pdata     = a653sched_free_pdata,
-    .alloc_pdata    = a653sched_alloc_pdata,
-
     .free_domdata   = a653sched_free_domdata,
     .alloc_domdata  = a653sched_alloc_domdata,
 
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 4c4927f..63a4a63 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -23,6 +23,7 @@
 #include <xen/errno.h>
 #include <xen/keyhandler.h>
 #include <xen/trace.h>
+#include <xen/err.h>
 
 
 /*
@@ -532,12 +533,12 @@ csched_alloc_pdata(const struct scheduler *ops, int cpu)
     /* Allocate per-PCPU info */
     spc = xzalloc(struct csched_pcpu);
     if ( spc == NULL )
-        return NULL;
+        return ERR_PTR(-ENOMEM);
 
     if ( !alloc_cpumask_var(&spc->balance_mask) )
     {
         xfree(spc);
-        return NULL;
+        return ERR_PTR(-ENOMEM);
     }
 
     spin_lock_irqsave(&prv->lock, flags);
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index b8c8e40..e97d8be 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -2047,7 +2047,7 @@ csched2_alloc_pdata(const struct scheduler *ops, int cpu)
         printk("%s: cpu %d not online yet, deferring initializatgion\n",
                __func__, cpu);
 
-    return (void *)1;
+    return NULL;
 }
 
 static void
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index 321b0a5..aece318 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -29,6 +29,7 @@
 #include <xen/cpu.h>
 #include <xen/keyhandler.h>
 #include <xen/trace.h>
+#include <xen/err.h>
 #include <xen/guest_access.h>
 
 /*
@@ -681,7 +682,7 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu)
     spin_unlock_irqrestore(old_lock, flags);
 
     if ( !alloc_cpumask_var(&_cpumask_scratch[cpu]) )
-        return NULL;
+        return ERR_PTR(-ENOMEM);
 
     if ( prv->repl_timer == NULL )
     {
@@ -689,13 +690,12 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu)
         prv->repl_timer = xzalloc(struct timer);
 
         if ( prv->repl_timer == NULL )
-            return NULL;
+            return ERR_PTR(-ENOMEM);
 
         init_timer(prv->repl_timer, repl_timer_handler, (void *)ops, cpu);
     }
 
-    /* 1 indicates alloc. succeed in schedule.c */
-    return (void *)1;
+    return NULL;
 }
 
 static void
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index b7dee16..1941613 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -37,6 +37,7 @@
 #include <xen/event.h>
 #include <public/sched.h>
 #include <xsm/xsm.h>
+#include <xen/err.h>
 
 /* opt_sched: scheduler - default to configured value */
 static char __initdata opt_sched[10] = CONFIG_SCHED_DEFAULT;
@@ -1462,6 +1463,7 @@ static void poll_timer_fn(void *data)
 static int cpu_schedule_up(unsigned int cpu)
 {
     struct schedule_data *sd = &per_cpu(schedule_data, cpu);
+    void *sched_priv;
 
     per_cpu(scheduler, cpu) = &ops;
     spin_lock_init(&sd->_lock);
@@ -1500,9 +1502,16 @@ static int cpu_schedule_up(unsigned int cpu)
     if ( idle_vcpu[cpu] == NULL )
         return -ENOMEM;
 
-    if ( (ops.alloc_pdata != NULL) &&
-         ((sd->sched_priv = ops.alloc_pdata(&ops, cpu)) == NULL) )
-        return -ENOMEM;
+    /*
+     * We don't want to risk calling xfree() on an sd->sched_priv
+     * (e.g., inside free_pdata, from cpu_schedule_down() called
+     * during CPU_UP_CANCELLED) that contains an IS_ERR value.
+     */
+    sched_priv = SCHED_OP(&ops, alloc_pdata, cpu);
+    if ( IS_ERR(sched_priv) )
+        return PTR_ERR(sched_priv);
+
+    sd->sched_priv = sched_priv;
 
     return 0;
 }
@@ -1512,8 +1521,7 @@ static void cpu_schedule_down(unsigned int cpu)
     struct schedule_data *sd = &per_cpu(schedule_data, cpu);
     struct scheduler *sched = per_cpu(scheduler, cpu);
 
-    if ( sd->sched_priv != NULL )
-        SCHED_OP(sched, free_pdata, sd->sched_priv, cpu);
+    SCHED_OP(sched, free_pdata, sd->sched_priv, cpu);
     SCHED_OP(sched, free_vdata, idle_vcpu[cpu]->sched_priv);
 
     idle_vcpu[cpu]->sched_priv = NULL;
@@ -1608,9 +1616,8 @@ void __init scheduler_init(void)
     idle_domain->max_vcpus = nr_cpu_ids;
     if ( alloc_vcpu(idle_domain, 0, 0) == NULL )
         BUG();
-    if ( ops.alloc_pdata &&
-         !(this_cpu(schedule_data).sched_priv = ops.alloc_pdata(&ops, 0)) )
-        BUG();
+    this_cpu(schedule_data).sched_priv = SCHED_OP(&ops, alloc_pdata, 0);
+    BUG_ON(IS_ERR(this_cpu(schedule_data).sched_priv));
     SCHED_OP(&ops, init_pdata, this_cpu(schedule_data).sched_priv, 0);
 }
 
@@ -1653,8 +1660,8 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
 
     idle = idle_vcpu[cpu];
     ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
-    if ( ppriv == NULL )
-        return -ENOMEM;
+    if ( IS_ERR(ppriv) )
+        return PTR_ERR(ppriv);
     SCHED_OP(new_ops, init_pdata, ppriv, cpu);
     vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv);
     if ( vpriv == NULL )


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 02/11] xen: sched: implement .init_pdata in Credit, Credit2 and RTDS
  2016-04-06 17:22 [PATCH v2 00/11] Fixes and improvement (including hard affinity!) for Credit2 Dario Faggioli
  2016-04-06 17:22 ` [PATCH v2 01/11] xen: sched: make implementing .alloc_pdata optional Dario Faggioli
@ 2016-04-06 17:22 ` Dario Faggioli
  2016-04-06 17:23 ` [PATCH v2 03/11] xen: sched: move pCPU initialization in an helper Dario Faggioli
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Dario Faggioli @ 2016-04-06 17:22 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Juergen Gross, Meng Xu

In fact, if a scheduler needs per-pCPU information,
that needs to be initialized appropriately. So, we take
the code that is performing initializations from (right
now) .alloc_pdata, and use it for .init_pdata, leaving
only actualy allocations in the former, if any (which
is the case in RTDS and Credit1).

On the other hand, in Credit2, since we don't really
need any per-pCPU data allocation, everything that was
being done in .alloc_pdata, is now done in .init_pdata.
And the fact that now .alloc_pdata can be left undefined,
allows us to just get rid of it.

Still for Credit2, the fact that .init_pdata is called
during CPU_STARTING (rather than CPU_UP_PREPARE) kills
the need for the scheduler to setup a similar callback
itself, simplifying the code.

And thanks to such simplification, it is now also ok to
move some of the logic meant at double checking that a
cpu was (or was not) initialized, into ASSERTS (rather
than an if() and a BUG_ON).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
---
Cc: Juergen Gross <jgross@suse.com>
---
Changes from v2:
* make the ASSERT() in credit more linear, as suggested
  during review;
* minor adjustements to the changelog, as suggested during
  review.
---
 xen/common/sched_credit.c  |   20 +++++++++---
 xen/common/sched_credit2.c |   72 +++-----------------------------------------
 xen/common/sched_rt.c      |   11 ++++++-
 3 files changed, 28 insertions(+), 75 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 63a4a63..f503e73 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -527,8 +527,6 @@ static void *
 csched_alloc_pdata(const struct scheduler *ops, int cpu)
 {
     struct csched_pcpu *spc;
-    struct csched_private *prv = CSCHED_PRIV(ops);
-    unsigned long flags;
 
     /* Allocate per-PCPU info */
     spc = xzalloc(struct csched_pcpu);
@@ -541,6 +539,19 @@ csched_alloc_pdata(const struct scheduler *ops, int cpu)
         return ERR_PTR(-ENOMEM);
     }
 
+    return spc;
+}
+
+static void
+csched_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
+{
+    struct csched_private *prv = CSCHED_PRIV(ops);
+    struct csched_pcpu * const spc = pdata;
+    unsigned long flags;
+
+    /* cpu data needs to be allocated, but STILL uninitialized */
+    ASSERT(spc && spc->runq.next == NULL && spc->runq.prev == NULL);
+
     spin_lock_irqsave(&prv->lock, flags);
 
     /* Initialize/update system-wide config */
@@ -561,16 +572,12 @@ csched_alloc_pdata(const struct scheduler *ops, int cpu)
     INIT_LIST_HEAD(&spc->runq);
     spc->runq_sort_last = prv->runq_sort;
     spc->idle_bias = nr_cpu_ids - 1;
-    if ( per_cpu(schedule_data, cpu).sched_priv == NULL )
-        per_cpu(schedule_data, cpu).sched_priv = spc;
 
     /* Start off idling... */
     BUG_ON(!is_idle_vcpu(curr_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
 
     spin_unlock_irqrestore(&prv->lock, flags);
-
-    return spc;
 }
 
 #ifndef NDEBUG
@@ -2054,6 +2061,7 @@ static const struct scheduler sched_credit_def = {
     .alloc_vdata    = csched_alloc_vdata,
     .free_vdata     = csched_free_vdata,
     .alloc_pdata    = csched_alloc_pdata,
+    .init_pdata     = csched_init_pdata,
     .free_pdata     = csched_free_pdata,
     .alloc_domdata  = csched_alloc_domdata,
     .free_domdata   = csched_free_domdata,
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index e97d8be..8a56953 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1971,7 +1971,8 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
     cpumask_clear_cpu(rqi, &prv->active_queues);
 }
 
-static void init_pcpu(const struct scheduler *ops, int cpu)
+static void
+csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
 {
     unsigned rqi;
     unsigned long flags;
@@ -1981,12 +1982,7 @@ static void init_pcpu(const struct scheduler *ops, int cpu)
 
     spin_lock_irqsave(&prv->lock, flags);
 
-    if ( cpumask_test_cpu(cpu, &prv->initialized) )
-    {
-        printk("%s: Strange, cpu %d already initialized!\n", __func__, cpu);
-        spin_unlock_irqrestore(&prv->lock, flags);
-        return;
-    }
+    ASSERT(!cpumask_test_cpu(cpu, &prv->initialized));
 
     /* Figure out which runqueue to put it in */
     rqi = 0;
@@ -2036,20 +2032,6 @@ static void init_pcpu(const struct scheduler *ops, int cpu)
     return;
 }
 
-static void *
-csched2_alloc_pdata(const struct scheduler *ops, int cpu)
-{
-    /* Check to see if the cpu is online yet */
-    /* Note: cpu 0 doesn't get a STARTING callback */
-    if ( cpu == 0 || cpu_to_socket(cpu) != XEN_INVALID_SOCKET_ID )
-        init_pcpu(ops, cpu);
-    else
-        printk("%s: cpu %d not online yet, deferring initializatgion\n",
-               __func__, cpu);
-
-    return NULL;
-}
-
 static void
 csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 {
@@ -2061,7 +2043,7 @@ csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 
     spin_lock_irqsave(&prv->lock, flags);
 
-    BUG_ON(!cpumask_test_cpu(cpu, &prv->initialized));
+    ASSERT(cpumask_test_cpu(cpu, &prv->initialized));
     
     /* Find the old runqueue and remove this cpu from it */
     rqi = prv->runq_map[cpu];
@@ -2099,49 +2081,6 @@ csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 }
 
 static int
-csched2_cpu_starting(int cpu)
-{
-    struct scheduler *ops;
-
-    /* Hope this is safe from cpupools switching things around. :-) */
-    ops = per_cpu(scheduler, cpu);
-
-    if ( ops->alloc_pdata == csched2_alloc_pdata )
-        init_pcpu(ops, cpu);
-
-    return NOTIFY_DONE;
-}
-
-static int cpu_credit2_callback(
-    struct notifier_block *nfb, unsigned long action, void *hcpu)
-{
-    unsigned int cpu = (unsigned long)hcpu;
-    int rc = 0;
-
-    switch ( action )
-    {
-    case CPU_STARTING:
-        csched2_cpu_starting(cpu);
-        break;
-    default:
-        break;
-    }
-
-    return !rc ? NOTIFY_DONE : notifier_from_errno(rc);
-}
-
-static struct notifier_block cpu_credit2_nfb = {
-    .notifier_call = cpu_credit2_callback
-};
-
-static int
-csched2_global_init(void)
-{
-    register_cpu_notifier(&cpu_credit2_nfb);
-    return 0;
-}
-
-static int
 csched2_init(struct scheduler *ops)
 {
     int i;
@@ -2219,12 +2158,11 @@ static const struct scheduler sched_credit2_def = {
 
     .dump_cpu_state = csched2_dump_pcpu,
     .dump_settings  = csched2_dump,
-    .global_init    = csched2_global_init,
     .init           = csched2_init,
     .deinit         = csched2_deinit,
     .alloc_vdata    = csched2_alloc_vdata,
     .free_vdata     = csched2_free_vdata,
-    .alloc_pdata    = csched2_alloc_pdata,
+    .init_pdata     = csched2_init_pdata,
     .free_pdata     = csched2_free_pdata,
     .alloc_domdata  = csched2_alloc_domdata,
     .free_domdata   = csched2_free_domdata,
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index aece318..b96bd93 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -666,8 +666,8 @@ rt_deinit(struct scheduler *ops)
  * Point per_cpu spinlock to the global system lock;
  * All cpu have same global system lock
  */
-static void *
-rt_alloc_pdata(const struct scheduler *ops, int cpu)
+static void
+rt_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
 {
     struct rt_private *prv = rt_priv(ops);
     spinlock_t *old_lock;
@@ -680,6 +680,12 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu)
 
     /* _Not_ pcpu_schedule_unlock(): per_cpu().schedule_lock changed! */
     spin_unlock_irqrestore(old_lock, flags);
+}
+
+static void *
+rt_alloc_pdata(const struct scheduler *ops, int cpu)
+{
+    struct rt_private *prv = rt_priv(ops);
 
     if ( !alloc_cpumask_var(&_cpumask_scratch[cpu]) )
         return ERR_PTR(-ENOMEM);
@@ -1461,6 +1467,7 @@ static const struct scheduler sched_rtds_def = {
     .deinit         = rt_deinit,
     .alloc_pdata    = rt_alloc_pdata,
     .free_pdata     = rt_free_pdata,
+    .init_pdata     = rt_init_pdata,
     .alloc_domdata  = rt_alloc_domdata,
     .free_domdata   = rt_free_domdata,
     .init_domain    = rt_dom_init,


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 03/11] xen: sched: move pCPU initialization in an helper
  2016-04-06 17:22 [PATCH v2 00/11] Fixes and improvement (including hard affinity!) for Credit2 Dario Faggioli
  2016-04-06 17:22 ` [PATCH v2 01/11] xen: sched: make implementing .alloc_pdata optional Dario Faggioli
  2016-04-06 17:22 ` [PATCH v2 02/11] xen: sched: implement .init_pdata in Credit, Credit2 and RTDS Dario Faggioli
@ 2016-04-06 17:23 ` Dario Faggioli
  2016-04-06 17:23 ` [PATCH v2 04/11] xen: sched: close potential races when switching scheduler to CPUs Dario Faggioli
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Dario Faggioli @ 2016-04-06 17:23 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

That will turn out useful in following patches, where such
code will need to be called more than just once. Create an
helper now, and move the code there, to avoid mixing code
motion and functional changes later.

In Credit2, some style cleanup is also done.

No functional change intended.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c  |   20 ++++++++++++--------
 xen/common/sched_credit2.c |   26 ++++++++++++++++----------
 2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index f503e73..96a245d 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -543,17 +543,12 @@ csched_alloc_pdata(const struct scheduler *ops, int cpu)
 }
 
 static void
-csched_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
+init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
 {
-    struct csched_private *prv = CSCHED_PRIV(ops);
-    struct csched_pcpu * const spc = pdata;
-    unsigned long flags;
-
-    /* cpu data needs to be allocated, but STILL uninitialized */
+    ASSERT(spin_is_locked(&prv->lock));
+    /* cpu data needs to be allocated, but STILL uninitialized. */
     ASSERT(spc && spc->runq.next == NULL && spc->runq.prev == NULL);
 
-    spin_lock_irqsave(&prv->lock, flags);
-
     /* Initialize/update system-wide config */
     prv->credit += prv->credits_per_tslice;
     prv->ncpus++;
@@ -576,7 +571,16 @@ csched_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
     /* Start off idling... */
     BUG_ON(!is_idle_vcpu(curr_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
+}
 
+static void
+csched_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
+{
+    unsigned long flags;
+    struct csched_private *prv = CSCHED_PRIV(ops);
+
+    spin_lock_irqsave(&prv->lock, flags);
+    init_pdata(prv, pdata, cpu);
     spin_unlock_irqrestore(&prv->lock, flags);
 }
 
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 8a56953..8989eea 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1972,16 +1972,13 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
 }
 
 static void
-csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
+init_pdata(struct csched2_private *prv, unsigned int cpu)
 {
     unsigned rqi;
-    unsigned long flags;
-    struct csched2_private *prv = CSCHED2_PRIV(ops);
     struct csched2_runqueue_data *rqd;
     spinlock_t *old_lock;
 
-    spin_lock_irqsave(&prv->lock, flags);
-
+    ASSERT(spin_is_locked(&prv->lock));
     ASSERT(!cpumask_test_cpu(cpu, &prv->initialized));
 
     /* Figure out which runqueue to put it in */
@@ -2001,7 +1998,7 @@ csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
         BUG();
     }
 
-    rqd=prv->rqd + rqi;
+    rqd = prv->rqd + rqi;
 
     printk("Adding cpu %d to runqueue %d\n", cpu, rqi);
     if ( ! cpumask_test_cpu(rqi, &prv->active_queues) )
@@ -2011,13 +2008,13 @@ csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
     }
     
     /* IRQs already disabled */
-    old_lock=pcpu_schedule_lock(cpu);
+    old_lock = pcpu_schedule_lock(cpu);
 
     /* Move spinlock to new runq lock.  */
     per_cpu(schedule_data, cpu).schedule_lock = &rqd->lock;
 
     /* Set the runqueue map */
-    prv->runq_map[cpu]=rqi;
+    prv->runq_map[cpu] = rqi;
     
     cpumask_set_cpu(cpu, &rqd->idle);
     cpumask_set_cpu(cpu, &rqd->active);
@@ -2027,12 +2024,21 @@ csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
 
     cpumask_set_cpu(cpu, &prv->initialized);
 
-    spin_unlock_irqrestore(&prv->lock, flags);
-
     return;
 }
 
 static void
+csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
+{
+    struct csched2_private *prv = CSCHED2_PRIV(ops);
+    unsigned long flags;
+
+    spin_lock_irqsave(&prv->lock, flags);
+    init_pdata(prv, cpu);
+    spin_unlock_irqrestore(&prv->lock, flags);
+}
+
+static void
 csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 {
     unsigned long flags;


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 04/11] xen: sched: close potential races when switching scheduler to CPUs
  2016-04-06 17:22 [PATCH v2 00/11] Fixes and improvement (including hard affinity!) for Credit2 Dario Faggioli
                   ` (2 preceding siblings ...)
  2016-04-06 17:23 ` [PATCH v2 03/11] xen: sched: move pCPU initialization in an helper Dario Faggioli
@ 2016-04-06 17:23 ` Dario Faggioli
  2016-04-07 14:54   ` George Dunlap
  2016-04-06 17:23 ` [PATCH v2 05/11] xen: sched: improve credit2 bootparams' scope, placement and signedness Dario Faggioli
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 20+ messages in thread
From: Dario Faggioli @ 2016-04-06 17:23 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Tianyang Chen, Meng Xu

In short, the point is making sure that the actual switch
of scheduler and the remapping of the scheduler's runqueue
lock occur in the same critical section, protected by the
"old" scheduler's lock (and not, e.g., in the free_pdata
hook, as it is now for Credit2 and RTDS).

Not doing  so, is (at least) racy. In fact, for instance,
if we switch cpu X from, Credit2 to Credit, we do:

 schedule_cpu_switch(x, csched2 --> csched):
   //scheduler[x] is csched2
   //schedule_lock[x] is csched2_lock
   csched_alloc_pdata(x)
   csched_init_pdata(x)
   pcpu_schedule_lock(x) ----> takes csched2_lock
   scheduler[X] = csched
   pcpu_schedule_unlock(x) --> unlocks csched2_lock
   [1]
   csched2_free_pdata(x)
     pcpu_schedule_lock(x) --> takes csched2_lock
     schedule_lock[x] = csched_lock
     spin_unlock(csched2_lock)

While, if we switch cpu X from, Credit to Credit2, we do:

 schedule_cpu_switch(X, csched --> csched2):
   //scheduler[x] is csched
   //schedule_lock[x] is csched_lock
   csched2_alloc_pdata(x)
   csched2_init_pdata(x)
     pcpu_schedule_lock(x) --> takes csched_lock
     schedule_lock[x] = csched2_lock
     spin_unlock(csched_lock)
   [2]
   pcpu_schedule_lock(x) ----> takes csched2_lock
   scheduler[X] = csched2
   pcpu_schedule_unlock(x) --> unlocks csched2_lock
   csched_free_pdata(x)

And if we switch cpu X from RTDS to Credit2, we do:

 schedule_cpu_switch(X, RTDS --> csched2):
   //scheduler[x] is rtds
   //schedule_lock[x] is rtds_lock
   csched2_alloc_pdata(x)
   csched2_init_pdata(x)
     pcpu_schedule_lock(x) --> takes rtds_lock
     schedule_lock[x] = csched2_lock
     spin_unlock(rtds_lock)
   pcpu_schedule_lock(x) ----> takes csched2_lock
   scheduler[x] = csched2
   pcpu_schedule_unlock(x) --> unlocks csched2_lock
   rtds_free_pdata(x)
     spin_lock(rtds_lock)
     ASSERT(schedule_lock[x] == rtds_lock) [3]
     schedule_lock[x] = DEFAULT_SCHEDULE_LOCK [4]
     spin_unlock(rtds_lock)

So, the first problem is that, if anything related to
scheduling, and involving CPU, happens at [1] or [2], we:
 - take csched2_lock,
 - operate on Credit1 functions and data structures,
which is no good!

The second problem is that the ASSERT at [3] triggers, and
the third that at [4], we screw up the lock remapping we've
done for ourself in csched2_init_pdata()!

The first problem arises because there is a window during
which the lock is already the new one, but the scheduler is
still the old one. The other two, becase we let schedulers
mess with the lock (re)mapping done by others.

This patch, therefore, introduces a new hook in the scheduler
interface, called switch_sched, meant at being used when
switching scheduler on a CPU, and implements it for the
various schedulers (that needs it: i.e., all except ARINC653),
so that things are done in the proper order and under the
protection of the best suited (set of) lock(s). It is
necessary to add the hook (as compared to keep doing things
in generic code), because different schedulers may have
 different locking schemes.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Meng Xu <mengxu@cis.upenn.edu>
Cc: Tianyang Chen <tiche@seas.upenn.edu>
---
Changes from v1:

new patch, basically, coming from squashing what were
4 patches in v1. In any case, with respect to those 4
patches:
 - runqueue lock is back being taken in schedule_cpu_switch(),
   as suggested during review;
 - add barriers for making sure all initialization is done
   when the new lock is assigned, as sugested during review;
 - add comments and ASSERT-s about how and why the adopted
   locking scheme is safe, as suggested during review.
---
 xen/common/sched_credit.c  |   44 ++++++++++++++++++++++++
 xen/common/sched_credit2.c |   81 +++++++++++++++++++++++++++++++++-----------
 xen/common/sched_rt.c      |   45 +++++++++++++++++-------
 xen/common/schedule.c      |   41 +++++++++++++++++-----
 xen/include/xen/sched-if.h |    3 ++
 5 files changed, 172 insertions(+), 42 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 96a245d..540d515 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -578,12 +578,55 @@ csched_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
 {
     unsigned long flags;
     struct csched_private *prv = CSCHED_PRIV(ops);
+    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
+
+    /*
+     * This is called either during during boot, resume or hotplug, in
+     * case Credit1 is the scheduler chosen at boot. In such cases, the
+     * scheduler lock for cpu is already pointing to the default per-cpu
+     * spinlock, as Credit1 needs it, so there is no remapping to be done.
+     */
+    ASSERT(sd->schedule_lock == &sd->_lock && !spin_is_locked(&sd->_lock));
 
     spin_lock_irqsave(&prv->lock, flags);
     init_pdata(prv, pdata, cpu);
     spin_unlock_irqrestore(&prv->lock, flags);
 }
 
+/* Change the scheduler of cpu to us (Credit). */
+static void
+csched_switch_sched(struct scheduler *ops, unsigned int cpu,
+                    void *pdata, void *vdata)
+{
+    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
+    struct csched_private *prv = CSCHED_PRIV(ops);
+    struct csched_vcpu *svc = vdata;
+
+    ASSERT(svc && is_idle_vcpu(svc->vcpu));
+
+    idle_vcpu[cpu]->sched_priv = vdata;
+
+    /*
+     * We are holding the runqueue lock already (it's been taken in
+     * schedule_cpu_switch()). It actually may or may not be the 'right'
+     * one for this cpu, but that is ok for preventing races.
+     */
+    spin_lock(&prv->lock);
+    init_pdata(prv, pdata, cpu);
+    spin_unlock(&prv->lock);
+
+    per_cpu(scheduler, cpu) = ops;
+    per_cpu(schedule_data, cpu).sched_priv = pdata;
+
+    /*
+     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
+     * if it is free (and it can be) we want that anyone that manages
+     * taking it, finds all the initializations we've done above in place.
+     */
+    smp_mb();
+    sd->schedule_lock = &sd->_lock;
+}
+
 #ifndef NDEBUG
 static inline void
 __csched_vcpu_check(struct vcpu *vc)
@@ -2067,6 +2110,7 @@ static const struct scheduler sched_credit_def = {
     .alloc_pdata    = csched_alloc_pdata,
     .init_pdata     = csched_init_pdata,
     .free_pdata     = csched_free_pdata,
+    .switch_sched   = csched_switch_sched,
     .alloc_domdata  = csched_alloc_domdata,
     .free_domdata   = csched_free_domdata,
 
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 8989eea..60c6f5b 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1971,12 +1971,12 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
     cpumask_clear_cpu(rqi, &prv->active_queues);
 }
 
-static void
+/* Returns the ID of the runqueue the cpu is assigned to. */
+static unsigned
 init_pdata(struct csched2_private *prv, unsigned int cpu)
 {
     unsigned rqi;
     struct csched2_runqueue_data *rqd;
-    spinlock_t *old_lock;
 
     ASSERT(spin_is_locked(&prv->lock));
     ASSERT(!cpumask_test_cpu(cpu, &prv->initialized));
@@ -2007,44 +2007,89 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
         activate_runqueue(prv, rqi);
     }
     
-    /* IRQs already disabled */
-    old_lock = pcpu_schedule_lock(cpu);
-
-    /* Move spinlock to new runq lock.  */
-    per_cpu(schedule_data, cpu).schedule_lock = &rqd->lock;
-
     /* Set the runqueue map */
     prv->runq_map[cpu] = rqi;
     
     cpumask_set_cpu(cpu, &rqd->idle);
     cpumask_set_cpu(cpu, &rqd->active);
-
-    /* _Not_ pcpu_schedule_unlock(): per_cpu().schedule_lock changed! */
-    spin_unlock(old_lock);
-
     cpumask_set_cpu(cpu, &prv->initialized);
 
-    return;
+    return rqi;
 }
 
 static void
 csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
 {
     struct csched2_private *prv = CSCHED2_PRIV(ops);
+    spinlock_t *old_lock;
     unsigned long flags;
+    unsigned rqi;
 
     spin_lock_irqsave(&prv->lock, flags);
-    init_pdata(prv, cpu);
+    old_lock = pcpu_schedule_lock(cpu);
+
+    rqi = init_pdata(prv, cpu);
+    /* Move the scheduler lock to the new runq lock. */
+    per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock;
+
+    /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
+    spin_unlock(old_lock);
     spin_unlock_irqrestore(&prv->lock, flags);
 }
 
+/* Change the scheduler of cpu to us (Credit2). */
+static void
+csched2_switch_sched(struct scheduler *new_ops, unsigned int cpu,
+                     void *pdata, void *vdata)
+{
+    struct csched2_private *prv = CSCHED2_PRIV(new_ops);
+    struct csched2_vcpu *svc = vdata;
+    unsigned rqi;
+
+    ASSERT(!pdata && svc && is_idle_vcpu(svc->vcpu));
+
+    /*
+     * We own one runqueue lock already (from schedule_cpu_switch()). This
+     * looks like it violates this scheduler's locking rules, but it does
+     * not, as what we own is the lock of another scheduler, that hence has
+     * no particular (ordering) relationship with our private global lock.
+     * And owning exactly that one (the lock of the old scheduler of this
+     * cpu) is what is necessary to prevent races.
+     */
+    spin_lock_irq(&prv->lock);
+
+    idle_vcpu[cpu]->sched_priv = vdata;
+
+    rqi = init_pdata(prv, cpu);
+
+    /*
+     * Now that we know what runqueue we'll go in, double check what's said
+     * above: the lock we already hold is not the one of this runqueue of
+     * this scheduler, and so it's safe to have taken it /before/ our
+     * private global lock.
+     */
+    ASSERT(per_cpu(schedule_data, cpu).schedule_lock != &prv->rqd[rqi].lock);
+
+    per_cpu(scheduler, cpu) = new_ops;
+    per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */
+
+    /*
+     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
+     * if it is free (and it can be) we want that anyone that manages
+     * taking it, find all the initializations we've done above in place.
+     */
+    smp_mb();
+    per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock;
+
+    spin_unlock_irq(&prv->lock);
+}
+
 static void
 csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 {
     unsigned long flags;
     struct csched2_private *prv = CSCHED2_PRIV(ops);
     struct csched2_runqueue_data *rqd;
-    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
     int rqi;
 
     spin_lock_irqsave(&prv->lock, flags);
@@ -2072,11 +2117,6 @@ csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
         deactivate_runqueue(prv, rqi);
     }
 
-    /* Move spinlock to the original lock.  */
-    ASSERT(sd->schedule_lock == &rqd->lock);
-    ASSERT(!spin_is_locked(&sd->_lock));
-    sd->schedule_lock = &sd->_lock;
-
     spin_unlock(&rqd->lock);
 
     cpumask_clear_cpu(cpu, &prv->initialized);
@@ -2170,6 +2210,7 @@ static const struct scheduler sched_credit2_def = {
     .free_vdata     = csched2_free_vdata,
     .init_pdata     = csched2_init_pdata,
     .free_pdata     = csched2_free_pdata,
+    .switch_sched   = csched2_switch_sched,
     .alloc_domdata  = csched2_alloc_domdata,
     .free_domdata   = csched2_free_domdata,
 };
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index b96bd93..3bb8c71 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -682,6 +682,37 @@ rt_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
     spin_unlock_irqrestore(old_lock, flags);
 }
 
+/* Change the scheduler of cpu to us (RTDS). */
+static void
+rt_switch_sched(struct scheduler *new_ops, unsigned int cpu,
+                void *pdata, void *vdata)
+{
+    struct rt_private *prv = rt_priv(new_ops);
+    struct rt_vcpu *svc = vdata;
+
+    ASSERT(!pdata && svc && is_idle_vcpu(svc->vcpu));
+
+    /*
+     * We are holding the runqueue lock already (it's been taken in
+     * schedule_cpu_switch()). It's actually the runqueue lock of
+     * another scheduler, but that is how things need to be, for
+     * preventing races.
+     */
+    ASSERT(per_cpu(schedule_data, cpu).schedule_lock != &prv->lock);
+
+    idle_vcpu[cpu]->sched_priv = vdata;
+    per_cpu(scheduler, cpu) = new_ops;
+    per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */
+
+    /*
+     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
+     * if it is free (and it can be) we want that anyone that manages
+     * taking it, find all the initializations we've done above in place.
+     */
+    smp_mb();
+    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
+}
+
 static void *
 rt_alloc_pdata(const struct scheduler *ops, int cpu)
 {
@@ -707,19 +738,6 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu)
 static void
 rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 {
-    struct rt_private *prv = rt_priv(ops);
-    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
-    unsigned long flags;
-
-    spin_lock_irqsave(&prv->lock, flags);
-
-    /* Move spinlock back to the default lock */
-    ASSERT(sd->schedule_lock == &prv->lock);
-    ASSERT(!spin_is_locked(&sd->_lock));
-    sd->schedule_lock = &sd->_lock;
-
-    spin_unlock_irqrestore(&prv->lock, flags);
-
     free_cpumask_var(_cpumask_scratch[cpu]);
 }
 
@@ -1468,6 +1486,7 @@ static const struct scheduler sched_rtds_def = {
     .alloc_pdata    = rt_alloc_pdata,
     .free_pdata     = rt_free_pdata,
     .init_pdata     = rt_init_pdata,
+    .switch_sched   = rt_switch_sched,
     .alloc_domdata  = rt_alloc_domdata,
     .free_domdata   = rt_free_domdata,
     .init_domain    = rt_dom_init,
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 1941613..5559aa1 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -1635,11 +1635,11 @@ void __init scheduler_init(void)
 int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
 {
     struct vcpu *idle;
-    spinlock_t *lock;
     void *ppriv, *ppriv_old, *vpriv, *vpriv_old;
     struct scheduler *old_ops = per_cpu(scheduler, cpu);
     struct scheduler *new_ops = (c == NULL) ? &ops : c->sched;
     struct cpupool *old_pool = per_cpu(cpupool, cpu);
+    spinlock_t * old_lock;
 
     /*
      * pCPUs only move from a valid cpupool to free (i.e., out of any pool),
@@ -1658,11 +1658,21 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
     if ( old_ops == new_ops )
         goto out;
 
+    /*
+     * To setup the cpu for the new scheduler we need:
+     *  - a valid instance of per-CPU scheduler specific data, as it is
+     *    allocated by SCHED_OP(alloc_pdata). Note that we do not want to
+     *    initialize it yet (i.e., we are not calling SCHED_OP(init_pdata)).
+     *    That will be done by the target scheduler, in SCHED_OP(switch_sched),
+     *    in proper ordering and with locking.
+     *  - a valid instance of per-vCPU scheduler specific data, for the idle
+     *    vCPU of cpu. That is what the target scheduler will use for the
+     *    sched_priv field of the per-vCPU info of the idle domain.
+     */
     idle = idle_vcpu[cpu];
     ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
     if ( IS_ERR(ppriv) )
         return PTR_ERR(ppriv);
-    SCHED_OP(new_ops, init_pdata, ppriv, cpu);
     vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv);
     if ( vpriv == NULL )
     {
@@ -1670,17 +1680,30 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
         return -ENOMEM;
     }
 
-    lock = pcpu_schedule_lock_irq(cpu);
-
     SCHED_OP(old_ops, tick_suspend, cpu);
+
+    /*
+     * The actual switch, including (if necessary) the rerouting of the
+     * scheduler lock to whatever new_ops prefers,  needs to happen in one
+     * critical section, protected by old_ops' lock, or races are possible.
+     * It is, in fact, the lock of another scheduler that we are taking (the
+     * scheduler of the cpupool that cpu still belongs to). But that is ok
+     * as, anyone trying to schedule on this cpu will spin until when we
+     * release that lock (bottom of this function). When he'll get the lock
+     * --thanks to the loop inside *_schedule_lock() functions-- he'll notice
+     * that the lock itself changed, and retry acquiring the new one (which
+     * will be the correct, remapped one, at that point).
+     */
+    old_lock = pcpu_schedule_lock(cpu);
+
     vpriv_old = idle->sched_priv;
-    idle->sched_priv = vpriv;
-    per_cpu(scheduler, cpu) = new_ops;
     ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
-    per_cpu(schedule_data, cpu).sched_priv = ppriv;
-    SCHED_OP(new_ops, tick_resume, cpu);
+    SCHED_OP(new_ops, switch_sched, cpu, ppriv, vpriv);
 
-    pcpu_schedule_unlock_irq(lock, cpu);
+    /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
+    spin_unlock_irq(old_lock);
+
+    SCHED_OP(new_ops, tick_resume, cpu);
 
     SCHED_OP(old_ops, free_vdata, vpriv_old);
     SCHED_OP(old_ops, free_pdata, ppriv_old, cpu);
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 70c08c6..9cebe41 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -137,6 +137,9 @@ struct scheduler {
     void         (*free_domdata)   (const struct scheduler *, void *);
     void *       (*alloc_domdata)  (const struct scheduler *, struct domain *);
 
+    void         (*switch_sched)   (struct scheduler *, unsigned int,
+                                    void *, void *);
+
     int          (*init_domain)    (const struct scheduler *, struct domain *);
     void         (*destroy_domain) (const struct scheduler *, struct domain *);
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 05/11] xen: sched: improve credit2 bootparams' scope, placement and signedness
  2016-04-06 17:22 [PATCH v2 00/11] Fixes and improvement (including hard affinity!) for Credit2 Dario Faggioli
                   ` (3 preceding siblings ...)
  2016-04-06 17:23 ` [PATCH v2 04/11] xen: sched: close potential races when switching scheduler to CPUs Dario Faggioli
@ 2016-04-06 17:23 ` Dario Faggioli
  2016-04-06 17:23 ` [PATCH v2 06/11] xen: sched: on Credit2, don't reprogram the timer if idle Dario Faggioli
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Dario Faggioli @ 2016-04-06 17:23 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, George Dunlap, Jan Beulich, Uma Sharma

From: Uma Sharma <uma.sharma523@gmail.com>

and, while we are adjusting signedness of opt_load_window_shift,
make also prv->load_window_shift unsigned, as approapriate.

Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Jan Beulich <JBeulich@suse.com>
---
 xen/common/sched_credit2.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 60c6f5b..7286e50 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -162,7 +162,7 @@
 #define CSFLAG_runq_migrate_request (1<<__CSFLAG_runq_migrate_request)
 
 
-int opt_migrate_resist=500;
+static unsigned int __read_mostly opt_migrate_resist = 500;
 integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
 
 /*
@@ -185,12 +185,12 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
  *   to a load of 1.
  */
 #define LOADAVG_GRANULARITY_SHIFT (10)
-int opt_load_window_shift=18;
+static unsigned int __read_mostly opt_load_window_shift = 18;
 #define  LOADAVG_WINDOW_SHIFT_MIN 4
 integer_param("credit2_load_window_shift", opt_load_window_shift);
-int opt_underload_balance_tolerance=0;
+static int __read_mostly opt_underload_balance_tolerance = 0;
 integer_param("credit2_balance_under", opt_underload_balance_tolerance);
-int opt_overload_balance_tolerance=-3;
+static int __read_mostly opt_overload_balance_tolerance = -3;
 integer_param("credit2_balance_over", opt_overload_balance_tolerance);
 
 /*
@@ -227,7 +227,7 @@ struct csched2_private {
     cpumask_t active_queues; /* Queues which may have active cpus */
     struct csched2_runqueue_data rqd[NR_CPUS];
 
-    int load_window_shift;
+    unsigned int load_window_shift;
 };
 
 /*


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 06/11] xen: sched: on Credit2, don't reprogram the timer if idle
  2016-04-06 17:22 [PATCH v2 00/11] Fixes and improvement (including hard affinity!) for Credit2 Dario Faggioli
                   ` (4 preceding siblings ...)
  2016-04-06 17:23 ` [PATCH v2 05/11] xen: sched: improve credit2 bootparams' scope, placement and signedness Dario Faggioli
@ 2016-04-06 17:23 ` Dario Faggioli
  2016-04-06 17:23 ` [PATCH v2 07/11] xen: sched: fix per-socket runqueue creation in credit2 Dario Faggioli
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Dario Faggioli @ 2016-04-06 17:23 UTC (permalink / raw)
  To: xen-devel; +Cc: Tianyang Chen, George Dunlap

as other schedulers are doing already: if the idle vcpu
is picked and scheduled, there is no need to reprogram the
scheduler timer to fire and invoke csched2_schedule()
again in future.

Tickling or external events will serve as pokes, when
necessary, but until we can, we should just stay idle.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reported-by: Tianyang Chen <tiche@seas.upenn.edu>
Suggested-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Tianyang Chen <tiche@seas.upenn.edu>
---
Changes from v1:
 * use George's work address (sorry again!).
---
 xen/common/sched_credit2.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 7286e50..b207d84 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1543,8 +1543,12 @@ csched2_runtime(const struct scheduler *ops, int cpu, struct csched2_vcpu *snext
     struct csched2_runqueue_data *rqd = RQD(ops, cpu);
     struct list_head *runq = &rqd->runq;
 
+    /*
+     * If we're idle, just stay so. Others (or external events)
+     * will poke us when necessary.
+     */
     if ( is_idle_vcpu(snext->vcpu) )
-        return CSCHED2_MAX_TIMER;
+        return -1;
 
     /* General algorithm:
      * 1) Run until snext's credit will be 0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 07/11] xen: sched: fix per-socket runqueue creation in credit2
  2016-04-06 17:22 [PATCH v2 00/11] Fixes and improvement (including hard affinity!) for Credit2 Dario Faggioli
                   ` (5 preceding siblings ...)
  2016-04-06 17:23 ` [PATCH v2 06/11] xen: sched: on Credit2, don't reprogram the timer if idle Dario Faggioli
@ 2016-04-06 17:23 ` Dario Faggioli
  2016-04-06 17:23 ` [PATCH v2 08/11] xen: sched: allow for choosing credit2 runqueues configuration at boot Dario Faggioli
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Dario Faggioli @ 2016-04-06 17:23 UTC (permalink / raw)
  To: xen-devel; +Cc: Justin Weaver, George Dunlap

The credit2 scheduler tries to setup runqueues in such
a way that there is one of them per each socket. However,
that does not work. The issue is described in bug #36
"credit2 only uses one runqueue instead of one runq per
socket" (http://bugs.xenproject.org/xen/bug/36), and a
solution has been attempted by an old patch series:

 http://lists.xen.org/archives/html/xen-devel/2014-08/msg02168.html

Here, we take advantage of the fact that now initialization
happens (for all schedulers) during CPU_STARTING, so we
have all the topology information available when necessary.

This is true for all the pCPUs _except_ the boot CPU. That
is not an issue, though. In fact, no runqueue exists yet
when the boot CPU is initialized, so we can just create
one and put the boot CPU in there.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Justin Weaver <jtweaver@hawaii.edu>
---
Changes from v1:
 * fixed a typo in a comment.
---
 xen/common/sched_credit2.c |   59 ++++++++++++++++++++++++++++++++------------
 1 file changed, 43 insertions(+), 16 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index b207d84..a61a45a 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -53,7 +53,6 @@
  *  http://wiki.xen.org/wiki/Credit2_Scheduler_Development
  * TODO:
  * + Multiple sockets
- *  - Detect cpu layout and make runqueue map, one per L2 (make_runq_map())
  *  - Simple load balancer / runqueue assignment
  *  - Runqueue load measurement
  *  - Load-based load balancer
@@ -1975,6 +1974,48 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
     cpumask_clear_cpu(rqi, &prv->active_queues);
 }
 
+static unsigned int
+cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu)
+{
+    struct csched2_runqueue_data *rqd;
+    unsigned int rqi;
+
+    for ( rqi = 0; rqi < nr_cpu_ids; rqi++ )
+    {
+        unsigned int peer_cpu;
+
+        /*
+         * As soon as we come across an uninitialized runqueue, use it.
+         * In fact, either:
+         *  - we are initializing the first cpu, and we assign it to
+         *    runqueue 0. This is handy, especially if we are dealing
+         *    with the boot cpu (if credit2 is the default scheduler),
+         *    as we would not be able to use cpu_to_socket() and similar
+         *    helpers anyway (they're result of which is not reliable yet);
+         *  - we have gone through all the active runqueues, and have not
+         *    found anyone whose cpus' topology matches the one we are
+         *    dealing with, so activating a new runqueue is what we want.
+         */
+        if ( prv->rqd[rqi].id == -1 )
+            break;
+
+        rqd = prv->rqd + rqi;
+        BUG_ON(cpumask_empty(&rqd->active));
+
+        peer_cpu = cpumask_first(&rqd->active);
+        BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID ||
+               cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID);
+
+        if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) )
+            break;
+    }
+
+    /* We really expect to be able to assign each cpu to a runqueue. */
+    BUG_ON(rqi >= nr_cpu_ids);
+
+    return rqi;
+}
+
 /* Returns the ID of the runqueue the cpu is assigned to. */
 static unsigned
 init_pdata(struct csched2_private *prv, unsigned int cpu)
@@ -1986,21 +2027,7 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
     ASSERT(!cpumask_test_cpu(cpu, &prv->initialized));
 
     /* Figure out which runqueue to put it in */
-    rqi = 0;
-
-    /* Figure out which runqueue to put it in */
-    /* NB: cpu 0 doesn't get a STARTING callback, so we hard-code it to runqueue 0. */
-    if ( cpu == 0 )
-        rqi = 0;
-    else
-        rqi = cpu_to_socket(cpu);
-
-    if ( rqi == XEN_INVALID_SOCKET_ID )
-    {
-        printk("%s: cpu_to_socket(%d) returned %d!\n",
-               __func__, cpu, rqi);
-        BUG();
-    }
+    rqi = cpu_to_runqueue(prv, cpu);
 
     rqd = prv->rqd + rqi;
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 08/11] xen: sched: allow for choosing credit2 runqueues configuration at boot
  2016-04-06 17:22 [PATCH v2 00/11] Fixes and improvement (including hard affinity!) for Credit2 Dario Faggioli
                   ` (6 preceding siblings ...)
  2016-04-06 17:23 ` [PATCH v2 07/11] xen: sched: fix per-socket runqueue creation in credit2 Dario Faggioli
@ 2016-04-06 17:23 ` Dario Faggioli
  2016-04-07  5:04   ` Juergen Gross
  2016-04-06 17:23 ` [PATCH v2 09/11] xen: sched: per-core runqueues as default in credit2 Dario Faggioli
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 20+ messages in thread
From: Dario Faggioli @ 2016-04-06 17:23 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Juergen Gross, Uma Sharma

In fact, credit2 uses CPU topology to decide how to arrange
its internal runqueues. Before this change, only 'one runqueue
per socket' was allowed. However, experiments have shown that,
for instance, having one runqueue per physical core improves
performance, especially in case hyperthreading is available.

In general, it makes sense to allow users to pick one runqueue
arrangement at boot time, so that:
 - more experiments can be easily performed to even better
   assess and improve performance;
 - one can select the best configuration for his specific
   use case and/or hardware.

This patch enables the above.

Note that, for correctly arranging runqueues to be per-core,
just checking cpu_to_core() on the host CPUs is not enough.
In fact, cores (and hyperthreads) on different sockets, can
have the same core (and thread) IDs! We, therefore, need to
check whether the full topology of two CPUs matches, for
them to be put in the same runqueue.

Note also that the default (although not functional) for
credit2, since now, has been per-socket runqueue. This patch
leaves things that way, to avoid mixing policy and technical
changes.

Finally, it would be a nice feature to be able to select
a particular runqueue arrangement, even when creating a
Credit2 cpupool. This is left as future work.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Uma Sharma <uma.sharma523@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
---
Cahnges from v1:
 * fix bug in parameter parsing, and start using strcmp()
   for that, as requested during review.
---
 docs/misc/xen-command-line.markdown |   19 +++++++++
 xen/common/sched_credit2.c          |   76 +++++++++++++++++++++++++++++++++--
 2 files changed, 90 insertions(+), 5 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index ca77e3b..0047f94 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -469,6 +469,25 @@ combination with the `low_crashinfo` command line option.
 ### credit2\_load\_window\_shift
 > `= <integer>`
 
+### credit2\_runqueue
+> `= core | socket | node | all`
+
+> Default: `socket`
+
+Specify how host CPUs are arranged in runqueues. Runqueues are kept
+balanced with respect to the load generated by the vCPUs running on
+them. Smaller runqueues (as in with `core`) means more accurate load
+balancing (for instance, it will deal better with hyperthreading),
+but also more overhead.
+
+Available alternatives, with their meaning, are:
+* `core`: one runqueue per each physical core of the host;
+* `socket`: one runqueue per each physical socket (which often,
+            but not always, matches a NUMA node) of the host;
+* `node`: one runqueue per each NUMA node of the host;
+* `all`: just one runqueue shared by all the logical pCPUs of
+         the host
+
 ### dbgp
 > `= ehci[ <integer> | @pci<bus>:<slot>.<func> ]`
 
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index a61a45a..20f8d35 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -81,10 +81,6 @@
  * Credits are "reset" when the next vcpu in the runqueue is less than
  * or equal to zero.  At that point, everyone's credits are "clipped"
  * to a small value, and a fixed credit is added to everyone.
- *
- * The plan is for all cores that share an L2 will share the same
- * runqueue.  At the moment, there is one global runqueue for all
- * cores.
  */
 
 /*
@@ -193,6 +189,55 @@ static int __read_mostly opt_overload_balance_tolerance = -3;
 integer_param("credit2_balance_over", opt_overload_balance_tolerance);
 
 /*
+ * Runqueue organization.
+ *
+ * The various cpus are to be assigned each one to a runqueue, and we
+ * want that to happen basing on topology. At the moment, it is possible
+ * to choose to arrange runqueues to be:
+ *
+ * - per-core: meaning that there will be one runqueue per each physical
+ *             core of the host. This will happen if the opt_runqueue
+ *             parameter is set to 'core';
+ *
+ * - per-node: meaning that there will be one runqueue per each physical
+ *             NUMA node of the host. This will happen if the opt_runqueue
+ *             parameter is set to 'node';
+ *
+ * - per-socket: meaning that there will be one runqueue per each physical
+ *               socket (AKA package, which often, but not always, also
+ *               matches a NUMA node) of the host; This will happen if
+ *               the opt_runqueue parameter is set to 'socket';
+ *
+ * - global: meaning that there will be only one runqueue to which all the
+ *           (logical) processors of the host belongs. This will happen if
+ *           the opt_runqueue parameter is set to 'all'.
+ *
+ * Depending on the value of opt_runqueue, therefore, cpus that are part of
+ * either the same physical core, or of the same physical socket, will be
+ * put together to form runqueues.
+ */
+#define OPT_RUNQUEUE_CORE   1
+#define OPT_RUNQUEUE_SOCKET 2
+#define OPT_RUNQUEUE_NODE   3
+#define OPT_RUNQUEUE_ALL    4
+static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET;
+
+static void parse_credit2_runqueue(const char *s)
+{
+    if ( !strcmp(s, "core") )
+        opt_runqueue = OPT_RUNQUEUE_CORE;
+    else if ( !strcmp(s, "socket") )
+        opt_runqueue = OPT_RUNQUEUE_SOCKET;
+    else if ( !strcmp(s, "node") )
+        opt_runqueue = OPT_RUNQUEUE_NODE;
+    else if ( !strcmp(s, "all") )
+        opt_runqueue = OPT_RUNQUEUE_ALL;
+    else
+        printk("WARNING, unrecognized value of credit2_runqueue option!\n");
+}
+custom_param("credit2_runqueue", parse_credit2_runqueue);
+
+/*
  * Per-runqueue data
  */
 struct csched2_runqueue_data {
@@ -1974,6 +2019,22 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
     cpumask_clear_cpu(rqi, &prv->active_queues);
 }
 
+static inline bool_t same_node(unsigned int cpua, unsigned int cpub)
+{
+    return cpu_to_node(cpua) == cpu_to_node(cpub);
+}
+
+static inline bool_t same_socket(unsigned int cpua, unsigned int cpub)
+{
+    return cpu_to_socket(cpua) == cpu_to_socket(cpub);
+}
+
+static inline bool_t same_core(unsigned int cpua, unsigned int cpub)
+{
+    return same_socket(cpua, cpub) &&
+           cpu_to_core(cpua) == cpu_to_core(cpub);
+}
+
 static unsigned int
 cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu)
 {
@@ -2006,7 +2067,10 @@ cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu)
         BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID ||
                cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID);
 
-        if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) )
+        if ( opt_runqueue == OPT_RUNQUEUE_ALL ||
+             (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) ||
+             (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) ||
+             (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) )
             break;
     }
 
@@ -2170,6 +2234,8 @@ csched2_init(struct scheduler *ops)
     printk(" load_window_shift: %d\n", opt_load_window_shift);
     printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance);
     printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance);
+    printk(" runqueues arrangement: per-%s\n",
+           opt_runqueue == OPT_RUNQUEUE_CORE ? "core" : "socket");
 
     if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN )
     {


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 09/11] xen: sched: per-core runqueues as default in credit2
  2016-04-06 17:22 [PATCH v2 00/11] Fixes and improvement (including hard affinity!) for Credit2 Dario Faggioli
                   ` (7 preceding siblings ...)
  2016-04-06 17:23 ` [PATCH v2 08/11] xen: sched: allow for choosing credit2 runqueues configuration at boot Dario Faggioli
@ 2016-04-06 17:23 ` Dario Faggioli
  2016-04-06 17:24 ` [PATCH v2 10/11] xen: sched: privde some scratch space for not putting cpumasks on stack Dario Faggioli
  2016-04-06 17:24 ` [PATCH v2 11/11] xen: sched: implement vcpu hard affinity in Credit2 Dario Faggioli
  10 siblings, 0 replies; 20+ messages in thread
From: Dario Faggioli @ 2016-04-06 17:23 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, George Dunlap, Uma Sharma

Experiments have shown that arranging the scheduing
runqueues on a per-core basis yields better results,
in most cases.

Such evaluation has been done, for the first time,
by Uma Sharma, during her participation to OPW. Some
of the results she got are summarized here:

 http://lists.xen.org/archives/html/xen-devel/2015-03/msg01499.html

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
---
 docs/misc/xen-command-line.markdown |    2 +-
 xen/common/sched_credit2.c          |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 0047f94..5d801b8 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -472,7 +472,7 @@ combination with the `low_crashinfo` command line option.
 ### credit2\_runqueue
 > `= core | socket | node | all`
 
-> Default: `socket`
+> Default: `core`
 
 Specify how host CPUs are arranged in runqueues. Runqueues are kept
 balanced with respect to the load generated by the vCPUs running on
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 20f8d35..3b45816 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -220,7 +220,7 @@ integer_param("credit2_balance_over", opt_overload_balance_tolerance);
 #define OPT_RUNQUEUE_SOCKET 2
 #define OPT_RUNQUEUE_NODE   3
 #define OPT_RUNQUEUE_ALL    4
-static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET;
+static int __read_mostly opt_runqueue = OPT_RUNQUEUE_CORE;
 
 static void parse_credit2_runqueue(const char *s)
 {


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 10/11] xen: sched: privde some scratch space for not putting cpumasks on stack
  2016-04-06 17:22 [PATCH v2 00/11] Fixes and improvement (including hard affinity!) for Credit2 Dario Faggioli
                   ` (8 preceding siblings ...)
  2016-04-06 17:23 ` [PATCH v2 09/11] xen: sched: per-core runqueues as default in credit2 Dario Faggioli
@ 2016-04-06 17:24 ` Dario Faggioli
  2016-04-07 15:12   ` George Dunlap
  2016-04-06 17:24 ` [PATCH v2 11/11] xen: sched: implement vcpu hard affinity in Credit2 Dario Faggioli
  10 siblings, 1 reply; 20+ messages in thread
From: Dario Faggioli @ 2016-04-06 17:24 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper

directly, from schedule.c, for any scheduler that needs
it to use it.

In fact, Credit1 and RTDS needs this already. Credit2 is
also going to need it, for supporting hard affinity
(which is, typically, what requires a lot of cpumask
manipulations, inside various functions).

Therefore, let's define the scratch space at a broader
scope, to limit code duplication in handling it.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes from v1:
* scratch space for cpumask is not "global", and defined
  in schedule.c, as suggested during review.
---
 xen/common/sched_credit.c  |   34 ++++++-------------------
 xen/common/sched_credit2.c |    1 +
 xen/common/sched_rt.c      |   59 +++-----------------------------------------
 xen/common/schedule.c      |    8 ++++++
 xen/include/xen/sched-if.h |    4 +++
 5 files changed, 25 insertions(+), 81 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 540d515..eac3f5e 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -171,20 +171,9 @@ struct csched_pcpu {
     struct timer ticker;
     unsigned int tick;
     unsigned int idle_bias;
-    /* Store this here to avoid having too many cpumask_var_t-s on stack */
-    cpumask_var_t balance_mask;
 };
 
 /*
- * Convenience macro for accessing the per-PCPU cpumask we need for
- * implementing the two steps (soft and hard affinity) balancing logic.
- * It is stored in csched_pcpu so that serialization is not an issue,
- * as there is a csched_pcpu for each PCPU, and we always hold the
- * runqueue lock for the proper PCPU when using this.
- */
-#define csched_balance_mask(c) (CSCHED_PCPU(c)->balance_mask)
-
-/*
  * Virtual CPU
  */
 struct csched_vcpu {
@@ -416,10 +405,10 @@ static inline void __runq_tickle(struct csched_vcpu *new)
 
             /* Are there idlers suitable for new (for this balance step)? */
             csched_balance_cpumask(new->vcpu, balance_step,
-                                   csched_balance_mask(cpu));
-            cpumask_and(csched_balance_mask(cpu),
-                        csched_balance_mask(cpu), &idle_mask);
-            new_idlers_empty = cpumask_empty(csched_balance_mask(cpu));
+                                   cpumask_scratch_cpu(cpu));
+            cpumask_and(cpumask_scratch_cpu(cpu),
+                        cpumask_scratch_cpu(cpu), &idle_mask);
+            new_idlers_empty = cpumask_empty(cpumask_scratch_cpu(cpu));
 
             /*
              * Let's not be too harsh! If there aren't idlers suitable
@@ -445,8 +434,8 @@ static inline void __runq_tickle(struct csched_vcpu *new)
             if ( new_idlers_empty && new->pri > cur->pri )
             {
                 csched_balance_cpumask(cur->vcpu, balance_step,
-                                       csched_balance_mask(cpu));
-                if ( cpumask_intersects(csched_balance_mask(cpu),
+                                       cpumask_scratch_cpu(cpu));
+                if ( cpumask_intersects(cpumask_scratch_cpu(cpu),
                                         &idle_mask) )
                 {
                     SCHED_VCPU_STAT_CRANK(cur, kicked_away);
@@ -519,7 +508,6 @@ csched_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 
     spin_unlock_irqrestore(&prv->lock, flags);
 
-    free_cpumask_var(spc->balance_mask);
     xfree(spc);
 }
 
@@ -533,12 +521,6 @@ csched_alloc_pdata(const struct scheduler *ops, int cpu)
     if ( spc == NULL )
         return ERR_PTR(-ENOMEM);
 
-    if ( !alloc_cpumask_var(&spc->balance_mask) )
-    {
-        xfree(spc);
-        return ERR_PTR(-ENOMEM);
-    }
-
     return spc;
 }
 
@@ -1592,9 +1574,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
                  && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
                 continue;
 
-            csched_balance_cpumask(vc, balance_step, csched_balance_mask(cpu));
+            csched_balance_cpumask(vc, balance_step, cpumask_scratch_cpu(cpu));
             if ( __csched_vcpu_is_migrateable(vc, cpu,
-                                              csched_balance_mask(cpu)) )
+                                              cpumask_scratch_cpu(cpu)) )
             {
                 /* We got a candidate. Grab it! */
                 TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 3b45816..084963a 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -2252,6 +2252,7 @@ csched2_init(struct scheduler *ops)
     if ( prv == NULL )
         return -ENOMEM;
     ops->sched_data = prv;
+
     spin_lock_init(&prv->lock);
     INIT_LIST_HEAD(&prv->sdom);
 
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index 3bb8c71..673fc92 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -155,24 +155,6 @@
 #define TRC_RTDS_BUDGET_REPLENISH TRC_SCHED_CLASS_EVT(RTDS, 4)
 #define TRC_RTDS_SCHED_TASKLET    TRC_SCHED_CLASS_EVT(RTDS, 5)
 
- /*
-  * Useful to avoid too many cpumask_var_t on the stack.
-  */
-static cpumask_var_t *_cpumask_scratch;
-#define cpumask_scratch _cpumask_scratch[smp_processor_id()]
-
-/*
- * We want to only allocate the _cpumask_scratch array the first time an
- * instance of this scheduler is used, and avoid reallocating and leaking
- * the old one when more instance are activated inside new cpupools. We
- * also want to get rid of it when the last instance is de-inited.
- *
- * So we (sort of) reference count the number of initialized instances. This
- * does not need to happen via atomic_t refcounters, as it only happens either
- * during boot, or under the protection of the cpupool_lock spinlock.
- */
-static unsigned int nr_rt_ops;
-
 static void repl_timer_handler(void *data);
 
 /*
@@ -301,12 +283,11 @@ rt_dump_vcpu(const struct scheduler *ops, const struct rt_vcpu *svc)
     /*
      * We can't just use 'cpumask_scratch' because the dumping can
      * happen from a pCPU outside of this scheduler's cpupool, and
-     * hence it's not right to use the pCPU's scratch mask (which
-     * may even not exist!). On the other hand, it is safe to use
-     * svc->vcpu->processor's own scratch space, since we hold the
-     * runqueue lock.
+     * hence it's not right to use its pCPU's scratch mask.
+     * On the other hand, it is safe to use svc->vcpu->processor's
+     * own scratch space, since we hold the runqueue lock.
      */
-    mask = _cpumask_scratch[svc->vcpu->processor];
+    mask = cpumask_scratch_cpu(svc->vcpu->processor);
 
     cpupool_mask = cpupool_domain_cpumask(svc->vcpu->domain);
     cpumask_and(mask, cpupool_mask, svc->vcpu->cpu_hard_affinity);
@@ -609,16 +590,6 @@ rt_init(struct scheduler *ops)
     if ( prv == NULL )
         return -ENOMEM;
 
-    ASSERT( _cpumask_scratch == NULL || nr_rt_ops > 0 );
-
-    if ( !_cpumask_scratch )
-    {
-        _cpumask_scratch = xmalloc_array(cpumask_var_t, nr_cpu_ids);
-        if ( !_cpumask_scratch )
-            goto no_mem;
-    }
-    nr_rt_ops++;
-
     spin_lock_init(&prv->lock);
     INIT_LIST_HEAD(&prv->sdom);
     INIT_LIST_HEAD(&prv->runq);
@@ -636,10 +607,6 @@ rt_init(struct scheduler *ops)
     prv->repl_timer = NULL;
 
     return 0;
-
- no_mem:
-    xfree(prv);
-    return -ENOMEM;
 }
 
 static void
@@ -647,14 +614,6 @@ rt_deinit(struct scheduler *ops)
 {
     struct rt_private *prv = rt_priv(ops);
 
-    ASSERT( _cpumask_scratch && nr_rt_ops > 0 );
-
-    if ( (--nr_rt_ops) == 0 )
-    {
-        xfree(_cpumask_scratch);
-        _cpumask_scratch = NULL;
-    }
-
     kill_timer(prv->repl_timer);
     xfree(prv->repl_timer);
 
@@ -718,9 +677,6 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu)
 {
     struct rt_private *prv = rt_priv(ops);
 
-    if ( !alloc_cpumask_var(&_cpumask_scratch[cpu]) )
-        return ERR_PTR(-ENOMEM);
-
     if ( prv->repl_timer == NULL )
     {
         /* Allocate the timer on the first cpu of this pool. */
@@ -735,12 +691,6 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu)
     return NULL;
 }
 
-static void
-rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
-{
-    free_cpumask_var(_cpumask_scratch[cpu]);
-}
-
 static void *
 rt_alloc_domdata(const struct scheduler *ops, struct domain *dom)
 {
@@ -1484,7 +1434,6 @@ static const struct scheduler sched_rtds_def = {
     .init           = rt_init,
     .deinit         = rt_deinit,
     .alloc_pdata    = rt_alloc_pdata,
-    .free_pdata     = rt_free_pdata,
     .init_pdata     = rt_init_pdata,
     .switch_sched   = rt_switch_sched,
     .alloc_domdata  = rt_alloc_domdata,
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 5559aa1..922b035 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -65,6 +65,14 @@ static void poll_timer_fn(void *data);
 DEFINE_PER_CPU(struct schedule_data, schedule_data);
 DEFINE_PER_CPU(struct scheduler *, scheduler);
 
+/*
+ * Scratch space, for avoiding having too many cpumask_var_t on the stack.
+ * Properly serializing access, if necessary, is responsibility of each
+ * scheduler (typically, one can expect this to be protected by the per pCPU
+ * or per runqueue lock).
+ */
+DEFINE_PER_CPU(cpumask_t, cpumask_scratch);
+
 extern const struct scheduler *__start_schedulers_array[], *__end_schedulers_array[];
 #define NUM_SCHEDULERS (__end_schedulers_array - __start_schedulers_array)
 #define schedulers __start_schedulers_array
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 9cebe41..1db7c8d 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -47,6 +47,10 @@ DECLARE_PER_CPU(struct schedule_data, schedule_data);
 DECLARE_PER_CPU(struct scheduler *, scheduler);
 DECLARE_PER_CPU(struct cpupool *, cpupool);
 
+DECLARE_PER_CPU(cpumask_t, cpumask_scratch);
+#define cpumask_scratch        (&this_cpu(cpumask_scratch))
+#define cpumask_scratch_cpu(c) (&per_cpu(cpumask_scratch, c))
+
 #define sched_lock(kind, param, cpu, irq, arg...) \
 static inline spinlock_t *kind##_schedule_lock##irq(param EXTRA_TYPE(arg)) \
 { \


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 11/11] xen: sched: implement vcpu hard affinity in Credit2
  2016-04-06 17:22 [PATCH v2 00/11] Fixes and improvement (including hard affinity!) for Credit2 Dario Faggioli
                   ` (9 preceding siblings ...)
  2016-04-06 17:24 ` [PATCH v2 10/11] xen: sched: privde some scratch space for not putting cpumasks on stack Dario Faggioli
@ 2016-04-06 17:24 ` Dario Faggioli
  10 siblings, 0 replies; 20+ messages in thread
From: Dario Faggioli @ 2016-04-06 17:24 UTC (permalink / raw)
  To: xen-devel; +Cc: Justin Weaver, George Dunlap

From: Justin Weaver <jtweaver@hawaii.edu>

as it was still missing.

Note that this patch "only" implements hard affinity,
i.e., the possibility of specifying on what pCPUs a
certain vCPU can run. Soft affinity (which express a
preference for vCPUs to run on certain pCPUs) is still
not supported by Credit2, even after this patch.

Signed-off-by: Justin Weaver <jtweaver@hawaii.edu>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
---
 xen/common/sched_credit2.c |  131 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 102 insertions(+), 29 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 084963a..03cd10c 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -310,6 +310,36 @@ struct csched2_dom {
     uint16_t nr_vcpus;
 };
 
+/*
+ * When a hard affinity change occurs, we may not be able to check some
+ * (any!) of the other runqueues, when looking for the best new processor
+ * for svc (as trylock-s in choose_cpu() can fail). If that happens, we
+ * pick, in order of decreasing preference:
+ *  - svc's current pcpu;
+ *  - another pcpu from svc's current runq;
+ *  - any cpu.
+ */
+static int get_fallback_cpu(struct csched2_vcpu *svc)
+{
+    int cpu;
+
+    if ( likely(cpumask_test_cpu(svc->vcpu->processor,
+                                 svc->vcpu->cpu_hard_affinity)) )
+        return svc->vcpu->processor;
+
+    cpumask_and(cpumask_scratch, svc->vcpu->cpu_hard_affinity,
+                &svc->rqd->active);
+    cpu = cpumask_first(cpumask_scratch);
+    if ( likely(cpu < nr_cpu_ids) )
+        return cpu;
+
+    cpumask_and(cpumask_scratch, svc->vcpu->cpu_hard_affinity,
+                cpupool_domain_cpumask(svc->vcpu->domain));
+
+    ASSERT(!cpumask_empty(cpumask_scratch));
+
+    return cpumask_first(cpumask_scratch);
+}
 
 /*
  * Time-to-credit, credit-to-time.
@@ -543,8 +573,9 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
         goto tickle;
     }
     
-    /* Get a mask of idle, but not tickled */
+    /* Get a mask of idle, but not tickled, that new is allowed to run on. */
     cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
+    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
     
     /* If it's not empty, choose one */
     i = cpumask_cycle(cpu, &mask);
@@ -555,9 +586,11 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
     }
 
     /* Otherwise, look for the non-idle cpu with the lowest credit,
-     * skipping cpus which have been tickled but not scheduled yet */
+     * skipping cpus which have been tickled but not scheduled yet,
+     * that new is allowed to run on. */
     cpumask_andnot(&mask, &rqd->active, &rqd->idle);
     cpumask_andnot(&mask, &mask, &rqd->tickled);
+    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
 
     for_each_cpu(i, &mask)
     {
@@ -1107,9 +1140,8 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
             d2printk("%pv -\n", svc->vcpu);
             clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
         }
-        /* Leave it where it is for now.  When we actually pay attention
-         * to affinity we'll have to figure something out... */
-        return vc->processor;
+
+        return get_fallback_cpu(svc);
     }
 
     /* First check to see if we're here because someone else suggested a place
@@ -1120,45 +1152,56 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
         {
             printk("%s: Runqueue migrate aborted because target runqueue disappeared!\n",
                    __func__);
-            /* Fall-through to normal cpu pick */
         }
         else
         {
-            d2printk("%pv +\n", svc->vcpu);
-            new_cpu = cpumask_cycle(vc->processor, &svc->migrate_rqd->active);
-            goto out_up;
+            cpumask_and(cpumask_scratch, vc->cpu_hard_affinity,
+                        &svc->migrate_rqd->active);
+            new_cpu = cpumask_any(cpumask_scratch);
+            if ( new_cpu < nr_cpu_ids )
+            {
+                d2printk("%pv +\n", svc->vcpu);
+                goto out_up;
+            }
         }
+        /* Fall-through to normal cpu pick */
     }
 
-    /* FIXME: Pay attention to cpu affinity */                                                                                      
-
     min_avgload = MAX_LOAD;
 
     /* Find the runqueue with the lowest instantaneous load */
     for_each_cpu(i, &prv->active_queues)
     {
         struct csched2_runqueue_data *rqd;
-        s_time_t rqd_avgload;
+        s_time_t rqd_avgload = MAX_LOAD;
 
         rqd = prv->rqd + i;
 
-        /* If checking a different runqueue, grab the lock,
-         * read the avg, and then release the lock.
+        /*
+         * If checking a different runqueue, grab the lock, check hard
+         * affinity, read the avg, and then release the lock.
          *
          * If on our own runqueue, don't grab or release the lock;
          * but subtract our own load from the runqueue load to simulate
-         * impartiality */
+         * impartiality.
+         *
+         * Note that, if svc's hard affinity has changed, this is the
+         * first time when we see such change, so it is indeed possible
+         * that none of the cpus in svc's current runqueue is in our
+         * (new) hard affinity!
+         */
         if ( rqd == svc->rqd )
         {
-            rqd_avgload = rqd->b_avgload - svc->avgload;
+            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
+                rqd_avgload = rqd->b_avgload - svc->avgload;
         }
         else if ( spin_trylock(&rqd->lock) )
         {
-            rqd_avgload = rqd->b_avgload;
+            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
+                rqd_avgload = rqd->b_avgload;
+
             spin_unlock(&rqd->lock);
         }
-        else
-            continue;
 
         if ( rqd_avgload < min_avgload )
         {
@@ -1167,12 +1210,14 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
         }
     }
 
-    /* We didn't find anyone (most likely because of spinlock contention); leave it where it is */
+    /* We didn't find anyone (most likely because of spinlock contention). */
     if ( min_rqi == -1 )
-        new_cpu = vc->processor;
+        new_cpu = get_fallback_cpu(svc);
     else
     {
-        new_cpu = cpumask_cycle(vc->processor, &prv->rqd[min_rqi].active);
+        cpumask_and(cpumask_scratch, vc->cpu_hard_affinity,
+                    &prv->rqd[min_rqi].active);
+        new_cpu = cpumask_any(cpumask_scratch);
         BUG_ON(new_cpu >= nr_cpu_ids);
     }
 
@@ -1252,7 +1297,12 @@ static void migrate(const struct scheduler *ops,
             on_runq=1;
         }
         __runq_deassign(svc);
-        svc->vcpu->processor = cpumask_any(&trqd->active);
+
+        cpumask_and(cpumask_scratch, svc->vcpu->cpu_hard_affinity,
+                    &trqd->active);
+        svc->vcpu->processor = cpumask_any(cpumask_scratch);
+        BUG_ON(svc->vcpu->processor >= nr_cpu_ids);
+
         __runq_assign(svc, trqd);
         if ( on_runq )
         {
@@ -1266,6 +1316,17 @@ static void migrate(const struct scheduler *ops,
     }
 }
 
+/*
+ * It makes sense considering migrating svc to rqd, if:
+ *  - svc is not already flagged to migrate,
+ *  - if svc is allowed to run on at least one of the pcpus of rqd.
+ */
+static bool_t vcpu_is_migrateable(struct csched2_vcpu *svc,
+                                  struct csched2_runqueue_data *rqd)
+{
+    return !(svc->flags & CSFLAG_runq_migrate_request) &&
+           cpumask_intersects(svc->vcpu->cpu_hard_affinity, &rqd->active);
+}
 
 static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
 {
@@ -1374,8 +1435,7 @@ retry:
 
         __update_svc_load(ops, push_svc, 0, now);
 
-        /* Skip this one if it's already been flagged to migrate */
-        if ( push_svc->flags & CSFLAG_runq_migrate_request )
+        if ( !vcpu_is_migrateable(push_svc, st.orqd) )
             continue;
 
         list_for_each( pull_iter, &st.orqd->svc )
@@ -1387,8 +1447,7 @@ retry:
                 __update_svc_load(ops, pull_svc, 0, now);
             }
         
-            /* Skip this one if it's already been flagged to migrate */
-            if ( pull_svc->flags & CSFLAG_runq_migrate_request )
+            if ( !vcpu_is_migrateable(pull_svc, st.lrqd) )
                 continue;
 
             consider(&st, push_svc, pull_svc);
@@ -1404,8 +1463,7 @@ retry:
     {
         struct csched2_vcpu * pull_svc = list_entry(pull_iter, struct csched2_vcpu, rqd_elem);
         
-        /* Skip this one if it's already been flagged to migrate */
-        if ( pull_svc->flags & CSFLAG_runq_migrate_request )
+        if ( !vcpu_is_migrateable(pull_svc, st.lrqd) )
             continue;
 
         /* Consider pull only */
@@ -1444,11 +1502,22 @@ csched2_vcpu_migrate(
 
     /* Check if new_cpu is valid */
     BUG_ON(!cpumask_test_cpu(new_cpu, &CSCHED2_PRIV(ops)->initialized));
+    ASSERT(cpumask_test_cpu(new_cpu, vc->cpu_hard_affinity));
 
     trqd = RQD(ops, new_cpu);
 
+    /*
+     * Do the actual movement toward new_cpu, and update vc->processor.
+     * If we are changing runqueue, migrate() takes care of everything.
+     * If we are not changing runqueue, we need to update vc->processor
+     * here. In fact, if, for instance, we are here because the vcpu's
+     * hard affinity changed, we don't want to risk leaving vc->processor
+     * pointing to a pcpu where we can't run any longer.
+     */
     if ( trqd != svc->rqd )
         migrate(ops, svc, trqd, NOW());
+    else
+        vc->processor = new_cpu;
 }
 
 static int
@@ -1671,6 +1740,10 @@ runq_candidate(struct csched2_runqueue_data *rqd,
     {
         struct csched2_vcpu * svc = list_entry(iter, struct csched2_vcpu, runq_elem);
 
+        /* Only consider vcpus that are allowed to run on this processor. */
+        if ( !cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity) )
+            continue;
+
         /* If this is on a different processor, don't pull it unless
          * its credit is at least CSCHED2_MIGRATE_RESIST higher. */
         if ( svc->vcpu->processor != cpu


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 01/11] xen: sched: make implementing .alloc_pdata optional
  2016-04-06 17:22 ` [PATCH v2 01/11] xen: sched: make implementing .alloc_pdata optional Dario Faggioli
@ 2016-04-07  4:56   ` Juergen Gross
  2016-04-07 11:24   ` George Dunlap
  1 sibling, 0 replies; 20+ messages in thread
From: Juergen Gross @ 2016-04-07  4:56 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: George Dunlap, Jan Beulich, Josh Whitehead, Meng Xu, Robert VanVossen

On 06/04/16 19:22, Dario Faggioli wrote:
> The .alloc_pdata scheduler hook must, before this change,
> be implemented by all schedulers --even those ones that
> don't need to allocate anything.
> 
> Make it possible to just use the SCHED_OP(), like for
> the other hooks, by using ERR_PTR() and IS_ERR() for
> error reporting. This:
>  - makes NULL a variant of success;
>  - allows for errors other than ENOMEM to be properly
>    communicated (if ever necessary).
> 
> This, in turn, means that schedulers not needing to
> allocate any per-pCPU data, can avoid implementing the
> hook. In fact, the artificial implementation of
> .alloc_pdata in the ARINC653 is removed (and, while there,
> nuke .free_pdata too, as it is equally useless).
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 08/11] xen: sched: allow for choosing credit2 runqueues configuration at boot
  2016-04-06 17:23 ` [PATCH v2 08/11] xen: sched: allow for choosing credit2 runqueues configuration at boot Dario Faggioli
@ 2016-04-07  5:04   ` Juergen Gross
  2016-04-07 15:04     ` George Dunlap
  0 siblings, 1 reply; 20+ messages in thread
From: Juergen Gross @ 2016-04-07  5:04 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Uma Sharma

On 06/04/16 19:23, Dario Faggioli wrote:
> In fact, credit2 uses CPU topology to decide how to arrange
> its internal runqueues. Before this change, only 'one runqueue
> per socket' was allowed. However, experiments have shown that,
> for instance, having one runqueue per physical core improves
> performance, especially in case hyperthreading is available.
> 
> In general, it makes sense to allow users to pick one runqueue
> arrangement at boot time, so that:
>  - more experiments can be easily performed to even better
>    assess and improve performance;
>  - one can select the best configuration for his specific
>    use case and/or hardware.
> 
> This patch enables the above.
> 
> Note that, for correctly arranging runqueues to be per-core,
> just checking cpu_to_core() on the host CPUs is not enough.
> In fact, cores (and hyperthreads) on different sockets, can
> have the same core (and thread) IDs! We, therefore, need to
> check whether the full topology of two CPUs matches, for
> them to be put in the same runqueue.
> 
> Note also that the default (although not functional) for
> credit2, since now, has been per-socket runqueue. This patch
> leaves things that way, to avoid mixing policy and technical
> changes.
> 
> Finally, it would be a nice feature to be able to select
> a particular runqueue arrangement, even when creating a
> Credit2 cpupool. This is left as future work.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>

With the one comment below addressed:

Reviewed-by: Juergen Gross <jgross@suse.com>

> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Uma Sharma <uma.sharma523@gmail.com>
> Cc: Juergen Gross <jgross@suse.com>
> ---
> Cahnges from v1:
>  * fix bug in parameter parsing, and start using strcmp()
>    for that, as requested during review.
> ---
>  docs/misc/xen-command-line.markdown |   19 +++++++++
>  xen/common/sched_credit2.c          |   76 +++++++++++++++++++++++++++++++++--
>  2 files changed, 90 insertions(+), 5 deletions(-)
> 

...

> @@ -2006,7 +2067,10 @@ cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu)
>          BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID ||
>                 cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID);
>  
> -        if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) )
> +        if ( opt_runqueue == OPT_RUNQUEUE_ALL ||
> +             (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) ||
> +             (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) ||
> +             (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) )
>              break;
>      }
>  
> @@ -2170,6 +2234,8 @@ csched2_init(struct scheduler *ops)
>      printk(" load_window_shift: %d\n", opt_load_window_shift);
>      printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance);
>      printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance);
> +    printk(" runqueues arrangement: per-%s\n",
> +           opt_runqueue == OPT_RUNQUEUE_CORE ? "core" : "socket");

I asked this before: shouldn't the optiones "node" and "all" be
respected here, too?


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 01/11] xen: sched: make implementing .alloc_pdata optional
  2016-04-06 17:22 ` [PATCH v2 01/11] xen: sched: make implementing .alloc_pdata optional Dario Faggioli
  2016-04-07  4:56   ` Juergen Gross
@ 2016-04-07 11:24   ` George Dunlap
  1 sibling, 0 replies; 20+ messages in thread
From: George Dunlap @ 2016-04-07 11:24 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Juergen Gross, George Dunlap, Robert VanVossen, Josh Whitehead,
	Meng Xu, Jan Beulich

On 06/04/16 18:22, Dario Faggioli wrote:
> The .alloc_pdata scheduler hook must, before this change,
> be implemented by all schedulers --even those ones that
> don't need to allocate anything.
> 
> Make it possible to just use the SCHED_OP(), like for
> the other hooks, by using ERR_PTR() and IS_ERR() for
> error reporting. This:
>  - makes NULL a variant of success;
>  - allows for errors other than ENOMEM to be properly
>    communicated (if ever necessary).
> 
> This, in turn, means that schedulers not needing to
> allocate any per-pCPU data, can avoid implementing the
> hook. In fact, the artificial implementation of
> .alloc_pdata in the ARINC653 is removed (and, while there,
> nuke .free_pdata too, as it is equally useless).
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>

Acked-by: George Dunlap <george.dunlap@citrix.com>

> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Robert VanVossen <robert.vanvossen@dornerworks.com>
> Cc: Josh Whitehead <josh.whitehead@dornerworks.com>
> Cc: Jan Beulich <JBeulich@suse.com>
> Cc: Juergen Gross <jgross@suse.com>
> ---
> Changes from v1:
>  * only update sd->sched_priv if alloc_pdata does not return
>    IS_ERR, so that xfree() can always be safely called on
>    sd->sched_priv itself, as requested during review;
>  * xen/err.h included in .c files that actually need it,
>    instead than in sched-if.h.
> ---
>  xen/common/sched_arinc653.c |   31 -------------------------------
>  xen/common/sched_credit.c   |    5 +++--
>  xen/common/sched_credit2.c  |    2 +-
>  xen/common/sched_rt.c       |    8 ++++----
>  xen/common/schedule.c       |   27 +++++++++++++++++----------
>  5 files changed, 25 insertions(+), 48 deletions(-)
> 
> diff --git a/xen/common/sched_arinc653.c b/xen/common/sched_arinc653.c
> index 8a11a2f..b79fcdf 100644
> --- a/xen/common/sched_arinc653.c
> +++ b/xen/common/sched_arinc653.c
> @@ -456,34 +456,6 @@ a653sched_free_vdata(const struct scheduler *ops, void *priv)
>  }
>  
>  /**
> - * This function allocates scheduler-specific data for a physical CPU
> - *
> - * We do not actually make use of any per-CPU data but the hypervisor expects
> - * a non-NULL return value
> - *
> - * @param ops       Pointer to this instance of the scheduler structure
> - *
> - * @return          Pointer to the allocated data
> - */
> -static void *
> -a653sched_alloc_pdata(const struct scheduler *ops, int cpu)
> -{
> -    /* return a non-NULL value to keep schedule.c happy */
> -    return SCHED_PRIV(ops);
> -}
> -
> -/**
> - * This function frees scheduler-specific data for a physical CPU
> - *
> - * @param ops       Pointer to this instance of the scheduler structure
> - */
> -static void
> -a653sched_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
> -{
> -    /* nop */
> -}
> -
> -/**
>   * This function allocates scheduler-specific data for a domain
>   *
>   * We do not actually make use of any per-domain data but the hypervisor
> @@ -737,9 +709,6 @@ static const struct scheduler sched_arinc653_def = {
>      .free_vdata     = a653sched_free_vdata,
>      .alloc_vdata    = a653sched_alloc_vdata,
>  
> -    .free_pdata     = a653sched_free_pdata,
> -    .alloc_pdata    = a653sched_alloc_pdata,
> -
>      .free_domdata   = a653sched_free_domdata,
>      .alloc_domdata  = a653sched_alloc_domdata,
>  
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index 4c4927f..63a4a63 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -23,6 +23,7 @@
>  #include <xen/errno.h>
>  #include <xen/keyhandler.h>
>  #include <xen/trace.h>
> +#include <xen/err.h>
>  
>  
>  /*
> @@ -532,12 +533,12 @@ csched_alloc_pdata(const struct scheduler *ops, int cpu)
>      /* Allocate per-PCPU info */
>      spc = xzalloc(struct csched_pcpu);
>      if ( spc == NULL )
> -        return NULL;
> +        return ERR_PTR(-ENOMEM);
>  
>      if ( !alloc_cpumask_var(&spc->balance_mask) )
>      {
>          xfree(spc);
> -        return NULL;
> +        return ERR_PTR(-ENOMEM);
>      }
>  
>      spin_lock_irqsave(&prv->lock, flags);
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index b8c8e40..e97d8be 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -2047,7 +2047,7 @@ csched2_alloc_pdata(const struct scheduler *ops, int cpu)
>          printk("%s: cpu %d not online yet, deferring initializatgion\n",
>                 __func__, cpu);
>  
> -    return (void *)1;
> +    return NULL;
>  }
>  
>  static void
> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> index 321b0a5..aece318 100644
> --- a/xen/common/sched_rt.c
> +++ b/xen/common/sched_rt.c
> @@ -29,6 +29,7 @@
>  #include <xen/cpu.h>
>  #include <xen/keyhandler.h>
>  #include <xen/trace.h>
> +#include <xen/err.h>
>  #include <xen/guest_access.h>
>  
>  /*
> @@ -681,7 +682,7 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu)
>      spin_unlock_irqrestore(old_lock, flags);
>  
>      if ( !alloc_cpumask_var(&_cpumask_scratch[cpu]) )
> -        return NULL;
> +        return ERR_PTR(-ENOMEM);
>  
>      if ( prv->repl_timer == NULL )
>      {
> @@ -689,13 +690,12 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu)
>          prv->repl_timer = xzalloc(struct timer);
>  
>          if ( prv->repl_timer == NULL )
> -            return NULL;
> +            return ERR_PTR(-ENOMEM);
>  
>          init_timer(prv->repl_timer, repl_timer_handler, (void *)ops, cpu);
>      }
>  
> -    /* 1 indicates alloc. succeed in schedule.c */
> -    return (void *)1;
> +    return NULL;
>  }
>  
>  static void
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index b7dee16..1941613 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -37,6 +37,7 @@
>  #include <xen/event.h>
>  #include <public/sched.h>
>  #include <xsm/xsm.h>
> +#include <xen/err.h>
>  
>  /* opt_sched: scheduler - default to configured value */
>  static char __initdata opt_sched[10] = CONFIG_SCHED_DEFAULT;
> @@ -1462,6 +1463,7 @@ static void poll_timer_fn(void *data)
>  static int cpu_schedule_up(unsigned int cpu)
>  {
>      struct schedule_data *sd = &per_cpu(schedule_data, cpu);
> +    void *sched_priv;
>  
>      per_cpu(scheduler, cpu) = &ops;
>      spin_lock_init(&sd->_lock);
> @@ -1500,9 +1502,16 @@ static int cpu_schedule_up(unsigned int cpu)
>      if ( idle_vcpu[cpu] == NULL )
>          return -ENOMEM;
>  
> -    if ( (ops.alloc_pdata != NULL) &&
> -         ((sd->sched_priv = ops.alloc_pdata(&ops, cpu)) == NULL) )
> -        return -ENOMEM;
> +    /*
> +     * We don't want to risk calling xfree() on an sd->sched_priv
> +     * (e.g., inside free_pdata, from cpu_schedule_down() called
> +     * during CPU_UP_CANCELLED) that contains an IS_ERR value.
> +     */
> +    sched_priv = SCHED_OP(&ops, alloc_pdata, cpu);
> +    if ( IS_ERR(sched_priv) )
> +        return PTR_ERR(sched_priv);
> +
> +    sd->sched_priv = sched_priv;
>  
>      return 0;
>  }
> @@ -1512,8 +1521,7 @@ static void cpu_schedule_down(unsigned int cpu)
>      struct schedule_data *sd = &per_cpu(schedule_data, cpu);
>      struct scheduler *sched = per_cpu(scheduler, cpu);
>  
> -    if ( sd->sched_priv != NULL )
> -        SCHED_OP(sched, free_pdata, sd->sched_priv, cpu);
> +    SCHED_OP(sched, free_pdata, sd->sched_priv, cpu);
>      SCHED_OP(sched, free_vdata, idle_vcpu[cpu]->sched_priv);
>  
>      idle_vcpu[cpu]->sched_priv = NULL;
> @@ -1608,9 +1616,8 @@ void __init scheduler_init(void)
>      idle_domain->max_vcpus = nr_cpu_ids;
>      if ( alloc_vcpu(idle_domain, 0, 0) == NULL )
>          BUG();
> -    if ( ops.alloc_pdata &&
> -         !(this_cpu(schedule_data).sched_priv = ops.alloc_pdata(&ops, 0)) )
> -        BUG();
> +    this_cpu(schedule_data).sched_priv = SCHED_OP(&ops, alloc_pdata, 0);
> +    BUG_ON(IS_ERR(this_cpu(schedule_data).sched_priv));
>      SCHED_OP(&ops, init_pdata, this_cpu(schedule_data).sched_priv, 0);
>  }
>  
> @@ -1653,8 +1660,8 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>  
>      idle = idle_vcpu[cpu];
>      ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
> -    if ( ppriv == NULL )
> -        return -ENOMEM;
> +    if ( IS_ERR(ppriv) )
> +        return PTR_ERR(ppriv);
>      SCHED_OP(new_ops, init_pdata, ppriv, cpu);
>      vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv);
>      if ( vpriv == NULL )
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 04/11] xen: sched: close potential races when switching scheduler to CPUs
  2016-04-06 17:23 ` [PATCH v2 04/11] xen: sched: close potential races when switching scheduler to CPUs Dario Faggioli
@ 2016-04-07 14:54   ` George Dunlap
  2016-04-07 23:48     ` Dario Faggioli
  0 siblings, 1 reply; 20+ messages in thread
From: George Dunlap @ 2016-04-07 14:54 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Tianyang Chen, Meng Xu

On 06/04/16 18:23, Dario Faggioli wrote:
> In short, the point is making sure that the actual switch
> of scheduler and the remapping of the scheduler's runqueue
> lock occur in the same critical section, protected by the
> "old" scheduler's lock (and not, e.g., in the free_pdata
> hook, as it is now for Credit2 and RTDS).
> 
> Not doing  so, is (at least) racy. In fact, for instance,
> if we switch cpu X from, Credit2 to Credit, we do:
> 
>  schedule_cpu_switch(x, csched2 --> csched):
>    //scheduler[x] is csched2
>    //schedule_lock[x] is csched2_lock
>    csched_alloc_pdata(x)
>    csched_init_pdata(x)
>    pcpu_schedule_lock(x) ----> takes csched2_lock
>    scheduler[X] = csched
>    pcpu_schedule_unlock(x) --> unlocks csched2_lock
>    [1]
>    csched2_free_pdata(x)
>      pcpu_schedule_lock(x) --> takes csched2_lock
>      schedule_lock[x] = csched_lock
>      spin_unlock(csched2_lock)
> 
> While, if we switch cpu X from, Credit to Credit2, we do:
> 
>  schedule_cpu_switch(X, csched --> csched2):
>    //scheduler[x] is csched
>    //schedule_lock[x] is csched_lock
>    csched2_alloc_pdata(x)
>    csched2_init_pdata(x)
>      pcpu_schedule_lock(x) --> takes csched_lock
>      schedule_lock[x] = csched2_lock
>      spin_unlock(csched_lock)
>    [2]
>    pcpu_schedule_lock(x) ----> takes csched2_lock
>    scheduler[X] = csched2
>    pcpu_schedule_unlock(x) --> unlocks csched2_lock
>    csched_free_pdata(x)
> 
> And if we switch cpu X from RTDS to Credit2, we do:
> 
>  schedule_cpu_switch(X, RTDS --> csched2):
>    //scheduler[x] is rtds
>    //schedule_lock[x] is rtds_lock
>    csched2_alloc_pdata(x)
>    csched2_init_pdata(x)
>      pcpu_schedule_lock(x) --> takes rtds_lock
>      schedule_lock[x] = csched2_lock
>      spin_unlock(rtds_lock)
>    pcpu_schedule_lock(x) ----> takes csched2_lock
>    scheduler[x] = csched2
>    pcpu_schedule_unlock(x) --> unlocks csched2_lock
>    rtds_free_pdata(x)
>      spin_lock(rtds_lock)
>      ASSERT(schedule_lock[x] == rtds_lock) [3]
>      schedule_lock[x] = DEFAULT_SCHEDULE_LOCK [4]
>      spin_unlock(rtds_lock)
> 
> So, the first problem is that, if anything related to
> scheduling, and involving CPU, happens at [1] or [2], we:
>  - take csched2_lock,
>  - operate on Credit1 functions and data structures,
> which is no good!
> 
> The second problem is that the ASSERT at [3] triggers, and
> the third that at [4], we screw up the lock remapping we've
> done for ourself in csched2_init_pdata()!
> 
> The first problem arises because there is a window during
> which the lock is already the new one, but the scheduler is
> still the old one. The other two, becase we let schedulers
> mess with the lock (re)mapping done by others.
> 
> This patch, therefore, introduces a new hook in the scheduler
> interface, called switch_sched, meant at being used when
> switching scheduler on a CPU, and implements it for the
> various schedulers (that needs it: i.e., all except ARINC653),
> so that things are done in the proper order and under the
> protection of the best suited (set of) lock(s). It is
> necessary to add the hook (as compared to keep doing things
> in generic code), because different schedulers may have
>  different locking schemes.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Hey Dario! Everything here looks good, except for one thing: the
scheduler lock for arinc653 scheduler. :-)  What happens now if you
assign a cpu to credit2, and then assign it to arinc653?  Since arinc
doesn't implement the switch_sched() functionality, the per-cpu
scheduler lock will still point to the credit2 lock, won't it?

Which will *work*, although it will add unnecessary contention to the
credit2 lock;  until the lock goes away, at which point
vcpu_schedule_lock*() will essentially be using a wild pointer.

 -George

> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Meng Xu <mengxu@cis.upenn.edu>
> Cc: Tianyang Chen <tiche@seas.upenn.edu>
> ---
> Changes from v1:
> 
> new patch, basically, coming from squashing what were
> 4 patches in v1. In any case, with respect to those 4
> patches:
>  - runqueue lock is back being taken in schedule_cpu_switch(),
>    as suggested during review;
>  - add barriers for making sure all initialization is done
>    when the new lock is assigned, as sugested during review;
>  - add comments and ASSERT-s about how and why the adopted
>    locking scheme is safe, as suggested during review.
> ---
>  xen/common/sched_credit.c  |   44 ++++++++++++++++++++++++
>  xen/common/sched_credit2.c |   81 +++++++++++++++++++++++++++++++++-----------
>  xen/common/sched_rt.c      |   45 +++++++++++++++++-------
>  xen/common/schedule.c      |   41 +++++++++++++++++-----
>  xen/include/xen/sched-if.h |    3 ++
>  5 files changed, 172 insertions(+), 42 deletions(-)
> 
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index 96a245d..540d515 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -578,12 +578,55 @@ csched_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
>  {
>      unsigned long flags;
>      struct csched_private *prv = CSCHED_PRIV(ops);
> +    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
> +
> +    /*
> +     * This is called either during during boot, resume or hotplug, in
> +     * case Credit1 is the scheduler chosen at boot. In such cases, the
> +     * scheduler lock for cpu is already pointing to the default per-cpu
> +     * spinlock, as Credit1 needs it, so there is no remapping to be done.
> +     */
> +    ASSERT(sd->schedule_lock == &sd->_lock && !spin_is_locked(&sd->_lock));
>  
>      spin_lock_irqsave(&prv->lock, flags);
>      init_pdata(prv, pdata, cpu);
>      spin_unlock_irqrestore(&prv->lock, flags);
>  }
>  
> +/* Change the scheduler of cpu to us (Credit). */
> +static void
> +csched_switch_sched(struct scheduler *ops, unsigned int cpu,
> +                    void *pdata, void *vdata)
> +{
> +    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
> +    struct csched_private *prv = CSCHED_PRIV(ops);
> +    struct csched_vcpu *svc = vdata;
> +
> +    ASSERT(svc && is_idle_vcpu(svc->vcpu));
> +
> +    idle_vcpu[cpu]->sched_priv = vdata;
> +
> +    /*
> +     * We are holding the runqueue lock already (it's been taken in
> +     * schedule_cpu_switch()). It actually may or may not be the 'right'
> +     * one for this cpu, but that is ok for preventing races.
> +     */
> +    spin_lock(&prv->lock);
> +    init_pdata(prv, pdata, cpu);
> +    spin_unlock(&prv->lock);
> +
> +    per_cpu(scheduler, cpu) = ops;
> +    per_cpu(schedule_data, cpu).sched_priv = pdata;
> +
> +    /*
> +     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
> +     * if it is free (and it can be) we want that anyone that manages
> +     * taking it, finds all the initializations we've done above in place.
> +     */
> +    smp_mb();
> +    sd->schedule_lock = &sd->_lock;
> +}
> +
>  #ifndef NDEBUG
>  static inline void
>  __csched_vcpu_check(struct vcpu *vc)
> @@ -2067,6 +2110,7 @@ static const struct scheduler sched_credit_def = {
>      .alloc_pdata    = csched_alloc_pdata,
>      .init_pdata     = csched_init_pdata,
>      .free_pdata     = csched_free_pdata,
> +    .switch_sched   = csched_switch_sched,
>      .alloc_domdata  = csched_alloc_domdata,
>      .free_domdata   = csched_free_domdata,
>  
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 8989eea..60c6f5b 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -1971,12 +1971,12 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
>      cpumask_clear_cpu(rqi, &prv->active_queues);
>  }
>  
> -static void
> +/* Returns the ID of the runqueue the cpu is assigned to. */
> +static unsigned
>  init_pdata(struct csched2_private *prv, unsigned int cpu)
>  {
>      unsigned rqi;
>      struct csched2_runqueue_data *rqd;
> -    spinlock_t *old_lock;
>  
>      ASSERT(spin_is_locked(&prv->lock));
>      ASSERT(!cpumask_test_cpu(cpu, &prv->initialized));
> @@ -2007,44 +2007,89 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
>          activate_runqueue(prv, rqi);
>      }
>      
> -    /* IRQs already disabled */
> -    old_lock = pcpu_schedule_lock(cpu);
> -
> -    /* Move spinlock to new runq lock.  */
> -    per_cpu(schedule_data, cpu).schedule_lock = &rqd->lock;
> -
>      /* Set the runqueue map */
>      prv->runq_map[cpu] = rqi;
>      
>      cpumask_set_cpu(cpu, &rqd->idle);
>      cpumask_set_cpu(cpu, &rqd->active);
> -
> -    /* _Not_ pcpu_schedule_unlock(): per_cpu().schedule_lock changed! */
> -    spin_unlock(old_lock);
> -
>      cpumask_set_cpu(cpu, &prv->initialized);
>  
> -    return;
> +    return rqi;
>  }
>  
>  static void
>  csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
>  {
>      struct csched2_private *prv = CSCHED2_PRIV(ops);
> +    spinlock_t *old_lock;
>      unsigned long flags;
> +    unsigned rqi;
>  
>      spin_lock_irqsave(&prv->lock, flags);
> -    init_pdata(prv, cpu);
> +    old_lock = pcpu_schedule_lock(cpu);
> +
> +    rqi = init_pdata(prv, cpu);
> +    /* Move the scheduler lock to the new runq lock. */
> +    per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock;
> +
> +    /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
> +    spin_unlock(old_lock);
>      spin_unlock_irqrestore(&prv->lock, flags);
>  }
>  
> +/* Change the scheduler of cpu to us (Credit2). */
> +static void
> +csched2_switch_sched(struct scheduler *new_ops, unsigned int cpu,
> +                     void *pdata, void *vdata)
> +{
> +    struct csched2_private *prv = CSCHED2_PRIV(new_ops);
> +    struct csched2_vcpu *svc = vdata;
> +    unsigned rqi;
> +
> +    ASSERT(!pdata && svc && is_idle_vcpu(svc->vcpu));
> +
> +    /*
> +     * We own one runqueue lock already (from schedule_cpu_switch()). This
> +     * looks like it violates this scheduler's locking rules, but it does
> +     * not, as what we own is the lock of another scheduler, that hence has
> +     * no particular (ordering) relationship with our private global lock.
> +     * And owning exactly that one (the lock of the old scheduler of this
> +     * cpu) is what is necessary to prevent races.
> +     */
> +    spin_lock_irq(&prv->lock);
> +
> +    idle_vcpu[cpu]->sched_priv = vdata;
> +
> +    rqi = init_pdata(prv, cpu);
> +
> +    /*
> +     * Now that we know what runqueue we'll go in, double check what's said
> +     * above: the lock we already hold is not the one of this runqueue of
> +     * this scheduler, and so it's safe to have taken it /before/ our
> +     * private global lock.
> +     */
> +    ASSERT(per_cpu(schedule_data, cpu).schedule_lock != &prv->rqd[rqi].lock);
> +
> +    per_cpu(scheduler, cpu) = new_ops;
> +    per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */
> +
> +    /*
> +     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
> +     * if it is free (and it can be) we want that anyone that manages
> +     * taking it, find all the initializations we've done above in place.
> +     */
> +    smp_mb();
> +    per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock;
> +
> +    spin_unlock_irq(&prv->lock);
> +}
> +
>  static void
>  csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>  {
>      unsigned long flags;
>      struct csched2_private *prv = CSCHED2_PRIV(ops);
>      struct csched2_runqueue_data *rqd;
> -    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
>      int rqi;
>  
>      spin_lock_irqsave(&prv->lock, flags);
> @@ -2072,11 +2117,6 @@ csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>          deactivate_runqueue(prv, rqi);
>      }
>  
> -    /* Move spinlock to the original lock.  */
> -    ASSERT(sd->schedule_lock == &rqd->lock);
> -    ASSERT(!spin_is_locked(&sd->_lock));
> -    sd->schedule_lock = &sd->_lock;
> -
>      spin_unlock(&rqd->lock);
>  
>      cpumask_clear_cpu(cpu, &prv->initialized);
> @@ -2170,6 +2210,7 @@ static const struct scheduler sched_credit2_def = {
>      .free_vdata     = csched2_free_vdata,
>      .init_pdata     = csched2_init_pdata,
>      .free_pdata     = csched2_free_pdata,
> +    .switch_sched   = csched2_switch_sched,
>      .alloc_domdata  = csched2_alloc_domdata,
>      .free_domdata   = csched2_free_domdata,
>  };
> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> index b96bd93..3bb8c71 100644
> --- a/xen/common/sched_rt.c
> +++ b/xen/common/sched_rt.c
> @@ -682,6 +682,37 @@ rt_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
>      spin_unlock_irqrestore(old_lock, flags);
>  }
>  
> +/* Change the scheduler of cpu to us (RTDS). */
> +static void
> +rt_switch_sched(struct scheduler *new_ops, unsigned int cpu,
> +                void *pdata, void *vdata)
> +{
> +    struct rt_private *prv = rt_priv(new_ops);
> +    struct rt_vcpu *svc = vdata;
> +
> +    ASSERT(!pdata && svc && is_idle_vcpu(svc->vcpu));
> +
> +    /*
> +     * We are holding the runqueue lock already (it's been taken in
> +     * schedule_cpu_switch()). It's actually the runqueue lock of
> +     * another scheduler, but that is how things need to be, for
> +     * preventing races.
> +     */
> +    ASSERT(per_cpu(schedule_data, cpu).schedule_lock != &prv->lock);
> +
> +    idle_vcpu[cpu]->sched_priv = vdata;
> +    per_cpu(scheduler, cpu) = new_ops;
> +    per_cpu(schedule_data, cpu).sched_priv = NULL; /* no pdata */
> +
> +    /*
> +     * (Re?)route the lock to the per pCPU lock as /last/ thing. In fact,
> +     * if it is free (and it can be) we want that anyone that manages
> +     * taking it, find all the initializations we've done above in place.
> +     */
> +    smp_mb();
> +    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
> +}
> +
>  static void *
>  rt_alloc_pdata(const struct scheduler *ops, int cpu)
>  {
> @@ -707,19 +738,6 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu)
>  static void
>  rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>  {
> -    struct rt_private *prv = rt_priv(ops);
> -    struct schedule_data *sd = &per_cpu(schedule_data, cpu);
> -    unsigned long flags;
> -
> -    spin_lock_irqsave(&prv->lock, flags);
> -
> -    /* Move spinlock back to the default lock */
> -    ASSERT(sd->schedule_lock == &prv->lock);
> -    ASSERT(!spin_is_locked(&sd->_lock));
> -    sd->schedule_lock = &sd->_lock;
> -
> -    spin_unlock_irqrestore(&prv->lock, flags);
> -
>      free_cpumask_var(_cpumask_scratch[cpu]);
>  }
>  
> @@ -1468,6 +1486,7 @@ static const struct scheduler sched_rtds_def = {
>      .alloc_pdata    = rt_alloc_pdata,
>      .free_pdata     = rt_free_pdata,
>      .init_pdata     = rt_init_pdata,
> +    .switch_sched   = rt_switch_sched,
>      .alloc_domdata  = rt_alloc_domdata,
>      .free_domdata   = rt_free_domdata,
>      .init_domain    = rt_dom_init,
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 1941613..5559aa1 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -1635,11 +1635,11 @@ void __init scheduler_init(void)
>  int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>  {
>      struct vcpu *idle;
> -    spinlock_t *lock;
>      void *ppriv, *ppriv_old, *vpriv, *vpriv_old;
>      struct scheduler *old_ops = per_cpu(scheduler, cpu);
>      struct scheduler *new_ops = (c == NULL) ? &ops : c->sched;
>      struct cpupool *old_pool = per_cpu(cpupool, cpu);
> +    spinlock_t * old_lock;
>  
>      /*
>       * pCPUs only move from a valid cpupool to free (i.e., out of any pool),
> @@ -1658,11 +1658,21 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>      if ( old_ops == new_ops )
>          goto out;
>  
> +    /*
> +     * To setup the cpu for the new scheduler we need:
> +     *  - a valid instance of per-CPU scheduler specific data, as it is
> +     *    allocated by SCHED_OP(alloc_pdata). Note that we do not want to
> +     *    initialize it yet (i.e., we are not calling SCHED_OP(init_pdata)).
> +     *    That will be done by the target scheduler, in SCHED_OP(switch_sched),
> +     *    in proper ordering and with locking.
> +     *  - a valid instance of per-vCPU scheduler specific data, for the idle
> +     *    vCPU of cpu. That is what the target scheduler will use for the
> +     *    sched_priv field of the per-vCPU info of the idle domain.
> +     */
>      idle = idle_vcpu[cpu];
>      ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
>      if ( IS_ERR(ppriv) )
>          return PTR_ERR(ppriv);
> -    SCHED_OP(new_ops, init_pdata, ppriv, cpu);
>      vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv);
>      if ( vpriv == NULL )
>      {
> @@ -1670,17 +1680,30 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>          return -ENOMEM;
>      }
>  
> -    lock = pcpu_schedule_lock_irq(cpu);
> -
>      SCHED_OP(old_ops, tick_suspend, cpu);
> +
> +    /*
> +     * The actual switch, including (if necessary) the rerouting of the
> +     * scheduler lock to whatever new_ops prefers,  needs to happen in one
> +     * critical section, protected by old_ops' lock, or races are possible.
> +     * It is, in fact, the lock of another scheduler that we are taking (the
> +     * scheduler of the cpupool that cpu still belongs to). But that is ok
> +     * as, anyone trying to schedule on this cpu will spin until when we
> +     * release that lock (bottom of this function). When he'll get the lock
> +     * --thanks to the loop inside *_schedule_lock() functions-- he'll notice
> +     * that the lock itself changed, and retry acquiring the new one (which
> +     * will be the correct, remapped one, at that point).
> +     */
> +    old_lock = pcpu_schedule_lock(cpu);
> +
>      vpriv_old = idle->sched_priv;
> -    idle->sched_priv = vpriv;
> -    per_cpu(scheduler, cpu) = new_ops;
>      ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
> -    per_cpu(schedule_data, cpu).sched_priv = ppriv;
> -    SCHED_OP(new_ops, tick_resume, cpu);
> +    SCHED_OP(new_ops, switch_sched, cpu, ppriv, vpriv);
>  
> -    pcpu_schedule_unlock_irq(lock, cpu);
> +    /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
> +    spin_unlock_irq(old_lock);
> +
> +    SCHED_OP(new_ops, tick_resume, cpu);
>  
>      SCHED_OP(old_ops, free_vdata, vpriv_old);
>      SCHED_OP(old_ops, free_pdata, ppriv_old, cpu);
> diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
> index 70c08c6..9cebe41 100644
> --- a/xen/include/xen/sched-if.h
> +++ b/xen/include/xen/sched-if.h
> @@ -137,6 +137,9 @@ struct scheduler {
>      void         (*free_domdata)   (const struct scheduler *, void *);
>      void *       (*alloc_domdata)  (const struct scheduler *, struct domain *);
>  
> +    void         (*switch_sched)   (struct scheduler *, unsigned int,
> +                                    void *, void *);
> +
>      int          (*init_domain)    (const struct scheduler *, struct domain *);
>      void         (*destroy_domain) (const struct scheduler *, struct domain *);
>  
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 08/11] xen: sched: allow for choosing credit2 runqueues configuration at boot
  2016-04-07  5:04   ` Juergen Gross
@ 2016-04-07 15:04     ` George Dunlap
  2016-04-07 22:45       ` Dario Faggioli
  0 siblings, 1 reply; 20+ messages in thread
From: George Dunlap @ 2016-04-07 15:04 UTC (permalink / raw)
  To: Juergen Gross, Dario Faggioli, xen-devel; +Cc: George Dunlap, Uma Sharma

On 07/04/16 06:04, Juergen Gross wrote:
> On 06/04/16 19:23, Dario Faggioli wrote:
>> In fact, credit2 uses CPU topology to decide how to arrange
>> its internal runqueues. Before this change, only 'one runqueue
>> per socket' was allowed. However, experiments have shown that,
>> for instance, having one runqueue per physical core improves
>> performance, especially in case hyperthreading is available.
>>
>> In general, it makes sense to allow users to pick one runqueue
>> arrangement at boot time, so that:
>>  - more experiments can be easily performed to even better
>>    assess and improve performance;
>>  - one can select the best configuration for his specific
>>    use case and/or hardware.
>>
>> This patch enables the above.
>>
>> Note that, for correctly arranging runqueues to be per-core,
>> just checking cpu_to_core() on the host CPUs is not enough.
>> In fact, cores (and hyperthreads) on different sockets, can
>> have the same core (and thread) IDs! We, therefore, need to
>> check whether the full topology of two CPUs matches, for
>> them to be put in the same runqueue.
>>
>> Note also that the default (although not functional) for
>> credit2, since now, has been per-socket runqueue. This patch
>> leaves things that way, to avoid mixing policy and technical
>> changes.
>>
>> Finally, it would be a nice feature to be able to select
>> a particular runqueue arrangement, even when creating a
>> Credit2 cpupool. This is left as future work.
>>
>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>> Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
> 
> With the one comment below addressed:
> 
> Reviewed-by: Juergen Gross <jgross@suse.com>
> 
>> ---
>> Cc: George Dunlap <george.dunlap@eu.citrix.com>
>> Cc: Uma Sharma <uma.sharma523@gmail.com>
>> Cc: Juergen Gross <jgross@suse.com>
>> ---
>> Cahnges from v1:
>>  * fix bug in parameter parsing, and start using strcmp()
>>    for that, as requested during review.
>> ---
>>  docs/misc/xen-command-line.markdown |   19 +++++++++
>>  xen/common/sched_credit2.c          |   76 +++++++++++++++++++++++++++++++++--
>>  2 files changed, 90 insertions(+), 5 deletions(-)
>>
> 
> ...
> 
>> @@ -2006,7 +2067,10 @@ cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu)
>>          BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID ||
>>                 cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID);
>>  
>> -        if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) )
>> +        if ( opt_runqueue == OPT_RUNQUEUE_ALL ||
>> +             (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) ||
>> +             (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) ||
>> +             (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) )
>>              break;
>>      }
>>  
>> @@ -2170,6 +2234,8 @@ csched2_init(struct scheduler *ops)
>>      printk(" load_window_shift: %d\n", opt_load_window_shift);
>>      printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance);
>>      printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance);
>> +    printk(" runqueues arrangement: per-%s\n",
>> +           opt_runqueue == OPT_RUNQUEUE_CORE ? "core" : "socket");
> 
> I asked this before: shouldn't the optiones "node" and "all" be
> respected here, too?

Dario, would it make sense to put the string names ("core", "socket",
&c) in an array, then have both parse_credit2_runqueue() iterate over
the array to find the appropriate numeric value, and have this use the
array to convert from the numeric value to a string?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 10/11] xen: sched: privde some scratch space for not putting cpumasks on stack
  2016-04-06 17:24 ` [PATCH v2 10/11] xen: sched: privde some scratch space for not putting cpumasks on stack Dario Faggioli
@ 2016-04-07 15:12   ` George Dunlap
  0 siblings, 0 replies; 20+ messages in thread
From: George Dunlap @ 2016-04-07 15:12 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Andrew Cooper

On 06/04/16 18:24, Dario Faggioli wrote:
> directly, from schedule.c, for any scheduler that needs
> it to use it.
> 
> In fact, Credit1 and RTDS needs this already. Credit2 is
> also going to need it, for supporting hard affinity
> (which is, typically, what requires a lot of cpumask
> manipulations, inside various functions).
> 
> Therefore, let's define the scratch space at a broader
> scope, to limit code duplication in handling it.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Love seeing all the lines starting with '-' in this diff. :-)

Reviewed-by: George Dunlap <george.dunlap@citrix.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 08/11] xen: sched: allow for choosing credit2 runqueues configuration at boot
  2016-04-07 15:04     ` George Dunlap
@ 2016-04-07 22:45       ` Dario Faggioli
  0 siblings, 0 replies; 20+ messages in thread
From: Dario Faggioli @ 2016-04-07 22:45 UTC (permalink / raw)
  To: George Dunlap, Juergen Gross, xen-devel; +Cc: George Dunlap, Uma Sharma


[-- Attachment #1.1: Type: text/plain, Size: 1700 bytes --]

On Thu, 2016-04-07 at 16:04 +0100, George Dunlap wrote:
> On 07/04/16 06:04, Juergen Gross wrote:
> > On 06/04/16 19:23, Dario Faggioli wrote:
> > > @@ -2170,6 +2234,8 @@ csched2_init(struct scheduler *ops)
> > >      printk(" load_window_shift: %d\n", opt_load_window_shift);
> > >      printk(" underload_balance_tolerance: %d\n",
> > > opt_underload_balance_tolerance);
> > >      printk(" overload_balance_tolerance: %d\n",
> > > opt_overload_balance_tolerance);
> > > +    printk(" runqueues arrangement: per-%s\n",
> > > +           opt_runqueue == OPT_RUNQUEUE_CORE ? "core" :
> > > "socket");
> > I asked this before: shouldn't the optiones "node" and "all" be
> > respected here, too?
> Dario, would it make sense to put the string names ("core", "socket",
> &c) in an array, then have both parse_credit2_runqueue() iterate over
> the array to find the appropriate numeric value, and have this use
> the
> array to convert from the numeric value to a string?
> 
Ok, I'll do that.

Even if I do, though, I can't get rid of the OPT_RUNQUEUE_CORE, etc.,
symbols, as I need to figure out what the numeric value found during
parsing actually means in cpu_to_runqueue().

I know you're not mentioning this, but I felt like I better make this
clear, in case one would expect for those to go away too.

In any case, you'll see this in the patch.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 04/11] xen: sched: close potential races when switching scheduler to CPUs
  2016-04-07 14:54   ` George Dunlap
@ 2016-04-07 23:48     ` Dario Faggioli
  0 siblings, 0 replies; 20+ messages in thread
From: Dario Faggioli @ 2016-04-07 23:48 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: George Dunlap, Tianyang Chen, Meng Xu


[-- Attachment #1.1: Type: text/plain, Size: 2989 bytes --]

On Thu, 2016-04-07 at 15:54 +0100, George Dunlap wrote:
> On 06/04/16 18:23, Dario Faggioli wrote:
> > 
> > This patch, therefore, introduces a new hook in the scheduler
> > interface, called switch_sched, meant at being used when
> > switching scheduler on a CPU, and implements it for the
> > various schedulers (that needs it: i.e., all except ARINC653),
> > so that things are done in the proper order and under the
> > protection of the best suited (set of) lock(s). It is
> > necessary to add the hook (as compared to keep doing things
> > in generic code), because different schedulers may have
> >  different locking schemes.
> > 
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Hey Dario! Everything here looks good, except for one thing: the
> scheduler lock for arinc653 scheduler. :-)  
>
Bah! :-/

> What happens now if you
> assign a cpu to credit2, and then assign it to arinc653?  Since arinc
> doesn't implement the switch_sched() functionality, the per-cpu
> scheduler lock will still point to the credit2 lock, won't it?
> 
So, to assign a cpu to a pool you have to first remove it from the pool
it's currently in. And removing a cpu from its pool, means re-assigning 
it to the default pool (pool0), which uses as scheduler whatever we
chose at boot. (Then, of course, vcpus of domains in the default pool
still won't run there, because its corresponding bit in
pool0->cpu_valid mask is 0.)

So, if you move a cpu from pool0 to an ARINC pool, indeed the lock
would keep pointing to pool0's runqueue lock (which will be one of the
Credit2's runqueue locks, if the default scheduler is Credit2).

And that is the case even if you move a cpu from some non-default pool
to an ARINC pool, because you always have to remove the cpu from where
it is first, which automatically puts it back in the default pool.

This is relevant because...

> Which will *work*, although it will add unnecessary contention to the
> credit2 lock;  until the lock goes away, at which point
> vcpu_schedule_lock*() will essentially be using a wild pointer.
> 
...since we're talking about the lock of the default scheduler, it'll
never go away.

But indeed I agree that this just adds unnecessary contention which, in
case of Credit2 or RTDS, where the lock is shared between pcpus, is
something we certainly don't want (good catch!).

I'll add a very simple implementation of switch_sched to ARINC, which
basically puts back sd->schedule_lock to sd->_lock, and also does the
basic operations for actually switching scheduler, which before this
patch were done in schedule_cpu_switch(), and hence are also needed in
there.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-04-07 23:49 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-06 17:22 [PATCH v2 00/11] Fixes and improvement (including hard affinity!) for Credit2 Dario Faggioli
2016-04-06 17:22 ` [PATCH v2 01/11] xen: sched: make implementing .alloc_pdata optional Dario Faggioli
2016-04-07  4:56   ` Juergen Gross
2016-04-07 11:24   ` George Dunlap
2016-04-06 17:22 ` [PATCH v2 02/11] xen: sched: implement .init_pdata in Credit, Credit2 and RTDS Dario Faggioli
2016-04-06 17:23 ` [PATCH v2 03/11] xen: sched: move pCPU initialization in an helper Dario Faggioli
2016-04-06 17:23 ` [PATCH v2 04/11] xen: sched: close potential races when switching scheduler to CPUs Dario Faggioli
2016-04-07 14:54   ` George Dunlap
2016-04-07 23:48     ` Dario Faggioli
2016-04-06 17:23 ` [PATCH v2 05/11] xen: sched: improve credit2 bootparams' scope, placement and signedness Dario Faggioli
2016-04-06 17:23 ` [PATCH v2 06/11] xen: sched: on Credit2, don't reprogram the timer if idle Dario Faggioli
2016-04-06 17:23 ` [PATCH v2 07/11] xen: sched: fix per-socket runqueue creation in credit2 Dario Faggioli
2016-04-06 17:23 ` [PATCH v2 08/11] xen: sched: allow for choosing credit2 runqueues configuration at boot Dario Faggioli
2016-04-07  5:04   ` Juergen Gross
2016-04-07 15:04     ` George Dunlap
2016-04-07 22:45       ` Dario Faggioli
2016-04-06 17:23 ` [PATCH v2 09/11] xen: sched: per-core runqueues as default in credit2 Dario Faggioli
2016-04-06 17:24 ` [PATCH v2 10/11] xen: sched: privde some scratch space for not putting cpumasks on stack Dario Faggioli
2016-04-07 15:12   ` George Dunlap
2016-04-06 17:24 ` [PATCH v2 11/11] xen: sched: implement vcpu hard affinity in Credit2 Dario Faggioli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).