All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] xen: sched: improve (a lot! :-D) Credit2 runqueue handling
@ 2015-09-29 16:55 Dario Faggioli
  2015-09-29 16:55 ` [PATCH 1/9] xen: sched: fix an 'off by one \t' in credit2 debug dump Dario Faggioli
                   ` (8 more replies)
  0 siblings, 9 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-09-29 16:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Justin Weaver, George Dunlap, Robert VanVossen, Uma Sharma,
	Josh Whitehead, Meng Xu, Juergen Gross

Hi everyone,

Code is already there, in the Credit2 scheduler, for setting up one runqueue
per each socket found in the system. However, that is not functional and only
one runqueue is being created, with all the pCPUs in it.

Some more details are available here:

  http://lists.xen.org/archives/html/xen-devel/2014-08/msg02168.html

As well as in our bug #36 "credit2 only uses one runqueue instead of one runq
per socket" (http://bugs.xenproject.org/xen/bug/36).

Also, during a previous OPW round, Uma Sharma did some investigation, together
with me and George, on how we could introduce hyperthreading support in
Credit2. It turned out that arranging runqueues to be 'per-core' (rather than
per-socket) was the most obvious first step. She performed some experiments,
and what came out was that aggregating pCPUs per-core is actually providing
better performance, even better than Credit1, in some cases. There still are
scenarios where more investigation and refinements are necessary, but the
per-core approach looks more promising.

Some of the measurements she performed are available here:

  http://lists.xen.org/archives/html/xen-devel/2015-03/msg01499.html

I made some more myself, for my presentation at the last XenDevSummit, which is
available here:

  http://events.linuxfoundation.org/sites/events/files/slides/Faggioli_XenSummit.pdf

Since there was some preparatory work to be done (i.e., what's in the first
half of this series), Uma couldn't really upstream her patches at the time, and
we agreed that I would have followed up, and re-submit her patches after the
preliminary adjustments were done. Therefore, this series encompasses Uma's
work, sometimes in the exact same form in which she originally submitted it,
sometimes tweaked to fit on top of the updated Credit2 codebase. 


So, in summary, this series does the following:

 1) first of all, it splits allocation and initialization phases of scheduling
    related per-pCPU data. Thanks to this, all allocations can be performed
    during the CPU_UP_PREPARE phase of pCPU bringup, while proper
    initialization can happen later, during CPU_STARTING. We really want
    --especially for Credit2-- to do the initialization in CPU_STARTING, as
    before that, we won't have the necessary topology information available.

    However, we can't move the alloc_pdata hook all together (and keep doing
    both allocation and initialization in there), as we can't really allocate
    memory while in CPU_STARTING (due to IRQs being disabled). Therefore, the
    split looks like the best solution, and it also seems to me to make a lot
    of sense to me, architecturally. In fact, right now, we have schedulers
    that does not need to allocate anything, but have to use alloc_pdata for
    initialization! :-/

    This was discussed already in some detail, a few time ago, in this
    subthread:

      http://lists.xen.org/archives/html/xen-devel/2015-03/msg02309.html

 2) At this point, we can easily fixup Credit2 runqueue creation phase, as we
    now have all the necessary topology information available when we need
    them. We therefore do create one runqueue per socket, and this time, it
    works. :-)

    While there, we make it possible for the user to decide how pCPUs should
    be aggregated in runqueues, with a Xen boot time parameter (socket and
    core is what is supported for now).

 3) Finally, as the per-core runqueue approach looks promising, we set
    per-core runqueue to be the default.


The series is structured as follows:
 * patch 1 and 2 are: one Credit2 debug dump output fix (patch 1), and some
   Credit2 boot parameter variables scope and placement improvements (patch 2);
 * patch 3 fixes a locking inconsistency in the scheduling interface;
 * patch 4, 5 and 6, arrange for the introduction of the new initialization
   hook (init_pcpu). Whe implementing it, for the schedulers that need that,
   code is borrowed from the allocation hook (i.e., step 1 above);
 * patch 7 and 8 fix Credit2 runqueue creation, and introduce a boot time
   parameter for influencing the runqueue aggregation strategy (i.e., step
   2 above);
 * patch 9 makes per-core the default runqueue aggregation strategy. This
   is done in a separate patch to avoid mixing technical and (default) policy
   changes.


This series is available as git branch, on top of staging, here:

  git://xenbits.xen.org/people/dariof/xen.git  rel/sched/credit2/fix-runqueues
  http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/rel/sched/credit2/fix-runqueues

There is also a branch containing this series, rebased on top of the following
(already sent) three patches:

  xen: sched: get rid of cpupool_scheduler_cpumask()
  xen: sched: adjustments to some performance counters
  xen: credit1: fix tickling when it happens from a remote pCPU

just in case those go in before this series (which I'd call likely), and one
wants to pull the series on top of that:

  git://xenbits.xen.org/people/dariof/xen.git  rel/sched/credit2/fix-runqueues_rebase
  http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/rel/sched/credit2/fix-runqueues_rebase

Regards,
Dario

---
Dario Faggioli (8):
      xen: sched: fix an 'off by one \t' in credit2 debug dump
      xen: sched: make locking for {insert,remove}_vcpu consistent
      xen: sched: add .init_pdata hook to the scheduler interface
      xen: sched: make implementing .alloc_pdata optional
      xen: sched: implement .init_pdata in all schedulers
      xen: sched: fix per-socket runqueue creation in credit2
      xen: sched: allow for choosing credit2 runqueues configuration at boot
      xen: sched: per-core runqueues as default in credit2

Uma Sharma (1):
      xen: sched: improve scope and placement of credit2 boot parameters

 docs/misc/xen-command-line.markdown |   11 ++
 xen/common/sched_arinc653.c         |   31 ------
 xen/common/sched_credit.c           |   47 +++++++--
 xen/common/sched_credit2.c          |  186 +++++++++++++++++++----------------
 xen/common/sched_rt.c               |   19 +++-
 xen/common/schedule.c               |   28 +++--
 xen/include/xen/sched-if.h          |    1 
 7 files changed, 182 insertions(+), 141 deletions(-)

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/9] xen: sched: fix an 'off by one \t' in credit2 debug dump
  2015-09-29 16:55 [PATCH 0/9] xen: sched: improve (a lot! :-D) Credit2 runqueue handling Dario Faggioli
@ 2015-09-29 16:55 ` Dario Faggioli
  2015-10-01  5:22   ` Juergen Gross
  2015-10-08 14:09   ` George Dunlap
  2015-09-29 16:55 ` [PATCH 2/9] xen: sched: improve scope and placement of credit2 boot parameters Dario Faggioli
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-09-29 16:55 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit2.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 75e0321..46645f8 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1931,7 +1931,7 @@ csched2_dump(const struct scheduler *ops)
         struct csched2_dom *sdom;
         sdom = list_entry(iter_sdom, struct csched2_dom, sdom_elem);
 
-        printk("\tDomain: %d w %d v %d\n\t",
+        printk("\tDomain: %d w %d v %d\n",
                sdom->dom->domain_id,
                sdom->weight,
                sdom->nr_vcpus);

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/9] xen: sched: improve scope and placement of credit2 boot parameters
  2015-09-29 16:55 [PATCH 0/9] xen: sched: improve (a lot! :-D) Credit2 runqueue handling Dario Faggioli
  2015-09-29 16:55 ` [PATCH 1/9] xen: sched: fix an 'off by one \t' in credit2 debug dump Dario Faggioli
@ 2015-09-29 16:55 ` Dario Faggioli
  2015-10-01  5:23   ` Juergen Gross
  2015-10-01  7:51   ` Jan Beulich
  2015-09-29 16:55 ` [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent Dario Faggioli
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-09-29 16:55 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Uma Sharma

From: Uma Sharma <uma.sharma523@gmail.com>

Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit2.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 46645f8..a28edd8 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -162,7 +162,7 @@
 #define CSFLAG_runq_migrate_request (1<<__CSFLAG_runq_migrate_request)
 
 
-int opt_migrate_resist=500;
+static int __read_mostly opt_migrate_resist = 500;
 integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
 
 /*
@@ -185,12 +185,12 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
  *   to a load of 1.
  */
 #define LOADAVG_GRANULARITY_SHIFT (10)
-int opt_load_window_shift=18;
+static int __read_mostly opt_load_window_shift = 18;
 #define  LOADAVG_WINDOW_SHIFT_MIN 4
 integer_param("credit2_load_window_shift", opt_load_window_shift);
-int opt_underload_balance_tolerance=0;
+static int __read_mostly opt_underload_balance_tolerance = 0;
 integer_param("credit2_balance_under", opt_underload_balance_tolerance);
-int opt_overload_balance_tolerance=-3;
+static int __read_mostly opt_overload_balance_tolerance = -3;
 integer_param("credit2_balance_over", opt_overload_balance_tolerance);
 
 /*

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-09-29 16:55 [PATCH 0/9] xen: sched: improve (a lot! :-D) Credit2 runqueue handling Dario Faggioli
  2015-09-29 16:55 ` [PATCH 1/9] xen: sched: fix an 'off by one \t' in credit2 debug dump Dario Faggioli
  2015-09-29 16:55 ` [PATCH 2/9] xen: sched: improve scope and placement of credit2 boot parameters Dario Faggioli
@ 2015-09-29 16:55 ` Dario Faggioli
  2015-09-29 17:31   ` Andrew Cooper
  2015-10-01  8:03   ` Jan Beulich
  2015-09-29 16:55 ` [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface Dario Faggioli
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-09-29 16:55 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Meng Xu

The insert_vcpu() scheduler hook is called with an
inconsistent locking strategy. In fact, it is sometimes
invoked while holding the runqueue lock and sometimes
when that is not the case.

In other words, some call sites seems to imply that
locking should be handled in the callers, in schedule.c
--e.g., in schedule_cpu_switch(), which acquires the
runqueue lock before calling the hook; others that
specific schedulers should be responsible for locking
themselves --e.g., in sched_move_domain(), which does
not acquire any lock for calling the hook.

The right thing to do seems to always defer locking to
the specific schedulers, as it's them that know what, how
and when it is best to lock (as in: runqueue locks, vs.
private scheduler locks, vs. both, etc.)

This patch, therefore:
 - removes any locking around insert_vcpu() from
   generic code (schedule.c);
 - add the _proper_ locking in the hook implementations,
   depending on the scheduler (for instance, credit2
   does that already, credit1 and RTDS need to grab
   the runqueue lock while manipulating runqueues).

In case of credit1, remove_vcpu() handling needs some
fixing remove_vcpu() too, i.e.:
 - it manipulates runqueues, so the runqueue lock must
   be acquired;
 - *_lock_irq() is enough, there is no need to do
   _irqsave()

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Meng Xu <mengxu@cis.upenn.edu>
---
 xen/common/sched_credit.c |   24 +++++++++++++++++++-----
 xen/common/sched_rt.c     |   10 +++++++++-
 xen/common/schedule.c     |    6 ------
 3 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index a1945ac..557efaa 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -905,8 +905,19 @@ csched_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched_vcpu *svc = vc->sched_priv;
 
-    if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc->is_running )
-        __runq_insert(vc->processor, svc);
+    /*
+     * For the idle domain, this is called, during boot, before
+     * than alloc_pdata() has been called for the pCPU.
+     */
+    if ( !is_idle_vcpu(vc) )
+    {
+        spinlock_t *lock = vcpu_schedule_lock_irq(vc);
+
+        if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc->is_running )
+            __runq_insert(vc->processor, svc);
+
+        vcpu_schedule_unlock_irq(lock, vc);
+    }
 }
 
 static void
@@ -925,7 +936,7 @@ csched_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
     struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_vcpu * const svc = CSCHED_VCPU(vc);
     struct csched_dom * const sdom = svc->sdom;
-    unsigned long flags;
+    spinlock_t *lock;
 
     SCHED_STAT_CRANK(vcpu_destroy);
 
@@ -935,15 +946,18 @@ csched_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
         vcpu_unpause(svc->vcpu);
     }
 
+    lock = vcpu_schedule_lock_irq(vc);
+
     if ( __vcpu_on_runq(svc) )
         __runq_remove(svc);
 
-    spin_lock_irqsave(&(prv->lock), flags);
+    spin_lock(&(prv->lock));
 
     if ( !list_empty(&svc->active_vcpu_elem) )
         __csched_vcpu_acct_stop_locked(prv, svc);
 
-    spin_unlock_irqrestore(&(prv->lock), flags);
+    spin_unlock(&(prv->lock));
+    vcpu_schedule_unlock_irq(lock, vc);
 
     BUG_ON( sdom == NULL );
     BUG_ON( !list_empty(&svc->runq_elem) );
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index 4372486..89f29b2 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -621,18 +621,26 @@ static void
 rt_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
 {
     struct rt_vcpu *svc = rt_vcpu(vc);
+    spinlock_t *lock;
     s_time_t now = NOW();
 
-    /* not addlocate idle vcpu to dom vcpu list */
+    /*
+     * For the idle domain, this is called, during boot, before
+     * than alloc_pdata() has been called for the pCPU.
+     */
     if ( is_idle_vcpu(vc) )
         return;
 
+    lock = vcpu_schedule_lock_irq(vc);
+
     if ( now >= svc->cur_deadline )
         rt_update_deadline(now, svc);
 
     if ( !__vcpu_on_q(svc) && vcpu_runnable(vc) && !vc->is_running )
         __runq_insert(ops, svc);
 
+    vcpu_schedule_unlock_irq(lock, vc);
+
     /* add rt_vcpu svc to scheduler-specific vcpu list of the dom */
     list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu);
 }
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 3eefed7..4a89222 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -1488,9 +1488,7 @@ void __init scheduler_init(void)
 
 int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
 {
-    unsigned long flags;
     struct vcpu *idle;
-    spinlock_t *lock;
     void *ppriv, *ppriv_old, *vpriv, *vpriv_old;
     struct scheduler *old_ops = per_cpu(scheduler, cpu);
     struct scheduler *new_ops = (c == NULL) ? &ops : c->sched;
@@ -1509,8 +1507,6 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
         return -ENOMEM;
     }
 
-    lock = pcpu_schedule_lock_irqsave(cpu, &flags);
-
     SCHED_OP(old_ops, tick_suspend, cpu);
     vpriv_old = idle->sched_priv;
     idle->sched_priv = vpriv;
@@ -1520,8 +1516,6 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
     SCHED_OP(new_ops, tick_resume, cpu);
     SCHED_OP(new_ops, insert_vcpu, idle);
 
-    pcpu_schedule_unlock_irqrestore(lock, flags, cpu);
-
     SCHED_OP(old_ops, free_vdata, vpriv_old);
     SCHED_OP(old_ops, free_pdata, ppriv_old, cpu);

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface
  2015-09-29 16:55 [PATCH 0/9] xen: sched: improve (a lot! :-D) Credit2 runqueue handling Dario Faggioli
                   ` (2 preceding siblings ...)
  2015-09-29 16:55 ` [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent Dario Faggioli
@ 2015-09-29 16:55 ` Dario Faggioli
  2015-10-01  5:21   ` Juergen Gross
  2015-10-01  8:17   ` Jan Beulich
  2015-09-29 16:56 ` [PATCH 5/9] xen: sched: make implementing .alloc_pdata optional Dario Faggioli
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-09-29 16:55 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Juergen Gross

with the purpose of decoupling the allocation phase and
the initialization one, for per-pCPU data of the schedulers.

This makes it possible to perform the initialization later
in the pCPU bringup/assignement process, when more information
(for instance, the host CPU topology) are available. This,
for now, is important only for Credit2, but it can well be
useful to other schedulers.

This also fixes a latent bug. In fact, when moving a pCPU
from a Credit2 cpupool to another, whose scheduler also
remaps runqueue spin locks (e.g., RTDS), we experience the
following Oops:

(XEN) Initializing RTDS scheduler
(XEN) WARNING: This is experimental software in development.
(XEN) Use at your own risk.
(XEN) Removing cpu 6 from runqueue 0
(XEN) Assertion 'sd->schedule_lock == &rqd->lock' failed at sched_credit2.c:2102
(XEN) ----[ Xen-4.7-unstable  x86_64  debug=y  Not tainted ]----
... ... ...
(XEN) Xen call trace:
(XEN)    [<ffff82d0801256fa>] csched2_free_pdata+0x135/0x180
(XEN)    [<ffff82d08012ccad>] schedule_cpu_switch+0x1ee/0x217
(XEN)    [<ffff82d080101c34>] cpupool_assign_cpu_locked+0x4d/0x152
(XEN)    [<ffff82d0801024ad>] cpupool_do_sysctl+0x20b/0x69d
(XEN)    [<ffff82d08012f693>] do_sysctl+0x665/0x10f8
(XEN)    [<ffff82d08023cb86>] lstar_enter+0x106/0x160

This happens because, right now, the scheduler of the
target pool remaps the runqueue lock during (rt_)alloc_pdata,
which is called at the very beginning of schedule_cpu_switch().
Credit2's csched2_free_pdata(), however, wants to find the spin
lock the same way it was put by csched2_alloc_pdata(), and
explodes, it not being the case.

This patch also fixes this as, now, spin lock remapping
happens in the init_pdata hook of the target scheduler for
the pCPU, which we can easily make sure it is ivoked *after*
the free_pdata hook of the old scheduler (which is exactly
what is done in this patch), without needing to restructure
schedule_cpu_switch().

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Juergen Gross <jgross@ssue.com>
---
 xen/common/schedule.c      |   16 +++++++++++++---
 xen/include/xen/sched-if.h |    1 +
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 4a89222..83244d7 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -1407,6 +1407,9 @@ static int cpu_schedule_callback(
 
     switch ( action )
     {
+    case CPU_STARTING:
+        SCHED_OP(&ops, init_pdata, cpu);
+        break;
     case CPU_UP_PREPARE:
         rc = cpu_schedule_up(cpu);
         break;
@@ -1484,6 +1487,7 @@ void __init scheduler_init(void)
     if ( ops.alloc_pdata &&
          !(this_cpu(schedule_data).sched_priv = ops.alloc_pdata(&ops, 0)) )
         BUG();
+    SCHED_OP(&ops, init_pdata, 0);
 }
 
 int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
@@ -1513,12 +1517,18 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
     per_cpu(scheduler, cpu) = new_ops;
     ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
     per_cpu(schedule_data, cpu).sched_priv = ppriv;
-    SCHED_OP(new_ops, tick_resume, cpu);
-    SCHED_OP(new_ops, insert_vcpu, idle);
-
     SCHED_OP(old_ops, free_vdata, vpriv_old);
     SCHED_OP(old_ops, free_pdata, ppriv_old, cpu);
 
+    /*
+     * Now that the new pcpu data have been allocated and all pointers
+     * correctly redirected to them, we can perform all the necessary
+     * initializations.
+     */
+    SCHED_OP(new_ops, init_pdata, cpu);
+    SCHED_OP(new_ops, tick_resume, cpu);
+    SCHED_OP(new_ops, insert_vcpu, idle);
+
     return 0;
 }
 
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index dbe7cab..14392f3 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -133,6 +133,7 @@ struct scheduler {
                                     void *);
     void         (*free_pdata)     (const struct scheduler *, void *, int);
     void *       (*alloc_pdata)    (const struct scheduler *, int);
+    void         (*init_pdata)     (const struct scheduler *, int);
     void         (*free_domdata)   (const struct scheduler *, void *);
     void *       (*alloc_domdata)  (const struct scheduler *, struct domain *);

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 5/9] xen: sched: make implementing .alloc_pdata optional
  2015-09-29 16:55 [PATCH 0/9] xen: sched: improve (a lot! :-D) Credit2 runqueue handling Dario Faggioli
                   ` (3 preceding siblings ...)
  2015-09-29 16:55 ` [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface Dario Faggioli
@ 2015-09-29 16:56 ` Dario Faggioli
  2015-10-01  5:28   ` Juergen Gross
  2015-10-01  7:49   ` Jan Beulich
  2015-09-29 16:56 ` [PATCH 6/9] xen: sched: implement .init_pdata in all schedulers Dario Faggioli
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-09-29 16:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Josh Whitehead, Robert VanVossen

The .alloc_pdata hook of the scheduler interface is
called, in schedule_cpu_switch(), unconditionally,
for all schedulers.

This forces even schedulers that do not require any
actual allocation of per-pCPU data to:
 - contain an implementation of the hook;
 - return some artificial non-NULL value to make
   such a caller happy.

This changes allows schedulers that do not need per-pCPU
allocations to avoid implementing the hook. It also
kills such artificial implementation from the ARINC653
scheduler (and, while there, it nukes .free_pdata from
there too, which is equally useless).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Robert VanVossen <robert.vanvossen@dornerworks.com>
Cc: Josh Whitehead <josh.whitehead@dornerworks.com>
---
 xen/common/sched_arinc653.c |   31 -------------------------------
 xen/common/schedule.c       |    6 +++---
 2 files changed, 3 insertions(+), 34 deletions(-)

diff --git a/xen/common/sched_arinc653.c b/xen/common/sched_arinc653.c
index cff5da9..ef192c3 100644
--- a/xen/common/sched_arinc653.c
+++ b/xen/common/sched_arinc653.c
@@ -455,34 +455,6 @@ a653sched_free_vdata(const struct scheduler *ops, void *priv)
 }
 
 /**
- * This function allocates scheduler-specific data for a physical CPU
- *
- * We do not actually make use of any per-CPU data but the hypervisor expects
- * a non-NULL return value
- *
- * @param ops       Pointer to this instance of the scheduler structure
- *
- * @return          Pointer to the allocated data
- */
-static void *
-a653sched_alloc_pdata(const struct scheduler *ops, int cpu)
-{
-    /* return a non-NULL value to keep schedule.c happy */
-    return SCHED_PRIV(ops);
-}
-
-/**
- * This function frees scheduler-specific data for a physical CPU
- *
- * @param ops       Pointer to this instance of the scheduler structure
- */
-static void
-a653sched_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
-{
-    /* nop */
-}
-
-/**
  * This function allocates scheduler-specific data for a domain
  *
  * We do not actually make use of any per-domain data but the hypervisor
@@ -736,9 +708,6 @@ const struct scheduler sched_arinc653_def = {
     .free_vdata     = a653sched_free_vdata,
     .alloc_vdata    = a653sched_alloc_vdata,
 
-    .free_pdata     = a653sched_free_pdata,
-    .alloc_pdata    = a653sched_alloc_pdata,
-
     .free_domdata   = a653sched_free_domdata,
     .alloc_domdata  = a653sched_alloc_domdata,
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 83244d7..0e02af2 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -1493,7 +1493,7 @@ void __init scheduler_init(void)
 int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
 {
     struct vcpu *idle;
-    void *ppriv, *ppriv_old, *vpriv, *vpriv_old;
+    void *ppriv = NULL, *ppriv_old, *vpriv, *vpriv_old;
     struct scheduler *old_ops = per_cpu(scheduler, cpu);
     struct scheduler *new_ops = (c == NULL) ? &ops : c->sched;
 
@@ -1501,8 +1501,8 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
         return 0;
 
     idle = idle_vcpu[cpu];
-    ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
-    if ( ppriv == NULL )
+    if ( (new_ops->alloc_pdata != NULL) &&
+         ((ppriv = new_ops->alloc_pdata(new_ops, cpu)) == NULL) )
         return -ENOMEM;
     vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv);
     if ( vpriv == NULL )

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 6/9] xen: sched: implement .init_pdata in all schedulers
  2015-09-29 16:55 [PATCH 0/9] xen: sched: improve (a lot! :-D) Credit2 runqueue handling Dario Faggioli
                   ` (4 preceding siblings ...)
  2015-09-29 16:56 ` [PATCH 5/9] xen: sched: make implementing .alloc_pdata optional Dario Faggioli
@ 2015-09-29 16:56 ` Dario Faggioli
  2015-09-29 16:56 ` [PATCH 7/9] xen: sched: fix per-socket runqueue creation in credit2 Dario Faggioli
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-09-29 16:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Meng Xu

by borrowing some of the code of .alloc_pdata, i.e.,
the bits that perform initializations, leaving only
actual allocations in there, when any, which is the
case for Credit1 and RTDS.

On the other hand, in Credit2, since we don't really
need any per-pCPU data allocation, everything that was
being done in .alloc_pdata, is now done in .init_pdata.
And the fact that now .alloc_pdata can be left undefined,
allows us to just get rid of it.

Still for Credit2, the fact that .init_pdata is called
during CPU_STARTING (rather than CPU_UP_PREPARE) kills
the need for the scheduler to setup a similar callback
itself, which makes things a lot simpler.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Meng Xu <mengxu@cis.upenn.edu>
---
 xen/common/sched_credit.c  |   23 ++++++++++++----
 xen/common/sched_credit2.c |   63 ++------------------------------------------
 xen/common/sched_rt.c      |    9 +++++-
 3 files changed, 27 insertions(+), 68 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 557efaa..80c0b1e 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -514,8 +514,6 @@ static void *
 csched_alloc_pdata(const struct scheduler *ops, int cpu)
 {
     struct csched_pcpu *spc;
-    struct csched_private *prv = CSCHED_PRIV(ops);
-    unsigned long flags;
 
     /* Allocate per-PCPU info */
     spc = xzalloc(struct csched_pcpu);
@@ -528,6 +526,22 @@ csched_alloc_pdata(const struct scheduler *ops, int cpu)
         return NULL;
     }
 
+    if ( per_cpu(schedule_data, cpu).sched_priv == NULL )
+        per_cpu(schedule_data, cpu).sched_priv = spc;
+
+    return spc;
+}
+
+static void
+csched_init_pdata(const struct scheduler *ops, int cpu)
+{
+    struct csched_private *prv = CSCHED_PRIV(ops);
+    struct csched_pcpu * const spc = CSCHED_PCPU(cpu);
+    unsigned long flags;
+
+    /* cpu data needs to be allocated, but STILL uninitialized */
+    ASSERT(spc && spc->runq.next == spc->runq.prev && spc->runq.next == NULL);
+
     spin_lock_irqsave(&prv->lock, flags);
 
     /* Initialize/update system-wide config */
@@ -548,16 +562,12 @@ csched_alloc_pdata(const struct scheduler *ops, int cpu)
     INIT_LIST_HEAD(&spc->runq);
     spc->runq_sort_last = prv->runq_sort;
     spc->idle_bias = nr_cpu_ids - 1;
-    if ( per_cpu(schedule_data, cpu).sched_priv == NULL )
-        per_cpu(schedule_data, cpu).sched_priv = spc;
 
     /* Start off idling... */
     BUG_ON(!is_idle_vcpu(curr_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
 
     spin_unlock_irqrestore(&prv->lock, flags);
-
-    return spc;
 }
 
 #ifndef NDEBUG
@@ -2021,6 +2031,7 @@ const struct scheduler sched_credit_def = {
     .alloc_vdata    = csched_alloc_vdata,
     .free_vdata     = csched_free_vdata,
     .alloc_pdata    = csched_alloc_pdata,
+    .init_pdata     = csched_init_pdata,
     .free_pdata     = csched_free_pdata,
     .alloc_domdata  = csched_alloc_domdata,
     .free_domdata   = csched_free_domdata,
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index a28edd8..1d27090 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1985,7 +1985,8 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
     cpumask_clear_cpu(rqi, &prv->active_queues);
 }
 
-static void init_pcpu(const struct scheduler *ops, int cpu)
+static void
+csched2_init_pdata(const struct scheduler *ops, int cpu)
 {
     unsigned rqi;
     unsigned long flags;
@@ -2050,20 +2051,6 @@ static void init_pcpu(const struct scheduler *ops, int cpu)
     return;
 }
 
-static void *
-csched2_alloc_pdata(const struct scheduler *ops, int cpu)
-{
-    /* Check to see if the cpu is online yet */
-    /* Note: cpu 0 doesn't get a STARTING callback */
-    if ( cpu == 0 || cpu_to_socket(cpu) != XEN_INVALID_SOCKET_ID )
-        init_pcpu(ops, cpu);
-    else
-        printk("%s: cpu %d not online yet, deferring initializatgion\n",
-               __func__, cpu);
-
-    return (void *)1;
-}
-
 static void
 csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 {
@@ -2113,49 +2100,6 @@ csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 }
 
 static int
-csched2_cpu_starting(int cpu)
-{
-    struct scheduler *ops;
-
-    /* Hope this is safe from cpupools switching things around. :-) */
-    ops = per_cpu(scheduler, cpu);
-
-    if ( ops->alloc_pdata == csched2_alloc_pdata )
-        init_pcpu(ops, cpu);
-
-    return NOTIFY_DONE;
-}
-
-static int cpu_credit2_callback(
-    struct notifier_block *nfb, unsigned long action, void *hcpu)
-{
-    unsigned int cpu = (unsigned long)hcpu;
-    int rc = 0;
-
-    switch ( action )
-    {
-    case CPU_STARTING:
-        csched2_cpu_starting(cpu);
-        break;
-    default:
-        break;
-    }
-
-    return !rc ? NOTIFY_DONE : notifier_from_errno(rc);
-}
-
-static struct notifier_block cpu_credit2_nfb = {
-    .notifier_call = cpu_credit2_callback
-};
-
-static int
-csched2_global_init(void)
-{
-    register_cpu_notifier(&cpu_credit2_nfb);
-    return 0;
-}
-
-static int
 csched2_init(struct scheduler *ops)
 {
     int i;
@@ -2235,12 +2179,11 @@ const struct scheduler sched_credit2_def = {
 
     .dump_cpu_state = csched2_dump_pcpu,
     .dump_settings  = csched2_dump,
-    .global_init    = csched2_global_init,
     .init           = csched2_init,
     .deinit         = csched2_deinit,
     .alloc_vdata    = csched2_alloc_vdata,
     .free_vdata     = csched2_free_vdata,
-    .alloc_pdata    = csched2_alloc_pdata,
+    .init_pdata     = csched2_init_pdata,
     .free_pdata     = csched2_free_pdata,
     .alloc_domdata  = csched2_alloc_domdata,
     .free_domdata   = csched2_free_domdata,
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index 89f29b2..b33c4cd 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -480,8 +480,8 @@ rt_deinit(const struct scheduler *ops)
  * Point per_cpu spinlock to the global system lock;
  * All cpu have same global system lock
  */
-static void *
-rt_alloc_pdata(const struct scheduler *ops, int cpu)
+static void
+rt_init_pdata(const struct scheduler *ops, int cpu)
 {
     struct rt_private *prv = rt_priv(ops);
     unsigned long flags;
@@ -489,7 +489,11 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu)
     spin_lock_irqsave(&prv->lock, flags);
     per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
     spin_unlock_irqrestore(&prv->lock, flags);
+}
 
+static void *
+rt_alloc_pdata(const struct scheduler *ops, int cpu)
+{
     if ( !alloc_cpumask_var(&_cpumask_scratch[cpu]) )
         return NULL;
 
@@ -1189,6 +1193,7 @@ const struct scheduler sched_rtds_def = {
     .deinit         = rt_deinit,
     .alloc_pdata    = rt_alloc_pdata,
     .free_pdata     = rt_free_pdata,
+    .init_pdata     = rt_init_pdata,
     .alloc_domdata  = rt_alloc_domdata,
     .free_domdata   = rt_free_domdata,
     .init_domain    = rt_dom_init,

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 7/9] xen: sched: fix per-socket runqueue creation in credit2
  2015-09-29 16:55 [PATCH 0/9] xen: sched: improve (a lot! :-D) Credit2 runqueue handling Dario Faggioli
                   ` (5 preceding siblings ...)
  2015-09-29 16:56 ` [PATCH 6/9] xen: sched: implement .init_pdata in all schedulers Dario Faggioli
@ 2015-09-29 16:56 ` Dario Faggioli
  2015-09-29 16:56 ` [PATCH 8/9] xen: sched: allow for choosing credit2 runqueues configuration at boot Dario Faggioli
  2015-09-29 16:56 ` [PATCH 9/9] xen: sched: per-core runqueues as default in credit2 Dario Faggioli
  8 siblings, 0 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-09-29 16:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Justin Weaver, Uma Sharma

The credit2 scheduler tries to setup runqueues in such
a way that there is one of them per each socket. However,
that does not work. The issue is described in bug #36
"credit2 only uses one runqueue instead of one runq per
socket" (http://bugs.xenproject.org/xen/bug/36), and a
solution has been attempted by an old patch series:

 http://lists.xen.org/archives/html/xen-devel/2014-08/msg02168.html

Here, we take advantage of the fact that now initialization
happens (for all schedulers) during CPU_STARTING, so we
have all the topology information available when necessary.

This is true for all the pCPUs _except_ the boot CPU. That
is not an issue, though. In fact, no runqueue exists yet
when the boot CPU is initialized, so we can just create
one and put the boot CPU in there.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Justin Weaver <jtweaver@hawaii.edu>
Cc: Uma Sharma <uma.sharma523@gmail.com>
---
 xen/common/sched_credit2.c |   58 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 43 insertions(+), 15 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 1d27090..38f382e 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1985,6 +1985,48 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
     cpumask_clear_cpu(rqi, &prv->active_queues);
 }
 
+static unsigned int
+cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu)
+{
+    struct csched2_runqueue_data *rqd;
+    unsigned int rqi;
+
+    for ( rqi = 0; rqi < nr_cpu_ids; rqi++ )
+    {
+        unsigned int peer_cpu;
+
+        /*
+         * As soon as we come across an uninitialized runqueue, use it.
+         * In fact, either:
+         *  - we are initializing the first cpu, and we assign it to
+         *    runqueue 0. This is handy, especially if we are dealing
+         *    with the boot cpu (if credit2 is the default scheduler),
+         *    as we would not be able to use cpu_to_socket() and similar
+         *    helpers anyway (they're result of which is not reliable yet);
+         *  - we have gone through all the active runqueues, and have not
+         *    found anyone whose cpus' topology matches the one we are
+         *    dealing with, so ativating a new runqueue is what we want.
+         */
+        if ( prv->rqd[rqi].id == -1 )
+            break;
+
+        rqd = prv->rqd + rqi;
+        BUG_ON(cpumask_empty(&rqd->active));
+
+        peer_cpu = cpumask_first(&rqd->active);
+        BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID ||
+               cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID);
+
+        if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) )
+            break;
+    }
+
+    /* We really expect to be able to assign each cpu to a runqueue. */
+    BUG_ON(rqi >= nr_cpu_ids);
+
+    return rqi;
+}
+
 static void
 csched2_init_pdata(const struct scheduler *ops, int cpu)
 {
@@ -2004,21 +2046,7 @@ csched2_init_pdata(const struct scheduler *ops, int cpu)
     }
 
     /* Figure out which runqueue to put it in */
-    rqi = 0;
-
-    /* Figure out which runqueue to put it in */
-    /* NB: cpu 0 doesn't get a STARTING callback, so we hard-code it to runqueue 0. */
-    if ( cpu == 0 )
-        rqi = 0;
-    else
-        rqi = cpu_to_socket(cpu);
-
-    if ( rqi == XEN_INVALID_SOCKET_ID )
-    {
-        printk("%s: cpu_to_socket(%d) returned %d!\n",
-               __func__, cpu, rqi);
-        BUG();
-    }
+    rqi = cpu_to_runqueue(prv, cpu);
 
     rqd=prv->rqd + rqi;

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 8/9] xen: sched: allow for choosing credit2 runqueues configuration at boot
  2015-09-29 16:55 [PATCH 0/9] xen: sched: improve (a lot! :-D) Credit2 runqueue handling Dario Faggioli
                   ` (6 preceding siblings ...)
  2015-09-29 16:56 ` [PATCH 7/9] xen: sched: fix per-socket runqueue creation in credit2 Dario Faggioli
@ 2015-09-29 16:56 ` Dario Faggioli
  2015-10-01  5:48   ` Juergen Gross
  2015-09-29 16:56 ` [PATCH 9/9] xen: sched: per-core runqueues as default in credit2 Dario Faggioli
  8 siblings, 1 reply; 48+ messages in thread
From: Dario Faggioli @ 2015-09-29 16:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Uma Sharma

In fact, credit2 uses CPU topology to decide how to arrange
its internal runqueues. Before this change, only 'one runqueue
per socket' was allowed. However, experiments have shown that,
for instance, having one runqueue per physical core improves
performance, especially in case hyperthreading is available.

In general, it makes sense to allow users to pick one runqueue
arrangement at boot time, so that:
 - more experiments can be easily performed to even better
   assess and improve performance;
 - one can select the best configuration for his specific
   use case and/or hardware.

This patch enables the above.

Note that, for correctly arranging runqueues to be per-core,
just checking cpu_to_core() on the host CPUs is not enough.
In fact, cores (and hyperthreads) on different sockets, can
have the same core (and thread) IDs! We, therefore, need to
check whether the full topology of two CPUs matches, for
them to be put in the same runqueue.

Note also that the default (although not functional) for
credit2, since now, has been per-socket runqueue. This patch
leaves things that way, to avoid mixing policy and technical
changes.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Uma Sharma <uma.sharma523@gmail.com>
---
 docs/misc/xen-command-line.markdown |   11 +++++++
 xen/common/sched_credit2.c          |   57 ++++++++++++++++++++++++++++++++---
 2 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index a2e427c..71315b8 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -467,6 +467,17 @@ combination with the `low_crashinfo` command line option.
 ### credit2\_load\_window\_shift
 > `= <integer>`
 
+### credit2\_runqueue
+> `= socket | core`
+
+> Default: `socket`
+
+Specify how host CPUs are arranged in runqueues. Runqueues are kept
+balanced with respect to the load generated by the vCPUs running on
+them. Smaller runqueues (as in with `core`) means more accurate load
+balancing (for instance, it will deal better with hyperthreading),
+but also more overhead.
+
 ### dbgp
 > `= ehci[ <integer> | @pci<bus>:<slot>.<func> ]`
 
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 38f382e..025626f 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -82,10 +82,6 @@
  * Credits are "reset" when the next vcpu in the runqueue is less than
  * or equal to zero.  At that point, everyone's credits are "clipped"
  * to a small value, and a fixed credit is added to everyone.
- *
- * The plan is for all cores that share an L2 will share the same
- * runqueue.  At the moment, there is one global runqueue for all
- * cores.
  */
 
 /*
@@ -194,6 +190,41 @@ static int __read_mostly opt_overload_balance_tolerance = -3;
 integer_param("credit2_balance_over", opt_overload_balance_tolerance);
 
 /*
+ * Runqueue organization.
+ *
+ * The various cpus are to be assigned each one to a runqueue, and we
+ * want that to happen basing on topology. At the moment, it is possible
+ * to choose to arrange runqueues to be:
+ *
+ * - per-core: meaning that there will be one runqueue per each physical
+ *             core of the host. This will happen if the opt_runqueue
+ *             parameter is set to 'core';
+ *
+ * - per-socket: meaning that there will be one runqueue per each physical
+ *               socket (AKA package, which often, but not always, also
+ *               matches a NUMA node) of the host; This will happen if
+ *               the opt_runqueue parameter is set to 'socket';
+ *
+ * Depending on the value of opt_runqueue, therefore, cpus that are part of
+ * either the same physical core, or of the same physical socket, will be
+ * put together to form runqueues.
+ */
+#define OPT_RUNQUEUE_CORE   1
+#define OPT_RUNQUEUE_SOCKET 2
+static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET;
+
+static void parse_credit2_runqueue(const char *s)
+{
+    if ( !strncmp(s, "core", 4) && !s[4] )
+        opt_runqueue=OPT_RUNQUEUE_CORE;
+    else if ( !strncmp(s, "socket", 6) && !s[6] )
+        opt_runqueue=OPT_RUNQUEUE_SOCKET;
+    else
+        printk("WARNING, unrecognized value of credit2_runqueue option!\n");
+}
+custom_param("credit2_runqueue", parse_credit2_runqueue);
+
+/*
  * Per-runqueue data
  */
 struct csched2_runqueue_data {
@@ -1985,6 +2016,17 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
     cpumask_clear_cpu(rqi, &prv->active_queues);
 }
 
+static inline bool_t same_socket(unsigned int cpua, unsigned int cpub)
+{
+    return cpu_to_socket(cpua) == cpu_to_socket(cpub);
+}
+
+static inline bool_t same_core(unsigned int cpua, unsigned int cpub)
+{
+    return same_socket(cpua, cpub) &&
+           cpu_to_core(cpua) == cpu_to_core(cpub);
+}
+
 static unsigned int
 cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu)
 {
@@ -2017,7 +2059,10 @@ cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu)
         BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID ||
                cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID);
 
-        if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) )
+        if ( (opt_runqueue == OPT_RUNQUEUE_CORE &&
+              same_core(peer_cpu, cpu)) ||
+             (opt_runqueue == OPT_RUNQUEUE_SOCKET &&
+              same_socket(peer_cpu, cpu)) )
             break;
     }
 
@@ -2140,6 +2185,8 @@ csched2_init(struct scheduler *ops)
     printk(" load_window_shift: %d\n", opt_load_window_shift);
     printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance);
     printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance);
+    printk(" runqueues arrangement: per-%s\n",
+           opt_runqueue == OPT_RUNQUEUE_CORE ? "core" : "socket");
 
     if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN )
     {

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 9/9] xen: sched: per-core runqueues as default in credit2
  2015-09-29 16:55 [PATCH 0/9] xen: sched: improve (a lot! :-D) Credit2 runqueue handling Dario Faggioli
                   ` (7 preceding siblings ...)
  2015-09-29 16:56 ` [PATCH 8/9] xen: sched: allow for choosing credit2 runqueues configuration at boot Dario Faggioli
@ 2015-09-29 16:56 ` Dario Faggioli
  2015-10-01  5:48   ` Juergen Gross
  8 siblings, 1 reply; 48+ messages in thread
From: Dario Faggioli @ 2015-09-29 16:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Uma Sharma

Experiments have shown that arranging the scheduing
runqueues on a per-core basis yields better results,
in most cases.

Such evaluation has been done, for the first time,
by Uma Sharma, during her participation to OPW. Some
of the results she got are summarized here:

 http://lists.xen.org/archives/html/xen-devel/2015-03/msg01499.html

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Uma Sharma <uma.sharma523@gmail.com>
---
 docs/misc/xen-command-line.markdown |    2 +-
 xen/common/sched_credit2.c          |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 71315b8..8fc9cd7 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -470,7 +470,7 @@ combination with the `low_crashinfo` command line option.
 ### credit2\_runqueue
 > `= socket | core`
 
-> Default: `socket`
+> Default: `core`
 
 Specify how host CPUs are arranged in runqueues. Runqueues are kept
 balanced with respect to the load generated by the vCPUs running on
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 025626f..0e9e722 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -211,7 +211,7 @@ integer_param("credit2_balance_over", opt_overload_balance_tolerance);
  */
 #define OPT_RUNQUEUE_CORE   1
 #define OPT_RUNQUEUE_SOCKET 2
-static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET;
+static int __read_mostly opt_runqueue = OPT_RUNQUEUE_CORE;
 
 static void parse_credit2_runqueue(const char *s)
 {

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-09-29 16:55 ` [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent Dario Faggioli
@ 2015-09-29 17:31   ` Andrew Cooper
  2015-09-29 21:40     ` Dario Faggioli
  2015-10-08 14:58     ` George Dunlap
  2015-10-01  8:03   ` Jan Beulich
  1 sibling, 2 replies; 48+ messages in thread
From: Andrew Cooper @ 2015-09-29 17:31 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Meng Xu

On 29/09/15 17:55, Dario Faggioli wrote:
> The insert_vcpu() scheduler hook is called with an
> inconsistent locking strategy. In fact, it is sometimes
> invoked while holding the runqueue lock and sometimes
> when that is not the case.
>
> In other words, some call sites seems to imply that
> locking should be handled in the callers, in schedule.c
> --e.g., in schedule_cpu_switch(), which acquires the
> runqueue lock before calling the hook; others that
> specific schedulers should be responsible for locking
> themselves --e.g., in sched_move_domain(), which does
> not acquire any lock for calling the hook.
>
> The right thing to do seems to always defer locking to
> the specific schedulers, as it's them that know what, how
> and when it is best to lock (as in: runqueue locks, vs.
> private scheduler locks, vs. both, etc.)
>
> This patch, therefore:
>  - removes any locking around insert_vcpu() from
>    generic code (schedule.c);
>  - add the _proper_ locking in the hook implementations,
>    depending on the scheduler (for instance, credit2
>    does that already, credit1 and RTDS need to grab
>    the runqueue lock while manipulating runqueues).
>
> In case of credit1, remove_vcpu() handling needs some
> fixing remove_vcpu() too, i.e.:
>  - it manipulates runqueues, so the runqueue lock must
>    be acquired;
>  - *_lock_irq() is enough, there is no need to do
>    _irqsave()

Nothing in any of generic scheduling code should need interrupts
disabled at all.

One of the problem-areas identified by Jenny during the ticketlock
performance work was that the SCHEDULE_SOFTIRQ was a large consumer of
time with interrupts disabled.  (The other large one being the time
calibration rendezvous, but that is a wildly different can of worms to fix.)

Is the use of _lock_irq() here to cover another issue expecting
interrupts to be disabled, or could it be replaced with a plain spin_lock()?

Also, a style nit...

>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Meng Xu <mengxu@cis.upenn.edu>
> ---
>  xen/common/sched_credit.c |   24 +++++++++++++++++++-----
>  xen/common/sched_rt.c     |   10 +++++++++-
>  xen/common/schedule.c     |    6 ------
>  3 files changed, 28 insertions(+), 12 deletions(-)
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index a1945ac..557efaa 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -905,8 +905,19 @@ csched_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
>  {
>      struct csched_vcpu *svc = vc->sched_priv;
>  
> -    if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc->is_running )
> -        __runq_insert(vc->processor, svc);
> +    /*
> +     * For the idle domain, this is called, during boot, before
> +     * than alloc_pdata() has been called for the pCPU.
> +     */
> +    if ( !is_idle_vcpu(vc) )
> +    {
> +        spinlock_t *lock = vcpu_schedule_lock_irq(vc);
> +
> +        if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc->is_running )
> +            __runq_insert(vc->processor, svc);
> +
> +        vcpu_schedule_unlock_irq(lock, vc);
> +    }
>  }
>  
>  static void
> @@ -925,7 +936,7 @@ csched_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
>      struct csched_private *prv = CSCHED_PRIV(ops);
>      struct csched_vcpu * const svc = CSCHED_VCPU(vc);
>      struct csched_dom * const sdom = svc->sdom;
> -    unsigned long flags;
> +    spinlock_t *lock;
>  
>      SCHED_STAT_CRANK(vcpu_destroy);
>  
> @@ -935,15 +946,18 @@ csched_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
>          vcpu_unpause(svc->vcpu);
>      }
>  
> +    lock = vcpu_schedule_lock_irq(vc);
> +
>      if ( __vcpu_on_runq(svc) )
>          __runq_remove(svc);
>  
> -    spin_lock_irqsave(&(prv->lock), flags);
> +    spin_lock(&(prv->lock));

Please drop the superfluous brackets as you are already changing the line.

~Andrew

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-09-29 17:31   ` Andrew Cooper
@ 2015-09-29 21:40     ` Dario Faggioli
  2015-09-29 21:56       ` Dario Faggioli
  2015-09-30  9:00       ` Andrew Cooper
  2015-10-08 14:58     ` George Dunlap
  1 sibling, 2 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-09-29 21:40 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: George Dunlap, Meng Xu


[-- Attachment #1.1: Type: text/plain, Size: 3892 bytes --]

On Tue, 2015-09-29 at 18:31 +0100, Andrew Cooper wrote:
> On 29/09/15 17:55, Dario Faggioli wrote:
> > The insert_vcpu() scheduler hook is called with an
> > inconsistent locking strategy. In fact, it is sometimes
> > invoked while holding the runqueue lock and sometimes
> > when that is not the case.
> > 
> > In other words, some call sites seems to imply that
> > locking should be handled in the callers, in schedule.c
> > --e.g., in schedule_cpu_switch(), which acquires the
> > runqueue lock before calling the hook; others that
> > specific schedulers should be responsible for locking
> > themselves --e.g., in sched_move_domain(), which does
> > not acquire any lock for calling the hook.
> > 
> > The right thing to do seems to always defer locking to
> > the specific schedulers, as it's them that know what, how
> > and when it is best to lock (as in: runqueue locks, vs.
> > private scheduler locks, vs. both, etc.)
> > 
> > This patch, therefore:
> >  - removes any locking around insert_vcpu() from
> >    generic code (schedule.c);
> >  - add the _proper_ locking in the hook implementations,
> >    depending on the scheduler (for instance, credit2
> >    does that already, credit1 and RTDS need to grab
> >    the runqueue lock while manipulating runqueues).
> > 
> > In case of credit1, remove_vcpu() handling needs some
> > fixing remove_vcpu() too, i.e.:
> >  - it manipulates runqueues, so the runqueue lock must
> >    be acquired;
> >  - *_lock_irq() is enough, there is no need to do
> >    _irqsave()
> 
> Nothing in any of generic scheduling code should need interrupts
> disabled at all.
> 
That's a really, really, really interesting point.

I think I see what you mean. However, currently, pretty much **all**
scheduling related locks are acquired via _irq or _irqsave primitives,
and that is true for schedule.c and for all the sched_*.c files.

> One of the problem-areas identified by Jenny during the ticketlock
> performance work was that the SCHEDULE_SOFTIRQ was a large consumer
> of
> time with interrupts disabled.
>
Right, and I am very much up for investigating whether this can
improve. However, this seems to me the topic for a different series.

> Is the use of _lock_irq() here to cover another issue expecting
> interrupts to be disabled, or could it be replaced with a plain
> spin_lock()?
> 
As said, it is probably the case that spin_lock() would be ok, here as
well as elsewhere. This is being done like this in this patch for
consistency, as that is what happens **everywhere** else in scheduling
code. In fact, I haven't tried, but it may well be the case that,
converting only one (or a subset) of locks to non _irq* variants, we'd
make check_lock() complain.

So, can we just allow this patch to follow suit, and then overhaul and
change/fix (if it reveals feasible) all locking at once, in a dedicated
series? This seems the best approach to me...

> Also, a style nit...
>

> > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> > index a1945ac..557efaa 100644

> > @@ -935,15 +946,18 @@ csched_vcpu_remove(const struct scheduler
> > *ops, struct vcpu *vc)
> >          vcpu_unpause(svc->vcpu);
> >      }
> >  
> > +    lock = vcpu_schedule_lock_irq(vc);
> > +
> >      if ( __vcpu_on_runq(svc) )
> >          __runq_remove(svc);
> >  
> > -    spin_lock_irqsave(&(prv->lock), flags);
> > +    spin_lock(&(prv->lock));
> 
> Please drop the superfluous brackets as you are already changing the
> line.
> 
This, I can do for sure. :-)

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-09-29 21:40     ` Dario Faggioli
@ 2015-09-29 21:56       ` Dario Faggioli
  2015-09-30  9:00       ` Andrew Cooper
  1 sibling, 0 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-09-29 21:56 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: George Dunlap, Meng Xu


[-- Attachment #1.1: Type: text/plain, Size: 3870 bytes --]

On Tue, 2015-09-29 at 23:40 +0200, Dario Faggioli wrote:
> On Tue, 2015-09-29 at 18:31 +0100, Andrew Cooper wrote:
> > On 29/09/15 17:55, Dario Faggioli wrote:

> > Is the use of _lock_irq() here to cover another issue expecting
> > interrupts to be disabled, or could it be replaced with a plain
> > spin_lock()?
> > 
> As said, it is probably the case that spin_lock() would be ok, here
> as
> well as elsewhere. This is being done like this in this patch for
> consistency, as that is what happens **everywhere** else in
> scheduling
> code. In fact, I haven't tried, but it may well be the case that,
> converting only one (or a subset) of locks to non _irq* variants,
> we'd
> make check_lock() complain.
> 
And now I have just checked, and that indeed looks to be the case.

I.e., I changed the spin_lock_irq() being touched in this patch to
spin_lock(), and here's what I see (very early, during Xen's boot):

(XEN) Xen BUG at spinlock.c:48
(XEN) ----[ Xen-4.7-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82d08012df6e>] check_lock+0x3d/0x41
(XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff82d08033cd68   rcx: 0000000000000001
(XEN) rdx: 0000000000000000   rsi: 0000000000000001   rdi: ffff82d08033cd70
(XEN) rbp: ffff8300dbaf7e40   rsp: ffff8300dbaf7e40   r8:  0000000000000001
(XEN) r9:  0000000000000001   r10: 0000000000000001   r11: 0000000000000004
(XEN) r12: ffff82d0803272a0   r13: ffff83031fa29000   r14: ffff82d08033cd60
(XEN) r15: ffff82d08033cd68   cr0: 000000008005003b   cr4: 00000000000026e0
(XEN) cr3: 00000000dbaa3000   cr2: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff8300dbaf7e40:
(XEN)    ffff8300dbaf7e58 ffff82d08012df9d 0000000000000282 ffff8300dbaf7e70
(XEN)    ffff82d08012e000 ffff8300db8f4000 ffff8300dbaf7ec0 ffff82d08012b6e3
(XEN)    ffff82d080189be6 ffff8300dbaf7eb8 ffff82d08018a008 ffff8300db8f4000
(XEN)    ffff8300dbaf7f18 ffff83031fa29000 0000000000000001 ffff82d080291d20
(XEN)    ffff8300dbaf7ee0 ffff82d08010642e 00000000000002f8 ffff8300dbaf0000
(XEN)    ffff8300dbaf7ef0 ffff82d080106579 ffff8300dbaf7f10 ffff82d0801895b2
(XEN)    ffff8300dbaf0000 ffff8300dbaf7f18 ffff8300dbaf7fb8 0080026200000002
(XEN)    0000000000000000 00000000000dbaf6 ffff8300dbaf6000 ffff8300dbaf0000
(XEN)    ffff8300dbaf0000 ffff8300dbaf7f18 ffff83031fa29000 0000000000000001
(XEN)    ffff82d080291d20 ffff8300dbaf7f78 ffff82d080184130 ffff8300dbaf7f88
(XEN)    ffff82d08018546a ffff8300dbaf7f98 ffff82d080185491 ffff8300dbaf7fb8
(XEN)    ffff82d0802c763f ffff82d0802e66c0 ffff83031fabb610 ffff82d0802f7f08
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 ffff8300dbb3f000 0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82d08012df6e>] check_lock+0x3d/0x41
(XEN)    [<ffff82d08012df9d>] _spin_lock+0x11/0x52
(XEN)    [<ffff82d08012e000>] _spin_lock_irqsave+0xd/0x13
(XEN)    [<ffff82d08012b6e3>] vcpu_wake+0x36/0x3ce
(XEN)    [<ffff82d08010642e>] domain_unpause+0x38/0x48
(XEN)    [<ffff82d080106579>] domain_unpause_by_systemcontroller+0x2c/0x3b
(XEN)    [<ffff82d0801895b2>] init_done+0x1d/0x144
(XEN) 
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Xen BUG at spinlock.c:48
(XEN) ****************************************

So, really, (trying to)do the spinlock conversion in its dedicated
series is the way to go, IMO.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-09-29 21:40     ` Dario Faggioli
  2015-09-29 21:56       ` Dario Faggioli
@ 2015-09-30  9:00       ` Andrew Cooper
  1 sibling, 0 replies; 48+ messages in thread
From: Andrew Cooper @ 2015-09-30  9:00 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Meng Xu

On 29/09/15 22:40, Dario Faggioli wrote:
> On Tue, 2015-09-29 at 18:31 +0100, Andrew Cooper wrote:
>> On 29/09/15 17:55, Dario Faggioli wrote:
>>> The insert_vcpu() scheduler hook is called with an
>>> inconsistent locking strategy. In fact, it is sometimes
>>> invoked while holding the runqueue lock and sometimes
>>> when that is not the case.
>>>
>>> In other words, some call sites seems to imply that
>>> locking should be handled in the callers, in schedule.c
>>> --e.g., in schedule_cpu_switch(), which acquires the
>>> runqueue lock before calling the hook; others that
>>> specific schedulers should be responsible for locking
>>> themselves --e.g., in sched_move_domain(), which does
>>> not acquire any lock for calling the hook.
>>>
>>> The right thing to do seems to always defer locking to
>>> the specific schedulers, as it's them that know what, how
>>> and when it is best to lock (as in: runqueue locks, vs.
>>> private scheduler locks, vs. both, etc.)
>>>
>>> This patch, therefore:
>>>  - removes any locking around insert_vcpu() from
>>>    generic code (schedule.c);
>>>  - add the _proper_ locking in the hook implementations,
>>>    depending on the scheduler (for instance, credit2
>>>    does that already, credit1 and RTDS need to grab
>>>    the runqueue lock while manipulating runqueues).
>>>
>>> In case of credit1, remove_vcpu() handling needs some
>>> fixing remove_vcpu() too, i.e.:
>>>  - it manipulates runqueues, so the runqueue lock must
>>>    be acquired;
>>>  - *_lock_irq() is enough, there is no need to do
>>>    _irqsave()
>> Nothing in any of generic scheduling code should need interrupts
>> disabled at all.
>>
> That's a really, really, really interesting point.
>
> I think I see what you mean. However, currently, pretty much **all**
> scheduling related locks are acquired via _irq or _irqsave primitives,
> and that is true for schedule.c and for all the sched_*.c files.
>
>> One of the problem-areas identified by Jenny during the ticketlock
>> performance work was that the SCHEDULE_SOFTIRQ was a large consumer
>> of
>> time with interrupts disabled.
>>
> Right, and I am very much up for investigating whether this can
> improve. However, this seems to me the topic for a different series.
>
>> Is the use of _lock_irq() here to cover another issue expecting
>> interrupts to be disabled, or could it be replaced with a plain
>> spin_lock()?
>>
> As said, it is probably the case that spin_lock() would be ok, here as
> well as elsewhere. This is being done like this in this patch for
> consistency, as that is what happens **everywhere** else in scheduling
> code. In fact, I haven't tried, but it may well be the case that,
> converting only one (or a subset) of locks to non _irq* variants, we'd
> make check_lock() complain.
>
> So, can we just allow this patch to follow suit, and then overhaul and
> change/fix (if it reveals feasible) all locking at once, in a dedicated
> series? This seems the best approach to me...

This seems reasonable to me.  I just wanted to check that you were not
using the _irq() variants "just because".  I did suspect that the _irq()
variants were in use because everything else uses _irq()/_irqsave().

This change doesn't make the matter worse, and fixing that can of worms
is probably going to be a whole series in itself, therefore no further
objections from me.

~Andrew

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface
  2015-09-29 16:55 ` [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface Dario Faggioli
@ 2015-10-01  5:21   ` Juergen Gross
  2015-10-01  6:33     ` Dario Faggioli
  2015-10-01  8:17   ` Jan Beulich
  1 sibling, 1 reply; 48+ messages in thread
From: Juergen Gross @ 2015-10-01  5:21 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap

On 09/29/2015 06:55 PM, Dario Faggioli wrote:
> with the purpose of decoupling the allocation phase and
> the initialization one, for per-pCPU data of the schedulers.
>
> This makes it possible to perform the initialization later
> in the pCPU bringup/assignement process, when more information
> (for instance, the host CPU topology) are available. This,
> for now, is important only for Credit2, but it can well be
> useful to other schedulers.
>
> This also fixes a latent bug. In fact, when moving a pCPU
> from a Credit2 cpupool to another, whose scheduler also
> remaps runqueue spin locks (e.g., RTDS), we experience the
> following Oops:
>
> (XEN) Initializing RTDS scheduler
> (XEN) WARNING: This is experimental software in development.
> (XEN) Use at your own risk.
> (XEN) Removing cpu 6 from runqueue 0
> (XEN) Assertion 'sd->schedule_lock == &rqd->lock' failed at sched_credit2.c:2102
> (XEN) ----[ Xen-4.7-unstable  x86_64  debug=y  Not tainted ]----
> ... ... ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82d0801256fa>] csched2_free_pdata+0x135/0x180
> (XEN)    [<ffff82d08012ccad>] schedule_cpu_switch+0x1ee/0x217
> (XEN)    [<ffff82d080101c34>] cpupool_assign_cpu_locked+0x4d/0x152
> (XEN)    [<ffff82d0801024ad>] cpupool_do_sysctl+0x20b/0x69d
> (XEN)    [<ffff82d08012f693>] do_sysctl+0x665/0x10f8
> (XEN)    [<ffff82d08023cb86>] lstar_enter+0x106/0x160
>
> This happens because, right now, the scheduler of the
> target pool remaps the runqueue lock during (rt_)alloc_pdata,
> which is called at the very beginning of schedule_cpu_switch().
> Credit2's csched2_free_pdata(), however, wants to find the spin
> lock the same way it was put by csched2_alloc_pdata(), and
> explodes, it not being the case.
>
> This patch also fixes this as, now, spin lock remapping
> happens in the init_pdata hook of the target scheduler for
> the pCPU, which we can easily make sure it is ivoked *after*
> the free_pdata hook of the old scheduler (which is exactly
> what is done in this patch), without needing to restructure
> schedule_cpu_switch().
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Juergen Gross <jgross@ssue.com>
> ---
>   xen/common/schedule.c      |   16 +++++++++++++---
>   xen/include/xen/sched-if.h |    1 +
>   2 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 4a89222..83244d7 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -1407,6 +1407,9 @@ static int cpu_schedule_callback(
>
>       switch ( action )
>       {
> +    case CPU_STARTING:
> +        SCHED_OP(&ops, init_pdata, cpu);
> +        break;
>       case CPU_UP_PREPARE:
>           rc = cpu_schedule_up(cpu);
>           break;
> @@ -1484,6 +1487,7 @@ void __init scheduler_init(void)
>       if ( ops.alloc_pdata &&
>            !(this_cpu(schedule_data).sched_priv = ops.alloc_pdata(&ops, 0)) )
>           BUG();
> +    SCHED_OP(&ops, init_pdata, 0);

You can't call this without having it set in all schedulers.

I guess using the init_pdata hook has to be introduced after or in
patch 5.


Juergen

>   }
>
>   int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
> @@ -1513,12 +1517,18 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>       per_cpu(scheduler, cpu) = new_ops;
>       ppriv_old = per_cpu(schedule_data, cpu).sched_priv;
>       per_cpu(schedule_data, cpu).sched_priv = ppriv;
> -    SCHED_OP(new_ops, tick_resume, cpu);
> -    SCHED_OP(new_ops, insert_vcpu, idle);
> -
>       SCHED_OP(old_ops, free_vdata, vpriv_old);
>       SCHED_OP(old_ops, free_pdata, ppriv_old, cpu);
>
> +    /*
> +     * Now that the new pcpu data have been allocated and all pointers
> +     * correctly redirected to them, we can perform all the necessary
> +     * initializations.
> +     */
> +    SCHED_OP(new_ops, init_pdata, cpu);
> +    SCHED_OP(new_ops, tick_resume, cpu);
> +    SCHED_OP(new_ops, insert_vcpu, idle);
> +
>       return 0;
>   }
>
> diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
> index dbe7cab..14392f3 100644
> --- a/xen/include/xen/sched-if.h
> +++ b/xen/include/xen/sched-if.h
> @@ -133,6 +133,7 @@ struct scheduler {
>                                       void *);
>       void         (*free_pdata)     (const struct scheduler *, void *, int);
>       void *       (*alloc_pdata)    (const struct scheduler *, int);
> +    void         (*init_pdata)     (const struct scheduler *, int);
>       void         (*free_domdata)   (const struct scheduler *, void *);
>       void *       (*alloc_domdata)  (const struct scheduler *, struct domain *);
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] xen: sched: fix an 'off by one \t' in credit2 debug dump
  2015-09-29 16:55 ` [PATCH 1/9] xen: sched: fix an 'off by one \t' in credit2 debug dump Dario Faggioli
@ 2015-10-01  5:22   ` Juergen Gross
  2015-10-08 14:09   ` George Dunlap
  1 sibling, 0 replies; 48+ messages in thread
From: Juergen Gross @ 2015-10-01  5:22 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap

On 09/29/2015 06:55 PM, Dario Faggioli wrote:
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: Juergen Gross <jgross@suse.com>

> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> ---
>   xen/common/sched_credit2.c |    2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 75e0321..46645f8 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -1931,7 +1931,7 @@ csched2_dump(const struct scheduler *ops)
>           struct csched2_dom *sdom;
>           sdom = list_entry(iter_sdom, struct csched2_dom, sdom_elem);
>
> -        printk("\tDomain: %d w %d v %d\n\t",
> +        printk("\tDomain: %d w %d v %d\n",
>                  sdom->dom->domain_id,
>                  sdom->weight,
>                  sdom->nr_vcpus);
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/9] xen: sched: improve scope and placement of credit2 boot parameters
  2015-09-29 16:55 ` [PATCH 2/9] xen: sched: improve scope and placement of credit2 boot parameters Dario Faggioli
@ 2015-10-01  5:23   ` Juergen Gross
  2015-10-01  7:51   ` Jan Beulich
  1 sibling, 0 replies; 48+ messages in thread
From: Juergen Gross @ 2015-10-01  5:23 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Uma Sharma

On 09/29/2015 06:55 PM, Dario Faggioli wrote:
> From: Uma Sharma <uma.sharma523@gmail.com>
>
> Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: Juergen Gross <jgross@suse.com>

> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> ---
>   xen/common/sched_credit2.c |    8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 46645f8..a28edd8 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -162,7 +162,7 @@
>   #define CSFLAG_runq_migrate_request (1<<__CSFLAG_runq_migrate_request)
>
>
> -int opt_migrate_resist=500;
> +static int __read_mostly opt_migrate_resist = 500;
>   integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
>
>   /*
> @@ -185,12 +185,12 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
>    *   to a load of 1.
>    */
>   #define LOADAVG_GRANULARITY_SHIFT (10)
> -int opt_load_window_shift=18;
> +static int __read_mostly opt_load_window_shift = 18;
>   #define  LOADAVG_WINDOW_SHIFT_MIN 4
>   integer_param("credit2_load_window_shift", opt_load_window_shift);
> -int opt_underload_balance_tolerance=0;
> +static int __read_mostly opt_underload_balance_tolerance = 0;
>   integer_param("credit2_balance_under", opt_underload_balance_tolerance);
> -int opt_overload_balance_tolerance=-3;
> +static int __read_mostly opt_overload_balance_tolerance = -3;
>   integer_param("credit2_balance_over", opt_overload_balance_tolerance);
>
>   /*
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9] xen: sched: make implementing .alloc_pdata optional
  2015-09-29 16:56 ` [PATCH 5/9] xen: sched: make implementing .alloc_pdata optional Dario Faggioli
@ 2015-10-01  5:28   ` Juergen Gross
  2015-10-01  6:35     ` Dario Faggioli
  2015-10-01  7:49   ` Jan Beulich
  1 sibling, 1 reply; 48+ messages in thread
From: Juergen Gross @ 2015-10-01  5:28 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Josh Whitehead, Robert VanVossen

On 09/29/2015 06:56 PM, Dario Faggioli wrote:
> The .alloc_pdata hook of the scheduler interface is
> called, in schedule_cpu_switch(), unconditionally,
> for all schedulers.
>
> This forces even schedulers that do not require any
> actual allocation of per-pCPU data to:
>   - contain an implementation of the hook;
>   - return some artificial non-NULL value to make
>     such a caller happy.
>
> This changes allows schedulers that do not need per-pCPU
> allocations to avoid implementing the hook. It also
> kills such artificial implementation from the ARINC653
> scheduler (and, while there, it nukes .free_pdata from
> there too, which is equally useless).
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Robert VanVossen <robert.vanvossen@dornerworks.com>
> Cc: Josh Whitehead <josh.whitehead@dornerworks.com>
> ---
>   xen/common/sched_arinc653.c |   31 -------------------------------
>   xen/common/schedule.c       |    6 +++---
>   2 files changed, 3 insertions(+), 34 deletions(-)
>
> diff --git a/xen/common/sched_arinc653.c b/xen/common/sched_arinc653.c
> index cff5da9..ef192c3 100644
> --- a/xen/common/sched_arinc653.c
> +++ b/xen/common/sched_arinc653.c
> @@ -455,34 +455,6 @@ a653sched_free_vdata(const struct scheduler *ops, void *priv)
>   }
>
>   /**
> - * This function allocates scheduler-specific data for a physical CPU
> - *
> - * We do not actually make use of any per-CPU data but the hypervisor expects
> - * a non-NULL return value
> - *
> - * @param ops       Pointer to this instance of the scheduler structure
> - *
> - * @return          Pointer to the allocated data
> - */
> -static void *
> -a653sched_alloc_pdata(const struct scheduler *ops, int cpu)
> -{
> -    /* return a non-NULL value to keep schedule.c happy */
> -    return SCHED_PRIV(ops);
> -}
> -
> -/**
> - * This function frees scheduler-specific data for a physical CPU
> - *
> - * @param ops       Pointer to this instance of the scheduler structure
> - */
> -static void
> -a653sched_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
> -{
> -    /* nop */
> -}
> -
> -/**
>    * This function allocates scheduler-specific data for a domain
>    *
>    * We do not actually make use of any per-domain data but the hypervisor
> @@ -736,9 +708,6 @@ const struct scheduler sched_arinc653_def = {
>       .free_vdata     = a653sched_free_vdata,
>       .alloc_vdata    = a653sched_alloc_vdata,
>
> -    .free_pdata     = a653sched_free_pdata,
> -    .alloc_pdata    = a653sched_alloc_pdata,
> -
>       .free_domdata   = a653sched_free_domdata,
>       .alloc_domdata  = a653sched_alloc_domdata,
>
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 83244d7..0e02af2 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -1493,7 +1493,7 @@ void __init scheduler_init(void)
>   int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>   {
>       struct vcpu *idle;
> -    void *ppriv, *ppriv_old, *vpriv, *vpriv_old;
> +    void *ppriv = NULL, *ppriv_old, *vpriv, *vpriv_old;
>       struct scheduler *old_ops = per_cpu(scheduler, cpu);
>       struct scheduler *new_ops = (c == NULL) ? &ops : c->sched;
>
> @@ -1501,8 +1501,8 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>           return 0;
>
>       idle = idle_vcpu[cpu];
> -    ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
> -    if ( ppriv == NULL )
> +    if ( (new_ops->alloc_pdata != NULL) &&
> +         ((ppriv = new_ops->alloc_pdata(new_ops, cpu)) == NULL) )
>           return -ENOMEM;
>       vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv);
>       if ( vpriv == NULL )

Just below this there are 2 SCHED_OP calls to free_pdata. You'll have to
check whether the hook is present as you just have nuked it above.


Juergen

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 8/9] xen: sched: allow for choosing credit2 runqueues configuration at boot
  2015-09-29 16:56 ` [PATCH 8/9] xen: sched: allow for choosing credit2 runqueues configuration at boot Dario Faggioli
@ 2015-10-01  5:48   ` Juergen Gross
  2015-10-01  7:23     ` Dario Faggioli
  0 siblings, 1 reply; 48+ messages in thread
From: Juergen Gross @ 2015-10-01  5:48 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Uma Sharma

On 09/29/2015 06:56 PM, Dario Faggioli wrote:
> In fact, credit2 uses CPU topology to decide how to arrange
> its internal runqueues. Before this change, only 'one runqueue
> per socket' was allowed. However, experiments have shown that,
> for instance, having one runqueue per physical core improves
> performance, especially in case hyperthreading is available.
>
> In general, it makes sense to allow users to pick one runqueue
> arrangement at boot time, so that:
>   - more experiments can be easily performed to even better
>     assess and improve performance;
>   - one can select the best configuration for his specific
>     use case and/or hardware.
>
> This patch enables the above.
>
> Note that, for correctly arranging runqueues to be per-core,
> just checking cpu_to_core() on the host CPUs is not enough.
> In fact, cores (and hyperthreads) on different sockets, can
> have the same core (and thread) IDs! We, therefore, need to
> check whether the full topology of two CPUs matches, for
> them to be put in the same runqueue.
>
> Note also that the default (although not functional) for
> credit2, since now, has been per-socket runqueue. This patch
> leaves things that way, to avoid mixing policy and technical
> changes.

I think you should think about a way to make this parameter a per
cpupool one instead a system global one. As this will require some
extra work regarding the tools interface I'd be absolutely fine with
adding this at a later time, but you should have that in mind when
setting this up now.

>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Uma Sharma <uma.sharma523@gmail.com>
> ---
>   docs/misc/xen-command-line.markdown |   11 +++++++
>   xen/common/sched_credit2.c          |   57 ++++++++++++++++++++++++++++++++---
>   2 files changed, 63 insertions(+), 5 deletions(-)
>
> diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
> index a2e427c..71315b8 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -467,6 +467,17 @@ combination with the `low_crashinfo` command line option.
>   ### credit2\_load\_window\_shift
>   > `= <integer>`
>
> +### credit2\_runqueue
> +> `= socket | core`
> +
> +> Default: `socket`
> +
> +Specify how host CPUs are arranged in runqueues. Runqueues are kept
> +balanced with respect to the load generated by the vCPUs running on
> +them. Smaller runqueues (as in with `core`) means more accurate load
> +balancing (for instance, it will deal better with hyperthreading),
> +but also more overhead.
> +
>   ### dbgp
>   > `= ehci[ <integer> | @pci<bus>:<slot>.<func> ]`
>
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 38f382e..025626f 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -82,10 +82,6 @@
>    * Credits are "reset" when the next vcpu in the runqueue is less than
>    * or equal to zero.  At that point, everyone's credits are "clipped"
>    * to a small value, and a fixed credit is added to everyone.
> - *
> - * The plan is for all cores that share an L2 will share the same
> - * runqueue.  At the moment, there is one global runqueue for all
> - * cores.
>    */
>
>   /*
> @@ -194,6 +190,41 @@ static int __read_mostly opt_overload_balance_tolerance = -3;
>   integer_param("credit2_balance_over", opt_overload_balance_tolerance);
>
>   /*
> + * Runqueue organization.
> + *
> + * The various cpus are to be assigned each one to a runqueue, and we
> + * want that to happen basing on topology. At the moment, it is possible
> + * to choose to arrange runqueues to be:
> + *
> + * - per-core: meaning that there will be one runqueue per each physical
> + *             core of the host. This will happen if the opt_runqueue
> + *             parameter is set to 'core';
> + *
> + * - per-socket: meaning that there will be one runqueue per each physical
> + *               socket (AKA package, which often, but not always, also
> + *               matches a NUMA node) of the host; This will happen if
> + *               the opt_runqueue parameter is set to 'socket';

Wouldn't it be a nice idea to add "per-numa-node" as well?

This would make a difference for systems with:

- multiple sockets per numa-node
- multiple numa-nodes per socket

It might even be a good idea to be able to have only one runqueue in
small cpupools (again, this will apply only in case you have a per
cpupool setting instead a global one).


Juergen

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 9/9] xen: sched: per-core runqueues as default in credit2
  2015-09-29 16:56 ` [PATCH 9/9] xen: sched: per-core runqueues as default in credit2 Dario Faggioli
@ 2015-10-01  5:48   ` Juergen Gross
  0 siblings, 0 replies; 48+ messages in thread
From: Juergen Gross @ 2015-10-01  5:48 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Uma Sharma

On 09/29/2015 06:56 PM, Dario Faggioli wrote:
> Experiments have shown that arranging the scheduing
> runqueues on a per-core basis yields better results,
> in most cases.
>
> Such evaluation has been done, for the first time,
> by Uma Sharma, during her participation to OPW. Some
> of the results she got are summarized here:
>
>   http://lists.xen.org/archives/html/xen-devel/2015-03/msg01499.html
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>

Reviewed-by: Juergen Gross <jgross@suse.com>

> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Uma Sharma <uma.sharma523@gmail.com>
> ---
>   docs/misc/xen-command-line.markdown |    2 +-
>   xen/common/sched_credit2.c          |    2 +-
>   2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
> index 71315b8..8fc9cd7 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -470,7 +470,7 @@ combination with the `low_crashinfo` command line option.
>   ### credit2\_runqueue
>   > `= socket | core`
>
> -> Default: `socket`
> +> Default: `core`
>
>   Specify how host CPUs are arranged in runqueues. Runqueues are kept
>   balanced with respect to the load generated by the vCPUs running on
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 025626f..0e9e722 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -211,7 +211,7 @@ integer_param("credit2_balance_over", opt_overload_balance_tolerance);
>    */
>   #define OPT_RUNQUEUE_CORE   1
>   #define OPT_RUNQUEUE_SOCKET 2
> -static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET;
> +static int __read_mostly opt_runqueue = OPT_RUNQUEUE_CORE;
>
>   static void parse_credit2_runqueue(const char *s)
>   {
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface
  2015-10-01  5:21   ` Juergen Gross
@ 2015-10-01  6:33     ` Dario Faggioli
  2015-10-01  7:43       ` Juergen Gross
  0 siblings, 1 reply; 48+ messages in thread
From: Dario Faggioli @ 2015-10-01  6:33 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 1850 bytes --]

On Thu, 2015-10-01 at 07:21 +0200, Juergen Gross wrote:
> On 09/29/2015 06:55 PM, Dario Faggioli wrote:

> > --- a/xen/common/schedule.c
> > +++ b/xen/common/schedule.c
> > @@ -1407,6 +1407,9 @@ static int cpu_schedule_callback(
> > 
> >       switch ( action )
> >       {
> > +    case CPU_STARTING:
> > +        SCHED_OP(&ops, init_pdata, cpu);
> > +        break;
> >       case CPU_UP_PREPARE:
> >           rc = cpu_schedule_up(cpu);
> >           break;
> > @@ -1484,6 +1487,7 @@ void __init scheduler_init(void)
> >       if ( ops.alloc_pdata &&
> >            !(this_cpu(schedule_data).sched_priv =
> > ops.alloc_pdata(&ops, 0)) )
> >           BUG();
> > +    SCHED_OP(&ops, init_pdata, 0);
> 
> You can't call this without having it set in all schedulers.
> 
> I guess using the init_pdata hook has to be introduced after or in
> patch 5.
> 
But, if it is not set, it is NULL, in which case, SCHED_OP does the
trick for me, doesn't it?

#define SCHED_OP(opsptr, fn, ...)                                          \
         (( (opsptr)->fn != NULL ) ? (opsptr)->fn(opsptr, ##__VA_ARGS__ )  \
          : (typeof((opsptr)->fn(opsptr, ##__VA_ARGS__)))0 )

AFAICT, the cases where we want to be absolutely sure whether the hook
"exists" is when we care about its return value, especially if it may
return NULL to report an error (like, for instance, alloc_pdata()), as
that would make it impossible to distinguish between "hook not there"
and "hook failed".

All other cases, should be fine handled like this.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9] xen: sched: make implementing .alloc_pdata optional
  2015-10-01  5:28   ` Juergen Gross
@ 2015-10-01  6:35     ` Dario Faggioli
  0 siblings, 0 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-10-01  6:35 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: George Dunlap, Josh Whitehead, Robert VanVossen


[-- Attachment #1.1: Type: text/plain, Size: 1237 bytes --]

On Thu, 2015-10-01 at 07:28 +0200, Juergen Gross wrote:
> On 09/29/2015 06:56 PM, Dario Faggioli wrote:

> > --- a/xen/common/schedule.c
> > +++ b/xen/common/schedule.c
> > @@ -1501,8 +1501,8 @@ int schedule_cpu_switch(unsigned int cpu,
> > struct cpupool *c)
> >           return 0;
> > 
> >       idle = idle_vcpu[cpu];
> > -    ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
> > -    if ( ppriv == NULL )
> > +    if ( (new_ops->alloc_pdata != NULL) &&
> > +         ((ppriv = new_ops->alloc_pdata(new_ops, cpu)) == NULL) )
> >           return -ENOMEM;
> >       vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain
> > ->sched_priv);
> >       if ( vpriv == NULL )
> 
> Just below this there are 2 SCHED_OP calls to free_pdata. You'll have
> to
> check whether the hook is present as you just have nuked it above.
> 
I think this is fine, for the same reasons explained when replying to
your comment to patch 4.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 8/9] xen: sched: allow for choosing credit2 runqueues configuration at boot
  2015-10-01  5:48   ` Juergen Gross
@ 2015-10-01  7:23     ` Dario Faggioli
  2015-10-01  7:46       ` Juergen Gross
  0 siblings, 1 reply; 48+ messages in thread
From: Dario Faggioli @ 2015-10-01  7:23 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: George Dunlap, Uma Sharma


[-- Attachment #1.1: Type: text/plain, Size: 4146 bytes --]

On Thu, 2015-10-01 at 07:48 +0200, Juergen Gross wrote:
> On 09/29/2015 06:56 PM, Dario Faggioli wrote:
> > In fact, credit2 uses CPU topology to decide how to arrange
> > its internal runqueues. Before this change, only 'one runqueue
> > per socket' was allowed. However, experiments have shown that,
> > for instance, having one runqueue per physical core improves
> > performance, especially in case hyperthreading is available.
> > 
> > In general, it makes sense to allow users to pick one runqueue
> > arrangement at boot time, so that:
> >   - more experiments can be easily performed to even better
> >     assess and improve performance;
> >   - one can select the best configuration for his specific
> >     use case and/or hardware.
> > 
> > This patch enables the above.
> > 
> > Note that, for correctly arranging runqueues to be per-core,
> > just checking cpu_to_core() on the host CPUs is not enough.
> > In fact, cores (and hyperthreads) on different sockets, can
> > have the same core (and thread) IDs! We, therefore, need to
> > check whether the full topology of two CPUs matches, for
> > them to be put in the same runqueue.
> > 
> > Note also that the default (although not functional) for
> > credit2, since now, has been per-socket runqueue. This patch
> > leaves things that way, to avoid mixing policy and technical
> > changes.
> 
> I think you should think about a way to make this parameter a per
> cpupool one instead a system global one. 
>
Believe it or not, I though about this already, and yes, it is in my
plans to make this per-cpupool. However...

> As this will require some
> extra work regarding the tools interface I'd be absolutely fine with
> adding this at a later time, but you should have that in mind when
> setting this up now.
> 
...yes, that was phase II in my mind as well.

So (sorry, but just to make sure I understand), since you said you're
fine with it coming later, are you also fine with this patch, or do you
think some adjustments are necessary, right here, right now, because of
that future plan?

> > --- a/xen/common/sched_credit2.c
> > +++ b/xen/common/sched_credit2.c
> > @@ -82,10 +82,6 @@

> > @@ -194,6 +190,41 @@ static int __read_mostly
> > opt_overload_balance_tolerance = -3;
> >   integer_param("credit2_balance_over",
> > opt_overload_balance_tolerance);
> > 
> >   /*
> > + * Runqueue organization.
> > + *
> > + * The various cpus are to be assigned each one to a runqueue, and
> > we
> > + * want that to happen basing on topology. At the moment, it is
> > possible
> > + * to choose to arrange runqueues to be:
> > + *
> > + * - per-core: meaning that there will be one runqueue per each
> > physical
> > + *             core of the host. This will happen if the
> > opt_runqueue
> > + *             parameter is set to 'core';
> > + *
> > + * - per-socket: meaning that there will be one runqueue per each
> > physical
> > + *               socket (AKA package, which often, but not always,
> > also
> > + *               matches a NUMA node) of the host; This will
> > happen if
> > + *               the opt_runqueue parameter is set to 'socket';
> 
> Wouldn't it be a nice idea to add "per-numa-node" as well?
> 
I think it is.

> This would make a difference for systems with:
> 
> - multiple sockets per numa-node
> - multiple numa-nodes per socket
> 
Yep.

> It might even be a good idea to be able to have only one runqueue in
> small cpupools (again, this will apply only in case you have a per
> cpupool setting instead a global one).
> 
And I agree on this too.

TBH, I had considered these too, and I was thinking to make them happen
in phase II as well. However, they're simple enough to be implemented
now (as in, in v2 of this series), so I think I'll do that.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface
  2015-10-01  6:33     ` Dario Faggioli
@ 2015-10-01  7:43       ` Juergen Gross
  2015-10-01  9:32         ` Andrew Cooper
  0 siblings, 1 reply; 48+ messages in thread
From: Juergen Gross @ 2015-10-01  7:43 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap

On 10/01/2015 08:33 AM, Dario Faggioli wrote:
> On Thu, 2015-10-01 at 07:21 +0200, Juergen Gross wrote:
>> On 09/29/2015 06:55 PM, Dario Faggioli wrote:
>
>>> --- a/xen/common/schedule.c
>>> +++ b/xen/common/schedule.c
>>> @@ -1407,6 +1407,9 @@ static int cpu_schedule_callback(
>>>
>>>        switch ( action )
>>>        {
>>> +    case CPU_STARTING:
>>> +        SCHED_OP(&ops, init_pdata, cpu);
>>> +        break;
>>>        case CPU_UP_PREPARE:
>>>            rc = cpu_schedule_up(cpu);
>>>            break;
>>> @@ -1484,6 +1487,7 @@ void __init scheduler_init(void)
>>>        if ( ops.alloc_pdata &&
>>>             !(this_cpu(schedule_data).sched_priv =
>>> ops.alloc_pdata(&ops, 0)) )
>>>            BUG();
>>> +    SCHED_OP(&ops, init_pdata, 0);
>>
>> You can't call this without having it set in all schedulers.
>>
>> I guess using the init_pdata hook has to be introduced after or in
>> patch 5.
>>
> But, if it is not set, it is NULL, in which case, SCHED_OP does the
> trick for me, doesn't it?
>
> #define SCHED_OP(opsptr, fn, ...)                                          \
>           (( (opsptr)->fn != NULL ) ? (opsptr)->fn(opsptr, ##__VA_ARGS__ )  \
>            : (typeof((opsptr)->fn(opsptr, ##__VA_ARGS__)))0 )

Aah, yes, of course.

Sorry for the noise.


Juergen

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 8/9] xen: sched: allow for choosing credit2 runqueues configuration at boot
  2015-10-01  7:23     ` Dario Faggioli
@ 2015-10-01  7:46       ` Juergen Gross
  0 siblings, 0 replies; 48+ messages in thread
From: Juergen Gross @ 2015-10-01  7:46 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Uma Sharma

On 10/01/2015 09:23 AM, Dario Faggioli wrote:
> On Thu, 2015-10-01 at 07:48 +0200, Juergen Gross wrote:
>> On 09/29/2015 06:56 PM, Dario Faggioli wrote:
>>> In fact, credit2 uses CPU topology to decide how to arrange
>>> its internal runqueues. Before this change, only 'one runqueue
>>> per socket' was allowed. However, experiments have shown that,
>>> for instance, having one runqueue per physical core improves
>>> performance, especially in case hyperthreading is available.
>>>
>>> In general, it makes sense to allow users to pick one runqueue
>>> arrangement at boot time, so that:
>>>    - more experiments can be easily performed to even better
>>>      assess and improve performance;
>>>    - one can select the best configuration for his specific
>>>      use case and/or hardware.
>>>
>>> This patch enables the above.
>>>
>>> Note that, for correctly arranging runqueues to be per-core,
>>> just checking cpu_to_core() on the host CPUs is not enough.
>>> In fact, cores (and hyperthreads) on different sockets, can
>>> have the same core (and thread) IDs! We, therefore, need to
>>> check whether the full topology of two CPUs matches, for
>>> them to be put in the same runqueue.
>>>
>>> Note also that the default (although not functional) for
>>> credit2, since now, has been per-socket runqueue. This patch
>>> leaves things that way, to avoid mixing policy and technical
>>> changes.
>>
>> I think you should think about a way to make this parameter a per
>> cpupool one instead a system global one.
>>
> Believe it or not, I though about this already, and yes, it is in my
> plans to make this per-cpupool. However...
>
>> As this will require some
>> extra work regarding the tools interface I'd be absolutely fine with
>> adding this at a later time, but you should have that in mind when
>> setting this up now.
>>
> ...yes, that was phase II in my mind as well.
>
> So (sorry, but just to make sure I understand), since you said you're
> fine with it coming later, are you also fine with this patch, or do you
> think some adjustments are necessary, right here, right now, because of
> that future plan?

No, I'm fine.

>
>>> --- a/xen/common/sched_credit2.c
>>> +++ b/xen/common/sched_credit2.c
>>> @@ -82,10 +82,6 @@
>
>>> @@ -194,6 +190,41 @@ static int __read_mostly
>>> opt_overload_balance_tolerance = -3;
>>>    integer_param("credit2_balance_over",
>>> opt_overload_balance_tolerance);
>>>
>>>    /*
>>> + * Runqueue organization.
>>> + *
>>> + * The various cpus are to be assigned each one to a runqueue, and
>>> we
>>> + * want that to happen basing on topology. At the moment, it is
>>> possible
>>> + * to choose to arrange runqueues to be:
>>> + *
>>> + * - per-core: meaning that there will be one runqueue per each
>>> physical
>>> + *             core of the host. This will happen if the
>>> opt_runqueue
>>> + *             parameter is set to 'core';
>>> + *
>>> + * - per-socket: meaning that there will be one runqueue per each
>>> physical
>>> + *               socket (AKA package, which often, but not always,
>>> also
>>> + *               matches a NUMA node) of the host; This will
>>> happen if
>>> + *               the opt_runqueue parameter is set to 'socket';
>>
>> Wouldn't it be a nice idea to add "per-numa-node" as well?
>>
> I think it is.
>
>> This would make a difference for systems with:
>>
>> - multiple sockets per numa-node
>> - multiple numa-nodes per socket
>>
> Yep.
>
>> It might even be a good idea to be able to have only one runqueue in
>> small cpupools (again, this will apply only in case you have a per
>> cpupool setting instead a global one).
>>
> And I agree on this too.
>
> TBH, I had considered these too, and I was thinking to make them happen
> in phase II as well. However, they're simple enough to be implemented
> now (as in, in v2 of this series), so I think I'll do that.

Thanks.


Juergen

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9] xen: sched: make implementing .alloc_pdata optional
  2015-09-29 16:56 ` [PATCH 5/9] xen: sched: make implementing .alloc_pdata optional Dario Faggioli
  2015-10-01  5:28   ` Juergen Gross
@ 2015-10-01  7:49   ` Jan Beulich
  2015-10-01  8:13     ` Dario Faggioli
  1 sibling, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2015-10-01  7:49 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: George Dunlap, xen-devel, Josh Whitehead, Robert VanVossen

>>> On 29.09.15 at 18:56, <dario.faggioli@citrix.com> wrote:
> @@ -1501,8 +1501,8 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
>          return 0;
>  
>      idle = idle_vcpu[cpu];
> -    ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
> -    if ( ppriv == NULL )
> +    if ( (new_ops->alloc_pdata != NULL) &&
> +         ((ppriv = new_ops->alloc_pdata(new_ops, cpu)) == NULL) )
>          return -ENOMEM;

I think this should be solved without open coding SCHED_OP():
Make use of the helpers in xen/err.h and have NULL become a
variant of success. Then you at once allow errors other than
out-of-memory to be properly communicated.

Jan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/9] xen: sched: improve scope and placement of credit2 boot parameters
  2015-09-29 16:55 ` [PATCH 2/9] xen: sched: improve scope and placement of credit2 boot parameters Dario Faggioli
  2015-10-01  5:23   ` Juergen Gross
@ 2015-10-01  7:51   ` Jan Beulich
  2015-10-01  8:17     ` Dario Faggioli
  1 sibling, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2015-10-01  7:51 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: George Dunlap, xen-devel, Uma Sharma

>>> On 29.09.15 at 18:55, <dario.faggioli@citrix.com> wrote:
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -162,7 +162,7 @@
>  #define CSFLAG_runq_migrate_request (1<<__CSFLAG_runq_migrate_request)
>  
>  
> -int opt_migrate_resist=500;
> +static int __read_mostly opt_migrate_resist = 500;
>  integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
>  
>  /*
> @@ -185,12 +185,12 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
>   *   to a load of 1.
>   */
>  #define LOADAVG_GRANULARITY_SHIFT (10)
> -int opt_load_window_shift=18;
> +static int __read_mostly opt_load_window_shift = 18;
>  #define  LOADAVG_WINDOW_SHIFT_MIN 4
>  integer_param("credit2_load_window_shift", opt_load_window_shift);
> -int opt_underload_balance_tolerance=0;
> +static int __read_mostly opt_underload_balance_tolerance = 0;
>  integer_param("credit2_balance_under", opt_underload_balance_tolerance);
> -int opt_overload_balance_tolerance=-3;
> +static int __read_mostly opt_overload_balance_tolerance = -3;
>  integer_param("credit2_balance_over", opt_overload_balance_tolerance);

Are all of these validly _signed_ int? If not, I think adjusting their
types here would be quite appropriate.

Jan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-09-29 16:55 ` [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent Dario Faggioli
  2015-09-29 17:31   ` Andrew Cooper
@ 2015-10-01  8:03   ` Jan Beulich
  2015-10-01 11:59     ` Dario Faggioli
  1 sibling, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2015-10-01  8:03 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: George Dunlap, xen-devel, Meng Xu

>>> On 29.09.15 at 18:55, <dario.faggioli@citrix.com> wrote:
> In case of credit1, remove_vcpu() handling needs some
> fixing remove_vcpu() too, i.e.:
>  - it manipulates runqueues, so the runqueue lock must
>    be acquired;

Is this really a fix needed only as a result of the insert side
changes? If not, since the insert side looks more like an
improvement than a bug fix, perhaps splitting the two (and
requesting backport of the remove side change) would be
better?

> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -905,8 +905,19 @@ csched_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
>  {
>      struct csched_vcpu *svc = vc->sched_priv;
>  
> -    if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc->is_running )
> -        __runq_insert(vc->processor, svc);
> +    /*
> +     * For the idle domain, this is called, during boot, before
> +     * than alloc_pdata() has been called for the pCPU.
> +     */
> +    if ( !is_idle_vcpu(vc) )

Looking through the patch I can't spot why this would all of the
sudden need special casing. Can you explain that please? In
particular (in case it's just the acquiring of the lock that now
becomes an issue) - where would the runqueue insertion be
happening now? Or if one of the latter two conditions makes it
so that insertion turns out not to be needed, perhaps they could
be moved up to replace the is-idle check?

Also, not the least since the comment gets repeated below, if
it is to stay then I think the "than" should be dropped (or
re-worded more extensively).

Jan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9] xen: sched: make implementing .alloc_pdata optional
  2015-10-01  7:49   ` Jan Beulich
@ 2015-10-01  8:13     ` Dario Faggioli
  0 siblings, 0 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-10-01  8:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Josh Whitehead, Robert VanVossen


[-- Attachment #1.1: Type: text/plain, Size: 1290 bytes --]

On Thu, 2015-10-01 at 01:49 -0600, Jan Beulich wrote:
> > > > On 29.09.15 at 18:56, <dario.faggioli@citrix.com> wrote:
> > @@ -1501,8 +1501,8 @@ int schedule_cpu_switch(unsigned int cpu,
> > struct cpupool *c)
> >          return 0;
> >  
> >      idle = idle_vcpu[cpu];
> > -    ppriv = SCHED_OP(new_ops, alloc_pdata, cpu);
> > -    if ( ppriv == NULL )
> > +    if ( (new_ops->alloc_pdata != NULL) &&
> > +         ((ppriv = new_ops->alloc_pdata(new_ops, cpu)) == NULL) )
> >          return -ENOMEM;
> 
> I think this should be solved without open coding SCHED_OP():
> Make use of the helpers in xen/err.h and have NULL become a
> variant of success.
>
I see. There are other occurrences of the open coded variant, so I just
went for it for the sake of this patch.

But yes, I guess I can have a look at what you suggest, add a
preparatory which does that to the series, and convert all existing
cases (and then, of course, use the same approach when I get to this
one).

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface
  2015-09-29 16:55 ` [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface Dario Faggioli
  2015-10-01  5:21   ` Juergen Gross
@ 2015-10-01  8:17   ` Jan Beulich
  2015-10-01  9:26     ` Dario Faggioli
  1 sibling, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2015-10-01  8:17 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: George Dunlap, xen-devel, Juergen Gross

>>> On 29.09.15 at 18:55, <dario.faggioli@citrix.com> wrote:
> This happens because, right now, the scheduler of the
> target pool remaps the runqueue lock during (rt_)alloc_pdata,
> which is called at the very beginning of schedule_cpu_switch().
> Credit2's csched2_free_pdata(), however, wants to find the spin
> lock the same way it was put by csched2_alloc_pdata(), and
> explodes, it not being the case.
> 
> This patch also fixes this as, now, spin lock remapping
> happens in the init_pdata hook of the target scheduler for
> the pCPU, which we can easily make sure it is ivoked *after*
> the free_pdata hook of the old scheduler (which is exactly
> what is done in this patch), without needing to restructure
> schedule_cpu_switch().

Hmm, while I can see this to be okay for the
cpupool_unassign_cpu_helper() case (due to the call to
cpu_disable_scheduler() up front) as well as for the
cpupool_cpu_add() one (I take it that the CPU can't be
scheduled on in that case), I don't see how this would be
safe for the XEN_SYSCTL_CPUPOOL_OP_ADDCPU case: How is
scheduling activity being prevented to happen after the new
ppriv got installed, but before it gets initialized via the init_pdata
hook? Or how would scheduling activity be safe with an allocated
but not yet initialized ppriv?

Jan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/9] xen: sched: improve scope and placement of credit2 boot parameters
  2015-10-01  7:51   ` Jan Beulich
@ 2015-10-01  8:17     ` Dario Faggioli
  0 siblings, 0 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-10-01  8:17 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Uma Sharma


[-- Attachment #1.1: Type: text/plain, Size: 1784 bytes --]

On Thu, 2015-10-01 at 01:51 -0600, Jan Beulich wrote:
> > > > On 29.09.15 at 18:55, <dario.faggioli@citrix.com> wrote:
> > --- a/xen/common/sched_credit2.c
> > +++ b/xen/common/sched_credit2.c
> > @@ -162,7 +162,7 @@
> >  #define CSFLAG_runq_migrate_request
> > (1<<__CSFLAG_runq_migrate_request)
> >  
> >  
> > -int opt_migrate_resist=500;
> > +static int __read_mostly opt_migrate_resist = 500;
> >  integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
> >  
> >  /*
> > @@ -185,12 +185,12 @@ integer_param("sched_credit2_migrate_resist",
> > opt_migrate_resist);
> >   *   to a load of 1.
> >   */
> >  #define LOADAVG_GRANULARITY_SHIFT (10)
> > -int opt_load_window_shift=18;
> > +static int __read_mostly opt_load_window_shift = 18;
> >  #define  LOADAVG_WINDOW_SHIFT_MIN 4
> >  integer_param("credit2_load_window_shift", opt_load_window_shift);
> > -int opt_underload_balance_tolerance=0;
> > +static int __read_mostly opt_underload_balance_tolerance = 0;
> >  integer_param("credit2_balance_under",
> > opt_underload_balance_tolerance);
> > -int opt_overload_balance_tolerance=-3;
> > +static int __read_mostly opt_overload_balance_tolerance = -3;
> >  integer_param("credit2_balance_over",
> > opt_overload_balance_tolerance);
> 
> Are all of these validly _signed_ int? If not, I think adjusting
> their
> types here would be quite appropriate.
> 
Good point. Since I'm touching them, I'll convert to 'unsigned int' som
e of them

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface
  2015-10-01  8:17   ` Jan Beulich
@ 2015-10-01  9:26     ` Dario Faggioli
  2015-10-01 10:12       ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Dario Faggioli @ 2015-10-01  9:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Juergen Gross


[-- Attachment #1.1: Type: text/plain, Size: 2805 bytes --]

On Thu, 2015-10-01 at 02:17 -0600, Jan Beulich wrote:
> > > > On 29.09.15 at 18:55, <dario.faggioli@citrix.com> wrote:
> > This happens because, right now, the scheduler of the
> > target pool remaps the runqueue lock during (rt_)alloc_pdata,
> > which is called at the very beginning of schedule_cpu_switch().
> > Credit2's csched2_free_pdata(), however, wants to find the spin
> > lock the same way it was put by csched2_alloc_pdata(), and
> > explodes, it not being the case.
> > 
> > This patch also fixes this as, now, spin lock remapping
> > happens in the init_pdata hook of the target scheduler for
> > the pCPU, which we can easily make sure it is ivoked *after*
> > the free_pdata hook of the old scheduler (which is exactly
> > what is done in this patch), without needing to restructure
> > schedule_cpu_switch().
> 
> Hmm, while I can see this to be okay for the
> cpupool_unassign_cpu_helper() case (due to the call to
> cpu_disable_scheduler() up front) as well as for the
> cpupool_cpu_add() one (I take it that the CPU can't be
> scheduled on in that case), I don't see how this would be
> safe for the XEN_SYSCTL_CPUPOOL_OP_ADDCPU case: How is
> scheduling activity being prevented to happen after the new
> ppriv got installed, but before it gets initialized via the
> init_pdata
> hook? Or how would scheduling activity be safe with an allocated
> but not yet initialized ppriv?
> 
Not sure I'm getting, but let me try.

XEN_SYSCTL_CPUPOOL_OP_ADDCPU is what adds a CPU to a pool. If we are
adding a CPU to a pool, it means the CPU is currently free, i.e.,
outside of any pool. This, in turn, means no pool has the CPU's bit set
in its cpupool->cpu_valid mask, which is what we check, within the
scheduler, to see where we can schedule the vcpus subject to a
particular scheduler.

If no pool has that bit set in its mask, nothing is being scheduled on
that CPU. (Actually, making sure that that is really the case, has been
the subject of some patches I sent back in July.)

Ok, now XEN_SYSCTL_CPUPOOL_OP_ADDCPU calls cpupool_assign_cpu_locked(),
which calls schedule_cpu_switch(cpu --> c) and, *only* *after* *that*,
does the following:

    cpumask_clear_cpu(cpu, &cpupool_free_cpus);
    ...
    cpumask_set_cpu(cpu, c->cpu_valid);

So, still assuming that I got your question right, the answer is:
scheduling activity was stopped already, and is restarted only after
both allocation and initialization have happened.

Let me know what you think.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface
  2015-10-01  7:43       ` Juergen Gross
@ 2015-10-01  9:32         ` Andrew Cooper
  2015-10-01  9:40           ` Dario Faggioli
  0 siblings, 1 reply; 48+ messages in thread
From: Andrew Cooper @ 2015-10-01  9:32 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: Juergen Gross, xen-devel, George Dunlap

On 01/10/15 08:43, Juergen Gross wrote:
> On 10/01/2015 08:33 AM, Dario Faggioli wrote:
>> On Thu, 2015-10-01 at 07:21 +0200, Juergen Gross wrote:
>>> On 09/29/2015 06:55 PM, Dario Faggioli wrote:
>>
>>>> --- a/xen/common/schedule.c
>>>> +++ b/xen/common/schedule.c
>>>> @@ -1407,6 +1407,9 @@ static int cpu_schedule_callback(
>>>>
>>>>        switch ( action )
>>>>        {
>>>> +    case CPU_STARTING:
>>>> +        SCHED_OP(&ops, init_pdata, cpu);
>>>> +        break;
>>>>        case CPU_UP_PREPARE:
>>>>            rc = cpu_schedule_up(cpu);
>>>>            break;
>>>> @@ -1484,6 +1487,7 @@ void __init scheduler_init(void)
>>>>        if ( ops.alloc_pdata &&
>>>>             !(this_cpu(schedule_data).sched_priv =
>>>> ops.alloc_pdata(&ops, 0)) )
>>>>            BUG();
>>>> +    SCHED_OP(&ops, init_pdata, 0);
>>>
>>> You can't call this without having it set in all schedulers.
>>>
>>> I guess using the init_pdata hook has to be introduced after or in
>>> patch 5.
>>>
>> But, if it is not set, it is NULL, in which case, SCHED_OP does the
>> trick for me, doesn't it?
>>
>> #define SCHED_OP(opsptr, fn,
>> ...)                                          \
>>           (( (opsptr)->fn != NULL ) ? (opsptr)->fn(opsptr,
>> ##__VA_ARGS__ )  \
>>            : (typeof((opsptr)->fn(opsptr, ##__VA_ARGS__)))0 )
>
> Aah, yes, of course.
>
> Sorry for the noise.

Another area ripe for cleanup is these macros.  As far as I can tell,
they serve no purpose other than to obscure the code, stop cscope/ctags
from following the calls, and give the preprocessor a headache.

A system like we use for the hvm_func pointers and the static inline
wrappers would be far clearer, and also make it obvious that it is safe
to call with a NULL function pointer.

~Andrew

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface
  2015-10-01  9:32         ` Andrew Cooper
@ 2015-10-01  9:40           ` Dario Faggioli
  0 siblings, 0 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-10-01  9:40 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Juergen Gross, xen-devel, George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 1293 bytes --]

On Thu, 2015-10-01 at 10:32 +0100, Andrew Cooper wrote:
> On 01/10/15 08:43, Juergen Gross wrote:
> > On 10/01/2015 08:33 AM, Dario Faggioli wrote:
> > > On Thu, 2015-10-01 at 07:21 +0200, Juergen Gross wrote:

> > > #define SCHED_OP(opsptr, fn,
> > > ...)                                          \
> > >           (( (opsptr)->fn != NULL ) ? (opsptr)->fn(opsptr,
> > > ##__VA_ARGS__ )  \
> > >            : (typeof((opsptr)->fn(opsptr, ##__VA_ARGS__)))0 )
> > 
> > Aah, yes, of course.
> > 
> > Sorry for the noise.
> 
> Another area ripe for cleanup is these macros.  As far as I can tell,
> they serve no purpose other than to obscure the code, stop
> cscope/ctags
> from following the calls, and give the preprocessor a headache.
> 
> A system like we use for the hvm_func pointers and the static inline
> wrappers would be far clearer, and also make it obvious that it is
> safe
> to call with a NULL function pointer.
> 
Happy to add this as well to my TODO list. :-)

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface
  2015-10-01  9:26     ` Dario Faggioli
@ 2015-10-01 10:12       ` Jan Beulich
  2015-10-01 10:35         ` Dario Faggioli
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2015-10-01 10:12 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: George Dunlap, xen-devel, Juergen Gross

>>> On 01.10.15 at 11:26, <dario.faggioli@citrix.com> wrote:
> On Thu, 2015-10-01 at 02:17 -0600, Jan Beulich wrote:
>> > > > On 29.09.15 at 18:55, <dario.faggioli@citrix.com> wrote:
>> > This happens because, right now, the scheduler of the
>> > target pool remaps the runqueue lock during (rt_)alloc_pdata,
>> > which is called at the very beginning of schedule_cpu_switch().
>> > Credit2's csched2_free_pdata(), however, wants to find the spin
>> > lock the same way it was put by csched2_alloc_pdata(), and
>> > explodes, it not being the case.
>> > 
>> > This patch also fixes this as, now, spin lock remapping
>> > happens in the init_pdata hook of the target scheduler for
>> > the pCPU, which we can easily make sure it is ivoked *after*
>> > the free_pdata hook of the old scheduler (which is exactly
>> > what is done in this patch), without needing to restructure
>> > schedule_cpu_switch().
>> 
>> Hmm, while I can see this to be okay for the
>> cpupool_unassign_cpu_helper() case (due to the call to
>> cpu_disable_scheduler() up front) as well as for the
>> cpupool_cpu_add() one (I take it that the CPU can't be
>> scheduled on in that case), I don't see how this would be
>> safe for the XEN_SYSCTL_CPUPOOL_OP_ADDCPU case: How is
>> scheduling activity being prevented to happen after the new
>> ppriv got installed, but before it gets initialized via the
>> init_pdata
>> hook? Or how would scheduling activity be safe with an allocated
>> but not yet initialized ppriv?
>> 
> Not sure I'm getting, but let me try.
> 
> XEN_SYSCTL_CPUPOOL_OP_ADDCPU is what adds a CPU to a pool. If we are
> adding a CPU to a pool, it means the CPU is currently free, i.e.,
> outside of any pool. This, in turn, means no pool has the CPU's bit set
> in its cpupool->cpu_valid mask, which is what we check, within the
> scheduler, to see where we can schedule the vcpus subject to a
> particular scheduler.
> 
> If no pool has that bit set in its mask, nothing is being scheduled on
> that CPU. (Actually, making sure that that is really the case, has been
> the subject of some patches I sent back in July.)
> 
> Ok, now XEN_SYSCTL_CPUPOOL_OP_ADDCPU calls cpupool_assign_cpu_locked(),
> which calls schedule_cpu_switch(cpu --> c) and, *only* *after* *that*,
> does the following:
> 
>     cpumask_clear_cpu(cpu, &cpupool_free_cpus);
>     ...
>     cpumask_set_cpu(cpu, c->cpu_valid);
> 
> So, still assuming that I got your question right, the answer is:
> scheduling activity was stopped already, and is restarted only after
> both allocation and initialization have happened.

Okay. The thing is that looking at schedule_cpu_switch() alone
(and also considering its name) it is not clear that all callers either
move the CPU into unusable state (from scheduling pov) or out
of it, but never between CPUs from usable to usable. No
assertion, no comment, nothing. IOW even if not an active bug,
at least a latent one with your changes.

Jan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface
  2015-10-01 10:12       ` Jan Beulich
@ 2015-10-01 10:35         ` Dario Faggioli
  2015-10-01 10:47           ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Dario Faggioli @ 2015-10-01 10:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Juergen Gross


[-- Attachment #1.1: Type: text/plain, Size: 1128 bytes --]

On Thu, 2015-10-01 at 04:12 -0600, Jan Beulich wrote:

> Okay. The thing is that looking at schedule_cpu_switch() alone
> (and also considering its name) it is not clear that all callers
> either
> move the CPU into unusable state (from scheduling pov) or out
> of it, but never between CPUs from usable to usable. 
>
I agree.

> No assertion, no comment, nothing. IOW even if not an active bug,
> at least a latent one with your changes.
>
Well, for sure the patch does fix an actual bug, as detailed in the
changelog. Two of them, actually, considering that it is this
alloc<-->init split that allows to fix the Credit2 runqueue bug.

But sure I don't want to introduce a new bug --no matter whether actual
or latent-- even if I'm fixing two! :-D

I certainly can add both, a comment and an ASSERT(). Is that ok?

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface
  2015-10-01 10:35         ` Dario Faggioli
@ 2015-10-01 10:47           ` Jan Beulich
  0 siblings, 0 replies; 48+ messages in thread
From: Jan Beulich @ 2015-10-01 10:47 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: George Dunlap, xen-devel, Juergen Gross

>>> On 01.10.15 at 12:35, <dario.faggioli@citrix.com> wrote:
> On Thu, 2015-10-01 at 04:12 -0600, Jan Beulich wrote:
>> Okay. The thing is that looking at schedule_cpu_switch() alone
>> (and also considering its name) it is not clear that all callers
>> either
>> move the CPU into unusable state (from scheduling pov) or out
>> of it, but never between CPUs from usable to usable. 
>>
> I agree.
> 
>> No assertion, no comment, nothing. IOW even if not an active bug,
>> at least a latent one with your changes.
>>
> Well, for sure the patch does fix an actual bug, as detailed in the
> changelog. Two of them, actually, considering that it is this
> alloc<-->init split that allows to fix the Credit2 runqueue bug.
> 
> But sure I don't want to introduce a new bug --no matter whether actual
> or latent-- even if I'm fixing two! :-D
> 
> I certainly can add both, a comment and an ASSERT(). Is that ok?

Sure. Personally I'd be fine with just an ASSERT(), but talk to George.

Jan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-10-01  8:03   ` Jan Beulich
@ 2015-10-01 11:59     ` Dario Faggioli
  0 siblings, 0 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-10-01 11:59 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Meng Xu


[-- Attachment #1.1: Type: text/plain, Size: 2715 bytes --]

On Thu, 2015-10-01 at 02:03 -0600, Jan Beulich wrote:
> > > > On 29.09.15 at 18:55, <dario.faggioli@citrix.com> wrote:
> > In case of credit1, remove_vcpu() handling needs some
> > fixing remove_vcpu() too, i.e.:
> >  - it manipulates runqueues, so the runqueue lock must
> >    be acquired;
> 
> Is this really a fix needed only as a result of the insert side
> changes? If not, since the insert side looks more like an
> improvement than a bug fix, perhaps splitting the two (and
> requesting backport of the remove side change) would be
> better?
> 
Yes, I like the idea, I'll do this.

> > --- a/xen/common/sched_credit.c
> > +++ b/xen/common/sched_credit.c
> > @@ -905,8 +905,19 @@ csched_vcpu_insert(const struct scheduler
> > *ops, struct vcpu *vc)
> >  {
> >      struct csched_vcpu *svc = vc->sched_priv;
> >  
> > -    if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc
> > ->is_running )
> > -        __runq_insert(vc->processor, svc);
> > +    /*
> > +     * For the idle domain, this is called, during boot, before
> > +     * than alloc_pdata() has been called for the pCPU.
> > +     */
> > +    if ( !is_idle_vcpu(vc) )
> 
> Looking through the patch I can't spot why this would all of the
> sudden need special casing. Can you explain that please? In
> particular (in case it's just the acquiring of the lock that now
> becomes an issue) - where would the runqueue insertion be
> happening now? 
>
So, this is an issue in Credit2 (from where the comment was taken) and
RTDS. I've checked further and, for Credit, it is an issue only without
patch 4 of this series, and even in that case, it can be cured in
another (better) way, I think.

So, I'll not only split, but also rework this patch a bit, and
resubmit.

> Or if one of the latter two conditions makes it
> so that insertion turns out not to be needed, perhaps they could
> be moved up to replace the is-idle check?
> 
Yes, it was the !vc->is_running part which prevents the insertion of
idle vcpus in runqueues already. I wouldn't like having it guarding the
whole thing, for various reasons, but as I said, I think I can rework
the patch to make things more clear.

> Also, not the least since the comment gets repeated below, if
> it is to stay then I think the "than" should be dropped (or
> re-worded more extensively).
> 
Yep, as a part of the rework, I'll reword the comment(s).

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] xen: sched: fix an 'off by one \t' in credit2 debug dump
  2015-09-29 16:55 ` [PATCH 1/9] xen: sched: fix an 'off by one \t' in credit2 debug dump Dario Faggioli
  2015-10-01  5:22   ` Juergen Gross
@ 2015-10-08 14:09   ` George Dunlap
  1 sibling, 0 replies; 48+ messages in thread
From: George Dunlap @ 2015-10-08 14:09 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap

On 29/09/15 17:55, Dario Faggioli wrote:
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>

Acked-by: George Dunlap <george.dunlap@citrix.com>

> ---
>  xen/common/sched_credit2.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 75e0321..46645f8 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -1931,7 +1931,7 @@ csched2_dump(const struct scheduler *ops)
>          struct csched2_dom *sdom;
>          sdom = list_entry(iter_sdom, struct csched2_dom, sdom_elem);
>  
> -        printk("\tDomain: %d w %d v %d\n\t",
> +        printk("\tDomain: %d w %d v %d\n",
>                 sdom->dom->domain_id,
>                 sdom->weight,
>                 sdom->nr_vcpus);
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-09-29 17:31   ` Andrew Cooper
  2015-09-29 21:40     ` Dario Faggioli
@ 2015-10-08 14:58     ` George Dunlap
  2015-10-08 15:20       ` Andrew Cooper
  1 sibling, 1 reply; 48+ messages in thread
From: George Dunlap @ 2015-10-08 14:58 UTC (permalink / raw)
  To: Andrew Cooper, Dario Faggioli, xen-devel; +Cc: George Dunlap, Meng Xu

On 29/09/15 18:31, Andrew Cooper wrote:
> On 29/09/15 17:55, Dario Faggioli wrote:
>> The insert_vcpu() scheduler hook is called with an
>> inconsistent locking strategy. In fact, it is sometimes
>> invoked while holding the runqueue lock and sometimes
>> when that is not the case.
>>
>> In other words, some call sites seems to imply that
>> locking should be handled in the callers, in schedule.c
>> --e.g., in schedule_cpu_switch(), which acquires the
>> runqueue lock before calling the hook; others that
>> specific schedulers should be responsible for locking
>> themselves --e.g., in sched_move_domain(), which does
>> not acquire any lock for calling the hook.
>>
>> The right thing to do seems to always defer locking to
>> the specific schedulers, as it's them that know what, how
>> and when it is best to lock (as in: runqueue locks, vs.
>> private scheduler locks, vs. both, etc.)
>>
>> This patch, therefore:
>>  - removes any locking around insert_vcpu() from
>>    generic code (schedule.c);
>>  - add the _proper_ locking in the hook implementations,
>>    depending on the scheduler (for instance, credit2
>>    does that already, credit1 and RTDS need to grab
>>    the runqueue lock while manipulating runqueues).
>>
>> In case of credit1, remove_vcpu() handling needs some
>> fixing remove_vcpu() too, i.e.:
>>  - it manipulates runqueues, so the runqueue lock must
>>    be acquired;
>>  - *_lock_irq() is enough, there is no need to do
>>    _irqsave()
> 
> Nothing in any of generic scheduling code should need interrupts
> disabled at all.
> 
> One of the problem-areas identified by Jenny during the ticketlock
> performance work was that the SCHEDULE_SOFTIRQ was a large consumer of
> time with interrupts disabled.  (The other large one being the time
> calibration rendezvous, but that is a wildly different can of worms to fix.)

Generic scheduling code is called from interrupt contexts -- namely,
vcpu_wake(), which for the credit scheduler wants to put things on the
runqueue.  Lock taken in interrupt context => interrupts must be
disabled whenever taking the lock, yes?

There may be a way we can revise the whole scheduling system to separate
locks taken in an interrupt context from other locks (for instance,
having a special "wake" queue which is drained in schedule() or
something) but it's not as simple as just switching all the locks over
to non-irq.

 -George

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-10-08 14:58     ` George Dunlap
@ 2015-10-08 15:20       ` Andrew Cooper
  2015-10-08 16:46         ` George Dunlap
  2015-10-08 20:39         ` Dario Faggioli
  0 siblings, 2 replies; 48+ messages in thread
From: Andrew Cooper @ 2015-10-08 15:20 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli, xen-devel; +Cc: George Dunlap, Meng Xu

On 08/10/15 15:58, George Dunlap wrote:
> On 29/09/15 18:31, Andrew Cooper wrote:
>> On 29/09/15 17:55, Dario Faggioli wrote:
>>> The insert_vcpu() scheduler hook is called with an
>>> inconsistent locking strategy. In fact, it is sometimes
>>> invoked while holding the runqueue lock and sometimes
>>> when that is not the case.
>>>
>>> In other words, some call sites seems to imply that
>>> locking should be handled in the callers, in schedule.c
>>> --e.g., in schedule_cpu_switch(), which acquires the
>>> runqueue lock before calling the hook; others that
>>> specific schedulers should be responsible for locking
>>> themselves --e.g., in sched_move_domain(), which does
>>> not acquire any lock for calling the hook.
>>>
>>> The right thing to do seems to always defer locking to
>>> the specific schedulers, as it's them that know what, how
>>> and when it is best to lock (as in: runqueue locks, vs.
>>> private scheduler locks, vs. both, etc.)
>>>
>>> This patch, therefore:
>>>  - removes any locking around insert_vcpu() from
>>>    generic code (schedule.c);
>>>  - add the _proper_ locking in the hook implementations,
>>>    depending on the scheduler (for instance, credit2
>>>    does that already, credit1 and RTDS need to grab
>>>    the runqueue lock while manipulating runqueues).
>>>
>>> In case of credit1, remove_vcpu() handling needs some
>>> fixing remove_vcpu() too, i.e.:
>>>  - it manipulates runqueues, so the runqueue lock must
>>>    be acquired;
>>>  - *_lock_irq() is enough, there is no need to do
>>>    _irqsave()
>> Nothing in any of generic scheduling code should need interrupts
>> disabled at all.
>>
>> One of the problem-areas identified by Jenny during the ticketlock
>> performance work was that the SCHEDULE_SOFTIRQ was a large consumer of
>> time with interrupts disabled.  (The other large one being the time
>> calibration rendezvous, but that is a wildly different can of worms to fix.)
> Generic scheduling code is called from interrupt contexts -- namely,
> vcpu_wake()

There are a lot of codepaths, but I cant see one which is definitely
called with interrupts disables.  (OTOH, I can see several where
interrupts are definitely enabled).

> , which for the credit scheduler wants to put things on the
> runqueue.  Lock taken in interrupt context => interrupts must be
> disabled whenever taking the lock, yes?

Correct, which is the purpose of the ASSERT()s in the _irq() and
_irqsave() variants.

> There may be a way we can revise the whole scheduling system to separate
> locks taken in an interrupt context from other locks (for instance,
> having a special "wake" queue which is drained in schedule() or
> something) but it's not as simple as just switching all the locks over
> to non-irq.

I did not wish to imply that this is trivial to achieve.  I am entirely
willing to believe that it might involve changes of logic to achieve.

~Andrew

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-10-08 15:20       ` Andrew Cooper
@ 2015-10-08 16:46         ` George Dunlap
  2015-10-08 17:23           ` Andrew Cooper
  2015-10-08 20:39         ` Dario Faggioli
  1 sibling, 1 reply; 48+ messages in thread
From: George Dunlap @ 2015-10-08 16:46 UTC (permalink / raw)
  To: Andrew Cooper, Dario Faggioli, xen-devel; +Cc: George Dunlap, Meng Xu

On 08/10/15 16:20, Andrew Cooper wrote:
> On 08/10/15 15:58, George Dunlap wrote:
>> On 29/09/15 18:31, Andrew Cooper wrote:
>>> On 29/09/15 17:55, Dario Faggioli wrote:
>>>> The insert_vcpu() scheduler hook is called with an
>>>> inconsistent locking strategy. In fact, it is sometimes
>>>> invoked while holding the runqueue lock and sometimes
>>>> when that is not the case.
>>>>
>>>> In other words, some call sites seems to imply that
>>>> locking should be handled in the callers, in schedule.c
>>>> --e.g., in schedule_cpu_switch(), which acquires the
>>>> runqueue lock before calling the hook; others that
>>>> specific schedulers should be responsible for locking
>>>> themselves --e.g., in sched_move_domain(), which does
>>>> not acquire any lock for calling the hook.
>>>>
>>>> The right thing to do seems to always defer locking to
>>>> the specific schedulers, as it's them that know what, how
>>>> and when it is best to lock (as in: runqueue locks, vs.
>>>> private scheduler locks, vs. both, etc.)
>>>>
>>>> This patch, therefore:
>>>>  - removes any locking around insert_vcpu() from
>>>>    generic code (schedule.c);
>>>>  - add the _proper_ locking in the hook implementations,
>>>>    depending on the scheduler (for instance, credit2
>>>>    does that already, credit1 and RTDS need to grab
>>>>    the runqueue lock while manipulating runqueues).
>>>>
>>>> In case of credit1, remove_vcpu() handling needs some
>>>> fixing remove_vcpu() too, i.e.:
>>>>  - it manipulates runqueues, so the runqueue lock must
>>>>    be acquired;
>>>>  - *_lock_irq() is enough, there is no need to do
>>>>    _irqsave()
>>> Nothing in any of generic scheduling code should need interrupts
>>> disabled at all.
>>>
>>> One of the problem-areas identified by Jenny during the ticketlock
>>> performance work was that the SCHEDULE_SOFTIRQ was a large consumer of
>>> time with interrupts disabled.  (The other large one being the time
>>> calibration rendezvous, but that is a wildly different can of worms to fix.)
>> Generic scheduling code is called from interrupt contexts -- namely,
>> vcpu_wake()
> 
> There are a lot of codepaths, but I cant see one which is definitely
> called with interrupts disables.  (OTOH, I can see several where
> interrupts are definitely enabled).

Oh, I think I misunderstood you.  You meant, "No codepaths *calling
into* generic scheduling code should need interrupts disabled at all".
I can certainly believe that to be true in most cases; there's no sense
in saving the flags if we don't need to.

 -George

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-10-08 16:46         ` George Dunlap
@ 2015-10-08 17:23           ` Andrew Cooper
  2015-10-08 20:44             ` Dario Faggioli
  2015-10-12  9:44             ` George Dunlap
  0 siblings, 2 replies; 48+ messages in thread
From: Andrew Cooper @ 2015-10-08 17:23 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli, xen-devel; +Cc: George Dunlap, Meng Xu

On 08/10/15 17:46, George Dunlap wrote:
> On 08/10/15 16:20, Andrew Cooper wrote:
>> On 08/10/15 15:58, George Dunlap wrote:
>>> On 29/09/15 18:31, Andrew Cooper wrote:
>>>> On 29/09/15 17:55, Dario Faggioli wrote:
>>>>> The insert_vcpu() scheduler hook is called with an
>>>>> inconsistent locking strategy. In fact, it is sometimes
>>>>> invoked while holding the runqueue lock and sometimes
>>>>> when that is not the case.
>>>>>
>>>>> In other words, some call sites seems to imply that
>>>>> locking should be handled in the callers, in schedule.c
>>>>> --e.g., in schedule_cpu_switch(), which acquires the
>>>>> runqueue lock before calling the hook; others that
>>>>> specific schedulers should be responsible for locking
>>>>> themselves --e.g., in sched_move_domain(), which does
>>>>> not acquire any lock for calling the hook.
>>>>>
>>>>> The right thing to do seems to always defer locking to
>>>>> the specific schedulers, as it's them that know what, how
>>>>> and when it is best to lock (as in: runqueue locks, vs.
>>>>> private scheduler locks, vs. both, etc.)
>>>>>
>>>>> This patch, therefore:
>>>>>  - removes any locking around insert_vcpu() from
>>>>>    generic code (schedule.c);
>>>>>  - add the _proper_ locking in the hook implementations,
>>>>>    depending on the scheduler (for instance, credit2
>>>>>    does that already, credit1 and RTDS need to grab
>>>>>    the runqueue lock while manipulating runqueues).
>>>>>
>>>>> In case of credit1, remove_vcpu() handling needs some
>>>>> fixing remove_vcpu() too, i.e.:
>>>>>  - it manipulates runqueues, so the runqueue lock must
>>>>>    be acquired;
>>>>>  - *_lock_irq() is enough, there is no need to do
>>>>>    _irqsave()
>>>> Nothing in any of generic scheduling code should need interrupts
>>>> disabled at all.
>>>>
>>>> One of the problem-areas identified by Jenny during the ticketlock
>>>> performance work was that the SCHEDULE_SOFTIRQ was a large consumer of
>>>> time with interrupts disabled.  (The other large one being the time
>>>> calibration rendezvous, but that is a wildly different can of worms to fix.)
>>> Generic scheduling code is called from interrupt contexts -- namely,
>>> vcpu_wake()
>> There are a lot of codepaths, but I cant see one which is definitely
>> called with interrupts disables.  (OTOH, I can see several where
>> interrupts are definitely enabled).
> Oh, I think I misunderstood you.  You meant, "No codepaths *calling
> into* generic scheduling code should need interrupts disabled at all".
> I can certainly believe that to be true in most cases; there's no sense
> in saving the flags if we don't need to.

My original statement came from the observation that schedule() runs
with interrupts disabled, and takes between 2.2 and 4 microseconds to
run (as measured during the ticketlock performance analysis).

It is the biggest consumer of time with interrupts disabled, next being
the time calibration rendezvous.

I am going to go out on a limb and say that the majority of that time
does not need to be spent with interrupts disabled.  I might easily be
wrong, but I suspect I am not.

~Andrew

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-10-08 15:20       ` Andrew Cooper
  2015-10-08 16:46         ` George Dunlap
@ 2015-10-08 20:39         ` Dario Faggioli
  2015-10-09 13:05           ` Andrew Cooper
  1 sibling, 1 reply; 48+ messages in thread
From: Dario Faggioli @ 2015-10-08 20:39 UTC (permalink / raw)
  To: Andrew Cooper, George Dunlap, xen-devel; +Cc: George Dunlap, Meng Xu


[-- Attachment #1.1: Type: text/plain, Size: 1904 bytes --]

On Thu, 2015-10-08 at 16:20 +0100, Andrew Cooper wrote:
> On 08/10/15 15:58, George Dunlap wrote:

> > Generic scheduling code is called from interrupt contexts --
> > namely,
> > vcpu_wake()
> 
> There are a lot of codepaths, but I cant see one which is definitely
> called with interrupts disables.  (OTOH, I can see several where
> interrupts are definitely enabled).
> 
Sorry, it's certainly me, but I don't think I understand this.

AFAICT, you are saying that you fail to find in the code, situations
the scheduler code is invoked, with interrupts already disabled, is
this correct? In particular "definitely called with interrupt disabled"
is what confuses me... :-/

Also (assuming what I just said is what you meant), are you referring
"just" to schedule(), or even to other scheduler's code, which also
disables interrupt when taking the lock(s) it needs, like, e.g.,
vcpu_wake() as George said?

> > , which for the credit scheduler wants to put things on the
> > runqueue.  Lock taken in interrupt context => interrupts must be
> > disabled whenever taking the lock, yes?
> 
> Correct, which is the purpose of the ASSERT()s in the _irq() and
> _irqsave() variants.
> 
What ASSERT()? I don't find any assert in _spin_lock_irqsave() (which
thing makes sense to me):

unsigned long _spin_lock_irqsave(spinlock_t *lock)
{
    unsigned long flags;

    local_irq_save(flags);
    _spin_lock(lock);
    return flags;
}

Note that, again, I'm not just being nitpicking, for the sake of it...
I'm trying to understand, as I'm interesting in trying making things
better. :-)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-10-08 17:23           ` Andrew Cooper
@ 2015-10-08 20:44             ` Dario Faggioli
  2015-10-12  9:44             ` George Dunlap
  1 sibling, 0 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-10-08 20:44 UTC (permalink / raw)
  To: Andrew Cooper, George Dunlap, xen-devel; +Cc: George Dunlap, Meng Xu


[-- Attachment #1.1: Type: text/plain, Size: 1675 bytes --]

On Thu, 2015-10-08 at 18:23 +0100, Andrew Cooper wrote:
> On 08/10/15 17:46, George Dunlap wrote:
> > On 08/10/15 16:20, Andrew Cooper wrote:

> > > There are a lot of codepaths, but I cant see one which is
> > > definitely
> > > called with interrupts disables.  (OTOH, I can see several where
> > > interrupts are definitely enabled).
> > Oh, I think I misunderstood you.  You meant, "No codepaths *calling
> > into* generic scheduling code should need interrupts disabled at
> > all".
> > I can certainly believe that to be true in most cases; there's no
> > sense
> > in saving the flags if we don't need to.
> 
> My original statement came from the observation that schedule() runs
> with interrupts disabled, and takes between 2.2 and 4 microseconds to
> run (as measured during the ticketlock performance analysis).
> 
Ok, this part, I got, I think.

BTW, do you still have the scripts/harness/whatever to perform such
analysis? I'm asking because I'm about to give it a try at reducing the
amount of time during which the scheduler keeps interrupt disabled.

If I manage, it'd be nice to retry it, an get updated numbers.

> I am going to go out on a limb and say that the majority of that time
> does not need to be spent with interrupts disabled.  I might easily
> be
> wrong, but I suspect I am not.
> 
I'll have a look and let you know what I find.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-10-08 20:39         ` Dario Faggioli
@ 2015-10-09 13:05           ` Andrew Cooper
  2015-10-09 16:56             ` Dario Faggioli
  0 siblings, 1 reply; 48+ messages in thread
From: Andrew Cooper @ 2015-10-09 13:05 UTC (permalink / raw)
  To: Dario Faggioli, George Dunlap, xen-devel; +Cc: George Dunlap, Meng Xu

On 08/10/15 21:39, Dario Faggioli wrote:
> On Thu, 2015-10-08 at 16:20 +0100, Andrew Cooper wrote:
>> On 08/10/15 15:58, George Dunlap wrote:
>>> Generic scheduling code is called from interrupt contexts --
>>> namely,
>>> vcpu_wake()
>> There are a lot of codepaths, but I cant see one which is definitely
>> called with interrupts disables.  (OTOH, I can see several where
>> interrupts are definitely enabled).
>>
> Sorry, it's certainly me, but I don't think I understand this.
>
> AFAICT, you are saying that you fail to find in the code, situations
> the scheduler code is invoked, with interrupts already disabled, is
> this correct? In particular "definitely called with interrupt disabled"
> is what confuses me... :-/

Given a brief look at the code, I can't spot a codepath where it is
obvious that interrupts are disabled.

>
> Also (assuming what I just said is what you meant), are you referring
> "just" to schedule(), or even to other scheduler's code, which also
> disables interrupt when taking the lock(s) it needs, like, e.g.,
> vcpu_wake() as George said?

I was looking on callchains involving vcpu_wake().

>
>>> , which for the credit scheduler wants to put things on the
>>> runqueue.  Lock taken in interrupt context => interrupts must be
>>> disabled whenever taking the lock, yes?
>> Correct, which is the purpose of the ASSERT()s in the _irq() and
>> _irqsave() variants.
>>
> What ASSERT()? I don't find any assert in _spin_lock_irqsave() (which
> thing makes sense to me):

Ah yes - _irqsave() wouldn't want an assertion.

~Andrew

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-10-09 13:05           ` Andrew Cooper
@ 2015-10-09 16:56             ` Dario Faggioli
  0 siblings, 0 replies; 48+ messages in thread
From: Dario Faggioli @ 2015-10-09 16:56 UTC (permalink / raw)
  To: Andrew Cooper, George Dunlap, xen-devel; +Cc: George Dunlap, Meng Xu


[-- Attachment #1.1: Type: text/plain, Size: 2133 bytes --]

On Fri, 2015-10-09 at 14:05 +0100, Andrew Cooper wrote:
> On 08/10/15 21:39, Dario Faggioli wrote:
> > On Thu, 2015-10-08 at 16:20 +0100, Andrew Cooper wrote:

> > > On 08/10/15 15:58, George Dunlap wrote:
> > > > Generic scheduling code is called from interrupt contexts --
> > > > namely,
> > > > vcpu_wake()
> > > There are a lot of codepaths, but I cant see one which is
> > > definitely
> > > called with interrupts disables.  (OTOH, I can see several where
> > > interrupts are definitely enabled).
> > > 
> > Sorry, it's certainly me, but I don't think I understand this.
> > 
> > AFAICT, you are saying that you fail to find in the code,
> > situations
> > the scheduler code is invoked, with interrupts already disabled, is
> > this correct? In particular "definitely called with interrupt
> > disabled"
> > is what confuses me... :-/
> 
> Given a brief look at the code, I can't spot a codepath where it is
> obvious that interrupts are disabled.
> 
Ok, thanks for rephrasing it (and for bearing with me :-).

> > Also (assuming what I just said is what you meant), are you
> > referring
> > "just" to schedule(), or even to other scheduler's code, which also
> > disables interrupt when taking the lock(s) it needs, like, e.g.,
> > vcpu_wake() as George said?
> 
> I was looking on callchains involving vcpu_wake().
> 
Ok, thanks for clarifying this too.

AFAIUI for now, the issue is more whether we are in interrupt context,
rather than whether or not interrupt are disabled already. That is,
whether or not, by not deactivating interrupt when taking the
{v,p}cpu_schedule_lock(), we can deadlock a pCPU (and, in case that can
happen, how to avoid that).

Anyway, I think this is interesting (it is for me, at least :-)) and
worthwhile enough to investigate a bit more.

I'll dig and report.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent
  2015-10-08 17:23           ` Andrew Cooper
  2015-10-08 20:44             ` Dario Faggioli
@ 2015-10-12  9:44             ` George Dunlap
  1 sibling, 0 replies; 48+ messages in thread
From: George Dunlap @ 2015-10-12  9:44 UTC (permalink / raw)
  To: Andrew Cooper, Dario Faggioli, xen-devel; +Cc: George Dunlap, Meng Xu

On 08/10/15 18:23, Andrew Cooper wrote:
> On 08/10/15 17:46, George Dunlap wrote:
>> On 08/10/15 16:20, Andrew Cooper wrote:
>>> On 08/10/15 15:58, George Dunlap wrote:
>>>> On 29/09/15 18:31, Andrew Cooper wrote:
>>>>> On 29/09/15 17:55, Dario Faggioli wrote:
>>>>>> The insert_vcpu() scheduler hook is called with an
>>>>>> inconsistent locking strategy. In fact, it is sometimes
>>>>>> invoked while holding the runqueue lock and sometimes
>>>>>> when that is not the case.
>>>>>>
>>>>>> In other words, some call sites seems to imply that
>>>>>> locking should be handled in the callers, in schedule.c
>>>>>> --e.g., in schedule_cpu_switch(), which acquires the
>>>>>> runqueue lock before calling the hook; others that
>>>>>> specific schedulers should be responsible for locking
>>>>>> themselves --e.g., in sched_move_domain(), which does
>>>>>> not acquire any lock for calling the hook.
>>>>>>
>>>>>> The right thing to do seems to always defer locking to
>>>>>> the specific schedulers, as it's them that know what, how
>>>>>> and when it is best to lock (as in: runqueue locks, vs.
>>>>>> private scheduler locks, vs. both, etc.)
>>>>>>
>>>>>> This patch, therefore:
>>>>>>  - removes any locking around insert_vcpu() from
>>>>>>    generic code (schedule.c);
>>>>>>  - add the _proper_ locking in the hook implementations,
>>>>>>    depending on the scheduler (for instance, credit2
>>>>>>    does that already, credit1 and RTDS need to grab
>>>>>>    the runqueue lock while manipulating runqueues).
>>>>>>
>>>>>> In case of credit1, remove_vcpu() handling needs some
>>>>>> fixing remove_vcpu() too, i.e.:
>>>>>>  - it manipulates runqueues, so the runqueue lock must
>>>>>>    be acquired;
>>>>>>  - *_lock_irq() is enough, there is no need to do
>>>>>>    _irqsave()
>>>>> Nothing in any of generic scheduling code should need interrupts
>>>>> disabled at all.
>>>>>
>>>>> One of the problem-areas identified by Jenny during the ticketlock
>>>>> performance work was that the SCHEDULE_SOFTIRQ was a large consumer of
>>>>> time with interrupts disabled.  (The other large one being the time
>>>>> calibration rendezvous, but that is a wildly different can of worms to fix.)
>>>> Generic scheduling code is called from interrupt contexts -- namely,
>>>> vcpu_wake()
>>> There are a lot of codepaths, but I cant see one which is definitely
>>> called with interrupts disables.  (OTOH, I can see several where
>>> interrupts are definitely enabled).
>> Oh, I think I misunderstood you.  You meant, "No codepaths *calling
>> into* generic scheduling code should need interrupts disabled at all".
>> I can certainly believe that to be true in most cases; there's no sense
>> in saving the flags if we don't need to.
> 
> My original statement came from the observation that schedule() runs
> with interrupts disabled, and takes between 2.2 and 4 microseconds to
> run (as measured during the ticketlock performance analysis).
> 
> It is the biggest consumer of time with interrupts disabled, next being
> the time calibration rendezvous.
> 
> I am going to go out on a limb and say that the majority of that time
> does not need to be spent with interrupts disabled.  I might easily be
> wrong, but I suspect I am not.

It's certainly worth taking a look at -- in particular, as (if I recall
correctly) we grab the schedule lock, then release it briefly, then grab
it again for the context switch.

Two things related to irqs and the schedule / context-switch path.  One
we've already covered: one is calling vcpu_wake from within an interrupt
context.  The second is what might be called the "idle race": we need
interrupts disabled from the time we last check for softirqs until we
actually return to user mode.  But that's only a few dozen instructions
in most cases.

It might be possible to break things down into two locks -- one for
general schedule data structures, which would not be allowed to be
called from within an interrupt context, and one specifically to be used
for vcpu_wake (i.e., protecting manipulations to the actual runqueue)
which would have to be called with interrupts off.  But the generic
scheduling framework might make that a bit more tricky to get right.

 -George

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2015-10-12  9:44 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-29 16:55 [PATCH 0/9] xen: sched: improve (a lot! :-D) Credit2 runqueue handling Dario Faggioli
2015-09-29 16:55 ` [PATCH 1/9] xen: sched: fix an 'off by one \t' in credit2 debug dump Dario Faggioli
2015-10-01  5:22   ` Juergen Gross
2015-10-08 14:09   ` George Dunlap
2015-09-29 16:55 ` [PATCH 2/9] xen: sched: improve scope and placement of credit2 boot parameters Dario Faggioli
2015-10-01  5:23   ` Juergen Gross
2015-10-01  7:51   ` Jan Beulich
2015-10-01  8:17     ` Dario Faggioli
2015-09-29 16:55 ` [PATCH 3/9] xen: sched: make locking for {insert, remove}_vcpu consistent Dario Faggioli
2015-09-29 17:31   ` Andrew Cooper
2015-09-29 21:40     ` Dario Faggioli
2015-09-29 21:56       ` Dario Faggioli
2015-09-30  9:00       ` Andrew Cooper
2015-10-08 14:58     ` George Dunlap
2015-10-08 15:20       ` Andrew Cooper
2015-10-08 16:46         ` George Dunlap
2015-10-08 17:23           ` Andrew Cooper
2015-10-08 20:44             ` Dario Faggioli
2015-10-12  9:44             ` George Dunlap
2015-10-08 20:39         ` Dario Faggioli
2015-10-09 13:05           ` Andrew Cooper
2015-10-09 16:56             ` Dario Faggioli
2015-10-01  8:03   ` Jan Beulich
2015-10-01 11:59     ` Dario Faggioli
2015-09-29 16:55 ` [PATCH 4/9] xen: sched: add .init_pdata hook to the scheduler interface Dario Faggioli
2015-10-01  5:21   ` Juergen Gross
2015-10-01  6:33     ` Dario Faggioli
2015-10-01  7:43       ` Juergen Gross
2015-10-01  9:32         ` Andrew Cooper
2015-10-01  9:40           ` Dario Faggioli
2015-10-01  8:17   ` Jan Beulich
2015-10-01  9:26     ` Dario Faggioli
2015-10-01 10:12       ` Jan Beulich
2015-10-01 10:35         ` Dario Faggioli
2015-10-01 10:47           ` Jan Beulich
2015-09-29 16:56 ` [PATCH 5/9] xen: sched: make implementing .alloc_pdata optional Dario Faggioli
2015-10-01  5:28   ` Juergen Gross
2015-10-01  6:35     ` Dario Faggioli
2015-10-01  7:49   ` Jan Beulich
2015-10-01  8:13     ` Dario Faggioli
2015-09-29 16:56 ` [PATCH 6/9] xen: sched: implement .init_pdata in all schedulers Dario Faggioli
2015-09-29 16:56 ` [PATCH 7/9] xen: sched: fix per-socket runqueue creation in credit2 Dario Faggioli
2015-09-29 16:56 ` [PATCH 8/9] xen: sched: allow for choosing credit2 runqueues configuration at boot Dario Faggioli
2015-10-01  5:48   ` Juergen Gross
2015-10-01  7:23     ` Dario Faggioli
2015-10-01  7:46       ` Juergen Gross
2015-09-29 16:56 ` [PATCH 9/9] xen: sched: per-core runqueues as default in credit2 Dario Faggioli
2015-10-01  5:48   ` Juergen Gross

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.