xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support
@ 2019-09-30  5:21 Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 01/19] xen/sched: add code to sync scheduling of all vcpus of a sched unit Juergen Gross
                   ` (18 more replies)
  0 siblings, 19 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Robert VanVossen, Tim Deegan, Julien Grall, Josh Whitehead,
	Meng Xu, Jan Beulich, Dario Faggioli, Volodymyr Babchuk,
	Roger Pau Monné

Add support for core- and socket-scheduling in the Xen hypervisor.

Via boot parameter sched-gran=core (or sched-gran=socket)
it is possible to change the scheduling granularity from cpu (the
default) to either whole cores or even sockets.

All logical cpus (threads) of the core or socket are always scheduled
together. This means that on a core always vcpus of the same domain
will be active, and those vcpus will always be scheduled at the same
time.

This is achieved by switching the scheduler to no longer see vcpus as
the primary object to schedule, but "schedule units". Each schedule
unit consists of as many vcpus as each core has threads on the current
system. The vcpu->unit relation is fixed.

I have done some very basic performance testing: on a 4 cpu system
(2 cores with 2 threads each) I did a "make -j 4" for building the Xen
hypervisor. With This test has been run on dom0, once with no other
guest active and once with another guest with 4 vcpus running the same
test. The results are (always elapsed time, system time, user time):

sched-gran=cpu,    no other guest: 116.10 177.65 207.84
sched-gran=core,   no other guest: 114.04 175.47 207.45
sched-gran=cpu,    other guest:    202.30 334.21 384.63
sched-gran=core,   other guest:    207.24 293.04 371.37

The performance tests have been performed with credit2, the other
schedulers are tested only briefly to be able to create a domain in a
cpupool.

Cpupools have been moderately tested (cpu add/remove, create, destroy,
move domain).

Cpu on-/offlining has been moderately tested, too.

The series is based on:
"xen/sched: rework and rename vcpu_force_reschedule()"
which has been split off from V2 and:
"xen/sched: fix locking in a653sched_free_vdata()"
which is fixing a problem detected via review of V3.

The complete patch series (plus prereq patches and some debugging
additions in form of additional patches) is available under:

  git://github.com/jgross1/xen/ sched-v5

Changes in V5:
- dropped patches 1-27 as they already went in
- added comments in 2 patches

Changes in V4:
- comments addressed
- former patch 36 merged into patch 32

Changes in V3:
- comments addressed
- former patch 26 carved out and sent separately
- some minor bugs fixed

Changes in V2:
- comments addressed
- some patches merged into one
- idle scheduler related patches split off to own series
- some patches are already applied
- some bugs fixed (e.g. crashes when powering off)

Changes in V1:
- cpupools are working now
- cpu on-/offlining working now
- all schedulers working now
- renamed "items" to "units"
- introduction of "idle scheduler"
- several new patches (see individual patches, mostly splits of
  former patches or cpupool and cpu on-/offlining support)
- all review comments addressed
- some minor changes (see individual patches)

Changes in RFC V2:
- ARM is building now
- HVM domains are working now
- idling will always be done with idle_vcpu active
- other small changes see individual patches

Juergen Gross (19):
  xen/sched: add code to sync scheduling of all vcpus of a sched unit
  xen/sched: introduce unit_runnable_state()
  xen/sched: add support for multiple vcpus per sched unit where missing
  xen/sched: modify cpupool_domain_cpumask() to be an unit mask
  xen/sched: support allocating multiple vcpus into one sched unit
  xen/sched: add a percpu resource index
  xen/sched: add fall back to idle vcpu when scheduling unit
  xen/sched: make vcpu_wake() and vcpu_sleep() core scheduling aware
  xen/sched: move per-cpu variable scheduler to struct sched_resource
  xen/sched: move per-cpu variable cpupool to struct sched_resource
  xen/sched: reject switching smt on/off with core scheduling active
  xen/sched: prepare per-cpupool scheduling granularity
  xen/sched: split schedule_cpu_switch()
  xen/sched: protect scheduling resource via rcu
  xen/sched: support multiple cpus per scheduling resource
  xen/sched: support differing granularity in schedule_cpu_[add/rm]()
  xen/sched: support core scheduling for moving cpus to/from cpupools
  xen/sched: disable scheduling when entering ACPI deep sleep states
  xen/sched: add scheduling granularity enum

 xen/arch/arm/domain.c         |    2 +-
 xen/arch/x86/Kconfig          |    1 +
 xen/arch/x86/acpi/power.c     |    4 +
 xen/arch/x86/domain.c         |   26 +-
 xen/arch/x86/sysctl.c         |    5 +
 xen/common/Kconfig            |    3 +
 xen/common/cpupool.c          |  232 ++++++--
 xen/common/domain.c           |    8 +-
 xen/common/domctl.c           |    2 +-
 xen/common/sched_arinc653.c   |    4 +-
 xen/common/sched_credit.c     |   73 +--
 xen/common/sched_credit2.c    |   32 +-
 xen/common/sched_null.c       |   11 +-
 xen/common/sched_rt.c         |   18 +-
 xen/common/schedule.c         | 1300 +++++++++++++++++++++++++++++++++--------
 xen/common/softirq.c          |    6 +-
 xen/include/asm-arm/current.h |    1 +
 xen/include/asm-x86/current.h |   19 +-
 xen/include/asm-x86/smp.h     |    7 +
 xen/include/xen/sched-if.h    |   86 ++-
 xen/include/xen/sched.h       |   26 +-
 xen/include/xen/softirq.h     |    1 +
 22 files changed, 1504 insertions(+), 363 deletions(-)

-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 01/19] xen/sched: add code to sync scheduling of all vcpus of a sched unit
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30 10:36   ` Jan Beulich
  2019-09-30 10:54   ` Jan Beulich
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 02/19] xen/sched: introduce unit_runnable_state() Juergen Gross
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Tim Deegan, Julien Grall, Jan Beulich, Dario Faggioli,
	Volodymyr Babchuk, Roger Pau Monné

When switching sched units synchronize all vcpus of the new unit to be
scheduled at the same time.

A variable sched_granularity is added which holds the number of vcpus
per schedule unit.

As tasklets require to schedule the idle unit it is required to set the
tasklet_work_scheduled parameter of do_schedule() to true if any cpu
covered by the current schedule() call has any pending tasklet work.

For joining other vcpus of the schedule unit we need to add a new
softirq SCHED_SLAVE_SOFTIRQ in order to have a way to initiate a
context switch without calling the generic schedule() function
selecting the vcpu to switch to, as we already know which vcpu we
want to run. This has the other advantage not to loose any other
concurrent SCHEDULE_SOFTIRQ events.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
RFC V2:
- move syncing after context_switch() to schedule.c
V2:
- don't run tasklets directly from sched_wait_rendezvous_in()
V3:
- adapt array size in sched_move_domain() (Jan Beulich)
- int -> unsigned int (Jan Beulich)
V4:
- renamed sd to sr in several places (Jan Beulich)
- swap stop_timer() and NOW() calls (Jan Beulich)
- context_switch() on ARM returns - handle that (Jan Beulich)
---
 xen/arch/arm/domain.c      |   2 +-
 xen/arch/x86/domain.c      |   3 +-
 xen/common/schedule.c      | 353 +++++++++++++++++++++++++++++++++++----------
 xen/common/softirq.c       |   6 +-
 xen/include/xen/sched-if.h |   1 +
 xen/include/xen/sched.h    |  16 +-
 xen/include/xen/softirq.h  |   1 +
 7 files changed, 294 insertions(+), 88 deletions(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index f0ee5a2140..460e968e97 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -318,7 +318,7 @@ static void schedule_tail(struct vcpu *prev)
 
     local_irq_enable();
 
-    context_saved(prev);
+    sched_context_switched(prev, current);
 
     update_runstate_area(current);
 
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index c7fa224c89..27f99d3bcc 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1784,7 +1784,6 @@ static void __context_switch(void)
     per_cpu(curr_vcpu, cpu) = n;
 }
 
-
 void context_switch(struct vcpu *prev, struct vcpu *next)
 {
     unsigned int cpu = smp_processor_id();
@@ -1860,7 +1859,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
         }
     }
 
-    context_saved(prev);
+    sched_context_switched(prev, next);
 
     _update_runstate_area(next);
     /* Must be done with interrupts enabled */
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 4711ece1ef..ff67fb3633 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -61,6 +61,9 @@ boolean_param("sched_smt_power_savings", sched_smt_power_savings);
 int sched_ratelimit_us = SCHED_DEFAULT_RATELIMIT_US;
 integer_param("sched_ratelimit_us", sched_ratelimit_us);
 
+/* Number of vcpus per struct sched_unit. */
+static unsigned int __read_mostly sched_granularity = 1;
+
 /* Common lock for free cpus. */
 static DEFINE_SPINLOCK(sched_free_cpu_lock);
 
@@ -532,8 +535,8 @@ int sched_move_domain(struct domain *d, struct cpupool *c)
     if ( IS_ERR(domdata) )
         return PTR_ERR(domdata);
 
-    /* TODO: fix array size with multiple vcpus per unit. */
-    unit_priv = xzalloc_array(void *, d->max_vcpus);
+    unit_priv = xzalloc_array(void *,
+                              DIV_ROUND_UP(d->max_vcpus, sched_granularity));
     if ( unit_priv == NULL )
     {
         sched_free_domdata(c->sched, domdata);
@@ -1714,133 +1717,325 @@ void vcpu_set_periodic_timer(struct vcpu *v, s_time_t value)
     spin_unlock(&v->periodic_timer_lock);
 }
 
-/*
- * The main function
- * - deschedule the current domain (scheduler independent).
- * - pick a new domain (scheduler dependent).
- */
-static void schedule(void)
+static void sched_switch_units(struct sched_resource *sr,
+                               struct sched_unit *next, struct sched_unit *prev,
+                               s_time_t now)
 {
-    struct sched_unit    *prev = current->sched_unit, *next = NULL;
-    s_time_t              now;
-    struct scheduler     *sched;
-    unsigned long        *tasklet_work = &this_cpu(tasklet_work_to_do);
-    bool                  tasklet_work_scheduled = false;
-    struct sched_resource *sd;
-    spinlock_t           *lock;
-    int cpu = smp_processor_id();
+    sr->curr = next;
 
-    ASSERT_NOT_IN_ATOMIC();
+    TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->unit_id,
+             now - prev->state_entry_time);
+    TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->unit_id,
+             (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
+             (now - next->state_entry_time) : 0, prev->next_time);
 
-    SCHED_STAT_CRANK(sched_run);
+    ASSERT(prev->vcpu_list->runstate.state == RUNSTATE_running);
+
+    TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
+             next->domain->domain_id, next->unit_id);
+
+    sched_unit_runstate_change(prev, false, now);
+
+    ASSERT(next->vcpu_list->runstate.state != RUNSTATE_running);
+    sched_unit_runstate_change(next, true, now);
 
-    sd = get_sched_res(cpu);
+    /*
+     * NB. Don't add any trace records from here until the actual context
+     * switch, else lost_records resume will not work properly.
+     */
+
+    ASSERT(!next->is_running);
+    next->vcpu_list->is_running = 1;
+    next->is_running = true;
+    next->state_entry_time = now;
+}
+
+static bool sched_tasklet_check_cpu(unsigned int cpu)
+{
+    unsigned long *tasklet_work = &per_cpu(tasklet_work_to_do, cpu);
 
-    /* Update tasklet scheduling status. */
     switch ( *tasklet_work )
     {
     case TASKLET_enqueued:
         set_bit(_TASKLET_scheduled, tasklet_work);
         /* fallthrough */
     case TASKLET_enqueued|TASKLET_scheduled:
-        tasklet_work_scheduled = true;
+        return true;
         break;
     case TASKLET_scheduled:
         clear_bit(_TASKLET_scheduled, tasklet_work);
+        /* fallthrough */
     case 0:
-        /*tasklet_work_scheduled = false;*/
+        /* return false; */
         break;
     default:
         BUG();
     }
 
-    lock = pcpu_schedule_lock_irq(cpu);
+    return false;
+}
 
-    now = NOW();
+static bool sched_tasklet_check(unsigned int cpu)
+{
+    bool tasklet_work_scheduled = false;
+    const cpumask_t *mask = get_sched_res(cpu)->cpus;
+    unsigned int cpu_iter;
+
+    for_each_cpu ( cpu_iter, mask )
+        if ( sched_tasklet_check_cpu(cpu_iter) )
+            tasklet_work_scheduled = true;
 
-    stop_timer(&sd->s_timer);
+    return tasklet_work_scheduled;
+}
+
+static struct sched_unit *do_schedule(struct sched_unit *prev, s_time_t now,
+                                      unsigned int cpu)
+{
+    struct scheduler *sched = per_cpu(scheduler, cpu);
+    struct sched_resource *sr = get_sched_res(cpu);
+    struct sched_unit *next;
 
     /* get policy-specific decision on scheduling... */
-    sched = this_cpu(scheduler);
-    sched->do_schedule(sched, prev, now, tasklet_work_scheduled);
+    sched->do_schedule(sched, prev, now, sched_tasklet_check(cpu));
 
     next = prev->next_task;
 
-    sd->curr = next;
-
     if ( prev->next_time >= 0 ) /* -ve means no limit */
-        set_timer(&sd->s_timer, now + prev->next_time);
+        set_timer(&sr->s_timer, now + prev->next_time);
+
+    if ( likely(prev != next) )
+        sched_switch_units(sr, next, prev, now);
+
+    return next;
+}
+
+static void context_saved(struct vcpu *prev)
+{
+    struct sched_unit *unit = prev->sched_unit;
+
+    /* Clear running flag /after/ writing context to memory. */
+    smp_wmb();
+
+    prev->is_running = 0;
+    unit->is_running = false;
+    unit->state_entry_time = NOW();
+
+    /* Check for migration request /after/ clearing running flag. */
+    smp_mb();
+
+    sched_context_saved(vcpu_scheduler(prev), unit);
 
-    if ( unlikely(prev == next) )
+    sched_unit_migrate_finish(unit);
+}
+
+/*
+ * Rendezvous on end of context switch.
+ * As no lock is protecting this rendezvous function we need to use atomic
+ * access functions on the counter.
+ * The counter will be 0 in case no rendezvous is needed. For the rendezvous
+ * case it is initialised to the number of cpus to rendezvous plus 1. Each
+ * member entering decrements the counter. The last one will decrement it to
+ * 1 and perform the final needed action in that case (call of context_saved()
+ * if vcpu was switched), and then set the counter to zero. The other members
+ * will wait until the counter becomes zero until they proceed.
+ */
+void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext)
+{
+    struct sched_unit *next = vnext->sched_unit;
+
+    if ( atomic_read(&next->rendezvous_out_cnt) )
+    {
+        int cnt = atomic_dec_return(&next->rendezvous_out_cnt);
+
+        /* Call context_saved() before releasing other waiters. */
+        if ( cnt == 1 )
+        {
+            if ( vprev != vnext )
+                context_saved(vprev);
+            atomic_set(&next->rendezvous_out_cnt, 0);
+        }
+        else
+            while ( atomic_read(&next->rendezvous_out_cnt) )
+                cpu_relax();
+    }
+    else if ( vprev != vnext )
+        context_saved(vprev);
+}
+
+static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
+                                 s_time_t now)
+{
+    if ( unlikely(vprev == vnext) )
     {
-        pcpu_schedule_unlock_irq(lock, cpu);
         TRACE_4D(TRC_SCHED_SWITCH_INFCONT,
-                 next->domain->domain_id, next->unit_id,
-                 now - prev->state_entry_time,
-                 prev->next_time);
-        trace_continue_running(next->vcpu_list);
-        return continue_running(prev->vcpu_list);
+                 vnext->domain->domain_id, vnext->sched_unit->unit_id,
+                 now - vprev->runstate.state_entry_time,
+                 vprev->sched_unit->next_time);
+        sched_context_switched(vprev, vnext);
+        trace_continue_running(vnext);
+        return continue_running(vprev);
     }
 
-    TRACE_3D(TRC_SCHED_SWITCH_INFPREV,
-             prev->domain->domain_id, prev->unit_id,
-             now - prev->state_entry_time);
-    TRACE_4D(TRC_SCHED_SWITCH_INFNEXT,
-             next->domain->domain_id, next->unit_id,
-             (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
-             (now - next->state_entry_time) : 0,
-             prev->next_time);
+    SCHED_STAT_CRANK(sched_ctx);
 
-    ASSERT(prev->vcpu_list->runstate.state == RUNSTATE_running);
+    stop_timer(&vprev->periodic_timer);
 
-    TRACE_4D(TRC_SCHED_SWITCH,
-             prev->domain->domain_id, prev->unit_id,
-             next->domain->domain_id, next->unit_id);
+    if ( vnext->sched_unit->migrated )
+        vcpu_move_irqs(vnext);
 
-    sched_unit_runstate_change(prev, false, now);
+    vcpu_periodic_timer_work(vnext);
 
-    ASSERT(next->vcpu_list->runstate.state != RUNSTATE_running);
-    sched_unit_runstate_change(next, true, now);
+    context_switch(vprev, vnext);
+}
 
-    /*
-     * NB. Don't add any trace records from here until the actual context
-     * switch, else lost_records resume will not work properly.
-     */
+/*
+ * Rendezvous before taking a scheduling decision.
+ * Called with schedule lock held, so all accesses to the rendezvous counter
+ * can be normal ones (no atomic accesses needed).
+ * The counter is initialized to the number of cpus to rendezvous initially.
+ * Each cpu entering will decrement the counter. In case the counter becomes
+ * zero do_schedule() is called and the rendezvous counter for leaving
+ * context_switch() is set. All other members will wait until the counter is
+ * becoming zero, dropping the schedule lock in between.
+ */
+static struct sched_unit *sched_wait_rendezvous_in(struct sched_unit *prev,
+                                                   spinlock_t **lock, int cpu,
+                                                   s_time_t now)
+{
+    struct sched_unit *next;
 
-    ASSERT(!next->is_running);
-    next->vcpu_list->is_running = 1;
-    next->is_running = true;
-    next->state_entry_time = now;
+    if ( !--prev->rendezvous_in_cnt )
+    {
+        next = do_schedule(prev, now, cpu);
+        atomic_set(&next->rendezvous_out_cnt, sched_granularity + 1);
+        return next;
+    }
 
-    pcpu_schedule_unlock_irq(lock, cpu);
+    while ( prev->rendezvous_in_cnt )
+    {
+        /*
+         * Coming from idle might need to do tasklet work.
+         * In order to avoid deadlocks we can't do that here, but have to
+         * continue the idle loop.
+         * Undo the rendezvous_in_cnt decrement and schedule another call of
+         * sched_slave().
+         */
+        if ( is_idle_unit(prev) && sched_tasklet_check_cpu(cpu) )
+        {
+            struct vcpu *vprev = current;
 
-    SCHED_STAT_CRANK(sched_ctx);
+            prev->rendezvous_in_cnt++;
+            atomic_set(&prev->rendezvous_out_cnt, 0);
+
+            pcpu_schedule_unlock_irq(*lock, cpu);
+
+            raise_softirq(SCHED_SLAVE_SOFTIRQ);
+            sched_context_switch(vprev, vprev, now);
+
+            return NULL;         /* ARM only. */
+        }
 
-    stop_timer(&prev->vcpu_list->periodic_timer);
+        pcpu_schedule_unlock_irq(*lock, cpu);
 
-    if ( next->migrated )
-        vcpu_move_irqs(next->vcpu_list);
+        cpu_relax();
 
-    vcpu_periodic_timer_work(next->vcpu_list);
+        *lock = pcpu_schedule_lock_irq(cpu);
+    }
 
-    context_switch(prev->vcpu_list, next->vcpu_list);
+    return prev->next_task;
 }
 
-void context_saved(struct vcpu *prev)
+static void sched_slave(void)
 {
-    /* Clear running flag /after/ writing context to memory. */
-    smp_wmb();
+    struct vcpu          *vprev = current;
+    struct sched_unit    *prev = vprev->sched_unit, *next;
+    s_time_t              now;
+    spinlock_t           *lock;
+    unsigned int          cpu = smp_processor_id();
 
-    prev->is_running = 0;
-    prev->sched_unit->is_running = false;
-    prev->sched_unit->state_entry_time = NOW();
+    ASSERT_NOT_IN_ATOMIC();
 
-    /* Check for migration request /after/ clearing running flag. */
-    smp_mb();
+    lock = pcpu_schedule_lock_irq(cpu);
 
-    sched_context_saved(vcpu_scheduler(prev), prev->sched_unit);
+    now = NOW();
+
+    if ( !prev->rendezvous_in_cnt )
+    {
+        pcpu_schedule_unlock_irq(lock, cpu);
+        return;
+    }
+
+    stop_timer(&get_sched_res(cpu)->s_timer);
+
+    next = sched_wait_rendezvous_in(prev, &lock, cpu, now);
+    if ( !next )
+        return;
+
+    pcpu_schedule_unlock_irq(lock, cpu);
 
-    sched_unit_migrate_finish(prev->sched_unit);
+    sched_context_switch(vprev, next->vcpu_list, now);
+}
+
+/*
+ * The main function
+ * - deschedule the current domain (scheduler independent).
+ * - pick a new domain (scheduler dependent).
+ */
+static void schedule(void)
+{
+    struct vcpu          *vnext, *vprev = current;
+    struct sched_unit    *prev = vprev->sched_unit, *next = NULL;
+    s_time_t              now;
+    struct sched_resource *sr;
+    spinlock_t           *lock;
+    int cpu = smp_processor_id();
+
+    ASSERT_NOT_IN_ATOMIC();
+
+    SCHED_STAT_CRANK(sched_run);
+
+    sr = get_sched_res(cpu);
+
+    lock = pcpu_schedule_lock_irq(cpu);
+
+    if ( prev->rendezvous_in_cnt )
+    {
+        /*
+         * We have a race: sched_slave() should be called, so raise a softirq
+         * in order to re-enter schedule() later and call sched_slave() now.
+         */
+        pcpu_schedule_unlock_irq(lock, cpu);
+
+        raise_softirq(SCHEDULE_SOFTIRQ);
+        return sched_slave();
+    }
+
+    stop_timer(&sr->s_timer);
+
+    now = NOW();
+
+    if ( sched_granularity > 1 )
+    {
+        cpumask_t mask;
+
+        prev->rendezvous_in_cnt = sched_granularity;
+        cpumask_andnot(&mask, sr->cpus, cpumask_of(cpu));
+        cpumask_raise_softirq(&mask, SCHED_SLAVE_SOFTIRQ);
+        next = sched_wait_rendezvous_in(prev, &lock, cpu, now);
+        if ( !next )
+            return;
+    }
+    else
+    {
+        prev->rendezvous_in_cnt = 0;
+        next = do_schedule(prev, now, cpu);
+        atomic_set(&next->rendezvous_out_cnt, 0);
+    }
+
+    pcpu_schedule_unlock_irq(lock, cpu);
+
+    vnext = next->vcpu_list;
+    sched_context_switch(vprev, vnext, now);
 }
 
 /* The scheduler timer: force a run through the scheduler */
@@ -1881,6 +2076,7 @@ static int cpu_schedule_up(unsigned int cpu)
     if ( sr == NULL )
         return -ENOMEM;
     sr->master_cpu = cpu;
+    sr->cpus = cpumask_of(cpu);
     set_sched_res(cpu, sr);
 
     per_cpu(scheduler, cpu) = &sched_idle_ops;
@@ -1901,6 +2097,8 @@ static int cpu_schedule_up(unsigned int cpu)
     if ( idle_vcpu[cpu] == NULL )
         return -ENOMEM;
 
+    idle_vcpu[cpu]->sched_unit->rendezvous_in_cnt = 0;
+
     /*
      * No need to allocate any scheduler data, as cpus coming online are
      * free initially and the idle scheduler doesn't need any data areas
@@ -2001,6 +2199,7 @@ void __init scheduler_init(void)
     int i;
 
     open_softirq(SCHEDULE_SOFTIRQ, schedule);
+    open_softirq(SCHED_SLAVE_SOFTIRQ, sched_slave);
 
     for ( i = 0; i < NUM_SCHEDULERS; i++)
     {
diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index 83c3c09bd5..2d66193203 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -33,8 +33,8 @@ static void __do_softirq(unsigned long ignore_mask)
     for ( ; ; )
     {
         /*
-         * Initialise @cpu on every iteration: SCHEDULE_SOFTIRQ may move
-         * us to another processor.
+         * Initialise @cpu on every iteration: SCHEDULE_SOFTIRQ or
+         * SCHED_SLAVE_SOFTIRQ may move us to another processor.
          */
         cpu = smp_processor_id();
 
@@ -55,7 +55,7 @@ void process_pending_softirqs(void)
 {
     ASSERT(!in_irq() && local_irq_is_enabled());
     /* Do not enter scheduler as it can preempt the calling context. */
-    __do_softirq(1ul<<SCHEDULE_SOFTIRQ);
+    __do_softirq((1ul << SCHEDULE_SOFTIRQ) | (1ul << SCHED_SLAVE_SOFTIRQ));
 }
 
 void do_softirq(void)
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 0423be987d..c65dfa943b 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -42,6 +42,7 @@ struct sched_resource {
 
     /* Cpu with lowest id in scheduling resource. */
     unsigned int        master_cpu;
+    const cpumask_t    *cpus;           /* cpus covered by this struct     */
 };
 
 DECLARE_PER_CPU(struct scheduler *, scheduler);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index ebf723a866..c770ab4aa0 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -292,6 +292,12 @@ struct sched_unit {
     /* Next unit to run. */
     struct sched_unit      *next_task;
     s_time_t                next_time;
+
+    /* Number of vcpus not yet joined for context switch. */
+    unsigned int            rendezvous_in_cnt;
+
+    /* Number of vcpus not yet finished with context switch. */
+    atomic_t                rendezvous_out_cnt;
 };
 
 #define for_each_sched_unit(d, u)                                         \
@@ -696,10 +702,10 @@ void sync_local_execstate(void);
 
 /*
  * Called by the scheduler to switch to another VCPU. This function must
- * call context_saved(@prev) when the local CPU is no longer running in
- * @prev's context, and that context is saved to memory. Alternatively, if
- * implementing lazy context switching, it suffices to ensure that invoking
- * sync_vcpu_execstate() will switch and commit @prev's state.
+ * call sched_context_switched(@prev, @next) when the local CPU is no longer
+ * running in @prev's context, and that context is saved to memory.
+ * Alternatively, if implementing lazy context switching, it suffices to ensure
+ * that invoking sync_vcpu_execstate() will switch and commit @prev's state.
  */
 void context_switch(
     struct vcpu *prev,
@@ -711,7 +717,7 @@ void context_switch(
  * saved to memory. Alternatively, if implementing lazy context switching,
  * ensure that invoking sync_vcpu_execstate() will switch and commit @prev.
  */
-void context_saved(struct vcpu *prev);
+void sched_context_switched(struct vcpu *prev, struct vcpu *vnext);
 
 /* Called by the scheduler to continue running the current VCPU. */
 void continue_running(
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index c327c9b6cd..d7273b389b 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -4,6 +4,7 @@
 /* Low-latency softirqs come first in the following list. */
 enum {
     TIMER_SOFTIRQ = 0,
+    SCHED_SLAVE_SOFTIRQ,
     SCHEDULE_SOFTIRQ,
     NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
     RCU_SOFTIRQ,
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 02/19] xen/sched: introduce unit_runnable_state()
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 01/19] xen/sched: add code to sync scheduling of all vcpus of a sched unit Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  7:22   ` Dario Faggioli
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 03/19] xen/sched: add support for multiple vcpus per sched unit where missing Juergen Gross
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Robert VanVossen, Tim Deegan, Julien Grall, Josh Whitehead,
	Meng Xu, Jan Beulich, Dario Faggioli

Today the vcpu runstate of a new scheduled vcpu is always set to
"running" even if at that time vcpu_runnable() is already returning
false due to a race (e.g. with pausing the vcpu).

With core scheduling this can no longer work as not all vcpus of a
schedule unit have to be "running" when being scheduled. So the vcpu's
new runstate has to be selected at the same time as the runnability of
the related schedule unit is probed.

For this purpose introduce a new helper unit_runnable_state() which
will save the new runstate of all tested vcpus in a new field of the
vcpu struct.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
RFC V2:
- new patch
V3:
- add vcpu loop to unit_runnable_state() right now instead of doing
  so in next patch (Jan Beulich, Dario Faggioli)
- make new_state unsigned int (Jan Beulich)
V4:
- add comment explaining unit_runnable_state() (Jan Beulich)
---
 xen/common/domain.c         |  1 +
 xen/common/sched_arinc653.c |  2 +-
 xen/common/sched_credit.c   | 49 ++++++++++++++++++++++++---------------------
 xen/common/sched_credit2.c  |  7 ++++---
 xen/common/sched_null.c     |  3 ++-
 xen/common/sched_rt.c       |  8 +++++++-
 xen/common/schedule.c       |  2 +-
 xen/include/xen/sched-if.h  | 30 +++++++++++++++++++++++++++
 xen/include/xen/sched.h     |  1 +
 9 files changed, 73 insertions(+), 30 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 601da28c9c..a9882509ed 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -157,6 +157,7 @@ struct vcpu *vcpu_create(struct domain *d, unsigned int vcpu_id)
     if ( is_idle_domain(d) )
     {
         v->runstate.state = RUNSTATE_running;
+        v->new_state = RUNSTATE_running;
     }
     else
     {
diff --git a/xen/common/sched_arinc653.c b/xen/common/sched_arinc653.c
index fcf81db19a..dd5876eacd 100644
--- a/xen/common/sched_arinc653.c
+++ b/xen/common/sched_arinc653.c
@@ -563,7 +563,7 @@ a653sched_do_schedule(
     if ( !((new_task != NULL)
            && (AUNIT(new_task) != NULL)
            && AUNIT(new_task)->awake
-           && unit_runnable(new_task)) )
+           && unit_runnable_state(new_task)) )
         new_task = IDLETASK(cpu);
     BUG_ON(new_task == NULL);
 
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 299eff21ac..00beac3ea4 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1894,7 +1894,7 @@ static void csched_schedule(
     if ( !test_bit(CSCHED_FLAG_UNIT_YIELD, &scurr->flags)
          && !tasklet_work_scheduled
          && prv->ratelimit
-         && unit_runnable(unit)
+         && unit_runnable_state(unit)
          && !is_idle_unit(unit)
          && runtime < prv->ratelimit )
     {
@@ -1939,33 +1939,36 @@ static void csched_schedule(
         dec_nr_runnable(sched_cpu);
     }
 
-    snext = __runq_elem(runq->next);
-
-    /* Tasklet work (which runs in idle UNIT context) overrides all else. */
-    if ( tasklet_work_scheduled )
-    {
-        TRACE_0D(TRC_CSCHED_SCHED_TASKLET);
-        snext = CSCHED_UNIT(sched_idle_unit(sched_cpu));
-        snext->pri = CSCHED_PRI_TS_BOOST;
-    }
-
     /*
      * Clear YIELD flag before scheduling out
      */
     clear_bit(CSCHED_FLAG_UNIT_YIELD, &scurr->flags);
 
-    /*
-     * SMP Load balance:
-     *
-     * If the next highest priority local runnable UNIT has already eaten
-     * through its credits, look on other PCPUs to see if we have more
-     * urgent work... If not, csched_load_balance() will return snext, but
-     * already removed from the runq.
-     */
-    if ( snext->pri > CSCHED_PRI_TS_OVER )
-        __runq_remove(snext);
-    else
-        snext = csched_load_balance(prv, sched_cpu, snext, &migrated);
+    do {
+        snext = __runq_elem(runq->next);
+
+        /* Tasklet work (which runs in idle UNIT context) overrides all else. */
+        if ( tasklet_work_scheduled )
+        {
+            TRACE_0D(TRC_CSCHED_SCHED_TASKLET);
+            snext = CSCHED_UNIT(sched_idle_unit(sched_cpu));
+            snext->pri = CSCHED_PRI_TS_BOOST;
+        }
+
+        /*
+         * SMP Load balance:
+         *
+         * If the next highest priority local runnable UNIT has already eaten
+         * through its credits, look on other PCPUs to see if we have more
+         * urgent work... If not, csched_load_balance() will return snext, but
+         * already removed from the runq.
+         */
+        if ( snext->pri > CSCHED_PRI_TS_OVER )
+            __runq_remove(snext);
+        else
+            snext = csched_load_balance(prv, sched_cpu, snext, &migrated);
+
+    } while ( !unit_runnable_state(snext->unit) );
 
     /*
      * Update idlers mask if necessary. When we're idling, other CPUs
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 87d142bbe4..0e29e56d5a 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -3291,7 +3291,7 @@ runq_candidate(struct csched2_runqueue_data *rqd,
      * In fact, it may be the case that scurr is about to spin, and there's
      * no point forcing it to do so until rate limiting expires.
      */
-    if ( !yield && prv->ratelimit_us && unit_runnable(scurr->unit) &&
+    if ( !yield && prv->ratelimit_us && unit_runnable_state(scurr->unit) &&
          (now - scurr->unit->state_entry_time) < MICROSECS(prv->ratelimit_us) )
     {
         if ( unlikely(tb_init_done) )
@@ -3345,7 +3345,7 @@ runq_candidate(struct csched2_runqueue_data *rqd,
      *
      * Of course, we also default to idle also if scurr is not runnable.
      */
-    if ( unit_runnable(scurr->unit) && !soft_aff_preempt )
+    if ( unit_runnable_state(scurr->unit) && !soft_aff_preempt )
         snext = scurr;
     else
         snext = csched2_unit(sched_idle_unit(cpu));
@@ -3405,7 +3405,8 @@ runq_candidate(struct csched2_runqueue_data *rqd,
          * some budget, then choose it.
          */
         if ( (yield || svc->credit > snext->credit) &&
-             (!has_cap(svc) || unit_grab_budget(svc)) )
+             (!has_cap(svc) || unit_grab_budget(svc)) &&
+             unit_runnable_state(svc->unit) )
             snext = svc;
 
         /* In any case, if we got this far, break. */
diff --git a/xen/common/sched_null.c b/xen/common/sched_null.c
index 80a7d45935..3dde1dcd00 100644
--- a/xen/common/sched_null.c
+++ b/xen/common/sched_null.c
@@ -864,7 +864,8 @@ static void null_schedule(const struct scheduler *ops, struct sched_unit *prev,
             cpumask_set_cpu(sched_cpu, &prv->cpus_free);
     }
 
-    if ( unlikely(prev->next_task == NULL || !unit_runnable(prev->next_task)) )
+    if ( unlikely(prev->next_task == NULL ||
+                  !unit_runnable_state(prev->next_task)) )
         prev->next_task = sched_idle_unit(sched_cpu);
 
     NULL_UNIT_CHECK(prev->next_task);
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index cfd7d334fa..fd882f2ca4 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -1092,12 +1092,18 @@ rt_schedule(const struct scheduler *ops, struct sched_unit *currunit,
     else
     {
         snext = runq_pick(ops, cpumask_of(sched_cpu));
+
         if ( snext == NULL )
             snext = rt_unit(sched_idle_unit(sched_cpu));
+        else if ( !unit_runnable_state(snext->unit) )
+        {
+            q_remove(snext);
+            snext = rt_unit(sched_idle_unit(sched_cpu));
+        }
 
         /* if scurr has higher priority and budget, still pick scurr */
         if ( !is_idle_unit(currunit) &&
-             unit_runnable(currunit) &&
+             unit_runnable_state(currunit) &&
              scurr->cur_budget > 0 &&
              ( is_idle_unit(snext->unit) ||
                compare_unit_priority(scurr, snext) > 0 ) )
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index ff67fb3633..9c1b044b49 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -280,7 +280,7 @@ static inline void sched_unit_runstate_change(struct sched_unit *unit,
     for_each_sched_unit_vcpu ( unit, v )
     {
         if ( running )
-            vcpu_runstate_change(v, RUNSTATE_running, new_entry_time);
+            vcpu_runstate_change(v, v->new_state, new_entry_time);
         else
             vcpu_runstate_change(v,
                 ((v->pause_flags & VPF_blocked) ? RUNSTATE_blocked :
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index c65dfa943b..7e568a9d9f 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -93,6 +93,36 @@ static inline bool unit_runnable(const struct sched_unit *unit)
     return false;
 }
 
+/*
+ * Returns whether a sched_unit is runnable and sets new_state for each of its
+ * vcpus. It is mandatory to determine the new runstate for all vcpus of a unit
+ * without dropping the schedule lock (which happens when synchronizing the
+ * context switch of the vcpus of a unit) in order to avoid races with e.g.
+ * vcpu_sleep().
+ */
+static inline bool unit_runnable_state(const struct sched_unit *unit)
+{
+    struct vcpu *v;
+    bool runnable, ret = false;
+
+    if ( is_idle_unit(unit) )
+        return true;
+
+    for_each_sched_unit_vcpu ( unit, v )
+    {
+        runnable = vcpu_runnable(v);
+
+        v->new_state = runnable ? RUNSTATE_running
+                                : (v->pause_flags & VPF_blocked)
+                                  ? RUNSTATE_blocked : RUNSTATE_offline;
+
+        if ( runnable )
+            ret = true;
+    }
+
+    return ret;
+}
+
 static inline void sched_set_res(struct sched_unit *unit,
                                  struct sched_resource *res)
 {
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index c770ab4aa0..12f00cd78d 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -174,6 +174,7 @@ struct vcpu
         XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t) compat;
     } runstate_guest; /* guest address */
 #endif
+    unsigned int     new_state;
 
     /* Has the FPU been initialised? */
     bool             fpu_initialised;
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 03/19] xen/sched: add support for multiple vcpus per sched unit where missing
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 01/19] xen/sched: add code to sync scheduling of all vcpus of a sched unit Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 02/19] xen/sched: introduce unit_runnable_state() Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30 10:41   ` Jan Beulich
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 04/19] xen/sched: modify cpupool_domain_cpumask() to be an unit mask Juergen Gross
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Tim Deegan, Julien Grall, Jan Beulich, Dario Faggioli

In several places there is support for multiple vcpus per sched unit
missing. Add that missing support (with the exception of initial
allocation) and missing helpers for that.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
RFC V2:
- fix vcpu_runstate_helper()
V1:
- add special handling for idle unit in unit_runnable() and
  unit_runnable_state()
V2:
- handle affinity_broken correctly (Jan Beulich)
V3:
- type for cpu ->unsigned int (Jan Beulich)
---
 xen/common/domain.c        |  5 ++++-
 xen/common/schedule.c      |  9 +++++----
 xen/include/xen/sched-if.h | 16 +++++++++++++++-
 3 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index a9882509ed..93aa856bcb 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1273,7 +1273,10 @@ int vcpu_reset(struct vcpu *v)
     v->async_exception_mask = 0;
     memset(v->async_exception_state, 0, sizeof(v->async_exception_state));
 #endif
-    v->affinity_broken = 0;
+    if ( v->affinity_broken & VCPU_AFFINITY_OVERRIDE )
+        vcpu_temporary_affinity(v, NR_CPUS, VCPU_AFFINITY_OVERRIDE);
+    if ( v->affinity_broken & VCPU_AFFINITY_WAIT )
+        vcpu_temporary_affinity(v, NR_CPUS, VCPU_AFFINITY_WAIT);
     clear_bit(_VPF_blocked, &v->pause_flags);
     clear_bit(_VPF_in_reset, &v->pause_flags);
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 9c1b044b49..3094ff6838 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -252,8 +252,9 @@ static inline void vcpu_runstate_change(
     s_time_t delta;
     struct sched_unit *unit = v->sched_unit;
 
-    ASSERT(v->runstate.state != new_state);
     ASSERT(spin_is_locked(get_sched_res(v->processor)->schedule_lock));
+    if ( v->runstate.state == new_state )
+        return;
 
     vcpu_urgent_count_update(v);
 
@@ -1729,14 +1730,14 @@ static void sched_switch_units(struct sched_resource *sr,
              (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
              (now - next->state_entry_time) : 0, prev->next_time);
 
-    ASSERT(prev->vcpu_list->runstate.state == RUNSTATE_running);
+    ASSERT(unit_running(prev));
 
     TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
              next->domain->domain_id, next->unit_id);
 
     sched_unit_runstate_change(prev, false, now);
 
-    ASSERT(next->vcpu_list->runstate.state != RUNSTATE_running);
+    ASSERT(!unit_running(next));
     sched_unit_runstate_change(next, true, now);
 
     /*
@@ -1858,7 +1859,7 @@ void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext)
             while ( atomic_read(&next->rendezvous_out_cnt) )
                 cpu_relax();
     }
-    else if ( vprev != vnext )
+    else if ( vprev != vnext && sched_granularity == 1 )
         context_saved(vprev);
 }
 
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 7e568a9d9f..983f2ece83 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -81,6 +81,11 @@ static inline bool is_unit_online(const struct sched_unit *unit)
     return false;
 }
 
+static inline unsigned int unit_running(const struct sched_unit *unit)
+{
+    return unit->runstate_cnt[RUNSTATE_running];
+}
+
 /* Returns true if at least one vcpu of the unit is runnable. */
 static inline bool unit_runnable(const struct sched_unit *unit)
 {
@@ -126,7 +131,16 @@ static inline bool unit_runnable_state(const struct sched_unit *unit)
 static inline void sched_set_res(struct sched_unit *unit,
                                  struct sched_resource *res)
 {
-    unit->vcpu_list->processor = res->master_cpu;
+    unsigned int cpu = cpumask_first(res->cpus);
+    struct vcpu *v;
+
+    for_each_sched_unit_vcpu ( unit, v )
+    {
+        ASSERT(cpu < nr_cpu_ids);
+        v->processor = cpu;
+        cpu = cpumask_next(cpu, res->cpus);
+    }
+
     unit->res = res;
 }
 
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 04/19] xen/sched: modify cpupool_domain_cpumask() to be an unit mask
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (2 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 03/19] xen/sched: add support for multiple vcpus per sched unit where missing Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 05/19] xen/sched: support allocating multiple vcpus into one sched unit Juergen Gross
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Tim Deegan, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Robert VanVossen, Dario Faggioli, Julien Grall, Josh Whitehead,
	Meng Xu, Jan Beulich

cpupool_domain_cpumask() is used by scheduling to select cpus or to
iterate over cpus. In order to support scheduling units spanning
multiple cpus rename cpupool_domain_cpumask() to
cpupool_domain_master_cpumask() and let it return a cpumask with only
one bit set per scheduling resource.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
V4:
- rename to cpupool_domain_master_cpumask() (Jan Beulich)
- check return value of zalloc_cpumask_var() (Jan Beulich)
---
 xen/common/cpupool.c        | 27 ++++++++++++++++++---------
 xen/common/domain.c         |  2 +-
 xen/common/domctl.c         |  2 +-
 xen/common/sched_arinc653.c |  2 +-
 xen/common/sched_credit.c   |  4 ++--
 xen/common/sched_credit2.c  | 22 +++++++++++-----------
 xen/common/sched_null.c     |  8 ++++----
 xen/common/sched_rt.c       |  8 ++++----
 xen/common/schedule.c       | 13 +++++++------
 xen/include/xen/sched-if.h  |  9 ++++++---
 10 files changed, 55 insertions(+), 42 deletions(-)

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index fd30040922..441a26f16c 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -36,26 +36,33 @@ static DEFINE_SPINLOCK(cpupool_lock);
 
 DEFINE_PER_CPU(struct cpupool *, cpupool);
 
+static void free_cpupool_struct(struct cpupool *c)
+{
+    if ( c )
+    {
+        free_cpumask_var(c->res_valid);
+        free_cpumask_var(c->cpu_valid);
+    }
+    xfree(c);
+}
+
 static struct cpupool *alloc_cpupool_struct(void)
 {
     struct cpupool *c = xzalloc(struct cpupool);
 
-    if ( !c || !zalloc_cpumask_var(&c->cpu_valid) )
+    if ( !c )
+        return NULL;
+
+    if ( !zalloc_cpumask_var(&c->cpu_valid) ||
+         !zalloc_cpumask_var(&c->res_valid) )
     {
-        xfree(c);
+        free_cpupool_struct(c);
         c = NULL;
     }
 
     return c;
 }
 
-static void free_cpupool_struct(struct cpupool *c)
-{
-    if ( c )
-        free_cpumask_var(c->cpu_valid);
-    xfree(c);
-}
-
 /*
  * find a cpupool by it's id. to be called with cpupool lock held
  * if exact is not specified, the first cpupool with an id larger or equal to
@@ -269,6 +276,7 @@ static int cpupool_assign_cpu_locked(struct cpupool *c, unsigned int cpu)
         cpupool_cpu_moving = NULL;
     }
     cpumask_set_cpu(cpu, c->cpu_valid);
+    cpumask_and(c->res_valid, c->cpu_valid, sched_res_mask);
 
     rcu_read_lock(&domlist_read_lock);
     for_each_domain_in_cpupool(d, c)
@@ -361,6 +369,7 @@ static int cpupool_unassign_cpu_start(struct cpupool *c, unsigned int cpu)
     atomic_inc(&c->refcnt);
     cpupool_cpu_moving = c;
     cpumask_clear_cpu(cpu, c->cpu_valid);
+    cpumask_and(c->res_valid, c->cpu_valid, sched_res_mask);
 
 out:
     spin_unlock(&cpupool_lock);
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 93aa856bcb..9c7360ed2a 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -584,7 +584,7 @@ void domain_update_node_affinity(struct domain *d)
         return;
     }
 
-    online = cpupool_domain_cpumask(d);
+    online = cpupool_domain_master_cpumask(d);
 
     spin_lock(&d->node_affinity_lock);
 
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index 8a694e0d37..d597a09f98 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -619,7 +619,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
         if ( op->cmd == XEN_DOMCTL_setvcpuaffinity )
         {
             cpumask_var_t new_affinity, old_affinity;
-            cpumask_t *online = cpupool_domain_cpumask(v->domain);
+            cpumask_t *online = cpupool_domain_master_cpumask(v->domain);
 
             /*
              * We want to be able to restore hard affinity if we are trying
diff --git a/xen/common/sched_arinc653.c b/xen/common/sched_arinc653.c
index dd5876eacd..45c05c6cd9 100644
--- a/xen/common/sched_arinc653.c
+++ b/xen/common/sched_arinc653.c
@@ -614,7 +614,7 @@ a653sched_pick_resource(const struct scheduler *ops,
      * If present, prefer unit's current processor, else
      * just find the first valid unit.
      */
-    online = cpupool_domain_cpumask(unit->domain);
+    online = cpupool_domain_master_cpumask(unit->domain);
 
     cpu = cpumask_first(online);
 
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 00beac3ea4..a6dff8ec62 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -361,7 +361,7 @@ static inline void __runq_tickle(struct csched_unit *new)
     ASSERT(cur);
     cpumask_clear(&mask);
 
-    online = cpupool_domain_cpumask(new->sdom->dom);
+    online = cpupool_domain_master_cpumask(new->sdom->dom);
     cpumask_and(&idle_mask, prv->idlers, online);
     idlers_empty = cpumask_empty(&idle_mask);
 
@@ -724,7 +724,7 @@ _csched_cpu_pick(const struct scheduler *ops, const struct sched_unit *unit,
     /* We must always use cpu's scratch space */
     cpumask_t *cpus = cpumask_scratch_cpu(cpu);
     cpumask_t idlers;
-    cpumask_t *online = cpupool_domain_cpumask(unit->domain);
+    cpumask_t *online = cpupool_domain_master_cpumask(unit->domain);
     struct csched_pcpu *spc = NULL;
     int balance_step;
 
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 0e29e56d5a..d51df05887 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -705,7 +705,7 @@ static int get_fallback_cpu(struct csched2_unit *svc)
 
         affinity_balance_cpumask(unit, bs, cpumask_scratch_cpu(cpu));
         cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
-                    cpupool_domain_cpumask(unit->domain));
+                    cpupool_domain_master_cpumask(unit->domain));
 
         /*
          * This is cases 1 or 3 (depending on bs): if processor is (still)
@@ -1440,7 +1440,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_unit *new, s_time_t now)
     struct sched_unit *unit = new->unit;
     unsigned int bs, cpu = sched_unit_master(unit);
     struct csched2_runqueue_data *rqd = c2rqd(ops, cpu);
-    cpumask_t *online = cpupool_domain_cpumask(unit->domain);
+    cpumask_t *online = cpupool_domain_master_cpumask(unit->domain);
     cpumask_t mask;
 
     ASSERT(new->rqd == rqd);
@@ -2243,7 +2243,7 @@ csched2_res_pick(const struct scheduler *ops, const struct sched_unit *unit)
     }
 
     cpumask_and(cpumask_scratch_cpu(cpu), unit->cpu_hard_affinity,
-                cpupool_domain_cpumask(unit->domain));
+                cpupool_domain_master_cpumask(unit->domain));
 
     /*
      * First check to see if we're here because someone else suggested a place
@@ -2358,8 +2358,8 @@ csched2_res_pick(const struct scheduler *ops, const struct sched_unit *unit)
          * ok because:
          * - we know that unit->cpu_hard_affinity and ->cpu_soft_affinity have
          *   a non-empty intersection (because has_soft is true);
-         * - we have unit->cpu_hard_affinity & cpupool_domain_cpumask() already
-         *   in cpumask_scratch, we do save a lot doing like this.
+         * - we have unit->cpu_hard_affinity & cpupool_domain_master_cpumask()
+         *   already in cpumask_scratch, we do save a lot doing like this.
          *
          * It's kind of like open coding affinity_balance_cpumask() but, in
          * this specific case, calling that would mean a lot of (unnecessary)
@@ -2378,7 +2378,7 @@ csched2_res_pick(const struct scheduler *ops, const struct sched_unit *unit)
          * affinity, so go for it.
          *
          * cpumask_scratch already has unit->cpu_hard_affinity &
-         * cpupool_domain_cpumask() in it, so it's enough that we filter
+         * cpupool_domain_master_cpumask() in it, so it's enough that we filter
          * with the cpus of the runq.
          */
         cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
@@ -2513,7 +2513,7 @@ static void migrate(const struct scheduler *ops,
         _runq_deassign(svc);
 
         cpumask_and(cpumask_scratch_cpu(cpu), unit->cpu_hard_affinity,
-                    cpupool_domain_cpumask(unit->domain));
+                    cpupool_domain_master_cpumask(unit->domain));
         cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
                     &trqd->active);
         sched_set_res(unit,
@@ -2547,7 +2547,7 @@ static bool unit_is_migrateable(struct csched2_unit *svc,
     int cpu = sched_unit_master(unit);
 
     cpumask_and(cpumask_scratch_cpu(cpu), unit->cpu_hard_affinity,
-                cpupool_domain_cpumask(unit->domain));
+                cpupool_domain_master_cpumask(unit->domain));
 
     return !(svc->flags & CSFLAG_runq_migrate_request) &&
            cpumask_intersects(cpumask_scratch_cpu(cpu), &rqd->active);
@@ -2763,7 +2763,7 @@ csched2_unit_migrate(
      * v->processor will be chosen, and during actual domain unpause that
      * the unit will be assigned to and added to the proper runqueue.
      */
-    if ( unlikely(!cpumask_test_cpu(new_cpu, cpupool_domain_cpumask(d))) )
+    if ( unlikely(!cpumask_test_cpu(new_cpu, cpupool_domain_master_cpumask(d))) )
     {
         ASSERT(system_state == SYS_STATE_suspend);
         if ( unit_on_runq(svc) )
@@ -3069,7 +3069,7 @@ csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom)
     sdom->nr_units = 0;
 
     init_timer(&sdom->repl_timer, replenish_domain_budget, sdom,
-               cpumask_any(cpupool_domain_cpumask(dom)));
+               cpumask_any(cpupool_domain_master_cpumask(dom)));
     spin_lock_init(&sdom->budget_lock);
     INIT_LIST_HEAD(&sdom->parked_units);
 
@@ -3317,7 +3317,7 @@ runq_candidate(struct csched2_runqueue_data *rqd,
                                  cpumask_scratch);
         if ( unlikely(!cpumask_test_cpu(cpu, cpumask_scratch)) )
         {
-            cpumask_t *online = cpupool_domain_cpumask(scurr->unit->domain);
+            cpumask_t *online = cpupool_domain_master_cpumask(scurr->unit->domain);
 
             /* Ok, is any of the pcpus in scurr soft-affinity idle? */
             cpumask_and(cpumask_scratch, cpumask_scratch, &rqd->idle);
diff --git a/xen/common/sched_null.c b/xen/common/sched_null.c
index 3dde1dcd00..2525464a7c 100644
--- a/xen/common/sched_null.c
+++ b/xen/common/sched_null.c
@@ -125,7 +125,7 @@ static inline bool unit_check_affinity(struct sched_unit *unit,
 {
     affinity_balance_cpumask(unit, balance_step, cpumask_scratch_cpu(cpu));
     cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
-                cpupool_domain_cpumask(unit->domain));
+                cpupool_domain_master_cpumask(unit->domain));
 
     return cpumask_test_cpu(cpu, cpumask_scratch_cpu(cpu));
 }
@@ -266,7 +266,7 @@ pick_res(struct null_private *prv, const struct sched_unit *unit)
 {
     unsigned int bs;
     unsigned int cpu = sched_unit_master(unit), new_cpu;
-    cpumask_t *cpus = cpupool_domain_cpumask(unit->domain);
+    cpumask_t *cpus = cpupool_domain_master_cpumask(unit->domain);
 
     ASSERT(spin_is_locked(get_sched_res(cpu)->schedule_lock));
 
@@ -467,7 +467,7 @@ static void null_unit_insert(const struct scheduler *ops,
     lock = unit_schedule_lock(unit);
 
     cpumask_and(cpumask_scratch_cpu(cpu), unit->cpu_hard_affinity,
-                cpupool_domain_cpumask(unit->domain));
+                cpupool_domain_master_cpumask(unit->domain));
 
     /* If the pCPU is free, we assign unit to it */
     if ( likely(per_cpu(npc, cpu).unit == NULL) )
@@ -579,7 +579,7 @@ static void null_unit_wake(const struct scheduler *ops,
         spin_unlock(&prv->waitq_lock);
 
         cpumask_and(cpumask_scratch_cpu(cpu), unit->cpu_hard_affinity,
-                    cpupool_domain_cpumask(unit->domain));
+                    cpupool_domain_master_cpumask(unit->domain));
 
         if ( !cpumask_intersects(&prv->cpus_free, cpumask_scratch_cpu(cpu)) )
         {
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index fd882f2ca4..d21c416cae 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -326,7 +326,7 @@ rt_dump_unit(const struct scheduler *ops, const struct rt_unit *svc)
      */
     mask = cpumask_scratch_cpu(sched_unit_master(svc->unit));
 
-    cpupool_mask = cpupool_domain_cpumask(svc->unit->domain);
+    cpupool_mask = cpupool_domain_master_cpumask(svc->unit->domain);
     cpumask_and(mask, cpupool_mask, svc->unit->cpu_hard_affinity);
     printk("[%5d.%-2u] cpu %u, (%"PRI_stime", %"PRI_stime"),"
            " cur_b=%"PRI_stime" cur_d=%"PRI_stime" last_start=%"PRI_stime"\n"
@@ -642,7 +642,7 @@ rt_res_pick(const struct scheduler *ops, const struct sched_unit *unit)
     cpumask_t *online;
     int cpu;
 
-    online = cpupool_domain_cpumask(unit->domain);
+    online = cpupool_domain_master_cpumask(unit->domain);
     cpumask_and(&cpus, online, unit->cpu_hard_affinity);
 
     cpu = cpumask_test_cpu(sched_unit_master(unit), &cpus)
@@ -1016,7 +1016,7 @@ runq_pick(const struct scheduler *ops, const cpumask_t *mask)
         iter_svc = q_elem(iter);
 
         /* mask cpu_hard_affinity & cpupool & mask */
-        online = cpupool_domain_cpumask(iter_svc->unit->domain);
+        online = cpupool_domain_master_cpumask(iter_svc->unit->domain);
         cpumask_and(&cpu_common, online, iter_svc->unit->cpu_hard_affinity);
         cpumask_and(&cpu_common, mask, &cpu_common);
         if ( cpumask_empty(&cpu_common) )
@@ -1191,7 +1191,7 @@ runq_tickle(const struct scheduler *ops, struct rt_unit *new)
     if ( new == NULL || is_idle_unit(new->unit) )
         return;
 
-    online = cpupool_domain_cpumask(new->unit->domain);
+    online = cpupool_domain_master_cpumask(new->unit->domain);
     cpumask_and(&not_tickled, online, new->unit->cpu_hard_affinity);
     cpumask_andnot(&not_tickled, &not_tickled, &prv->tickled);
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 3094ff6838..36b1d3df6e 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -63,6 +63,7 @@ integer_param("sched_ratelimit_us", sched_ratelimit_us);
 
 /* Number of vcpus per struct sched_unit. */
 static unsigned int __read_mostly sched_granularity = 1;
+const cpumask_t *sched_res_mask = &cpumask_all;
 
 /* Common lock for free cpus. */
 static DEFINE_SPINLOCK(sched_free_cpu_lock);
@@ -188,7 +189,7 @@ static inline struct scheduler *vcpu_scheduler(const struct vcpu *v)
 {
     return unit_scheduler(v->sched_unit);
 }
-#define VCPU2ONLINE(_v) cpupool_domain_cpumask((_v)->domain)
+#define VCPU2ONLINE(_v) cpupool_domain_master_cpumask((_v)->domain)
 
 static inline void trace_runstate_change(struct vcpu *v, int new_state)
 {
@@ -425,9 +426,9 @@ static unsigned int sched_select_initial_cpu(const struct vcpu *v)
     cpumask_clear(cpus);
     for_each_node_mask ( node, d->node_affinity )
         cpumask_or(cpus, cpus, &node_to_cpumask(node));
-    cpumask_and(cpus, cpus, cpupool_domain_cpumask(d));
+    cpumask_and(cpus, cpus, d->cpupool->cpu_valid);
     if ( cpumask_empty(cpus) )
-        cpumask_copy(cpus, cpupool_domain_cpumask(d));
+        cpumask_copy(cpus, d->cpupool->cpu_valid);
 
     if ( v->vcpu_id == 0 )
         cpu_ret = cpumask_first(cpus);
@@ -973,7 +974,7 @@ void restore_vcpu_affinity(struct domain *d)
         lock = unit_schedule_lock_irq(unit);
 
         cpumask_and(cpumask_scratch_cpu(cpu), unit->cpu_hard_affinity,
-                    cpupool_domain_cpumask(d));
+                    cpupool_domain_master_cpumask(d));
         if ( cpumask_empty(cpumask_scratch_cpu(cpu)) )
         {
             if ( sched_check_affinity_broken(unit) )
@@ -981,7 +982,7 @@ void restore_vcpu_affinity(struct domain *d)
                 sched_set_affinity(unit, unit->cpu_hard_affinity_saved, NULL);
                 sched_reset_affinity_broken(unit);
                 cpumask_and(cpumask_scratch_cpu(cpu), unit->cpu_hard_affinity,
-                            cpupool_domain_cpumask(d));
+                            cpupool_domain_master_cpumask(d));
             }
 
             if ( cpumask_empty(cpumask_scratch_cpu(cpu)) )
@@ -991,7 +992,7 @@ void restore_vcpu_affinity(struct domain *d)
                        unit->vcpu_list);
                 sched_set_affinity(unit, &cpumask_all, NULL);
                 cpumask_and(cpumask_scratch_cpu(cpu), unit->cpu_hard_affinity,
-                            cpupool_domain_cpumask(d));
+                            cpupool_domain_master_cpumask(d));
             }
         }
 
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 983f2ece83..1b296b150f 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -22,6 +22,8 @@ extern cpumask_t cpupool_free_cpus;
 #define SCHED_DEFAULT_RATELIMIT_US 1000
 extern int sched_ratelimit_us;
 
+/* Scheduling resource mask. */
+extern const cpumask_t *sched_res_mask;
 
 /*
  * In order to allow a scheduler to remap the lock->cpu mapping,
@@ -535,6 +537,7 @@ struct cpupool
     int              cpupool_id;
     unsigned int     n_dom;
     cpumask_var_t    cpu_valid;      /* all cpus assigned to pool */
+    cpumask_var_t    res_valid;      /* all scheduling resources of pool */
     struct cpupool   *next;
     struct scheduler *sched;
     atomic_t         refcnt;
@@ -543,14 +546,14 @@ struct cpupool
 #define cpupool_online_cpumask(_pool) \
     (((_pool) == NULL) ? &cpu_online_map : (_pool)->cpu_valid)
 
-static inline cpumask_t *cpupool_domain_cpumask(const struct domain *d)
+static inline cpumask_t *cpupool_domain_master_cpumask(const struct domain *d)
 {
     /*
      * d->cpupool is NULL only for the idle domain, and no one should
      * be interested in calling this for the idle domain.
      */
     ASSERT(d->cpupool != NULL);
-    return d->cpupool->cpu_valid;
+    return d->cpupool->res_valid;
 }
 
 /*
@@ -590,7 +593,7 @@ static inline cpumask_t *cpupool_domain_cpumask(const struct domain *d)
 static inline int has_soft_affinity(const struct sched_unit *unit)
 {
     return unit->soft_aff_effective &&
-           !cpumask_subset(cpupool_domain_cpumask(unit->domain),
+           !cpumask_subset(cpupool_domain_master_cpumask(unit->domain),
                            unit->cpu_soft_affinity);
 }
 
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 05/19] xen/sched: support allocating multiple vcpus into one sched unit
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (3 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 04/19] xen/sched: modify cpupool_domain_cpumask() to be an unit mask Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 06/19] xen/sched: add a percpu resource index Juergen Gross
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, George Dunlap, Dario Faggioli

With a scheduling granularity greater than 1 multiple vcpus share the
same struct sched_unit. Support that.

Setting the initial processor must be done carefully: we can't use
sched_set_res() as that relies on for_each_sched_unit_vcpu() which in
turn needs the vcpu already as a member of the domain's vcpu linked
list, which isn't the case.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
V4:
- merge patch 36 of V3 into this one (Jan Beulich)
- add some comments (Jan Beulich)
- use unit_id instead of vcpu_list->vcpu_id (Jan Beulich)
---
 xen/common/schedule.c | 97 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 76 insertions(+), 21 deletions(-)

diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 36b1d3df6e..37002b4c0e 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -349,7 +349,7 @@ static void sched_spin_unlock_double(spinlock_t *lock1, spinlock_t *lock2,
     spin_unlock_irqrestore(lock1, flags);
 }
 
-static void sched_free_unit(struct sched_unit *unit)
+static void sched_free_unit_mem(struct sched_unit *unit)
 {
     struct sched_unit *prev_unit;
     struct domain *d = unit->domain;
@@ -368,8 +368,6 @@ static void sched_free_unit(struct sched_unit *unit)
         }
     }
 
-    unit->vcpu_list->sched_unit = NULL;
-
     free_cpumask_var(unit->cpu_hard_affinity);
     free_cpumask_var(unit->cpu_hard_affinity_saved);
     free_cpumask_var(unit->cpu_soft_affinity);
@@ -377,18 +375,65 @@ static void sched_free_unit(struct sched_unit *unit)
     xfree(unit);
 }
 
+static void sched_free_unit(struct sched_unit *unit, struct vcpu *v)
+{
+    struct vcpu *vunit;
+    unsigned int cnt = 0;
+
+    /* Don't count to be released vcpu, might be not in vcpu list yet. */
+    for_each_sched_unit_vcpu ( unit, vunit )
+        if ( vunit != v )
+            cnt++;
+
+    v->sched_unit = NULL;
+    unit->runstate_cnt[v->runstate.state]--;
+
+    if ( unit->vcpu_list == v )
+        unit->vcpu_list = v->next_in_list;
+
+    if ( !cnt )
+        sched_free_unit_mem(unit);
+}
+
+static void sched_unit_add_vcpu(struct sched_unit *unit, struct vcpu *v)
+{
+    v->sched_unit = unit;
+
+    /* All but idle vcpus are allocated with sequential vcpu_id. */
+    if ( !unit->vcpu_list || unit->vcpu_list->vcpu_id > v->vcpu_id )
+    {
+        unit->vcpu_list = v;
+        /*
+         * unit_id is always the same as lowest vcpu_id of unit.
+         * This is used for stopping for_each_sched_unit_vcpu() loop and in
+         * order to support cpupools with different granularities.
+         */
+        unit->unit_id = v->vcpu_id;
+    }
+    unit->runstate_cnt[v->runstate.state]++;
+}
+
 static struct sched_unit *sched_alloc_unit(struct vcpu *v)
 {
     struct sched_unit *unit, **prev_unit;
     struct domain *d = v->domain;
 
+    for_each_sched_unit ( d, unit )
+        if ( unit->unit_id / sched_granularity ==
+             v->vcpu_id / sched_granularity )
+            break;
+
+    if ( unit )
+    {
+        sched_unit_add_vcpu(unit, v);
+        return unit;
+    }
+
     if ( (unit = xzalloc(struct sched_unit)) == NULL )
         return NULL;
 
-    unit->vcpu_list = v;
-    unit->unit_id = v->vcpu_id;
     unit->domain = d;
-    unit->runstate_cnt[v->runstate.state]++;
+    sched_unit_add_vcpu(unit, v);
 
     for ( prev_unit = &d->sched_unit_list; *prev_unit;
           prev_unit = &(*prev_unit)->next_in_list )
@@ -404,12 +449,10 @@ static struct sched_unit *sched_alloc_unit(struct vcpu *v)
          !zalloc_cpumask_var(&unit->cpu_soft_affinity) )
         goto fail;
 
-    v->sched_unit = unit;
-
     return unit;
 
  fail:
-    sched_free_unit(unit);
+    sched_free_unit(unit, v);
     return NULL;
 }
 
@@ -459,21 +502,26 @@ int sched_init_vcpu(struct vcpu *v)
     else
         processor = sched_select_initial_cpu(v);
 
-    sched_set_res(unit, get_sched_res(processor));
-
     /* Initialise the per-vcpu timers. */
     spin_lock_init(&v->periodic_timer_lock);
-    init_timer(&v->periodic_timer, vcpu_periodic_timer_fn,
-               v, v->processor);
-    init_timer(&v->singleshot_timer, vcpu_singleshot_timer_fn,
-               v, v->processor);
-    init_timer(&v->poll_timer, poll_timer_fn,
-               v, v->processor);
+    init_timer(&v->periodic_timer, vcpu_periodic_timer_fn, v, processor);
+    init_timer(&v->singleshot_timer, vcpu_singleshot_timer_fn, v, processor);
+    init_timer(&v->poll_timer, poll_timer_fn, v, processor);
+
+    /* If this is not the first vcpu of the unit we are done. */
+    if ( unit->priv != NULL )
+    {
+        v->processor = processor;
+        return 0;
+    }
+
+    /* The first vcpu of an unit can be set via sched_set_res(). */
+    sched_set_res(unit, get_sched_res(processor));
 
     unit->priv = sched_alloc_udata(dom_scheduler(d), unit, d->sched_priv);
     if ( unit->priv == NULL )
     {
-        sched_free_unit(unit);
+        sched_free_unit(unit, v);
         return 1;
     }
 
@@ -633,9 +681,16 @@ void sched_destroy_vcpu(struct vcpu *v)
     kill_timer(&v->poll_timer);
     if ( test_and_clear_bool(v->is_urgent) )
         atomic_dec(&per_cpu(sched_urgent_count, v->processor));
-    sched_remove_unit(vcpu_scheduler(v), unit);
-    sched_free_udata(vcpu_scheduler(v), unit->priv);
-    sched_free_unit(unit);
+    /*
+     * Vcpus are being destroyed top-down. So being the first vcpu of an unit
+     * is the same as being the only one.
+     */
+    if ( unit->vcpu_list == v )
+    {
+        sched_remove_unit(vcpu_scheduler(v), unit);
+        sched_free_udata(vcpu_scheduler(v), unit->priv);
+        sched_free_unit(unit, v);
+    }
 }
 
 int sched_init_domain(struct domain *d, int poolid)
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 06/19] xen/sched: add a percpu resource index
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (4 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 05/19] xen/sched: support allocating multiple vcpus into one sched unit Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 07/19] xen/sched: add fall back to idle vcpu when scheduling unit Juergen Gross
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, George Dunlap, Dario Faggioli

Add a percpu variable holding the index of the cpu in the current
sched_resource structure. This index is used to get the correct vcpu
of a sched_unit on a specific cpu.

For now this index will be zero for all cpus, but with core scheduling
it will be possible to have higher values, too.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
RFC V2: new patch (carved out from RFC V1 patch 49)
V4:
- make function parameter const (Jan Beulich)
---
 xen/common/schedule.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 37002b4c0e..c8e2999407 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -77,6 +77,7 @@ static void poll_timer_fn(void *data);
 /* This is global for now so that private implementations can reach it */
 DEFINE_PER_CPU(struct scheduler *, scheduler);
 DEFINE_PER_CPU_READ_MOSTLY(struct sched_resource *, sched_res);
+static DEFINE_PER_CPU_READ_MOSTLY(unsigned int, sched_res_idx);
 
 /* Scratch space for cpumasks. */
 DEFINE_PER_CPU(cpumask_t, cpumask_scratch);
@@ -144,6 +145,12 @@ static struct scheduler sched_idle_ops = {
     .switch_sched   = sched_idle_switch_sched,
 };
 
+static inline struct vcpu *sched_unit2vcpu_cpu(const struct sched_unit *unit,
+                                               unsigned int cpu)
+{
+    return unit->domain->vcpu[unit->unit_id + per_cpu(sched_res_idx, cpu)];
+}
+
 static inline struct scheduler *dom_scheduler(const struct domain *d)
 {
     if ( likely(d->cpupool != NULL) )
@@ -2030,7 +2037,7 @@ static void sched_slave(void)
 
     pcpu_schedule_unlock_irq(lock, cpu);
 
-    sched_context_switch(vprev, next->vcpu_list, now);
+    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), now);
 }
 
 /*
@@ -2091,7 +2098,7 @@ static void schedule(void)
 
     pcpu_schedule_unlock_irq(lock, cpu);
 
-    vnext = next->vcpu_list;
+    vnext = sched_unit2vcpu_cpu(next, cpu);
     sched_context_switch(vprev, vnext, now);
 }
 
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 07/19] xen/sched: add fall back to idle vcpu when scheduling unit
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (5 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 06/19] xen/sched: add a percpu resource index Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  7:28   ` Dario Faggioli
  2019-09-30 10:45   ` Jan Beulich
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 08/19] xen/sched: make vcpu_wake() and vcpu_sleep() core scheduling aware Juergen Gross
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Tim Deegan, Julien Grall, Jan Beulich, Dario Faggioli,
	Volodymyr Babchuk, Roger Pau Monné

When scheduling an unit with multiple vcpus there is no guarantee all
vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back to
idle vcpu of the current cpu in that case. This requires to store the
correct schedule_unit pointer in the idle vcpu as long as it used as
fallback vcpu.

In order to modify the runstates of the correct vcpus when switching
schedule units merge sched_unit_runstate_change() into
sched_switch_units() and loop over the affected physical cpus instead
of the unit's vcpus. This in turn requires an access function to the
current variable of other cpus.

Today context_saved() is called in case previous and next vcpus differ
when doing a context switch. With an idle vcpu being capable to be a
substitute for an offline vcpu this is problematic when switching to
an idle scheduling unit. An idle previous vcpu leaves us in doubt which
schedule unit was active previously, so save the previous unit pointer
in the per-schedule resource area. If it is NULL the unit has not
changed and we don't have to set the previous unit to be not running.

When running an idle vcpu in a non-idle scheduling unit use a specific
guest idle loop not performing any non-softirq tasklets and
livepatching in order to avoid populating the cpu caches with memory
used by other domains (as far as possible). Softirqs are considered to
be save.

In order to avoid livepatching when going to guest idle another
variant of reset_stack_and_jump() not calling check_for_livepatch_work
is needed.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
RFC V2:
- new patch (Andrew Cooper)

V1:
- use urgent_count to select correct idle routine (Jan Beulich)

V2:
- set vcpu->is_running in context_saved()
- introduce reset_stack_and_jump_nolp() (Jan Beulich)
- readd scrubbing (Jan Beulich, Andrew Cooper)
- get_cpu_current() _NOT_ moved to include/asm-x86/current.h as the
  needed reference of stack_base[] results in a #include hell

V3:
- split context_saved() into unit_context_saved() and vcpu_context_saved()

V4:
- rename sd -> sr (Jan Beulich)
- use unsigned int for cpu (Jan Beulich)
- add comment in sched_context_switch() (Jan Beulich)
- add comment before definition of get_cpu_current() (Jan Beulich)

V5:
- add comment (Dario Faggioli)
---
 xen/arch/x86/domain.c         |  23 +++++
 xen/common/schedule.c         | 195 +++++++++++++++++++++++++++++-------------
 xen/include/asm-arm/current.h |   1 +
 xen/include/asm-x86/current.h |  19 +++-
 xen/include/asm-x86/smp.h     |   7 ++
 xen/include/xen/sched-if.h    |   4 +-
 xen/include/xen/sched.h       |   1 +
 7 files changed, 187 insertions(+), 63 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 27f99d3bcc..c8d7f491ea 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -159,6 +159,25 @@ static void idle_loop(void)
     }
 }
 
+/*
+ * Idle loop for siblings in active schedule units.
+ * We don't do any standard idle work like tasklets or livepatching.
+ */
+static void guest_idle_loop(void)
+{
+    unsigned int cpu = smp_processor_id();
+
+    for ( ; ; )
+    {
+        ASSERT(!cpu_is_offline(cpu));
+
+        if ( !softirq_pending(cpu) && !scrub_free_pages() &&
+             !softirq_pending(cpu))
+            sched_guest_idle(pm_idle, cpu);
+        do_softirq();
+    }
+}
+
 void startup_cpu_idle_loop(void)
 {
     struct vcpu *v = current;
@@ -172,6 +191,10 @@ void startup_cpu_idle_loop(void)
 
 static void noreturn continue_idle_domain(struct vcpu *v)
 {
+    /* Idle vcpus might be attached to non-idle units! */
+    if ( !is_idle_domain(v->sched_unit->domain) )
+        reset_stack_and_jump_nolp(guest_idle_loop);
+
     reset_stack_and_jump(idle_loop);
 }
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index c8e2999407..b4c4b04ebe 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -145,10 +145,21 @@ static struct scheduler sched_idle_ops = {
     .switch_sched   = sched_idle_switch_sched,
 };
 
+static inline struct vcpu *unit2vcpu_cpu(const struct sched_unit *unit,
+                                         unsigned int cpu)
+{
+    unsigned int idx = unit->unit_id + per_cpu(sched_res_idx, cpu);
+    const struct domain *d = unit->domain;
+
+    return (idx < d->max_vcpus) ? d->vcpu[idx] : NULL;
+}
+
 static inline struct vcpu *sched_unit2vcpu_cpu(const struct sched_unit *unit,
                                                unsigned int cpu)
 {
-    return unit->domain->vcpu[unit->unit_id + per_cpu(sched_res_idx, cpu)];
+    struct vcpu *v = unit2vcpu_cpu(unit, cpu);
+
+    return (v && v->new_state == RUNSTATE_running) ? v : idle_vcpu[cpu];
 }
 
 static inline struct scheduler *dom_scheduler(const struct domain *d)
@@ -268,8 +279,11 @@ static inline void vcpu_runstate_change(
 
     trace_runstate_change(v, new_state);
 
-    unit->runstate_cnt[v->runstate.state]--;
-    unit->runstate_cnt[new_state]++;
+    if ( !is_idle_vcpu(v) )
+    {
+        unit->runstate_cnt[v->runstate.state]--;
+        unit->runstate_cnt[new_state]++;
+    }
 
     delta = new_entry_time - v->runstate.state_entry_time;
     if ( delta > 0 )
@@ -281,21 +295,18 @@ static inline void vcpu_runstate_change(
     v->runstate.state = new_state;
 }
 
-static inline void sched_unit_runstate_change(struct sched_unit *unit,
-    bool running, s_time_t new_entry_time)
+void sched_guest_idle(void (*idle) (void), unsigned int cpu)
 {
-    struct vcpu *v;
-
-    for_each_sched_unit_vcpu ( unit, v )
-    {
-        if ( running )
-            vcpu_runstate_change(v, v->new_state, new_entry_time);
-        else
-            vcpu_runstate_change(v,
-                ((v->pause_flags & VPF_blocked) ? RUNSTATE_blocked :
-                 (vcpu_runnable(v) ? RUNSTATE_runnable : RUNSTATE_offline)),
-                new_entry_time);
-    }
+    /*
+     * Another vcpu of the unit is active in guest context while this one is
+     * idle. In case of a scheduling event we don't want to have high latencies
+     * due to a cpu needing to wake up from deep C state for joining the
+     * rendezvous, so avoid those deep C states by incrementing the urgent
+     * count of the cpu.
+     */
+    atomic_inc(&per_cpu(sched_urgent_count, cpu));
+    idle();
+    atomic_dec(&per_cpu(sched_urgent_count, cpu));
 }
 
 void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate)
@@ -545,6 +556,7 @@ int sched_init_vcpu(struct vcpu *v)
     if ( is_idle_domain(d) )
     {
         get_sched_res(v->processor)->curr = unit;
+        get_sched_res(v->processor)->sched_unit_idle = unit;
         v->is_running = 1;
         unit->is_running = true;
         unit->state_entry_time = NOW();
@@ -877,7 +889,7 @@ static void sched_unit_move_locked(struct sched_unit *unit,
  *
  * sched_unit_migrate_finish() will do the work now if it can, or simply
  * return if it can't (because unit is still running); in that case
- * sched_unit_migrate_finish() will be called by context_saved().
+ * sched_unit_migrate_finish() will be called by unit_context_saved().
  */
 static void sched_unit_migrate_start(struct sched_unit *unit)
 {
@@ -900,7 +912,7 @@ static void sched_unit_migrate_finish(struct sched_unit *unit)
 
     /*
      * If the unit is currently running, this will be handled by
-     * context_saved(); and in any case, if the bit is cleared, then
+     * unit_context_saved(); and in any case, if the bit is cleared, then
      * someone else has already done the work so we don't need to.
      */
     if ( unit->is_running )
@@ -1785,33 +1797,66 @@ static void sched_switch_units(struct sched_resource *sr,
                                struct sched_unit *next, struct sched_unit *prev,
                                s_time_t now)
 {
-    sr->curr = next;
-
-    TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->unit_id,
-             now - prev->state_entry_time);
-    TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->unit_id,
-             (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
-             (now - next->state_entry_time) : 0, prev->next_time);
+    unsigned int cpu;
 
     ASSERT(unit_running(prev));
 
-    TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
-             next->domain->domain_id, next->unit_id);
+    if ( prev != next )
+    {
+        sr->curr = next;
+        sr->prev = prev;
 
-    sched_unit_runstate_change(prev, false, now);
+        TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id,
+                 prev->unit_id, now - prev->state_entry_time);
+        TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id,
+                 next->unit_id,
+                 (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
+                 (now - next->state_entry_time) : 0, prev->next_time);
+        TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
+                 next->domain->domain_id, next->unit_id);
 
-    ASSERT(!unit_running(next));
-    sched_unit_runstate_change(next, true, now);
+        ASSERT(!unit_running(next));
 
-    /*
-     * NB. Don't add any trace records from here until the actual context
-     * switch, else lost_records resume will not work properly.
-     */
+        /*
+         * NB. Don't add any trace records from here until the actual context
+         * switch, else lost_records resume will not work properly.
+         */
+
+        ASSERT(!next->is_running);
+        next->is_running = true;
+        next->state_entry_time = now;
+
+        if ( is_idle_unit(prev) )
+        {
+            prev->runstate_cnt[RUNSTATE_running] = 0;
+            prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity;
+        }
+        if ( is_idle_unit(next) )
+        {
+            next->runstate_cnt[RUNSTATE_running] = sched_granularity;
+            next->runstate_cnt[RUNSTATE_runnable] = 0;
+        }
+    }
+
+    for_each_cpu ( cpu, sr->cpus )
+    {
+        struct vcpu *vprev = get_cpu_current(cpu);
+        struct vcpu *vnext = sched_unit2vcpu_cpu(next, cpu);
+
+        if ( vprev != vnext || vprev->runstate.state != vnext->new_state )
+        {
+            vcpu_runstate_change(vprev,
+                ((vprev->pause_flags & VPF_blocked) ? RUNSTATE_blocked :
+                 (vcpu_runnable(vprev) ? RUNSTATE_runnable : RUNSTATE_offline)),
+                now);
+            vcpu_runstate_change(vnext, vnext->new_state, now);
+        }
 
-    ASSERT(!next->is_running);
-    next->vcpu_list->is_running = 1;
-    next->is_running = true;
-    next->state_entry_time = now;
+        vnext->is_running = 1;
+
+        if ( is_idle_vcpu(vnext) )
+            vnext->sched_unit = next;
+    }
 }
 
 static bool sched_tasklet_check_cpu(unsigned int cpu)
@@ -1867,29 +1912,39 @@ static struct sched_unit *do_schedule(struct sched_unit *prev, s_time_t now,
     if ( prev->next_time >= 0 ) /* -ve means no limit */
         set_timer(&sr->s_timer, now + prev->next_time);
 
-    if ( likely(prev != next) )
-        sched_switch_units(sr, next, prev, now);
+    sched_switch_units(sr, next, prev, now);
 
     return next;
 }
 
-static void context_saved(struct vcpu *prev)
+static void vcpu_context_saved(struct vcpu *vprev, struct vcpu *vnext)
 {
-    struct sched_unit *unit = prev->sched_unit;
-
     /* Clear running flag /after/ writing context to memory. */
     smp_wmb();
 
-    prev->is_running = 0;
+    if ( vprev != vnext )
+        vprev->is_running = 0;
+}
+
+static void unit_context_saved(struct sched_resource *sr)
+{
+    struct sched_unit *unit = sr->prev;
+
+    if ( !unit )
+        return;
+
     unit->is_running = false;
     unit->state_entry_time = NOW();
+    sr->prev = NULL;
 
     /* Check for migration request /after/ clearing running flag. */
     smp_mb();
 
-    sched_context_saved(vcpu_scheduler(prev), unit);
+    sched_context_saved(unit_scheduler(unit), unit);
 
-    sched_unit_migrate_finish(unit);
+    /* Idle never migrates and idle vcpus might belong to other units. */
+    if ( !is_idle_unit(unit) )
+        sched_unit_migrate_finish(unit);
 }
 
 /*
@@ -1899,35 +1954,44 @@ static void context_saved(struct vcpu *prev)
  * The counter will be 0 in case no rendezvous is needed. For the rendezvous
  * case it is initialised to the number of cpus to rendezvous plus 1. Each
  * member entering decrements the counter. The last one will decrement it to
- * 1 and perform the final needed action in that case (call of context_saved()
- * if vcpu was switched), and then set the counter to zero. The other members
+ * 1 and perform the final needed action in that case (call of
+ * unit_context_saved()), and then set the counter to zero. The other members
  * will wait until the counter becomes zero until they proceed.
  */
 void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext)
 {
     struct sched_unit *next = vnext->sched_unit;
+    struct sched_resource *sr = get_sched_res(smp_processor_id());
 
     if ( atomic_read(&next->rendezvous_out_cnt) )
     {
         int cnt = atomic_dec_return(&next->rendezvous_out_cnt);
 
-        /* Call context_saved() before releasing other waiters. */
+        vcpu_context_saved(vprev, vnext);
+
+        /* Call unit_context_saved() before releasing other waiters. */
         if ( cnt == 1 )
         {
-            if ( vprev != vnext )
-                context_saved(vprev);
+            unit_context_saved(sr);
             atomic_set(&next->rendezvous_out_cnt, 0);
         }
         else
             while ( atomic_read(&next->rendezvous_out_cnt) )
                 cpu_relax();
     }
-    else if ( vprev != vnext && sched_granularity == 1 )
-        context_saved(vprev);
+    else
+    {
+        vcpu_context_saved(vprev, vnext);
+        if ( sched_granularity == 1 )
+            unit_context_saved(sr);
+    }
+
+    if ( is_idle_vcpu(vprev) && vprev != vnext )
+        vprev->sched_unit = sr->sched_unit_idle;
 }
 
 static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
-                                 s_time_t now)
+                                 bool reset_idle_unit, s_time_t now)
 {
     if ( unlikely(vprev == vnext) )
     {
@@ -1936,6 +2000,17 @@ static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
                  now - vprev->runstate.state_entry_time,
                  vprev->sched_unit->next_time);
         sched_context_switched(vprev, vnext);
+
+        /*
+         * We are switching from a non-idle to an idle unit.
+         * A vcpu of the idle unit might have been running before due to
+         * the guest vcpu being blocked. We must adjust the unit of the idle
+         * vcpu which might have been set to the guest's one.
+         */
+        if ( reset_idle_unit )
+            vnext->sched_unit =
+                get_sched_res(smp_processor_id())->sched_unit_idle;
+
         trace_continue_running(vnext);
         return continue_running(vprev);
     }
@@ -1994,7 +2069,7 @@ static struct sched_unit *sched_wait_rendezvous_in(struct sched_unit *prev,
             pcpu_schedule_unlock_irq(*lock, cpu);
 
             raise_softirq(SCHED_SLAVE_SOFTIRQ);
-            sched_context_switch(vprev, vprev, now);
+            sched_context_switch(vprev, vprev, false, now);
 
             return NULL;         /* ARM only. */
         }
@@ -2037,7 +2112,8 @@ static void sched_slave(void)
 
     pcpu_schedule_unlock_irq(lock, cpu);
 
-    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), now);
+    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu),
+                         is_idle_unit(next) && !is_idle_unit(prev), now);
 }
 
 /*
@@ -2099,7 +2175,8 @@ static void schedule(void)
     pcpu_schedule_unlock_irq(lock, cpu);
 
     vnext = sched_unit2vcpu_cpu(next, cpu);
-    sched_context_switch(vprev, vnext, now);
+    sched_context_switch(vprev, vnext,
+                         !is_idle_unit(prev) && is_idle_unit(next), now);
 }
 
 /* The scheduler timer: force a run through the scheduler */
@@ -2170,6 +2247,7 @@ static int cpu_schedule_up(unsigned int cpu)
      */
 
     sr->curr = idle_vcpu[cpu]->sched_unit;
+    sr->sched_unit_idle = idle_vcpu[cpu]->sched_unit;
 
     sr->sched_priv = NULL;
 
@@ -2339,6 +2417,7 @@ void __init scheduler_init(void)
     if ( vcpu_create(idle_domain, 0) == NULL )
         BUG();
     get_sched_res(0)->curr = idle_vcpu[0]->sched_unit;
+    get_sched_res(0)->sched_unit_idle = idle_vcpu[0]->sched_unit;
 }
 
 /*
diff --git a/xen/include/asm-arm/current.h b/xen/include/asm-arm/current.h
index 1653e89d30..88beb4645a 100644
--- a/xen/include/asm-arm/current.h
+++ b/xen/include/asm-arm/current.h
@@ -18,6 +18,7 @@ DECLARE_PER_CPU(struct vcpu *, curr_vcpu);
 
 #define current            (this_cpu(curr_vcpu))
 #define set_current(vcpu)  do { current = (vcpu); } while (0)
+#define get_cpu_current(cpu)  (per_cpu(curr_vcpu, cpu))
 
 /* Per-VCPU state that lives at the top of the stack */
 struct cpu_info {
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index f3508c3c08..0b47485337 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -77,6 +77,11 @@ struct cpu_info {
     /* get_stack_bottom() must be 16-byte aligned */
 };
 
+static inline struct cpu_info *get_cpu_info_from_stack(unsigned long sp)
+{
+    return (struct cpu_info *)((sp | (STACK_SIZE - 1)) + 1) - 1;
+}
+
 static inline struct cpu_info *get_cpu_info(void)
 {
 #ifdef __clang__
@@ -87,7 +92,7 @@ static inline struct cpu_info *get_cpu_info(void)
     register unsigned long sp asm("rsp");
 #endif
 
-    return (struct cpu_info *)((sp | (STACK_SIZE - 1)) + 1) - 1;
+    return get_cpu_info_from_stack(sp);
 }
 
 #define get_current()         (get_cpu_info()->current_vcpu)
@@ -124,16 +129,22 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
 # define CHECK_FOR_LIVEPATCH_WORK ""
 #endif
 
-#define reset_stack_and_jump(__fn)                                      \
+#define switch_stack_and_jump(fn, instr)                                \
     ({                                                                  \
         __asm__ __volatile__ (                                          \
             "mov %0,%%"__OP"sp;"                                        \
-            CHECK_FOR_LIVEPATCH_WORK                                      \
+            instr                                                       \
              "jmp %c1"                                                  \
-            : : "r" (guest_cpu_user_regs()), "i" (__fn) : "memory" );   \
+            : : "r" (guest_cpu_user_regs()), "i" (fn) : "memory" );     \
         unreachable();                                                  \
     })
 
+#define reset_stack_and_jump(fn)                                        \
+    switch_stack_and_jump(fn, CHECK_FOR_LIVEPATCH_WORK)
+
+#define reset_stack_and_jump_nolp(fn)                                   \
+    switch_stack_and_jump(fn, "")
+
 /*
  * Which VCPU's state is currently running on each CPU?
  * This is not necesasrily the same as 'current' as a CPU may be
diff --git a/xen/include/asm-x86/smp.h b/xen/include/asm-x86/smp.h
index 61446d0efd..dbeed2fd41 100644
--- a/xen/include/asm-x86/smp.h
+++ b/xen/include/asm-x86/smp.h
@@ -77,6 +77,13 @@ void set_nr_sockets(void);
 /* Representing HT and core siblings in each socket. */
 extern cpumask_t **socket_cpumask;
 
+/*
+ * To be used only while no context switch can occur on the cpu, i.e.
+ * by certain scheduling code only.
+ */
+#define get_cpu_current(cpu) \
+    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)
+
 #endif /* !__ASSEMBLY__ */
 
 #endif
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 1b296b150f..41a1083a08 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -39,6 +39,8 @@ struct sched_resource {
     spinlock_t         *schedule_lock,
                        _lock;
     struct sched_unit  *curr;
+    struct sched_unit  *sched_unit_idle;
+    struct sched_unit  *prev;
     void               *sched_priv;
     struct timer        s_timer;        /* scheduling timer                */
 
@@ -194,7 +196,7 @@ static inline void sched_clear_pause_flags_atomic(struct sched_unit *unit,
 
 static inline struct sched_unit *sched_idle_unit(unsigned int cpu)
 {
-    return idle_vcpu[cpu]->sched_unit;
+    return get_sched_res(cpu)->sched_unit_idle;
 }
 
 static inline unsigned int sched_get_resource_cpu(unsigned int cpu)
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 12f00cd78d..ce4329db72 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -929,6 +929,7 @@ void restore_vcpu_affinity(struct domain *d);
 
 void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate);
 uint64_t get_cpu_idle_time(unsigned int cpu);
+void sched_guest_idle(void (*idle) (void), unsigned int cpu);
 
 /*
  * Used by idle loop to decide whether there is work to do:
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 08/19] xen/sched: make vcpu_wake() and vcpu_sleep() core scheduling aware
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (6 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 07/19] xen/sched: add fall back to idle vcpu when scheduling unit Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  7:24   ` Dario Faggioli
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 09/19] xen/sched: move per-cpu variable scheduler to struct sched_resource Juergen Gross
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Tim Deegan, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Julien Grall, Jan Beulich

vcpu_wake() and vcpu_sleep() need to be made core scheduling aware:
they might need to switch a single vcpu of an already scheduled unit
between running and not running.

Especially when vcpu_sleep() for a vcpu is being called by a vcpu of
the same scheduling unit special care must be taken in order to avoid
a deadlock: the vcpu to be put asleep must be forced through a
context switch without doing so for the calling vcpu. For this
purpose add a vcpu flag handled in sched_slave() and in
sched_wait_rendezvous_in() allowing a vcpu of the currently running
unit to switch state at a higher priority than a normal schedule
event.

Use the same mechanism when waking up a vcpu of a currently active
unit.

While at it make vcpu_sleep_nosync_locked() static as it is used in
schedule.c only.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
RFC V2: add vcpu_sleep() handling and force_context_switch flag
V2: fix runstate change in sched_force_context_switch()
V4:
- use unit_scheduler() where appropriate (Jan Beulich)
- make cpu parameter unsigned int (Jan Beulich)
- comments (Jan Beulich)
- use true instead 1 for setting bool (Jan Beulich)
- const parameter (Jan Beulich)
V5:
- add comments (Dario Faggioli)
---
 xen/common/schedule.c      | 134 +++++++++++++++++++++++++++++++++++++++++++--
 xen/include/xen/sched-if.h |   9 ++-
 xen/include/xen/sched.h    |   2 +
 3 files changed, 136 insertions(+), 9 deletions(-)

diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index b4c4b04ebe..9442be1c83 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -751,8 +751,10 @@ void sched_destroy_domain(struct domain *d)
     }
 }
 
-void vcpu_sleep_nosync_locked(struct vcpu *v)
+static void vcpu_sleep_nosync_locked(struct vcpu *v)
 {
+    struct sched_unit *unit = v->sched_unit;
+
     ASSERT(spin_is_locked(get_sched_res(v->processor)->schedule_lock));
 
     if ( likely(!vcpu_runnable(v)) )
@@ -760,7 +762,15 @@ void vcpu_sleep_nosync_locked(struct vcpu *v)
         if ( v->runstate.state == RUNSTATE_runnable )
             vcpu_runstate_change(v, RUNSTATE_offline, NOW());
 
-        sched_sleep(vcpu_scheduler(v), v->sched_unit);
+        /* Only put unit to sleep in case all vcpus are not runnable. */
+        if ( likely(!unit_runnable(unit)) )
+            sched_sleep(unit_scheduler(unit), unit);
+        else if ( unit_running(unit) > 1 && v->is_running &&
+                  !v->force_context_switch )
+        {
+            v->force_context_switch = true;
+            cpu_raise_softirq(v->processor, SCHED_SLAVE_SOFTIRQ);
+        }
     }
 }
 
@@ -792,16 +802,27 @@ void vcpu_wake(struct vcpu *v)
 {
     unsigned long flags;
     spinlock_t *lock;
+    struct sched_unit *unit = v->sched_unit;
 
     TRACE_2D(TRC_SCHED_WAKE, v->domain->domain_id, v->vcpu_id);
 
-    lock = unit_schedule_lock_irqsave(v->sched_unit, &flags);
+    lock = unit_schedule_lock_irqsave(unit, &flags);
 
     if ( likely(vcpu_runnable(v)) )
     {
         if ( v->runstate.state >= RUNSTATE_blocked )
             vcpu_runstate_change(v, RUNSTATE_runnable, NOW());
-        sched_wake(vcpu_scheduler(v), v->sched_unit);
+        /*
+         * Call sched_wake() unconditionally, even if unit is running already.
+         * We might have not been de-scheduled after vcpu_sleep_nosync_locked()
+         * and are now to be woken up again.
+         */
+        sched_wake(unit_scheduler(unit), unit);
+        if ( unit->is_running && !v->is_running && !v->force_context_switch )
+        {
+            v->force_context_switch = true;
+            cpu_raise_softirq(v->processor, SCHED_SLAVE_SOFTIRQ);
+        }
     }
     else if ( !(v->pause_flags & VPF_blocked) )
     {
@@ -809,7 +830,7 @@ void vcpu_wake(struct vcpu *v)
             vcpu_runstate_change(v, RUNSTATE_offline, NOW());
     }
 
-    unit_schedule_unlock_irqrestore(lock, flags, v->sched_unit);
+    unit_schedule_unlock_irqrestore(lock, flags, unit);
 }
 
 void vcpu_unblock(struct vcpu *v)
@@ -2027,6 +2048,65 @@ static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
     context_switch(vprev, vnext);
 }
 
+/*
+ * Force a context switch of a single vcpu of an unit.
+ * Might be called either if a vcpu of an already running unit is woken up
+ * or if a vcpu of a running unit is put asleep with other vcpus of the same
+ * unit still running.
+ * Returns either NULL if v is already in the correct state or the vcpu to
+ * run next.
+ */
+static struct vcpu *sched_force_context_switch(struct vcpu *vprev,
+                                               struct vcpu *v,
+                                               unsigned int cpu, s_time_t now)
+{
+    v->force_context_switch = false;
+
+    if ( vcpu_runnable(v) == v->is_running )
+        return NULL;
+
+    if ( vcpu_runnable(v) )
+    {
+        if ( is_idle_vcpu(vprev) )
+        {
+            vcpu_runstate_change(vprev, RUNSTATE_runnable, now);
+            vprev->sched_unit = get_sched_res(cpu)->sched_unit_idle;
+        }
+        vcpu_runstate_change(v, RUNSTATE_running, now);
+    }
+    else
+    {
+        /* Make sure not to switch last vcpu of an unit away. */
+        if ( unit_running(v->sched_unit) == 1 )
+            return NULL;
+
+        v->new_state = vcpu_runstate_blocked(v);
+        vcpu_runstate_change(v, v->new_state, now);
+        v = sched_unit2vcpu_cpu(vprev->sched_unit, cpu);
+        if ( v != vprev )
+        {
+            if ( is_idle_vcpu(vprev) )
+            {
+                vcpu_runstate_change(vprev, RUNSTATE_runnable, now);
+                vprev->sched_unit = get_sched_res(cpu)->sched_unit_idle;
+            }
+            else
+            {
+                v->sched_unit = vprev->sched_unit;
+                vcpu_runstate_change(v, RUNSTATE_running, now);
+            }
+        }
+    }
+
+    /* This vcpu will be switched to. */
+    v->is_running = true;
+
+    /* Make sure not to loose another slave call. */
+    raise_softirq(SCHED_SLAVE_SOFTIRQ);
+
+    return v;
+}
+
 /*
  * Rendezvous before taking a scheduling decision.
  * Called with schedule lock held, so all accesses to the rendezvous counter
@@ -2042,6 +2122,7 @@ static struct sched_unit *sched_wait_rendezvous_in(struct sched_unit *prev,
                                                    s_time_t now)
 {
     struct sched_unit *next;
+    struct vcpu *v;
 
     if ( !--prev->rendezvous_in_cnt )
     {
@@ -2050,8 +2131,28 @@ static struct sched_unit *sched_wait_rendezvous_in(struct sched_unit *prev,
         return next;
     }
 
+    v = unit2vcpu_cpu(prev, cpu);
     while ( prev->rendezvous_in_cnt )
     {
+        if ( v && v->force_context_switch )
+        {
+            struct vcpu *vprev = current;
+
+            v = sched_force_context_switch(vprev, v, cpu, now);
+
+            if ( v )
+            {
+                /* We'll come back another time, so adjust rendezvous_in_cnt. */
+                prev->rendezvous_in_cnt++;
+                atomic_set(&prev->rendezvous_out_cnt, 0);
+
+                pcpu_schedule_unlock_irq(*lock, cpu);
+
+                sched_context_switch(vprev, v, false, now);
+            }
+
+            v = unit2vcpu_cpu(prev, cpu);
+        }
         /*
          * Coming from idle might need to do tasklet work.
          * In order to avoid deadlocks we can't do that here, but have to
@@ -2086,10 +2187,11 @@ static struct sched_unit *sched_wait_rendezvous_in(struct sched_unit *prev,
 
 static void sched_slave(void)
 {
-    struct vcpu          *vprev = current;
+    struct vcpu          *v, *vprev = current;
     struct sched_unit    *prev = vprev->sched_unit, *next;
     s_time_t              now;
     spinlock_t           *lock;
+    bool                  do_softirq = false;
     unsigned int          cpu = smp_processor_id();
 
     ASSERT_NOT_IN_ATOMIC();
@@ -2098,9 +2200,29 @@ static void sched_slave(void)
 
     now = NOW();
 
+    v = unit2vcpu_cpu(prev, cpu);
+    if ( v && v->force_context_switch )
+    {
+        v = sched_force_context_switch(vprev, v, cpu, now);
+
+        if ( v )
+        {
+            pcpu_schedule_unlock_irq(lock, cpu);
+
+            sched_context_switch(vprev, v, false, now);
+        }
+
+        do_softirq = true;
+    }
+
     if ( !prev->rendezvous_in_cnt )
     {
         pcpu_schedule_unlock_irq(lock, cpu);
+
+        /* Check for failed forced context switch. */
+        if ( do_softirq )
+            raise_softirq(SCHEDULE_SOFTIRQ);
+
         return;
     }
 
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 41a1083a08..021c1d7c2c 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -102,6 +102,11 @@ static inline bool unit_runnable(const struct sched_unit *unit)
     return false;
 }
 
+static inline int vcpu_runstate_blocked(const struct vcpu *v)
+{
+    return (v->pause_flags & VPF_blocked) ? RUNSTATE_blocked : RUNSTATE_offline;
+}
+
 /*
  * Returns whether a sched_unit is runnable and sets new_state for each of its
  * vcpus. It is mandatory to determine the new runstate for all vcpus of a unit
@@ -121,9 +126,7 @@ static inline bool unit_runnable_state(const struct sched_unit *unit)
     {
         runnable = vcpu_runnable(v);
 
-        v->new_state = runnable ? RUNSTATE_running
-                                : (v->pause_flags & VPF_blocked)
-                                  ? RUNSTATE_blocked : RUNSTATE_offline;
+        v->new_state = runnable ? RUNSTATE_running : vcpu_runstate_blocked(v);
 
         if ( runnable )
             ret = true;
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index ce4329db72..f97303668a 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -186,6 +186,8 @@ struct vcpu
     bool             is_running;
     /* VCPU should wake fast (do not deep sleep the CPU). */
     bool             is_urgent;
+    /* VCPU must context_switch without scheduling unit. */
+    bool             force_context_switch;
 
 #ifdef VCPU_TRAP_LAST
 #define VCPU_TRAP_NONE    0
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 09/19] xen/sched: move per-cpu variable scheduler to struct sched_resource
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (7 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 08/19] xen/sched: make vcpu_wake() and vcpu_sleep() core scheduling aware Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 10/19] xen/sched: move per-cpu variable cpupool " Juergen Gross
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Tim Deegan, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Julien Grall, Jan Beulich

Having a pointer to struct scheduler in struct sched_resource instead
of per cpu is enough.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
V1: new patch
V4:
- several renames sd -> sr (Jan Beulich)
- use ops instead or sr->scheduler (Jan Beulich)
---
 xen/common/sched_credit.c  | 18 +++++++++++-------
 xen/common/sched_credit2.c |  3 ++-
 xen/common/schedule.c      | 15 +++++++--------
 xen/include/xen/sched-if.h |  2 +-
 4 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index a6dff8ec62..86603adcb6 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -352,9 +352,10 @@ DEFINE_PER_CPU(unsigned int, last_tickle_cpu);
 static inline void __runq_tickle(struct csched_unit *new)
 {
     unsigned int cpu = sched_unit_master(new->unit);
+    struct sched_resource *sr = get_sched_res(cpu);
     struct sched_unit *unit = new->unit;
     struct csched_unit * const cur = CSCHED_UNIT(curr_on_cpu(cpu));
-    struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
+    struct csched_private *prv = CSCHED_PRIV(sr->scheduler);
     cpumask_t mask, idle_mask, *online;
     int balance_step, idlers_empty;
 
@@ -931,7 +932,8 @@ csched_unit_acct(struct csched_private *prv, unsigned int cpu)
 {
     struct sched_unit *currunit = current->sched_unit;
     struct csched_unit * const svc = CSCHED_UNIT(currunit);
-    const struct scheduler *ops = per_cpu(scheduler, cpu);
+    struct sched_resource *sr = get_sched_res(cpu);
+    const struct scheduler *ops = sr->scheduler;
 
     ASSERT( sched_unit_master(currunit) == cpu );
     ASSERT( svc->sdom != NULL );
@@ -987,8 +989,7 @@ csched_unit_acct(struct csched_private *prv, unsigned int cpu)
              * idlers. But, if we are here, it means there is someone running
              * on it, and hence the bit must be zero already.
              */
-            ASSERT(!cpumask_test_cpu(cpu,
-                                     CSCHED_PRIV(per_cpu(scheduler, cpu))->idlers));
+            ASSERT(!cpumask_test_cpu(cpu, CSCHED_PRIV(ops)->idlers));
             cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
         }
     }
@@ -1083,6 +1084,7 @@ csched_unit_sleep(const struct scheduler *ops, struct sched_unit *unit)
 {
     struct csched_unit * const svc = CSCHED_UNIT(unit);
     unsigned int cpu = sched_unit_master(unit);
+    struct sched_resource *sr = get_sched_res(cpu);
 
     SCHED_STAT_CRANK(unit_sleep);
 
@@ -1095,7 +1097,7 @@ csched_unit_sleep(const struct scheduler *ops, struct sched_unit *unit)
          * But, we are here because unit is going to sleep while running on cpu,
          * so the bit must be zero already.
          */
-        ASSERT(!cpumask_test_cpu(cpu, CSCHED_PRIV(per_cpu(scheduler, cpu))->idlers));
+        ASSERT(!cpumask_test_cpu(cpu, CSCHED_PRIV(sr->scheduler)->idlers));
         cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
     }
     else if ( __unit_on_runq(svc) )
@@ -1575,8 +1577,9 @@ static void
 csched_tick(void *_cpu)
 {
     unsigned int cpu = (unsigned long)_cpu;
+    struct sched_resource *sr = get_sched_res(cpu);
     struct csched_pcpu *spc = CSCHED_PCPU(cpu);
-    struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
+    struct csched_private *prv = CSCHED_PRIV(sr->scheduler);
 
     spc->tick++;
 
@@ -1601,7 +1604,8 @@ csched_tick(void *_cpu)
 static struct csched_unit *
 csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
 {
-    const struct csched_private * const prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
+    struct sched_resource *sr = get_sched_res(cpu);
+    const struct csched_private * const prv = CSCHED_PRIV(sr->scheduler);
     const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
     struct csched_unit *speer;
     struct list_head *iter;
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index d51df05887..af58ee161d 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -3268,8 +3268,9 @@ runq_candidate(struct csched2_runqueue_data *rqd,
                unsigned int *skipped)
 {
     struct list_head *iter, *temp;
+    struct sched_resource *sr = get_sched_res(cpu);
     struct csched2_unit *snext = NULL;
-    struct csched2_private *prv = csched2_priv(per_cpu(scheduler, cpu));
+    struct csched2_private *prv = csched2_priv(sr->scheduler);
     bool yield = false, soft_aff_preempt = false;
 
     *skipped = 0;
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 9442be1c83..5e9cee1f82 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -75,7 +75,6 @@ static void vcpu_singleshot_timer_fn(void *data);
 static void poll_timer_fn(void *data);
 
 /* This is global for now so that private implementations can reach it */
-DEFINE_PER_CPU(struct scheduler *, scheduler);
 DEFINE_PER_CPU_READ_MOSTLY(struct sched_resource *, sched_res);
 static DEFINE_PER_CPU_READ_MOSTLY(unsigned int, sched_res_idx);
 
@@ -200,7 +199,7 @@ static inline struct scheduler *unit_scheduler(const struct sched_unit *unit)
      */
 
     ASSERT(is_idle_domain(d));
-    return per_cpu(scheduler, unit->res->master_cpu);
+    return unit->res->scheduler;
 }
 
 static inline struct scheduler *vcpu_scheduler(const struct vcpu *v)
@@ -1921,8 +1920,8 @@ static bool sched_tasklet_check(unsigned int cpu)
 static struct sched_unit *do_schedule(struct sched_unit *prev, s_time_t now,
                                       unsigned int cpu)
 {
-    struct scheduler *sched = per_cpu(scheduler, cpu);
     struct sched_resource *sr = get_sched_res(cpu);
+    struct scheduler *sched = sr->scheduler;
     struct sched_unit *next;
 
     /* get policy-specific decision on scheduling... */
@@ -2342,7 +2341,7 @@ static int cpu_schedule_up(unsigned int cpu)
     sr->cpus = cpumask_of(cpu);
     set_sched_res(cpu, sr);
 
-    per_cpu(scheduler, cpu) = &sched_idle_ops;
+    sr->scheduler = &sched_idle_ops;
     spin_lock_init(&sr->_lock);
     sr->schedule_lock = &sched_free_cpu_lock;
     init_timer(&sr->s_timer, s_timer_fn, NULL, cpu);
@@ -2553,7 +2552,7 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
 {
     struct vcpu *idle;
     void *ppriv, *ppriv_old, *vpriv, *vpriv_old;
-    struct scheduler *old_ops = per_cpu(scheduler, cpu);
+    struct scheduler *old_ops = get_sched_res(cpu)->scheduler;
     struct scheduler *new_ops = (c == NULL) ? &sched_idle_ops : c->sched;
     struct cpupool *old_pool = per_cpu(cpupool, cpu);
     struct sched_resource *sd = get_sched_res(cpu);
@@ -2617,7 +2616,7 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
     ppriv_old = sd->sched_priv;
     new_lock = sched_switch_sched(new_ops, cpu, ppriv, vpriv);
 
-    per_cpu(scheduler, cpu) = new_ops;
+    sd->scheduler = new_ops;
     sd->sched_priv = ppriv;
 
     /*
@@ -2717,7 +2716,7 @@ void sched_tick_suspend(void)
     struct scheduler *sched;
     unsigned int cpu = smp_processor_id();
 
-    sched = per_cpu(scheduler, cpu);
+    sched = get_sched_res(cpu)->scheduler;
     sched_do_tick_suspend(sched, cpu);
     rcu_idle_enter(cpu);
     rcu_idle_timer_start();
@@ -2730,7 +2729,7 @@ void sched_tick_resume(void)
 
     rcu_idle_timer_stop();
     rcu_idle_exit(cpu);
-    sched = per_cpu(scheduler, cpu);
+    sched = get_sched_res(cpu)->scheduler;
     sched_do_tick_resume(sched, cpu);
 }
 
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 021c1d7c2c..01821b3e5b 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -36,6 +36,7 @@ extern const cpumask_t *sched_res_mask;
  * as the rest of the struct.  Just have the scheduler point to the
  * one it wants (This may be the one right in front of it).*/
 struct sched_resource {
+    struct scheduler   *scheduler;
     spinlock_t         *schedule_lock,
                        _lock;
     struct sched_unit  *curr;
@@ -49,7 +50,6 @@ struct sched_resource {
     const cpumask_t    *cpus;           /* cpus covered by this struct     */
 };
 
-DECLARE_PER_CPU(struct scheduler *, scheduler);
 DECLARE_PER_CPU(struct cpupool *, cpupool);
 DECLARE_PER_CPU(struct sched_resource *, sched_res);
 
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 10/19] xen/sched: move per-cpu variable cpupool to struct sched_resource
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (8 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 09/19] xen/sched: move per-cpu variable scheduler to struct sched_resource Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 11/19] xen/sched: reject switching smt on/off with core scheduling active Juergen Gross
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Tim Deegan, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Julien Grall, Meng Xu, Jan Beulich

Having a pointer to struct cpupool in struct sched_resource instead
of per cpu is enough.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
V1: new patch
---
 xen/common/cpupool.c       | 4 +---
 xen/common/sched_credit.c  | 2 +-
 xen/common/sched_rt.c      | 2 +-
 xen/common/schedule.c      | 8 ++++----
 xen/include/xen/sched-if.h | 2 +-
 5 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 441a26f16c..60a85f50e1 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -34,8 +34,6 @@ static cpumask_t cpupool_locked_cpus;
 
 static DEFINE_SPINLOCK(cpupool_lock);
 
-DEFINE_PER_CPU(struct cpupool *, cpupool);
-
 static void free_cpupool_struct(struct cpupool *c)
 {
     if ( c )
@@ -504,7 +502,7 @@ static int cpupool_cpu_add(unsigned int cpu)
      * (or unplugging would have failed) and that is the default behavior
      * anyway.
      */
-    per_cpu(cpupool, cpu) = NULL;
+    get_sched_res(cpu)->cpupool = NULL;
     ret = cpupool_assign_cpu_locked(cpupool0, cpu);
 
     spin_unlock(&cpupool_lock);
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 86603adcb6..31fdcd6a2f 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1681,7 +1681,7 @@ static struct csched_unit *
 csched_load_balance(struct csched_private *prv, int cpu,
     struct csched_unit *snext, bool *stolen)
 {
-    struct cpupool *c = per_cpu(cpupool, cpu);
+    struct cpupool *c = get_sched_res(cpu)->cpupool;
     struct csched_unit *speer;
     cpumask_t workers;
     cpumask_t *online;
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index d21c416cae..6e93e50acb 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -774,7 +774,7 @@ rt_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 
     if ( prv->repl_timer.cpu == cpu )
     {
-        struct cpupool *c = per_cpu(cpupool, cpu);
+        struct cpupool *c = get_sched_res(cpu)->cpupool;
         unsigned int new_cpu = cpumask_cycle(cpu, cpupool_online_cpumask(c));
 
         /*
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 5e9cee1f82..249ff8a882 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -1120,7 +1120,7 @@ int cpu_disable_scheduler(unsigned int cpu)
     cpumask_t online_affinity;
     int ret = 0;
 
-    c = per_cpu(cpupool, cpu);
+    c = get_sched_res(cpu)->cpupool;
     if ( c == NULL )
         return ret;
 
@@ -1189,7 +1189,7 @@ static int cpu_disable_scheduler_check(unsigned int cpu)
     struct vcpu *v;
     struct cpupool *c;
 
-    c = per_cpu(cpupool, cpu);
+    c = get_sched_res(cpu)->cpupool;
     if ( c == NULL )
         return 0;
 
@@ -2554,8 +2554,8 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
     void *ppriv, *ppriv_old, *vpriv, *vpriv_old;
     struct scheduler *old_ops = get_sched_res(cpu)->scheduler;
     struct scheduler *new_ops = (c == NULL) ? &sched_idle_ops : c->sched;
-    struct cpupool *old_pool = per_cpu(cpupool, cpu);
     struct sched_resource *sd = get_sched_res(cpu);
+    struct cpupool *old_pool = sd->cpupool;
     spinlock_t *old_lock, *new_lock;
     unsigned long flags;
 
@@ -2637,7 +2637,7 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
     sched_free_udata(old_ops, vpriv_old);
     sched_free_pdata(old_ops, ppriv_old, cpu);
 
-    per_cpu(cpupool, cpu) = c;
+    get_sched_res(cpu)->cpupool = c;
     /* When a cpu is added to a pool, trigger it to go pick up some work */
     if ( c != NULL )
         cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 01821b3e5b..e675061290 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -37,6 +37,7 @@ extern const cpumask_t *sched_res_mask;
  * one it wants (This may be the one right in front of it).*/
 struct sched_resource {
     struct scheduler   *scheduler;
+    struct cpupool     *cpupool;
     spinlock_t         *schedule_lock,
                        _lock;
     struct sched_unit  *curr;
@@ -50,7 +51,6 @@ struct sched_resource {
     const cpumask_t    *cpus;           /* cpus covered by this struct     */
 };
 
-DECLARE_PER_CPU(struct cpupool *, cpupool);
 DECLARE_PER_CPU(struct sched_resource *, sched_res);
 
 static inline struct sched_resource *get_sched_res(unsigned int cpu)
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 11/19] xen/sched: reject switching smt on/off with core scheduling active
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (9 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 10/19] xen/sched: move per-cpu variable cpupool " Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 12/19] xen/sched: prepare per-cpupool scheduling granularity Juergen Gross
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Tim Deegan, Julien Grall, Jan Beulich, Dario Faggioli,
	Roger Pau Monné

When core or socket scheduling are active enabling or disabling smt is
not possible as that would require a major host reconfiguration.

Add a bool sched_disable_smt_switching which will be set for core or
socket scheduling.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Dario Faggioli <dfaggioli@suse.com>
---
V1:
- new patch
V2:
- EBUSY as return code (Jan Beulich, Dario Faggioli)
- __read_mostly for sched_disable_smt_switching (Jan Beulich)
---
 xen/arch/x86/sysctl.c   | 5 +++++
 xen/common/schedule.c   | 1 +
 xen/include/xen/sched.h | 1 +
 3 files changed, 7 insertions(+)

diff --git a/xen/arch/x86/sysctl.c b/xen/arch/x86/sysctl.c
index 3742ede61b..4a76f0f47f 100644
--- a/xen/arch/x86/sysctl.c
+++ b/xen/arch/x86/sysctl.c
@@ -209,6 +209,11 @@ long arch_do_sysctl(
                 ret = -EOPNOTSUPP;
                 break;
             }
+            if ( sched_disable_smt_switching )
+            {
+                ret = -EBUSY;
+                break;
+            }
             plug = op == XEN_SYSCTL_CPU_HOTPLUG_SMT_ENABLE;
             fn = smt_up_down_helper;
             hcpu = _p(plug);
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 249ff8a882..0dcf004d78 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -63,6 +63,7 @@ integer_param("sched_ratelimit_us", sched_ratelimit_us);
 
 /* Number of vcpus per struct sched_unit. */
 static unsigned int __read_mostly sched_granularity = 1;
+bool __read_mostly sched_disable_smt_switching;
 const cpumask_t *sched_res_mask = &cpumask_all;
 
 /* Common lock for free cpus. */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index f97303668a..aa8257edc9 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -1037,6 +1037,7 @@ static inline bool is_iommu_enabled(const struct domain *d)
 }
 
 extern bool sched_smt_power_savings;
+extern bool sched_disable_smt_switching;
 
 extern enum cpufreq_controller {
     FREQCTL_none, FREQCTL_dom0_kernel, FREQCTL_xen
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 12/19] xen/sched: prepare per-cpupool scheduling granularity
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (10 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 11/19] xen/sched: reject switching smt on/off with core scheduling active Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 13/19] xen/sched: split schedule_cpu_switch() Juergen Gross
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Tim Deegan, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Julien Grall, Jan Beulich

On- and offlining cpus with core scheduling is rather complicated as
the cpus are taken on- or offline one by one, but scheduling wants them
rather to be handled per core.

As the future plan is to be able to select scheduling granularity per
cpupool prepare that by storing the granularity in struct
sched_resource (we need it there for free cpus which are not
associated to any cpupool). Free cpus will always use granularity 1.

Store the selected granularity option (cpu, core or socket) in the
cpupool , as we will need it to select the appropriate cpu mask when
populating the cpupool with cpus.

This will make on- and offlining of cpus much easier and avoids
writing code which would needed to be thrown away later.

Move the granularity related variables to cpupool.c as they are now
used form there only.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
V1: new patch
V4:
- move opt_sched_granularity and sched_granularity to cpupool.c
  (Jan Beulich)
- rename c->opt_sched_granularity, drop c->granularity (Jan Beulich)
---
 xen/common/cpupool.c       |  9 +++++++++
 xen/common/schedule.c      | 27 ++++++++++++++++-----------
 xen/include/xen/sched-if.h | 11 +++++++++++
 3 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 60a85f50e1..51f0ff0d88 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -34,6 +34,14 @@ static cpumask_t cpupool_locked_cpus;
 
 static DEFINE_SPINLOCK(cpupool_lock);
 
+static enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu;
+static unsigned int __read_mostly sched_granularity = 1;
+
+unsigned int cpupool_get_granularity(const struct cpupool *c)
+{
+    return c ? sched_granularity : 1;
+}
+
 static void free_cpupool_struct(struct cpupool *c)
 {
     if ( c )
@@ -173,6 +181,7 @@ static struct cpupool *cpupool_create(
             return NULL;
         }
     }
+    c->gran = opt_sched_granularity;
 
     *q = c;
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 0dcf004d78..5257225050 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -62,7 +62,6 @@ int sched_ratelimit_us = SCHED_DEFAULT_RATELIMIT_US;
 integer_param("sched_ratelimit_us", sched_ratelimit_us);
 
 /* Number of vcpus per struct sched_unit. */
-static unsigned int __read_mostly sched_granularity = 1;
 bool __read_mostly sched_disable_smt_switching;
 const cpumask_t *sched_res_mask = &cpumask_all;
 
@@ -435,10 +434,10 @@ static struct sched_unit *sched_alloc_unit(struct vcpu *v)
 {
     struct sched_unit *unit, **prev_unit;
     struct domain *d = v->domain;
+    unsigned int gran = cpupool_get_granularity(d->cpupool);
 
     for_each_sched_unit ( d, unit )
-        if ( unit->unit_id / sched_granularity ==
-             v->vcpu_id / sched_granularity )
+        if ( unit->unit_id / gran == v->vcpu_id / gran )
             break;
 
     if ( unit )
@@ -593,6 +592,7 @@ int sched_move_domain(struct domain *d, struct cpupool *c)
     void *unitdata;
     struct scheduler *old_ops;
     void *old_domdata;
+    unsigned int gran = cpupool_get_granularity(c);
 
     for_each_vcpu ( d, v )
     {
@@ -604,8 +604,7 @@ int sched_move_domain(struct domain *d, struct cpupool *c)
     if ( IS_ERR(domdata) )
         return PTR_ERR(domdata);
 
-    unit_priv = xzalloc_array(void *,
-                              DIV_ROUND_UP(d->max_vcpus, sched_granularity));
+    unit_priv = xzalloc_array(void *, DIV_ROUND_UP(d->max_vcpus, gran));
     if ( unit_priv == NULL )
     {
         sched_free_domdata(c->sched, domdata);
@@ -1850,11 +1849,11 @@ static void sched_switch_units(struct sched_resource *sr,
         if ( is_idle_unit(prev) )
         {
             prev->runstate_cnt[RUNSTATE_running] = 0;
-            prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity;
+            prev->runstate_cnt[RUNSTATE_runnable] = sr->granularity;
         }
         if ( is_idle_unit(next) )
         {
-            next->runstate_cnt[RUNSTATE_running] = sched_granularity;
+            next->runstate_cnt[RUNSTATE_running] = sr->granularity;
             next->runstate_cnt[RUNSTATE_runnable] = 0;
         }
     }
@@ -2003,7 +2002,7 @@ void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext)
     else
     {
         vcpu_context_saved(vprev, vnext);
-        if ( sched_granularity == 1 )
+        if ( sr->granularity == 1 )
             unit_context_saved(sr);
     }
 
@@ -2123,11 +2122,12 @@ static struct sched_unit *sched_wait_rendezvous_in(struct sched_unit *prev,
 {
     struct sched_unit *next;
     struct vcpu *v;
+    unsigned int gran = get_sched_res(cpu)->granularity;
 
     if ( !--prev->rendezvous_in_cnt )
     {
         next = do_schedule(prev, now, cpu);
-        atomic_set(&next->rendezvous_out_cnt, sched_granularity + 1);
+        atomic_set(&next->rendezvous_out_cnt, gran + 1);
         return next;
     }
 
@@ -2251,6 +2251,7 @@ static void schedule(void)
     struct sched_resource *sr;
     spinlock_t           *lock;
     int cpu = smp_processor_id();
+    unsigned int          gran = get_sched_res(cpu)->granularity;
 
     ASSERT_NOT_IN_ATOMIC();
 
@@ -2276,11 +2277,11 @@ static void schedule(void)
 
     now = NOW();
 
-    if ( sched_granularity > 1 )
+    if ( gran > 1 )
     {
         cpumask_t mask;
 
-        prev->rendezvous_in_cnt = sched_granularity;
+        prev->rendezvous_in_cnt = gran;
         cpumask_andnot(&mask, sr->cpus, cpumask_of(cpu));
         cpumask_raise_softirq(&mask, SCHED_SLAVE_SOFTIRQ);
         next = sched_wait_rendezvous_in(prev, &lock, cpu, now);
@@ -2348,6 +2349,9 @@ static int cpu_schedule_up(unsigned int cpu)
     init_timer(&sr->s_timer, s_timer_fn, NULL, cpu);
     atomic_set(&per_cpu(sched_urgent_count, cpu), 0);
 
+    /* We start with cpu granularity. */
+    sr->granularity = 1;
+
     /* Boot CPU is dealt with later in scheduler_init(). */
     if ( cpu == 0 )
         return 0;
@@ -2638,6 +2642,7 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
     sched_free_udata(old_ops, vpriv_old);
     sched_free_pdata(old_ops, ppriv_old, cpu);
 
+    get_sched_res(cpu)->granularity = cpupool_get_granularity(c);
     get_sched_res(cpu)->cpupool = c;
     /* When a cpu is added to a pool, trigger it to go pick up some work */
     if ( c != NULL )
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index e675061290..f8f0f484cb 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -25,6 +25,13 @@ extern int sched_ratelimit_us;
 /* Scheduling resource mask. */
 extern const cpumask_t *sched_res_mask;
 
+/* Number of vcpus per struct sched_unit. */
+enum sched_gran {
+    SCHED_GRAN_cpu,
+    SCHED_GRAN_core,
+    SCHED_GRAN_socket
+};
+
 /*
  * In order to allow a scheduler to remap the lock->cpu mapping,
  * we have a per-cpu pointer, along with a pre-allocated set of
@@ -48,6 +55,7 @@ struct sched_resource {
 
     /* Cpu with lowest id in scheduling resource. */
     unsigned int        master_cpu;
+    unsigned int        granularity;
     const cpumask_t    *cpus;           /* cpus covered by this struct     */
 };
 
@@ -546,6 +554,7 @@ struct cpupool
     struct cpupool   *next;
     struct scheduler *sched;
     atomic_t         refcnt;
+    enum sched_gran  gran;
 };
 
 #define cpupool_online_cpumask(_pool) \
@@ -561,6 +570,8 @@ static inline cpumask_t *cpupool_domain_master_cpumask(const struct domain *d)
     return d->cpupool->res_valid;
 }
 
+unsigned int cpupool_get_granularity(const struct cpupool *c);
+
 /*
  * Hard and soft affinity load balancing.
  *
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 13/19] xen/sched: split schedule_cpu_switch()
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (11 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 12/19] xen/sched: prepare per-cpupool scheduling granularity Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 14/19] xen/sched: protect scheduling resource via rcu Juergen Gross
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Tim Deegan, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Julien Grall, Jan Beulich

Instead of letting schedule_cpu_switch() handle moving cpus from and
to cpupools, split it into schedule_cpu_add() and schedule_cpu_rm().

This will allow us to drop allocating/freeing scheduler data for free
cpus as the idle scheduler doesn't need such data.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
V1: new patch
V4:
- rename sd -> sr (Jan Beulich)
---
 xen/common/cpupool.c    |   4 +-
 xen/common/schedule.c   | 133 +++++++++++++++++++++++++++---------------------
 xen/include/xen/sched.h |   3 +-
 3 files changed, 78 insertions(+), 62 deletions(-)

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 51f0ff0d88..02825e779d 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -271,7 +271,7 @@ static int cpupool_assign_cpu_locked(struct cpupool *c, unsigned int cpu)
 
     if ( (cpupool_moving_cpu == cpu) && (c != cpupool_cpu_moving) )
         return -EADDRNOTAVAIL;
-    ret = schedule_cpu_switch(cpu, c);
+    ret = schedule_cpu_add(cpu, c);
     if ( ret )
         return ret;
 
@@ -321,7 +321,7 @@ static int cpupool_unassign_cpu_finish(struct cpupool *c)
      */
     if ( !ret )
     {
-        ret = schedule_cpu_switch(cpu, NULL);
+        ret = schedule_cpu_rm(cpu);
         if ( ret )
             cpumask_clear_cpu(cpu, &cpupool_free_cpus);
         else
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 5257225050..a96fc82282 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -93,15 +93,6 @@ static struct scheduler __read_mostly ops;
 static void sched_set_affinity(
     struct sched_unit *unit, const cpumask_t *hard, const cpumask_t *soft);
 
-static spinlock_t *
-sched_idle_switch_sched(struct scheduler *new_ops, unsigned int cpu,
-                        void *pdata, void *vdata)
-{
-    sched_idle_unit(cpu)->priv = NULL;
-
-    return &sched_free_cpu_lock;
-}
-
 static struct sched_resource *
 sched_idle_res_pick(const struct scheduler *ops, const struct sched_unit *unit)
 {
@@ -141,7 +132,6 @@ static struct scheduler sched_idle_ops = {
 
     .alloc_udata    = sched_idle_alloc_udata,
     .free_udata     = sched_idle_free_udata,
-    .switch_sched   = sched_idle_switch_sched,
 };
 
 static inline struct vcpu *unit2vcpu_cpu(const struct sched_unit *unit,
@@ -2547,36 +2537,22 @@ void __init scheduler_init(void)
 }
 
 /*
- * Move a pCPU outside of the influence of the scheduler of its current
- * cpupool, or subject it to the scheduler of a new cpupool.
- *
- * For the pCPUs that are removed from their cpupool, their scheduler becomes
- * &sched_idle_ops (the idle scheduler).
+ * Move a pCPU from free cpus (running the idle scheduler) to a cpupool
+ * using any "real" scheduler.
+ * The cpu is still marked as "free" and not yet valid for its cpupool.
  */
-int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
+int schedule_cpu_add(unsigned int cpu, struct cpupool *c)
 {
     struct vcpu *idle;
-    void *ppriv, *ppriv_old, *vpriv, *vpriv_old;
-    struct scheduler *old_ops = get_sched_res(cpu)->scheduler;
-    struct scheduler *new_ops = (c == NULL) ? &sched_idle_ops : c->sched;
-    struct sched_resource *sd = get_sched_res(cpu);
-    struct cpupool *old_pool = sd->cpupool;
+    void *ppriv, *vpriv;
+    struct scheduler *new_ops = c->sched;
+    struct sched_resource *sr = get_sched_res(cpu);
     spinlock_t *old_lock, *new_lock;
     unsigned long flags;
 
-    /*
-     * pCPUs only move from a valid cpupool to free (i.e., out of any pool),
-     * or from free to a valid cpupool. In the former case (which happens when
-     * c is NULL), we want the CPU to have been marked as free already, as
-     * well as to not be valid for the source pool any longer, when we get to
-     * here. In the latter case (which happens when c is a valid cpupool), we
-     * want the CPU to still be marked as free, as well as to not yet be valid
-     * for the destination pool.
-     */
-    ASSERT(c != old_pool && (c != NULL || old_pool != NULL));
     ASSERT(cpumask_test_cpu(cpu, &cpupool_free_cpus));
-    ASSERT((c == NULL && !cpumask_test_cpu(cpu, old_pool->cpu_valid)) ||
-           (c != NULL && !cpumask_test_cpu(cpu, c->cpu_valid)));
+    ASSERT(!cpumask_test_cpu(cpu, c->cpu_valid));
+    ASSERT(get_sched_res(cpu)->cpupool == NULL);
 
     /*
      * To setup the cpu for the new scheduler we need:
@@ -2601,52 +2577,91 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c)
         return -ENOMEM;
     }
 
-    sched_do_tick_suspend(old_ops, cpu);
-
     /*
-     * The actual switch, including (if necessary) the rerouting of the
-     * scheduler lock to whatever new_ops prefers,  needs to happen in one
-     * critical section, protected by old_ops' lock, or races are possible.
-     * It is, in fact, the lock of another scheduler that we are taking (the
-     * scheduler of the cpupool that cpu still belongs to). But that is ok
-     * as, anyone trying to schedule on this cpu will spin until when we
-     * release that lock (bottom of this function). When he'll get the lock
-     * --thanks to the loop inside *_schedule_lock() functions-- he'll notice
-     * that the lock itself changed, and retry acquiring the new one (which
-     * will be the correct, remapped one, at that point).
+     * The actual switch, including the rerouting of the scheduler lock to
+     * whatever new_ops prefers, needs to happen in one critical section,
+     * protected by old_ops' lock, or races are possible.
+     * It is, in fact, the lock of the idle scheduler that we are taking.
+     * But that is ok as anyone trying to schedule on this cpu will spin until
+     * when we release that lock (bottom of this function). When he'll get the
+     * lock --thanks to the loop inside *_schedule_lock() functions-- he'll
+     * notice that the lock itself changed, and retry acquiring the new one
+     * (which will be the correct, remapped one, at that point).
      */
     old_lock = pcpu_schedule_lock_irqsave(cpu, &flags);
 
-    vpriv_old = idle->sched_unit->priv;
-    ppriv_old = sd->sched_priv;
     new_lock = sched_switch_sched(new_ops, cpu, ppriv, vpriv);
 
-    sd->scheduler = new_ops;
-    sd->sched_priv = ppriv;
+    sr->scheduler = new_ops;
+    sr->sched_priv = ppriv;
 
     /*
-     * The data above is protected under new_lock, which may be unlocked.
-     * Another CPU can take new_lock as soon as sd->schedule_lock is visible,
-     * and must observe all prior initialisation.
+     * Reroute the lock to the per pCPU lock as /last/ thing. In fact,
+     * if it is free (and it can be) we want that anyone that manages
+     * taking it, finds all the initializations we've done above in place.
      */
     smp_wmb();
-    sd->schedule_lock = new_lock;
+    sr->schedule_lock = new_lock;
 
-    /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
+    /* _Not_ pcpu_schedule_unlock(): schedule_lock has changed! */
     spin_unlock_irqrestore(old_lock, flags);
 
     sched_do_tick_resume(new_ops, cpu);
 
+    sr->granularity = cpupool_get_granularity(c);
+    sr->cpupool = c;
+    /* The  cpu is added to a pool, trigger it to go pick up some work */
+    cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
+
+    return 0;
+}
+
+/*
+ * Remove a pCPU from its cpupool. Its scheduler becomes &sched_idle_ops
+ * (the idle scheduler).
+ * The cpu is already marked as "free" and not valid any longer for its
+ * cpupool.
+ */
+int schedule_cpu_rm(unsigned int cpu)
+{
+    struct vcpu *idle;
+    void *ppriv_old, *vpriv_old;
+    struct sched_resource *sr = get_sched_res(cpu);
+    struct scheduler *old_ops = sr->scheduler;
+    spinlock_t *old_lock;
+    unsigned long flags;
+
+    ASSERT(sr->cpupool != NULL);
+    ASSERT(cpumask_test_cpu(cpu, &cpupool_free_cpus));
+    ASSERT(!cpumask_test_cpu(cpu, sr->cpupool->cpu_valid));
+
+    idle = idle_vcpu[cpu];
+
+    sched_do_tick_suspend(old_ops, cpu);
+
+    /* See comment in schedule_cpu_add() regarding lock switching. */
+    old_lock = pcpu_schedule_lock_irqsave(cpu, &flags);
+
+    vpriv_old = idle->sched_unit->priv;
+    ppriv_old = sr->sched_priv;
+
+    idle->sched_unit->priv = NULL;
+    sr->scheduler = &sched_idle_ops;
+    sr->sched_priv = NULL;
+
+    smp_mb();
+    sr->schedule_lock = &sched_free_cpu_lock;
+
+    /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
+    spin_unlock_irqrestore(old_lock, flags);
+
     sched_deinit_pdata(old_ops, ppriv_old, cpu);
 
     sched_free_udata(old_ops, vpriv_old);
     sched_free_pdata(old_ops, ppriv_old, cpu);
 
-    get_sched_res(cpu)->granularity = cpupool_get_granularity(c);
-    get_sched_res(cpu)->cpupool = c;
-    /* When a cpu is added to a pool, trigger it to go pick up some work */
-    if ( c != NULL )
-        cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
+    sr->granularity = 1;
+    sr->cpupool = NULL;
 
     return 0;
 }
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index aa8257edc9..a40bd5fb56 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -920,7 +920,8 @@ struct scheduler;
 struct scheduler *scheduler_get_default(void);
 struct scheduler *scheduler_alloc(unsigned int sched_id, int *perr);
 void scheduler_free(struct scheduler *sched);
-int schedule_cpu_switch(unsigned int cpu, struct cpupool *c);
+int schedule_cpu_add(unsigned int cpu, struct cpupool *c);
+int schedule_cpu_rm(unsigned int cpu);
 void vcpu_set_periodic_timer(struct vcpu *v, s_time_t value);
 int cpu_disable_scheduler(unsigned int cpu);
 void sched_setup_dom0_vcpus(struct domain *d);
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 14/19] xen/sched: protect scheduling resource via rcu
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (12 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 13/19] xen/sched: split schedule_cpu_switch() Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 15/19] xen/sched: support multiple cpus per scheduling resource Juergen Gross
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Tim Deegan, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Julien Grall, Jan Beulich

In order to be able to move cpus to cpupools with core scheduling
active it is mandatory to merge multiple cpus into one scheduling
resource or to split a scheduling resource with multiple cpus in it
into multiple scheduling resources. This in turn requires to modify
the cpu <-> scheduling resource relation. In order to be able to free
unused resources protect struct sched_resource via RCU. This ensures
there are no users left when freeing such a resource.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
V1: new patch
---
 xen/common/cpupool.c       |   4 +
 xen/common/schedule.c      | 187 ++++++++++++++++++++++++++++++++++++++++-----
 xen/include/xen/sched-if.h |   7 +-
 3 files changed, 178 insertions(+), 20 deletions(-)

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 02825e779d..7228ca84b4 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -511,8 +511,10 @@ static int cpupool_cpu_add(unsigned int cpu)
      * (or unplugging would have failed) and that is the default behavior
      * anyway.
      */
+    rcu_read_lock(&sched_res_rculock);
     get_sched_res(cpu)->cpupool = NULL;
     ret = cpupool_assign_cpu_locked(cpupool0, cpu);
+    rcu_read_unlock(&sched_res_rculock);
 
     spin_unlock(&cpupool_lock);
 
@@ -597,7 +599,9 @@ static void cpupool_cpu_remove_forced(unsigned int cpu)
         }
     }
 
+    rcu_read_lock(&sched_res_rculock);
     sched_rm_cpu(cpu);
+    rcu_read_unlock(&sched_res_rculock);
 }
 
 /*
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index a96fc82282..1f23bf0e83 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -77,6 +77,7 @@ static void poll_timer_fn(void *data);
 /* This is global for now so that private implementations can reach it */
 DEFINE_PER_CPU_READ_MOSTLY(struct sched_resource *, sched_res);
 static DEFINE_PER_CPU_READ_MOSTLY(unsigned int, sched_res_idx);
+DEFINE_RCU_READ_LOCK(sched_res_rculock);
 
 /* Scratch space for cpumasks. */
 DEFINE_PER_CPU(cpumask_t, cpumask_scratch);
@@ -300,10 +301,12 @@ void sched_guest_idle(void (*idle) (void), unsigned int cpu)
 
 void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate)
 {
-    spinlock_t *lock = likely(v == current)
-                       ? NULL : unit_schedule_lock_irq(v->sched_unit);
+    spinlock_t *lock;
     s_time_t delta;
 
+    rcu_read_lock(&sched_res_rculock);
+
+    lock = likely(v == current) ? NULL : unit_schedule_lock_irq(v->sched_unit);
     memcpy(runstate, &v->runstate, sizeof(*runstate));
     delta = NOW() - runstate->state_entry_time;
     if ( delta > 0 )
@@ -311,6 +314,8 @@ void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate)
 
     if ( unlikely(lock != NULL) )
         unit_schedule_unlock_irq(lock, v->sched_unit);
+
+    rcu_read_unlock(&sched_res_rculock);
 }
 
 uint64_t get_cpu_idle_time(unsigned int cpu)
@@ -522,6 +527,8 @@ int sched_init_vcpu(struct vcpu *v)
         return 0;
     }
 
+    rcu_read_lock(&sched_res_rculock);
+
     /* The first vcpu of an unit can be set via sched_set_res(). */
     sched_set_res(unit, get_sched_res(processor));
 
@@ -529,6 +536,7 @@ int sched_init_vcpu(struct vcpu *v)
     if ( unit->priv == NULL )
     {
         sched_free_unit(unit, v);
+        rcu_read_unlock(&sched_res_rculock);
         return 1;
     }
 
@@ -555,6 +563,8 @@ int sched_init_vcpu(struct vcpu *v)
         sched_insert_unit(dom_scheduler(d), unit);
     }
 
+    rcu_read_unlock(&sched_res_rculock);
+
     return 0;
 }
 
@@ -583,6 +593,7 @@ int sched_move_domain(struct domain *d, struct cpupool *c)
     struct scheduler *old_ops;
     void *old_domdata;
     unsigned int gran = cpupool_get_granularity(c);
+    int ret = 0;
 
     for_each_vcpu ( d, v )
     {
@@ -590,15 +601,21 @@ int sched_move_domain(struct domain *d, struct cpupool *c)
             return -EBUSY;
     }
 
+    rcu_read_lock(&sched_res_rculock);
+
     domdata = sched_alloc_domdata(c->sched, d);
     if ( IS_ERR(domdata) )
-        return PTR_ERR(domdata);
+    {
+        ret = PTR_ERR(domdata);
+        goto out;
+    }
 
     unit_priv = xzalloc_array(void *, DIV_ROUND_UP(d->max_vcpus, gran));
     if ( unit_priv == NULL )
     {
         sched_free_domdata(c->sched, domdata);
-        return -ENOMEM;
+        ret = -ENOMEM;
+        goto out;
     }
 
     unit_idx = 0;
@@ -611,7 +628,8 @@ int sched_move_domain(struct domain *d, struct cpupool *c)
                 sched_free_udata(c->sched, unit_priv[unit_idx]);
             xfree(unit_priv);
             sched_free_domdata(c->sched, domdata);
-            return -ENOMEM;
+            ret = -ENOMEM;
+            goto out;
         }
         unit_idx++;
     }
@@ -677,7 +695,10 @@ int sched_move_domain(struct domain *d, struct cpupool *c)
 
     xfree(unit_priv);
 
-    return 0;
+out:
+    rcu_read_unlock(&sched_res_rculock);
+
+    return ret;
 }
 
 void sched_destroy_vcpu(struct vcpu *v)
@@ -695,9 +716,13 @@ void sched_destroy_vcpu(struct vcpu *v)
      */
     if ( unit->vcpu_list == v )
     {
+        rcu_read_lock(&sched_res_rculock);
+
         sched_remove_unit(vcpu_scheduler(v), unit);
         sched_free_udata(vcpu_scheduler(v), unit->priv);
         sched_free_unit(unit, v);
+
+        rcu_read_unlock(&sched_res_rculock);
     }
 }
 
@@ -715,7 +740,12 @@ int sched_init_domain(struct domain *d, int poolid)
     SCHED_STAT_CRANK(dom_init);
     TRACE_1D(TRC_SCHED_DOM_ADD, d->domain_id);
 
+    rcu_read_lock(&sched_res_rculock);
+
     sdom = sched_alloc_domdata(dom_scheduler(d), d);
+
+    rcu_read_unlock(&sched_res_rculock);
+
     if ( IS_ERR(sdom) )
         return PTR_ERR(sdom);
 
@@ -733,9 +763,13 @@ void sched_destroy_domain(struct domain *d)
         SCHED_STAT_CRANK(dom_destroy);
         TRACE_1D(TRC_SCHED_DOM_REM, d->domain_id);
 
+        rcu_read_lock(&sched_res_rculock);
+
         sched_free_domdata(dom_scheduler(d), d->sched_priv);
         d->sched_priv = NULL;
 
+        rcu_read_unlock(&sched_res_rculock);
+
         cpupool_rm_domain(d);
     }
 }
@@ -770,11 +804,15 @@ void vcpu_sleep_nosync(struct vcpu *v)
 
     TRACE_2D(TRC_SCHED_SLEEP, v->domain->domain_id, v->vcpu_id);
 
+    rcu_read_lock(&sched_res_rculock);
+
     lock = unit_schedule_lock_irqsave(v->sched_unit, &flags);
 
     vcpu_sleep_nosync_locked(v);
 
     unit_schedule_unlock_irqrestore(lock, flags, v->sched_unit);
+
+    rcu_read_unlock(&sched_res_rculock);
 }
 
 void vcpu_sleep_sync(struct vcpu *v)
@@ -795,6 +833,8 @@ void vcpu_wake(struct vcpu *v)
 
     TRACE_2D(TRC_SCHED_WAKE, v->domain->domain_id, v->vcpu_id);
 
+    rcu_read_lock(&sched_res_rculock);
+
     lock = unit_schedule_lock_irqsave(unit, &flags);
 
     if ( likely(vcpu_runnable(v)) )
@@ -820,6 +860,8 @@ void vcpu_wake(struct vcpu *v)
     }
 
     unit_schedule_unlock_irqrestore(lock, flags, unit);
+
+    rcu_read_unlock(&sched_res_rculock);
 }
 
 void vcpu_unblock(struct vcpu *v)
@@ -853,6 +895,8 @@ static void sched_unit_move_locked(struct sched_unit *unit,
     unsigned int old_cpu = unit->res->master_cpu;
     struct vcpu *v;
 
+    rcu_read_lock(&sched_res_rculock);
+
     /*
      * Transfer urgency status to new CPU before switching CPUs, as
      * once the switch occurs, v->is_urgent is no longer protected by
@@ -872,6 +916,8 @@ static void sched_unit_move_locked(struct sched_unit *unit,
      * pointer can't change while the current lock is held.
      */
     sched_migrate(unit_scheduler(unit), unit, new_cpu);
+
+    rcu_read_unlock(&sched_res_rculock);
 }
 
 /*
@@ -1039,6 +1085,8 @@ void restore_vcpu_affinity(struct domain *d)
 
     ASSERT(system_state == SYS_STATE_resume);
 
+    rcu_read_lock(&sched_res_rculock);
+
     for_each_sched_unit ( d, unit )
     {
         spinlock_t *lock;
@@ -1095,6 +1143,8 @@ void restore_vcpu_affinity(struct domain *d)
             sched_move_irqs(unit);
     }
 
+    rcu_read_unlock(&sched_res_rculock);
+
     domain_update_node_affinity(d);
 }
 
@@ -1110,9 +1160,11 @@ int cpu_disable_scheduler(unsigned int cpu)
     cpumask_t online_affinity;
     int ret = 0;
 
+    rcu_read_lock(&sched_res_rculock);
+
     c = get_sched_res(cpu)->cpupool;
     if ( c == NULL )
-        return ret;
+        goto out;
 
     for_each_domain_in_cpupool ( d, c )
     {
@@ -1170,6 +1222,9 @@ int cpu_disable_scheduler(unsigned int cpu)
         }
     }
 
+out:
+    rcu_read_unlock(&sched_res_rculock);
+
     return ret;
 }
 
@@ -1201,7 +1256,9 @@ static int cpu_disable_scheduler_check(unsigned int cpu)
 static void sched_set_affinity(
     struct sched_unit *unit, const cpumask_t *hard, const cpumask_t *soft)
 {
+    rcu_read_lock(&sched_res_rculock);
     sched_adjust_affinity(dom_scheduler(unit->domain), unit, hard, soft);
+    rcu_read_unlock(&sched_res_rculock);
 
     if ( hard )
         cpumask_copy(unit->cpu_hard_affinity, hard);
@@ -1221,6 +1278,8 @@ static int vcpu_set_affinity(
     spinlock_t *lock;
     int ret = 0;
 
+    rcu_read_lock(&sched_res_rculock);
+
     lock = unit_schedule_lock_irq(unit);
 
     if ( v->affinity_broken )
@@ -1249,6 +1308,8 @@ static int vcpu_set_affinity(
 
     sched_unit_migrate_finish(unit);
 
+    rcu_read_unlock(&sched_res_rculock);
+
     return ret;
 }
 
@@ -1375,11 +1436,16 @@ static long do_poll(struct sched_poll *sched_poll)
 long vcpu_yield(void)
 {
     struct vcpu * v=current;
-    spinlock_t *lock = unit_schedule_lock_irq(v->sched_unit);
+    spinlock_t *lock;
+
+    rcu_read_lock(&sched_res_rculock);
 
+    lock = unit_schedule_lock_irq(v->sched_unit);
     sched_yield(vcpu_scheduler(v), v->sched_unit);
     unit_schedule_unlock_irq(lock, v->sched_unit);
 
+    rcu_read_unlock(&sched_res_rculock);
+
     SCHED_STAT_CRANK(vcpu_yield);
 
     TRACE_2D(TRC_SCHED_YIELD, current->domain->domain_id, current->vcpu_id);
@@ -1476,6 +1542,8 @@ int vcpu_temporary_affinity(struct vcpu *v, unsigned int cpu, uint8_t reason)
     int ret = -EINVAL;
     bool migrate;
 
+    rcu_read_lock(&sched_res_rculock);
+
     lock = unit_schedule_lock_irq(unit);
 
     if ( cpu == NR_CPUS )
@@ -1515,6 +1583,8 @@ int vcpu_temporary_affinity(struct vcpu *v, unsigned int cpu, uint8_t reason)
     if ( migrate )
         sched_unit_migrate_finish(unit);
 
+    rcu_read_unlock(&sched_res_rculock);
+
     return ret;
 }
 
@@ -1726,9 +1796,13 @@ long sched_adjust(struct domain *d, struct xen_domctl_scheduler_op *op)
 
     /* NB: the pluggable scheduler code needs to take care
      * of locking by itself. */
+    rcu_read_lock(&sched_res_rculock);
+
     if ( (ret = sched_adjust_dom(dom_scheduler(d), d, op)) == 0 )
         TRACE_1D(TRC_SCHED_ADJDOM, d->domain_id);
 
+    rcu_read_unlock(&sched_res_rculock);
+
     return ret;
 }
 
@@ -1749,9 +1823,13 @@ long sched_adjust_global(struct xen_sysctl_scheduler_op *op)
     if ( pool == NULL )
         return -ESRCH;
 
+    rcu_read_lock(&sched_res_rculock);
+
     rc = ((op->sched_id == pool->sched->sched_id)
           ? sched_adjust_cpupool(pool->sched, op) : -EINVAL);
 
+    rcu_read_unlock(&sched_res_rculock);
+
     cpupool_put(pool);
 
     return rc;
@@ -1971,7 +2049,11 @@ static void unit_context_saved(struct sched_resource *sr)
 void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext)
 {
     struct sched_unit *next = vnext->sched_unit;
-    struct sched_resource *sr = get_sched_res(smp_processor_id());
+    struct sched_resource *sr;
+
+    rcu_read_lock(&sched_res_rculock);
+
+    sr = get_sched_res(smp_processor_id());
 
     if ( atomic_read(&next->rendezvous_out_cnt) )
     {
@@ -1998,6 +2080,8 @@ void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext)
 
     if ( is_idle_vcpu(vprev) && vprev != vnext )
         vprev->sched_unit = sr->sched_unit_idle;
+
+    rcu_read_unlock(&sched_res_rculock);
 }
 
 static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
@@ -2021,6 +2105,8 @@ static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
             vnext->sched_unit =
                 get_sched_res(smp_processor_id())->sched_unit_idle;
 
+        rcu_read_unlock(&sched_res_rculock);
+
         trace_continue_running(vnext);
         return continue_running(vprev);
     }
@@ -2034,6 +2120,8 @@ static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
 
     vcpu_periodic_timer_work(vnext);
 
+    rcu_read_unlock(&sched_res_rculock);
+
     context_switch(vprev, vnext);
 }
 
@@ -2186,6 +2274,8 @@ static void sched_slave(void)
 
     ASSERT_NOT_IN_ATOMIC();
 
+    rcu_read_lock(&sched_res_rculock);
+
     lock = pcpu_schedule_lock_irq(cpu);
 
     now = NOW();
@@ -2209,6 +2299,8 @@ static void sched_slave(void)
     {
         pcpu_schedule_unlock_irq(lock, cpu);
 
+        rcu_read_unlock(&sched_res_rculock);
+
         /* Check for failed forced context switch. */
         if ( do_softirq )
             raise_softirq(SCHEDULE_SOFTIRQ);
@@ -2241,13 +2333,16 @@ static void schedule(void)
     struct sched_resource *sr;
     spinlock_t           *lock;
     int cpu = smp_processor_id();
-    unsigned int          gran = get_sched_res(cpu)->granularity;
+    unsigned int          gran;
 
     ASSERT_NOT_IN_ATOMIC();
 
     SCHED_STAT_CRANK(sched_run);
 
+    rcu_read_lock(&sched_res_rculock);
+
     sr = get_sched_res(cpu);
+    gran = sr->granularity;
 
     lock = pcpu_schedule_lock_irq(cpu);
 
@@ -2259,6 +2354,8 @@ static void schedule(void)
          */
         pcpu_schedule_unlock_irq(lock, cpu);
 
+        rcu_read_unlock(&sched_res_rculock);
+
         raise_softirq(SCHEDULE_SOFTIRQ);
         return sched_slave();
     }
@@ -2370,14 +2467,27 @@ static int cpu_schedule_up(unsigned int cpu)
     return 0;
 }
 
+static void sched_res_free(struct rcu_head *head)
+{
+    struct sched_resource *sr = container_of(head, struct sched_resource, rcu);
+
+    xfree(sr);
+}
+
 static void cpu_schedule_down(unsigned int cpu)
 {
-    struct sched_resource *sr = get_sched_res(cpu);
+    struct sched_resource *sr;
+
+    rcu_read_lock(&sched_res_rculock);
+
+    sr = get_sched_res(cpu);
 
     kill_timer(&sr->s_timer);
 
     set_sched_res(cpu, NULL);
-    xfree(sr);
+    call_rcu(&sr->rcu, sched_res_free);
+
+    rcu_read_unlock(&sched_res_rculock);
 }
 
 void sched_rm_cpu(unsigned int cpu)
@@ -2397,6 +2507,8 @@ static int cpu_schedule_callback(
     unsigned int cpu = (unsigned long)hcpu;
     int rc = 0;
 
+    rcu_read_lock(&sched_res_rculock);
+
     /*
      * From the scheduler perspective, bringing up a pCPU requires
      * allocating and initializing the per-pCPU scheduler specific data,
@@ -2443,6 +2555,8 @@ static int cpu_schedule_callback(
         break;
     }
 
+    rcu_read_unlock(&sched_res_rculock);
+
     return !rc ? NOTIFY_DONE : notifier_from_errno(rc);
 }
 
@@ -2532,8 +2646,13 @@ void __init scheduler_init(void)
     idle_domain->max_vcpus = nr_cpu_ids;
     if ( vcpu_create(idle_domain, 0) == NULL )
         BUG();
+
+    rcu_read_lock(&sched_res_rculock);
+
     get_sched_res(0)->curr = idle_vcpu[0]->sched_unit;
     get_sched_res(0)->sched_unit_idle = idle_vcpu[0]->sched_unit;
+
+    rcu_read_unlock(&sched_res_rculock);
 }
 
 /*
@@ -2546,9 +2665,14 @@ int schedule_cpu_add(unsigned int cpu, struct cpupool *c)
     struct vcpu *idle;
     void *ppriv, *vpriv;
     struct scheduler *new_ops = c->sched;
-    struct sched_resource *sr = get_sched_res(cpu);
+    struct sched_resource *sr;
     spinlock_t *old_lock, *new_lock;
     unsigned long flags;
+    int ret = 0;
+
+    rcu_read_lock(&sched_res_rculock);
+
+    sr = get_sched_res(cpu);
 
     ASSERT(cpumask_test_cpu(cpu, &cpupool_free_cpus));
     ASSERT(!cpumask_test_cpu(cpu, c->cpu_valid));
@@ -2568,13 +2692,18 @@ int schedule_cpu_add(unsigned int cpu, struct cpupool *c)
     idle = idle_vcpu[cpu];
     ppriv = sched_alloc_pdata(new_ops, cpu);
     if ( IS_ERR(ppriv) )
-        return PTR_ERR(ppriv);
+    {
+        ret = PTR_ERR(ppriv);
+        goto out;
+    }
+
     vpriv = sched_alloc_udata(new_ops, idle->sched_unit,
                               idle->domain->sched_priv);
     if ( vpriv == NULL )
     {
         sched_free_pdata(new_ops, ppriv, cpu);
-        return -ENOMEM;
+        ret = -ENOMEM;
+        goto out;
     }
 
     /*
@@ -2613,7 +2742,10 @@ int schedule_cpu_add(unsigned int cpu, struct cpupool *c)
     /* The  cpu is added to a pool, trigger it to go pick up some work */
     cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
 
-    return 0;
+out:
+    rcu_read_unlock(&sched_res_rculock);
+
+    return ret;
 }
 
 /*
@@ -2626,11 +2758,16 @@ int schedule_cpu_rm(unsigned int cpu)
 {
     struct vcpu *idle;
     void *ppriv_old, *vpriv_old;
-    struct sched_resource *sr = get_sched_res(cpu);
-    struct scheduler *old_ops = sr->scheduler;
+    struct sched_resource *sr;
+    struct scheduler *old_ops;
     spinlock_t *old_lock;
     unsigned long flags;
 
+    rcu_read_lock(&sched_res_rculock);
+
+    sr = get_sched_res(cpu);
+    old_ops = sr->scheduler;
+
     ASSERT(sr->cpupool != NULL);
     ASSERT(cpumask_test_cpu(cpu, &cpupool_free_cpus));
     ASSERT(!cpumask_test_cpu(cpu, sr->cpupool->cpu_valid));
@@ -2663,6 +2800,8 @@ int schedule_cpu_rm(unsigned int cpu)
     sr->granularity = 1;
     sr->cpupool = NULL;
 
+    rcu_read_unlock(&sched_res_rculock);
+
     return 0;
 }
 
@@ -2711,6 +2850,8 @@ void schedule_dump(struct cpupool *c)
 
     /* Locking, if necessary, must be handled withing each scheduler */
 
+    rcu_read_lock(&sched_res_rculock);
+
     if ( c != NULL )
     {
         sched = c->sched;
@@ -2730,6 +2871,8 @@ void schedule_dump(struct cpupool *c)
         for_each_cpu (i, cpus)
             sched_dump_cpu_state(sched, i);
     }
+
+    rcu_read_unlock(&sched_res_rculock);
 }
 
 void sched_tick_suspend(void)
@@ -2737,10 +2880,14 @@ void sched_tick_suspend(void)
     struct scheduler *sched;
     unsigned int cpu = smp_processor_id();
 
+    rcu_read_lock(&sched_res_rculock);
+
     sched = get_sched_res(cpu)->scheduler;
     sched_do_tick_suspend(sched, cpu);
     rcu_idle_enter(cpu);
     rcu_idle_timer_start();
+
+    rcu_read_unlock(&sched_res_rculock);
 }
 
 void sched_tick_resume(void)
@@ -2748,10 +2895,14 @@ void sched_tick_resume(void)
     struct scheduler *sched;
     unsigned int cpu = smp_processor_id();
 
+    rcu_read_lock(&sched_res_rculock);
+
     rcu_idle_timer_stop();
     rcu_idle_exit(cpu);
     sched = get_sched_res(cpu)->scheduler;
     sched_do_tick_resume(sched, cpu);
+
+    rcu_read_unlock(&sched_res_rculock);
 }
 
 void wait(void)
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index f8f0f484cb..3988985ee6 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -10,6 +10,7 @@
 
 #include <xen/percpu.h>
 #include <xen/err.h>
+#include <xen/rcupdate.h>
 
 /* A global pointer to the initial cpupool (POOL0). */
 extern struct cpupool *cpupool0;
@@ -57,18 +58,20 @@ struct sched_resource {
     unsigned int        master_cpu;
     unsigned int        granularity;
     const cpumask_t    *cpus;           /* cpus covered by this struct     */
+    struct rcu_head     rcu;
 };
 
 DECLARE_PER_CPU(struct sched_resource *, sched_res);
+extern rcu_read_lock_t sched_res_rculock;
 
 static inline struct sched_resource *get_sched_res(unsigned int cpu)
 {
-    return per_cpu(sched_res, cpu);
+    return rcu_dereference(per_cpu(sched_res, cpu));
 }
 
 static inline void set_sched_res(unsigned int cpu, struct sched_resource *res)
 {
-    per_cpu(sched_res, cpu) = res;
+    rcu_assign_pointer(per_cpu(sched_res, cpu), res);
 }
 
 static inline struct sched_unit *curr_on_cpu(unsigned int cpu)
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 15/19] xen/sched: support multiple cpus per scheduling resource
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (13 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 14/19] xen/sched: protect scheduling resource via rcu Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 16/19] xen/sched: support differing granularity in schedule_cpu_[add/rm]() Juergen Gross
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Tim Deegan, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Julien Grall, Jan Beulich

Prepare supporting multiple cpus per scheduling resource by allocating
the cpumask per resource dynamically.

Modify sched_res_mask to have only one bit per scheduling resource set.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
V1: new patch (carved out from other patch)
V4:
- use cpumask_t for sched_res_mask (Jan Beulich)
- clear cpu in sched_res_mask when taking cpu away (Jan Beulich)
---
 xen/common/cpupool.c       |  4 ++--
 xen/common/schedule.c      | 15 +++++++++++++--
 xen/include/xen/sched-if.h |  4 ++--
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 7228ca84b4..13dffaadcf 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -283,7 +283,7 @@ static int cpupool_assign_cpu_locked(struct cpupool *c, unsigned int cpu)
         cpupool_cpu_moving = NULL;
     }
     cpumask_set_cpu(cpu, c->cpu_valid);
-    cpumask_and(c->res_valid, c->cpu_valid, sched_res_mask);
+    cpumask_and(c->res_valid, c->cpu_valid, &sched_res_mask);
 
     rcu_read_lock(&domlist_read_lock);
     for_each_domain_in_cpupool(d, c)
@@ -376,7 +376,7 @@ static int cpupool_unassign_cpu_start(struct cpupool *c, unsigned int cpu)
     atomic_inc(&c->refcnt);
     cpupool_cpu_moving = c;
     cpumask_clear_cpu(cpu, c->cpu_valid);
-    cpumask_and(c->res_valid, c->cpu_valid, sched_res_mask);
+    cpumask_and(c->res_valid, c->cpu_valid, &sched_res_mask);
 
 out:
     spin_unlock(&cpupool_lock);
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 1f23bf0e83..efe077b01f 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -63,7 +63,7 @@ integer_param("sched_ratelimit_us", sched_ratelimit_us);
 
 /* Number of vcpus per struct sched_unit. */
 bool __read_mostly sched_disable_smt_switching;
-const cpumask_t *sched_res_mask = &cpumask_all;
+cpumask_t sched_res_mask;
 
 /* Common lock for free cpus. */
 static DEFINE_SPINLOCK(sched_free_cpu_lock);
@@ -2426,8 +2426,14 @@ static int cpu_schedule_up(unsigned int cpu)
     sr = xzalloc(struct sched_resource);
     if ( sr == NULL )
         return -ENOMEM;
+    if ( !zalloc_cpumask_var(&sr->cpus) )
+    {
+        xfree(sr);
+        return -ENOMEM;
+    }
+
     sr->master_cpu = cpu;
-    sr->cpus = cpumask_of(cpu);
+    cpumask_copy(sr->cpus, cpumask_of(cpu));
     set_sched_res(cpu, sr);
 
     sr->scheduler = &sched_idle_ops;
@@ -2439,6 +2445,8 @@ static int cpu_schedule_up(unsigned int cpu)
     /* We start with cpu granularity. */
     sr->granularity = 1;
 
+    cpumask_set_cpu(cpu, &sched_res_mask);
+
     /* Boot CPU is dealt with later in scheduler_init(). */
     if ( cpu == 0 )
         return 0;
@@ -2471,6 +2479,7 @@ static void sched_res_free(struct rcu_head *head)
 {
     struct sched_resource *sr = container_of(head, struct sched_resource, rcu);
 
+    free_cpumask_var(sr->cpus);
     xfree(sr);
 }
 
@@ -2484,7 +2493,9 @@ static void cpu_schedule_down(unsigned int cpu)
 
     kill_timer(&sr->s_timer);
 
+    cpumask_clear_cpu(cpu, &sched_res_mask);
     set_sched_res(cpu, NULL);
+
     call_rcu(&sr->rcu, sched_res_free);
 
     rcu_read_unlock(&sched_res_rculock);
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 3988985ee6..780735dda3 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -24,7 +24,7 @@ extern cpumask_t cpupool_free_cpus;
 extern int sched_ratelimit_us;
 
 /* Scheduling resource mask. */
-extern const cpumask_t *sched_res_mask;
+extern cpumask_t sched_res_mask;
 
 /* Number of vcpus per struct sched_unit. */
 enum sched_gran {
@@ -57,7 +57,7 @@ struct sched_resource {
     /* Cpu with lowest id in scheduling resource. */
     unsigned int        master_cpu;
     unsigned int        granularity;
-    const cpumask_t    *cpus;           /* cpus covered by this struct     */
+    cpumask_var_t       cpus;           /* cpus covered by this struct     */
     struct rcu_head     rcu;
 };
 
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 16/19] xen/sched: support differing granularity in schedule_cpu_[add/rm]()
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (14 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 15/19] xen/sched: support multiple cpus per scheduling resource Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 17/19] xen/sched: support core scheduling for moving cpus to/from cpupools Juergen Gross
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, George Dunlap, Dario Faggioli

With core scheduling active schedule_cpu_[add/rm]() has to cope with
different scheduling granularity: a cpu not in any cpupool is subject
to granularity 1 (cpu scheduling), while a cpu in a cpupool might be
in a scheduling resource with more than one cpu.

Handle that by having arrays of old/new pdata and vdata and loop over
those where appropriate.

Additionally the scheduling resource(s) must either be merged or
split.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
 xen/common/cpupool.c  |  18 ++--
 xen/common/schedule.c | 226 +++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 204 insertions(+), 40 deletions(-)

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 13dffaadcf..04c3b3c04b 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -536,6 +536,7 @@ static void cpupool_cpu_remove(unsigned int cpu)
         ret = cpupool_unassign_cpu_finish(cpupool0);
         BUG_ON(ret);
     }
+    cpumask_clear_cpu(cpu, &cpupool_free_cpus);
 }
 
 /*
@@ -585,20 +586,19 @@ static void cpupool_cpu_remove_forced(unsigned int cpu)
     struct cpupool **c;
     int ret;
 
-    if ( cpumask_test_cpu(cpu, &cpupool_free_cpus) )
-        cpumask_clear_cpu(cpu, &cpupool_free_cpus);
-    else
+    for_each_cpupool ( c )
     {
-        for_each_cpupool(c)
+        if ( cpumask_test_cpu(cpu, (*c)->cpu_valid) )
         {
-            if ( cpumask_test_cpu(cpu, (*c)->cpu_valid) )
-            {
-                ret = cpupool_unassign_cpu(*c, cpu);
-                BUG_ON(ret);
-            }
+            ret = cpupool_unassign_cpu_start(*c, cpu);
+            BUG_ON(ret);
+            ret = cpupool_unassign_cpu_finish(*c);
+            BUG_ON(ret);
         }
     }
 
+    cpumask_clear_cpu(cpu, &cpupool_free_cpus);
+
     rcu_read_lock(&sched_res_rculock);
     sched_rm_cpu(cpu);
     rcu_read_unlock(&sched_res_rculock);
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index efe077b01f..e411b6d03e 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -425,27 +425,30 @@ static void sched_unit_add_vcpu(struct sched_unit *unit, struct vcpu *v)
     unit->runstate_cnt[v->runstate.state]++;
 }
 
-static struct sched_unit *sched_alloc_unit(struct vcpu *v)
+static struct sched_unit *sched_alloc_unit_mem(void)
 {
-    struct sched_unit *unit, **prev_unit;
-    struct domain *d = v->domain;
-    unsigned int gran = cpupool_get_granularity(d->cpupool);
+    struct sched_unit *unit;
 
-    for_each_sched_unit ( d, unit )
-        if ( unit->unit_id / gran == v->vcpu_id / gran )
-            break;
+    unit = xzalloc(struct sched_unit);
+    if ( !unit )
+        return NULL;
 
-    if ( unit )
+    if ( !zalloc_cpumask_var(&unit->cpu_hard_affinity) ||
+         !zalloc_cpumask_var(&unit->cpu_hard_affinity_saved) ||
+         !zalloc_cpumask_var(&unit->cpu_soft_affinity) )
     {
-        sched_unit_add_vcpu(unit, v);
-        return unit;
+        sched_free_unit_mem(unit);
+        unit = NULL;
     }
 
-    if ( (unit = xzalloc(struct sched_unit)) == NULL )
-        return NULL;
+    return unit;
+}
+
+static void sched_domain_insert_unit(struct sched_unit *unit, struct domain *d)
+{
+    struct sched_unit **prev_unit;
 
     unit->domain = d;
-    sched_unit_add_vcpu(unit, v);
 
     for ( prev_unit = &d->sched_unit_list; *prev_unit;
           prev_unit = &(*prev_unit)->next_in_list )
@@ -455,17 +458,31 @@ static struct sched_unit *sched_alloc_unit(struct vcpu *v)
 
     unit->next_in_list = *prev_unit;
     *prev_unit = unit;
+}
 
-    if ( !zalloc_cpumask_var(&unit->cpu_hard_affinity) ||
-         !zalloc_cpumask_var(&unit->cpu_hard_affinity_saved) ||
-         !zalloc_cpumask_var(&unit->cpu_soft_affinity) )
-        goto fail;
+static struct sched_unit *sched_alloc_unit(struct vcpu *v)
+{
+    struct sched_unit *unit;
+    struct domain *d = v->domain;
+    unsigned int gran = cpupool_get_granularity(d->cpupool);
 
-    return unit;
+    for_each_sched_unit ( d, unit )
+        if ( unit->unit_id / gran == v->vcpu_id / gran )
+            break;
 
- fail:
-    sched_free_unit(unit, v);
-    return NULL;
+    if ( unit )
+    {
+        sched_unit_add_vcpu(unit, v);
+        return unit;
+    }
+
+    if ( (unit = sched_alloc_unit_mem()) == NULL )
+        return NULL;
+
+    sched_unit_add_vcpu(unit, v);
+    sched_domain_insert_unit(unit, d);
+
+    return unit;
 }
 
 static unsigned int sched_select_initial_cpu(const struct vcpu *v)
@@ -2419,18 +2436,28 @@ static void poll_timer_fn(void *data)
         vcpu_unblock(v);
 }
 
-static int cpu_schedule_up(unsigned int cpu)
+static struct sched_resource *sched_alloc_res(void)
 {
     struct sched_resource *sr;
 
     sr = xzalloc(struct sched_resource);
     if ( sr == NULL )
-        return -ENOMEM;
+        return NULL;
     if ( !zalloc_cpumask_var(&sr->cpus) )
     {
         xfree(sr);
-        return -ENOMEM;
+        return NULL;
     }
+    return sr;
+}
+
+static int cpu_schedule_up(unsigned int cpu)
+{
+    struct sched_resource *sr;
+
+    sr = sched_alloc_res();
+    if ( sr == NULL )
+        return -ENOMEM;
 
     sr->master_cpu = cpu;
     cpumask_copy(sr->cpus, cpumask_of(cpu));
@@ -2480,6 +2507,8 @@ static void sched_res_free(struct rcu_head *head)
     struct sched_resource *sr = container_of(head, struct sched_resource, rcu);
 
     free_cpumask_var(sr->cpus);
+    if ( sr->sched_unit_idle )
+        sched_free_unit_mem(sr->sched_unit_idle);
     xfree(sr);
 }
 
@@ -2496,6 +2525,8 @@ static void cpu_schedule_down(unsigned int cpu)
     cpumask_clear_cpu(cpu, &sched_res_mask);
     set_sched_res(cpu, NULL);
 
+    /* Keep idle unit. */
+    sr->sched_unit_idle = NULL;
     call_rcu(&sr->rcu, sched_res_free);
 
     rcu_read_unlock(&sched_res_rculock);
@@ -2575,6 +2606,30 @@ static struct notifier_block cpu_schedule_nfb = {
     .notifier_call = cpu_schedule_callback
 };
 
+static const cpumask_t *sched_get_opt_cpumask(enum sched_gran opt,
+                                              unsigned int cpu)
+{
+    const cpumask_t *mask;
+
+    switch ( opt )
+    {
+    case SCHED_GRAN_cpu:
+        mask = cpumask_of(cpu);
+        break;
+    case SCHED_GRAN_core:
+        mask = per_cpu(cpu_sibling_mask, cpu);
+        break;
+    case SCHED_GRAN_socket:
+        mask = per_cpu(cpu_core_mask, cpu);
+        break;
+    default:
+        ASSERT_UNREACHABLE();
+        return NULL;
+    }
+
+    return mask;
+}
+
 /* Initialise the data structures. */
 void __init scheduler_init(void)
 {
@@ -2730,6 +2785,46 @@ int schedule_cpu_add(unsigned int cpu, struct cpupool *c)
      */
     old_lock = pcpu_schedule_lock_irqsave(cpu, &flags);
 
+    if ( cpupool_get_granularity(c) > 1 )
+    {
+        const cpumask_t *mask;
+        unsigned int cpu_iter, idx = 0;
+        struct sched_unit *old_unit, *master_unit;
+        struct sched_resource *sr_old;
+
+        /*
+         * We need to merge multiple idle_vcpu units and sched_resource structs
+         * into one. As the free cpus all share the same lock we are fine doing
+         * that now. The worst which could happen would be someone waiting for
+         * the lock, thus dereferencing sched_res->schedule_lock. This is the
+         * reason we are freeing struct sched_res via call_rcu() to avoid the
+         * lock pointer suddenly disappearing.
+         */
+        mask = sched_get_opt_cpumask(c->gran, cpu);
+        master_unit = idle_vcpu[cpu]->sched_unit;
+
+        for_each_cpu ( cpu_iter, mask )
+        {
+            if ( idx )
+                cpumask_clear_cpu(cpu_iter, &sched_res_mask);
+
+            per_cpu(sched_res_idx, cpu_iter) = idx++;
+
+            if ( cpu == cpu_iter )
+                continue;
+
+            old_unit = idle_vcpu[cpu_iter]->sched_unit;
+            sr_old = get_sched_res(cpu_iter);
+            kill_timer(&sr_old->s_timer);
+            idle_vcpu[cpu_iter]->sched_unit = master_unit;
+            master_unit->runstate_cnt[RUNSTATE_running]++;
+            set_sched_res(cpu_iter, sr);
+            cpumask_set_cpu(cpu_iter, sr->cpus);
+
+            call_rcu(&sr_old->rcu, sched_res_free);
+        }
+    }
+
     new_lock = sched_switch_sched(new_ops, cpu, ppriv, vpriv);
 
     sr->scheduler = new_ops;
@@ -2767,33 +2862,100 @@ out:
  */
 int schedule_cpu_rm(unsigned int cpu)
 {
-    struct vcpu *idle;
     void *ppriv_old, *vpriv_old;
-    struct sched_resource *sr;
+    struct sched_resource *sr, **sr_new = NULL;
+    struct sched_unit *unit;
     struct scheduler *old_ops;
     spinlock_t *old_lock;
     unsigned long flags;
+    int idx, ret = -ENOMEM;
+    unsigned int cpu_iter;
 
     rcu_read_lock(&sched_res_rculock);
 
     sr = get_sched_res(cpu);
     old_ops = sr->scheduler;
 
+    if ( sr->granularity > 1 )
+    {
+        sr_new = xmalloc_array(struct sched_resource *, sr->granularity - 1);
+        if ( !sr_new )
+            goto out;
+        for ( idx = 0; idx < sr->granularity - 1; idx++ )
+        {
+            sr_new[idx] = sched_alloc_res();
+            if ( sr_new[idx] )
+            {
+                sr_new[idx]->sched_unit_idle = sched_alloc_unit_mem();
+                if ( !sr_new[idx]->sched_unit_idle )
+                {
+                    sched_res_free(&sr_new[idx]->rcu);
+                    sr_new[idx] = NULL;
+                }
+            }
+            if ( !sr_new[idx] )
+            {
+                for ( idx--; idx >= 0; idx-- )
+                    sched_res_free(&sr_new[idx]->rcu);
+                goto out;
+            }
+            sr_new[idx]->curr = sr_new[idx]->sched_unit_idle;
+            sr_new[idx]->scheduler = &sched_idle_ops;
+            sr_new[idx]->granularity = 1;
+
+            /* We want the lock not to change when replacing the resource. */
+            sr_new[idx]->schedule_lock = sr->schedule_lock;
+        }
+    }
+
+    ret = 0;
     ASSERT(sr->cpupool != NULL);
     ASSERT(cpumask_test_cpu(cpu, &cpupool_free_cpus));
     ASSERT(!cpumask_test_cpu(cpu, sr->cpupool->cpu_valid));
 
-    idle = idle_vcpu[cpu];
-
     sched_do_tick_suspend(old_ops, cpu);
 
     /* See comment in schedule_cpu_add() regarding lock switching. */
     old_lock = pcpu_schedule_lock_irqsave(cpu, &flags);
 
-    vpriv_old = idle->sched_unit->priv;
+    vpriv_old = idle_vcpu[cpu]->sched_unit->priv;
     ppriv_old = sr->sched_priv;
 
-    idle->sched_unit->priv = NULL;
+    idx = 0;
+    for_each_cpu ( cpu_iter, sr->cpus )
+    {
+        per_cpu(sched_res_idx, cpu_iter) = 0;
+        if ( cpu_iter == cpu )
+        {
+            idle_vcpu[cpu_iter]->sched_unit->priv = NULL;
+        }
+        else
+        {
+            /* Initialize unit. */
+            unit = sr_new[idx]->sched_unit_idle;
+            unit->res = sr_new[idx];
+            unit->is_running = true;
+            sched_unit_add_vcpu(unit, idle_vcpu[cpu_iter]);
+            sched_domain_insert_unit(unit, idle_vcpu[cpu_iter]->domain);
+
+            /* Adjust cpu masks of resources (old and new). */
+            cpumask_clear_cpu(cpu_iter, sr->cpus);
+            cpumask_set_cpu(cpu_iter, sr_new[idx]->cpus);
+
+            /* Init timer. */
+            init_timer(&sr_new[idx]->s_timer, s_timer_fn, NULL, cpu_iter);
+
+            /* Last resource initializations and insert resource pointer. */
+            sr_new[idx]->master_cpu = cpu_iter;
+            set_sched_res(cpu_iter, sr_new[idx]);
+
+            /* Last action: set the new lock pointer. */
+            smp_mb();
+            sr_new[idx]->schedule_lock = &sched_free_cpu_lock;
+
+            idx++;
+        }
+    }
     sr->scheduler = &sched_idle_ops;
     sr->sched_priv = NULL;
 
@@ -2811,9 +2973,11 @@ int schedule_cpu_rm(unsigned int cpu)
     sr->granularity = 1;
     sr->cpupool = NULL;
 
+out:
     rcu_read_unlock(&sched_res_rculock);
+    xfree(sr_new);
 
-    return 0;
+    return ret;
 }
 
 struct scheduler *scheduler_get_default(void)
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 17/19] xen/sched: support core scheduling for moving cpus to/from cpupools
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (15 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 16/19] xen/sched: support differing granularity in schedule_cpu_[add/rm]() Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 18/19] xen/sched: disable scheduling when entering ACPI deep sleep states Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 19/19] xen/sched: add scheduling granularity enum Juergen Gross
  18 siblings, 0 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Tim Deegan, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Julien Grall, Jan Beulich

With core scheduling active it is necessary to move multiple cpus at
the same time to or from a cpupool in order to avoid split scheduling
resources in between.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
V1: new patch
---
 xen/common/cpupool.c       | 100 +++++++++++++++++++++++++++++++++------------
 xen/common/schedule.c      |   3 +-
 xen/include/xen/sched-if.h |   1 +
 3 files changed, 76 insertions(+), 28 deletions(-)

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 04c3b3c04b..f7a13c7a4c 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -268,23 +268,30 @@ static int cpupool_assign_cpu_locked(struct cpupool *c, unsigned int cpu)
 {
     int ret;
     struct domain *d;
+    const cpumask_t *cpus;
+
+    cpus = sched_get_opt_cpumask(c->gran, cpu);
 
     if ( (cpupool_moving_cpu == cpu) && (c != cpupool_cpu_moving) )
         return -EADDRNOTAVAIL;
-    ret = schedule_cpu_add(cpu, c);
+    ret = schedule_cpu_add(cpumask_first(cpus), c);
     if ( ret )
         return ret;
 
-    cpumask_clear_cpu(cpu, &cpupool_free_cpus);
+    rcu_read_lock(&sched_res_rculock);
+
+    cpumask_andnot(&cpupool_free_cpus, &cpupool_free_cpus, cpus);
     if (cpupool_moving_cpu == cpu)
     {
         cpupool_moving_cpu = -1;
         cpupool_put(cpupool_cpu_moving);
         cpupool_cpu_moving = NULL;
     }
-    cpumask_set_cpu(cpu, c->cpu_valid);
+    cpumask_or(c->cpu_valid, c->cpu_valid, cpus);
     cpumask_and(c->res_valid, c->cpu_valid, &sched_res_mask);
 
+    rcu_read_unlock(&sched_res_rculock);
+
     rcu_read_lock(&domlist_read_lock);
     for_each_domain_in_cpupool(d, c)
     {
@@ -298,6 +305,7 @@ static int cpupool_assign_cpu_locked(struct cpupool *c, unsigned int cpu)
 static int cpupool_unassign_cpu_finish(struct cpupool *c)
 {
     int cpu = cpupool_moving_cpu;
+    const cpumask_t *cpus;
     struct domain *d;
     int ret;
 
@@ -310,7 +318,10 @@ static int cpupool_unassign_cpu_finish(struct cpupool *c)
      */
     rcu_read_lock(&domlist_read_lock);
     ret = cpu_disable_scheduler(cpu);
-    cpumask_set_cpu(cpu, &cpupool_free_cpus);
+
+    rcu_read_lock(&sched_res_rculock);
+    cpus = get_sched_res(cpu)->cpus;
+    cpumask_or(&cpupool_free_cpus, &cpupool_free_cpus, cpus);
 
     /*
      * cpu_disable_scheduler() returning an error doesn't require resetting
@@ -323,7 +334,7 @@ static int cpupool_unassign_cpu_finish(struct cpupool *c)
     {
         ret = schedule_cpu_rm(cpu);
         if ( ret )
-            cpumask_clear_cpu(cpu, &cpupool_free_cpus);
+            cpumask_andnot(&cpupool_free_cpus, &cpupool_free_cpus, cpus);
         else
         {
             cpupool_moving_cpu = -1;
@@ -331,6 +342,7 @@ static int cpupool_unassign_cpu_finish(struct cpupool *c)
             cpupool_cpu_moving = NULL;
         }
     }
+    rcu_read_unlock(&sched_res_rculock);
 
     for_each_domain_in_cpupool(d, c)
     {
@@ -345,6 +357,7 @@ static int cpupool_unassign_cpu_start(struct cpupool *c, unsigned int cpu)
 {
     int ret;
     struct domain *d;
+    const cpumask_t *cpus;
 
     spin_lock(&cpupool_lock);
     ret = -EADDRNOTAVAIL;
@@ -353,7 +366,11 @@ static int cpupool_unassign_cpu_start(struct cpupool *c, unsigned int cpu)
         goto out;
 
     ret = 0;
-    if ( (c->n_dom > 0) && (cpumask_weight(c->cpu_valid) == 1) &&
+    rcu_read_lock(&sched_res_rculock);
+    cpus = get_sched_res(cpu)->cpus;
+
+    if ( (c->n_dom > 0) &&
+         (cpumask_weight(c->cpu_valid) == cpumask_weight(cpus)) &&
          (cpu != cpupool_moving_cpu) )
     {
         rcu_read_lock(&domlist_read_lock);
@@ -375,9 +392,10 @@ static int cpupool_unassign_cpu_start(struct cpupool *c, unsigned int cpu)
     cpupool_moving_cpu = cpu;
     atomic_inc(&c->refcnt);
     cpupool_cpu_moving = c;
-    cpumask_clear_cpu(cpu, c->cpu_valid);
+    cpumask_andnot(c->cpu_valid, c->cpu_valid, cpus);
     cpumask_and(c->res_valid, c->cpu_valid, &sched_res_mask);
 
+    rcu_read_unlock(&domlist_read_lock);
 out:
     spin_unlock(&cpupool_lock);
 
@@ -417,11 +435,13 @@ static int cpupool_unassign_cpu(struct cpupool *c, unsigned int cpu)
 {
     int work_cpu;
     int ret;
+    unsigned int master_cpu;
 
     debugtrace_printk("cpupool_unassign_cpu(pool=%d,cpu=%d)\n",
                       c->cpupool_id, cpu);
 
-    ret = cpupool_unassign_cpu_start(c, cpu);
+    master_cpu = sched_get_resource_cpu(cpu);
+    ret = cpupool_unassign_cpu_start(c, master_cpu);
     if ( ret )
     {
         debugtrace_printk("cpupool_unassign_cpu(pool=%d,cpu=%d) ret %d\n",
@@ -429,12 +449,12 @@ static int cpupool_unassign_cpu(struct cpupool *c, unsigned int cpu)
         return ret;
     }
 
-    work_cpu = smp_processor_id();
-    if ( work_cpu == cpu )
+    work_cpu = sched_get_resource_cpu(smp_processor_id());
+    if ( work_cpu == master_cpu )
     {
         work_cpu = cpumask_first(cpupool0->cpu_valid);
-        if ( work_cpu == cpu )
-            work_cpu = cpumask_next(cpu, cpupool0->cpu_valid);
+        if ( work_cpu == master_cpu )
+            work_cpu = cpumask_last(cpupool0->cpu_valid);
     }
     return continue_hypercall_on_cpu(work_cpu, cpupool_unassign_cpu_helper, c);
 }
@@ -500,6 +520,7 @@ void cpupool_rm_domain(struct domain *d)
 static int cpupool_cpu_add(unsigned int cpu)
 {
     int ret = 0;
+    const cpumask_t *cpus;
 
     spin_lock(&cpupool_lock);
     cpumask_clear_cpu(cpu, &cpupool_locked_cpus);
@@ -513,7 +534,11 @@ static int cpupool_cpu_add(unsigned int cpu)
      */
     rcu_read_lock(&sched_res_rculock);
     get_sched_res(cpu)->cpupool = NULL;
-    ret = cpupool_assign_cpu_locked(cpupool0, cpu);
+
+    cpus = sched_get_opt_cpumask(cpupool0->gran, cpu);
+    if ( cpumask_subset(cpus, &cpupool_free_cpus) )
+        ret = cpupool_assign_cpu_locked(cpupool0, cpu);
+
     rcu_read_unlock(&sched_res_rculock);
 
     spin_unlock(&cpupool_lock);
@@ -548,27 +573,33 @@ static void cpupool_cpu_remove(unsigned int cpu)
 static int cpupool_cpu_remove_prologue(unsigned int cpu)
 {
     int ret = 0;
+    cpumask_t *cpus;
+    unsigned int master_cpu;
 
     spin_lock(&cpupool_lock);
 
-    if ( cpumask_test_cpu(cpu, &cpupool_locked_cpus) )
+    rcu_read_lock(&sched_res_rculock);
+    cpus = get_sched_res(cpu)->cpus;
+    master_cpu = sched_get_resource_cpu(cpu);
+    if ( cpumask_intersects(cpus, &cpupool_locked_cpus) )
         ret = -EBUSY;
     else
         cpumask_set_cpu(cpu, &cpupool_locked_cpus);
+    rcu_read_unlock(&sched_res_rculock);
 
     spin_unlock(&cpupool_lock);
 
     if ( ret )
         return  ret;
 
-    if ( cpumask_test_cpu(cpu, cpupool0->cpu_valid) )
+    if ( cpumask_test_cpu(master_cpu, cpupool0->cpu_valid) )
     {
         /* Cpupool0 is populated only after all cpus are up. */
         ASSERT(system_state == SYS_STATE_active);
 
-        ret = cpupool_unassign_cpu_start(cpupool0, cpu);
+        ret = cpupool_unassign_cpu_start(cpupool0, master_cpu);
     }
-    else if ( !cpumask_test_cpu(cpu, &cpupool_free_cpus) )
+    else if ( !cpumask_test_cpu(master_cpu, &cpupool_free_cpus) )
         ret = -ENODEV;
 
     return ret;
@@ -585,12 +616,13 @@ static void cpupool_cpu_remove_forced(unsigned int cpu)
 {
     struct cpupool **c;
     int ret;
+    unsigned int master_cpu = sched_get_resource_cpu(cpu);
 
     for_each_cpupool ( c )
     {
-        if ( cpumask_test_cpu(cpu, (*c)->cpu_valid) )
+        if ( cpumask_test_cpu(master_cpu, (*c)->cpu_valid) )
         {
-            ret = cpupool_unassign_cpu_start(*c, cpu);
+            ret = cpupool_unassign_cpu_start(*c, master_cpu);
             BUG_ON(ret);
             ret = cpupool_unassign_cpu_finish(*c);
             BUG_ON(ret);
@@ -658,29 +690,45 @@ int cpupool_do_sysctl(struct xen_sysctl_cpupool_op *op)
     case XEN_SYSCTL_CPUPOOL_OP_ADDCPU:
     {
         unsigned cpu;
+        const cpumask_t *cpus;
 
         cpu = op->cpu;
         debugtrace_printk("cpupool_assign_cpu(pool=%d,cpu=%d)\n",
                           op->cpupool_id, cpu);
+
         spin_lock(&cpupool_lock);
+
+        c = cpupool_find_by_id(op->cpupool_id);
+        ret = -ENOENT;
+        if ( c == NULL )
+            goto addcpu_out;
         if ( cpu == XEN_SYSCTL_CPUPOOL_PAR_ANY )
-            cpu = cpumask_first(&cpupool_free_cpus);
+        {
+            for_each_cpu ( cpu, &cpupool_free_cpus )
+            {
+                cpus = sched_get_opt_cpumask(c->gran, cpu);
+                if ( cpumask_subset(cpus, &cpupool_free_cpus) )
+                    break;
+            }
+            ret = -ENODEV;
+            if ( cpu >= nr_cpu_ids )
+                goto addcpu_out;
+        }
         ret = -EINVAL;
         if ( cpu >= nr_cpu_ids )
             goto addcpu_out;
         ret = -ENODEV;
-        if ( !cpumask_test_cpu(cpu, &cpupool_free_cpus) ||
-             cpumask_test_cpu(cpu, &cpupool_locked_cpus) )
-            goto addcpu_out;
-        c = cpupool_find_by_id(op->cpupool_id);
-        ret = -ENOENT;
-        if ( c == NULL )
+        cpus = sched_get_opt_cpumask(c->gran, cpu);
+        if ( !cpumask_subset(cpus, &cpupool_free_cpus) ||
+             cpumask_intersects(cpus, &cpupool_locked_cpus) )
             goto addcpu_out;
         ret = cpupool_assign_cpu_locked(c, cpu);
+
     addcpu_out:
         spin_unlock(&cpupool_lock);
         debugtrace_printk("cpupool_assign_cpu(pool=%d,cpu=%d) ret %d\n",
                           op->cpupool_id, cpu, ret);
+
     }
     break;
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index e411b6d03e..48ddbdfd7e 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -2606,8 +2606,7 @@ static struct notifier_block cpu_schedule_nfb = {
     .notifier_call = cpu_schedule_callback
 };
 
-static const cpumask_t *sched_get_opt_cpumask(enum sched_gran opt,
-                                              unsigned int cpu)
+const cpumask_t *sched_get_opt_cpumask(enum sched_gran opt, unsigned int cpu)
 {
     const cpumask_t *mask;
 
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 780735dda3..cd731d7172 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -638,5 +638,6 @@ affinity_balance_cpumask(const struct sched_unit *unit, int step,
 }
 
 void sched_rm_cpu(unsigned int cpu);
+const cpumask_t *sched_get_opt_cpumask(enum sched_gran opt, unsigned int cpu);
 
 #endif /* __XEN_SCHED_IF_H__ */
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 18/19] xen/sched: disable scheduling when entering ACPI deep sleep states
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (16 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 17/19] xen/sched: support core scheduling for moving cpus to/from cpupools Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 19/19] xen/sched: add scheduling granularity enum Juergen Gross
  18 siblings, 0 replies; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Tim Deegan, Julien Grall, Jan Beulich, Dario Faggioli,
	Roger Pau Monné

When entering deep sleep states all domains are paused resulting in
all cpus only running idle vcpus. This enables us to stop scheduling
completely in order to avoid synchronization problems with core
scheduling when individual cpus are offlined.

Disabling the scheduler is done by replacing the softirq handler
with a dummy scheduling routine only enabling tasklets to run.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
V2: new patch
---
 xen/arch/x86/acpi/power.c |  4 ++++
 xen/common/schedule.c     | 31 +++++++++++++++++++++++++++++--
 xen/include/xen/sched.h   |  2 ++
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c
index 01e6aec4e8..8078352312 100644
--- a/xen/arch/x86/acpi/power.c
+++ b/xen/arch/x86/acpi/power.c
@@ -145,12 +145,16 @@ static void freeze_domains(void)
     for_each_domain ( d )
         domain_pause(d);
     rcu_read_unlock(&domlist_read_lock);
+
+    scheduler_disable();
 }
 
 static void thaw_domains(void)
 {
     struct domain *d;
 
+    scheduler_enable();
+
     rcu_read_lock(&domlist_read_lock);
     for_each_domain ( d )
     {
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 48ddbdfd7e..dbffec8cf2 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -91,6 +91,8 @@ extern const struct scheduler *__start_schedulers_array[], *__end_schedulers_arr
 
 static struct scheduler __read_mostly ops;
 
+static bool scheduler_active;
+
 static void sched_set_affinity(
     struct sched_unit *unit, const cpumask_t *hard, const cpumask_t *soft);
 
@@ -2275,6 +2277,13 @@ static struct sched_unit *sched_wait_rendezvous_in(struct sched_unit *prev,
         cpu_relax();
 
         *lock = pcpu_schedule_lock_irq(cpu);
+
+        if ( unlikely(!scheduler_active) )
+        {
+            ASSERT(is_idle_unit(prev));
+            atomic_set(&prev->next_task->rendezvous_out_cnt, 0);
+            prev->rendezvous_in_cnt = 0;
+        }
     }
 
     return prev->next_task;
@@ -2629,14 +2638,32 @@ const cpumask_t *sched_get_opt_cpumask(enum sched_gran opt, unsigned int cpu)
     return mask;
 }
 
+static void schedule_dummy(void)
+{
+    sched_tasklet_check_cpu(smp_processor_id());
+}
+
+void scheduler_disable(void)
+{
+    scheduler_active = false;
+    open_softirq(SCHEDULE_SOFTIRQ, schedule_dummy);
+    open_softirq(SCHED_SLAVE_SOFTIRQ, schedule_dummy);
+}
+
+void scheduler_enable(void)
+{
+    open_softirq(SCHEDULE_SOFTIRQ, schedule);
+    open_softirq(SCHED_SLAVE_SOFTIRQ, sched_slave);
+    scheduler_active = true;
+}
+
 /* Initialise the data structures. */
 void __init scheduler_init(void)
 {
     struct domain *idle_domain;
     int i;
 
-    open_softirq(SCHEDULE_SOFTIRQ, schedule);
-    open_softirq(SCHED_SLAVE_SOFTIRQ, sched_slave);
+    scheduler_enable();
 
     for ( i = 0; i < NUM_SCHEDULERS; i++)
     {
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index a40bd5fb56..629a4c52e0 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -933,6 +933,8 @@ void restore_vcpu_affinity(struct domain *d);
 void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate);
 uint64_t get_cpu_idle_time(unsigned int cpu);
 void sched_guest_idle(void (*idle) (void), unsigned int cpu);
+void scheduler_enable(void);
+void scheduler_disable(void);
 
 /*
  * Used by idle loop to decide whether there is work to do:
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Xen-devel] [PATCH v5 19/19] xen/sched: add scheduling granularity enum
  2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
                   ` (17 preceding siblings ...)
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 18/19] xen/sched: disable scheduling when entering ACPI deep sleep states Juergen Gross
@ 2019-09-30  5:21 ` Juergen Gross
  2019-09-30  9:37   ` Andrew Cooper
  18 siblings, 1 reply; 33+ messages in thread
From: Juergen Gross @ 2019-09-30  5:21 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Tim Deegan, Julien Grall, Jan Beulich, Dario Faggioli,
	Roger Pau Monné

Add a scheduling granularity enum ("cpu", "core", "socket") for
specification of the scheduling granularity. Initially it is set to
"cpu", this can be modified by the new boot parameter (x86 only)
"sched-gran".

According to the selected granularity sched_granularity is set after
all cpus are online.

A test is added for all sched resources holding the same number of
cpus. Fall back to core- or cpu-scheduling in that case.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
---
RFC V2:
- fixed freeing of sched_res when merging cpus
- rename parameter to "sched-gran" (Jan Beulich)
- rename parameter option from "thread" to "cpu" (Jan Beulich)

V1:
- rename scheduler_smp_init() to scheduler_gran_init(), let it be called
  by cpupool_init()
- avoid using literal cpu number 0 in scheduler_percpu_init() (Jan Beulich)
- style correction (Jan Beulich)
- fallback to smaller granularity instead of panic in case of
  unbalanced cpu configuration

V2:
- style changes (Jan Beulich)
- introduce CONFIG_HAS_SCHED_GRANULARITY (Jan Beulich)

V4:
- move code to cpupool.c
---
 xen/arch/x86/Kconfig |  1 +
 xen/common/Kconfig   |  3 ++
 xen/common/cpupool.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 84 insertions(+)

diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index 288dc6c042..3f88adae97 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -22,6 +22,7 @@ config X86
 	select HAS_PASSTHROUGH
 	select HAS_PCI
 	select HAS_PDX
+	select HAS_SCHED_GRANULARITY
 	select HAS_UBSAN
 	select HAS_VPCI if !PV_SHIM_EXCLUSIVE && HVM
 	select NEEDS_LIBELF
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index 16829f6274..e9247871a8 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -63,6 +63,9 @@ config HAS_GDBSX
 config HAS_IOPORTS
 	bool
 
+config HAS_SCHED_GRANULARITY
+	bool
+
 config NEEDS_LIBELF
 	bool
 
diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index f7a13c7a4c..4d3adbdd8d 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -17,6 +17,7 @@
 #include <xen/percpu.h>
 #include <xen/sched.h>
 #include <xen/sched-if.h>
+#include <xen/warning.h>
 #include <xen/keyhandler.h>
 #include <xen/cpu.h>
 
@@ -37,6 +38,83 @@ static DEFINE_SPINLOCK(cpupool_lock);
 static enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu;
 static unsigned int __read_mostly sched_granularity = 1;
 
+#ifdef CONFIG_HAS_SCHED_GRANULARITY
+static int __init sched_select_granularity(const char *str)
+{
+    if ( strcmp("cpu", str) == 0 )
+        opt_sched_granularity = SCHED_GRAN_cpu;
+    else if ( strcmp("core", str) == 0 )
+        opt_sched_granularity = SCHED_GRAN_core;
+    else if ( strcmp("socket", str) == 0 )
+        opt_sched_granularity = SCHED_GRAN_socket;
+    else
+        return -EINVAL;
+
+    return 0;
+}
+custom_param("sched-gran", sched_select_granularity);
+#endif
+
+static unsigned int __init cpupool_check_granularity(void)
+{
+    unsigned int cpu;
+    unsigned int siblings, gran = 0;
+
+    if ( opt_sched_granularity == SCHED_GRAN_cpu )
+        return 1;
+
+    for_each_online_cpu ( cpu )
+    {
+        siblings = cpumask_weight(sched_get_opt_cpumask(opt_sched_granularity,
+                                                        cpu));
+        if ( gran == 0 )
+            gran = siblings;
+        else if ( gran != siblings )
+            return 0;
+    }
+
+    sched_disable_smt_switching = true;
+
+    return gran;
+}
+
+/* Setup data for selected scheduler granularity. */
+static void __init cpupool_gran_init(void)
+{
+    unsigned int gran = 0;
+    const char *fallback = NULL;
+
+    while ( gran == 0 )
+    {
+        gran = cpupool_check_granularity();
+
+        if ( gran == 0 )
+        {
+            switch ( opt_sched_granularity )
+            {
+            case SCHED_GRAN_core:
+                opt_sched_granularity = SCHED_GRAN_cpu;
+                fallback = "Asymmetric cpu configuration.\n"
+                           "Falling back to sched-gran=cpu.\n";
+                break;
+            case SCHED_GRAN_socket:
+                opt_sched_granularity = SCHED_GRAN_core;
+                fallback = "Asymmetric cpu configuration.\n"
+                           "Falling back to sched-gran=core.\n";
+                break;
+            default:
+                ASSERT_UNREACHABLE();
+                break;
+            }
+        }
+    }
+
+    if ( fallback )
+        warning_add(fallback);
+
+    sched_granularity = gran;
+}
+
 unsigned int cpupool_get_granularity(const struct cpupool *c)
 {
     return c ? sched_granularity : 1;
@@ -871,6 +949,8 @@ static int __init cpupool_init(void)
     unsigned int cpu;
     int err;
 
+    cpupool_gran_init();
+
     cpupool0 = cpupool_create(0, 0, &err);
     BUG_ON(cpupool0 == NULL);
     cpupool_put(cpupool0);
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [Xen-devel] [PATCH v5 02/19] xen/sched: introduce unit_runnable_state()
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 02/19] xen/sched: introduce unit_runnable_state() Juergen Gross
@ 2019-09-30  7:22   ` Dario Faggioli
  0 siblings, 0 replies; 33+ messages in thread
From: Dario Faggioli @ 2019-09-30  7:22 UTC (permalink / raw)
  To: Juergen Gross, xen-devel
  Cc: sstabellini, wl, konrad.wilk, George.Dunlap, tim, ian.jackson,
	robert.vanvossen, julien.grall, josh.whitehead, mengxu,
	Jan Beulich, andrew.cooper3


[-- Attachment #1.1: Type: text/plain, Size: 1351 bytes --]

On Mon, 2019-09-30 at 07:21 +0200, Juergen Gross wrote:
> Today the vcpu runstate of a new scheduled vcpu is always set to
> "running" even if at that time vcpu_runnable() is already returning
> false due to a race (e.g. with pausing the vcpu).
> 
> With core scheduling this can no longer work as not all vcpus of a
> schedule unit have to be "running" when being scheduled. So the
> vcpu's
> new runstate has to be selected at the same time as the runnability
> of
> the related schedule unit is probed.
> 
> For this purpose introduce a new helper unit_runnable_state() which
> will save the new runstate of all tested vcpus in a new field of the
> vcpu struct.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
> RFC V2:
> - new patch
> V3:
> - add vcpu loop to unit_runnable_state() right now instead of doing
>   so in next patch (Jan Beulich, Dario Faggioli)
> - make new_state unsigned int (Jan Beulich)
> V4:
> - add comment explaining unit_runnable_state() (Jan Beulich)
>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xen-devel] [PATCH v5 08/19] xen/sched: make vcpu_wake() and vcpu_sleep() core scheduling aware
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 08/19] xen/sched: make vcpu_wake() and vcpu_sleep() core scheduling aware Juergen Gross
@ 2019-09-30  7:24   ` Dario Faggioli
  0 siblings, 0 replies; 33+ messages in thread
From: Dario Faggioli @ 2019-09-30  7:24 UTC (permalink / raw)
  To: Juergen Gross, xen-devel
  Cc: sstabellini, wl, konrad.wilk, george.dunlap, ian.jackson, tim,
	julien.grall, Jan Beulich, andrew.cooper3


[-- Attachment #1.1: Type: text/plain, Size: 1713 bytes --]

On Mon, 2019-09-30 at 07:21 +0200, Juergen Gross wrote:
> vcpu_wake() and vcpu_sleep() need to be made core scheduling aware:
> they might need to switch a single vcpu of an already scheduled unit
> between running and not running.
> 
> Especially when vcpu_sleep() for a vcpu is being called by a vcpu of
> the same scheduling unit special care must be taken in order to avoid
> a deadlock: the vcpu to be put asleep must be forced through a
> context switch without doing so for the calling vcpu. For this
> purpose add a vcpu flag handled in sched_slave() and in
> sched_wait_rendezvous_in() allowing a vcpu of the currently running
> unit to switch state at a higher priority than a normal schedule
> event.
> 
> Use the same mechanism when waking up a vcpu of a currently active
> unit.
> 
> While at it make vcpu_sleep_nosync_locked() static as it is used in
> schedule.c only.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
> RFC V2: add vcpu_sleep() handling and force_context_switch flag
> V2: fix runstate change in sched_force_context_switch()
> V4:
> - use unit_scheduler() where appropriate (Jan Beulich)
> - make cpu parameter unsigned int (Jan Beulich)
> - comments (Jan Beulich)
> - use true instead 1 for setting bool (Jan Beulich)
> - const parameter (Jan Beulich)
> V5:
> - add comments (Dario Faggioli)
>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xen-devel] [PATCH v5 07/19] xen/sched: add fall back to idle vcpu when scheduling unit
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 07/19] xen/sched: add fall back to idle vcpu when scheduling unit Juergen Gross
@ 2019-09-30  7:28   ` Dario Faggioli
  2019-09-30 10:45   ` Jan Beulich
  1 sibling, 0 replies; 33+ messages in thread
From: Dario Faggioli @ 2019-09-30  7:28 UTC (permalink / raw)
  To: Juergen Gross, xen-devel
  Cc: sstabellini, wl, konrad.wilk, George.Dunlap, andrew.cooper3,
	ian.jackson, tim, julien.grall, Jan Beulich, Volodymyr_Babchuk,
	roger.pau


[-- Attachment #1.1: Type: text/plain, Size: 2946 bytes --]

On Mon, 2019-09-30 at 07:21 +0200, Juergen Gross wrote:
> When scheduling an unit with multiple vcpus there is no guarantee all
> vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back
> to
> idle vcpu of the current cpu in that case. This requires to store the
> correct schedule_unit pointer in the idle vcpu as long as it used as
> fallback vcpu.
> 
> In order to modify the runstates of the correct vcpus when switching
> schedule units merge sched_unit_runstate_change() into
> sched_switch_units() and loop over the affected physical cpus instead
> of the unit's vcpus. This in turn requires an access function to the
> current variable of other cpus.
> 
> Today context_saved() is called in case previous and next vcpus
> differ
> when doing a context switch. With an idle vcpu being capable to be a
> substitute for an offline vcpu this is problematic when switching to
> an idle scheduling unit. An idle previous vcpu leaves us in doubt
> which
> schedule unit was active previously, so save the previous unit
> pointer
> in the per-schedule resource area. If it is NULL the unit has not
> changed and we don't have to set the previous unit to be not running.
> 
> When running an idle vcpu in a non-idle scheduling unit use a
> specific
> guest idle loop not performing any non-softirq tasklets and
> livepatching in order to avoid populating the cpu caches with memory
> used by other domains (as far as possible). Softirqs are considered
> to
> be save.
> 
> In order to avoid livepatching when going to guest idle another
> variant of reset_stack_and_jump() not calling
> check_for_livepatch_work
> is needed.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Acked-by: Julien Grall <julien.grall@arm.com>
> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
> ---
> RFC V2:
> - new patch (Andrew Cooper)
> 
> V1:
> - use urgent_count to select correct idle routine (Jan Beulich)
> 
> V2:
> - set vcpu->is_running in context_saved()
> - introduce reset_stack_and_jump_nolp() (Jan Beulich)
> - readd scrubbing (Jan Beulich, Andrew Cooper)
> - get_cpu_current() _NOT_ moved to include/asm-x86/current.h as the
>   needed reference of stack_base[] results in a #include hell
> 
> V3:
> - split context_saved() into unit_context_saved() and
> vcpu_context_saved()
> 
> V4:
> - rename sd -> sr (Jan Beulich)
> - use unsigned int for cpu (Jan Beulich)
> - add comment in sched_context_switch() (Jan Beulich)
> - add comment before definition of get_cpu_current() (Jan Beulich)
> 
> V5:
> - add comment (Dario Faggioli)
>
Saw it, and it's great.

Thanks for doing this!

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xen-devel] [PATCH v5 19/19] xen/sched: add scheduling granularity enum
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 19/19] xen/sched: add scheduling granularity enum Juergen Gross
@ 2019-09-30  9:37   ` Andrew Cooper
  0 siblings, 0 replies; 33+ messages in thread
From: Andrew Cooper @ 2019-09-30  9:37 UTC (permalink / raw)
  To: Juergen Gross, xen-devel
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Tim Deegan, Ian Jackson, Dario Faggioli,
	Julien Grall, Jan Beulich, Roger Pau Monné

On 30/09/2019 06:21, Juergen Gross wrote:
> Add a scheduling granularity enum ("cpu", "core", "socket") for
> specification of the scheduling granularity. Initially it is set to
> "cpu", this can be modified by the new boot parameter (x86 only)
> "sched-gran".
>
> ---
>  xen/arch/x86/Kconfig |  1 +
>  xen/common/Kconfig   |  3 ++
>  xen/common/cpupool.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 84 insertions(+)

Missing a patch to xen-command-line.pandoc.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xen-devel] [PATCH v5 01/19] xen/sched: add code to sync scheduling of all vcpus of a sched unit
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 01/19] xen/sched: add code to sync scheduling of all vcpus of a sched unit Juergen Gross
@ 2019-09-30 10:36   ` Jan Beulich
  2019-09-30 10:38     ` Andrew Cooper
  2019-09-30 10:54   ` Jan Beulich
  1 sibling, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2019-09-30 10:36 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Julien Grall, xen-devel, Dario Faggioli, Volodymyr Babchuk,
	Roger Pau Monné

On 30.09.2019 07:21, Juergen Gross wrote:
> When switching sched units synchronize all vcpus of the new unit to be
> scheduled at the same time.
> 
> A variable sched_granularity is added which holds the number of vcpus
> per schedule unit.
> 
> As tasklets require to schedule the idle unit it is required to set the
> tasklet_work_scheduled parameter of do_schedule() to true if any cpu
> covered by the current schedule() call has any pending tasklet work.
> 
> For joining other vcpus of the schedule unit we need to add a new
> softirq SCHED_SLAVE_SOFTIRQ in order to have a way to initiate a
> context switch without calling the generic schedule() function
> selecting the vcpu to switch to, as we already know which vcpu we
> want to run. This has the other advantage not to loose any other
> concurrent SCHEDULE_SOFTIRQ events.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

x86 and applicable common code parts
Acked-by: Jan Beulich <jbeulich@suse.com>

However, ...

> +static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
> +                                 s_time_t now)
> +{
> +    if ( unlikely(vprev == vnext) )
>      {
> -        pcpu_schedule_unlock_irq(lock, cpu);
>          TRACE_4D(TRC_SCHED_SWITCH_INFCONT,
> -                 next->domain->domain_id, next->unit_id,
> -                 now - prev->state_entry_time,
> -                 prev->next_time);
> -        trace_continue_running(next->vcpu_list);
> -        return continue_running(prev->vcpu_list);
> +                 vnext->domain->domain_id, vnext->sched_unit->unit_id,
> +                 now - vprev->runstate.state_entry_time,
> +                 vprev->sched_unit->next_time);
> +        sched_context_switched(vprev, vnext);
> +        trace_continue_running(vnext);
> +        return continue_running(vprev);
>      }

... I don't recall if there weren't compiler (clang?) versions not
allowing (or at least warning about) use of this extension. I'm
having difficulty thinking of a way to find a possible example use
elsewhere in our code, proving that this isn't the first instance.
Hence I wonder whether it wouldn't be better to avoid use of the
extension here.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xen-devel] [PATCH v5 01/19] xen/sched: add code to sync scheduling of all vcpus of a sched unit
  2019-09-30 10:36   ` Jan Beulich
@ 2019-09-30 10:38     ` Andrew Cooper
  2019-09-30 10:39       ` Jan Beulich
  0 siblings, 1 reply; 33+ messages in thread
From: Andrew Cooper @ 2019-09-30 10:38 UTC (permalink / raw)
  To: Jan Beulich, Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Tim Deegan, Ian Jackson, Dario Faggioli,
	Julien Grall, xen-devel, Volodymyr Babchuk, Roger Pau Monné

On 30/09/2019 11:36, Jan Beulich wrote:
> On 30.09.2019 07:21, Juergen Gross wrote:
>> When switching sched units synchronize all vcpus of the new unit to be
>> scheduled at the same time.
>>
>> A variable sched_granularity is added which holds the number of vcpus
>> per schedule unit.
>>
>> As tasklets require to schedule the idle unit it is required to set the
>> tasklet_work_scheduled parameter of do_schedule() to true if any cpu
>> covered by the current schedule() call has any pending tasklet work.
>>
>> For joining other vcpus of the schedule unit we need to add a new
>> softirq SCHED_SLAVE_SOFTIRQ in order to have a way to initiate a
>> context switch without calling the generic schedule() function
>> selecting the vcpu to switch to, as we already know which vcpu we
>> want to run. This has the other advantage not to loose any other
>> concurrent SCHEDULE_SOFTIRQ events.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
> x86 and applicable common code parts
> Acked-by: Jan Beulich <jbeulich@suse.com>
>
> However, ...
>
>> +static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
>> +                                 s_time_t now)
>> +{
>> +    if ( unlikely(vprev == vnext) )
>>      {
>> -        pcpu_schedule_unlock_irq(lock, cpu);
>>          TRACE_4D(TRC_SCHED_SWITCH_INFCONT,
>> -                 next->domain->domain_id, next->unit_id,
>> -                 now - prev->state_entry_time,
>> -                 prev->next_time);
>> -        trace_continue_running(next->vcpu_list);
>> -        return continue_running(prev->vcpu_list);
>> +                 vnext->domain->domain_id, vnext->sched_unit->unit_id,
>> +                 now - vprev->runstate.state_entry_time,
>> +                 vprev->sched_unit->next_time);
>> +        sched_context_switched(vprev, vnext);
>> +        trace_continue_running(vnext);
>> +        return continue_running(vprev);
>>      }
> ... I don't recall if there weren't compiler (clang?) versions not
> allowing (or at least warning about) use of this extension.

Which extension?

> I'm
> having difficulty thinking of a way to find a possible example use
> elsewhere in our code, proving that this isn't the first instance.
> Hence I wonder whether it wouldn't be better to avoid use of the
> extension here.

Gitlab can give us the answer easily.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xen-devel] [PATCH v5 01/19] xen/sched: add code to sync scheduling of all vcpus of a sched unit
  2019-09-30 10:38     ` Andrew Cooper
@ 2019-09-30 10:39       ` Jan Beulich
  2019-09-30 10:42         ` Jürgen Groß
  2019-09-30 10:43         ` George Dunlap
  0 siblings, 2 replies; 33+ messages in thread
From: Jan Beulich @ 2019-09-30 10:39 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Tim Deegan, Ian Jackson,
	Dario Faggioli, Julien Grall, xen-devel, VolodymyrBabchuk,
	Roger Pau Monné

On 30.09.2019 12:38, Andrew Cooper wrote:
> On 30/09/2019 11:36, Jan Beulich wrote:
>> On 30.09.2019 07:21, Juergen Gross wrote:
>>> When switching sched units synchronize all vcpus of the new unit to be
>>> scheduled at the same time.
>>>
>>> A variable sched_granularity is added which holds the number of vcpus
>>> per schedule unit.
>>>
>>> As tasklets require to schedule the idle unit it is required to set the
>>> tasklet_work_scheduled parameter of do_schedule() to true if any cpu
>>> covered by the current schedule() call has any pending tasklet work.
>>>
>>> For joining other vcpus of the schedule unit we need to add a new
>>> softirq SCHED_SLAVE_SOFTIRQ in order to have a way to initiate a
>>> context switch without calling the generic schedule() function
>>> selecting the vcpu to switch to, as we already know which vcpu we
>>> want to run. This has the other advantage not to loose any other
>>> concurrent SCHEDULE_SOFTIRQ events.
>>>
>>> Signed-off-by: Juergen Gross <jgross@suse.com>
>>> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
>> x86 and applicable common code parts
>> Acked-by: Jan Beulich <jbeulich@suse.com>
>>
>> However, ...
>>
>>> +static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
>>> +                                 s_time_t now)
>>> +{
>>> +    if ( unlikely(vprev == vnext) )
>>>      {
>>> -        pcpu_schedule_unlock_irq(lock, cpu);
>>>          TRACE_4D(TRC_SCHED_SWITCH_INFCONT,
>>> -                 next->domain->domain_id, next->unit_id,
>>> -                 now - prev->state_entry_time,
>>> -                 prev->next_time);
>>> -        trace_continue_running(next->vcpu_list);
>>> -        return continue_running(prev->vcpu_list);
>>> +                 vnext->domain->domain_id, vnext->sched_unit->unit_id,
>>> +                 now - vprev->runstate.state_entry_time,
>>> +                 vprev->sched_unit->next_time);
>>> +        sched_context_switched(vprev, vnext);
>>> +        trace_continue_running(vnext);
>>> +        return continue_running(vprev);
>>>      }
>> ... I don't recall if there weren't compiler (clang?) versions not
>> allowing (or at least warning about) use of this extension.
> 
> Which extension?

"return" with an expression of "void" type.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xen-devel] [PATCH v5 03/19] xen/sched: add support for multiple vcpus per sched unit where missing
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 03/19] xen/sched: add support for multiple vcpus per sched unit where missing Juergen Gross
@ 2019-09-30 10:41   ` Jan Beulich
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2019-09-30 10:41 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Julien Grall, xen-devel, Dario Faggioli

On 30.09.2019 07:21, Juergen Gross wrote:
> In several places there is support for multiple vcpus per sched unit
> missing. Add that missing support (with the exception of initial
> allocation) and missing helpers for that.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
> ---
> RFC V2:
> - fix vcpu_runstate_helper()
> V1:
> - add special handling for idle unit in unit_runnable() and
>   unit_runnable_state()
> V2:
> - handle affinity_broken correctly (Jan Beulich)
> V3:
> - type for cpu ->unsigned int (Jan Beulich)
> ---
>  xen/common/domain.c        |  5 ++++-

Acked-by: Jan Beulich <jbeulich@suse.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xen-devel] [PATCH v5 01/19] xen/sched: add code to sync scheduling of all vcpus of a sched unit
  2019-09-30 10:39       ` Jan Beulich
@ 2019-09-30 10:42         ` Jürgen Groß
  2019-09-30 10:56           ` Jan Beulich
  2019-09-30 10:43         ` George Dunlap
  1 sibling, 1 reply; 33+ messages in thread
From: Jürgen Groß @ 2019-09-30 10:42 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Tim Deegan, Ian Jackson, Dario Faggioli,
	Julien Grall, xen-devel, VolodymyrBabchuk, Roger Pau Monné

On 30.09.19 12:39, Jan Beulich wrote:
> On 30.09.2019 12:38, Andrew Cooper wrote:
>> On 30/09/2019 11:36, Jan Beulich wrote:
>>> On 30.09.2019 07:21, Juergen Gross wrote:
>>>> When switching sched units synchronize all vcpus of the new unit to be
>>>> scheduled at the same time.
>>>>
>>>> A variable sched_granularity is added which holds the number of vcpus
>>>> per schedule unit.
>>>>
>>>> As tasklets require to schedule the idle unit it is required to set the
>>>> tasklet_work_scheduled parameter of do_schedule() to true if any cpu
>>>> covered by the current schedule() call has any pending tasklet work.
>>>>
>>>> For joining other vcpus of the schedule unit we need to add a new
>>>> softirq SCHED_SLAVE_SOFTIRQ in order to have a way to initiate a
>>>> context switch without calling the generic schedule() function
>>>> selecting the vcpu to switch to, as we already know which vcpu we
>>>> want to run. This has the other advantage not to loose any other
>>>> concurrent SCHEDULE_SOFTIRQ events.
>>>>
>>>> Signed-off-by: Juergen Gross <jgross@suse.com>
>>>> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
>>> x86 and applicable common code parts
>>> Acked-by: Jan Beulich <jbeulich@suse.com>
>>>
>>> However, ...
>>>
>>>> +static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
>>>> +                                 s_time_t now)
>>>> +{
>>>> +    if ( unlikely(vprev == vnext) )
>>>>       {
>>>> -        pcpu_schedule_unlock_irq(lock, cpu);
>>>>           TRACE_4D(TRC_SCHED_SWITCH_INFCONT,
>>>> -                 next->domain->domain_id, next->unit_id,
>>>> -                 now - prev->state_entry_time,
>>>> -                 prev->next_time);
>>>> -        trace_continue_running(next->vcpu_list);
>>>> -        return continue_running(prev->vcpu_list);
>>>> +                 vnext->domain->domain_id, vnext->sched_unit->unit_id,
>>>> +                 now - vprev->runstate.state_entry_time,
>>>> +                 vprev->sched_unit->next_time);
>>>> +        sched_context_switched(vprev, vnext);
>>>> +        trace_continue_running(vnext);
>>>> +        return continue_running(vprev);
>>>>       }
>>> ... I don't recall if there weren't compiler (clang?) versions not
>>> allowing (or at least warning about) use of this extension.
>>
>> Which extension?
> 
> "return" with an expression of "void" type.

It was there in the original code, too:

http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/common/schedule.c;h=fd587622f4c3ee13d57334f90b1eab4b17031c0b;hb=refs/heads/staging-4.12#l1536


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xen-devel] [PATCH v5 01/19] xen/sched: add code to sync scheduling of all vcpus of a sched unit
  2019-09-30 10:39       ` Jan Beulich
  2019-09-30 10:42         ` Jürgen Groß
@ 2019-09-30 10:43         ` George Dunlap
  1 sibling, 0 replies; 33+ messages in thread
From: George Dunlap @ 2019-09-30 10:43 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Tim Deegan, Ian Jackson,
	Dario Faggioli, Julien Grall, xen-devel, VolodymyrBabchuk,
	Roger Pau Monné

On 9/30/19 11:39 AM, Jan Beulich wrote:
> On 30.09.2019 12:38, Andrew Cooper wrote:
>> On 30/09/2019 11:36, Jan Beulich wrote:
>>> On 30.09.2019 07:21, Juergen Gross wrote:
>>>> When switching sched units synchronize all vcpus of the new unit to be
>>>> scheduled at the same time.
>>>>
>>>> A variable sched_granularity is added which holds the number of vcpus
>>>> per schedule unit.
>>>>
>>>> As tasklets require to schedule the idle unit it is required to set the
>>>> tasklet_work_scheduled parameter of do_schedule() to true if any cpu
>>>> covered by the current schedule() call has any pending tasklet work.
>>>>
>>>> For joining other vcpus of the schedule unit we need to add a new
>>>> softirq SCHED_SLAVE_SOFTIRQ in order to have a way to initiate a
>>>> context switch without calling the generic schedule() function
>>>> selecting the vcpu to switch to, as we already know which vcpu we
>>>> want to run. This has the other advantage not to loose any other
>>>> concurrent SCHEDULE_SOFTIRQ events.
>>>>
>>>> Signed-off-by: Juergen Gross <jgross@suse.com>
>>>> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
>>> x86 and applicable common code parts
>>> Acked-by: Jan Beulich <jbeulich@suse.com>
>>>
>>> However, ...
>>>
>>>> +static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
>>>> +                                 s_time_t now)
>>>> +{
>>>> +    if ( unlikely(vprev == vnext) )
>>>>      {
>>>> -        pcpu_schedule_unlock_irq(lock, cpu);
>>>>          TRACE_4D(TRC_SCHED_SWITCH_INFCONT,
>>>> -                 next->domain->domain_id, next->unit_id,
>>>> -                 now - prev->state_entry_time,
>>>> -                 prev->next_time);
>>>> -        trace_continue_running(next->vcpu_list);
>>>> -        return continue_running(prev->vcpu_list);
>>>> +                 vnext->domain->domain_id, vnext->sched_unit->unit_id,
>>>> +                 now - vprev->runstate.state_entry_time,
>>>> +                 vprev->sched_unit->next_time);
>>>> +        sched_context_switched(vprev, vnext);
>>>> +        trace_continue_running(vnext);
>>>> +        return continue_running(vprev);
>>>>      }
>>> ... I don't recall if there weren't compiler (clang?) versions not
>>> allowing (or at least warning about) use of this extension.
>>
>> Which extension?
> 
> "return" with an expression of "void" type.

I think that must be a mistake.  In this instance there isn't really
even a "syntactic sugar"* reason to use it.

 -George

* Syntactic sugar being to do something like:

if ( blah )
   return foo();

rather than

if ( blah ) {
    foo();
    return;
}

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xen-devel] [PATCH v5 07/19] xen/sched: add fall back to idle vcpu when scheduling unit
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 07/19] xen/sched: add fall back to idle vcpu when scheduling unit Juergen Gross
  2019-09-30  7:28   ` Dario Faggioli
@ 2019-09-30 10:45   ` Jan Beulich
  1 sibling, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2019-09-30 10:45 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Julien Grall, xen-devel, Dario Faggioli, Volodymyr Babchuk,
	Roger Pau Monné

On 30.09.2019 07:21, Juergen Gross wrote:
> When scheduling an unit with multiple vcpus there is no guarantee all
> vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back to
> idle vcpu of the current cpu in that case. This requires to store the
> correct schedule_unit pointer in the idle vcpu as long as it used as
> fallback vcpu.
> 
> In order to modify the runstates of the correct vcpus when switching
> schedule units merge sched_unit_runstate_change() into
> sched_switch_units() and loop over the affected physical cpus instead
> of the unit's vcpus. This in turn requires an access function to the
> current variable of other cpus.
> 
> Today context_saved() is called in case previous and next vcpus differ
> when doing a context switch. With an idle vcpu being capable to be a
> substitute for an offline vcpu this is problematic when switching to
> an idle scheduling unit. An idle previous vcpu leaves us in doubt which
> schedule unit was active previously, so save the previous unit pointer
> in the per-schedule resource area. If it is NULL the unit has not
> changed and we don't have to set the previous unit to be not running.
> 
> When running an idle vcpu in a non-idle scheduling unit use a specific
> guest idle loop not performing any non-softirq tasklets and
> livepatching in order to avoid populating the cpu caches with memory
> used by other domains (as far as possible). Softirqs are considered to
> be save.
> 
> In order to avoid livepatching when going to guest idle another
> variant of reset_stack_and_jump() not calling check_for_livepatch_work
> is needed.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Acked-by: Julien Grall <julien.grall@arm.com>
> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

x86:
Acked-by: Jan Beulich <jbeulich@suse.com>

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xen-devel] [PATCH v5 01/19] xen/sched: add code to sync scheduling of all vcpus of a sched unit
  2019-09-30  5:21 ` [Xen-devel] [PATCH v5 01/19] xen/sched: add code to sync scheduling of all vcpus of a sched unit Juergen Gross
  2019-09-30 10:36   ` Jan Beulich
@ 2019-09-30 10:54   ` Jan Beulich
  1 sibling, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2019-09-30 10:54 UTC (permalink / raw)
  To: Juergen Gross, Stefano Stabellini, Julien Grall
  Cc: Wei Liu, Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, xen-devel, Dario Faggioli,
	Volodymyr Babchuk, Roger Pau Monné

On 30.09.2019 07:21, Juergen Gross wrote:
> When switching sched units synchronize all vcpus of the new unit to be
> scheduled at the same time.
> 
> A variable sched_granularity is added which holds the number of vcpus
> per schedule unit.
> 
> As tasklets require to schedule the idle unit it is required to set the
> tasklet_work_scheduled parameter of do_schedule() to true if any cpu
> covered by the current schedule() call has any pending tasklet work.
> 
> For joining other vcpus of the schedule unit we need to add a new
> softirq SCHED_SLAVE_SOFTIRQ in order to have a way to initiate a
> context switch without calling the generic schedule() function
> selecting the vcpu to switch to, as we already know which vcpu we
> want to run. This has the other advantage not to loose any other
> concurrent SCHEDULE_SOFTIRQ events.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
> ---
> RFC V2:
> - move syncing after context_switch() to schedule.c
> V2:
> - don't run tasklets directly from sched_wait_rendezvous_in()
> V3:
> - adapt array size in sched_move_domain() (Jan Beulich)
> - int -> unsigned int (Jan Beulich)
> V4:
> - renamed sd to sr in several places (Jan Beulich)
> - swap stop_timer() and NOW() calls (Jan Beulich)
> - context_switch() on ARM returns - handle that (Jan Beulich)

Especially because of this (previously overlooked) aspect I think
I'd prefer an Arm maintainer ack here before committing no matter
that ...

> ---
>  xen/arch/arm/domain.c      |   2 +-

... this is a rather minimal change.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xen-devel] [PATCH v5 01/19] xen/sched: add code to sync scheduling of all vcpus of a sched unit
  2019-09-30 10:42         ` Jürgen Groß
@ 2019-09-30 10:56           ` Jan Beulich
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2019-09-30 10:56 UTC (permalink / raw)
  To: Jürgen Groß
  Cc: Tim Deegan, Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Dario Faggioli,
	Julien Grall, xen-devel, VolodymyrBabchuk, Roger Pau Monné

On 30.09.2019 12:42, Jürgen Groß wrote:
> On 30.09.19 12:39, Jan Beulich wrote:
>> On 30.09.2019 12:38, Andrew Cooper wrote:
>>> On 30/09/2019 11:36, Jan Beulich wrote:
>>>> On 30.09.2019 07:21, Juergen Gross wrote:
>>>>> When switching sched units synchronize all vcpus of the new unit to be
>>>>> scheduled at the same time.
>>>>>
>>>>> A variable sched_granularity is added which holds the number of vcpus
>>>>> per schedule unit.
>>>>>
>>>>> As tasklets require to schedule the idle unit it is required to set the
>>>>> tasklet_work_scheduled parameter of do_schedule() to true if any cpu
>>>>> covered by the current schedule() call has any pending tasklet work.
>>>>>
>>>>> For joining other vcpus of the schedule unit we need to add a new
>>>>> softirq SCHED_SLAVE_SOFTIRQ in order to have a way to initiate a
>>>>> context switch without calling the generic schedule() function
>>>>> selecting the vcpu to switch to, as we already know which vcpu we
>>>>> want to run. This has the other advantage not to loose any other
>>>>> concurrent SCHEDULE_SOFTIRQ events.
>>>>>
>>>>> Signed-off-by: Juergen Gross <jgross@suse.com>
>>>>> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
>>>> x86 and applicable common code parts
>>>> Acked-by: Jan Beulich <jbeulich@suse.com>
>>>>
>>>> However, ...
>>>>
>>>>> +static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
>>>>> +                                 s_time_t now)
>>>>> +{
>>>>> +    if ( unlikely(vprev == vnext) )
>>>>>       {
>>>>> -        pcpu_schedule_unlock_irq(lock, cpu);
>>>>>           TRACE_4D(TRC_SCHED_SWITCH_INFCONT,
>>>>> -                 next->domain->domain_id, next->unit_id,
>>>>> -                 now - prev->state_entry_time,
>>>>> -                 prev->next_time);
>>>>> -        trace_continue_running(next->vcpu_list);
>>>>> -        return continue_running(prev->vcpu_list);
>>>>> +                 vnext->domain->domain_id, vnext->sched_unit->unit_id,
>>>>> +                 now - vprev->runstate.state_entry_time,
>>>>> +                 vprev->sched_unit->next_time);
>>>>> +        sched_context_switched(vprev, vnext);
>>>>> +        trace_continue_running(vnext);
>>>>> +        return continue_running(vprev);
>>>>>       }
>>>> ... I don't recall if there weren't compiler (clang?) versions not
>>>> allowing (or at least warning about) use of this extension.
>>>
>>> Which extension?
>>
>> "return" with an expression of "void" type.
> 
> It was there in the original code, too:
> 
> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/common/schedule.c;h=fd587622f4c3ee13d57334f90b1eab4b17031c0b;hb=refs/heads/staging-4.12#l1536

Oh, indeed - I must have been blind: It's also there in context above,
among the code being replaced.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2019-09-30 10:56 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-30  5:21 [Xen-devel] [PATCH v5 00/19] xen: add core scheduling support Juergen Gross
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 01/19] xen/sched: add code to sync scheduling of all vcpus of a sched unit Juergen Gross
2019-09-30 10:36   ` Jan Beulich
2019-09-30 10:38     ` Andrew Cooper
2019-09-30 10:39       ` Jan Beulich
2019-09-30 10:42         ` Jürgen Groß
2019-09-30 10:56           ` Jan Beulich
2019-09-30 10:43         ` George Dunlap
2019-09-30 10:54   ` Jan Beulich
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 02/19] xen/sched: introduce unit_runnable_state() Juergen Gross
2019-09-30  7:22   ` Dario Faggioli
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 03/19] xen/sched: add support for multiple vcpus per sched unit where missing Juergen Gross
2019-09-30 10:41   ` Jan Beulich
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 04/19] xen/sched: modify cpupool_domain_cpumask() to be an unit mask Juergen Gross
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 05/19] xen/sched: support allocating multiple vcpus into one sched unit Juergen Gross
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 06/19] xen/sched: add a percpu resource index Juergen Gross
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 07/19] xen/sched: add fall back to idle vcpu when scheduling unit Juergen Gross
2019-09-30  7:28   ` Dario Faggioli
2019-09-30 10:45   ` Jan Beulich
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 08/19] xen/sched: make vcpu_wake() and vcpu_sleep() core scheduling aware Juergen Gross
2019-09-30  7:24   ` Dario Faggioli
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 09/19] xen/sched: move per-cpu variable scheduler to struct sched_resource Juergen Gross
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 10/19] xen/sched: move per-cpu variable cpupool " Juergen Gross
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 11/19] xen/sched: reject switching smt on/off with core scheduling active Juergen Gross
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 12/19] xen/sched: prepare per-cpupool scheduling granularity Juergen Gross
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 13/19] xen/sched: split schedule_cpu_switch() Juergen Gross
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 14/19] xen/sched: protect scheduling resource via rcu Juergen Gross
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 15/19] xen/sched: support multiple cpus per scheduling resource Juergen Gross
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 16/19] xen/sched: support differing granularity in schedule_cpu_[add/rm]() Juergen Gross
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 17/19] xen/sched: support core scheduling for moving cpus to/from cpupools Juergen Gross
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 18/19] xen/sched: disable scheduling when entering ACPI deep sleep states Juergen Gross
2019-09-30  5:21 ` [Xen-devel] [PATCH v5 19/19] xen/sched: add scheduling granularity enum Juergen Gross
2019-09-30  9:37   ` Andrew Cooper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).