All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/5] xenalyze: Handle start-of-day ->RUNNING transitions
@ 2023-04-28  8:08 George Dunlap
  2023-04-28  8:08 ` [PATCH 2/5] xenalyze: Basic TRC_HVM_EMUL handling George Dunlap
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: George Dunlap @ 2023-04-28  8:08 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Anthony Perard, Wei Liu

A recent xentrace highlighted an unhandled corner case in the vcpu
"start-of-day" logic, if the trace starts after the last running ->
non-running transition, but before the first non-running -> running
transition.  Because start-of-day wasn't handled, vcpu_next_update()
was expecting p->current to be NULL, and tripping out with the
following error message when it wasn't:

vcpu_next_update: FATAL: p->current not NULL! (d32768dv$p, runstate RUNSTATE_INIT)

where 32768 is the DEFAULT_DOMAIN, and $p is the pcpu number.

Instead of calling vcpu_start() piecemeal throughout
sched_runstate_process(), call it at the top of the function if the
vcpu in question is still in RUNSTATE_INIT, so that we can handle all
the cases in one place.

Sketch out at the top of the function all cases which we need to
handle, and what to do in those cases.  Some transitions tell us where
v is running; some transitions tell us about what is (or is not)
running on p; some transitions tell us neither.

If a transition tells us where v is now running, update its state;
otherwise leave it in INIT, in order to avoid having to deal with TSC
skew on start-up.

If a transition tells us what is or is not running on p, update
p->current (either to v or NULL).  Otherwise leave it alone.

If neither, do nothing.

Reifying those rules:

- If we're continuing to run, set v to RUNNING, and use p->first_tsc
  as the runstate time.

- If we're starting to run, set v to RUNNING, and use ri->tsc as the
  runstate time.

- If v is being deschedled, leave v in the INIT state to avoid dealing
  with TSC skew; but set p->current to NULL so that whatever is
  scheduled next won't trigger the assert in vcpu_next_update().

- If a vcpu is waking up (switching from one non-runnable state to
  another non-runnable state), leave v in INIT, and p in whatever
  state it's in (which may be the default domain, or some other vcpu
  which has already run).

While here, fix the comment above vcpu_start; it's called when the
vcpu state is INIT, not when current is the default domain.

Signed-off-by: George Dunlap <george.dunlap@cloud.com>
---
CC: Anthony Perard <anthony.perard@cloud.com>
CC: Wei Liu <wl@xenproject.org>
---
 tools/xentrace/xenalyze.c | 159 ++++++++++++++++++++++++--------------
 1 file changed, 101 insertions(+), 58 deletions(-)

diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index 12dcca9646..ff9716cd12 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -6885,39 +6885,86 @@ void vcpu_next_update(struct pcpu_info *p, struct vcpu_data *next, tsc_t tsc)
     p->lost_record.seen_valid_schedule = 1;
 }
 
-/* If current is the default domain, we're fixing up from something
- * like start-of-day.  Update what we can. */
-void vcpu_start(struct pcpu_info *p, struct vcpu_data *v) {
-    /* If vcpus are created, or first show up, in a "dead zone", this will
-     * fail. */
-    if( !p->current || p->current->d->did != DEFAULT_DOMAIN) {
-        fprintf(stderr, "Strange, p->current not default domain!\n");
-        error(ERR_FILE, NULL);
-        return;
-    }
+/* 
+ * If the vcpu in question is in state INIT, we're fixing up from something
+ * like start-of-day.  Update what we can.
+ */
+void vcpu_start(struct pcpu_info *p, struct vcpu_data *v,
+                int old_runstate, int new_runstate, tsc_t ri_tsc) {
+    tsc_t tsc;
+
+    /*
+     * 
+     * Cases:
+     * running -> running:
+     *  v -> running, using p->first_tsc
+     * {runnable, blocked} -> running:
+     *  v -> running, using ri->tsc
+     * running -> {runnable, blocked}:
+     *  Leave v INIT, but clear p->current in case another vcpu is scheduled
+     * blocked -> runnable:
+     *  Leave INIT, and also leave p->current, since we still don't know who's scheduled here
+     */
+
+    /* 
+     * NB that a vcpu won't come out of INIT until it starts running somewhere.  
+     * If this event is pcpu that has already seen a scheduling event, p->current
+     * should be null; if this is the first scheduling event on this pcpu, 
+     * p->current should be the default domain.
+     */
+    if( old_runstate == RUNSTATE_RUNNING ) {
+        if ( !p->current ||  p->current->d->did != DEFAULT_DOMAIN) {
+            fprintf(stderr, "Strange, p->current not default domain!\n");
+            error(ERR_FILE, NULL);
+            return;
 
-    if(!p->first_tsc) {
-        fprintf(stderr, "Strange, p%d first_tsc 0!\n", p->pid);
-        error(ERR_FILE, NULL);
+        }
+        
+        if(!p->first_tsc) {
+            fprintf(stderr, "Strange, p%d first_tsc 0!\n", p->pid);
+            error(ERR_FILE, NULL);
+        }
+
+        if(p->first_tsc <= p->current->runstate.tsc) {
+            fprintf(stderr, "Strange, first_tsc %llx < default_domain runstate tsc %llx!\n",
+                    p->first_tsc,
+                    p->current->runstate.tsc);
+            error(ERR_FILE, NULL);
+        }
+    
+        /* Change default domain to 'queued' */
+        runstate_update(p->current, RUNSTATE_QUEUED, p->first_tsc);
+
+        /* 
+         * Set current to NULL, so that if another vcpu (not in INIT)
+         * is scheduled here, we don't trip over the check in
+         * vcpu_next_update()
+         */
+        p->current = NULL;
     }
 
-    if(p->first_tsc <= p->current->runstate.tsc) {
-        fprintf(stderr, "Strange, first_tsc %llx < default_domain runstate tsc %llx!\n",
-                p->first_tsc,
-                p->current->runstate.tsc);
-        error(ERR_FILE, NULL);
+    /* TSC skew at start-of-day is hard to deal with.  Don't
+     * bring a vcpu out of INIT until it's seen to be actually
+     * running somewhere. */
+    if ( new_runstate != RUNSTATE_RUNNING ) {
+        fprintf(warn, "First schedule for d%dv%d doesn't take us into a running state; leaving INIT\n",
+                v->d->did, v->vid);
+
+        return;
     }
 
-    /* Change default domain to 'queued' */
-    runstate_update(p->current, RUNSTATE_QUEUED, p->first_tsc);
+    tsc = ri_tsc;
+    if ( old_runstate == RUNSTATE_RUNNING ) {
+        /* FIXME: Copy over data from the default domain this interval */
+        fprintf(warn, "Using first_tsc for d%dv%d (%lld cycles)\n",
+                v->d->did, v->vid, p->last_tsc - p->first_tsc);
 
-    /* FIXME: Copy over data from the default domain this interval */
-    fprintf(warn, "Using first_tsc for d%dv%d (%lld cycles)\n",
-            v->d->did, v->vid, p->last_tsc - p->first_tsc);
+        tsc = p->first_tsc;
+    }
 
     /* Simulate the time since the first tsc */
-    runstate_update(v, RUNSTATE_RUNNING, p->first_tsc);
-    p->time.tsc = p->first_tsc;
+    runstate_update(v, RUNSTATE_RUNNING, tsc);
+    p->time.tsc = tsc;
     p->current = v;
     pcpu_string_draw(p);
     v->p = p;
@@ -7021,6 +7068,13 @@ void sched_runstate_process(struct pcpu_info *p)
     last_oldstate = v->runstate.last_oldstate;
     v->runstate.last_oldstate.wrong = RUNSTATE_INIT;
 
+    /* Handle all "start-of-day" issues in one place.  This can be
+     * done before any of the other tracks or sanity checks. */
+    if ( v->runstate.state == RUNSTATE_INIT ) {
+        vcpu_start(p, v, sevt.old_runstate, sevt.new_runstate, ri->tsc);
+        return;
+    }
+
     /* Close vmexits when the putative reason for blocking / &c stops.
      * This way, we don't account cpu contention to some other overhead. */
     if(sevt.new_runstate == RUNSTATE_RUNNABLE
@@ -7190,32 +7244,27 @@ update:
      * or stopping actually running on a physical cpu. */
     if ( type == CONTINUE )
     {
-        if( v->runstate.state == RUNSTATE_INIT ) {
-            /* Start-of-day; account first tsc -> now to v */
-            vcpu_start(p, v);
-        } else {
-            /* Continue running.  First, do some sanity checks */
-            if ( v->runstate.state == RUNSTATE_LOST ) {
-                fprintf(warn, "WARNING: continue with d%dv%d in RUNSTATE_LOST.  Resetting current.\n",
-                        v->d->did, v->vid);
-                if ( p->current )
-                    vcpu_prev_update(p, p->current, ri->tsc, RUNSTATE_LOST);
-                vcpu_next_update(p, v, ri->tsc);
-            }
-            else if( v->runstate.state != RUNSTATE_RUNNING ) {
-                /* This should never happen. */
-                fprintf(warn, "FATAL: sevt.old_runstate running, but d%dv%d runstate %s!\n",
-                        v->d->did, v->vid, runstate_name[v->runstate.state]);
-                error(ERR_FILE, NULL);
-            } else if ( v->p != p ) {
-                fprintf(warn, "FATAL: continue on p%d, but d%dv%d p%d!\n",
-                        p->pid, v->d->did, v->vid,
-                        v->p ? v->p->pid : -1);
-                error(ERR_FILE, NULL);
-            }
-
-            runstate_update(v, RUNSTATE_RUNNING, ri->tsc);
+        /* Continue running.  First, do some sanity checks */
+        if ( v->runstate.state == RUNSTATE_LOST ) {
+            fprintf(warn, "WARNING: continue with d%dv%d in RUNSTATE_LOST.  Resetting current.\n",
+                    v->d->did, v->vid);
+            if ( p->current )
+                vcpu_prev_update(p, p->current, ri->tsc, RUNSTATE_LOST);
+            vcpu_next_update(p, v, ri->tsc);
+        }
+        else if( v->runstate.state != RUNSTATE_RUNNING ) {
+            /* This should never happen. */
+            fprintf(warn, "FATAL: sevt.old_runstate running, but d%dv%d runstate %s!\n",
+                    v->d->did, v->vid, runstate_name[v->runstate.state]);
+            error(ERR_FILE, NULL);
+        } else if ( v->p != p ) {
+            fprintf(warn, "FATAL: continue on p%d, but d%dv%d p%d!\n",
+                    p->pid, v->d->did, v->vid,
+                    v->p ? v->p->pid : -1);
+            error(ERR_FILE, NULL);
         }
+
+        runstate_update(v, RUNSTATE_RUNNING, ri->tsc);
     }
     else if ( sevt.old_runstate == RUNSTATE_RUNNING
               || v->runstate.state == RUNSTATE_RUNNING )
@@ -7232,10 +7281,7 @@ update:
          *   # (should never happen)
          */
         if( sevt.old_runstate == RUNSTATE_RUNNING ) {
-            if( v->runstate.state == RUNSTATE_INIT ) {
-                /* Start-of-day; account first tsc -> now to v */
-                vcpu_start(p, v);
-            } else if( v->runstate.state != RUNSTATE_RUNNING
+            if( v->runstate.state != RUNSTATE_RUNNING
                        && v->runstate.state != RUNSTATE_LOST ) {
                 /* This should never happen. */
                 fprintf(warn, "FATAL: sevt.old_runstate running, but d%dv%d runstate %s!\n",
@@ -7264,11 +7310,8 @@ update:
 
         vcpu_next_update(p, v, ri->tsc);
     }
-    else if ( v->runstate.state != RUNSTATE_INIT )
+    else
     {
-        /* TSC skew at start-of-day is hard to deal with.  Don't
-         * bring a vcpu out of INIT until it's seen to be actually
-         * running somewhere. */
         runstate_update(v, sevt.new_runstate, ri->tsc);
     }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/5] xenalyze: Basic TRC_HVM_EMUL handling
  2023-04-28  8:08 [PATCH 1/5] xenalyze: Handle start-of-day ->RUNNING transitions George Dunlap
@ 2023-04-28  8:08 ` George Dunlap
  2023-04-28  8:08 ` [PATCH 3/5] credit: Limit load balancing to once per millisecond George Dunlap
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: George Dunlap @ 2023-04-28  8:08 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Wei Liu, Anthony PERARD

For now, mainly just do volume analysis and get rid of the warnings.

Signed-off-by: George Dunlap <george.dunlap@cloud.com>
---
CC: Wei Liu <wl@xen.org>
CC: Anthony PERARD <anthony.perard@cloud.com>
---
 tools/xentrace/xenalyze.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index ff9716cd12..f7f8943079 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -987,6 +987,7 @@ enum {
     HVM_VOL_VMENTRY,
     HVM_VOL_VMEXIT,
     HVM_VOL_HANDLER,
+    HVM_VOL_EMUL,
     HVM_VOL_MAX
 };
 
@@ -1013,6 +1014,7 @@ const char *hvm_vol_name[HVM_VOL_MAX] = {
     [HVM_VOL_VMENTRY]="vmentry",
     [HVM_VOL_VMEXIT] ="vmexit",
     [HVM_VOL_HANDLER]="handler",
+    [HVM_VOL_EMUL]="emul",
 };
 
 enum {
@@ -5275,15 +5277,18 @@ void hvm_process(struct pcpu_info *p)
     if(vcpu_set_data_type(p->current, VCPU_DATA_HVM))
         return;
 
-    if(ri->evt.sub == 2)
-    {
+    switch ( ri->evt.sub ) {
+    case 2: /* HVM_HANDLER */
         UPDATE_VOLUME(p, hvm[HVM_VOL_HANDLER], ri->size);
         hvm_handler_process(ri, h);
-    }
-    else
-    {
+        break;
+    case 4: /* HVM_EMUL */
+        UPDATE_VOLUME(p, hvm[HVM_VOL_EMUL], ri->size);
+        warn_once("WARNING: We don't yet analyze HVM_EMUL events.\n");
+        /* FIXME: Collect analysis on this */
+        break;
+    default:
         switch(ri->event) {
-            /* HVM */
         case TRC_HVM_VMEXIT:
         case TRC_HVM_VMEXIT64:
             UPDATE_VOLUME(p, hvm[HVM_VOL_VMEXIT], ri->size);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/5] credit: Limit load balancing to once per millisecond
  2023-04-28  8:08 [PATCH 1/5] xenalyze: Handle start-of-day ->RUNNING transitions George Dunlap
  2023-04-28  8:08 ` [PATCH 2/5] xenalyze: Basic TRC_HVM_EMUL handling George Dunlap
@ 2023-04-28  8:08 ` George Dunlap
  2023-04-28  8:10   ` George Dunlap
  2023-04-28  8:08 ` [PATCH 4/5] credit: Don't steal vcpus which have yielded George Dunlap
  2023-04-28  8:08 ` [PATCH 5/5] SUPPORT.md: Make all security support explicit George Dunlap
  3 siblings, 1 reply; 6+ messages in thread
From: George Dunlap @ 2023-04-28  8:08 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

The credit scheduler tries as hard as it can to ensure that it always
runs scheduling units with positive credit (PRI_TS_UNDER) before
running those with negative credit (PRI_TS_OVER).  If the next
runnable scheduling unit is of priority OVER, it will always run the
load balancer, which will scour the system looking for another
scheduling unit of the UNDER priority.

Unfortunately, as the number of cores on a system has grown, the cost
of the work-stealing algorithm has dramatically increased; a recent
trace on a system with 128 cores showed this taking over 50
microseconds.

Add a parameter, load_balance_ratelimit, to limit the frequency of
load balance operations on a given pcpu.  Default this to 1
millisecond.

Invert the load balancing conditional to make it more clear, and line
up more closely with the comment above it.

Overall it might be cleaner to have the last_load_balance checking
happen inside csched_load_balance(), but that would require either
passing both now and spc into the function, or looking them up again;
both of which seemed to be worse than simply checking and setting the
values before calling it.

Signed-off-by: George Dunlap <george.dunlap@cloud.com>
---
 docs/misc/xen-command-line.pandoc |  6 +++++
 xen/common/sched/credit.c         | 40 ++++++++++++++++++++++++++-----
 xen/include/public/sysctl.h       |  6 +++++
 3 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index e0b89b7d33..ae51a8cfa2 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1840,6 +1840,12 @@ By default, Xen will use the INVPCID instruction for TLB management if
 it is available.  This option can be used to cause Xen to fall back to
 older mechanisms, which are generally slower.
 
+### load-balance-ratelimit
+> `= <integer>`
+
+The minimum interval between load balancing events on a given pcpu.
+At the moment only credit honors this parameter.
+
 ### noirqbalance (x86)
 > `= <boolean>`
 
diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c
index f2cd3d9da3..b8bdfd5f6a 100644
--- a/xen/common/sched/credit.c
+++ b/xen/common/sched/credit.c
@@ -50,6 +50,8 @@
 #define CSCHED_TICKS_PER_TSLICE     3
 /* Default timeslice: 30ms */
 #define CSCHED_DEFAULT_TSLICE_MS    30
+/* Default load balancing ratelimit: 1ms */
+#define CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US 1000
 #define CSCHED_CREDITS_PER_MSEC     10
 /* Never set a timer shorter than this value. */
 #define CSCHED_MIN_TIMER            XEN_SYSCTL_SCHED_RATELIMIT_MIN
@@ -153,6 +155,7 @@ struct csched_pcpu {
 
     unsigned int idle_bias;
     unsigned int nr_runnable;
+    s_time_t last_load_balance;
 
     unsigned int tick;
     struct timer ticker;
@@ -218,7 +221,7 @@ struct csched_private {
 
     /* Period of master and tick in milliseconds */
     unsigned int tick_period_us, ticks_per_tslice;
-    s_time_t ratelimit, tslice, unit_migr_delay;
+    s_time_t ratelimit, tslice, unit_migr_delay, load_balance_ratelimit;
 
     struct list_head active_sdom;
     uint32_t weight;
@@ -612,6 +615,8 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
     BUG_ON(!is_idle_unit(curr_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
     spc->nr_runnable = 0;
+
+    spc->last_load_balance = NOW();
 }
 
 static void cf_check
@@ -1267,7 +1272,8 @@ csched_sys_cntl(const struct scheduler *ops,
                  && (params->ratelimit_us > XEN_SYSCTL_SCHED_RATELIMIT_MAX
                      || params->ratelimit_us < XEN_SYSCTL_SCHED_RATELIMIT_MIN))
              || MICROSECS(params->ratelimit_us) > MILLISECS(params->tslice_ms)
-             || params->vcpu_migr_delay_us > XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US )
+             || params->vcpu_migr_delay_us > XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US
+             || params->load_balance_ratelimit_us > XEN_SYSCTL_CSCHED_LB_RATE_MAX_US)
                 goto out;
 
         spin_lock_irqsave(&prv->lock, flags);
@@ -1278,6 +1284,7 @@ csched_sys_cntl(const struct scheduler *ops,
             printk(XENLOG_INFO "Disabling context switch rate limiting\n");
         prv->ratelimit = MICROSECS(params->ratelimit_us);
         prv->unit_migr_delay = MICROSECS(params->vcpu_migr_delay_us);
+        prv->load_balance_ratelimit = MICROSECS(params->load_balance_ratelimit_us);
         spin_unlock_irqrestore(&prv->lock, flags);
 
         /* FALLTHRU */
@@ -1285,6 +1292,7 @@ csched_sys_cntl(const struct scheduler *ops,
         params->tslice_ms = prv->tslice / MILLISECS(1);
         params->ratelimit_us = prv->ratelimit / MICROSECS(1);
         params->vcpu_migr_delay_us = prv->unit_migr_delay / MICROSECS(1);
+        params->load_balance_ratelimit_us = prv->load_balance_ratelimit / MICROSECS(1);
         rc = 0;
         break;
     }
@@ -1676,9 +1684,17 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
     return NULL;
 }
 
+/*
+ * Minimum delay, in microseconds, between load balance operations.
+ * This prevents spending too much time doing load balancing, particularly
+ * when the system has a high number of YIELDs due to spinlock priority inversion.
+ */
+static unsigned int __read_mostly load_balance_ratelimit_us = CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;
+integer_param("load-balance-ratelimit", load_balance_ratelimit_us);
+
 static struct csched_unit *
 csched_load_balance(struct csched_private *prv, int cpu,
-    struct csched_unit *snext, bool *stolen)
+                    struct csched_unit *snext, bool *stolen)
 {
     const struct cpupool *c = get_sched_res(cpu)->cpupool;
     struct csched_unit *speer;
@@ -1963,10 +1979,12 @@ static void cf_check csched_schedule(
          * urgent work... If not, csched_load_balance() will return snext, but
          * already removed from the runq.
          */
-        if ( snext->pri > CSCHED_PRI_TS_OVER )
-            __runq_remove(snext);
-        else
+        if ( snext->pri <= CSCHED_PRI_TS_OVER
+             && now - spc->last_load_balance > prv->load_balance_ratelimit) {
+            spc->last_load_balance = now;
             snext = csched_load_balance(prv, sched_cpu, snext, &migrated);
+        } else
+            __runq_remove(snext);
 
     } while ( !unit_runnable_state(snext->unit) );
 
@@ -2181,6 +2199,14 @@ csched_global_init(void)
                XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US, vcpu_migration_delay_us);
     }
 
+    if ( load_balance_ratelimit_us > XEN_SYSCTL_CSCHED_LB_RATE_MAX_US )
+    {
+        load_balance_ratelimit_us = CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;
+        printk("WARNING: load-balance-ratelimit outside of valid range [0,%d]us.\n"
+               "Resetting to default: %u\n",
+               XEN_SYSCTL_CSCHED_LB_RATE_MAX_US, load_balance_ratelimit_us);
+    }
+
     return 0;
 }
 
@@ -2223,6 +2249,8 @@ csched_init(struct scheduler *ops)
 
     prv->unit_migr_delay = MICROSECS(vcpu_migration_delay_us);
 
+    prv->load_balance_ratelimit = MICROSECS(load_balance_ratelimit_us);
+
     return 0;
 }
 
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 2b24d6bfd0..192458d635 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -637,6 +637,12 @@ struct xen_sysctl_credit_schedule {
     */
 #define XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US (100 * 1000)
     uint32_t vcpu_migr_delay_us;
+    /*
+     * Minimum delay, in microseconds, between load balance
+     * operations; max 1 second.
+     */
+#define XEN_SYSCTL_CSCHED_LB_RATE_MAX_US (1000000)
+    uint32_t load_balance_ratelimit_us;
 };
 
 struct xen_sysctl_credit2_schedule {
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/5] credit: Don't steal vcpus which have yielded
  2023-04-28  8:08 [PATCH 1/5] xenalyze: Handle start-of-day ->RUNNING transitions George Dunlap
  2023-04-28  8:08 ` [PATCH 2/5] xenalyze: Basic TRC_HVM_EMUL handling George Dunlap
  2023-04-28  8:08 ` [PATCH 3/5] credit: Limit load balancing to once per millisecond George Dunlap
@ 2023-04-28  8:08 ` George Dunlap
  2023-04-28  8:08 ` [PATCH 5/5] SUPPORT.md: Make all security support explicit George Dunlap
  3 siblings, 0 replies; 6+ messages in thread
From: George Dunlap @ 2023-04-28  8:08 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

On large systems with many vcpus yielding due to spinlock priority
inversion, it's not uncommon for a vcpu to yield its timeslice, only
to be immediately stolen by another pcpu looking for higher-priority
work.

To prevent this:

* Keep the YIELD flag until a vcpu is removed from a runqueue

* When looking for work to steal, skip vcpus which have yielded

Signed-off-by: George Dunlap <george.dunlap@cloud.com>
---
 xen/common/sched/credit.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c
index b8bdfd5f6a..70a1a57ba6 100644
--- a/xen/common/sched/credit.c
+++ b/xen/common/sched/credit.c
@@ -319,6 +319,11 @@ __runq_remove(struct csched_unit *svc)
 {
     BUG_ON( !__unit_on_runq(svc) );
     list_del_init(&svc->runq_elem);
+
+    /*
+     * Clear YIELD flag when scheduling back in
+     */
+    clear_bit(CSCHED_FLAG_UNIT_YIELD, &svc->flags);
 }
 
 static inline void
@@ -1638,6 +1643,13 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
         if ( speer->pri <= pri )
             break;
 
+        /*
+         * Don't steal a UNIT which has yielded; it's waiting for a
+         * reason
+         */
+        if (test_bit(CSCHED_FLAG_UNIT_YIELD, &speer->flags))
+            continue;
+
         /* Is this UNIT runnable on our PCPU? */
         unit = speer->unit;
         BUG_ON( is_idle_unit(unit) );
@@ -1955,11 +1967,6 @@ static void cf_check csched_schedule(
         dec_nr_runnable(sched_cpu);
     }
 
-    /*
-     * Clear YIELD flag before scheduling out
-     */
-    clear_bit(CSCHED_FLAG_UNIT_YIELD, &scurr->flags);
-
     do {
         snext = __runq_elem(runq->next);
 
@@ -1974,10 +1981,11 @@ static void cf_check csched_schedule(
         /*
          * SMP Load balance:
          *
-         * If the next highest priority local runnable UNIT has already eaten
-         * through its credits, look on other PCPUs to see if we have more
-         * urgent work... If not, csched_load_balance() will return snext, but
-         * already removed from the runq.
+         * If the next highest priority local runnable UNIT has
+         * already eaten through its credits (and we're below the
+         * balancing ratelimit), look on other PCPUs to see if we have
+         * more urgent work... If we don't, csched_load_balance() will
+         * return snext, but already removed from the runq.
          */
         if ( snext->pri <= CSCHED_PRI_TS_OVER
              && now - spc->last_load_balance > prv->load_balance_ratelimit) {
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 5/5] SUPPORT.md: Make all security support explicit
  2023-04-28  8:08 [PATCH 1/5] xenalyze: Handle start-of-day ->RUNNING transitions George Dunlap
                   ` (2 preceding siblings ...)
  2023-04-28  8:08 ` [PATCH 4/5] credit: Don't steal vcpus which have yielded George Dunlap
@ 2023-04-28  8:08 ` George Dunlap
  3 siblings, 0 replies; 6+ messages in thread
From: George Dunlap @ 2023-04-28  8:08 UTC (permalink / raw)
  To: xen-devel
  Cc: George Dunlap, Wei Liu, Andrew Cooper, Jan Beulich,
	Roger Pau Monne, Stefano Stabellini, Julien Grall

The initial goal of SUPPORT.md was to help both users, and the Xen
Project Security Team, determine what functionality was security
supported; i.e., what kinds of security bugs would trigger an XSA.

Our proposal is that as of 4.18, all functionality not explicitly
listed as security supported will be considered not security
supported.  Add some text to that effect.

The patch as written cannot be applied, since specifying "xl.cfg core
functionality" is a TODO; but it should do to start a discussion.

Signed-off-by: Georg Dunlap <george.dunlap@cloud.com>
---
CC: Wei Liu <wl@xen.org>
CC: Andrew Cooper <andrew.cooper@cloud.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Roger Pau Monne <roger.pau@cloud.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
---
 SUPPORT.md | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/SUPPORT.md b/SUPPORT.md
index aa1940e55f..fcbcb44c44 100644
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -17,6 +17,36 @@ for the definitions of the support status levels etc.
 Release Notes
 : <a href="https://wiki.xenproject.org/wiki/Xen_Project_X.YY_Release_Notes">RN</a>
 
+# General security support
+
+An XSA will always be issued for security-related bugs which are
+present in a "plain vanilla" configuration.  A "plain vanilla"
+configuration is defined as follows:
+
+* The Xen hypervisor is built from a tagged release of Xen, or a
+  commit which was on the tip of one of the supported stable branches.
+
+* The Xen hypervisor was built with the default config for the platform
+
+* No Xen command-line parameters were specified
+
+* No parameters for Xen-related drivers in the Linux kernel were specified
+
+* No modifications were made to the default xl.conf
+
+* xl.cfg files use only core functionality
+
+* Alternate toolstacks only activate functionality activated by the
+  core functionality of xl.cfg files.
+
+Any system outside this configuration will only be considered security
+supported if the functionality is explicitly listed as supported in
+this document.
+
+If a security-related bug exits only in a configuration listed as not
+security supported, the security team will generally not issue an XSA;
+the bug will simply be handled in public.
+
 # Feature Support
 
 ## Kconfig
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/5] credit: Limit load balancing to once per millisecond
  2023-04-28  8:08 ` [PATCH 3/5] credit: Limit load balancing to once per millisecond George Dunlap
@ 2023-04-28  8:10   ` George Dunlap
  0 siblings, 0 replies; 6+ messages in thread
From: George Dunlap @ 2023-04-28  8:10 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 8385 bytes --]

Please ignore patches 1-4 of this series...

 -George

On Fri, Apr 28, 2023 at 9:08 AM George Dunlap <george.dunlap@cloud.com>
wrote:

> The credit scheduler tries as hard as it can to ensure that it always
> runs scheduling units with positive credit (PRI_TS_UNDER) before
> running those with negative credit (PRI_TS_OVER).  If the next
> runnable scheduling unit is of priority OVER, it will always run the
> load balancer, which will scour the system looking for another
> scheduling unit of the UNDER priority.
>
> Unfortunately, as the number of cores on a system has grown, the cost
> of the work-stealing algorithm has dramatically increased; a recent
> trace on a system with 128 cores showed this taking over 50
> microseconds.
>
> Add a parameter, load_balance_ratelimit, to limit the frequency of
> load balance operations on a given pcpu.  Default this to 1
> millisecond.
>
> Invert the load balancing conditional to make it more clear, and line
> up more closely with the comment above it.
>
> Overall it might be cleaner to have the last_load_balance checking
> happen inside csched_load_balance(), but that would require either
> passing both now and spc into the function, or looking them up again;
> both of which seemed to be worse than simply checking and setting the
> values before calling it.
>
> Signed-off-by: George Dunlap <george.dunlap@cloud.com>
> ---
>  docs/misc/xen-command-line.pandoc |  6 +++++
>  xen/common/sched/credit.c         | 40 ++++++++++++++++++++++++++-----
>  xen/include/public/sysctl.h       |  6 +++++
>  3 files changed, 46 insertions(+), 6 deletions(-)
>
> diff --git a/docs/misc/xen-command-line.pandoc
> b/docs/misc/xen-command-line.pandoc
> index e0b89b7d33..ae51a8cfa2 100644
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -1840,6 +1840,12 @@ By default, Xen will use the INVPCID instruction
> for TLB management if
>  it is available.  This option can be used to cause Xen to fall back to
>  older mechanisms, which are generally slower.
>
> +### load-balance-ratelimit
> +> `= <integer>`
> +
> +The minimum interval between load balancing events on a given pcpu.
> +At the moment only credit honors this parameter.
> +
>  ### noirqbalance (x86)
>  > `= <boolean>`
>
> diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c
> index f2cd3d9da3..b8bdfd5f6a 100644
> --- a/xen/common/sched/credit.c
> +++ b/xen/common/sched/credit.c
> @@ -50,6 +50,8 @@
>  #define CSCHED_TICKS_PER_TSLICE     3
>  /* Default timeslice: 30ms */
>  #define CSCHED_DEFAULT_TSLICE_MS    30
> +/* Default load balancing ratelimit: 1ms */
> +#define CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US 1000
>  #define CSCHED_CREDITS_PER_MSEC     10
>  /* Never set a timer shorter than this value. */
>  #define CSCHED_MIN_TIMER            XEN_SYSCTL_SCHED_RATELIMIT_MIN
> @@ -153,6 +155,7 @@ struct csched_pcpu {
>
>      unsigned int idle_bias;
>      unsigned int nr_runnable;
> +    s_time_t last_load_balance;
>
>      unsigned int tick;
>      struct timer ticker;
> @@ -218,7 +221,7 @@ struct csched_private {
>
>      /* Period of master and tick in milliseconds */
>      unsigned int tick_period_us, ticks_per_tslice;
> -    s_time_t ratelimit, tslice, unit_migr_delay;
> +    s_time_t ratelimit, tslice, unit_migr_delay, load_balance_ratelimit;
>
>      struct list_head active_sdom;
>      uint32_t weight;
> @@ -612,6 +615,8 @@ init_pdata(struct csched_private *prv, struct
> csched_pcpu *spc, int cpu)
>      BUG_ON(!is_idle_unit(curr_on_cpu(cpu)));
>      cpumask_set_cpu(cpu, prv->idlers);
>      spc->nr_runnable = 0;
> +
> +    spc->last_load_balance = NOW();
>  }
>
>  static void cf_check
> @@ -1267,7 +1272,8 @@ csched_sys_cntl(const struct scheduler *ops,
>                   && (params->ratelimit_us > XEN_SYSCTL_SCHED_RATELIMIT_MAX
>                       || params->ratelimit_us <
> XEN_SYSCTL_SCHED_RATELIMIT_MIN))
>               || MICROSECS(params->ratelimit_us) >
> MILLISECS(params->tslice_ms)
> -             || params->vcpu_migr_delay_us >
> XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US )
> +             || params->vcpu_migr_delay_us >
> XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US
> +             || params->load_balance_ratelimit_us >
> XEN_SYSCTL_CSCHED_LB_RATE_MAX_US)
>                  goto out;
>
>          spin_lock_irqsave(&prv->lock, flags);
> @@ -1278,6 +1284,7 @@ csched_sys_cntl(const struct scheduler *ops,
>              printk(XENLOG_INFO "Disabling context switch rate
> limiting\n");
>          prv->ratelimit = MICROSECS(params->ratelimit_us);
>          prv->unit_migr_delay = MICROSECS(params->vcpu_migr_delay_us);
> +        prv->load_balance_ratelimit =
> MICROSECS(params->load_balance_ratelimit_us);
>          spin_unlock_irqrestore(&prv->lock, flags);
>
>          /* FALLTHRU */
> @@ -1285,6 +1292,7 @@ csched_sys_cntl(const struct scheduler *ops,
>          params->tslice_ms = prv->tslice / MILLISECS(1);
>          params->ratelimit_us = prv->ratelimit / MICROSECS(1);
>          params->vcpu_migr_delay_us = prv->unit_migr_delay / MICROSECS(1);
> +        params->load_balance_ratelimit_us = prv->load_balance_ratelimit /
> MICROSECS(1);
>          rc = 0;
>          break;
>      }
> @@ -1676,9 +1684,17 @@ csched_runq_steal(int peer_cpu, int cpu, int pri,
> int balance_step)
>      return NULL;
>  }
>
> +/*
> + * Minimum delay, in microseconds, between load balance operations.
> + * This prevents spending too much time doing load balancing, particularly
> + * when the system has a high number of YIELDs due to spinlock priority
> inversion.
> + */
> +static unsigned int __read_mostly load_balance_ratelimit_us =
> CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;
> +integer_param("load-balance-ratelimit", load_balance_ratelimit_us);
> +
>  static struct csched_unit *
>  csched_load_balance(struct csched_private *prv, int cpu,
> -    struct csched_unit *snext, bool *stolen)
> +                    struct csched_unit *snext, bool *stolen)
>  {
>      const struct cpupool *c = get_sched_res(cpu)->cpupool;
>      struct csched_unit *speer;
> @@ -1963,10 +1979,12 @@ static void cf_check csched_schedule(
>           * urgent work... If not, csched_load_balance() will return
> snext, but
>           * already removed from the runq.
>           */
> -        if ( snext->pri > CSCHED_PRI_TS_OVER )
> -            __runq_remove(snext);
> -        else
> +        if ( snext->pri <= CSCHED_PRI_TS_OVER
> +             && now - spc->last_load_balance >
> prv->load_balance_ratelimit) {
> +            spc->last_load_balance = now;
>              snext = csched_load_balance(prv, sched_cpu, snext, &migrated);
> +        } else
> +            __runq_remove(snext);
>
>      } while ( !unit_runnable_state(snext->unit) );
>
> @@ -2181,6 +2199,14 @@ csched_global_init(void)
>                 XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US, vcpu_migration_delay_us);
>      }
>
> +    if ( load_balance_ratelimit_us > XEN_SYSCTL_CSCHED_LB_RATE_MAX_US )
> +    {
> +        load_balance_ratelimit_us =
> CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US;
> +        printk("WARNING: load-balance-ratelimit outside of valid range
> [0,%d]us.\n"
> +               "Resetting to default: %u\n",
> +               XEN_SYSCTL_CSCHED_LB_RATE_MAX_US,
> load_balance_ratelimit_us);
> +    }
> +
>      return 0;
>  }
>
> @@ -2223,6 +2249,8 @@ csched_init(struct scheduler *ops)
>
>      prv->unit_migr_delay = MICROSECS(vcpu_migration_delay_us);
>
> +    prv->load_balance_ratelimit = MICROSECS(load_balance_ratelimit_us);
> +
>      return 0;
>  }
>
> diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
> index 2b24d6bfd0..192458d635 100644
> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -637,6 +637,12 @@ struct xen_sysctl_credit_schedule {
>      */
>  #define XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US (100 * 1000)
>      uint32_t vcpu_migr_delay_us;
> +    /*
> +     * Minimum delay, in microseconds, between load balance
> +     * operations; max 1 second.
> +     */
> +#define XEN_SYSCTL_CSCHED_LB_RATE_MAX_US (1000000)
> +    uint32_t load_balance_ratelimit_us;
>  };
>
>  struct xen_sysctl_credit2_schedule {
> --
> 2.25.1
>
>

[-- Attachment #2: Type: text/html, Size: 9703 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-04-28  8:10 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-28  8:08 [PATCH 1/5] xenalyze: Handle start-of-day ->RUNNING transitions George Dunlap
2023-04-28  8:08 ` [PATCH 2/5] xenalyze: Basic TRC_HVM_EMUL handling George Dunlap
2023-04-28  8:08 ` [PATCH 3/5] credit: Limit load balancing to once per millisecond George Dunlap
2023-04-28  8:10   ` George Dunlap
2023-04-28  8:08 ` [PATCH 4/5] credit: Don't steal vcpus which have yielded George Dunlap
2023-04-28  8:08 ` [PATCH 5/5] SUPPORT.md: Make all security support explicit George Dunlap

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.