xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [Xen-devel] [PATCH v2 0/2] xen: credit2: fix vcpu starvation due to too few credits
@ 2020-03-19  0:11 Dario Faggioli
  2020-03-19  0:11 ` [Xen-devel] [PATCH v2 1/2] xen: credit2: avoid vCPUs to ever reach lower credits than idle Dario Faggioli
  2020-03-19  0:12 ` [Xen-devel] [PATCH v2 2/2] xen: credit2: fix credit reset happening too few times Dario Faggioli
  0 siblings, 2 replies; 7+ messages in thread
From: Dario Faggioli @ 2020-03-19  0:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Charles Arnold, Jan Beulich, Glen, George Dunlap,
	Tomas Mozes, Sarah Newman

Hello everyone,

Here's v2 of the series for fixing some starvation issues, under
Credit2, if as a consequence of some (still unclear) bug/circumstance,
vCPUs manage to run uninterrupted for long timeslices.

v1 is this:
https://lore.kernel.org/xen-devel/158402056376.753.7091379488590272336.stgit@Palanthas/

And the series is also available here:
https://gitlab.com/dfaggioli/xen/-/tree/sched/credit2/fix-credit2-vcpu-stall-v2
https://gitlab.com/dfaggioli/xen.git fix-credit2-vcpu-stall-v2

Difference between v1 and v2 is that I took a slighly different
approach, in patch 1. This is due to discussion happened within the v1
thread and on IRC (more below, and in patch 1 changelog).

Basically, there have been reports of a Credit2 issue due to which
vCPUs where being starved, to the point that guest kernel would complain
or even crash.

See the following xen-users and xen-devel threads:
https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00018.html
https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html
https://lists.xenproject.org/archives/html/xen-devel/2020-02/msg01158.html

I did some investigations, and figured out that the vCPUs in question
are not scheduled for long time intervals because they somehow manage to
be given an amount of credits which is less than the credit the idle
vCPU has.

An example of this situation is shown here. In fact, we can see d0v1
sitting in the runqueue while all the CPUs are idle, as it has
-1254238270 credits, which is smaller than -2^30 = −1073741824:

    (XEN) Runqueue 0:
    (XEN)   ncpus              = 28
    (XEN)   cpus               = 0-27
    (XEN)   max_weight         = 256
    (XEN)   pick_bias          = 22
    (XEN)   instload           = 1
    (XEN)   aveload            = 293391 (~111%)
    (XEN)   idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
    (XEN)   tickled: 00,00000000,00000000,00000000,00000000,00000000,00000000
    (XEN)   fully idle cores: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
    [...]
    (XEN) Runqueue 0:
    (XEN) CPU[00] runq=0, sibling=00,..., core=00,...
    (XEN) CPU[01] runq=0, sibling=00,..., core=00,...
    [...]
    (XEN) CPU[26] runq=0, sibling=00,..., core=00,...
    (XEN) CPU[27] runq=0, sibling=00,..., core=00,...
    (XEN) RUNQ:
    (XEN)     0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 (~100%)

This happens bacause --although very rarely-- vCPUs are allowed to
execute for much more than the scheduler would want them to.

For example, I have a trace showing that csched2_schedule() is invoked at
t=57970746155ns. At t=57970747658ns (+1503ns) the s_timer is set to
fire at t=57979485083ns, i.e., 8738928ns in future. That's because credit
of snext is exactly that 8738928ns. Then, what I see is that the next
call to burn_credits(), coming from csched2_schedule() for the same vCPU
happens at t=60083283617ns. That is *a lot* (2103798534ns) later than
when we expected and asked. Of course, that also means that delta is
2112537462ns, and therefore credits will sink to -2103798534!

Also, to the best of my current knowledge, this does not look like
Credit2 related, as I've observed it when running with Credit1 as well.
I personally don't think it would be scheduling related, in general, but
I need to do more investigation to be sure about that (and/or to figure
out what the real root cause is).

The reason why Credit2 is affected much more than Credit1 is because of
how time accounting is done. Basically, there's very rudimental time
accounting in Credit1, which is a very bad thing, IMO, but indeed that
is also what prevented for this issue to cause severe stalls.

One more thing is that Credit2 gives -2^30 credits to the idle vCPU, which
was considered to be low enough, which is true. But it's not a robust
choice, should an issue like the one we're discussing occur, which is
happening. :-)

We can make things better by establishing a minimum value for the amount
of credits any vCPU will ever be able to reach. In fact, as soon as we
realize that the vCPU that should run next has a negative amount of
credits --no matter whether -1 or -2^30-- we do a credit reset. Which
means that we can just limit this "negative credits peak" during
accounting anyway. Then, for increased robustness, we use an even
smaller value for the credits we give to the idle vCPUs, so that we're
sure that we will never pick idle instead of an actual vCPU that is
ready to run.

This is what is done in the first patch of this series. This is a
robustness improvement and a fix (or at least the best way we can deal
with the it within the scheduler) for the issue at hand. It therefore
should be backported.

While looking into this, I also have found out that there is an actual
bug in Credit2 code. It is something I introduced myself with commit
5e4b4199667b9 ("xen: credit2: only reset credit on reset condition").
In fact, while it was and still is a good idea to avoid resetting
credits too often, the implementation of this was just wrong.

A fix for this bug is what is contained in patch 2. And it also should
be backported.

Note that patch 2 alone was also already mitigating the stall/starvation
issue quite substantially. Nevertheless, the proper fix for the issue
itself is making Credit2 more robust against similar problem, as done in
patch 1, while this other bug just happens to be something which
interact with the sympthoms.

This to say that, although both patches will be bugported, asboth are
actual bugfixes, if there is the need to apply something "in emergency"
to fix the starvation problem, applying only patch 1 is enough.

Thanks and Regards
---
Dario Faggioli (2):
      xen: credit2: avoid vCPUs to ever reach lower credits than idle
      xen: credit2: fix credit reset happening too few times

 tools/xentrace/formats     |    4 +-
 tools/xentrace/xenalyze.c  |   13 +++----
 xen/common/sched/credit2.c |   83 ++++++++++++++++++++++----------------------
 3 files changed, 48 insertions(+), 52 deletions(-)
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Xen-devel] [PATCH v2 1/2] xen: credit2: avoid vCPUs to ever reach lower credits than idle
  2020-03-19  0:11 [Xen-devel] [PATCH v2 0/2] xen: credit2: fix vcpu starvation due to too few credits Dario Faggioli
@ 2020-03-19  0:11 ` Dario Faggioli
  2020-04-02 18:21   ` George Dunlap
  2020-03-19  0:12 ` [Xen-devel] [PATCH v2 2/2] xen: credit2: fix credit reset happening too few times Dario Faggioli
  1 sibling, 1 reply; 7+ messages in thread
From: Dario Faggioli @ 2020-03-19  0:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Charles Arnold, Tomas Mozes, Glen, George Dunlap,
	Jan Beulich, Sarah Newman

There have been report of stalls of guest vCPUs, when Credit2 was used.
It seemed like these vCPUs were not getting scheduled for very long
time, even under light load conditions (e.g., during dom0 boot).

Investigations led to the discovery that --although rarely-- it can
happen that a vCPU manages to run for very long timeslices. In Credit2,
this means that, when runtime accounting happens, the vCPU will lose a
large quantity of credits. This in turn may lead to the vCPU having less
credits than the idle vCPUs (-2^30). At this point, the scheduler will
pick the idle vCPU, instead of the ready to run vCPU, for a few
"epochs", which often times is enough for the guest kernel to think the
vCPU is not responding and crashing.

An example of this situation is shown here. In fact, we can see d0v1
sitting in the runqueue while all the CPUs are idle, as it has
-1254238270 credits, which is smaller than -2^30 = −1073741824:

    (XEN) Runqueue 0:
    (XEN)   ncpus              = 28
    (XEN)   cpus               = 0-27
    (XEN)   max_weight         = 256
    (XEN)   pick_bias          = 22
    (XEN)   instload           = 1
    (XEN)   aveload            = 293391 (~111%)
    (XEN)   idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
    (XEN)   tickled: 00,00000000,00000000,00000000,00000000,00000000,00000000
    (XEN)   fully idle cores: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
    [...]
    (XEN) Runqueue 0:
    (XEN) CPU[00] runq=0, sibling=00,..., core=00,...
    (XEN) CPU[01] runq=0, sibling=00,..., core=00,...
    [...]
    (XEN) CPU[26] runq=0, sibling=00,..., core=00,...
    (XEN) CPU[27] runq=0, sibling=00,..., core=00,...
    (XEN) RUNQ:
    (XEN)     0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 (~100%)

We certainly don't want, under any circumstance, this to happen.
Let's, therefore, define a minimum amount of credits a vCPU can have.
During accounting, we make sure that, for however long the vCPU has
run, it will never get to have less than such minimum amount of
credits. Then, we set the credits of the idle vCPU to an even
smaller value.

NOTE: investigations have been done about _how_ it is possible for a
vCPU to execute for so much time that its credits becomes so low. While
still not completely clear, there are evidence that:
- it only happens very rarely,
- it appears to be both machine and workload specific,
- it does not look to be a Credit2 (e.g., as it happens when
  running with Credit1 as well) issue, or a scheduler issue.

This patch makes Credit2 more robust to events like this, whatever
the cause is, and should hence be backported (as far as possible).

Reported-by: Glen <glenbarney@gmail.com>
Reported-by: Tomas Mozes <hydrapolic@gmail.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Charles Arnold <carnold@suse.com>
Cc: Sarah Newman <srn@prgmr.com>
---
Changes from v1:
- different approach. Instead than using INT_MIN for idle vCPUs'
  credits, limit the minimum number of credits regular vCPUs can have.
---
I will provide the backports myself, at least for 4.13 and 4.12.x (and
feel free to ask for more).
---
For Sarah, looking back at the various threads, I am not quite sure
whether you also experienced the issue and reported it. If yes, I'm
happy to add a "Reported-by:" line about you too (or, if this is fine to
go in, for this to be done while committing, if possible).
---
 tools/xentrace/formats     |    2 +-
 tools/xentrace/xenalyze.c  |    5 ++--
 xen/common/sched/credit2.c |   53 +++++++++++++++++++++++---------------------
 3 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index d6e7e3f800..8f126f65f1 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -55,7 +55,7 @@
 0x00022204  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit_add
 0x00022205  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tickle_check   [ dom:vcpu = 0x%(1)08x, credit = %(2)d, score = %(3)d ]
 0x00022206  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tickle         [ cpu = %(1)d ]
-0x00022207  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit_reset   [ dom:vcpu = 0x%(1)08x, cr_start = %(2)d, cr_end = %(3)d, mult = %(4)d ]
+0x00022207  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit_reset   [ dom:vcpu = 0x%(1)08x, cr_start = %(2)d, cr_end = %(3)d ]
 0x00022208  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:sched_tasklet
 0x00022209  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:update_load
 0x0002220a  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_assign    [ dom:vcpu = 0x%(1)08x, rq_id = %(2)d ]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index aa894673ad..d3c8368e9d 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7718,13 +7718,12 @@ void sched_process(struct pcpu_info *p)
                 struct {
                     unsigned int vcpuid:16, domid:16;
                     int credit_start, credit_end;
-                    unsigned int multiplier;
                 } *r = (typeof(r))ri->d;
 
                 printf(" %s csched2:reset_credits d%uv%u, "
-                       "credit_start = %d, credit_end = %d, mult = %u\n",
+                       "credit_start = %d, credit_end = %d\n",
                        ri->dump_header, r->domid, r->vcpuid,
-                       r->credit_start, r->credit_end, r->multiplier);
+                       r->credit_start, r->credit_end);
             }
             break;
         case TRC_SCHED_CLASS_EVT(CSCHED2, 8):  /* SCHED_TASKLET    */
diff --git a/xen/common/sched/credit2.c b/xen/common/sched/credit2.c
index c7241944a8..cbf9ce2b97 100644
--- a/xen/common/sched/credit2.c
+++ b/xen/common/sched/credit2.c
@@ -229,12 +229,22 @@
  * before a reset event.
  */
 #define CSCHED2_CREDIT_INIT          MILLISECS(10)
+/*
+ * Minimum amount of credits VMs can have. Ideally, no VM would get
+ * close to this (unless a vCPU manages to execute for really long
+ * time uninterrupted). In case it happens, it makes no sense to
+ * track even deeper undershoots.
+ *
+ * NOTE: If making this smaller than -CSCHED2_CREDIT_INIT, adjust
+ * reset_credit() accordingly.
+ */
+#define CSCHED2_CREDIT_MIN           (-CSCHED2_CREDIT_INIT)
 /*
  * Amount of credit the idle units have. It never changes, as idle
  * units does not consume credits, and it must be lower than whatever
  * amount of credit 'regular' unit would end up with.
  */
-#define CSCHED2_IDLE_CREDIT          (-(1U<<30))
+#define CSCHED2_IDLE_CREDIT          (CSCHED2_CREDIT_MIN-1)
 /*
  * Carryover: How much "extra" credit may be carried over after
  * a reset.
@@ -781,10 +791,15 @@ static int get_fallback_cpu(struct csched2_unit *svc)
 static void t2c_update(const struct csched2_runqueue_data *rqd, s_time_t time,
                           struct csched2_unit *svc)
 {
-    uint64_t val = time * rqd->max_weight + svc->residual;
+    int64_t val = time * rqd->max_weight + svc->residual;
 
     svc->residual = do_div(val, svc->weight);
-    svc->credit -= val;
+    /* Getting to lower credit than CSCHED2_CREDIT_MIN makes no sense. */
+    val = svc->credit - val;
+    if ( unlikely(val < CSCHED2_CREDIT_MIN) )
+        svc->credit = CSCHED2_CREDIT_MIN;
+    else
+        svc->credit = val;
 }
 
 static s_time_t c2t(const struct csched2_runqueue_data *rqd, s_time_t credit,
@@ -1616,28 +1631,25 @@ static void reset_credit(int cpu, s_time_t now, struct csched2_unit *snext)
 {
     struct csched2_runqueue_data *rqd = c2rqd(cpu);
     struct list_head *iter;
-    int m;
+    int reset = CSCHED2_CREDIT_INIT;
 
     /*
      * Under normal circumstances, snext->credit should never be less
      * than -CSCHED2_MIN_TIMER.  However, under some circumstances, an
      * unit with low credits may be allowed to run long enough that
-     * its credits are actually less than -CSCHED2_CREDIT_INIT.
+     * its credits are actually much lower than that.
      * (Instances have been observed, for example, where an unit with
      * 200us of credit was allowed to run for 11ms, giving it -10.8ms
      * of credit.  Thus it was still negative even after the reset.)
      *
      * If this is the case for snext, we simply want to keep moving
-     * everyone up until it is in the black again.  This fair because
-     * none of the other units want to run at the moment.
-     *
-     * Rather than looping, however, we just calculate a multiplier,
-     * avoiding an integer division and multiplication in the common
-     * case.
+     * everyone up until it is in the black again. This means that,
+     * since CSCHED2_CREDIT_MIN is -CSCHED2_CREDIT_INIT, we need to
+     * actually add 2*CSCHED2_CREDIT_INIT.
      */
-    m = 1;
-    if ( snext->credit < -CSCHED2_CREDIT_INIT )
-        m += (-snext->credit) / CSCHED2_CREDIT_INIT;
+    ASSERT(snext->credit >= CSCHED2_CREDIT_MIN);
+    if ( unlikely(snext->credit == CSCHED2_CREDIT_MIN) )
+        reset += CSCHED2_CREDIT_INIT;
 
     list_for_each( iter, &rqd->svc )
     {
@@ -1668,15 +1680,7 @@ static void reset_credit(int cpu, s_time_t now, struct csched2_unit *snext)
         }
 
         start_credit = svc->credit;
-
-        /*
-         * Add INIT * m, avoiding integer multiplication in the common case.
-         */
-        if ( likely(m==1) )
-            svc->credit += CSCHED2_CREDIT_INIT;
-        else
-            svc->credit += m * CSCHED2_CREDIT_INIT;
-
+        svc->credit += reset;
         /* "Clip" credits to max carryover */
         if ( svc->credit > CSCHED2_CREDIT_INIT + CSCHED2_CARRYOVER_MAX )
             svc->credit = CSCHED2_CREDIT_INIT + CSCHED2_CARRYOVER_MAX;
@@ -1688,19 +1692,18 @@ static void reset_credit(int cpu, s_time_t now, struct csched2_unit *snext)
             struct {
                 unsigned unit:16, dom:16;
                 int credit_start, credit_end;
-                unsigned multiplier;
             } d;
             d.dom = svc->unit->domain->domain_id;
             d.unit = svc->unit->unit_id;
             d.credit_start = start_credit;
             d.credit_end = svc->credit;
-            d.multiplier = m;
             __trace_var(TRC_CSCHED2_CREDIT_RESET, 1,
                         sizeof(d),
                         (unsigned char *)&d);
         }
     }
 
+    ASSERT(snext->credit > 0);
     SCHED_STAT_CRANK(credit_reset);
 
     /* No need to resort runqueue, as everyone's order should be the same. */


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [Xen-devel] [PATCH v2 2/2] xen: credit2: fix credit reset happening too few times
  2020-03-19  0:11 [Xen-devel] [PATCH v2 0/2] xen: credit2: fix vcpu starvation due to too few credits Dario Faggioli
  2020-03-19  0:11 ` [Xen-devel] [PATCH v2 1/2] xen: credit2: avoid vCPUs to ever reach lower credits than idle Dario Faggioli
@ 2020-03-19  0:12 ` Dario Faggioli
  2020-04-02 18:32   ` George Dunlap
  2020-04-06 11:53   ` [Xen-devel] " Jan Beulich
  1 sibling, 2 replies; 7+ messages in thread
From: Dario Faggioli @ 2020-03-19  0:12 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Charles Arnold, Jan Beulich, Glen, George Dunlap,
	Tomas Mozes, Sarah Newman

There is a bug in commit 5e4b4199667b9 ("xen: credit2: only reset
credit on reset condition"). In fact, the aim of that commit was to
make sure that we do not perform too many credit reset operations
(which are not super cheap, and in an hot-path). But the check used
to determine whether a reset is necessary was the wrong one.

In fact, knowing just that some vCPUs have been skipped, while
traversing the runqueue (in runq_candidate()), is not enough. We
need to check explicitly whether the first vCPU in the runqueue
has a negative amount of credit.

Since a trace record is changed, this patch updates xentrace format file
and xenalyze as well

This should be backported.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Charles Arnold <carnold@suse.com>
Cc: Glen <glenbarney@gmail.com>
Cc: Tomas Mozes <hydrapolic@gmail.com>
Cc: Sarah Newman <srn@prgmr.com>
---
About the Credit2 stall issue reported recently, and mentioned in patch
1 of this series. This second patch, alone, was already mitigating the
issue quite substantially.

Still, the proper fix for the issue itself is patch 1, while this is a
fix for a bug in the code, introduced with a previous change, which
happens to help to cure the sympthoms of the problem at hand.
---
 tools/xentrace/formats     |    2 +-
 tools/xentrace/xenalyze.c  |    8 +++-----
 xen/common/sched/credit2.c |   30 +++++++++++++-----------------
 3 files changed, 17 insertions(+), 23 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index 8f126f65f1..deac4d8598 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -67,7 +67,7 @@
 0x00022210  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:load_check     [ lrq_id[16]:orq_id[16] = 0x%(1)08x, delta = %(2)d ]
 0x00022211  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:load_balance   [ l_bavgload = 0x%(2)08x%(1)08x, o_bavgload = 0x%(4)08x%(3)08x, lrq_id[16]:orq_id[16] = 0x%(5)08x ]
 0x00022212  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:pick_cpu       [ b_avgload = 0x%(2)08x%(1)08x, dom:vcpu = 0x%(3)08x, rq_id[16]:new_cpu[16] = %(4)d ]
-0x00022213  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_candidate [ dom:vcpu = 0x%(1)08x, credit = %(4)d, skipped_vcpus = %(3)d, tickled_cpu = %(2)d ]
+0x00022213  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_candidate [ dom:vcpu = 0x%(1)08x, credit = %(3)d, tickled_cpu = %(2)d ]
 0x00022214  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:schedule       [ rq:cpu = 0x%(1)08x, tasklet[8]:idle[8]:smt_idle[8]:tickled[8] = %(2)08x ]
 0x00022215  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:ratelimit      [ dom:vcpu = 0x%(1)08x, runtime = %(2)d ]
 0x00022216  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_cand_chk  [ dom:vcpu = 0x%(1)08x ]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index d3c8368e9d..b7f4e2bea8 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7855,14 +7855,12 @@ void sched_process(struct pcpu_info *p)
             if (opt.dump_all) {
                 struct {
                     unsigned vcpuid:16, domid:16;
-                    unsigned tickled_cpu, skipped;
+                    unsigned tickled_cpu;
                     int credit;
                 } *r = (typeof(r))ri->d;
 
-                printf(" %s csched2:runq_candidate d%uv%u, credit = %d, "
-                       "%u vcpus skipped, ",
-                       ri->dump_header, r->domid, r->vcpuid,
-                       r->credit, r->skipped);
+                printf(" %s csched2:runq_candidate d%uv%u, credit = %d, ",
+                       ri->dump_header, r->domid, r->vcpuid, r->credit);
                 if (r->tickled_cpu == (unsigned)-1)
                     printf("no cpu was tickled\n");
                 else
diff --git a/xen/common/sched/credit2.c b/xen/common/sched/credit2.c
index cbf9ce2b97..34f05c3e2a 100644
--- a/xen/common/sched/credit2.c
+++ b/xen/common/sched/credit2.c
@@ -3224,8 +3224,7 @@ csched2_runtime(const struct scheduler *ops, int cpu,
 static struct csched2_unit *
 runq_candidate(struct csched2_runqueue_data *rqd,
                struct csched2_unit *scurr,
-               int cpu, s_time_t now,
-               unsigned int *skipped)
+               int cpu, s_time_t now)
 {
     struct list_head *iter, *temp;
     const struct sched_resource *sr = get_sched_res(cpu);
@@ -3233,8 +3232,6 @@ runq_candidate(struct csched2_runqueue_data *rqd,
     struct csched2_private *prv = csched2_priv(sr->scheduler);
     bool yield = false, soft_aff_preempt = false;
 
-    *skipped = 0;
-
     if ( unlikely(is_idle_unit(scurr->unit)) )
     {
         snext = scurr;
@@ -3328,12 +3325,9 @@ runq_candidate(struct csched2_runqueue_data *rqd,
                         (unsigned char *)&d);
         }
 
-        /* Only consider units that are allowed to run on this processor. */
+        /* Only consider vcpus that are allowed to run on this processor. */
         if ( !cpumask_test_cpu(cpu, svc->unit->cpu_hard_affinity) )
-        {
-            (*skipped)++;
             continue;
-        }
 
         /*
          * If an unit is meant to be picked up by another processor, and such
@@ -3342,7 +3336,6 @@ runq_candidate(struct csched2_runqueue_data *rqd,
         if ( svc->tickled_cpu != -1 && svc->tickled_cpu != cpu &&
              cpumask_test_cpu(svc->tickled_cpu, &rqd->tickled) )
         {
-            (*skipped)++;
             SCHED_STAT_CRANK(deferred_to_tickled_cpu);
             continue;
         }
@@ -3354,7 +3347,6 @@ runq_candidate(struct csched2_runqueue_data *rqd,
         if ( sched_unit_master(svc->unit) != cpu
              && snext->credit + CSCHED2_MIGRATE_RESIST > svc->credit )
         {
-            (*skipped)++;
             SCHED_STAT_CRANK(migrate_resisted);
             continue;
         }
@@ -3378,14 +3370,13 @@ runq_candidate(struct csched2_runqueue_data *rqd,
     {
         struct {
             unsigned unit:16, dom:16;
-            unsigned tickled_cpu, skipped;
+            unsigned tickled_cpu;
             int credit;
         } d;
         d.dom = snext->unit->domain->domain_id;
         d.unit = snext->unit->unit_id;
         d.credit = snext->credit;
         d.tickled_cpu = snext->tickled_cpu;
-        d.skipped = *skipped;
         __trace_var(TRC_CSCHED2_RUNQ_CANDIDATE, 1,
                     sizeof(d),
                     (unsigned char *)&d);
@@ -3417,7 +3408,6 @@ static void csched2_schedule(
     struct csched2_runqueue_data *rqd;
     struct csched2_unit * const scurr = csched2_unit(currunit);
     struct csched2_unit *snext = NULL;
-    unsigned int skipped_units = 0;
     bool tickled;
     bool migrated = false;
 
@@ -3495,7 +3485,7 @@ static void csched2_schedule(
         snext = csched2_unit(sched_idle_unit(sched_cpu));
     }
     else
-        snext = runq_candidate(rqd, scurr, sched_cpu, now, &skipped_units);
+        snext = runq_candidate(rqd, scurr, sched_cpu, now);
 
     /* If switching from a non-idle runnable unit, put it
      * back on the runqueue. */
@@ -3507,6 +3497,8 @@ static void csched2_schedule(
     /* Accounting for non-idle tasks */
     if ( !is_idle_unit(snext->unit) )
     {
+        int top_credit;
+
         /* If switching, remove this from the runqueue and mark it scheduled */
         if ( snext != scurr )
         {
@@ -3534,11 +3526,15 @@ static void csched2_schedule(
          *  2) no other unit with higher credits wants to run.
          *
          * Here, where we want to check for reset, we need to make sure the
-         * proper unit is being used. In fact, runqueue_candidate() may have
-         * not returned the first unit in the runqueue, for various reasons
+         * proper unit is being used. In fact, runq_candidate() may have not
+         * returned the first unit in the runqueue, for various reasons
          * (e.g., affinity). Only trigger a reset when it does.
          */
-        if ( skipped_units == 0 && snext->credit <= CSCHED2_CREDIT_RESET )
+        if ( list_empty(&rqd->runq) )
+            top_credit = snext->credit;
+        else
+            top_credit = max(snext->credit, runq_elem(rqd->runq.next)->credit);
+        if ( top_credit <= CSCHED2_CREDIT_RESET )
         {
             reset_credit(sched_cpu, now, snext);
             balance_load(ops, sched_cpu, now);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] xen: credit2: avoid vCPUs to ever reach lower credits than idle
  2020-03-19  0:11 ` [Xen-devel] [PATCH v2 1/2] xen: credit2: avoid vCPUs to ever reach lower credits than idle Dario Faggioli
@ 2020-04-02 18:21   ` George Dunlap
  0 siblings, 0 replies; 7+ messages in thread
From: George Dunlap @ 2020-04-02 18:21 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Juergen Gross, Charles Arnold, Tomas Mozes, Glen, Jan Beulich,
	Sarah Newman, xen-devel



> On Mar 19, 2020, at 12:11 AM, Dario Faggioli <dfaggioli@suse.com> wrote:
> 
> There have been report of stalls of guest vCPUs, when Credit2 was used.
> It seemed like these vCPUs were not getting scheduled for very long
> time, even under light load conditions (e.g., during dom0 boot).
> 
> Investigations led to the discovery that --although rarely-- it can
> happen that a vCPU manages to run for very long timeslices. In Credit2,
> this means that, when runtime accounting happens, the vCPU will lose a
> large quantity of credits. This in turn may lead to the vCPU having less
> credits than the idle vCPUs (-2^30). At this point, the scheduler will
> pick the idle vCPU, instead of the ready to run vCPU, for a few
> "epochs", which often times is enough for the guest kernel to think the
> vCPU is not responding and crashing.
> 
> An example of this situation is shown here. In fact, we can see d0v1
> sitting in the runqueue while all the CPUs are idle, as it has
> -1254238270 credits, which is smaller than -2^30 = −1073741824:
> 
>    (XEN) Runqueue 0:
>    (XEN)   ncpus              = 28
>    (XEN)   cpus               = 0-27
>    (XEN)   max_weight         = 256
>    (XEN)   pick_bias          = 22
>    (XEN)   instload           = 1
>    (XEN)   aveload            = 293391 (~111%)
>    (XEN)   idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
>    (XEN)   tickled: 00,00000000,00000000,00000000,00000000,00000000,00000000
>    (XEN)   fully idle cores: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
>    [...]
>    (XEN) Runqueue 0:
>    (XEN) CPU[00] runq=0, sibling=00,..., core=00,...
>    (XEN) CPU[01] runq=0, sibling=00,..., core=00,...
>    [...]
>    (XEN) CPU[26] runq=0, sibling=00,..., core=00,...
>    (XEN) CPU[27] runq=0, sibling=00,..., core=00,...
>    (XEN) RUNQ:
>    (XEN)     0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 (~100%)
> 
> We certainly don't want, under any circumstance, this to happen.
> Let's, therefore, define a minimum amount of credits a vCPU can have.
> During accounting, we make sure that, for however long the vCPU has
> run, it will never get to have less than such minimum amount of
> credits. Then, we set the credits of the idle vCPU to an even
> smaller value.
> 
> NOTE: investigations have been done about _how_ it is possible for a
> vCPU to execute for so much time that its credits becomes so low. While
> still not completely clear, there are evidence that:
> - it only happens very rarely,
> - it appears to be both machine and workload specific,
> - it does not look to be a Credit2 (e.g., as it happens when
>  running with Credit1 as well) issue, or a scheduler issue.
> 
> This patch makes Credit2 more robust to events like this, whatever
> the cause is, and should hence be backported (as far as possible).
> 
> Reported-by: Glen <glenbarney@gmail.com>
> Reported-by: Tomas Mozes <hydrapolic@gmail.com>
> Signed-off-by: Dario Faggioli <dfaggioli@suse.com>

Reviewed-by: George Dunlap <george.dunlap@citrix.com>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] xen: credit2: fix credit reset happening too few times
  2020-03-19  0:12 ` [Xen-devel] [PATCH v2 2/2] xen: credit2: fix credit reset happening too few times Dario Faggioli
@ 2020-04-02 18:32   ` George Dunlap
  2020-04-06  8:40     ` Dario Faggioli
  2020-04-06 11:53   ` [Xen-devel] " Jan Beulich
  1 sibling, 1 reply; 7+ messages in thread
From: George Dunlap @ 2020-04-02 18:32 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Juergen Gross, Charles Arnold, Jan Beulich, Glen, Tomas Mozes,
	Sarah Newman, xen-devel



> On Mar 19, 2020, at 12:12 AM, Dario Faggioli <dfaggioli@suse.com> wrote:
> 
> There is a bug in commit 5e4b4199667b9 ("xen: credit2: only reset
> credit on reset condition"). In fact, the aim of that commit was to
> make sure that we do not perform too many credit reset operations
> (which are not super cheap, and in an hot-path). But the check used
> to determine whether a reset is necessary was the wrong one.
> 
> In fact, knowing just that some vCPUs have been skipped, while
> traversing the runqueue (in runq_candidate()), is not enough. We
> need to check explicitly whether the first vCPU in the runqueue
> has a negative amount of credit.

Oh, so if the top of the runqueue has negative credit, but it’s not chosen, then the one we *do* run has even lower credit.  Still not quite sure how that leads to a situation where credit resets don’t happen for long periods of time.  But anyway...

> 
> Since a trace record is changed, this patch updates xentrace format file
> and xenalyze as well
> 
> This should be backported.
> 
> Signed-off-by: Dario Faggioli <dfaggioli@suse.com>

Acked-by: George Dunlap <george.dunlap@citrix.com>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] xen: credit2: fix credit reset happening too few times
  2020-04-02 18:32   ` George Dunlap
@ 2020-04-06  8:40     ` Dario Faggioli
  0 siblings, 0 replies; 7+ messages in thread
From: Dario Faggioli @ 2020-04-06  8:40 UTC (permalink / raw)
  To: George Dunlap
  Cc: Juergen Gross, Charles Arnold, Jan Beulich, Glen, Tomas Mozes,
	Sarah Newman, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1725 bytes --]

On Thu, 2020-04-02 at 18:32 +0000, George Dunlap wrote:
> > On Mar 19, 2020, at 12:12 AM, Dario Faggioli <dfaggioli@suse.com>
> > wrote:
> > 
> > There is a bug in commit 5e4b4199667b9 ("xen: credit2: only reset
> > credit on reset condition"). In fact, the aim of that commit was to
> > make sure that we do not perform too many credit reset operations
> > (which are not super cheap, and in an hot-path). But the check used
> > to determine whether a reset is necessary was the wrong one.
> > 
> > In fact, knowing just that some vCPUs have been skipped, while
> > traversing the runqueue (in runq_candidate()), is not enough. We
> > need to check explicitly whether the first vCPU in the runqueue
> > has a negative amount of credit.
> 
> Oh, so if the top of the runqueue has negative credit, but it’s not
> chosen, then the one we *do* run has even lower credit.  Still not
> quite sure how that leads to a situation where credit resets don’t
> happen for long periods of time.  But anyway...
> 
Fact is, without this patch, we wouldn't call reset_credit() if we
"skipped" a vcpu/unit.

That means we skip all the times we did not pick the head of the
runqueue, even if the one we picked also have negative credits (as the
check for 'skipped_units == 0' was the first condition of the '&&').

That's what was making us skip resets. :-)

> Acked-by: George Dunlap <george.dunlap@citrix.com>
> 
Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] [PATCH v2 2/2] xen: credit2: fix credit reset happening too few times
  2020-03-19  0:12 ` [Xen-devel] [PATCH v2 2/2] xen: credit2: fix credit reset happening too few times Dario Faggioli
  2020-04-02 18:32   ` George Dunlap
@ 2020-04-06 11:53   ` Jan Beulich
  1 sibling, 0 replies; 7+ messages in thread
From: Jan Beulich @ 2020-04-06 11:53 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Juergen Gross, Charles Arnold, Glen, George Dunlap, Tomas Mozes,
	Sarah Newman, xen-devel

On 19.03.2020 01:12, Dario Faggioli wrote:
> @@ -3328,12 +3325,9 @@ runq_candidate(struct csched2_runqueue_data *rqd,
>                          (unsigned char *)&d);
>          }
>  
> -        /* Only consider units that are allowed to run on this processor. */
> +        /* Only consider vcpus that are allowed to run on this processor. */
>          if ( !cpumask_test_cpu(cpu, svc->unit->cpu_hard_affinity) )

While backporting this to 4.12 I noticed that this is a presumably
unintended comment adjustment.

Jan


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-04-06 11:53 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-19  0:11 [Xen-devel] [PATCH v2 0/2] xen: credit2: fix vcpu starvation due to too few credits Dario Faggioli
2020-03-19  0:11 ` [Xen-devel] [PATCH v2 1/2] xen: credit2: avoid vCPUs to ever reach lower credits than idle Dario Faggioli
2020-04-02 18:21   ` George Dunlap
2020-03-19  0:12 ` [Xen-devel] [PATCH v2 2/2] xen: credit2: fix credit reset happening too few times Dario Faggioli
2020-04-02 18:32   ` George Dunlap
2020-04-06  8:40     ` Dario Faggioli
2020-04-06 11:53   ` [Xen-devel] " Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).