[PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2

* [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2
@ 2017-03-02 10:37 Dario Faggioli
  2017-03-02 10:38 ` [PATCH 1/6] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Dario Faggioli @ 2017-03-02 10:37 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Anshul Makkar, Ian Jackson, George Dunlap, Wei Liu

Hello,

This series aims at introducing some optimization and performance improvement
in Credit1 (in certain specific situations), but Credit2 is lightly touched as
well.

The core of the series is patches 3 and 4, which aim at both redistributing and
reducing spinlock contention during load balancing. In fact, Credit1 load
balancing is based on "work stealing". This means that, when a pCPU would go
idle, it looks around inside other pCPUs' runqueues, to see if there are vCPUs
waiting to run, and steal the first one it finds.

This process of going around pCPUs happens in a NUMA node wise fashion, and
always starts from the first pCPU on each node. That may lead to higher
scheduler lock pressure on lower ID pCPUs (of each node), as well as stealing
happening more frequently from them. And this is what patch 4 aims at fixing.
This is not necessarily expected to improve performance per-se, although a
fairer lock pressure is likely to bring benefits.

Still about load balancing, when deciding whether or not to try to steal work
from a pCPU, we only consider the ones that are non-idle. A pCPU which is
running a vCPU and does not have any other vCPU in its runqueue waiting to run,
is not idle, but there's nothing we can steal. It's therefore possible that we
check a number of pCPUs, which include at least trying to take their runqueue
lock, only to figure out that there's no vCPU we can grab, and we need to
continue checking other processors.
On a large system, in situations where the load (i.e., the number of runnable
and running vCPUs) is only _slightly_ higher than the number of pCPUs, this can
have a significant performance impact.  A way of improving this situation, is
to keep track of not only whether pCPUs are idle or not, but also which ones
have more than one runnable vCPU, which basically means they have at least one
vCPU ready to be stolen by anyone that would otherwise go idle.
And this exactly is what is done in patch 3.

Finally, patch 6 does to Credit2, something similar to what patch 3 does to
Credit1, although the context is, actually, different. In fact, there are
places in Credit2, where we just want the scheduler to give us one pCPU from a
certain runqueue. We do that by means of cpumask_any(), which is great, but
comes at a price.  As a matter of fact, we don't really care much which one, as
a subsequent call to runq_tickle() will override such choice anyway. But
--within runqueue tickle itself-- the pCPU we choose is at last used as an
hint, so we really don't want to totally give up and introduce biases (by,
e.g., just using cpumask_first().  We, therefore, use an approach similar to
the one in patch 3, i.e., we record and remember which pCPU we choose for last,
and start from it next time.

As said already, the performance benefit of this series are to be expected on
large systems, with very specific load conditions.  I've done some benchmarking
on a 16 CPUs NUMA box that I have at hand.

I've run three experiments. A Xen compile ('MAKEXEN') inside a 16 vCPUs guest.
2 Xen compiles running concurrently inside two 16 vCPUs VMs. And a Xen compile
and Iperf ('IPERF') running concurrently inside two 16 vCPUs VMs

Here's the result for Credit1. For MAKEXEN, lower is better, while for IPERF,
higher is. Average and standard dviation over 10 runs is what's shown in the
tables below.

    |CREDIT1                                                            |
    |-------------------------------------------------------------------|
    |MAKEXEN, 1VM    |MAKEXEN, 2VMs   |vm1: MAKEXEN     vm2: IPERF      |
    |baseline patched|baseline patched|baseline	patched	baseline patched|
    |----------------|----------------|---------------------------------|
avg | 18.154   17.906| 52.832   51.088| 29.306   28.936	 15.840   18.580|
stdd|  0.580    0.059|  1.061    1.717|  0.757    0.296   4.264	   2.492|

So, with this patch applied, Xen compiles a little bit faster, and Iperf
achieves higher throughput, which is great. :-D

As far as Credit2 goes, here's the numbers:

    |CREDIT2                                                            |
    |-------------------------------------------------------------------|
    |MAKEXEN, 1VM    |MAKEXEN, 2VMs   |vm1: MAKEXEN     vm2: IPERF      |
    |baseline patched|baseline patched|baseline	patched	baseline patched|
    |----------------|----------------|---------------------------------|
avg | 18.062   17.894| 53.136   52.968| 32.754   32.880  18.160   19.240|
stdd|  0.331    0.205|  0.886    0.566|  0.787    0.548   1.910	   1.842|

In this case, the expected impact of the series is smaller, and that in fact
matches what we get, with baseline and patched numbers very very close. What I
wanted to verify is that I was not introducing regressions, which seems to be
confirmed.

Thanks and Regards,
Dario
---
Dario Faggioli (6):
      xen: credit1: simplify csched_runq_steal() a little bit.
      xen: credit: (micro) optimize csched_runq_steal().
      xen: credit1: increase efficiency and scalability of load balancing.
      xen: credit1: treat pCPUs more evenly during balancing.
      xen/tools: tracing: add record for credit1 runqueue stealing.
      xen: credit2: avoid cpumask_any() in pick_cpu().

 tools/xentrace/formats       |    1
 tools/xentrace/xenalyze.c    |   11 ++
 xen/common/sched_credit.c    |  199 +++++++++++++++++++++++++++++-------------
 xen/common/sched_credit2.c   |   22 ++++-
 xen/include/xen/perfc_defn.h |    1
 5 files changed, 169 insertions(+), 65 deletions(-)
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread