[PATCH 00/19] Assorted fixes and improvements to Credit2

* [PATCH 00/19] Assorted fixes and improvements to Credit2
@ 2016-06-17 17:32 Dario Faggioli
  2016-06-17 23:08 ` Dario Faggioli
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Dario Faggioli @ 2016-06-17 17:32 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Anshul Makkar, David Vrabel, Jan Beulich

Hi everyone,

Here you go a collection of pseudo-random fixes and improvement to Credit2.

In the process of working on Soft Affinity and Caps support, I stumbled upon
them, one after the other, and decided to take care.

It's been hard to test and run benchmark, due to the "time goes backwards" bug
I uncovered [1], and this is at least part of the reason why the code for
affinity and caps is still missing. I've got it already, but need to refine a
couple of things, after double checking benchmark results. So, now that we have
Jan's series [2] (thanks! [*]), and that I managed to indeed run some tests on
this preliminary set of patches, I decided I better set this first group free,
while working on finishing the rest.

The various patches do a wide range of different things, so, please, refer to
individual changelogs for more detailed explanation.

About the numbers I could collect so far, here's the situation. I've run rather
simple benchmarks such as: - Xen build inside a VM. Metric is how log that
takes (in seconds), so lower is better.  - Iperf from a VM to its host. Metric
is total aggregate throughput, so higher is better.

The host is a 16 pCPUs / 2 NUMA nodes Xeon E5620, 6GB RAM per node. The VM had
16 vCPUs and 4GB of memory. Dom0 had 16 vCPUs as well, and 1GB of RAM.

The Xen build, I did it one time with -j4 --representative of low VM load-- and
another time with -j24 --representative of high VM laod. The Iperf test, I've
only used 8 parallel streams (I wanted to do 4 and 8, but there was a bug in my
scripts! :-/).

I've run the above both with and without disturbing external (from the point of
view of the VM) load. Such load were just generated by means of running
processes in dom0. It's rather basic, but it certainly keeps dom0's vCPUs busy
and stress the scheduler. This "noise", when present, was composed by:
 - 8 (v)CPU hog process (`yes &> /dev/null'), running in dom0
 - 4 processes alternating computation and sleep with a duty cycle of 35%.

So, there basically were 12 vCPUs of dom0 kept busy, in an heterogeneous fashion.

I benchmarked Credit2 with runqueues arranged per-core (the current default)
and per-socket, and also Credit1, for reference. The baseline was current
staging plus Jan's monotonicity series.

Actual numbers:

|=======================================================================|
| CREDIT 1 (for reference)                                              |
|=======================================================================|
| Xen build, low VM load, no noise    |
|-------------------------------------|
|               32.207                |
|-------------------------------------|---------------------------------|
| Xen build, high VM load, no noise   | Iperf, high VM load, no noise   |
|-------------------------------------|---------------------------------|
|               18.500                |             22.633              |
|-------------------------------------|---------------------------------|
| Xen build, low VM load, with noise  |
|-------------------------------------|
|               38.700                |
|-------------------------------------|---------------------------------|
| Xen build, high VM load, with noise | Iperf, high VM load, with noise |
|-------------------------------------|---------------------------------|
|               80.317                |             21.300
|=======================================================================|
| CREDIT 2                                                              |
|=======================================================================|
| Xen build, low VM load, no noise    | 
|-------------------------------------|
|            runq=core   runq=socket  |
| baseline     34.543       38.070    |
| patched      35.200       33.433    |
|-------------------------------------|---------------------------------|
| Xen build, high VM load, no noise   | Iperf, high VM load, no noise   |
|-------------------------------------|---------------------------------|
|            runq=core   runq=socket  |           runq=core runq=socket |
| baseline     18.710       19.397    | baseline    21.300     21.933   |
| patched      18.013       18.530    | patched     23.200     23.466   |
|-------------------------------------|---------------------------------|
| Xen build, low VM load, with noise  |
|-------------------------------------|
|            runq=core   runq=socket  |
| baseline     44.483       40.747    |
| patched      45.866       39.493    |
|-------------------------------------|---------------------------------|
| Xen build, high VM load, with noise | Iperf, high VM load, with noise |
|-------------------------------------|---------------------------------|
|            runq=core   runq=socket  |           runq=core runq=socket |
| baseline     41.466       30.630    | baseline    20.333     20.633   |
| patched      36.840       29.080    | patched     19.967     21.000   |
|=======================================================================|

Which, summarizing, means:
 * as far as Credit2 is concerned,  applying this series and using runq=socket
   is what _ALWAYS_ provides the best results.
 * when looking at Credit1 vs. patched Credit2 with runq=socket:
  - Xen build, low VM load,  no noise  : Credit1 slightly better
  - Xen build, low VM load,  no noise  : on par
  - Xen build, low VM load,  with noise: Credit1 a bit better
  - Xen build, high VM load, with noise: Credit2 _ENORMOUSLY_ better (yes, I
    rerun both cases a number of time!)
  - Iperf,     high VM load, no noise  : Credit2 a bit better
  - Iperf,     high VM load, with noise: Credit1 slightly better    

So, Credit1 still wins a few rounds, but performance are very very very close,
and this series seems to me to help narrowing the gap (for some of the cases,
significantly).

It also looks like that, although rather naive, the 'Xen build, high VM load,
with noise' test case exposed another of those issues with Credit1 (more
investigation is necessary), while Credit2 keeps up just fine.

Another interesting thing to note is that, on Credit2 (with this series) 'Xen
build, high VM load, with noise' turns out being quicker than 'Xen build, low
VM load, with noise'. This means that using an higher value for `make -j' for a
build, inside a guest, results in quicker build time, which makes sense... But
that is _NOT_ what happens on Credit1, the whole thing (wildly :-P) hinting at
Credit2 being able to achieve better scalability and better fairness.

In any case, more benchmarking is necessary, and is already planned. More
investigation is also necessary to figure out whether, once we will have this
series, going back to runq=socket as default would indeed be the best thing
(which I indeed suspect it will).

But from all I see, and from all the various perspectives, this series seems a
step in the right direction.

Thanks and Regards,
Dario

[1] http://lists.xen.org/archives/html/xen-devel/2016-06/msg00922.html
[2] http://lists.xen.org/archives/html/xen-devel/2016-06/msg01884.html

[*] Jan, I confirm that, with your series applied, I haven't yet seen any of
those "Time went backwards?" printk from Credit2, as you sort of were
expecting...

---
Dario Faggioli (19):
      xen: sched: leave CPUs doing tasklet work alone.
      xen: sched: make the 'tickled' perf counter clearer
      xen: credit2: insert and tickle don't need a cpu parameter
      xen: credit2: kill useless helper function choose_cpu
      xen: credit2: do not warn if calling burn_credits more than once
      xen: credit2: read NOW() with the proper runq lock held
      xen: credit2: prevent load balancing to go mad if time goes backwards
      xen: credit2: when tickling, check idle cpus first
      xen: credit2: avoid calling __update_svc_load() multiple times on the same vcpu
      xen: credit2: rework load tracking logic
      tools: tracing: adapt Credit2 load tracking events to new format
      xen: credit2: use non-atomic cpumask and bit operations
      xen: credit2: make the code less experimental
      xen: credit2: add yet some more tracing
      xen: credit2: only marshall trace point arguments if tracing enabled
      tools: tracing: deal with new Credit2 events
      xen: credit2: the private scheduler lock can be an rwlock.
      xen: credit2: implement SMT support independent runq arrangement
      xen: credit2: use cpumask_first instead of cpumask_any when choosing cpu

 docs/misc/xen-command-line.markdown |   30 +
 tools/xentrace/formats              |   10 
 tools/xentrace/xenalyze.c           |  103 +++
 xen/common/sched_credit.c           |   22 -
 xen/common/sched_credit2.c          | 1158 +++++++++++++++++++++++++----------
 xen/common/sched_rt.c               |    8 
 xen/include/xen/cpumask.h           |    8 
 xen/include/xen/perfc_defn.h        |    5 
 8 files changed, 973 insertions(+), 371 deletions(-)

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread