[RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities

* [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
@ 2015-04-04  2:14 Dario Faggioli
  2015-04-04  2:14 ` [RFC PATCH 1/7] x86: improve psr scheduling code Dario Faggioli
                   ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-04  2:14 UTC (permalink / raw)
  To: Xen-devel
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, andrew.cooper3,
	Dongxiao Xu, JBeulich, Chao Peng

Hi Everyone,

This RFC series is the outcome of an investigation I've been doing about
whether we can take better advantage of features like Intel CMT (and of PSR
features in general). By "take better advantage of" them I mean, for example,
use the data obtained from monitoring within the scheduler and/or within
libxl's automatic NUMA placement algorithm, or similar.

I'm putting here in the cover letter a markdown document I wrote to better
describe my findings and ideas (sorry if it's a bit long! :-D). You can also
fetch it at the following links:

 * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.pdf
 * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.markdown

See the document itself and the changelog of the various patches for details.

The series includes one Chao's patch on top, as I found it convenient to build
on top of it. The series itself is available here:

  git://xenbits.xen.org/people/dariof/xen.git  wip/sched/icachemon
  http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/wip/sched/icachemon

Thanks a lot to everyone that will read and reply! :-)

Regards,
Dario
---

# Intel Cache Monitoring: Present and Future

## About this document

This document represents the result of in investigation on whether it would be
possible to more extensively exploit the Platform Shared Resource Monitoring
(PSR) capabilities of recent Intel x86 server chips. Examples of such features
are the Cache Monitoring Technology (CMT) and the Memory Bandwidth Monitoring
(MBM).

More specifically, it focuses on Cache Monitoring Technology, support for which
has recently been introduced in Xen by Intel, trying to figure out whether it
can be used for high level load balancing, such as libxl automatic domain
placement, and/or within Xen vCPU scheduler(s).

Note that, although the document only speaks about CMT, most of the
considerations apply (or can easily be extended) to MBM as well.

The fact that, currently, support is provided for monitoring L3 cache only,
somewhat limits the benefits of more extensively exploiting such technology,
which is exactly the purpose here. Nevertheless, some improvements are possible
already, and if at some point support for monitoring other cache layers will be
available, this can be the basic building block for taking advantage of that
too.

Source for this document is available [here](http://xenbits.xen.org/people/dariof/CMT-in-scheduling.markdown).
A PDF version is also available [here](http://xenbits.xen.org/people/dariof/CMT-in-scheduling.pdf).

### Terminology

In the remainder of the document, the term core, processor and (physical) CPU
(abbreviated with pCPU), are used interchangeably, for referring to a logical
processor. So, for instance, a server with 2 sockets, each one containing 4
cores and with hyperthreading will be referred to as a system with 16 pCPUs.

## The Cache Monitoring Technology (CMT)

Cache Monitoring Technology is about the hardware making cache utilization
information available to the Operating System (OS) or  Hypervisor (VMM), so
that it can make better decisions on workload scheduling.

Official documentation from Intel about CMT is available at the following URLs:

 * [Benefits of Cache Monitoring](https://software.intel.com/en-us/blogs/2014/06/18/benefit-of-cache-monitoring)
 * [Intel CMT: Software Visible Interfaces](https://software.intel.com/en-us/blogs/2014/12/11/intel-s-cache-monitoring-technology-software-visible-interfaces)
 * [Intel CMT: Usage Models and Data](https://software.intel.com/en-us/blogs/2014/12/11/intels-cache-monitoring-technology-use-models-and-data)
 * [Intel CMT: Software Support and Tools](https://software.intel.com/en-us/blogs/2014/12/11/intels-cache-monitoring-technology-software-support-and-tools)

The family of chips that first includes CMT is this:

 * https://software.intel.com/en-us/articles/intel-xeon-e5-2600-v3-product-family

Developers' documentation is available, as usual, in [Intel's SDM, volume 3B,
section 17.14](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf)

Materials about how CMT is currently supported in Xen can be found
[here]( http://wiki.xenproject.org/wiki/Intel_Cache_Monitoring_Technology)
and [here](http://xenbits.xen.org/docs/unstable/man/xl.1.html#cache_monitoring_technology).

## Current status

Intel itself did the work of upstreaming CMT support for Xen. That happened by
mean of the following changesets:

 * _x86: expose CMT L3 event mask to user space_ : [877eda3223161b995feacce8d2356ced1f627fa8](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=877eda3223161b995feacce8d2356ced1f627fa8)
 * _tools: CMDs and APIs for Cache Monitoring Technology_ : [747187995dd8cb28dcac1db8851d60e54f85f8e4](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=747187995dd8cb28dcac1db8851d60e54f85f8e4)
 * _xsm: add CMT related xsm policies_ : [edc3103ef384277d05a2d4a1f3aebd555add8d11](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=edc3103ef384277d05a2d4a1f3aebd555add8d11)
 * _x86: add CMT related MSRs in allowed list_ : [758b3b4ac2c7967d80c952da943a1ebddc66d2a2](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=758b3b4ac2c7967d80c952da943a1ebddc66d2a2)
 * _x86: enable CMT for each domain RMID_ : [494005637de52e52e228c04d170497b2e6950b53](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=494005637de52e52e228c04d170497b2e6950b53)
 * _x86: collect global CMT information_ : [78ec83170a25b7b7cfd9b5f0324bacdb8bed10bf](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=78ec83170a25b7b7cfd9b5f0324bacdb8bed10bf)
 * _x86: dynamically attach/detach CMT service for a guest_ : [c80c2b4bf07212485a9dcb27134f659c741155f5](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=c80c2b4bf07212485a9dcb27134f659c741155f5)
 * _x86: detect and initialize Cache Monitoring Technology feature_ : [021871770023700a30aa7e196cf7355b1ea4c075](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=021871770023700a30aa7e196cf7355b1ea4c075)
 * _libxc: provide interface for generic resource access_ : [fc265934d83be3d6da2647dce470424170fb96e9](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=fc265934d83be3d6da2647dce470424170fb96e9)
 * _xsm: add resource operation related xsm policy_ : [2a5e086e0bd6729b4a25536b9f978dedf3be52de](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=2a5e086e0bd6729b4a25536b9f978dedf3be52de)
 * _x86: add generic resource (e.g. MSR) access hypercall_ : [443035c40ab6a0566133a55090532740c52d61d3](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=443035c40ab6a0566133a55090532740c52d61d3)

Quickly summarizing, it is possible, from Dom0, to start and stop monitoring
the L3 cache occupancy of a domain, by doing something like this:

    [root@redbrick ~]# xl psr-cmt-attach 7

Results can be seen as follows:

    [root@redbrick ~]# xl psr-cmt-show cache_occupancy
    Total RMID: 71
    Name                                        ID        Socket 0        Socket 1        Socket 2        Socket 3
    Total L3 Cache Size                                   46080 KB        46080 KB        46080 KB        46080 KB
    wheezy64                                     7          432 KB            0 KB         2016 KB            0 KB

What happens in Xen is that an RMID is assigned to the domain, and that RMID is
loaded in a specific register, during a context switch, when one of the domain's
vCPU starts executing. Then, when such a vCPU is switched out, another RMID is
loaded, and that could be the RMID assigned to another domain being monitored
or the special RMID 0, used to represent the 'nothing to monitor' situation.

## Main limitations of current approach

The existing CMT support in Xen suffers from some limitations, some imputable
to hardware, some to software, some to both.

### Per-domain monitoring

Xen scheduler's schedule vCPUs, not domains. So, if wanting to use results
coming out of CMT for scheduling, the monitoring would have to happen on a
per-vCPU basis, rather than on a per-domain one.

This is mostly matter of how CMT support has been implemented, and that can be
changed toward per-vCPU monitoring pretty easily, at least from a purely
software perspective. That would mean, potentially, more overhead (RMID being
switched more often than now, and each RMID switch means writing a Model
Specific Register). The biggest hurdle toward achieving per-vCPU CMT support,
however, most likely is the fact that the number of available RMID is limited
(see below).

### RMIDs are limited

And that is an hard limit, imposed, per the best of author's understanding, by
the underlying hardware.

For example, a test/development box at hand, with an Intel(R) Xeon(R) CPU
E7-8890 v3 as CPU, had 144 pCPUs and only 71 RMIDs available. 

This means, for instance, that, if we would like to turn CMT support into being
to per-vCPU, we will not be able to monitor more vCPUs than 1/2 the number of
the host's pCPUs.

### Only L3 monitoring

For now, it is only possible to monitor L3 cache occupancy. That is, as far as
the author of this document understands, a limitation currently imposed by
hardware. However, making it possible to monitor other cache layer, e.g., L2,
is something Intel is planning to introduce, still to the best of the
document's author's knowledge.

In any case, the fact that CMT is limited to L3 for now, makes it less
appealing than one may think to be used from within the Xen scheduler, to make
cache aware scheduling decisions. In fact, L3 is, in most architectures, the so
called LLC (Last Level Cache), shared by all the cores of a socket.
Furthermore, a socket, most of the times, coincides with a NUMA node.

Therefore, although knowing about L3 cache occupancy enables making better
informed decisions when it comes at load balancing among NUMA nodes, moving one
(or more) vCPUs of a domain to a different socket will most likely have some
rather bad implications, for instance if the domain has it's memory on the NUMA
node where it is running (and both the domain creation logic and the Xen
scheduler have heuristics in place for make this happen!). So, although more
cache efficient, this would lead to a lot of remote memory traffic, which is
certainly undesirable.

Not all the domains are always allocated and configured in such a way to
achieve as much locality of memory accesses as possible, so, theoretically,
there is some room for this feature to be useful. However, practically
speaking, until we will have L2 monitoring the additional complexity and the
overhead introduced would most likely outweighs the benefits.

### RMIDs reuse

It is unclear to the author of this document what behavior would be considered
to be the proper one when an RMID is "reused", or "recycled". What that means
is what should happen if a domain is assigned an RMID and then, at some point,
the domain is detached from CMT, so the RMID is freed and, when another domain
is attached, that same RMID is used.

This is exactly what happens in the current implementation. Result looks as
follows:

    [root@redbrick ~]# xl psr-cmt-attach 0
    [root@redbrick ~]# xl psr-cmt-attach 1
    Total RMID: 71
    Name                                        ID        Socket 0        Socket 1        Socket 2        Socket 3
    Total L3 Cache Size                                   46080 KB        46080 KB        46080 KB        46080 KB
    Domain-0                                     0         6768 KB            0 KB            0 KB            0 KB
    wheezy64                                     1            0 KB          144 KB          144 KB            0 KB

Let's assume that RMID 1 (RMID 0 is reserved) is used for Domain-0 and RMID 2
is used for wheezy64. Then:

    [root@redbrick ~]# xl psr-cmt-detach 0
    [root@redbrick ~]# xl psr-cmt-detach 1

So now both RMID 1 and 2 are free to be reused. Now, let's issue the following
commands:

    [root@redbrick ~]# xl psr-cmt-attach 1
    [root@redbrick ~]# xl psr-cmt-attach 0

Which means that RMID 1 is now assigned to wheezy64, and RMID 2 is given to
Domain-0. Here's the effect:

    [root@redbrick ~]# xl psr-cmt-show cache_occupancy
    Total RMID: 71
    Name                                        ID        Socket 0        Socket 1        Socket 2        Socket 3
    Total L3 Cache Size                                   46080 KB        46080 KB        46080 KB        46080 KB
    Domain-0                                     0          216 KB          144 KB          144 KB            0 KB
    wheezy64                                     1         7416 KB            0 KB         1872 KB            0 KB

It looks quite likely that the 144KB occupancy on sockets 1 and 2, now being
accounted to Domain-0, is really what has been allocated by domain wheezy64,
before the RMID "switch". The same applies to the 7416KB on socket 0 now
accounted to wheezy64, i.e., most of this is not accurate and was allocated
there by Domain-0.

This is only a simple example, others have been performed, restricting the
affinity of the various domains involved in order to control on what socket
cache load were to be expected, and all confirm the above reasoning.

It is rather easy to appreciate that any kind of 'flushing' mechanism, to be
triggered when reusing an RMID (if anything like that even exists!) would
impact system performance (e.g., it is not an option in hot paths), but the
situation outlined above needs to be fixed, before the mechanism could be
considered usable and reliable enough to do anything on top of it.

### RMID association asynchronousness

Associating an RMID to a domain, right now, happens upon toolstack request,
from Dom0, and it will be effective as soon as the next context switches
involving the vCPUs of the domain happen.

The worst case scenario is for an (new) RMID being assigned to a domain an
instant after all its vCPU started running on some pCPUs. After a while (e.g.,
at the end of their timeslice), they will be de-scheduled and context switched
in back when their turn come. It is only during this last context switch that
we load the RMID in the special register, i.e., we have lost a full instance of
execution and hence of cache load monitoring. This is not a too big issue for
the current use case, but it might be when thinking about using CMT from within
the Xen scheduler.

Detaching case is similar (actually, probably worse). In fact, let's assume
that just an instant after all the vCPUs of a domain with an RMID attached
started executing, the domain is detached from CMT and another domain is
attached, using the RMID which just got freed. What happens is that the new
domain will be accounted for the cache load generated by a full instance of
execution of the vCPUs of the old domain. Again, this is probably tolerable for
the current use case, but cause problems if wanting to use the CMT feature more
extensively.

## Potential improvements and/or alternative approaches

### Per-vCPU cache monitoring

This means being able to tell how much of the L3 is being used by each vCPU.
Monitoring the cache occupancy of a specific domain, would still be possible,
just by summing up the contributions from all the domain's vCPUs.

#### Benefits / Drawbacks

>From the scheduler perspective, this would be best possible situation, provided
the information is precise enough, and is available for all (or at least a
relevant, for some definition of relevant, subset of) vCPUs.

For instance, Credit1 already includes a function called
\_\_csched\_vcpu\_is\_cache\_hot(). That is called when, during load balancing,
we consider whether or not to "steal" a vCPU from a remote pCPU. That may be
good for load balancing, nevertheless, we want to leave where they are the ones
that have most of their data in the remote pCPU cache. Such function right now
tries to infer whether a vCPU is really "cache hot" by looking at how long ago
it run; with something like per-vCPU CMT, it could use actual cache occupancy
samples.

However, implementing per-vCPU cache monitoring would entail the following:

 * on large servers, running large (and/or many) guests, we may run out of
   RMIDs
 * more frequent RMID switching on the pCPUs, which (it being an MSR write)
   means increased overhead
 * rather frequent sampling of the cache occupancy value, which (it being an
   MSR read) means increased overhead

Moreover, right now that only L3 monitoring is supported, the information would
not be useful for deciding whether or not to migrate a vCPU between pCPUs
belonging to the same socket (in general, between pCPUs insisting on the same
LLC).

It therefore seems that, enabling per-vCPU monitoring and using the information
it could provide from withing the scheduler would require overcoming most of
the current (both hardware and software) limitations of CMT support.

### Per-pCPU cache monitoring

This means being able to tell how much of the L3 is being used by activities
running on each core that insists it. These activities are typically vCPUs
running on the various logical processors but, depending on the actual
implementation, it may or not include the effects on cache occupancy of
hypervisor code execution.

Implementing this would require reserving an RMID for each logical processor,
and keep it in the appropriate special register until the activity we are
interested in monitoring is running on it. Again, due to the "L3 only"
limitation, this is not so useful to know when inside the scheduler. However,
it could be a sensible and useful information for in-toolstack(s) balancing and
placement algorithms, as well as for the user, to decide how to manually place
domains and set o tweak things like vCPUs hard and soft affinities.

#### Benefits / Drawbacks

How much of the LLC is being used by each core is a fair measure of the load on
a core. As such, it can be reported all the way up to the user, as we currently
do with per-domain cache occupancy.

If such information is available for all cores, by summing up all their
contributions, and subtracting it from the size of such LLC, we obtain how much
free space we have in each L3. This is an information that, for example, the
automatic NUMA placement algorithm that we have in libxl can leverage, to have
a better idea of on what pCPUs/NUMA node would be best to place a new domain.

In this form of "how much free cache there is on each L3 (on each socket)", it
is possible for the Xen scheduler too to make a somewhat good use of it,
although, of course, only when the decisions involve more than one socket.
Also, the overhead being introduced by a similar CMT configuration, is
certainly not concerning.

If the association of RMIDs to the cores has to be dynamic, i.e., if it has to
be possible for an user to attach, detach and re-attach a core to CMT, the
issue described in the previous section about RMID reuse is very much relevant
for this use case too. If it is static (e.g., established at boot and never
changed), that is less of a concern.

If the information about the amount of free L3 has to be used from within the
scheduler, that requires some more MSR manipulations, as we not only need to
load the RMIDs, we also need to sample the cache occupancy value from time to
time. That still looks feasible, but it is worth mentioning and considering
(e.g., what periodicity should be used, etc.).

The biggest drawback is probably the fact that, if enabling this per-core CMT
setup, monitoring specific domains, i.e., what we support right now, would be
really difficult. In principle, it just requires that:

 * when a core is executing the vCPU of a non-monitored domain, the RMID of the
   core is used
 * when the vCPU of a monitored domain is context switched in, the RMID of the
   domain is used
 * the cache load generated while the RMID of the domain is used must be
   recorded and stored or accumulated somewhere
 * when the reading the cache occupancy of a core, the accumulated value of
   cache occupancy of domains that run on the core itself should be taken into
   account
 * when a domain stop being monitored, its cache occupancy value should be
   somehow incorporated back in the core's cache occupancy
 * when a domain is destroyed, its cache occupancy value should be discarded.

Currently, point 3 can't be achieved, due to the lack of precision of the
measurements, deriving from the asynchronous (with respect to scheduling)
nature of RMIDs association, as described in previous sections. Points below
it, are also quite difficult to implement, at least as far as reporting the
value up to toolstack is concerned, given how cache occupancy values read is
implemented (i.e., as a generic 'MSR read' platform hypercall).

Therefore, if wanting to go for the per-core CMT configuration, sane
alternatives seem to be to just disallow specific domain monitoring, to avoid
it screwing up the per-core monitoring, or the exact opposite. That is just
allow it and let it temporarily (i.e., as long as a domain is monitored plus,
likely, some settle time after that) screw up the per-core monitoring, of
course warning the user about it.

## Conclusion

In extreme summary, the outcome of this investigation is that CMT (and other
PSR provided facilities by Intel) is a nice to have and useful feature, with
the potential of being much more useful, but that depends on the evolution of
both hardware and software support for it.

Implementing per-core CMT setup looks to be a reasonably low hanging fruit,
although some design decisions needs to be taken. For this reason, an RFC patch
series drafting an implementation of that is provided, together with this
document.

Please, see the changelogs of the individual patches and find there more
opportunities to discuss the design of such a setup, if that is something
considered worth having in Xen.

Find the series in this git branch also:

 * [git://xenbits.xen.org/people/dariof/xen.git](http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=summary)  [wip/sched/icachemon](http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/wip/sched/icachemon)

---
Chao Peng (1):
      x86: improve psr scheduling code

Dario Faggioli (6):
      Xen: x86: print max usable RMID during init
      xen: psr: reserve an RMID for each core
      xen: libxc: libxl: report per-CPU cache occupancy up to libxl
      xen: libxc: libxl: allow for attaching and detaching a CPU to CMT
      xl: report per-CPU cache occupancy up to libxl
      xl: allow for attaching and detaching a CPU to CMT

 tools/libxc/include/xenctrl.h |    4 +
 tools/libxc/xc_psr.c          |   53 +++++++++++++
 tools/libxl/libxl.h           |    6 ++
 tools/libxl/libxl_psr.c       |   89 ++++++++++++++++++++++
 tools/libxl/xl_cmdimpl.c      |  132 +++++++++++++++++++++++++++++----
 tools/libxl/xl_cmdtable.c     |   14 ++--
 xen/arch/x86/domain.c         |    7 +-
 xen/arch/x86/psr.c            |  164 ++++++++++++++++++++++++++++++++---------
 xen/arch/x86/sysctl.c         |   42 +++++++++++
 xen/include/asm-x86/psr.h     |   14 +++-
 xen/include/public/sysctl.h   |    7 ++
 11 files changed, 467 insertions(+), 65 deletions(-)

^ permalink raw reply	[flat|nested] 32+ messages in thread