All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
@ 2015-04-04  2:14 Dario Faggioli
  2015-04-04  2:14 ` [RFC PATCH 1/7] x86: improve psr scheduling code Dario Faggioli
                   ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-04  2:14 UTC (permalink / raw)
  To: Xen-devel
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, andrew.cooper3,
	Dongxiao Xu, JBeulich, Chao Peng

Hi Everyone,

This RFC series is the outcome of an investigation I've been doing about
whether we can take better advantage of features like Intel CMT (and of PSR
features in general). By "take better advantage of" them I mean, for example,
use the data obtained from monitoring within the scheduler and/or within
libxl's automatic NUMA placement algorithm, or similar.

I'm putting here in the cover letter a markdown document I wrote to better
describe my findings and ideas (sorry if it's a bit long! :-D). You can also
fetch it at the following links:

 * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.pdf
 * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.markdown

See the document itself and the changelog of the various patches for details.

The series includes one Chao's patch on top, as I found it convenient to build
on top of it. The series itself is available here:

  git://xenbits.xen.org/people/dariof/xen.git  wip/sched/icachemon
  http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/wip/sched/icachemon

Thanks a lot to everyone that will read and reply! :-)

Regards,
Dario
---

# Intel Cache Monitoring: Present and Future

## About this document

This document represents the result of in investigation on whether it would be
possible to more extensively exploit the Platform Shared Resource Monitoring
(PSR) capabilities of recent Intel x86 server chips. Examples of such features
are the Cache Monitoring Technology (CMT) and the Memory Bandwidth Monitoring
(MBM).

More specifically, it focuses on Cache Monitoring Technology, support for which
has recently been introduced in Xen by Intel, trying to figure out whether it
can be used for high level load balancing, such as libxl automatic domain
placement, and/or within Xen vCPU scheduler(s).

Note that, although the document only speaks about CMT, most of the
considerations apply (or can easily be extended) to MBM as well.

The fact that, currently, support is provided for monitoring L3 cache only,
somewhat limits the benefits of more extensively exploiting such technology,
which is exactly the purpose here. Nevertheless, some improvements are possible
already, and if at some point support for monitoring other cache layers will be
available, this can be the basic building block for taking advantage of that
too.

Source for this document is available [here](http://xenbits.xen.org/people/dariof/CMT-in-scheduling.markdown).
A PDF version is also available [here](http://xenbits.xen.org/people/dariof/CMT-in-scheduling.pdf).

### Terminology

In the remainder of the document, the term core, processor and (physical) CPU
(abbreviated with pCPU), are used interchangeably, for referring to a logical
processor. So, for instance, a server with 2 sockets, each one containing 4
cores and with hyperthreading will be referred to as a system with 16 pCPUs.

## The Cache Monitoring Technology (CMT)

Cache Monitoring Technology is about the hardware making cache utilization
information available to the Operating System (OS) or  Hypervisor (VMM), so
that it can make better decisions on workload scheduling.

Official documentation from Intel about CMT is available at the following URLs:

 * [Benefits of Cache Monitoring](https://software.intel.com/en-us/blogs/2014/06/18/benefit-of-cache-monitoring)
 * [Intel CMT: Software Visible Interfaces](https://software.intel.com/en-us/blogs/2014/12/11/intel-s-cache-monitoring-technology-software-visible-interfaces)
 * [Intel CMT: Usage Models and Data](https://software.intel.com/en-us/blogs/2014/12/11/intels-cache-monitoring-technology-use-models-and-data)
 * [Intel CMT: Software Support and Tools](https://software.intel.com/en-us/blogs/2014/12/11/intels-cache-monitoring-technology-software-support-and-tools)

The family of chips that first includes CMT is this:

 * https://software.intel.com/en-us/articles/intel-xeon-e5-2600-v3-product-family

Developers' documentation is available, as usual, in [Intel's SDM, volume 3B,
section 17.14](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf)

Materials about how CMT is currently supported in Xen can be found
[here]( http://wiki.xenproject.org/wiki/Intel_Cache_Monitoring_Technology)
and [here](http://xenbits.xen.org/docs/unstable/man/xl.1.html#cache_monitoring_technology).

## Current status

Intel itself did the work of upstreaming CMT support for Xen. That happened by
mean of the following changesets:

 * _x86: expose CMT L3 event mask to user space_ : [877eda3223161b995feacce8d2356ced1f627fa8](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=877eda3223161b995feacce8d2356ced1f627fa8)
 * _tools: CMDs and APIs for Cache Monitoring Technology_ : [747187995dd8cb28dcac1db8851d60e54f85f8e4](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=747187995dd8cb28dcac1db8851d60e54f85f8e4)
 * _xsm: add CMT related xsm policies_ : [edc3103ef384277d05a2d4a1f3aebd555add8d11](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=edc3103ef384277d05a2d4a1f3aebd555add8d11)
 * _x86: add CMT related MSRs in allowed list_ : [758b3b4ac2c7967d80c952da943a1ebddc66d2a2](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=758b3b4ac2c7967d80c952da943a1ebddc66d2a2)
 * _x86: enable CMT for each domain RMID_ : [494005637de52e52e228c04d170497b2e6950b53](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=494005637de52e52e228c04d170497b2e6950b53)
 * _x86: collect global CMT information_ : [78ec83170a25b7b7cfd9b5f0324bacdb8bed10bf](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=78ec83170a25b7b7cfd9b5f0324bacdb8bed10bf)
 * _x86: dynamically attach/detach CMT service for a guest_ : [c80c2b4bf07212485a9dcb27134f659c741155f5](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=c80c2b4bf07212485a9dcb27134f659c741155f5)
 * _x86: detect and initialize Cache Monitoring Technology feature_ : [021871770023700a30aa7e196cf7355b1ea4c075](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=021871770023700a30aa7e196cf7355b1ea4c075)
 * _libxc: provide interface for generic resource access_ : [fc265934d83be3d6da2647dce470424170fb96e9](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=fc265934d83be3d6da2647dce470424170fb96e9)
 * _xsm: add resource operation related xsm policy_ : [2a5e086e0bd6729b4a25536b9f978dedf3be52de](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=2a5e086e0bd6729b4a25536b9f978dedf3be52de)
 * _x86: add generic resource (e.g. MSR) access hypercall_ : [443035c40ab6a0566133a55090532740c52d61d3](http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=443035c40ab6a0566133a55090532740c52d61d3)

Quickly summarizing, it is possible, from Dom0, to start and stop monitoring
the L3 cache occupancy of a domain, by doing something like this:

    [root@redbrick ~]# xl psr-cmt-attach 7

Results can be seen as follows:

    [root@redbrick ~]# xl psr-cmt-show cache_occupancy
    Total RMID: 71
    Name                                        ID        Socket 0        Socket 1        Socket 2        Socket 3
    Total L3 Cache Size                                   46080 KB        46080 KB        46080 KB        46080 KB
    wheezy64                                     7          432 KB            0 KB         2016 KB            0 KB

What happens in Xen is that an RMID is assigned to the domain, and that RMID is
loaded in a specific register, during a context switch, when one of the domain's
vCPU starts executing. Then, when such a vCPU is switched out, another RMID is
loaded, and that could be the RMID assigned to another domain being monitored
or the special RMID 0, used to represent the 'nothing to monitor' situation.

## Main limitations of current approach

The existing CMT support in Xen suffers from some limitations, some imputable
to hardware, some to software, some to both.

### Per-domain monitoring

Xen scheduler's schedule vCPUs, not domains. So, if wanting to use results
coming out of CMT for scheduling, the monitoring would have to happen on a
per-vCPU basis, rather than on a per-domain one.

This is mostly matter of how CMT support has been implemented, and that can be
changed toward per-vCPU monitoring pretty easily, at least from a purely
software perspective. That would mean, potentially, more overhead (RMID being
switched more often than now, and each RMID switch means writing a Model
Specific Register). The biggest hurdle toward achieving per-vCPU CMT support,
however, most likely is the fact that the number of available RMID is limited
(see below).

### RMIDs are limited

And that is an hard limit, imposed, per the best of author's understanding, by
the underlying hardware.

For example, a test/development box at hand, with an Intel(R) Xeon(R) CPU
E7-8890 v3 as CPU, had 144 pCPUs and only 71 RMIDs available. 

This means, for instance, that, if we would like to turn CMT support into being
to per-vCPU, we will not be able to monitor more vCPUs than 1/2 the number of
the host's pCPUs.

### Only L3 monitoring

For now, it is only possible to monitor L3 cache occupancy. That is, as far as
the author of this document understands, a limitation currently imposed by
hardware. However, making it possible to monitor other cache layer, e.g., L2,
is something Intel is planning to introduce, still to the best of the
document's author's knowledge.

In any case, the fact that CMT is limited to L3 for now, makes it less
appealing than one may think to be used from within the Xen scheduler, to make
cache aware scheduling decisions. In fact, L3 is, in most architectures, the so
called LLC (Last Level Cache), shared by all the cores of a socket.
Furthermore, a socket, most of the times, coincides with a NUMA node.

Therefore, although knowing about L3 cache occupancy enables making better
informed decisions when it comes at load balancing among NUMA nodes, moving one
(or more) vCPUs of a domain to a different socket will most likely have some
rather bad implications, for instance if the domain has it's memory on the NUMA
node where it is running (and both the domain creation logic and the Xen
scheduler have heuristics in place for make this happen!). So, although more
cache efficient, this would lead to a lot of remote memory traffic, which is
certainly undesirable.

Not all the domains are always allocated and configured in such a way to
achieve as much locality of memory accesses as possible, so, theoretically,
there is some room for this feature to be useful. However, practically
speaking, until we will have L2 monitoring the additional complexity and the
overhead introduced would most likely outweighs the benefits.

### RMIDs reuse

It is unclear to the author of this document what behavior would be considered
to be the proper one when an RMID is "reused", or "recycled". What that means
is what should happen if a domain is assigned an RMID and then, at some point,
the domain is detached from CMT, so the RMID is freed and, when another domain
is attached, that same RMID is used.

This is exactly what happens in the current implementation. Result looks as
follows:

    [root@redbrick ~]# xl psr-cmt-attach 0
    [root@redbrick ~]# xl psr-cmt-attach 1
    Total RMID: 71
    Name                                        ID        Socket 0        Socket 1        Socket 2        Socket 3
    Total L3 Cache Size                                   46080 KB        46080 KB        46080 KB        46080 KB
    Domain-0                                     0         6768 KB            0 KB            0 KB            0 KB
    wheezy64                                     1            0 KB          144 KB          144 KB            0 KB

Let's assume that RMID 1 (RMID 0 is reserved) is used for Domain-0 and RMID 2
is used for wheezy64. Then:

    [root@redbrick ~]# xl psr-cmt-detach 0
    [root@redbrick ~]# xl psr-cmt-detach 1

So now both RMID 1 and 2 are free to be reused. Now, let's issue the following
commands:

    [root@redbrick ~]# xl psr-cmt-attach 1
    [root@redbrick ~]# xl psr-cmt-attach 0

Which means that RMID 1 is now assigned to wheezy64, and RMID 2 is given to
Domain-0. Here's the effect:

    [root@redbrick ~]# xl psr-cmt-show cache_occupancy
    Total RMID: 71
    Name                                        ID        Socket 0        Socket 1        Socket 2        Socket 3
    Total L3 Cache Size                                   46080 KB        46080 KB        46080 KB        46080 KB
    Domain-0                                     0          216 KB          144 KB          144 KB            0 KB
    wheezy64                                     1         7416 KB            0 KB         1872 KB            0 KB

It looks quite likely that the 144KB occupancy on sockets 1 and 2, now being
accounted to Domain-0, is really what has been allocated by domain wheezy64,
before the RMID "switch". The same applies to the 7416KB on socket 0 now
accounted to wheezy64, i.e., most of this is not accurate and was allocated
there by Domain-0.

This is only a simple example, others have been performed, restricting the
affinity of the various domains involved in order to control on what socket
cache load were to be expected, and all confirm the above reasoning.

It is rather easy to appreciate that any kind of 'flushing' mechanism, to be
triggered when reusing an RMID (if anything like that even exists!) would
impact system performance (e.g., it is not an option in hot paths), but the
situation outlined above needs to be fixed, before the mechanism could be
considered usable and reliable enough to do anything on top of it.

### RMID association asynchronousness

Associating an RMID to a domain, right now, happens upon toolstack request,
from Dom0, and it will be effective as soon as the next context switches
involving the vCPUs of the domain happen.

The worst case scenario is for an (new) RMID being assigned to a domain an
instant after all its vCPU started running on some pCPUs. After a while (e.g.,
at the end of their timeslice), they will be de-scheduled and context switched
in back when their turn come. It is only during this last context switch that
we load the RMID in the special register, i.e., we have lost a full instance of
execution and hence of cache load monitoring. This is not a too big issue for
the current use case, but it might be when thinking about using CMT from within
the Xen scheduler.

Detaching case is similar (actually, probably worse). In fact, let's assume
that just an instant after all the vCPUs of a domain with an RMID attached
started executing, the domain is detached from CMT and another domain is
attached, using the RMID which just got freed. What happens is that the new
domain will be accounted for the cache load generated by a full instance of
execution of the vCPUs of the old domain. Again, this is probably tolerable for
the current use case, but cause problems if wanting to use the CMT feature more
extensively.

## Potential improvements and/or alternative approaches

### Per-vCPU cache monitoring

This means being able to tell how much of the L3 is being used by each vCPU.
Monitoring the cache occupancy of a specific domain, would still be possible,
just by summing up the contributions from all the domain's vCPUs.

#### Benefits / Drawbacks

>From the scheduler perspective, this would be best possible situation, provided
the information is precise enough, and is available for all (or at least a
relevant, for some definition of relevant, subset of) vCPUs.

For instance, Credit1 already includes a function called
\_\_csched\_vcpu\_is\_cache\_hot(). That is called when, during load balancing,
we consider whether or not to "steal" a vCPU from a remote pCPU. That may be
good for load balancing, nevertheless, we want to leave where they are the ones
that have most of their data in the remote pCPU cache. Such function right now
tries to infer whether a vCPU is really "cache hot" by looking at how long ago
it run; with something like per-vCPU CMT, it could use actual cache occupancy
samples.

However, implementing per-vCPU cache monitoring would entail the following:

 * on large servers, running large (and/or many) guests, we may run out of
   RMIDs
 * more frequent RMID switching on the pCPUs, which (it being an MSR write)
   means increased overhead
 * rather frequent sampling of the cache occupancy value, which (it being an
   MSR read) means increased overhead

Moreover, right now that only L3 monitoring is supported, the information would
not be useful for deciding whether or not to migrate a vCPU between pCPUs
belonging to the same socket (in general, between pCPUs insisting on the same
LLC).

It therefore seems that, enabling per-vCPU monitoring and using the information
it could provide from withing the scheduler would require overcoming most of
the current (both hardware and software) limitations of CMT support.

### Per-pCPU cache monitoring

This means being able to tell how much of the L3 is being used by activities
running on each core that insists it. These activities are typically vCPUs
running on the various logical processors but, depending on the actual
implementation, it may or not include the effects on cache occupancy of
hypervisor code execution.

Implementing this would require reserving an RMID for each logical processor,
and keep it in the appropriate special register until the activity we are
interested in monitoring is running on it. Again, due to the "L3 only"
limitation, this is not so useful to know when inside the scheduler. However,
it could be a sensible and useful information for in-toolstack(s) balancing and
placement algorithms, as well as for the user, to decide how to manually place
domains and set o tweak things like vCPUs hard and soft affinities.

#### Benefits / Drawbacks

How much of the LLC is being used by each core is a fair measure of the load on
a core. As such, it can be reported all the way up to the user, as we currently
do with per-domain cache occupancy.

If such information is available for all cores, by summing up all their
contributions, and subtracting it from the size of such LLC, we obtain how much
free space we have in each L3. This is an information that, for example, the
automatic NUMA placement algorithm that we have in libxl can leverage, to have
a better idea of on what pCPUs/NUMA node would be best to place a new domain.

In this form of "how much free cache there is on each L3 (on each socket)", it
is possible for the Xen scheduler too to make a somewhat good use of it,
although, of course, only when the decisions involve more than one socket.
Also, the overhead being introduced by a similar CMT configuration, is
certainly not concerning.

If the association of RMIDs to the cores has to be dynamic, i.e., if it has to
be possible for an user to attach, detach and re-attach a core to CMT, the
issue described in the previous section about RMID reuse is very much relevant
for this use case too. If it is static (e.g., established at boot and never
changed), that is less of a concern.

If the information about the amount of free L3 has to be used from within the
scheduler, that requires some more MSR manipulations, as we not only need to
load the RMIDs, we also need to sample the cache occupancy value from time to
time. That still looks feasible, but it is worth mentioning and considering
(e.g., what periodicity should be used, etc.).

The biggest drawback is probably the fact that, if enabling this per-core CMT
setup, monitoring specific domains, i.e., what we support right now, would be
really difficult. In principle, it just requires that:

 * when a core is executing the vCPU of a non-monitored domain, the RMID of the
   core is used
 * when the vCPU of a monitored domain is context switched in, the RMID of the
   domain is used
 * the cache load generated while the RMID of the domain is used must be
   recorded and stored or accumulated somewhere
 * when the reading the cache occupancy of a core, the accumulated value of
   cache occupancy of domains that run on the core itself should be taken into
   account
 * when a domain stop being monitored, its cache occupancy value should be
   somehow incorporated back in the core's cache occupancy
 * when a domain is destroyed, its cache occupancy value should be discarded.

Currently, point 3 can't be achieved, due to the lack of precision of the
measurements, deriving from the asynchronous (with respect to scheduling)
nature of RMIDs association, as described in previous sections. Points below
it, are also quite difficult to implement, at least as far as reporting the
value up to toolstack is concerned, given how cache occupancy values read is
implemented (i.e., as a generic 'MSR read' platform hypercall).

Therefore, if wanting to go for the per-core CMT configuration, sane
alternatives seem to be to just disallow specific domain monitoring, to avoid
it screwing up the per-core monitoring, or the exact opposite. That is just
allow it and let it temporarily (i.e., as long as a domain is monitored plus,
likely, some settle time after that) screw up the per-core monitoring, of
course warning the user about it.

## Conclusion

In extreme summary, the outcome of this investigation is that CMT (and other
PSR provided facilities by Intel) is a nice to have and useful feature, with
the potential of being much more useful, but that depends on the evolution of
both hardware and software support for it.

Implementing per-core CMT setup looks to be a reasonably low hanging fruit,
although some design decisions needs to be taken. For this reason, an RFC patch
series drafting an implementation of that is provided, together with this
document.

Please, see the changelogs of the individual patches and find there more
opportunities to discuss the design of such a setup, if that is something
considered worth having in Xen.

Find the series in this git branch also:

 * [git://xenbits.xen.org/people/dariof/xen.git](http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=summary)  [wip/sched/icachemon](http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/wip/sched/icachemon)

---
Chao Peng (1):
      x86: improve psr scheduling code

Dario Faggioli (6):
      Xen: x86: print max usable RMID during init
      xen: psr: reserve an RMID for each core
      xen: libxc: libxl: report per-CPU cache occupancy up to libxl
      xen: libxc: libxl: allow for attaching and detaching a CPU to CMT
      xl: report per-CPU cache occupancy up to libxl
      xl: allow for attaching and detaching a CPU to CMT

 tools/libxc/include/xenctrl.h |    4 +
 tools/libxc/xc_psr.c          |   53 +++++++++++++
 tools/libxl/libxl.h           |    6 ++
 tools/libxl/libxl_psr.c       |   89 ++++++++++++++++++++++
 tools/libxl/xl_cmdimpl.c      |  132 +++++++++++++++++++++++++++++----
 tools/libxl/xl_cmdtable.c     |   14 ++--
 xen/arch/x86/domain.c         |    7 +-
 xen/arch/x86/psr.c            |  164 ++++++++++++++++++++++++++++++++---------
 xen/arch/x86/sysctl.c         |   42 +++++++++++
 xen/include/asm-x86/psr.h     |   14 +++-
 xen/include/public/sysctl.h   |    7 ++
 11 files changed, 467 insertions(+), 65 deletions(-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 1/7] x86: improve psr scheduling code
  2015-04-04  2:14 [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Dario Faggioli
@ 2015-04-04  2:14 ` Dario Faggioli
  2015-04-06 13:48   ` Konrad Rzeszutek Wilk
  2015-04-04  2:14 ` [RFC PATCH 2/7] Xen: x86: print max usable RMID during init Dario Faggioli
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Dario Faggioli @ 2015-04-04  2:14 UTC (permalink / raw)
  To: Xen-devel
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, andrew.cooper3,
	Dongxiao Xu, JBeulich, Chao Peng

From: Chao Peng <chao.p.peng@linux.intel.com>

Switching RMID from previous vcpu to next vcpu only needs to write
MSR_IA32_PSR_ASSOC once. Write it with the value of next vcpu is enough,
no need to write '0' first. Idle domain has RMID set to 0 and because MSR
is already updated lazily, so just switch it as it does.

Also move the initialization of per-CPU variable which used for lazy
update from context switch to CPU starting.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 xen/arch/x86/domain.c     |    7 +---
 xen/arch/x86/psr.c        |   89 +++++++++++++++++++++++++++++++++++----------
 xen/include/asm-x86/psr.h |    3 +-
 3 files changed, 73 insertions(+), 26 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 393aa26..73f5d7f 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1443,8 +1443,6 @@ static void __context_switch(void)
     {
         memcpy(&p->arch.user_regs, stack_regs, CTXT_SWITCH_STACK_BYTES);
         vcpu_save_fpu(p);
-        if ( psr_cmt_enabled() )
-            psr_assoc_rmid(0);
         p->arch.ctxt_switch_from(p);
     }
 
@@ -1469,11 +1467,10 @@ static void __context_switch(void)
         }
         vcpu_restore_fpu_eager(n);
         n->arch.ctxt_switch_to(n);
-
-        if ( psr_cmt_enabled() && n->domain->arch.psr_rmid > 0 )
-            psr_assoc_rmid(n->domain->arch.psr_rmid);
     }
 
+    psr_ctxt_switch_to(n->domain);
+
     gdt = !is_pv_32on64_vcpu(n) ? per_cpu(gdt_table, cpu) :
                                   per_cpu(compat_gdt_table, cpu);
     if ( need_full_gdt(n) )
diff --git a/xen/arch/x86/psr.c b/xen/arch/x86/psr.c
index 2ef83df..c902625 100644
--- a/xen/arch/x86/psr.c
+++ b/xen/arch/x86/psr.c
@@ -22,7 +22,6 @@
 
 struct psr_assoc {
     uint64_t val;
-    bool_t initialized;
 };
 
 struct psr_cmt *__read_mostly psr_cmt;
@@ -115,14 +114,6 @@ static void __init init_psr_cmt(unsigned int rmid_max)
     printk(XENLOG_INFO "Cache Monitoring Technology enabled\n");
 }
 
-static int __init init_psr(void)
-{
-    if ( (opt_psr & PSR_CMT) && opt_rmid_max )
-        init_psr_cmt(opt_rmid_max);
-    return 0;
-}
-__initcall(init_psr);
-
 /* Called with domain lock held, no psr specific lock needed */
 int psr_alloc_rmid(struct domain *d)
 {
@@ -168,26 +159,84 @@ void psr_free_rmid(struct domain *d)
     d->arch.psr_rmid = 0;
 }
 
-void psr_assoc_rmid(unsigned int rmid)
+static inline void psr_assoc_init(void)
 {
-    uint64_t val;
-    uint64_t new_val;
     struct psr_assoc *psra = &this_cpu(psr_assoc);
 
-    if ( !psra->initialized )
-    {
+    if ( psr_cmt_enabled() )
         rdmsrl(MSR_IA32_PSR_ASSOC, psra->val);
-        psra->initialized = 1;
+}
+
+static inline void psr_assoc_reg_read(struct psr_assoc *psra, uint64_t *reg)
+{
+    *reg = psra->val;
+}
+
+static inline void psr_assoc_reg_write(struct psr_assoc *psra, uint64_t reg)
+{
+    if ( reg != psra->val )
+    {
+        wrmsrl(MSR_IA32_PSR_ASSOC, reg);
+        psra->val = reg;
     }
-    val = psra->val;
+}
+
+static inline void psr_assoc_rmid(uint64_t *reg, unsigned int rmid)
+{
+    *reg = (*reg & ~rmid_mask) | (rmid & rmid_mask);
+}
+
+void psr_ctxt_switch_to(struct domain *d)
+{
+    uint64_t reg;
+    struct psr_assoc *psra = &this_cpu(psr_assoc);
+
+    psr_assoc_reg_read(psra, &reg);
 
-    new_val = (val & ~rmid_mask) | (rmid & rmid_mask);
-    if ( val != new_val )
+    if ( psr_cmt_enabled() )
+        psr_assoc_rmid(&reg, d->arch.psr_rmid);
+
+    psr_assoc_reg_write(psra, reg);
+}
+
+static void psr_cpu_init(unsigned int cpu)
+{
+    psr_assoc_init();
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_STARTING:
+        psr_cpu_init(cpu);
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback
+};
+
+static int __init psr_presmp_init(void)
+{
+    if ( (opt_psr & PSR_CMT) && opt_rmid_max )
+        init_psr_cmt(opt_rmid_max);
+
+    if (  psr_cmt_enabled() )
     {
-        wrmsrl(MSR_IA32_PSR_ASSOC, new_val);
-        psra->val = new_val;
+        psr_cpu_init(smp_processor_id());
+        register_cpu_notifier(&cpu_nfb);
     }
+
+    return 0;
 }
+presmp_initcall(psr_presmp_init);
 
 /*
  * Local variables:
diff --git a/xen/include/asm-x86/psr.h b/xen/include/asm-x86/psr.h
index c6076e9..585350c 100644
--- a/xen/include/asm-x86/psr.h
+++ b/xen/include/asm-x86/psr.h
@@ -46,7 +46,8 @@ static inline bool_t psr_cmt_enabled(void)
 
 int psr_alloc_rmid(struct domain *d);
 void psr_free_rmid(struct domain *d);
-void psr_assoc_rmid(unsigned int rmid);
+
+void psr_ctxt_switch_to(struct domain *d);
 
 #endif /* __ASM_PSR_H__ */

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 2/7] Xen: x86: print max usable RMID during init
  2015-04-04  2:14 [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Dario Faggioli
  2015-04-04  2:14 ` [RFC PATCH 1/7] x86: improve psr scheduling code Dario Faggioli
@ 2015-04-04  2:14 ` Dario Faggioli
  2015-04-06 13:48   ` Konrad Rzeszutek Wilk
  2015-04-04  2:14 ` [RFC PATCH 3/7] xen: psr: reserve an RMID for each core Dario Faggioli
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Dario Faggioli @ 2015-04-04  2:14 UTC (permalink / raw)
  To: Xen-devel
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, andrew.cooper3,
	Dongxiao Xu, JBeulich, Chao Peng

Just print it.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 xen/arch/x86/psr.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/psr.c b/xen/arch/x86/psr.c
index c902625..0f2a6ce 100644
--- a/xen/arch/x86/psr.c
+++ b/xen/arch/x86/psr.c
@@ -111,7 +111,8 @@ static void __init init_psr_cmt(unsigned int rmid_max)
     for ( rmid = 1; rmid <= psr_cmt->rmid_max; rmid++ )
         psr_cmt->rmid_to_dom[rmid] = DOMID_INVALID;
 
-    printk(XENLOG_INFO "Cache Monitoring Technology enabled\n");
+    printk(XENLOG_INFO "Cache Monitoring Technology enabled, RMIDs: %u\n",
+           psr_cmt->rmid_max);
 }
 
 /* Called with domain lock held, no psr specific lock needed */

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 3/7] xen: psr: reserve an RMID for each core
  2015-04-04  2:14 [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Dario Faggioli
  2015-04-04  2:14 ` [RFC PATCH 1/7] x86: improve psr scheduling code Dario Faggioli
  2015-04-04  2:14 ` [RFC PATCH 2/7] Xen: x86: print max usable RMID during init Dario Faggioli
@ 2015-04-04  2:14 ` Dario Faggioli
  2015-04-06 13:59   ` Konrad Rzeszutek Wilk
                     ` (2 more replies)
  2015-04-04  2:14 ` [RFC PATCH 4/7] xen: libxc: libxl: report per-CPU cache occupancy up to libxl Dario Faggioli
                   ` (7 subsequent siblings)
  10 siblings, 3 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-04  2:14 UTC (permalink / raw)
  To: Xen-devel
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, andrew.cooper3,
	Dongxiao Xu, JBeulich, Chao Peng

This allows for a new item to be passed as part of the psr=
boot option: "percpu_cmt". If that is specified, Xen tries,
at boot time, to associate an RMID to each core.

XXX This all looks rather straightforward, if it weren't
    for the fact that it is, apparently, more common than
    I though to run out of RMID. For example, on a dev box
    we have in Cambridge, there are 144 pCPUs and only 71
    RMIDs.

    In this preliminary version, nothing particularly smart
    happens if we run out of RMIDs, we just fail attaching
    the remaining cores and that's it. In future, I'd
    probably like to:
     + check whether the operation have any chance to
       succeed up front (by comparing number of pCPUs with
       available RMIDs)
     + on unexpected failure, rollback everything... it
       seems to make more sense to me than just leaving
       the system half configured for per-cpu CMT

    Thoughts?

XXX Another idea I just have is to allow the user to
    somehow specify a different 'granularity'. Something
    like allowing 'percpu_cmt'|'percore_cmt'|'persocket_cmt'
    with the following meaning:
     + 'percpu_cmt': as in this patch
     + 'percore_cmt': same RMID to hthreads of the same core
     + 'persocket_cmt': same RMID to all cores of the same
        socket.

    'percore_cmt' would only allow gathering info on a
    per-core basis... still better than nothing if we
    do not have enough RMIDs for each pCPUs.

    'persocket_cmt' would basically only allow to track the
    amount of free L3 on each socket (by subtracting the
    monitored value from the total). Again, still better
    than nothing, would use very few RMIDs, and I could
    think of ways of using this information in a few
    places in the scheduler...

    Again, thought?

XXX Finally, when a domain with its own RMID executes on
    a core that also has its own RMID, domain monitoring
    just overrides per-CPU monitoring. That means the
    cache occupancy reported fo that pCPU is not accurate.

    For reasons why this situation is difficult to deal
    with properly, see the document in the cover letter.

    Ideas on how to deal with this, either about how to
    make it work or how to handle the thing from a
    'policying' perspective (i.e., which one mechanism
    should be disabled or penalized?), are very welcome

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 xen/arch/x86/psr.c        |   72 ++++++++++++++++++++++++++++++++++++---------
 xen/include/asm-x86/psr.h |   11 ++++++-
 2 files changed, 67 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/psr.c b/xen/arch/x86/psr.c
index 0f2a6ce..a71391c 100644
--- a/xen/arch/x86/psr.c
+++ b/xen/arch/x86/psr.c
@@ -26,10 +26,13 @@ struct psr_assoc {
 
 struct psr_cmt *__read_mostly psr_cmt;
 static bool_t __initdata opt_psr;
+static bool_t __initdata opt_cpu_cmt;
 static unsigned int __initdata opt_rmid_max = 255;
 static uint64_t rmid_mask;
 static DEFINE_PER_CPU(struct psr_assoc, psr_assoc);
 
+DEFINE_PER_CPU(unsigned int, pcpu_rmid);
+
 static void __init parse_psr_param(char *s)
 {
     char *ss, *val_str;
@@ -57,6 +60,8 @@ static void __init parse_psr_param(char *s)
                                     val_str);
             }
         }
+        else if ( !strcmp(s, "percpu_cmt") )
+            opt_cpu_cmt = 1;
         else if ( val_str && !strcmp(s, "rmid_max") )
             opt_rmid_max = simple_strtoul(val_str, NULL, 0);
 
@@ -94,8 +99,8 @@ static void __init init_psr_cmt(unsigned int rmid_max)
     }
 
     psr_cmt->rmid_max = min(psr_cmt->rmid_max, psr_cmt->l3.rmid_max);
-    psr_cmt->rmid_to_dom = xmalloc_array(domid_t, psr_cmt->rmid_max + 1UL);
-    if ( !psr_cmt->rmid_to_dom )
+    psr_cmt->rmids = xmalloc_array(domid_t, psr_cmt->rmid_max + 1UL);
+    if ( !psr_cmt->rmids )
     {
         xfree(psr_cmt);
         psr_cmt = NULL;
@@ -107,56 +112,86 @@ static void __init init_psr_cmt(unsigned int rmid_max)
      * with it. To reduce the waste of RMID, reserve RMID 0 for all CPUs that
      * have no domain being monitored.
      */
-    psr_cmt->rmid_to_dom[0] = DOMID_XEN;
+    psr_cmt->rmids[0] = DOMID_XEN;
     for ( rmid = 1; rmid <= psr_cmt->rmid_max; rmid++ )
-        psr_cmt->rmid_to_dom[rmid] = DOMID_INVALID;
+        psr_cmt->rmids[rmid] = DOMID_INVALID;
 
     printk(XENLOG_INFO "Cache Monitoring Technology enabled, RMIDs: %u\n",
            psr_cmt->rmid_max);
 }
 
-/* Called with domain lock held, no psr specific lock needed */
-int psr_alloc_rmid(struct domain *d)
+static int _psr_alloc_rmid(unsigned int *trmid, unsigned int id)
 {
     unsigned int rmid;
 
     ASSERT(psr_cmt_enabled());
 
-    if ( d->arch.psr_rmid > 0 )
+    if ( *trmid > 0 )
         return -EEXIST;
 
     for ( rmid = 1; rmid <= psr_cmt->rmid_max; rmid++ )
     {
-        if ( psr_cmt->rmid_to_dom[rmid] != DOMID_INVALID )
+        if ( psr_cmt->rmids[rmid] != DOMID_INVALID )
             continue;
 
-        psr_cmt->rmid_to_dom[rmid] = d->domain_id;
+        psr_cmt->rmids[rmid] = id;
         break;
     }
 
     /* No RMID available, assign RMID=0 by default. */
     if ( rmid > psr_cmt->rmid_max )
     {
-        d->arch.psr_rmid = 0;
+        *trmid = 0;
         return -EUSERS;
     }
 
-    d->arch.psr_rmid = rmid;
+    *trmid = rmid;
 
     return 0;
 }
 
+int psr_alloc_pcpu_rmid(unsigned int cpu)
+{
+    int ret;
+
+    /* XXX Any locking required? */
+    ret = _psr_alloc_rmid(&per_cpu(pcpu_rmid, cpu), DOMID_XEN);
+    if ( !ret )
+        printk(XENLOG_DEBUG "using RMID %u for CPU %u\n",
+               per_cpu(pcpu_rmid, cpu), cpu);
+
+    return ret;
+}
+
 /* Called with domain lock held, no psr specific lock needed */
-void psr_free_rmid(struct domain *d)
+int psr_alloc_rmid(struct domain *d)
 {
-    unsigned int rmid;
+    return _psr_alloc_rmid(&d->arch.psr_rmid, d->domain_id);
+}
 
-    rmid = d->arch.psr_rmid;
+static void _psr_free_rmid(unsigned int rmid)
+{
     /* We do not free system reserved "RMID=0". */
     if ( rmid == 0 )
         return;
 
-    psr_cmt->rmid_to_dom[rmid] = DOMID_INVALID;
+    psr_cmt->rmids[rmid] = DOMID_INVALID;
+}
+
+void psr_free_pcpu_rmid(unsigned int cpu)
+{
+    printk(XENLOG_DEBUG "Freeing RMID %u. CPU %u no longer monitored\n",
+           per_cpu(pcpu_rmid, cpu), cpu);
+
+    /* XXX Any locking required? */
+    _psr_free_rmid(per_cpu(pcpu_rmid, cpu));
+    per_cpu(pcpu_rmid, cpu) = 0;
+}
+
+/* Called with domain lock held, no psr specific lock needed */
+void psr_free_rmid(struct domain *d)
+{
+    _psr_free_rmid(d->arch.psr_rmid);
     d->arch.psr_rmid = 0;
 }
 
@@ -184,6 +219,10 @@ static inline void psr_assoc_reg_write(struct psr_assoc *psra, uint64_t reg)
 
 static inline void psr_assoc_rmid(uint64_t *reg, unsigned int rmid)
 {
+    /* Domain not monitored: switch to the RMID of the pcpu (if any) */
+    if ( rmid == 0 )
+        rmid = this_cpu(pcpu_rmid);
+
     *reg = (*reg & ~rmid_mask) | (rmid & rmid_mask);
 }
 
@@ -202,6 +241,9 @@ void psr_ctxt_switch_to(struct domain *d)
 
 static void psr_cpu_init(unsigned int cpu)
 {
+    if ( opt_cpu_cmt && !psr_alloc_pcpu_rmid(cpu) )
+        printk(XENLOG_INFO "pcpu %u: using RMID %u\n",
+                cpu, per_cpu(pcpu_rmid, cpu));
     psr_assoc_init();
 }
 
diff --git a/xen/include/asm-x86/psr.h b/xen/include/asm-x86/psr.h
index 585350c..b70f605 100644
--- a/xen/include/asm-x86/psr.h
+++ b/xen/include/asm-x86/psr.h
@@ -33,17 +33,26 @@ struct psr_cmt_l3 {
 struct psr_cmt {
     unsigned int rmid_max;
     unsigned int features;
-    domid_t *rmid_to_dom;
+    domid_t *rmids;
     struct psr_cmt_l3 l3;
 };
 
 extern struct psr_cmt *psr_cmt;
 
+/*
+ * RMID associated to each core, to track the cache
+ * occupancy contribution of the core itself.
+ */
+DECLARE_PER_CPU(unsigned int, pcpu_rmid);
+
 static inline bool_t psr_cmt_enabled(void)
 {
     return !!psr_cmt;
 }
 
+int psr_alloc_pcpu_rmid(unsigned int cpu);
+void psr_free_pcpu_rmid(unsigned int cpu);
+
 int psr_alloc_rmid(struct domain *d);
 void psr_free_rmid(struct domain *d);

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 4/7] xen: libxc: libxl: report per-CPU cache occupancy up to libxl
  2015-04-04  2:14 [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Dario Faggioli
                   ` (2 preceding siblings ...)
  2015-04-04  2:14 ` [RFC PATCH 3/7] xen: psr: reserve an RMID for each core Dario Faggioli
@ 2015-04-04  2:14 ` Dario Faggioli
  2015-04-04  2:14 ` [RFC PATCH 5/7] xen: libxc: libxl: allow for attaching and detaching a CPU to CMT Dario Faggioli
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-04  2:14 UTC (permalink / raw)
  To: Xen-devel
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, andrew.cooper3,
	Dongxiao Xu, JBeulich, Chao Peng

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 tools/libxc/include/xenctrl.h |    2 +
 tools/libxc/xc_psr.c          |   19 +++++++++++++
 tools/libxl/libxl.h           |    4 +++
 tools/libxl/libxl_psr.c       |   59 +++++++++++++++++++++++++++++++++++++++++
 xen/arch/x86/sysctl.c         |   14 ++++++++++
 xen/include/public/sysctl.h   |    5 +++
 6 files changed, 103 insertions(+)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 4e9537e..d038e40 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -2702,6 +2702,8 @@ int xc_psr_cmt_get_l3_upscaling_factor(xc_interface *xch,
 int xc_psr_cmt_get_l3_event_mask(xc_interface *xch, uint32_t *event_mask);
 int xc_psr_cmt_get_l3_cache_size(xc_interface *xch, uint32_t cpu,
                                  uint32_t *l3_cache_size);
+int xc_psr_cmt_get_cpu_rmid(xc_interface *xch, uint32_t cpu,
+                            uint32_t *rmid);
 int xc_psr_cmt_get_data(xc_interface *xch, uint32_t rmid, uint32_t cpu,
                         uint32_t psr_cmt_type, uint64_t *monitor_data,
                         uint64_t *tsc);
diff --git a/tools/libxc/xc_psr.c b/tools/libxc/xc_psr.c
index e367a80..088cf66 100644
--- a/tools/libxc/xc_psr.c
+++ b/tools/libxc/xc_psr.c
@@ -158,6 +158,25 @@ int xc_psr_cmt_get_l3_cache_size(xc_interface *xch, uint32_t cpu,
     return rc;
 }
 
+int xc_psr_cmt_get_cpu_rmid(xc_interface *xch, uint32_t cpu,
+                            uint32_t *rmid)
+{
+    int rc;
+    DECLARE_SYSCTL;
+
+    sysctl.cmd = XEN_SYSCTL_psr_cmt_op;
+    sysctl.u.psr_cmt_op.cmd = XEN_SYSCTL_PSR_CMT_get_cpu_rmid;
+    sysctl.u.psr_cmt_op.u.cpu_rmid.cpu = cpu;
+    sysctl.u.psr_cmt_op.flags = 0;
+
+    rc = xc_sysctl(xch, &sysctl);
+    if ( rc )
+        return -1;
+
+    *rmid = sysctl.u.psr_cmt_op.u.data;
+    return 0;
+}
+
 int xc_psr_cmt_get_data(xc_interface *xch, uint32_t rmid, uint32_t cpu,
                         xc_psr_cmt_type type, uint64_t *monitor_data,
                         uint64_t *tsc)
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 6bc75c5..23e266d 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1521,6 +1521,10 @@ int libxl_psr_cmt_get_cache_occupancy(libxl_ctx *ctx,
                                       uint32_t domid,
                                       uint32_t socketid,
                                       uint32_t *l3_cache_occupancy);
+int libxl_psr_cmt_cpu_attached(libxl_ctx *ctx, uint32_t cpu);
+int libxl_psr_cmt_get_cpu_cache_occupancy(libxl_ctx *ctx,
+                                          uint32_t cpu,
+                                          uint32_t *l3_cache_occupancy);
 #endif
 
 #ifdef LIBXL_HAVE_PSR_MBM
diff --git a/tools/libxl/libxl_psr.c b/tools/libxl/libxl_psr.c
index 3e1c792..f5688a3 100644
--- a/tools/libxl/libxl_psr.c
+++ b/tools/libxl/libxl_psr.c
@@ -247,6 +247,65 @@ out:
     return rc;
 }
 
+int libxl_psr_cmt_cpu_attached(libxl_ctx *ctx, uint32_t cpu)
+{
+    int rc;
+    uint32_t rmid;
+
+    rc = xc_psr_cmt_get_cpu_rmid(ctx->xch, cpu, &rmid);
+    if (rc)
+        return ERROR_FAIL;
+
+    return !!rmid;
+}
+
+int libxl_psr_cmt_get_cpu_cache_occupancy(libxl_ctx *ctx,
+                                          uint32_t cpu,
+                                          uint32_t *l3_cache_occupancy)
+{
+    GC_INIT(ctx);
+    unsigned int rmid;
+    uint32_t upscale;
+    uint64_t data;
+    int rc = 0;
+
+    if (!l3_cache_occupancy) {
+        LOGE(ERROR, "invalid parameter for returning cpu cache occupancy");
+        rc = ERROR_INVAL;
+        goto out;
+    }
+
+    rc = xc_psr_cmt_get_cpu_rmid(ctx->xch, cpu, &rmid);
+    if (rc || rmid == 0) {
+        LOGE(ERROR, "fail to get rmid or cpu not attached to monitoring");
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    rc = xc_psr_cmt_get_data(ctx->xch, rmid, cpu,
+                             LIBXL_PSR_CMT_TYPE_CACHE_OCCUPANCY - 1,
+                             &data, NULL);
+    if (rc) {
+        LOGE(ERROR, "failed to get monitoring data");
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    rc = xc_psr_cmt_get_l3_upscaling_factor(ctx->xch, &upscale);
+    if (rc) {
+        LOGE(ERROR, "failed to get L3 upscaling factor");
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    data *= upscale;
+    *l3_cache_occupancy = data / 1024;
+
+ out:
+    GC_FREE;
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/sysctl.c b/xen/arch/x86/sysctl.c
index 611a291..fc9838c 100644
--- a/xen/arch/x86/sysctl.c
+++ b/xen/arch/x86/sysctl.c
@@ -160,6 +160,20 @@ long arch_do_sysctl(
         case XEN_SYSCTL_PSR_CMT_get_l3_event_mask:
             sysctl->u.psr_cmt_op.u.data = psr_cmt->l3.features;
             break;
+        case XEN_SYSCTL_PSR_CMT_get_cpu_rmid:
+        {
+            unsigned int cpu = sysctl->u.psr_cmt_op.u.cpu_rmid.cpu;
+
+            if ( (cpu >= nr_cpu_ids) || !cpu_online(cpu) )
+            {
+                ret = -ENODEV;
+                sysctl->u.psr_cmt_op.u.data = 0;
+                break;
+            }
+
+            sysctl->u.psr_cmt_op.u.data = per_cpu(pcpu_rmid, cpu);
+            break;
+        }
         default:
             sysctl->u.psr_cmt_op.u.data = 0;
             ret = -ENOSYS;
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 711441f..11c26c6 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -647,6 +647,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_sysctl_coverage_op_t);
 #define XEN_SYSCTL_PSR_CMT_get_l3_cache_size         2
 #define XEN_SYSCTL_PSR_CMT_enabled                   3
 #define XEN_SYSCTL_PSR_CMT_get_l3_event_mask         4
+#define XEN_SYSCTL_PSR_CMT_get_cpu_rmid              5
 struct xen_sysctl_psr_cmt_op {
     uint32_t cmd;       /* IN: XEN_SYSCTL_PSR_CMT_* */
     uint32_t flags;     /* padding variable, may be extended for future use */
@@ -656,6 +657,10 @@ struct xen_sysctl_psr_cmt_op {
             uint32_t cpu;   /* IN */
             uint32_t rsvd;
         } l3_cache;
+        struct {
+            uint32_t cpu;   /* IN */
+            uint32_t rsvd;
+        } cpu_rmid;
     } u;
 };
 typedef struct xen_sysctl_psr_cmt_op xen_sysctl_psr_cmt_op_t;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 5/7] xen: libxc: libxl: allow for attaching and detaching a CPU to CMT
  2015-04-04  2:14 [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Dario Faggioli
                   ` (3 preceding siblings ...)
  2015-04-04  2:14 ` [RFC PATCH 4/7] xen: libxc: libxl: report per-CPU cache occupancy up to libxl Dario Faggioli
@ 2015-04-04  2:14 ` Dario Faggioli
  2015-04-04  2:15 ` [RFC PATCH 6/7] xl: report per-CPU cache occupancy up to libxl Dario Faggioli
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-04  2:14 UTC (permalink / raw)
  To: Xen-devel
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, andrew.cooper3,
	Dongxiao Xu, JBeulich, Chao Peng

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 tools/libxc/include/xenctrl.h |    2 ++
 tools/libxc/xc_psr.c          |   34 ++++++++++++++++++++++++++++++++++
 tools/libxl/libxl.h           |    2 ++
 tools/libxl/libxl_psr.c       |   30 ++++++++++++++++++++++++++++++
 xen/arch/x86/sysctl.c         |   28 ++++++++++++++++++++++++++++
 xen/include/public/sysctl.h   |    2 ++
 6 files changed, 98 insertions(+)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index d038e40..7c17e3e 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -2707,6 +2707,8 @@ int xc_psr_cmt_get_cpu_rmid(xc_interface *xch, uint32_t cpu,
 int xc_psr_cmt_get_data(xc_interface *xch, uint32_t rmid, uint32_t cpu,
                         uint32_t psr_cmt_type, uint64_t *monitor_data,
                         uint64_t *tsc);
+int xc_psr_cmt_cpu_attach(xc_interface *xch, uint32_t cpu);
+int xc_psr_cmt_cpu_detach(xc_interface *xch, uint32_t cpu);
 int xc_psr_cmt_enabled(xc_interface *xch);
 #endif
 
diff --git a/tools/libxc/xc_psr.c b/tools/libxc/xc_psr.c
index 088cf66..3d1b1cb 100644
--- a/tools/libxc/xc_psr.c
+++ b/tools/libxc/xc_psr.c
@@ -177,6 +177,40 @@ int xc_psr_cmt_get_cpu_rmid(xc_interface *xch, uint32_t cpu,
     return 0;
 }
 
+int xc_psr_cmt_cpu_attach(xc_interface *xch, uint32_t cpu)
+{
+    int rc;
+    DECLARE_SYSCTL;
+
+    sysctl.cmd = XEN_SYSCTL_psr_cmt_op;
+    sysctl.u.psr_cmt_op.cmd = XEN_SYSCTL_PSR_CMT_cpu_rmid_attach;
+    sysctl.u.psr_cmt_op.u.cpu_rmid.cpu = cpu;
+    sysctl.u.psr_cmt_op.flags = 0;
+
+    rc = xc_sysctl(xch, &sysctl);
+    if ( rc )
+        return -1;
+
+    return 0;
+}
+
+int xc_psr_cmt_cpu_detach(xc_interface *xch, uint32_t cpu)
+{
+    int rc;
+    DECLARE_SYSCTL;
+
+    sysctl.cmd = XEN_SYSCTL_psr_cmt_op;
+    sysctl.u.psr_cmt_op.cmd = XEN_SYSCTL_PSR_CMT_cpu_rmid_detach;
+    sysctl.u.psr_cmt_op.u.cpu_rmid.cpu = cpu;
+    sysctl.u.psr_cmt_op.flags = 0;
+
+    rc = xc_sysctl(xch, &sysctl);
+    if ( rc )
+        return -1;
+
+    return 0;
+}
+
 int xc_psr_cmt_get_data(xc_interface *xch, uint32_t rmid, uint32_t cpu,
                         xc_psr_cmt_type type, uint64_t *monitor_data,
                         uint64_t *tsc)
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 23e266d..1c1d5f0 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1525,6 +1525,8 @@ int libxl_psr_cmt_cpu_attached(libxl_ctx *ctx, uint32_t cpu);
 int libxl_psr_cmt_get_cpu_cache_occupancy(libxl_ctx *ctx,
                                           uint32_t cpu,
                                           uint32_t *l3_cache_occupancy);
+int libxl_psr_cmt_cpu_attach(libxl_ctx *ctx, uint32_t cpu);
+int libxl_psr_cmt_cpu_detach(libxl_ctx *ctx, uint32_t cpu);
 #endif
 
 #ifdef LIBXL_HAVE_PSR_MBM
diff --git a/tools/libxl/libxl_psr.c b/tools/libxl/libxl_psr.c
index f5688a3..6b7a7ba 100644
--- a/tools/libxl/libxl_psr.c
+++ b/tools/libxl/libxl_psr.c
@@ -306,6 +306,36 @@ int libxl_psr_cmt_get_cpu_cache_occupancy(libxl_ctx *ctx,
     return rc;
 }
 
+int libxl_psr_cmt_cpu_attach(libxl_ctx *ctx, uint32_t cpu)
+{
+    GC_INIT(ctx);
+    int rc;
+
+    rc = xc_psr_cmt_cpu_attach(ctx->xch, cpu);
+    if (rc) {
+        libxl__psr_cmt_log_err_msg(gc, errno);
+        rc = ERROR_FAIL;
+    }
+
+    GC_FREE;
+    return rc;
+}
+
+int libxl_psr_cmt_cpu_detach(libxl_ctx *ctx, uint32_t cpu)
+{
+    GC_INIT(ctx);
+    int rc;
+
+    rc = xc_psr_cmt_cpu_detach(ctx->xch, cpu);
+    if (rc) {
+        libxl__psr_cmt_log_err_msg(gc, errno);
+        rc = ERROR_FAIL;
+    }
+
+    GC_FREE;
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/sysctl.c b/xen/arch/x86/sysctl.c
index fc9838c..93837e5 100644
--- a/xen/arch/x86/sysctl.c
+++ b/xen/arch/x86/sysctl.c
@@ -174,6 +174,34 @@ long arch_do_sysctl(
             sysctl->u.psr_cmt_op.u.data = per_cpu(pcpu_rmid, cpu);
             break;
         }
+        case XEN_SYSCTL_PSR_CMT_cpu_rmid_attach:
+        {
+            unsigned int cpu = sysctl->u.psr_cmt_op.u.cpu_rmid.cpu;
+
+            if ( (cpu >= nr_cpu_ids) || !cpu_online(cpu) )
+            {
+                ret = -ENODEV;
+                break;
+            }
+            ret = psr_alloc_pcpu_rmid(cpu);
+            break;
+        }
+        case XEN_SYSCTL_PSR_CMT_cpu_rmid_detach:
+        {
+            unsigned int cpu = sysctl->u.psr_cmt_op.u.cpu_rmid.cpu;
+
+            if ( (cpu >= nr_cpu_ids) || !cpu_online(cpu) )
+            {
+                ret = -ENODEV;
+                break;
+            }
+
+            if ( per_cpu(pcpu_rmid, cpu) )
+                psr_free_pcpu_rmid(cpu);
+            else
+                return -ENOENT;
+            break;
+        }
         default:
             sysctl->u.psr_cmt_op.u.data = 0;
             ret = -ENOSYS;
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 11c26c6..0ae8758 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -648,6 +648,8 @@ DEFINE_XEN_GUEST_HANDLE(xen_sysctl_coverage_op_t);
 #define XEN_SYSCTL_PSR_CMT_enabled                   3
 #define XEN_SYSCTL_PSR_CMT_get_l3_event_mask         4
 #define XEN_SYSCTL_PSR_CMT_get_cpu_rmid              5
+#define XEN_SYSCTL_PSR_CMT_cpu_rmid_attach           6
+#define XEN_SYSCTL_PSR_CMT_cpu_rmid_detach           7
 struct xen_sysctl_psr_cmt_op {
     uint32_t cmd;       /* IN: XEN_SYSCTL_PSR_CMT_* */
     uint32_t flags;     /* padding variable, may be extended for future use */

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 6/7] xl: report per-CPU cache occupancy up to libxl
  2015-04-04  2:14 [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Dario Faggioli
                   ` (4 preceding siblings ...)
  2015-04-04  2:14 ` [RFC PATCH 5/7] xen: libxc: libxl: allow for attaching and detaching a CPU to CMT Dario Faggioli
@ 2015-04-04  2:15 ` Dario Faggioli
  2015-04-04  2:15 ` [RFC PATCH 7/7] xl: allow for attaching and detaching a CPU to CMT Dario Faggioli
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-04  2:15 UTC (permalink / raw)
  To: Xen-devel
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, andrew.cooper3,
	Dongxiao Xu, JBeulich, Chao Peng

Now that the functionallity is wired, from within Xen
up to libxl, use that to implement a new mode for
`xl psr-cmt-show', by means of the '-c' switch.

With some pCPUs attached to CMT, the output looks as
follows:

  [root@redbrick ~]# xl psr-cmt-show -c cache_occupancy
  Socket 0:      46080 KB Total L3 Cache Size
   CPU 0:         936 KB
   CPU 1:          72 KB
   CPU 3:           0 KB
  Socket 1:      46080 KB Total L3 Cache Size
   CPU 36:         144 KB
   CPU 48:           0 KB
  Socket 2:      46080 KB Total L3 Cache Size
   CPU 74:           0 KB
   CPU 92:           0 KB
  Socket 3:      46080 KB Total L3 Cache Size
   CPU 121:           0 KB

XXX Columns can be aligned better, I know. ;-P

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 tools/libxl/xl_cmdimpl.c  |   88 +++++++++++++++++++++++++++++++++++++++++----
 tools/libxl/xl_cmdtable.c |    2 +
 2 files changed, 82 insertions(+), 8 deletions(-)

diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 394b55d..d314947 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -8098,6 +8098,62 @@ static void psr_cmt_print_domain_info(libxl_dominfo *dominfo,
     printf("\n");
 }
 
+static bool psr_cmt_print_cpu_info(unsigned int cpu)
+{
+    uint32_t l3_occ;
+
+    if (libxl_psr_cmt_get_cpu_cache_occupancy(ctx, cpu, &l3_occ)) {
+        fprintf(stderr, "can't read the cache occupancy for cpu %u\n", cpu);
+        return false;
+    }
+
+    fprintf(stdout, " CPU %u: %11"PRIu32" KB\n", cpu, l3_occ);
+
+    return true;
+}
+
+static int psr_cmt_show_cpus(int cpu)
+{
+    libxl_cputopology *info;
+    int i, rc = 0, nr_cpus = 0;
+    int socket;
+
+    info = libxl_get_cpu_topology(ctx, &nr_cpus);
+    if (info == NULL) {
+        fprintf(stderr, "libxl_get_topologyinfo failed.\n");
+        return 1;
+    }
+
+    socket = -1;
+    for (i = cpu > 0 ? cpu : 0; i <= (cpu > 0 ? cpu : nr_cpus-1); i++) {
+        uint32_t l3_size;
+
+        if (!libxl_psr_cmt_cpu_attached(ctx, i))
+            continue;
+
+        if (socket != info[i].socket) {
+            if (libxl_psr_cmt_get_l3_cache_size(ctx, info[i].socket,
+                                                &l3_size)) {
+                fprintf(stderr, "Failed to get L3 cache size for socket:%d\n",
+                        info[i].socket);
+                rc = 1;
+                goto out;
+            }
+            fprintf(stdout, "Socket %u:%11"PRIu32" KB Total L3 Cache Size\n",
+                    info[i].socket, l3_size);
+            socket = info[i].socket;
+        }
+        if (!psr_cmt_print_cpu_info(i)) {
+            rc = 1;
+            goto out;
+        }
+    }
+
+ out:
+    libxl_cputopology_list_free(info, nr_cpus);
+    return rc;
+}
+
 static int psr_cmt_show(libxl_psr_cmt_type type, uint32_t domid)
 {
     uint32_t i, socketid, nr_sockets, total_rmid;
@@ -8105,11 +8161,6 @@ static int psr_cmt_show(libxl_psr_cmt_type type, uint32_t domid)
     libxl_physinfo info;
     int rc, nr_domains;
 
-    if (!libxl_psr_cmt_enabled(ctx)) {
-        fprintf(stderr, "CMT is disabled in the system\n");
-        return -1;
-    }
-
     if (!libxl_psr_cmt_type_supported(ctx, type)) {
         fprintf(stderr, "Monitor type '%s' is not supported in the system\n",
                 libxl_psr_cmt_type_to_string(type));
@@ -8213,11 +8264,20 @@ int main_psr_cmt_detach(int argc, char **argv)
 int main_psr_cmt_show(int argc, char **argv)
 {
     int opt, ret = 0;
+    bool cpus = false;
     uint32_t domid;
     libxl_psr_cmt_type type;
+    static struct option opts[] = {
+        {"cpus", 0, 0, 'c'},
+        COMMON_LONG_OPTS,
+        {0, 0, 0, 0}
+    };
 
-    SWITCH_FOREACH_OPT(opt, "", NULL, "psr-cmt-show", 1) {
-        /* No options */
+
+    SWITCH_FOREACH_OPT(opt, "c", opts, "psr-cmt-show", 1) {
+        case 'c':
+            cpus = true;
+            break;
     }
 
     if (!strcmp(argv[optind], "cache_occupancy"))
@@ -8231,6 +8291,15 @@ int main_psr_cmt_show(int argc, char **argv)
         return 2;
     }
 
+    if (cpus) {
+        if (type != LIBXL_PSR_CMT_TYPE_CACHE_OCCUPANCY) {
+            fprintf(stderr, "CPU monitoring supported for cache only\n");
+            return -1;
+        }
+        ret = psr_cmt_show_cpus(-1);
+        return ret;
+    }
+
     if (optind + 1 >= argc)
         domid = INVALID_DOMID;
     else if (optind + 1 == argc - 1)
@@ -8240,6 +8309,11 @@ int main_psr_cmt_show(int argc, char **argv)
         return 2;
     }
 
+    if (!libxl_psr_cmt_enabled(ctx)) {
+        fprintf(stderr, "CMT is disabled in the system\n");
+        return -1;
+    }
+
     ret = psr_cmt_show(type, domid);
 
     return ret;
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
index 9284887..5bbe406 100644
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -537,7 +537,7 @@ struct cmd_spec cmd_table[] = {
     { "psr-cmt-show",
       &main_psr_cmt_show, 0, 1,
       "Show Cache Monitoring Technology information",
-      "<PSR-CMT-Type> <Domain>",
+      "<PSR-CMT-Type> [-c | <Domain>]",
       "Available monitor types:\n"
       "\"cache_occupancy\":         Show L3 cache occupancy(KB)\n"
       "\"total_mem_bandwidth\":     Show total memory bandwidth(KB/s)\n"

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 7/7] xl: allow for attaching and detaching a CPU to CMT
  2015-04-04  2:14 [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Dario Faggioli
                   ` (5 preceding siblings ...)
  2015-04-04  2:15 ` [RFC PATCH 6/7] xl: report per-CPU cache occupancy up to libxl Dario Faggioli
@ 2015-04-04  2:15 ` Dario Faggioli
  2015-04-07  8:19 ` [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Chao Peng
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-04  2:15 UTC (permalink / raw)
  To: Xen-devel
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, andrew.cooper3,
	Dongxiao Xu, JBeulich, Chao Peng

Now that the functionallity is wired, from within
Xen up to libxl, use that to implement a new mode
for `xl psr-cmt-attach' and `xl psr-cmt-detach',
by means of a new '-c' switch:

[root@redbrick ~]# xl psr-cmt-detach -c 4
[root@redbrick ~]# xl psr-cmt-attach -c 121

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 tools/libxl/xl_cmdimpl.c  |   44 ++++++++++++++++++++++++++++++++++----------
 tools/libxl/xl_cmdtable.c |   12 ++++++++----
 2 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index d314947..7f7d995 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -8233,30 +8233,54 @@ static int psr_cmt_show(libxl_psr_cmt_type type, uint32_t domid)
 
 int main_psr_cmt_attach(int argc, char **argv)
 {
-    uint32_t domid;
+    uint32_t id;
+    bool cpu = false;
     int opt, ret = 0;
+    static struct option opts[] = {
+        {"cpu", 0, 0, 'c'},
+        COMMON_LONG_OPTS,
+        {0, 0, 0, 0}
+    };
 
-    SWITCH_FOREACH_OPT(opt, "", NULL, "psr-cmt-attach", 1) {
-        /* No options */
+    SWITCH_FOREACH_OPT(opt, "c", opts, "psr-cmt-attach", 1) {
+        case 'c':
+            cpu = true;
+            break;
     }
 
-    domid = find_domain(argv[optind]);
-    ret = libxl_psr_cmt_attach(ctx, domid);
+    if (cpu) {
+        id = atoi(argv[optind]);
+        return libxl_psr_cmt_cpu_attach(ctx, id);
+    }
+    id = find_domain(argv[optind]);
+    ret = libxl_psr_cmt_attach(ctx, id);
 
     return ret;
 }
 
 int main_psr_cmt_detach(int argc, char **argv)
 {
-    uint32_t domid;
+    uint32_t id;
+    bool cpu = false;
     int opt, ret = 0;
+    static struct option opts[] = {
+        {"cpu", 0, 0, 'c'},
+        COMMON_LONG_OPTS,
+        {0, 0, 0, 0}
+    };
 
-    SWITCH_FOREACH_OPT(opt, "", NULL, "psr-cmt-detach", 1) {
-        /* No options */
+    SWITCH_FOREACH_OPT(opt, "c", opts, "psr-cmt-detach", 1) {
+        case 'c':
+            cpu = true;
+            break;
     }
 
-    domid = find_domain(argv[optind]);
-    ret = libxl_psr_cmt_detach(ctx, domid);
+    if (cpu) {
+        id = atoi(argv[optind]);
+        return libxl_psr_cmt_cpu_detach(ctx, id);
+    }
+    id = find_domain(argv[optind]);
+    ret = libxl_psr_cmt_detach(ctx, id);
 
     return ret;
 }
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
index 5bbe406..886dd8a 100644
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -526,13 +526,17 @@ struct cmd_spec cmd_table[] = {
 #ifdef LIBXL_HAVE_PSR_CMT
     { "psr-cmt-attach",
       &main_psr_cmt_attach, 0, 1,
-      "Attach Cache Monitoring Technology service to a domain",
-      "<Domain>",
+      "Attach Cache Monitoring Technology service to a domain or a pCPU",
+      "[-c|--cpu] <id>",
+      "By default (no -c), <id> is the domain id of the domain to start monitoring.\n"
+      "-c|--cpu <id>           Attach monitoring to CPU <id>."
     },
     { "psr-cmt-detach",
       &main_psr_cmt_detach, 0, 1,
-      "Detach Cache Monitoring Technology service from a domain",
-      "<Domain>",
+      "Detach Cache Monitoring Technology service from a domain or a pCPU",
+      "[-c|--cpu] <id>",
+      "By default (no -c), <id> is the domain id of the domain to stop monitoring.\n"
+      "-c|--cpu <id>           Detach monitoring from CPU <id>."
     },
     { "psr-cmt-show",
       &main_psr_cmt_show, 0, 1,

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/7] x86: improve psr scheduling code
  2015-04-04  2:14 ` [RFC PATCH 1/7] x86: improve psr scheduling code Dario Faggioli
@ 2015-04-06 13:48   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 32+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-04-06 13:48 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, andrew.cooper3, Xen-devel,
	Dongxiao Xu, JBeulich, Chao Peng

On Sat, Apr 04, 2015 at 04:14:24AM +0200, Dario Faggioli wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
> 
> Switching RMID from previous vcpu to next vcpu only needs to write
> MSR_IA32_PSR_ASSOC once. Write it with the value of next vcpu is enough,
> no need to write '0' first. Idle domain has RMID set to 0 and because MSR
> is already updated lazily, so just switch it as it does.
> 
> Also move the initialization of per-CPU variable which used for lazy
> update from context switch to CPU starting.
> 
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  xen/arch/x86/domain.c     |    7 +---
>  xen/arch/x86/psr.c        |   89 +++++++++++++++++++++++++++++++++++----------
>  xen/include/asm-x86/psr.h |    3 +-
>  3 files changed, 73 insertions(+), 26 deletions(-)
> 
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 393aa26..73f5d7f 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1443,8 +1443,6 @@ static void __context_switch(void)
>      {
>          memcpy(&p->arch.user_regs, stack_regs, CTXT_SWITCH_STACK_BYTES);
>          vcpu_save_fpu(p);
> -        if ( psr_cmt_enabled() )
> -            psr_assoc_rmid(0);
>          p->arch.ctxt_switch_from(p);
>      }
>  
> @@ -1469,11 +1467,10 @@ static void __context_switch(void)
>          }
>          vcpu_restore_fpu_eager(n);
>          n->arch.ctxt_switch_to(n);
> -
> -        if ( psr_cmt_enabled() && n->domain->arch.psr_rmid > 0 )
> -            psr_assoc_rmid(n->domain->arch.psr_rmid);
>      }
>  
> +    psr_ctxt_switch_to(n->domain);
> +
>      gdt = !is_pv_32on64_vcpu(n) ? per_cpu(gdt_table, cpu) :
>                                    per_cpu(compat_gdt_table, cpu);
>      if ( need_full_gdt(n) )
> diff --git a/xen/arch/x86/psr.c b/xen/arch/x86/psr.c
> index 2ef83df..c902625 100644
> --- a/xen/arch/x86/psr.c
> +++ b/xen/arch/x86/psr.c
> @@ -22,7 +22,6 @@
>  
>  struct psr_assoc {
>      uint64_t val;
> -    bool_t initialized;
>  };
>  
>  struct psr_cmt *__read_mostly psr_cmt;
> @@ -115,14 +114,6 @@ static void __init init_psr_cmt(unsigned int rmid_max)
>      printk(XENLOG_INFO "Cache Monitoring Technology enabled\n");
>  }
>  
> -static int __init init_psr(void)
> -{
> -    if ( (opt_psr & PSR_CMT) && opt_rmid_max )
> -        init_psr_cmt(opt_rmid_max);
> -    return 0;
> -}
> -__initcall(init_psr);
> -
>  /* Called with domain lock held, no psr specific lock needed */
>  int psr_alloc_rmid(struct domain *d)
>  {
> @@ -168,26 +159,84 @@ void psr_free_rmid(struct domain *d)
>      d->arch.psr_rmid = 0;
>  }
>  
> -void psr_assoc_rmid(unsigned int rmid)
> +static inline void psr_assoc_init(void)
>  {
> -    uint64_t val;
> -    uint64_t new_val;
>      struct psr_assoc *psra = &this_cpu(psr_assoc);
>  
> -    if ( !psra->initialized )
> -    {
> +    if ( psr_cmt_enabled() )
>          rdmsrl(MSR_IA32_PSR_ASSOC, psra->val);
> -        psra->initialized = 1;
> +}
> +
> +static inline void psr_assoc_reg_read(struct psr_assoc *psra, uint64_t *reg)
> +{
> +    *reg = psra->val;
> +}
> +
> +static inline void psr_assoc_reg_write(struct psr_assoc *psra, uint64_t reg)
> +{
> +    if ( reg != psra->val )
> +    {
> +        wrmsrl(MSR_IA32_PSR_ASSOC, reg);
> +        psra->val = reg;
>      }
> -    val = psra->val;
> +}
> +
> +static inline void psr_assoc_rmid(uint64_t *reg, unsigned int rmid)
> +{
> +    *reg = (*reg & ~rmid_mask) | (rmid & rmid_mask);
> +}
> +
> +void psr_ctxt_switch_to(struct domain *d)
> +{
> +    uint64_t reg;
> +    struct psr_assoc *psra = &this_cpu(psr_assoc);
> +
> +    psr_assoc_reg_read(psra, &reg);
>  
> -    new_val = (val & ~rmid_mask) | (rmid & rmid_mask);
> -    if ( val != new_val )
> +    if ( psr_cmt_enabled() )
> +        psr_assoc_rmid(&reg, d->arch.psr_rmid);
> +
> +    psr_assoc_reg_write(psra, reg);
> +}
> +
> +static void psr_cpu_init(unsigned int cpu)
> +{
> +    psr_assoc_init();
> +}
> +
> +static int cpu_callback(
> +    struct notifier_block *nfb, unsigned long action, void *hcpu)
> +{
> +    unsigned int cpu = (unsigned long)hcpu;
> +
> +    switch ( action )
> +    {
> +    case CPU_STARTING:
> +        psr_cpu_init(cpu);
> +        break;
> +    }

You could just make it

	if ( action == CPU_STARTING )
		psr_assoc_init();

	return NOTIFY_DONE;

Instead of this big switch statement with casting and such.. Thought
oddly enough, your psr_assoc_init figures out the CPU by running
it with 'this_cpu'. Why not make psr_assoc_init()' accept the CPU value?


> +
> +    return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block cpu_nfb = {
> +    .notifier_call = cpu_callback
> +};
> +
> +static int __init psr_presmp_init(void)
> +{
> +    if ( (opt_psr & PSR_CMT) && opt_rmid_max )
> +        init_psr_cmt(opt_rmid_max);
> +
> +    if (  psr_cmt_enabled() )

Extra space.
>      {
> -        wrmsrl(MSR_IA32_PSR_ASSOC, new_val);
> -        psra->val = new_val;
> +        psr_cpu_init(smp_processor_id());
> +        register_cpu_notifier(&cpu_nfb);
>      }
> +
> +    return 0;
>  }
> +presmp_initcall(psr_presmp_init);
>  
>  /*
>   * Local variables:
> diff --git a/xen/include/asm-x86/psr.h b/xen/include/asm-x86/psr.h
> index c6076e9..585350c 100644
> --- a/xen/include/asm-x86/psr.h
> +++ b/xen/include/asm-x86/psr.h
> @@ -46,7 +46,8 @@ static inline bool_t psr_cmt_enabled(void)
>  
>  int psr_alloc_rmid(struct domain *d);
>  void psr_free_rmid(struct domain *d);
> -void psr_assoc_rmid(unsigned int rmid);
> +
> +void psr_ctxt_switch_to(struct domain *d);
>  
>  #endif /* __ASM_PSR_H__ */
>  
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 2/7] Xen: x86: print max usable RMID during init
  2015-04-04  2:14 ` [RFC PATCH 2/7] Xen: x86: print max usable RMID during init Dario Faggioli
@ 2015-04-06 13:48   ` Konrad Rzeszutek Wilk
  2015-04-07 10:11     ` Dario Faggioli
  0 siblings, 1 reply; 32+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-04-06 13:48 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, andrew.cooper3, Xen-devel,
	Dongxiao Xu, JBeulich, Chao Peng

On Sat, Apr 04, 2015 at 04:14:33AM +0200, Dario Faggioli wrote:
> Just print it.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
>  xen/arch/x86/psr.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/psr.c b/xen/arch/x86/psr.c
> index c902625..0f2a6ce 100644
> --- a/xen/arch/x86/psr.c
> +++ b/xen/arch/x86/psr.c
> @@ -111,7 +111,8 @@ static void __init init_psr_cmt(unsigned int rmid_max)
>      for ( rmid = 1; rmid <= psr_cmt->rmid_max; rmid++ )
>          psr_cmt->rmid_to_dom[rmid] = DOMID_INVALID;
>  
> -    printk(XENLOG_INFO "Cache Monitoring Technology enabled\n");
> +    printk(XENLOG_INFO "Cache Monitoring Technology enabled, RMIDs: %u\n",

max RMID: ?

> +           psr_cmt->rmid_max);
>  }
>  
>  /* Called with domain lock held, no psr specific lock needed */
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 3/7] xen: psr: reserve an RMID for each core
  2015-04-04  2:14 ` [RFC PATCH 3/7] xen: psr: reserve an RMID for each core Dario Faggioli
@ 2015-04-06 13:59   ` Konrad Rzeszutek Wilk
  2015-04-07 10:19     ` Dario Faggioli
  2015-04-07  8:24   ` Chao Peng
  2015-04-08 13:28   ` George Dunlap
  2 siblings, 1 reply; 32+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-04-06 13:59 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, andrew.cooper3, Xen-devel,
	Dongxiao Xu, JBeulich, Chao Peng

On Sat, Apr 04, 2015 at 04:14:41AM +0200, Dario Faggioli wrote:
> This allows for a new item to be passed as part of the psr=
> boot option: "percpu_cmt". If that is specified, Xen tries,
> at boot time, to associate an RMID to each core.
> 
> XXX This all looks rather straightforward, if it weren't
>     for the fact that it is, apparently, more common than
>     I though to run out of RMID. For example, on a dev box
>     we have in Cambridge, there are 144 pCPUs and only 71
>     RMIDs.
> 
>     In this preliminary version, nothing particularly smart
>     happens if we run out of RMIDs, we just fail attaching
>     the remaining cores and that's it. In future, I'd
>     probably like to:
>      + check whether the operation have any chance to
>        succeed up front (by comparing number of pCPUs with
>        available RMIDs)
>      + on unexpected failure, rollback everything... it
>        seems to make more sense to me than just leaving
>        the system half configured for per-cpu CMT
> 
>     Thoughts?
> 
> XXX Another idea I just have is to allow the user to
>     somehow specify a different 'granularity'. Something
>     like allowing 'percpu_cmt'|'percore_cmt'|'persocket_cmt'
>     with the following meaning:
>      + 'percpu_cmt': as in this patch
>      + 'percore_cmt': same RMID to hthreads of the same core
>      + 'persocket_cmt': same RMID to all cores of the same
>         socket.
> 
>     'percore_cmt' would only allow gathering info on a
>     per-core basis... still better than nothing if we
>     do not have enough RMIDs for each pCPUs.

Could we allocate nr_online_cpus() / nr_pmids() and have
some CPUs share the same PMIDs?

> 
>     'persocket_cmt' would basically only allow to track the
>     amount of free L3 on each socket (by subtracting the
>     monitored value from the total). Again, still better
>     than nothing, would use very few RMIDs, and I could
>     think of ways of using this information in a few
>     places in the scheduler...
> 
>     Again, thought?
> 
> XXX Finally, when a domain with its own RMID executes on
>     a core that also has its own RMID, domain monitoring
>     just overrides per-CPU monitoring. That means the
>     cache occupancy reported fo that pCPU is not accurate.
> 
>     For reasons why this situation is difficult to deal
>     with properly, see the document in the cover letter.
> 
>     Ideas on how to deal with this, either about how to
>     make it work or how to handle the thing from a
>     'policying' perspective (i.e., which one mechanism
>     should be disabled or penalized?), are very welcome
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
>  xen/arch/x86/psr.c        |   72 ++++++++++++++++++++++++++++++++++++---------
>  xen/include/asm-x86/psr.h |   11 ++++++-
>  2 files changed, 67 insertions(+), 16 deletions(-)
> 
> diff --git a/xen/arch/x86/psr.c b/xen/arch/x86/psr.c
> index 0f2a6ce..a71391c 100644
> --- a/xen/arch/x86/psr.c
> +++ b/xen/arch/x86/psr.c
> @@ -26,10 +26,13 @@ struct psr_assoc {
>  
>  struct psr_cmt *__read_mostly psr_cmt;
>  static bool_t __initdata opt_psr;
> +static bool_t __initdata opt_cpu_cmt;
>  static unsigned int __initdata opt_rmid_max = 255;
>  static uint64_t rmid_mask;
>  static DEFINE_PER_CPU(struct psr_assoc, psr_assoc);
>  
> +DEFINE_PER_CPU(unsigned int, pcpu_rmid);
> +
>  static void __init parse_psr_param(char *s)
>  {
>      char *ss, *val_str;
> @@ -57,6 +60,8 @@ static void __init parse_psr_param(char *s)
>                                      val_str);
>              }
>          }
> +        else if ( !strcmp(s, "percpu_cmt") )
> +            opt_cpu_cmt = 1;
>          else if ( val_str && !strcmp(s, "rmid_max") )
>              opt_rmid_max = simple_strtoul(val_str, NULL, 0);
>  
> @@ -94,8 +99,8 @@ static void __init init_psr_cmt(unsigned int rmid_max)
>      }
>  
>      psr_cmt->rmid_max = min(psr_cmt->rmid_max, psr_cmt->l3.rmid_max);
> -    psr_cmt->rmid_to_dom = xmalloc_array(domid_t, psr_cmt->rmid_max + 1UL);
> -    if ( !psr_cmt->rmid_to_dom )
> +    psr_cmt->rmids = xmalloc_array(domid_t, psr_cmt->rmid_max + 1UL);
> +    if ( !psr_cmt->rmids )
>      {
>          xfree(psr_cmt);
>          psr_cmt = NULL;
> @@ -107,56 +112,86 @@ static void __init init_psr_cmt(unsigned int rmid_max)
>       * with it. To reduce the waste of RMID, reserve RMID 0 for all CPUs that
>       * have no domain being monitored.
>       */
> -    psr_cmt->rmid_to_dom[0] = DOMID_XEN;
> +    psr_cmt->rmids[0] = DOMID_XEN;
>      for ( rmid = 1; rmid <= psr_cmt->rmid_max; rmid++ )
> -        psr_cmt->rmid_to_dom[rmid] = DOMID_INVALID;
> +        psr_cmt->rmids[rmid] = DOMID_INVALID;
>  
>      printk(XENLOG_INFO "Cache Monitoring Technology enabled, RMIDs: %u\n",
>             psr_cmt->rmid_max);
>  }
>  
> -/* Called with domain lock held, no psr specific lock needed */
> -int psr_alloc_rmid(struct domain *d)
> +static int _psr_alloc_rmid(unsigned int *trmid, unsigned int id)
>  {
>      unsigned int rmid;
>  
>      ASSERT(psr_cmt_enabled());
>  
> -    if ( d->arch.psr_rmid > 0 )
> +    if ( *trmid > 0 )
>          return -EEXIST;
>  
>      for ( rmid = 1; rmid <= psr_cmt->rmid_max; rmid++ )
>      {
> -        if ( psr_cmt->rmid_to_dom[rmid] != DOMID_INVALID )
> +        if ( psr_cmt->rmids[rmid] != DOMID_INVALID )
>              continue;
>  
> -        psr_cmt->rmid_to_dom[rmid] = d->domain_id;
> +        psr_cmt->rmids[rmid] = id;
>          break;
>      }
>  
>      /* No RMID available, assign RMID=0 by default. */
>      if ( rmid > psr_cmt->rmid_max )
>      {
> -        d->arch.psr_rmid = 0;
> +        *trmid = 0;
>          return -EUSERS;
>      }
>  
> -    d->arch.psr_rmid = rmid;
> +    *trmid = rmid;
>  
>      return 0;
>  }
>  
> +int psr_alloc_pcpu_rmid(unsigned int cpu)
> +{
> +    int ret;
> +
> +    /* XXX Any locking required? */

It shouldn't be needed on the per-cpu resources in the hotplug CPU routines.
> +    ret = _psr_alloc_rmid(&per_cpu(pcpu_rmid, cpu), DOMID_XEN);
> +    if ( !ret )
> +        printk(XENLOG_DEBUG "using RMID %u for CPU %u\n",
> +               per_cpu(pcpu_rmid, cpu), cpu);
> +
> +    return ret;
> +}
> +
>  /* Called with domain lock held, no psr specific lock needed */
> -void psr_free_rmid(struct domain *d)
> +int psr_alloc_rmid(struct domain *d)
>  {
> -    unsigned int rmid;
> +    return _psr_alloc_rmid(&d->arch.psr_rmid, d->domain_id);
> +}
>  
> -    rmid = d->arch.psr_rmid;
> +static void _psr_free_rmid(unsigned int rmid)
> +{
>      /* We do not free system reserved "RMID=0". */
>      if ( rmid == 0 )
>          return;
>  
> -    psr_cmt->rmid_to_dom[rmid] = DOMID_INVALID;
> +    psr_cmt->rmids[rmid] = DOMID_INVALID;
> +}
> +
> +void psr_free_pcpu_rmid(unsigned int cpu)
> +{
> +    printk(XENLOG_DEBUG "Freeing RMID %u. CPU %u no longer monitored\n",
> +           per_cpu(pcpu_rmid, cpu), cpu);
> +
> +    /* XXX Any locking required? */

No idea. Not clear who calls this.

> +    _psr_free_rmid(per_cpu(pcpu_rmid, cpu));
> +    per_cpu(pcpu_rmid, cpu) = 0;
> +}
> +
> +/* Called with domain lock held, no psr specific lock needed */
> +void psr_free_rmid(struct domain *d)
> +{
> +    _psr_free_rmid(d->arch.psr_rmid);
>      d->arch.psr_rmid = 0;
>  }
>  
> @@ -184,6 +219,10 @@ static inline void psr_assoc_reg_write(struct psr_assoc *psra, uint64_t reg)
>  
>  static inline void psr_assoc_rmid(uint64_t *reg, unsigned int rmid)
>  {
> +    /* Domain not monitored: switch to the RMID of the pcpu (if any) */
> +    if ( rmid == 0 )
> +        rmid = this_cpu(pcpu_rmid);
> +
>      *reg = (*reg & ~rmid_mask) | (rmid & rmid_mask);
>  }
>  
> @@ -202,6 +241,9 @@ void psr_ctxt_switch_to(struct domain *d)
>  
>  static void psr_cpu_init(unsigned int cpu)
>  {
> +    if ( opt_cpu_cmt && !psr_alloc_pcpu_rmid(cpu) )
> +        printk(XENLOG_INFO "pcpu %u: using RMID %u\n",
> +                cpu, per_cpu(pcpu_rmid, cpu));
>      psr_assoc_init();
>  }
>  
> diff --git a/xen/include/asm-x86/psr.h b/xen/include/asm-x86/psr.h
> index 585350c..b70f605 100644
> --- a/xen/include/asm-x86/psr.h
> +++ b/xen/include/asm-x86/psr.h
> @@ -33,17 +33,26 @@ struct psr_cmt_l3 {
>  struct psr_cmt {
>      unsigned int rmid_max;
>      unsigned int features;
> -    domid_t *rmid_to_dom;
> +    domid_t *rmids;
>      struct psr_cmt_l3 l3;
>  };
>  
>  extern struct psr_cmt *psr_cmt;
>  
> +/*
> + * RMID associated to each core, to track the cache
> + * occupancy contribution of the core itself.
> + */
> +DECLARE_PER_CPU(unsigned int, pcpu_rmid);
> +
>  static inline bool_t psr_cmt_enabled(void)
>  {
>      return !!psr_cmt;
>  }
>  
> +int psr_alloc_pcpu_rmid(unsigned int cpu);
> +void psr_free_pcpu_rmid(unsigned int cpu);
> +
>  int psr_alloc_rmid(struct domain *d);
>  void psr_free_rmid(struct domain *d);
>  
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
  2015-04-04  2:14 [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Dario Faggioli
                   ` (6 preceding siblings ...)
  2015-04-04  2:15 ` [RFC PATCH 7/7] xl: allow for attaching and detaching a CPU to CMT Dario Faggioli
@ 2015-04-07  8:19 ` Chao Peng
  2015-04-07  9:51   ` Dario Faggioli
  2015-04-07 10:27 ` Andrew Cooper
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Chao Peng @ 2015-04-07  8:19 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, andrew.cooper3, Xen-devel,
	Dongxiao Xu, JBeulich

On Sat, Apr 04, 2015 at 04:14:15AM +0200, Dario Faggioli wrote:
> Hi Everyone,
> 
> This RFC series is the outcome of an investigation I've been doing about
> whether we can take better advantage of features like Intel CMT (and of PSR
> features in general). By "take better advantage of" them I mean, for example,
> use the data obtained from monitoring within the scheduler and/or within
> libxl's automatic NUMA placement algorithm, or similar.
> 
> I'm putting here in the cover letter a markdown document I wrote to better
> describe my findings and ideas (sorry if it's a bit long! :-D). You can also
> fetch it at the following links:
> 
>  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.pdf
>  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.markdown
> 
> See the document itself and the changelog of the various patches for details.

Very good summary and possible usage analysis. Most of the problems do
exist and some of them may be solved partially but some looks
unavoidable.

> 
> The series includes one Chao's patch on top, as I found it convenient to build
> on top of it. The series itself is available here:
> 
>   git://xenbits.xen.org/people/dariof/xen.git  wip/sched/icachemon
>   http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/wip/sched/icachemon
> 
> Thanks a lot to everyone that will read and reply! :-)
> 
> Regards,
> Dario
> ---
> 
> This is exactly what happens in the current implementation. Result looks as
> follows:
> 
>     [root@redbrick ~]# xl psr-cmt-attach 0
>     [root@redbrick ~]# xl psr-cmt-attach 1
>     Total RMID: 71
>     Name                                        ID        Socket 0        Socket 1        Socket 2        Socket 3
>     Total L3 Cache Size                                   46080 KB        46080 KB        46080 KB        46080 KB
>     Domain-0                                     0         6768 KB            0 KB            0 KB            0 KB
>     wheezy64                                     1            0 KB          144 KB          144 KB            0 KB
> 
> Let's assume that RMID 1 (RMID 0 is reserved) is used for Domain-0 and RMID 2
> is used for wheezy64. Then:
> 
>     [root@redbrick ~]# xl psr-cmt-detach 0
>     [root@redbrick ~]# xl psr-cmt-detach 1
> 
> So now both RMID 1 and 2 are free to be reused. Now, let's issue the following
> commands:
> 
>     [root@redbrick ~]# xl psr-cmt-attach 1
>     [root@redbrick ~]# xl psr-cmt-attach 0
> 
> Which means that RMID 1 is now assigned to wheezy64, and RMID 2 is given to
> Domain-0. Here's the effect:
> 
>     [root@redbrick ~]# xl psr-cmt-show cache_occupancy
>     Total RMID: 71
>     Name                                        ID        Socket 0        Socket 1        Socket 2        Socket 3
>     Total L3 Cache Size                                   46080 KB        46080 KB        46080 KB        46080 KB
>     Domain-0                                     0          216 KB          144 KB          144 KB            0 KB
>     wheezy64                                     1         7416 KB            0 KB         1872 KB            0 KB
> 
> It looks quite likely that the 144KB occupancy on sockets 1 and 2, now being
> accounted to Domain-0, is really what has been allocated by domain wheezy64,
> before the RMID "switch". The same applies to the 7416KB on socket 0 now
> accounted to wheezy64, i.e., most of this is not accurate and was allocated
> there by Domain-0.
> 
> This is only a simple example, others have been performed, restricting the
> affinity of the various domains involved in order to control on what socket
> cache load were to be expected, and all confirm the above reasoning.
> 
> It is rather easy to appreciate that any kind of 'flushing' mechanism, to be
> triggered when reusing an RMID (if anything like that even exists!) would
> impact system performance (e.g., it is not an option in hot paths), but the
> situation outlined above needs to be fixed, before the mechanism could be
> considered usable and reliable enough to do anything on top of it.

As I know, no such 'flushing' mechanism available at present. One
possible software solution to lighten this issue is rotating the RMIDs
with algorithm like 'use oldest unused RMID first'.

Chao

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 3/7] xen: psr: reserve an RMID for each core
  2015-04-04  2:14 ` [RFC PATCH 3/7] xen: psr: reserve an RMID for each core Dario Faggioli
  2015-04-06 13:59   ` Konrad Rzeszutek Wilk
@ 2015-04-07  8:24   ` Chao Peng
  2015-04-07 10:07     ` Dario Faggioli
  2015-04-08 13:28   ` George Dunlap
  2 siblings, 1 reply; 32+ messages in thread
From: Chao Peng @ 2015-04-07  8:24 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, andrew.cooper3, Xen-devel,
	Dongxiao Xu, JBeulich

On Sat, Apr 04, 2015 at 04:14:41AM +0200, Dario Faggioli wrote:
> This allows for a new item to be passed as part of the psr=
> boot option: "percpu_cmt". If that is specified, Xen tries,
> at boot time, to associate an RMID to each core.
> 
> XXX This all looks rather straightforward, if it weren't
>     for the fact that it is, apparently, more common than
>     I though to run out of RMID. For example, on a dev box
>     we have in Cambridge, there are 144 pCPUs and only 71
>     RMIDs.
> 
>     In this preliminary version, nothing particularly smart
>     happens if we run out of RMIDs, we just fail attaching
>     the remaining cores and that's it. In future, I'd
>     probably like to:
>      + check whether the operation have any chance to
>        succeed up front (by comparing number of pCPUs with
>        available RMIDs)
>      + on unexpected failure, rollback everything... it
>        seems to make more sense to me than just leaving
>        the system half configured for per-cpu CMT
> 
>     Thoughts?
> 
> XXX Another idea I just have is to allow the user to
>     somehow specify a different 'granularity'. Something
>     like allowing 'percpu_cmt'|'percore_cmt'|'persocket_cmt'
>     with the following meaning:
>      + 'percpu_cmt': as in this patch
>      + 'percore_cmt': same RMID to hthreads of the same core
>      + 'persocket_cmt': same RMID to all cores of the same
>         socket.
> 
>     'percore_cmt' would only allow gathering info on a
>     per-core basis... still better than nothing if we
>     do not have enough RMIDs for each pCPUs.
> 
>     'persocket_cmt' would basically only allow to track the
>     amount of free L3 on each socket (by subtracting the
>     monitored value from the total). Again, still better
>     than nothing, would use very few RMIDs, and I could
>     think of ways of using this information in a few
>     places in the scheduler...
> 
>     Again, thought?

This even can be extended to the concept of 'cache monitoring group',
which can hold arbitrary cpus into one group. Actually Linux
implementation does this by using the cgoup mechanism to allocate RMID
to a group of threads. Such design can solve the RMID-shortage somehow.

Chao

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
  2015-04-07  8:19 ` [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Chao Peng
@ 2015-04-07  9:51   ` Dario Faggioli
  0 siblings, 0 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-07  9:51 UTC (permalink / raw)
  To: chao.p.peng
  Cc: Wei Liu, Ian Campbell, Andrew Cooper, George Dunlap, xen-devel,
	dongxiao.xu, JBeulich


[-- Attachment #1.1: Type: text/plain, Size: 1529 bytes --]

On Tue, 2015-04-07 at 16:19 +0800, Chao Peng wrote:
> On Sat, Apr 04, 2015 at 04:14:15AM +0200, Dario Faggioli wrote:

> > I'm putting here in the cover letter a markdown document I wrote to better
> > describe my findings and ideas (sorry if it's a bit long! :-D). You can also
> > fetch it at the following links:
> > 
> >  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.pdf
> >  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.markdown
> > 
> > See the document itself and the changelog of the various patches for details.
> 
> Very good summary and possible usage analysis. 
>
Thanks. :-)

> Most of the problems do
> exist and some of them may be solved partially but some looks
> unavoidable.
> 
I see.

> > It is rather easy to appreciate that any kind of 'flushing' mechanism, to be
> > triggered when reusing an RMID (if anything like that even exists!) would
> > impact system performance (e.g., it is not an option in hot paths), but the
> > situation outlined above needs to be fixed, before the mechanism could be
> > considered usable and reliable enough to do anything on top of it.
> 
> As I know, no such 'flushing' mechanism available at present. One
> possible software solution to lighten this issue is rotating the RMIDs
> with algorithm like 'use oldest unused RMID first'.
> 
Ok. Yes, that was something I was thinking to as well. It certainly
would make the issue less severe / less likely to happen.

Let's see what others think.

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 3/7] xen: psr: reserve an RMID for each core
  2015-04-07  8:24   ` Chao Peng
@ 2015-04-07 10:07     ` Dario Faggioli
  0 siblings, 0 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-07 10:07 UTC (permalink / raw)
  To: chao.p.peng
  Cc: Wei Liu, Ian Campbell, Andrew Cooper, George Dunlap, xen-devel, JBeulich


[-- Attachment #1.1: Type: text/plain, Size: 2163 bytes --]

On Tue, 2015-04-07 at 16:24 +0800, Chao Peng wrote:
> On Sat, Apr 04, 2015 at 04:14:41AM +0200, Dario Faggioli wrote:

> >     'persocket_cmt' would basically only allow to track the
> >     amount of free L3 on each socket (by subtracting the
> >     monitored value from the total). Again, still better
> >     than nothing, would use very few RMIDs, and I could
> >     think of ways of using this information in a few
> >     places in the scheduler...
> > 
> >     Again, thought?
> 
> This even can be extended to the concept of 'cache monitoring group',
> which can hold arbitrary cpus into one group. 
>
Yes, indeed.

> Actually Linux
> implementation does this by using the cgoup mechanism to allocate RMID
> to a group of threads. 
>
Does it? I dig the threads of the Linux's patches up to a certain point,
and it seemed that the maintainer disliked the idea of PSR-cgroups. I
have to admit I stopped at some point, and did not check how the
checked-in code actually looks like.

Anyway, yes, I'll explore the 'grouping idea'. In fact, in order to be
able to use CMT (and other PSR features) from inside the scheduler, the
shortage of RMIDs is probably not the biggest issue.

I'm much more concerned about the difficulties in making both "static"
monitoring (i.e., the per-core/per-group monitoring required by the
scheduler and other Xen components) and "dynamic" monitoring (i.e., the
per-domain monitoring, happening on user request, via `xl
psr-cmt-attach') play well together, if enabled at the same time.

If you've got time and are interested in providing your view on that, it
would be great.

> Such design can solve the RMID-shortage somehow.
> 
Exactly. Actually, I've got a question: I think I remember reading on
Intel's SDM that RMID makes sense per-socket. If that is the case, it
means I can use, say, RMID #18 for both core 4, on socket 0, and core
73, on socket 2, do you confirm that? I'm asking because, as I said, I
think I read it in the manual, but the Xen implementation does not seem
to take advantage of this (perhaps because it wasn't necessary?).

Thanks and Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 2/7] Xen: x86: print max usable RMID during init
  2015-04-06 13:48   ` Konrad Rzeszutek Wilk
@ 2015-04-07 10:11     ` Dario Faggioli
  0 siblings, 0 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-07 10:11 UTC (permalink / raw)
  To: konrad.wilk
  Cc: Wei Liu, Ian Campbell, Andrew Cooper, George Dunlap, xen-devel,
	JBeulich, chao.p.peng


[-- Attachment #1.1: Type: text/plain, Size: 929 bytes --]

On Mon, 2015-04-06 at 09:48 -0400, Konrad Rzeszutek Wilk wrote:
> On Sat, Apr 04, 2015 at 04:14:33AM +0200, Dario Faggioli wrote:

> > diff --git a/xen/arch/x86/psr.c b/xen/arch/x86/psr.c
> > index c902625..0f2a6ce 100644
> > --- a/xen/arch/x86/psr.c
> > +++ b/xen/arch/x86/psr.c
> > @@ -111,7 +111,8 @@ static void __init init_psr_cmt(unsigned int rmid_max)
> >      for ( rmid = 1; rmid <= psr_cmt->rmid_max; rmid++ )
> >          psr_cmt->rmid_to_dom[rmid] = DOMID_INVALID;
> >  
> > -    printk(XENLOG_INFO "Cache Monitoring Technology enabled\n");
> > +    printk(XENLOG_INFO "Cache Monitoring Technology enabled, RMIDs: %u\n",
> 
> max RMID: ?
> 
I considered it, but don't like it.

IMO 'max' tells something about the values one can use as RMIDs, not
about the fact that they are a limited resource.

Perhaps something like "available RMIDs:" or "total nr. of RMIDs" ?

Thanks and Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 3/7] xen: psr: reserve an RMID for each core
  2015-04-06 13:59   ` Konrad Rzeszutek Wilk
@ 2015-04-07 10:19     ` Dario Faggioli
  2015-04-07 13:57       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 32+ messages in thread
From: Dario Faggioli @ 2015-04-07 10:19 UTC (permalink / raw)
  To: konrad.wilk
  Cc: Wei Liu, Ian Campbell, Andrew Cooper, George Dunlap, xen-devel,
	JBeulich, chao.p.peng


[-- Attachment #1.1: Type: text/plain, Size: 1657 bytes --]

On Mon, 2015-04-06 at 09:59 -0400, Konrad Rzeszutek Wilk wrote:
> On Sat, Apr 04, 2015 at 04:14:41AM +0200, Dario Faggioli wrote:

> > XXX Another idea I just have is to allow the user to
> >     somehow specify a different 'granularity'. Something
> >     like allowing 'percpu_cmt'|'percore_cmt'|'persocket_cmt'
> >     with the following meaning:
> >      + 'percpu_cmt': as in this patch
> >      + 'percore_cmt': same RMID to hthreads of the same core
> >      + 'persocket_cmt': same RMID to all cores of the same
> >         socket.
> > 
> >     'percore_cmt' would only allow gathering info on a
> >     per-core basis... still better than nothing if we
> >     do not have enough RMIDs for each pCPUs.
> 
> Could we allocate nr_online_cpus() / nr_pmids() and have
> some CPUs share the same PMIDs?
> 
Mmm... I hope we can (see the reply to Chao about the per-socketness
nature of the RMIDs).

I'm not sure what you mean here, though. In the box I have at hand there
are 144 CPUs and 71 RMIDs. So, 144/71=2... maybe I'm missing something
of what you mean, how should I use these 2 RMIDs?

If RMIDs actually are per-socket, extending the existing Xen support to
reflect that, and take advantage of it would help a lot already. In such
box, it would mean I could use RMIDs 1-36, on each socket, per per-CPU
monitoring, and still have 35 RMIDs free (which could be 35x4=140,
depending *how* we extend te support to match the per-socket nature of
RMIDs).

Let's see if that is confirmed... Of course, I can book the box again
here and test it myself (and will do that, if necessary :-D).

Thanks and Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
  2015-04-04  2:14 [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Dario Faggioli
                   ` (7 preceding siblings ...)
  2015-04-07  8:19 ` [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Chao Peng
@ 2015-04-07 10:27 ` Andrew Cooper
  2015-04-07 13:10   ` Dario Faggioli
  2015-04-08 11:27   ` George Dunlap
  2015-04-08 11:30 ` George Dunlap
  2015-04-09 15:37 ` Meng Xu
  10 siblings, 2 replies; 32+ messages in thread
From: Andrew Cooper @ 2015-04-07 10:27 UTC (permalink / raw)
  To: Dario Faggioli, Xen-devel
  Cc: wei.liu2, Ian.Campbell, George.Dunlap, Dongxiao Xu, JBeulich, Chao Peng

On 04/04/2015 03:14, Dario Faggioli wrote:
> Hi Everyone,
>
> This RFC series is the outcome of an investigation I've been doing about
> whether we can take better advantage of features like Intel CMT (and of PSR
> features in general). By "take better advantage of" them I mean, for example,
> use the data obtained from monitoring within the scheduler and/or within
> libxl's automatic NUMA placement algorithm, or similar.
>
> I'm putting here in the cover letter a markdown document I wrote to better
> describe my findings and ideas (sorry if it's a bit long! :-D). You can also
> fetch it at the following links:
>
>  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.pdf
>  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.markdown
>
> See the document itself and the changelog of the various patches for details.
>
> The series includes one Chao's patch on top, as I found it convenient to build
> on top of it. The series itself is available here:
>
>   git://xenbits.xen.org/people/dariof/xen.git  wip/sched/icachemon
>   http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/wip/sched/icachemon
>
> Thanks a lot to everyone that will read and reply! :-)
>
> Regards,
> Dario
> ---

There seem to be several areas of confusion indicated in your document. 
I am unsure whether this is a side effect of the way you have written
it, but here are (hopefully) some words of clarification.  To the best
of my knowledge:

PSR CMT works by tagging cache lines with the currently-active RMID. 
The cache utilisation is a count of the number of lines which are tagged
with a specific RMID.  MBM on the other hand counts the number of cache
line fills and cache line evictions tagged with a specific RMID.

By this nature, the information will never reveal the exact state of
play.  e.g. a core with RMID A which gets a cache line hit against a
line currently tagged with RMID B will not alter any accounting. 
Furthermore, as alterations of the RMID only occur in
__context_switch(), Xen actions such as handling an interrupt will be
accounted against the currently active domain (or other future
granularity of RMID).

"max_rmid" is a per-socket property.  There is no requirement for it to
be the same for each socket in a system, although it is likely, given a
homogeneous system.  The limit on RMID is based on the size of the
accounting table.

As far as MSRs themselves go, an extra MSR write in the context switch
path is likely to pale into the noise.  However, querying the data is an
indirect MSR read (write to the event select MSR, read from  the data
MSR).  Furthermore there is no way to atomically read all data at once
which means that activity on other cores can interleave with
back-to-back reads in the scheduler.


As far as the plans here go, I have some concerns.  PSR is only
available on server platforms, which will be 2/4 socket systems with
large numbers of cores.  As you have discovered, there insufficient
RMIDs for redbrick pcpus, and on a system that size, XenServer typically
gets 7x vcpus to pcpus.

I think it is unrealistic to expect to use any scheduler scheme which is
per-pcpu or per-vcpu while the RMID limit is as small as it is. 
Depending on workload, even a per-domain scheme might be problematic. 
One of our tests involves running 500xWin7 VMs on that particular box.

~Andrew

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
  2015-04-07 10:27 ` Andrew Cooper
@ 2015-04-07 13:10   ` Dario Faggioli
  2015-04-08  5:59     ` Chao Peng
  2015-04-09 15:44     ` Meng Xu
  2015-04-08 11:27   ` George Dunlap
  1 sibling, 2 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-07 13:10 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Ian Campbell, George Dunlap, xen-devel, JBeulich, chao.p.peng


[-- Attachment #1.1: Type: text/plain, Size: 6702 bytes --]

On Tue, 2015-04-07 at 11:27 +0100, Andrew Cooper wrote:
> On 04/04/2015 03:14, Dario Faggioli wrote:
>
> > I'm putting here in the cover letter a markdown document I wrote to better
> > describe my findings and ideas (sorry if it's a bit long! :-D). You can also
> > fetch it at the following links:
> >
> >  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.pdf
> >  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.markdown
> >
> > See the document itself and the changelog of the various patches for details.

> 
> There seem to be several areas of confusion indicated in your document. 
>
I see. Sorry for that then.

> I am unsure whether this is a side effect of the way you have written
> it, but here are (hopefully) some words of clarification.
>
And thanks for this. :-)

> PSR CMT works by tagging cache lines with the currently-active RMID. 
> The cache utilisation is a count of the number of lines which are tagged
> with a specific RMID.  MBM on the other hand counts the number of cache
> line fills and cache line evictions tagged with a specific RMID.
> 
Ok.

> By this nature, the information will never reveal the exact state of
> play.  e.g. a core with RMID A which gets a cache line hit against a
> line currently tagged with RMID B will not alter any accounting. 
>
So, you're saying that the information we get is an approximation of
reality, not it's 100% accurate representation. That is no news, IMO.
When, inside Credit2, we try to track the average load on each runqueue,
that is an approximation. When, in Credit1, we consider a vcpu "cache
hot" if it run recently, that is an approximation. Etc. These
approximations happens fully in software, because it is possible, in
those cases.

PSR provides data and insights on something that, without hardware
support, we couldn't possibly hope to know anything about. Whether we
should think about using such data or not, it depends whether they are
represents a (base for a) reasonable enough approximation, or they are
just a bunch of pseudo random numbers.

It seems to me that you are suggesting the latter to be more likely than
the former, i.e., PSR does not provide a good enough approximation for
being used from inside Xen and toolstack, is my understanding correct?

> Furthermore, as alterations of the RMID only occur in
> __context_switch(), Xen actions such as handling an interrupt will be
> accounted against the currently active domain (or other future
> granularity of RMID).
> 
Yes, I thought about this. However, this is certainly important for
per-domain, or for a (unlikely) future per-vcpu, monitoring, but if you
attach an RMID to a pCPU (or groups of pCPU) then that is not really a
problem.

Actually, it's the correct behavior: running Xen and serving interrupts
in a certain core, in that case, *do* need to be accounted! So,
considering that both the document and the RFC series are mostly focused
on introducing per-pcpu/core/socket monitoring, rather than on
per-domain monitoring, and given that the document was becoming quite
long, I decided not to add a section about this.

> "max_rmid" is a per-socket property.  There is no requirement for it to
> be the same for each socket in a system, although it is likely, given a
> homogeneous system.
>
I know. Again this was not mentioned for document length reasons, but I
planned to ask about this (as I've done that already this morning, as
you can see. :-D).

In this case, though, it probably was something worth being mentioned,
so I will if there will ever be a v2 of the document. :-)

Mostly, I was curious to learn why that is not reflected in the current
implementation, i.e., whether there are any reasons why we should not
take advantage of per-socketness of RMIDs, as reported by SDM, as that
can greatly help mitigating RMID shortage in the per-CPU/core/socket
configuration (in general, actually, but it's per-cpu that I'm
interested in).

> The limit on RMID is based on the size of the
> accounting table.
> 
Did not know in details, but it makes sense. Getting feedback on what
should be expected as number of available RMIDs in current and future
hardware, from Intel people and from everyone who knows (like you :-D ),
was the main purpose of sending this out, so thanks.

> As far as MSRs themselves go, an extra MSR write in the context switch
> path is likely to pale into the noise.  However, querying the data is an
> indirect MSR read (write to the event select MSR, read from  the data
> MSR).  Furthermore there is no way to atomically read all data at once
> which means that activity on other cores can interleave with
> back-to-back reads in the scheduler.
> 
All true. And in fact, how and how frequent data should be gathered
remains to be decided (as said in the document). I was thinking more to
some periodic sampling, rather than to throw handfuls of rdmsr/wrmsr
against the code that makes scheduling decisions! :-D

> As far as the plans here go, I have some concerns.  PSR is only
> available on server platforms, which will be 2/4 socket systems with
> large numbers of cores.  As you have discovered, there insufficient
> RMIDs for redbrick pcpus, and on a system that size, XenServer typically
> gets 7x vcpus to pcpus.
> 
> I think it is unrealistic to expect to use any scheduler scheme which is
> per-pcpu or per-vcpu while the RMID limit is as small as it is. 
>
On the per-vcpu schemes, I fully agree. However, it was necessary to
mention it, IMO, and explain why that is the case... Being able to
monitor single vCPUs would be pretty cool, and it likely is one of the
first things that someone looking at this technology for the first time
would like to know whether it is possible or not. It's not, and I
thought not stating so and not explaining the reasons why it is not
would have been quite a deficiency of such a document.

On per-pcpu schemes, I mostly agree. Although exploiting the per-socket
nature of RMID, if possible, seems to offer a viable solution.

What I'm not sure I got is your opinion on per-pcpu or per-socket
schemes.

> Depending on workload, even a per-domain scheme might be problematic. 
> One of our tests involves running 500xWin7 VMs on that particular box.
> 
Yep. And in fact, I didn't even mention using any per-domain scheme for
scheduling as it has the same disadvantages of per-vcpu schemes, in
terms of RMID usage (a few multi-vcpus domain == many single-vcpus
domain), and it's useless for the scheduler, which barely knows about
what a domain is.

Regards, and Thanks a lot for your feedback. :-)
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 3/7] xen: psr: reserve an RMID for each core
  2015-04-07 10:19     ` Dario Faggioli
@ 2015-04-07 13:57       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 32+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-04-07 13:57 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Wei Liu, Ian Campbell, Andrew Cooper, George Dunlap, xen-devel,
	JBeulich, chao.p.peng

On Tue, Apr 07, 2015 at 10:19:22AM +0000, Dario Faggioli wrote:
> On Mon, 2015-04-06 at 09:59 -0400, Konrad Rzeszutek Wilk wrote:
> > On Sat, Apr 04, 2015 at 04:14:41AM +0200, Dario Faggioli wrote:
> 
> > > XXX Another idea I just have is to allow the user to
> > >     somehow specify a different 'granularity'. Something
> > >     like allowing 'percpu_cmt'|'percore_cmt'|'persocket_cmt'
> > >     with the following meaning:
> > >      + 'percpu_cmt': as in this patch
> > >      + 'percore_cmt': same RMID to hthreads of the same core
> > >      + 'persocket_cmt': same RMID to all cores of the same
> > >         socket.
> > > 
> > >     'percore_cmt' would only allow gathering info on a
> > >     per-core basis... still better than nothing if we
> > >     do not have enough RMIDs for each pCPUs.
> > 
> > Could we allocate nr_online_cpus() / nr_pmids() and have
> > some CPUs share the same PMIDs?
> > 
> Mmm... I hope we can (see the reply to Chao about the per-socketness
> nature of the RMIDs).
> 
> I'm not sure what you mean here, though. In the box I have at hand there
> are 144 CPUs and 71 RMIDs. So, 144/71=2... maybe I'm missing something
> of what you mean, how should I use these 2 RMIDs?

The other way - so 2 CPUs use 1 RMID.

> 
> If RMIDs actually are per-socket, extending the existing Xen support to
> reflect that, and take advantage of it would help a lot already. In such
> box, it would mean I could use RMIDs 1-36, on each socket, per per-CPU
> monitoring, and still have 35 RMIDs free (which could be 35x4=140,
> depending *how* we extend te support to match the per-socket nature of
> RMIDs).
> 
> Let's see if that is confirmed... Of course, I can book the box again
> here and test it myself (and will do that, if necessary :-D).
> 
> Thanks and Regards,
> Dario

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
  2015-04-07 13:10   ` Dario Faggioli
@ 2015-04-08  5:59     ` Chao Peng
  2015-04-08  8:23       ` Dario Faggioli
  2015-04-09 15:44     ` Meng Xu
  1 sibling, 1 reply; 32+ messages in thread
From: Chao Peng @ 2015-04-08  5:59 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Wei Liu, Ian Campbell, Andrew Cooper, George Dunlap, xen-devel, JBeulich

> > "max_rmid" is a per-socket property.  There is no requirement for it to
> > be the same for each socket in a system, although it is likely, given a
> > homogeneous system.
> >
> I know. Again this was not mentioned for document length reasons, but I
> planned to ask about this (as I've done that already this morning, as
> you can see. :-D).
> 
> In this case, though, it probably was something worth being mentioned,
> so I will if there will ever be a v2 of the document. :-)
> 
> Mostly, I was curious to learn why that is not reflected in the current
> implementation, i.e., whether there are any reasons why we should not
> take advantage of per-socketness of RMIDs, as reported by SDM, as that
> can greatly help mitigating RMID shortage in the per-CPU/core/socket
> configuration (in general, actually, but it's per-cpu that I'm
> interested in).

Andrew is right, RMID is a per-socket property. One reason it's not used
in current implementation, I think, is the fact that max_rmid is
normally the same among sockets, though they can be different in theory.
So the same RMID is targeted for all the sockets. But per-socketness of
RMIDs can be used anyway. 

We do take this into account for CAT.

> > As far as MSRs themselves go, an extra MSR write in the context switch
> > path is likely to pale into the noise.  However, querying the data is an
> > indirect MSR read (write to the event select MSR, read from  the data
> > MSR).  Furthermore there is no way to atomically read all data at once
> > which means that activity on other cores can interleave with
> > back-to-back reads in the scheduler.
> > 
> All true. And in fact, how and how frequent data should be gathered
> remains to be decided (as said in the document). I was thinking more to
> some periodic sampling, rather than to throw handfuls of rdmsr/wrmsr
> against the code that makes scheduling decisions! :-D

Due to current hardware limitations and in the case of scheduling improvement,
periodic sampling sounds a feasible direction to me.

Chao

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
  2015-04-08  5:59     ` Chao Peng
@ 2015-04-08  8:23       ` Dario Faggioli
  2015-04-08  8:53         ` Andrew Cooper
  2015-04-08  8:55         ` Chao Peng
  0 siblings, 2 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-08  8:23 UTC (permalink / raw)
  To: chao.p.peng
  Cc: Wei Liu, Ian Campbell, Andrew Cooper, George Dunlap, xen-devel, JBeulich


[-- Attachment #1.1: Type: text/plain, Size: 1486 bytes --]

On Wed, 2015-04-08 at 13:59 +0800, Chao Peng wrote:

> > Mostly, I was curious to learn why that is not reflected in the current
> > implementation, i.e., whether there are any reasons why we should not
> > take advantage of per-socketness of RMIDs, as reported by SDM, as that
> > can greatly help mitigating RMID shortage in the per-CPU/core/socket
> > configuration (in general, actually, but it's per-cpu that I'm
> > interested in).
> 
> Andrew is right, RMID is a per-socket property. One reason it's not used
> in current implementation, I think, is the fact that max_rmid is
> normally the same among sockets, though they can be different in theory.
> So the same RMID is targeted for all the sockets. But per-socketness of
> RMIDs can be used anyway. 
> 
Yeah, but rather than to the maximum number of available RMIDs, what I'm
much interested in is whether I can use _the_ _same_ RMID for different
cores, if they belong to different sockets. AFAIUI, it is possible, is
that correct?

> > All true. And in fact, how and how frequent data should be gathered
> > remains to be decided (as said in the document). I was thinking more to
> > some periodic sampling, rather than to throw handfuls of rdmsr/wrmsr
> > against the code that makes scheduling decisions! :-D
> 
> Due to current hardware limitations and in the case of scheduling improvement,
> periodic sampling sounds a feasible direction to me.
> 
Good to know, thanks.

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
  2015-04-08  8:23       ` Dario Faggioli
@ 2015-04-08  8:53         ` Andrew Cooper
  2015-04-08  8:55         ` Chao Peng
  1 sibling, 0 replies; 32+ messages in thread
From: Andrew Cooper @ 2015-04-08  8:53 UTC (permalink / raw)
  To: Dario Faggioli, chao.p.peng
  Cc: Ian Campbell, Wei Liu, George Dunlap, JBeulich, xen-devel

On 08/04/15 09:23, Dario Faggioli wrote:
> On Wed, 2015-04-08 at 13:59 +0800, Chao Peng wrote:
>
>>> Mostly, I was curious to learn why that is not reflected in the current
>>> implementation, i.e., whether there are any reasons why we should not
>>> take advantage of per-socketness of RMIDs, as reported by SDM, as that
>>> can greatly help mitigating RMID shortage in the per-CPU/core/socket
>>> configuration (in general, actually, but it's per-cpu that I'm
>>> interested in).
>> Andrew is right, RMID is a per-socket property. One reason it's not used
>> in current implementation, I think, is the fact that max_rmid is
>> normally the same among sockets, though they can be different in theory.
>> So the same RMID is targeted for all the sockets. But per-socketness of
>> RMIDs can be used anyway. 
>>
> Yeah, but rather than to the maximum number of available RMIDs, what I'm
> much interested in is whether I can use _the_ _same_ RMID for different
> cores, if they belong to different sockets. AFAIUI, it is possible, is
> that correct?

It is perfectly possible to track the same logical resource using
different RMIDs on different sockets.

~Andrew

>
>>> All true. And in fact, how and how frequent data should be gathered
>>> remains to be decided (as said in the document). I was thinking more to
>>> some periodic sampling, rather than to throw handfuls of rdmsr/wrmsr
>>> against the code that makes scheduling decisions! :-D
>> Due to current hardware limitations and in the case of scheduling improvement,
>> periodic sampling sounds a feasible direction to me.
>>
> Good to know, thanks.
>
> Regards,
> Dario

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
  2015-04-08  8:23       ` Dario Faggioli
  2015-04-08  8:53         ` Andrew Cooper
@ 2015-04-08  8:55         ` Chao Peng
  1 sibling, 0 replies; 32+ messages in thread
From: Chao Peng @ 2015-04-08  8:55 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Wei Liu, Ian Campbell, Andrew Cooper, George Dunlap, xen-devel, JBeulich

On Wed, Apr 08, 2015 at 08:23:11AM +0000, Dario Faggioli wrote:
> On Wed, 2015-04-08 at 13:59 +0800, Chao Peng wrote:
> 
> > > Mostly, I was curious to learn why that is not reflected in the current
> > > implementation, i.e., whether there are any reasons why we should not
> > > take advantage of per-socketness of RMIDs, as reported by SDM, as that
> > > can greatly help mitigating RMID shortage in the per-CPU/core/socket
> > > configuration (in general, actually, but it's per-cpu that I'm
> > > interested in).
> > 
> > Andrew is right, RMID is a per-socket property. One reason it's not used
> > in current implementation, I think, is the fact that max_rmid is
> > normally the same among sockets, though they can be different in theory.
> > So the same RMID is targeted for all the sockets. But per-socketness of
> > RMIDs can be used anyway. 
> > 
> Yeah, but rather than to the maximum number of available RMIDs, what I'm
> much interested in is whether I can use _the_ _same_ RMID for different
> cores, if they belong to different sockets. AFAIUI, it is possible, is
> that correct?

You are correct. So you actually have 72 RMIDs(0~71) for each sockets.
Normally there are 2 or more RMIDs per hardware thread.

Chao

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
  2015-04-07 10:27 ` Andrew Cooper
  2015-04-07 13:10   ` Dario Faggioli
@ 2015-04-08 11:27   ` George Dunlap
  2015-04-08 13:29     ` Dario Faggioli
  1 sibling, 1 reply; 32+ messages in thread
From: George Dunlap @ 2015-04-08 11:27 UTC (permalink / raw)
  To: Andrew Cooper, Dario Faggioli, Xen-devel
  Cc: Chao Peng, Dongxiao Xu, wei.liu2, Ian.Campbell, JBeulich

On 04/07/2015 11:27 AM, Andrew Cooper wrote:
> On 04/04/2015 03:14, Dario Faggioli wrote:
>> Hi Everyone,
>>
>> This RFC series is the outcome of an investigation I've been doing about
>> whether we can take better advantage of features like Intel CMT (and of PSR
>> features in general). By "take better advantage of" them I mean, for example,
>> use the data obtained from monitoring within the scheduler and/or within
>> libxl's automatic NUMA placement algorithm, or similar.
>>
>> I'm putting here in the cover letter a markdown document I wrote to better
>> describe my findings and ideas (sorry if it's a bit long! :-D). You can also
>> fetch it at the following links:
>>
>>  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.pdf
>>  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.markdown
>>
>> See the document itself and the changelog of the various patches for details.
>>
>> The series includes one Chao's patch on top, as I found it convenient to build
>> on top of it. The series itself is available here:
>>
>>   git://xenbits.xen.org/people/dariof/xen.git  wip/sched/icachemon
>>   http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/wip/sched/icachemon
>>
>> Thanks a lot to everyone that will read and reply! :-)
>>
>> Regards,
>> Dario
>> ---
> 
> There seem to be several areas of confusion indicated in your document. 
> I am unsure whether this is a side effect of the way you have written
> it, but here are (hopefully) some words of clarification.  To the best
> of my knowledge:
> 
> PSR CMT works by tagging cache lines with the currently-active RMID. 
> The cache utilisation is a count of the number of lines which are tagged
> with a specific RMID.  MBM on the other hand counts the number of cache
> line fills and cache line evictions tagged with a specific RMID.

An actual counter, like MBM, we actually don't need different RMIDs* to
implement a per-vcpu counter: we could just read the value on every
context-switch and compare it to the last value and store it in the vcpu
struct.  Having extra RMIDs just makes it easier -- is that right?

I haven't thought about it in detail, but it seems like for that having
an LRU algorithm for allocating MBM RMIDs might work.

* Are the called RMIDs for MBM?  If not replace "RMID" in this paragraph
with the appropriate value.

For CMT, we could imagine setting the RMID as giving the pcpu a
paintbrush with a specific color of paint, with which it paints that
color on the wall (which would represent the L3 cache).  If we give Red
to Andy and Blue to Dario, then after a while we can look at the red and
blue portions of the wall and know which belongs to which.  But if we
then give the red one to Konrad, we'll never be *really* sure how much
of the red on the wall was put there by Konrad and how much was put
there by Andy.  If Dario is a mad painter just painting over everything,
then within a relatively short period of time we can assume that
whatever red there is belongs to Konrad; but if Dario is more
constrained, Andy's paint may stay there indefinitely.

But what we *can* say, I suppose, is that Konrad's "footprint" is
certainly *less than* the amount of red paint on the wall; and that any
*increase* in the amount of red paint since we gave the brush to Konrad
certainly belongs to him.

So we could probably "bracket" the usage by any given vcpu: if the
original RMID occupancy was O, and the current RMID occupancy is N, then
the actual occupancy is between [N-O] and N.

Hmm, although I guess that's not true either -- a vcpu may still have
occupancy from all previous RMIDs that it's used.

Which makes me wonder -- If we were to use an RMID "recycling" scheme,
one of the best algorithms would probably be to recycle one the RMID
which was 1) not running on another core at the time, and 2) had the
lowest count.  With 71 RMIDs, it seems fairly likely to me that in
practice at least one of those will be nearly zero at any given time.
Reassigning only low-occupancy RMIDs also minimizes the effect mentioned
above, where a vcpu gets unaccounted occupancy from previously-used RMIDs.

What do you think?

> As far as MSRs themselves go, an extra MSR write in the context switch
> path is likely to pale into the noise.  However, querying the data is an
> indirect MSR read (write to the event select MSR, read from  the data
> MSR).  Furthermore there is no way to atomically read all data at once
> which means that activity on other cores can interleave with
> back-to-back reads in the scheduler.

I don't think it's a given that an MSR write will be cheap.  Back when I
was doing my thesis (10 years ago now), logging some performance
counters on context switch (which was just an MSR read) added about 7%
to the overhead of a kernel build, IIRC.

Processors have changed quite a bit in that time, and we can hope that
Intel would have tried to make writing the IDs pretty fast.  But before
we enabled anything by default I think we'd want to make sure and take a
look at the overhead first.

 -George

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
  2015-04-04  2:14 [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Dario Faggioli
                   ` (8 preceding siblings ...)
  2015-04-07 10:27 ` Andrew Cooper
@ 2015-04-08 11:30 ` George Dunlap
  2015-04-08 13:16   ` Dario Faggioli
  2015-04-09 15:37 ` Meng Xu
  10 siblings, 1 reply; 32+ messages in thread
From: George Dunlap @ 2015-04-08 11:30 UTC (permalink / raw)
  To: Dario Faggioli, Xen-devel
  Cc: wei.liu2, Ian.Campbell, andrew.cooper3, Dongxiao Xu, JBeulich, Chao Peng

On 04/04/2015 03:14 AM, Dario Faggioli wrote:
> ### Per-vCPU cache monitoring
> 
> This means being able to tell how much of the L3 is being used by each vCPU.
> Monitoring the cache occupancy of a specific domain, would still be possible,
> just by summing up the contributions from all the domain's vCPUs.

One note about this -- vcpu cache utilization may be predictive
short-term, but long-term it's probably less important because the guest
may move processes between vcpus.  So it may make sense to leave the
occupancy stats on a per-domain basis anyway.

Thoughts?

 -George

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
  2015-04-08 11:30 ` George Dunlap
@ 2015-04-08 13:16   ` Dario Faggioli
  0 siblings, 0 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-08 13:16 UTC (permalink / raw)
  To: George Dunlap
  Cc: Wei Liu, Ian Campbell, Andrew Cooper, xen-devel, JBeulich, chao.p.peng


[-- Attachment #1.1: Type: text/plain, Size: 1847 bytes --]

On Wed, 2015-04-08 at 12:30 +0100, George Dunlap wrote:
> On 04/04/2015 03:14 AM, Dario Faggioli wrote:
> > ### Per-vCPU cache monitoring
> > 
> > This means being able to tell how much of the L3 is being used by each vCPU.
> > Monitoring the cache occupancy of a specific domain, would still be possible,
> > just by summing up the contributions from all the domain's vCPUs.
> 
> One note about this -- vcpu cache utilization may be predictive
> short-term, but long-term it's probably less important because the guest
> may move processes between vcpus.
>
True. That, however, applies to any measurements / estimation of
per-vcpu load, for any definition of 'load', doesn't it? So, yes we can
use it for short term decisions and/or time-average it (i.e., exactly as
we do with per-vcpu and runqueue load, e.g., in Credit2).

>   So it may make sense to leave the
> occupancy stats on a per-domain basis anyway.
> 
Indeed, but then I'm not sure I see a way to use that stats, at least
not from inside the scheduler (if we're talking about this), do you?

> Thoughts?
> 
IMO, a nice way to use CMT in a per-vcpu configutation, from within the
scheduler, would have been to know, for a given vcpu, how much of the
data it uses are (still) resident on a given cache layer of a given pCPU
at a given time instant. That sort of info could have been used to
decide whether it is wise to move the vcpu away of that pCPU or not, in
a way that is complementary to other metrics which we already have,
and/or, in general, that can be implemented in software (e.g. load
average stats).

Doing that, however, requires too many RMIDs, and it's not terribly
useful if done on L3. Therefore, I think that investing time in enabling
and trying to exploit per-*pCPU* monitoring.

Thoughts? :-D

Thanks and Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 3/7] xen: psr: reserve an RMID for each core
  2015-04-04  2:14 ` [RFC PATCH 3/7] xen: psr: reserve an RMID for each core Dario Faggioli
  2015-04-06 13:59   ` Konrad Rzeszutek Wilk
  2015-04-07  8:24   ` Chao Peng
@ 2015-04-08 13:28   ` George Dunlap
  2015-04-08 14:03     ` Dario Faggioli
  2 siblings, 1 reply; 32+ messages in thread
From: George Dunlap @ 2015-04-08 13:28 UTC (permalink / raw)
  To: Dario Faggioli, Xen-devel
  Cc: wei.liu2, Ian.Campbell, andrew.cooper3, Dongxiao Xu, JBeulich, Chao Peng

On 04/04/2015 03:14 AM, Dario Faggioli wrote:
> This allows for a new item to be passed as part of the psr=
> boot option: "percpu_cmt". If that is specified, Xen tries,
> at boot time, to associate an RMID to each core.
> 
> XXX This all looks rather straightforward, if it weren't
>     for the fact that it is, apparently, more common than
>     I though to run out of RMID. For example, on a dev box
>     we have in Cambridge, there are 144 pCPUs and only 71
>     RMIDs.

Is that because you have 2 sockets?

There's no need to keep RMIDs unique across sockets, is there?  E.g.,
socket 0 cpu 0 and socket 1 cpu 0 can have the same RMID, because cache
and the MSRs are per-socket.

If we're doing things on a per-domain basis, having the same RMID
allocated for each socket sort of makes sense; but even then, if you
know a domain is only going to run on a given socket, there's no reason
in theory we couldn't use same RMID for a different domain on the other
socket (assuming it was only going to run on the other socket).

One advantage of doing things of a per-vcpu level is that you wouldn't
have to worry about inter-socket RMID issues.

 -George

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
  2015-04-08 11:27   ` George Dunlap
@ 2015-04-08 13:29     ` Dario Faggioli
  0 siblings, 0 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-08 13:29 UTC (permalink / raw)
  To: George Dunlap
  Cc: Wei Liu, Ian Campbell, Andrew Cooper, xen-devel, dongxiao.xu,
	JBeulich, chao.p.peng


[-- Attachment #1.1: Type: text/plain, Size: 4862 bytes --]

On Wed, 2015-04-08 at 12:27 +0100, George Dunlap wrote:
> On 04/07/2015 11:27 AM, Andrew Cooper wrote:

> > There seem to be several areas of confusion indicated in your document. 
> > I am unsure whether this is a side effect of the way you have written
> > it, but here are (hopefully) some words of clarification.  To the best
> > of my knowledge:
> > 
> > PSR CMT works by tagging cache lines with the currently-active RMID. 
> > The cache utilisation is a count of the number of lines which are tagged
> > with a specific RMID.  MBM on the other hand counts the number of cache
> > line fills and cache line evictions tagged with a specific RMID.
> 
> An actual counter, like MBM, we actually don't need different RMIDs* to
> implement a per-vcpu counter: we could just read the value on every
> context-switch and compare it to the last value and store it in the vcpu
> struct.  Having extra RMIDs just makes it easier -- is that right?
> 
I'm not sure I'm following.

As per Andrew's description, both are counters. And in fact, if
sampling-&-subtracting at every context switch is an option, both CMT
and MBM stats of a particular instance of execution of a vcpu can be
collected, I think, just by using one RMID for each pCPU.

I'm not sure what you 'last value' refers to, though. Last value of
what? I mean, last value of the counter associated with what RMID? What
entity were you thinking to associate an RMID with, a vcpu? A pCPU? A
domain? Were you thinking to a static or dynamic kind of association?

Anyway, sampling at every context switch means one MSR write and one MSR
read (to get one sample), which, as you say yourself below, may not be
that cheap.

> I haven't thought about it in detail, but it seems like for that having
> an LRU algorithm for allocating MBM RMIDs might work.
> 
> * Are the called RMIDs for MBM?  If not replace "RMID" in this paragraph
> with the appropriate value.
> 
They are called RMID, and they are the same for both MBM and CMT,
AFAIUI. I mean, once you associated logical entity X with RMID y, you
can monitor both X's cache occupancy and memory bandwidth, via RMID y.
OTOH, it is not possible to associate to X RMID y for CMT and RMID z for
MBM.

> For CMT, we could imagine setting the RMID as giving the pcpu a
> paintbrush with a specific color of paint, with which it paints that
> color on the wall (which would represent the L3 cache).  If we give Red
> to Andy and Blue to Dario, then after a while we can look at the red and
> blue portions of the wall and know which belongs to which.  But if we
> then give the red one to Konrad, we'll never be *really* sure how much
> of the red on the wall was put there by Konrad and how much was put
> there by Andy.  If Dario is a mad painter just painting over everything,
> then within a relatively short period of time we can assume that
> whatever red there is belongs to Konrad; but if Dario is more
> constrained, Andy's paint may stay there indefinitely.
> 
> But what we *can* say, I suppose, is that Konrad's "footprint" is
> certainly *less than* the amount of red paint on the wall; and that any
> *increase* in the amount of red paint since we gave the brush to Konrad
> certainly belongs to him.
> 
> So we could probably "bracket" the usage by any given vcpu: if the
> original RMID occupancy was O, and the current RMID occupancy is N, then
> the actual occupancy is between [N-O] and N.
> 
> Hmm, although I guess that's not true either -- a vcpu may still have
> occupancy from all previous RMIDs that it's used.
> 
This is about the problem 'recycling' RMIDs, but having not understood
how you are thinking to allocate them, I'm not getting the recycling
part either. :-)

It seems that you're suggesting some kind of dynamic RMID to vcpu
allocation scheme, is that the case?

> > As far as MSRs themselves go, an extra MSR write in the context switch
> > path is likely to pale into the noise.  However, querying the data is an
> > indirect MSR read (write to the event select MSR, read from  the data
> > MSR).  Furthermore there is no way to atomically read all data at once
> > which means that activity on other cores can interleave with
> > back-to-back reads in the scheduler.
> 
> I don't think it's a given that an MSR write will be cheap.  Back when I
> was doing my thesis (10 years ago now), logging some performance
> counters on context switch (which was just an MSR read) added about 7%
> to the overhead of a kernel build, IIRC.
> 
> Processors have changed quite a bit in that time, and we can hope that
> Intel would have tried to make writing the IDs pretty fast.  But before
> we enabled anything by default I think we'd want to make sure and take a
> look at the overhead first.
> 
>  -George
>
Thanks and Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 3/7] xen: psr: reserve an RMID for each core
  2015-04-08 13:28   ` George Dunlap
@ 2015-04-08 14:03     ` Dario Faggioli
  0 siblings, 0 replies; 32+ messages in thread
From: Dario Faggioli @ 2015-04-08 14:03 UTC (permalink / raw)
  To: George Dunlap
  Cc: Wei Liu, Ian Campbell, Andrew Cooper, xen-devel, dongxiao.xu,
	JBeulich, chao.p.peng


[-- Attachment #1.1: Type: text/plain, Size: 2565 bytes --]

On Wed, 2015-04-08 at 14:28 +0100, George Dunlap wrote:
> On 04/04/2015 03:14 AM, Dario Faggioli wrote:
> > This allows for a new item to be passed as part of the psr=
> > boot option: "percpu_cmt". If that is specified, Xen tries,
> > at boot time, to associate an RMID to each core.
> > 
> > XXX This all looks rather straightforward, if it weren't
> >     for the fact that it is, apparently, more common than
> >     I though to run out of RMID. For example, on a dev box
> >     we have in Cambridge, there are 144 pCPUs and only 71
> >     RMIDs.
> 
> Is that because you have 2 sockets?
> 
It has 4 sockets. Chao explained in one of his mails that there usually
is 2 or more RMIDs per hardware thread.

> There's no need to keep RMIDs unique across sockets, is there?  E.g.,
> socket 0 cpu 0 and socket 1 cpu 0 can have the same RMID, because cache
> and the MSRs are per-socket.
> 
Exactly. And in fact, I just added to my TODO list improving Xen's
current PSR support to take per-socketness of RMIDs correctly into
account.

> If we're doing things on a per-domain basis, having the same RMID
> allocated for each socket sort of makes sense; but even then, if you
> know a domain is only going to run on a given socket, there's no reason
> in theory we couldn't use same RMID for a different domain on the other
> socket (assuming it was only going to run on the other socket).
> 
Right now CMT is per-domain and, on a box with 71 available RMIDs
_per_each_socket_, we have (in Xen) an array of 71 possible RMIDs (72,
but RMID 0 is treated specially) to be assigned to domains.

Independently on where a domain will ever run, we can use separate
arrays, and each domain can have an RMID on each socket. If, as you say,
there are well known restrictions, a domain can avoid having RMIDs for
certain sockets. In the worst case (all domains can run everywhere), we
use the same amount of RMIDs, in better/best cases, we use less of them.

This is even more important when looking at per-pCPU CMT configurations.
In fact, right now, on that box, I have more than 71 pCPUs, so I can't
associate an RMID to all the pCPUs. However, with the proper per-socket
support implemented, I not only will be able to associate an RMID to all
the pCPUs, but I'll have 35 RMIDs on each socket free. :-)

> One advantage of doing things of a per-vcpu level is that you wouldn't
> have to worry about inter-socket RMID issues.
>
Sorry, but I'm lost again, I'm afraid. What do you mean with this?

Thanks and Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
  2015-04-04  2:14 [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Dario Faggioli
                   ` (9 preceding siblings ...)
  2015-04-08 11:30 ` George Dunlap
@ 2015-04-09 15:37 ` Meng Xu
  10 siblings, 0 replies; 32+ messages in thread
From: Meng Xu @ 2015-04-09 15:37 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Wei Liu, Ian Campbell, George Dunlap, Andrew Cooper, Xen-devel,
	Dongxiao Xu, Jan Beulich, Chao Peng

Hi Dario,

2015-04-03 22:14 GMT-04:00 Dario Faggioli <dario.faggioli@citrix.com>:
> Hi Everyone,
>
> This RFC series is the outcome of an investigation I've been doing about
> whether we can take better advantage of features like Intel CMT (and of PSR
> features in general). By "take better advantage of" them I mean, for example,
> use the data obtained from monitoring within the scheduler and/or within
> libxl's automatic NUMA placement algorithm, or similar.
>
> I'm putting here in the cover letter a markdown document I wrote to better
> describe my findings and ideas (sorry if it's a bit long! :-D). You can also
> fetch it at the following links:
>
>  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.pdf
>  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.markdown
>
> See the document itself and the changelog of the various patches for details.
>
> The series includes one Chao's patch on top, as I found it convenient to build
> on top of it. The series itself is available here:
>
>   git://xenbits.xen.org/people/dariof/xen.git  wip/sched/icachemon
>   http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/wip/sched/icachemon
>
> Thanks a lot to everyone that will read and reply! :-)
>
> Regards,
> Dario
> ---
>
> # Intel Cache Monitoring: Present and Future
>
> ## About this document
>
> This document represents the result of in investigation on whether it would be
> possible to more extensively exploit the Platform Shared Resource Monitoring
> (PSR) capabilities of recent Intel x86 server chips. Examples of such features
> are the Cache Monitoring Technology (CMT) and the Memory Bandwidth Monitoring
> (MBM).
>
> More specifically, it focuses on Cache Monitoring Technology, support for which
> has recently been introduced in Xen by Intel, trying to figure out whether it
> can be used for high level load balancing, such as libxl automatic domain
> placement, and/or within Xen vCPU scheduler(s).
>
> Note that, although the document only speaks about CMT, most of the
> considerations apply (or can easily be extended) to MBM as well.
>
> The fact that, currently, support is provided for monitoring L3 cache only,
> somewhat limits the benefits of more extensively exploiting such technology,
> which is exactly the purpose here. Nevertheless, some improvements are possible
> already, and if at some point support for monitoring other cache layers will be
> available, this can be the basic building block for taking advantage of that
> too.

I'm wondering if you really want to know the cache usage at different
levels of cache, you may use the (4) general PMC on each logical core
to monitor that. This could bypass the limitation of the current HW,
but the concern is that it may affect the other mechanisms in Xen,
like perf, which also use the PMC.)

Another thought on the CMT is that it seems that Intel introduces CMT
along with CAT. So I assume they want to use CMT along with CAT so
that it gives some hint on how to allocate LLC to different guests?
For example, if a crazy guest is thrashing the LLC, they can apply CAT
to constraint/calm down this crazy guest.


Best,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities
  2015-04-07 13:10   ` Dario Faggioli
  2015-04-08  5:59     ` Chao Peng
@ 2015-04-09 15:44     ` Meng Xu
  1 sibling, 0 replies; 32+ messages in thread
From: Meng Xu @ 2015-04-09 15:44 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Wei Liu, Ian Campbell, Andrew Cooper, George Dunlap, xen-devel,
	JBeulich, chao.p.peng

2015-04-07 9:10 GMT-04:00 Dario Faggioli <dario.faggioli@citrix.com>:
> On Tue, 2015-04-07 at 11:27 +0100, Andrew Cooper wrote:
>> On 04/04/2015 03:14, Dario Faggioli wrote:
>>
>> > I'm putting here in the cover letter a markdown document I wrote to better
>> > describe my findings and ideas (sorry if it's a bit long! :-D). You can also
>> > fetch it at the following links:
>> >
>> >  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.pdf
>> >  * http://xenbits.xen.org/people/dariof/CMT-in-scheduling.markdown
>> >
>> > See the document itself and the changelog of the various patches for details.
>
>>
>> There seem to be several areas of confusion indicated in your document.
>>
> I see. Sorry for that then.
>
>> I am unsure whether this is a side effect of the way you have written
>> it, but here are (hopefully) some words of clarification.
>>
> And thanks for this. :-)
>
>> PSR CMT works by tagging cache lines with the currently-active RMID.
>> The cache utilisation is a count of the number of lines which are tagged
>> with a specific RMID.  MBM on the other hand counts the number of cache
>> line fills and cache line evictions tagged with a specific RMID.
>>
> Ok.
>
>> By this nature, the information will never reveal the exact state of
>> play.  e.g. a core with RMID A which gets a cache line hit against a
>> line currently tagged with RMID B will not alter any accounting.
>>
> So, you're saying that the information we get is an approximation of
> reality, not it's 100% accurate representation. That is no news, IMO.
> When, inside Credit2, we try to track the average load on each runqueue,
> that is an approximation. When, in Credit1, we consider a vcpu "cache
> hot" if it run recently, that is an approximation. Etc. These
> approximations happens fully in software, because it is possible, in
> those cases.
>
> PSR provides data and insights on something that, without hardware
> support, we couldn't possibly hope to know anything about. Whether we
> should think about using such data or not, it depends whether they are
> represents a (base for a) reasonable enough approximation, or they are
> just a bunch of pseudo random numbers.
>
> It seems to me that you are suggesting the latter to be more likely than
> the former, i.e., PSR does not provide a good enough approximation for
> being used from inside Xen and toolstack, is my understanding correct?
>
>> Furthermore, as alterations of the RMID only occur in
>> __context_switch(), Xen actions such as handling an interrupt will be
>> accounted against the currently active domain (or other future
>> granularity of RMID).
>>
> Yes, I thought about this. However, this is certainly important for
> per-domain, or for a (unlikely) future per-vcpu, monitoring, but if you
> attach an RMID to a pCPU (or groups of pCPU) then that is not really a
> problem.
>
> Actually, it's the correct behavior: running Xen and serving interrupts
> in a certain core, in that case, *do* need to be accounted! So,
> considering that both the document and the RFC series are mostly focused
> on introducing per-pcpu/core/socket monitoring, rather than on
> per-domain monitoring, and given that the document was becoming quite
> long, I decided not to add a section about this.
>
>> "max_rmid" is a per-socket property.  There is no requirement for it to
>> be the same for each socket in a system, although it is likely, given a
>> homogeneous system.
>>
> I know. Again this was not mentioned for document length reasons, but I
> planned to ask about this (as I've done that already this morning, as
> you can see. :-D).
>
> In this case, though, it probably was something worth being mentioned,
> so I will if there will ever be a v2 of the document. :-)
>
> Mostly, I was curious to learn why that is not reflected in the current
> implementation, i.e., whether there are any reasons why we should not
> take advantage of per-socketness of RMIDs, as reported by SDM, as that
> can greatly help mitigating RMID shortage in the per-CPU/core/socket
> configuration (in general, actually, but it's per-cpu that I'm
> interested in).
>
>> The limit on RMID is based on the size of the
>> accounting table.
>>
> Did not know in details, but it makes sense. Getting feedback on what
> should be expected as number of available RMIDs in current and future
> hardware, from Intel people and from everyone who knows (like you :-D ),
> was the main purpose of sending this out, so thanks.
>
>> As far as MSRs themselves go, an extra MSR write in the context switch
>> path is likely to pale into the noise.  However, querying the data is an
>> indirect MSR read (write to the event select MSR, read from  the data
>> MSR).  Furthermore there is no way to atomically read all data at once
>> which means that activity on other cores can interleave with
>> back-to-back reads in the scheduler.
>>
> All true. And in fact, how and how frequent data should be gathered
> remains to be decided (as said in the document). I was thinking more to
> some periodic sampling, rather than to throw handfuls of rdmsr/wrmsr
> against the code that makes scheduling decisions! :-D
>


Actually, I'm considering if periodic sampling is a better idea than
event-based/situation-based sampling. For example, as you and George
mentioned that the cache affinity information may only be useful in
short term, which means you may not need to issue the MSR to get the
cache information when a vcpu runs long enough. IMHO, there should
exist some heuristics to indicate when the "near-accurate" cache usage
information will be very useful to guide the scheduling decisions.

For example, another situation in my mind which does not need so
frequent sampling is that:
If a domain has very little cache usage for the last several
"event-based" cache-usage sampling, we (or scheduler) can speculate
that this domain is not cache intensive, and make decision based on
this speculation. Then we only sample the cache usage of this domain
with a very low frequency until this domain change from the
not-cache-intensive mode to cache-intensive mode, we will change it
back to event-based sampling.

So I think maybe a hybrid way may be better. :-)

Best,

Meng


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2015-04-09 15:44 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-04  2:14 [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Dario Faggioli
2015-04-04  2:14 ` [RFC PATCH 1/7] x86: improve psr scheduling code Dario Faggioli
2015-04-06 13:48   ` Konrad Rzeszutek Wilk
2015-04-04  2:14 ` [RFC PATCH 2/7] Xen: x86: print max usable RMID during init Dario Faggioli
2015-04-06 13:48   ` Konrad Rzeszutek Wilk
2015-04-07 10:11     ` Dario Faggioli
2015-04-04  2:14 ` [RFC PATCH 3/7] xen: psr: reserve an RMID for each core Dario Faggioli
2015-04-06 13:59   ` Konrad Rzeszutek Wilk
2015-04-07 10:19     ` Dario Faggioli
2015-04-07 13:57       ` Konrad Rzeszutek Wilk
2015-04-07  8:24   ` Chao Peng
2015-04-07 10:07     ` Dario Faggioli
2015-04-08 13:28   ` George Dunlap
2015-04-08 14:03     ` Dario Faggioli
2015-04-04  2:14 ` [RFC PATCH 4/7] xen: libxc: libxl: report per-CPU cache occupancy up to libxl Dario Faggioli
2015-04-04  2:14 ` [RFC PATCH 5/7] xen: libxc: libxl: allow for attaching and detaching a CPU to CMT Dario Faggioli
2015-04-04  2:15 ` [RFC PATCH 6/7] xl: report per-CPU cache occupancy up to libxl Dario Faggioli
2015-04-04  2:15 ` [RFC PATCH 7/7] xl: allow for attaching and detaching a CPU to CMT Dario Faggioli
2015-04-07  8:19 ` [RFC PATCH 0/7] Intel Cache Monitoring: Current Status and Future Opportunities Chao Peng
2015-04-07  9:51   ` Dario Faggioli
2015-04-07 10:27 ` Andrew Cooper
2015-04-07 13:10   ` Dario Faggioli
2015-04-08  5:59     ` Chao Peng
2015-04-08  8:23       ` Dario Faggioli
2015-04-08  8:53         ` Andrew Cooper
2015-04-08  8:55         ` Chao Peng
2015-04-09 15:44     ` Meng Xu
2015-04-08 11:27   ` George Dunlap
2015-04-08 13:29     ` Dario Faggioli
2015-04-08 11:30 ` George Dunlap
2015-04-08 13:16   ` Dario Faggioli
2015-04-09 15:37 ` Meng Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.