Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support

From: Luiz Capitulino <lcapitulino@redhat.com>
To: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H Peter Anvin" <hpa@zytor.com>, "Ingo Molnar" <mingo@redhat.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"linux-kernel" <linux-kernel@vger.kernel.org>,
	"x86" <x86@kernel.org>,
	"Vikas Shivappa" <vikas.shivappa@linux.intel.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	tj@kernel.org, riel@redhat.com
Subject: Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
Date: Wed, 4 Nov 2015 09:42:27 -0500	[thread overview]
Message-ID: <20151104094227.5aafdf2c@redhat.com> (raw)
In-Reply-To: <1443766185-61618-1-git-send-email-fenghua.yu@intel.com>

On Thu,  1 Oct 2015 23:09:34 -0700
Fenghua Yu <fenghua.yu@intel.com> wrote:

> This series has some preparatory patches and Intel cache allocation
> support.

Ping? What's the status of this series?

We badly need this series for KVM-RT workloads. I did try it and it
seems to work but, apart from small fixable issues which I'll reply
to specific patches to point out, there are some design issues which
I need some clarification. They are in order of relevance:

 o Cache reservations are global to all NUMA nodes

   CAT is mostly intended for real-time and high performance
   computing. For both of them the most common setup is to
   pin your threads to specific cores on a specific NUMA node.

   So, suppose I have two HPC threads pinned to specific cores
   on node1. I want to reserve 80% of the L3 cache to those
   threads. With current patches I'd do this:

    1. Create a "all-tasks" cgroup which can only access 20% of
       the cache
    2. Create a "hpc" cgroup which can access 80% of the cache
    3. Move my HPC threads to "hpc" and all the other threads to
       "all-tasks"

   This has the intended behavior on node1: the "hpc" threads
   will write into 80% of the L3 cache and any "all-tasks" threads
   executing there will only write into 20% of the cache.

   However, this is also true for node0! So, the "all-tasks"
   threads can only write into 20% of the cache in node0 even
   though "hpc" threads will never execute there.

   Is this intended by design? Like, is this a hardware limitation
   (given that the IA32_L3_MASK_n MSRs are global anyways) or maybe
   a way to enforce cache coherence?

   I was wondering if we could have masks per NUMA node, where
   they are applied to processes whenever they migrate among
   NUMA nodes.

 o How does this feature apply to kernel threads?

   I'm just unable to move kernel threads out of the root
   cgroup. This means that kernel threads can always write
   into all cache no matter what the reservation scheme is.

   Is this intended by design? Why? Unless I'm missing
   something, reservations could and should be applied to
   kernel threads as well.

 o You can't change the root cgroup's CBM

   I can understand this makes the implementation a lot simpler.
   However, the reality is that there are way too little CBMs
   and loosing one for the root group seems like a waste.

   Can we change this or is there strong reasons not to do so?

 o cgroups hierarchy is limited by the number of CBMs

   Today on my Haswell system, this means that I can only have 3
   directories in my cgroups hierarchy. If the number of CBMs
   are expected to grow in next processors, then I think having
   this feature as cgroups makes sense. However, if we're still
   going to be this limited in terms of directory structure, then
   it seems a bit overkill to me to have this as cgroups