Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide

From: Marcelo Tosatti <mtosatti@redhat.com>
To: Vikas Shivappa <vikas.shivappa@intel.com>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org, hpa@zytor.com,
	tglx@linutronix.de, mingo@kernel.org, tj@kernel.org,
	peterz@infradead.org, matt.fleming@intel.com,
	will.auld@intel.com, glenn.p.williamson@intel.com,
	kanaka.d.juvva@intel.com
Subject: Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide
Date: Thu, 26 Mar 2015 22:29:27 -0300	[thread overview]
Message-ID: <20150327012927.GA1709@amt.cnet> (raw)
In-Reply-To: <alpine.DEB.2.10.1503261133070.19649@vshiva-Udesk>

On Thu, Mar 26, 2015 at 11:38:59AM -0700, Vikas Shivappa wrote:
> 
> Hello Marcelo,

Hi Vikas,

> On Wed, 25 Mar 2015, Marcelo Tosatti wrote:
> 
> >On Thu, Mar 12, 2015 at 04:16:07PM -0700, Vikas Shivappa wrote:
> >>This patch adds a description of Cache allocation technology, overview
> >>of kernel implementation and usage of CAT cgroup interface.
> >>
> >>Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> >>---
> >> Documentation/cgroups/rdt.txt | 183 ++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 183 insertions(+)
> >> create mode 100644 Documentation/cgroups/rdt.txt
> >>
> >>diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
> >>new file mode 100644
> >>index 0000000..98eb4b8
> >>--- /dev/null
> >>+++ b/Documentation/cgroups/rdt.txt
> >>@@ -0,0 +1,183 @@
> >>+        RDT
> >>+        ---
> >>+
> >>+Copyright (C) 2014 Intel Corporation
> >>+Written by vikas.shivappa@linux.intel.com
> >>+(based on contents and format from cpusets.txt)
> >>+
> >>+CONTENTS:
> >>+=========
> >>+
> >>+1. Cache Allocation Technology
> >>+  1.1 What is RDT and CAT ?
> >>+  1.2 Why is CAT needed ?
> >>+  1.3 CAT implementation overview
> >>+  1.4 Assignment of CBM and CLOS
> >>+  1.5 Scheduling and Context Switch
> >>+2. Usage Examples and Syntax
> >>+
> >>+1. Cache Allocation Technology(CAT)
> >>+===================================
> >>+
> >>+1.1 What is RDT and CAT
> >>+-----------------------
> >>+
> >>+CAT is a part of Resource Director Technology(RDT) or Platform Shared
> >>+resource control which provides support to control Platform shared
> >>+resources like cache. Currently Cache is the only resource that is
> >>+supported in RDT.
> >>+More information can be found in the Intel SDM section 17.15.
> >>+
> >>+Cache Allocation Technology provides a way for the Software (OS/VMM)
> >>+to restrict cache allocation to a defined 'subset' of cache which may
> >>+be overlapping with other 'subsets'.  This feature is used when
> >>+allocating a line in cache ie when pulling new data into the cache.
> >>+The programming of the h/w is done via programming  MSRs.
> >>+
> >>+The different cache subsets are identified by CLOS identifier (class
> >>+of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
> >>+contiguous set of bits which defines the amount of cache resource that
> >>+is available for each 'subset'.
> >>+
> >>+1.2 Why is CAT needed
> >>+---------------------
> >>+
> >>+The CAT  enables more cache resources to be made available for higher
> >>+priority applications based on guidance from the execution
> >>+environment.
> >>+
> >>+The architecture also allows dynamically changing these subsets during
> >>+runtime to further optimize the performance of the higher priority
> >>+application with minimal degradation to the low priority app.
> >>+Additionally, resources can be rebalanced for system throughput
> >>+benefit.  (Refer to Section 17.15 in the Intel SDM)
> >>+
> >>+This technique may be useful in managing large computer systems which
> >>+large LLC. Examples may be large servers running  instances of
> >>+webservers or database servers. In such complex systems, these subsets
> >>+can be used for more careful placing of the available cache
> >>+resources.
> >>+
> >>+The CAT kernel patch would provide a basic kernel framework for users
> >>+to be able to implement such cache subsets.
> >>+
> >>+1.3 CAT implementation Overview
> >>+-------------------------------
> >>+
> >>+Kernel implements a cgroup subsystem to support cache allocation.
> >>+
> >>+Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
> >>+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
> >>+to the kernel and not exposed to user.  Each cgroup would have one CBM
> >>+and would just represent one cache 'subset'.
> >>+
> >>+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> >>+cgroup never fails.  When a child cgroup is created it inherits the
> >>+CLOSid and the CBM from its parent.  When a user changes the default
> >>+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> >>+used before.  The changing of 'cbm' may fail with -ERRNOSPC once the
> >>+kernel runs out of maximum CLOSids it can support.
> >>+User can create as many cgroups as he wants but having different CBMs
> >>+at the same time is restricted by the maximum number of CLOSids
> >>+(multiple cgroups can have the same CBM).
> >>+Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
> >>+for each cgroup using a CLOSid.
> >>+
> >>+The tasks in the cgroup would get to fill the LLC cache represented by
> >>+the cgroup's 'cbm' file.
> >>+
> >>+Root directory would have all available  bits set in 'cbm' file by
> >>+default.
> >>+
> >>+1.4 Assignment of CBM,CLOS
> >>+--------------------------
> >>+
> >>+The 'cbm' needs to be a  subset of the parent node's 'cbm'.
> >>+Any contiguous subset of these bits(with a minimum of 2 bits) maybe
> >>+set to indicate the cache mapping desired.  The 'cbm' between 2
> >>+directories can overlap. The 'cbm' would represent the cache 'subset'
> >>+of the CAT cgroup.  For ex: on a system with 16 bits of max cbm bits,
> >>+if the directory has the least significant 4 bits set in its 'cbm'
> >>+file(meaning the 'cbm' is just 0xf), it would be allocated the right
> >>+quarter of the Last level cache which means the tasks belonging to
> >>+this CAT cgroup can use the right quarter of the cache to fill. If it
> >>+has the most significant 8 bits set ,it would be allocated the left
> >>+half of the cache(8 bits  out of 16 represents 50%).
> >>+
> >>+The cache portion defined in the CBM file is available to all tasks
> >>+within the cgroup to fill and these task are not allowed to allocate
> >>+space in other parts of the cache.
> >
> >Is there a reason to expose the hardware interface rather
> >than ratios to userspace ?
> >
> >Say, i'd like to allocate 20% of L3 cache to cgroup A,
> >80% to cgroup B.
> >
> >Well, you'd have to expose the shared percentages between
> >any two cgroups (that information is there in the
> >cbm bitmaps, but not in "ratios").
> >
> >One problem i see with exposing cbm bitmasks is that on hardware
> >updates that change cache size or bitmask length, userspace must
> >recalculate the bitmaps.
> >
> >Another is that its vendor dependant, while ratios (plus shared
> >information for two given cgroups) is not.
> >
> 
> Agree that this interface doesnot give options to directly allocate
> in terms of percentage . But note that specifying in bitmasks allows
> the user to allocate overlapping cache areas and also since we use
> cgroup we naturally follow the cgroup hierarchy. User should be able
> to convert the bitmasks into intended percentage or size values
> based on the other available cache size info in hooks like cpuinfo.
> 
> We discussed more on this before in the older patches and here is
> one thread where we discussed it for your reference -
> http://marc.info/?l=linux-kernel&m=142482002022543&w=2
> 
> Thanks,
> Vikas

I can't find any discussion relating to exposing the CBM interface
directly to userspace in that thread ?

Cpu.shares is written in ratio form, which is much more natural.
Do you see any advantage in maintaining the 

(ratio -> cbm bitmasks) 

translation in userspace rather than in the kernel ? 

What about something like:

		      root cgroup
		   /		  \
		  /		    \
		/		      \
	cgroupA-80			cgroupB-30

So that whatever exceeds 100% is the ratio of cache 
shared at that level (cgroup A and B share 10% of cache 
at that level).

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html

cpu — the cpu.shares parameter determines the share of CPU resources
available to each process in all cgroups. Setting the parameter to 250,
250, and 500 in the finance, sales, and engineering cgroups respectively
means that processes started in these groups will split the resources
with a 1:1:2 ratio. Note that when a single process is running, it
consumes as much CPU as necessary no matter which cgroup it is placed
in. The CPU limitation only comes into effect when two or more processes
compete for CPU resources.