Cache Allocation Technology Design

* Cache Allocation Technology Design
@ 2014-10-16 18:44 vikas
  2014-10-20 16:18 ` Matt Fleming
  2014-11-03 23:29 ` Vikas Shivappa
  0 siblings, 2 replies; 39+ messages in thread
From: vikas @ 2014-10-16 18:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: matt.fleming, will.auld, tj, vikas.shivappa

Hi All , We have put together a draft design document for cache 
allocation technology below. Please review the same and let us know any
feedback.

Make sure you cc my email vikas.shivappa@linux.intel.com when replying 

Thanks,
Vikas

What is Cache Allocation Technology ( CAT )
-------------------------------------------

Cache Allocation Technology provides a way for the Software (OS/VMM)
to restrict cache allocation to a defined 'subset' of cache which may 
be overlapping with other 'subsets'.  This feature is used when
allocating a line in cache ie when pulling new data into the cache.
The programming of the h/w is done via programming  MSRs.

The different cache subsets are identified by CLOS identifier (class 
of service) and each CLOS has a CBM (cache bit mask).  The CBM is a 
contiguous set of bits which defines the amount of cache resource that 
is available for each 'subset'.

Why is CAT (cache allocation technology)  needed
------------------------------------------------

The CAT  enables more cache resources to be made available for higher
priority applications based on guidance from the execution
environment.  

The architecture also allows dynamically changing these subsets during
runtime to further optimize the performance of the higher priority
application with minimal degradation to the low priority app.
Additionally, resources can be rebalanced for system throughput
benefit.  (Refer to Section 17.15 in the Intel SDM
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf)

This technique may be useful in managing large computer systems which
large LLC. Examples may be large servers running  instances of
webservers or database servers. In such complex systems, these subsets
can be used for more careful placing of the available cache
resources.

The CAT kernel patch would provide a basic kernel framework for users
to be able to implement such cache subsets. 

Kernel implementation Overview
-------------------------------

Kernel implements a cgroup subsystem to support Cache Allocation.

Creating a CAT cgroup would create a new CLOS <-> CBM mapping. Each
cgroup would have one CBM and would just represent one cache 'subset'.

The user would be allowed to create as many directories as there are
CLOSs defined by the h/w. If user tries to create more than the
available CLOSs , -ENOSPC is returned. Currently we support only one
level of directory, ie directory can be created only under the root. 

There are 2 modes supported 

1. Affinitized mode : Each CAT cgroup is affinitized to a set of CPUs
specified by the 'cpus' file. The tasks in the CAT cgroup would be
constrained only on the CPUs in the 'cpus' file. The CPUs in this file 
are exclusively used for this cgroup. Requests by task
using the sched_setaffinity() would be filtered through the tasks
'cpus'.

These tasks would get to fill the LLC cache represented by the
cgroup's 'cbm' file.  'cpus'  is a cpumask and works the same way as
the existing cpumask datastructure.

2. Non Affinitized mode : Each CAT cgroup(inturn 'subset') would be
for a group of tasks. There is no 'cpus' file and the CPUs that the
tasks run are not restricted by the CAT cgroup 

Assignment of CBM,CLOS and modes
---------------------------------

Root directory would have all bits in 'cbm' file by default.

The cbm_max file in the root defines the maximum number of bits
describing the available cache units. Say if cbm_max is 16 then the
'cbm' cannot have more than 16 bits.

The 'affinitized' file is either 0 or 1 which represent the two modes.
System would boot with affinitized mode and all CPUs would have all
bits in cbm set meaning all CPUs have 100% cache(effectively cache
allocation is not in effect).

The 'cbm' file is restricted to having no more than its cbm_max least
significant bits set. Any contiguous subset of these bits maybe set to
indication the cache mapping desired.  The 'cbm' between 2 directories
can overlap. The 'cbm' would represent the cache 'subset' of the CAT
cgroup. For ex: on a system with 16 bits of max cbm bits , if the
directory has the least significant 4 bits set in its 'cbm' file, it
would be allocated the right quarter of the Last level cache which
means the tasks belonging to this CAT cgroup can use the right quarter
of the cache to fill. If it has the most significant 8 bits set ,it
would be allocated the left half of the cache(8 bits  out of 16
represents 50%).

The cache subset would be affinitized to a set of cpus in affinitized
mode. The CPUs to which this allocation is affinitized to is
represented by the 'cpus' file. The 'cpus' need to be mutually
exclusive from cpus of  other directories. 

The cache portion defined in the CBM file is available to all tasks 
within the CAT group and these task are not allowed to allocate space 
in other parts of the cache. 

'cbm' file is used in both modes where as the 'cpus' file is relevant
in affinitized mode and would disappear in non-affinitized mode. 

Scheduling and Context Switch
------------------------------

In affinitized mode , the cache 'subset' and the tasks in a CAT cgroup
are affinitized to the CPUs represented by the CAT cgroup's 'cpus'
file i.e when user sets the 'cbm' to 'portion' and 'cpus' to c and 
'tasks' to t, the tasks 't' would always be scheduled on cpus 'c' and 
will get to fill in the allocated 'portion' in  last level cache.

As noted above ,in the affinitized mode the tasks in a CAT cgroup
would also be affinitized to the CPUs in the 'cpus' file of the
directory.  Following hooks in the kernel are required to implement
this (on the lines of cpuset code)
- in sched_setaffinity to mask the requested cpu mask with what is
present in the task's 'cpus' 
- in migrate_task to migrate the tasks only to those CPUs in the
'cpus' file if possible.
- in select_task_rq 

In non-affinitized mode the 'affinitized' is 0 , and the 'tasks' file
indicate the tasks the cache subset is affinitized to.  When user adds
tasks to the tasks file , the tasks would get to fill the cache subset
represented by the CAT cgroup's 'cbm' file.  

During context switch kernel implements this by writing the
corresponding CLOSid (internally maintained by kernel) of the CAT
cgroup to the CPU's IA32_PQR_ASSOC MSR. 

Usage and Example
-----------------

Following would mount the cache allocation cgroup subsystem and create
2 directories. Please refer to Documentation/cgroups/cgroups.txt on
details about how to use cgroups.

  cd /sys/fs/cgroup 
  mkdir cachealloc 
  mount -t cgroup -ocachealloc cachealloc /sys/fs/cgroup/cachealloc 
  cd cachealloc

Create 2 cat cgroups 

  mkdir group1 
  mkdir group2

Following are some of the Files in the directory

  ls 
  cachea.cbm 
  cachea.cpus . cpus file only appears in the affinitized  mode 
  cgroup.procs 
  tasks 
  cbm_max (root only) 
  affinitized (root only) . by default itsaffinitized mode

Say if the cache is 2MB and cbm supports 16 bits, then setting the
below allocates the 'right 1/4th(512KB)' of the cache to group2 

Edit the CBM for group2 to set the least significant 4 bits.  This
allocates 'right quarter' of the cache. 

  cd group2 
  /bin/echo 0xf > cachealloc.cbm 

Change cpus in the directory. 

  /bin/echo 1-4 > cachealloc.cpus 

Edit the CBM for group2 to set the least significant 8 bits.This
allocates the right half of the cache to 'group2'.

  cd group2 
  /bin/echo 0xff > cachea.cbm 

Assign tasks to the group2

  /bin/echo PID1 > tasks 
  /bin/echo PID2 > tasks 
  Meaning now threads
  PID1 and PID2 runs on CPUs 1-4 , and get to fill the 'right half' of
  the cache. The tasks PID1 and PID2 can only have a subset of the cpu
  affinity defined in the 'cpus' file

Edit the affinitized to 0.mode is changed in root directory cd ..

  /bin/echo 0 > cachealloc.affinitized

Now the tasks and the cache allocation is not affinitized to the CPUs
and the task's cpu affinity is not restricted to being with the subset
of 'cpus' cpumask. 

^ permalink raw reply	[flat|nested] 39+ messages in thread