linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
@ 2015-10-02  6:09 Fenghua Yu
  2015-10-02  6:09 ` [PATCH V15 01/11] x86/intel_cqm: Modify hot cpu notification handling Fenghua Yu
                   ` (13 more replies)
  0 siblings, 14 replies; 42+ messages in thread
From: Fenghua Yu @ 2015-10-02  6:09 UTC (permalink / raw)
  To: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, x86, Fenghua Yu, Vikas Shivappa

This series has some preparatory patches and Intel cache allocation
support.

	Prep patches :

	Has some changes to hot cpu handling code in existing cache
monitoring and RAPL kernel code. This improves hot cpu notification
handling by not looping through all online cpus which could be expensive
in large systems. 

	Intel Cache allocation support:

	Cache allocation patches adds a cgroup subsystem to support the new
Cache Allocation feature found in future Intel Xeon Intel processors.
Cache Allocation is a sub-feature with in Resource Director
Technology(RDT) feature. Current patches support only L3 cache
allocation.

Cache Allocation provides a way for the Software (OS/VMM) to restrict
cache allocation to a defined 'subset' of cache which may be overlapping
with other 'subsets'.  This feature is used when a thread is allocating
a cache line ie when pulling new data into the cache.  

Threads are associated with a CLOS(Class of service). OS specifies the
CLOS of a thread by writing the IA32_PQR_ASSOC MSR during context
switch. The cache capacity associated with CLOS 'n' is specified by
writing to the IA32_L3_MASK_n MSR. 

More information about cache allocation can be found in the Intel SDM,
Volume 3 section 17.16.  SDM does not yet use the 'RDT' term yet and it
is planned to be changed at a later time.

	Why Cache Allocation ?

In todays new processors the number of cores is continuously increasing
which in turn increase the number of threads or workloads that can
simultaneously be run. When multi-threaded applications run
concurrently, they compete for shared resources including L3 cache.  At
times, this L3 cache resource contention may result in inefficient space
utilization. For example a higher priority thread may end up with lesser
L3 cache resource or a cache sensitive app may not get optimal cache
occupancy thereby degrading the performance.  Cache Allocation kernel
patch helps provides a framework for sharing L3 cache so that users can
allocate the resource according to set requirements.

*All the patches will apply on 4.3-rc2.

Changes in v15:
 - Add a global IPI to update the closid on CPUs for current tasks.
 - Other minor changes where I remove the updating of clos_cbm_table to be
 set to all 1s during init.
 - Fix a few compilation warnings.
 - Port the patches to 4.3-rc.

Changes in v12:
 - From Matt's feedback replaced static cpumask_t tmp with function
 scope to static cpumask_t tmp_cpumask for the whole file. This is a
 temporary mask used during handling of hot cpu notifications in
 cqm/rapl and rdt code.  Although all the usage was serialized by hot
 cpu locking this makes it more readable. 

Changes in V11:  As per feedback from Thomas and discussions:

  - removed the cpumask_any_online_but.its usage could be easily replaced with
  'and'ing the cpu_online mask during hot cpu notifications.  Thomas
  pointed the API had issue where there tmp mask wasnt thread safe. I
  realized the support it indends to give does not seem to match with
  others in cpumask.h
  - the cqm patch which added mutex to hot cpu notification was merged
  with the cqm hot plug patch to improve notificaiton handling
  without commit logs and wasnt correct. seperated and just sending the
  cqm hot plug patch and will send the mutex cqm patch seperately
  - fixed issues in the hot cpu rdt handling. Since the cpu_starting was
  replaced with cpu_online , now the wrmsr needs to be actually
  scheduled on the target cpu - which the previous patch wasnt doing.
  Replaced the cpu_dead with cpu_down_prepare. the cpu_down_failed is
  handled the same way as cpu_online. By waiting till cpu_dead to update
  the rdt_cpumask , we may miss some of the msr updates.

Changes in V10:

- changed the hot cpu notification we handle in cqm and cache allocation
  to cpu_online and cpu_dead and removed others as the 
  cpu_*_prepare also had corresponding cancel notification 
  which we did not handle.
- changed the file in rdt cgroup to l3_cache_mask to represent that its
  for l3 cache.

Changes as per Thomas and PeterZ feedback:
- fixed the cpumask declarations in cpumask.h and rdt,cmt and rapl to
  have static so that they burden stack space when large.
- removed mutex in cpu_starting notifications, replaced the locking with
  cpu_online.
- changed name from hsw_probetest to cache_alloc_hsw_probe.
- changed x86_rdt_max_closid to x86_cache_max_closid and
  x86_rdt_max_cbm_len to x86_cache_max_cbm_len as they are only related
  to cache allocation and not to all rdt.

Changes in V9:
Changes made as per Thomas feedback:
- added a comment where we call schedule in code only when RDT is
  enabled.
- Reordered the local declarations to follow convention in
  intel_cqm_xchg_rmid

Changes in V8: Thanks to feedback from Thomas and following changes are
made based on his feedback:

Generic changes/Preparatory patches:
-added a new cpumask_any_online_but which returns the next
core sibling that is online.
-Made changes in Intel Cache monitoring and Intel RAPL(Running average
    power limit) code to use the new function above to find the next cpu
that can be a designated reader for the package. Also changed the way
the package masks are computed which can be simplified using
topology_core_cpumask.

Cache allocation specific changes:
-Moved the documentation to the begining of the patch series.
-Added more documentation for the rdt cgroup files in the documentation.
-Changed the dmesg output when cache alloc is enabled to be more helpful
and updated few other comments to be better readable.
-removed __ prefix to functions like clos_get which were not following
convention.
-added code to take action on a WARN_ON in clos_put. Made a few other
changes to reduce code text.
-updated better readable/Kernel doc format comments for the 
call to rdt_css_alloc, datastructures .
-removed cgroup_init
-changed the names of functions to only have intel_ prefix for external
APIs.
-replaced (void *)&closid with (void *)closid when calling
on_each_cpu_mask
-fixed the reference release of closid during cache bitmask write.
-changed the code to not ignore a cache mask which has bits set outside
of the max bits allowed. It returns an error instead.
-replaced bitmap_set(&max_mask, 0, max_cbm_len) with max_mask =
(1ULL << max_cbm) - 1.
- update the rdt_cpu_mask which has one cpu for each package, using
topology_core_cpumask instead of looping through existing rdt_cpu_mask.
Realized topology_core_cpumask name is misleading and it actually
returns the cores in a cpu package!
-arranged the code better to have the code relating to similar task
together.
-Improved searching for the next online cpu sibling and maintaining the
rdt_cpu_mask which has one cpu per package.
-removed the unnecessary wrapper rdt_enabled.
-removed unnecessary spin lock and rculock in the scheduling code.
-merged all scheduling code into one patch not seperating the RDT common
software cache code.

Changes in V7: Based on feedback from PeterZ and Matt and following
discussions :
- changed lot of naming to reflect the data structures which are common
to RDT and specific to Cache allocation.
- removed all usage of 'cat'. replace with more friendly cache
allocation
- fixed lot of convention issues (whitespace, return paradigm etc)
- changed the scheduling hook for RDT to not use a inline.
- removed adding new scheduling hook and just reused the existing one
similar to perf hook.

Changes in V6:
- rebased to 4.1-rc1 which has the CMT(cache monitoring) support included.
- (Thanks to Marcelo's feedback).Fixed support for hot cpu handling for 
IA32_L3_QOS MSRs. Although during deep C states the MSR need not be restored 
this is needed when physically a new package is added.
-some other coding convention changes including renaming to cache_mask using a 
 refcnt to track the number of cgroups using a closid in clos_cbm map.
 -1b cbm support for non-hsw SKUs. HSW is an exception which needs the cache 
 bit masks to be at least 2 bits.

Changes in v5:
- Added support to propagate the cache bit mask update for each 
package.
- Removed the cache bit mask reference in the intel_rdt structure as
  there was no need for that and we already maintain a separate
  closid<->cbm mapping.
- Made a few coding convention changes which include adding the 
assertion while freeing the CLOSID.

Changes in V4:
- Integrated with the latest V5 CMT patches.
- Changed naming of cgroup to rdt(resource director technology) from
  cat(cache allocation technology). This was done as the RDT is the
  umbrella term for platform shared resources allocation. Hence in
  future it would be easier to add resource allocation to the same 
  cgroup
- Naming changes also applied to a lot of other data structures/APIs.
- Added documentation on cgroup usage for cache allocation to address
  a lot of questions from various academic and industry regarding 
  cache allocation usage.

Changes in V3:
- Implements a common software cache for IA32_PQR_MSR
- Implements support for hsw Cache Allocation enumeration. This does not use the brand 
strings like earlier version but does a probe test. The probe test is done only 
on hsw family of processors
- Made a few coding convention, name changes
- Check for lock being held when ClosID manipulation happens

Changes in V2:
- Removed HSW specific enumeration changes. Plan to include it later as a
  separate patch.  
- Fixed the code in prep_arch_switch to be specific for x86 and removed
  x86 defines.
- Fixed cbm_write to not write all 1s when a cgroup is freed.
- Fixed one possible memory leak in init.  
- Changed some of manual bitmap
  manipulation to use the predefined bitmap APIs to make code more readable
- Changed name in sources from cqe to cat
- Global cat enable flag changed to static_key and disabled cgroup early_init
Fenghua Yu (11):
  x86/intel_cqm: Modify hot cpu notification handling
  x86/intel_rapl: Modify hot cpu notification handling
  x86/intel_rdt: Cache Allocation documentation
  x86/intel_rdt: Add support for Cache Allocation detection
  x86/intel_rdt: Add Class of service management
  x86/intel_rdt: Add L3 cache capacity bitmask management
  x86/intel_rdt: Implement scheduling support for Intel RDT
  x86/intel_rdt: Hot cpu support for Cache Allocation
  x86/intel_rdt: Intel haswell Cache Allocation enumeration
  x86,cgroup/intel_rdt : Add intel_rdt cgroup documentation
  x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache
    allocation

 Documentation/cgroups/rdt.txt               | 133 +++++++
 Documentation/x86/intel_rdt.txt             | 109 ++++++
 arch/x86/include/asm/cpufeature.h           |   6 +-
 arch/x86/include/asm/intel_rdt.h            |  79 ++++
 arch/x86/include/asm/pqr_common.h           |  27 ++
 arch/x86/include/asm/processor.h            |   3 +
 arch/x86/kernel/cpu/Makefile                |   1 +
 arch/x86/kernel/cpu/common.c                |  15 +
 arch/x86/kernel/cpu/intel_rdt.c             | 585 ++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_cqm.c  |  60 +--
 arch/x86/kernel/cpu/perf_event_intel_rapl.c |  35 +-
 arch/x86/kernel/process_64.c                |   6 +
 include/linux/cgroup_subsys.h               |   4 +
 init/Kconfig                                |  12 +
 14 files changed, 1016 insertions(+), 59 deletions(-)
 create mode 100644 Documentation/cgroups/rdt.txt
 create mode 100644 Documentation/x86/intel_rdt.txt
 create mode 100644 arch/x86/include/asm/intel_rdt.h
 create mode 100644 arch/x86/include/asm/pqr_common.h
 create mode 100644 arch/x86/kernel/cpu/intel_rdt.c

-- 
1.8.1.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V15 01/11] x86/intel_cqm: Modify hot cpu notification handling
  2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
@ 2015-10-02  6:09 ` Fenghua Yu
  2015-10-02  6:09 ` [PATCH V15 02/11] x86/intel_rapl: " Fenghua Yu
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 42+ messages in thread
From: Fenghua Yu @ 2015-10-02  6:09 UTC (permalink / raw)
  To: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, x86, Fenghua Yu, Vikas Shivappa

 - In cqm_pick_event_reader, use the existing package<->core map instead
 of looping through all cpus in cqm_cpumask.

 - In intel_cqm_cpu_exit, use the same map instead of looping through
 all online cpus. In large systems with large number of cpus the time
 taken to loop may be expensive and also the time increases linearly.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 34 +++++++++++++++---------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 377e8f8..93e54ad 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -62,6 +62,12 @@ static LIST_HEAD(cache_groups);
  */
 static cpumask_t cqm_cpumask;
 
+/*
+ * Temporary cpumask used during hot cpu notificaiton handling. The usage
+ * is serialized by hot cpu locks.
+ */
+static cpumask_t tmp_cpumask;
+
 #define RMID_VAL_ERROR		(1ULL << 63)
 #define RMID_VAL_UNAVAIL	(1ULL << 62)
 
@@ -1244,15 +1250,13 @@ static struct pmu intel_cqm_pmu = {
 
 static inline void cqm_pick_event_reader(int cpu)
 {
-	int phys_id = topology_physical_package_id(cpu);
-	int i;
+	cpumask_and(&tmp_cpumask, &cqm_cpumask, topology_core_cpumask(cpu));
 
-	for_each_cpu(i, &cqm_cpumask) {
-		if (phys_id == topology_physical_package_id(i))
-			return;	/* already got reader for this socket */
-	}
-
-	cpumask_set_cpu(cpu, &cqm_cpumask);
+	/*
+	 * Pick a reader if there isn't one already.
+	 */
+	if (cpumask_empty(&tmp_cpumask))
+		cpumask_set_cpu(cpu, &cqm_cpumask);
 }
 
 static void intel_cqm_cpu_starting(unsigned int cpu)
@@ -1270,7 +1274,6 @@ static void intel_cqm_cpu_starting(unsigned int cpu)
 
 static void intel_cqm_cpu_exit(unsigned int cpu)
 {
-	int phys_id = topology_physical_package_id(cpu);
 	int i;
 
 	/*
@@ -1279,15 +1282,12 @@ static void intel_cqm_cpu_exit(unsigned int cpu)
 	if (!cpumask_test_and_clear_cpu(cpu, &cqm_cpumask))
 		return;
 
-	for_each_online_cpu(i) {
-		if (i == cpu)
-			continue;
+	cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), cpu_online_mask);
+	cpumask_clear_cpu(cpu, &tmp_cpumask);
+	i = cpumask_any(&tmp_cpumask);
 
-		if (phys_id == topology_physical_package_id(i)) {
-			cpumask_set_cpu(i, &cqm_cpumask);
-			break;
-		}
-	}
+	if (i < nr_cpu_ids)
+		cpumask_set_cpu(i, &cqm_cpumask);
 }
 
 static int intel_cqm_cpu_notifier(struct notifier_block *nb,
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V15 02/11] x86/intel_rapl: Modify hot cpu notification handling
  2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
  2015-10-02  6:09 ` [PATCH V15 01/11] x86/intel_cqm: Modify hot cpu notification handling Fenghua Yu
@ 2015-10-02  6:09 ` Fenghua Yu
  2015-10-02  6:09 ` [PATCH V15 03/11] x86/intel_rdt: Cache Allocation documentation Fenghua Yu
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 42+ messages in thread
From: Fenghua Yu @ 2015-10-02  6:09 UTC (permalink / raw)
  To: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, x86, Fenghua Yu, Vikas Shivappa

 - In rapl_cpu_init, use the existing package<->core map instead of
 looping through all cpus in rapl_cpumask.

 - In rapl_cpu_exit, use the same mapping instead of looping all online
 cpus. In large systems with large number of cpus the time taken to
 loop may be expensive and also the time increase linearly.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_rapl.c | 35 ++++++++++++++---------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
index 81431c0..eeec6b8 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -136,6 +136,12 @@ static struct pmu rapl_pmu_class;
 static cpumask_t rapl_cpu_mask;
 static int rapl_cntr_mask;
 
+/*
+ * Temporary cpumask used during hot cpu notificaiton handling. The usage
+ * is serialized by hot cpu locks.
+ */
+static cpumask_t tmp_cpumask;
+
 static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu);
 static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu_to_free);
 
@@ -539,18 +545,16 @@ static struct pmu rapl_pmu_class = {
 static void rapl_cpu_exit(int cpu)
 {
 	struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
-	int i, phys_id = topology_physical_package_id(cpu);
 	int target = -1;
+	int i;
 
 	/* find a new cpu on same package */
-	for_each_online_cpu(i) {
-		if (i == cpu)
-			continue;
-		if (phys_id == topology_physical_package_id(i)) {
-			target = i;
-			break;
-		}
-	}
+	cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), cpu_online_mask);
+	cpumask_clear_cpu(cpu, &tmp_cpumask);
+	i = cpumask_any(&tmp_cpumask);
+	if (i < nr_cpu_ids)
+		target = i;
+
 	/*
 	 * clear cpu from cpumask
 	 * if was set in cpumask and still some cpu on package,
@@ -572,15 +576,10 @@ static void rapl_cpu_exit(int cpu)
 
 static void rapl_cpu_init(int cpu)
 {
-	int i, phys_id = topology_physical_package_id(cpu);
-
-	/* check if phys_is is already covered */
-	for_each_cpu(i, &rapl_cpu_mask) {
-		if (phys_id == topology_physical_package_id(i))
-			return;
-	}
-	/* was not found, so add it */
-	cpumask_set_cpu(cpu, &rapl_cpu_mask);
+	/* check if cpu's package is already covered.If not, add it.*/
+	cpumask_and(&tmp_cpumask, &rapl_cpu_mask, topology_core_cpumask(cpu));
+	if (cpumask_empty(&tmp_cpumask))
+		cpumask_set_cpu(cpu, &rapl_cpu_mask);
 }
 
 static __init void rapl_hsw_server_quirk(void)
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V15 03/11] x86/intel_rdt: Cache Allocation documentation
  2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
  2015-10-02  6:09 ` [PATCH V15 01/11] x86/intel_cqm: Modify hot cpu notification handling Fenghua Yu
  2015-10-02  6:09 ` [PATCH V15 02/11] x86/intel_rapl: " Fenghua Yu
@ 2015-10-02  6:09 ` Fenghua Yu
  2015-10-02  6:09 ` [PATCH V15 04/11] x86/intel_rdt: Add support for Cache Allocation detection Fenghua Yu
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 42+ messages in thread
From: Fenghua Yu @ 2015-10-02  6:09 UTC (permalink / raw)
  To: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, x86, Fenghua Yu, Vikas Shivappa

Adds a description of Cache allocation technology, overview of kernel
framework implementation. The framework has APIs to manage class of
service, capacity bitmask(CBM), scheduling support and other
architecture specific implementation. The APIs are used to build the
cgroup interface in later patches.

Cache allocation is a sub-feature of Resource Director Technology (RDT)
or Platform Shared resource control which provides support to control
Platform shared resources like L3 cache.

Cache Allocation Technology provides a way for the Software (OS/VMM) to
restrict cache allocation to a defined 'subset' of cache which may be
overlapping with other 'subsets'. This feature is used when allocating a
line in cache ie when pulling new data into the cache. The tasks are
grouped into CLOS (class of service). OS uses MSR writes to indicate the
CLOSid of the thread when scheduling in and to indicate the cache
capacity associated with the CLOSid. Currently cache allocation is
supported for L3 cache.

More information can be found in the Intel SDM June 2015, Volume 3,
section 17.16.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 Documentation/x86/intel_rdt.txt | 109 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)
 create mode 100644 Documentation/x86/intel_rdt.txt

diff --git a/Documentation/x86/intel_rdt.txt b/Documentation/x86/intel_rdt.txt
new file mode 100644
index 0000000..05ec819
--- /dev/null
+++ b/Documentation/x86/intel_rdt.txt
@@ -0,0 +1,109 @@
+        Intel RDT
+        ---------
+
+Copyright (C) 2014 Intel Corporation
+Written by vikas.shivappa@linux.intel.com
+
+CONTENTS:
+=========
+
+1. Cache Allocation Technology
+  1.1 What is RDT and Cache allocation ?
+  1.2 Why is Cache allocation needed ?
+  1.3 Cache allocation implementation overview
+  1.4 Assignment of CBM and CLOS
+  1.5 Scheduling and Context Switch
+
+1. Cache Allocation Technology
+===================================
+
+1.1 What is RDT and Cache allocation
+------------------------------------
+
+Cache allocation is a sub-feature of Resource Director Technology (RDT)
+Allocation or Platform Shared resource control which provides support to
+control Platform shared resources like L3 cache. Currently L3 Cache is
+the only resource that is supported in RDT. More information can be
+found in the Intel SDM June 2015, Volume 3, section 17.16.
+
+Cache Allocation Technology provides a way for the Software (OS/VMM) to
+restrict cache allocation to a defined 'subset' of cache which may be
+overlapping with other 'subsets'. This feature is used when allocating a
+line in cache ie when pulling new data into the cache. The programming
+of the h/w is done via programming MSRs.
+
+The different cache subsets are identified by CLOS identifier (class of
+service) and each CLOS has a CBM (cache bit mask). The CBM is a
+contiguous set of bits which defines the amount of cache resource that
+is available for each 'subset'.
+
+1.2 Why is Cache allocation needed
+----------------------------------
+
+In todays new processors the number of cores is continuously increasing
+especially in large scale usage models where VMs are used like
+webservers and datacenters. The number of cores increase the number of
+threads or workloads that can simultaneously be run. When
+multi-threaded-applications, VMs, workloads run concurrently they
+compete for shared resources including L3 cache.
+
+The architecture also allows dynamically changing these subsets during
+runtime to further optimize the performance of the higher priority
+application with minimal degradation to the low priority app.
+Additionally, resources can be rebalanced for system throughput benefit.
+
+This technique may be useful in managing large computer server systems
+with large L3 cache, in the cloud and container context. Examples may be
+large servers running instances of webservers or database servers. In
+such complex systems, these subsets can be used for more careful placing
+of the available cache resources by a centralized root accessible
+interface.
+
+A specific use case may be to solve the noisy neighbour issue when a app
+which is constantly copying data like streaming app is using large
+amount of cache which could have otherwise been used by a high priority
+computing application. Using the cache allocation feature, the streaming
+application can be confined to use a smaller cache and the high priority
+application be awarded a larger amount of cache space.
+
+1.3 Cache allocation implementation Overview
+--------------------------------------------
+
+Kernel has a new field in the task_struct called 'closid' which
+represents the Class of service ID of the task.
+
+There is a 1:1 CLOSid <-> CBM (capacity bit mask) mapping. A CLOS (Class
+of service) is represented by a CLOSid. Each closid would have one CBM
+and would just represent one cache 'subset'.  The tasks would get to
+fill the L3 cache represented by the capacity bit mask or CBM.
+
+The APIs to manage the closid and CBM can be used to develop user
+interfaces.
+
+1.4 Assignment of CBM, CLOS
+---------------------------
+
+The framework provides APIs to manage the closid and CBM which can be
+used to develop user/kernel mode interfaces.
+
+1.5 Scheduling and Context Switch
+---------------------------------
+
+During context switch kernel implements this by writing the CLOSid of
+the task to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written when
+there is a change in the CLOSid for the CPU in order to minimize the
+latency incurred during context switch.
+
+The following considerations are done for the PQR MSR write so that it
+has minimal impact on scheduling hot path:
+ - This path doesn't exist on any non-intel platforms.
+ - On Intel platforms, this would not exist by default unless INTEL_RDT
+ is enabled.
+ - remains a no-op when INTEL_RDT is enabled and intel hardware does
+ not support the feature.
+ - When feature is available, does not do MSR write till the user
+ starts using the feature *and* assigns a new cache capacity mask.
+ - per cpu PQR values are cached and the MSR write is only done when
+ there is a task with different PQR is scheduled on the CPU. Typically
+ if the task groups are bound to be scheduled on a set of CPUs, the
+ number of MSR writes is greatly reduced.
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V15 04/11] x86/intel_rdt: Add support for Cache Allocation detection
  2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
                   ` (2 preceding siblings ...)
  2015-10-02  6:09 ` [PATCH V15 03/11] x86/intel_rdt: Cache Allocation documentation Fenghua Yu
@ 2015-10-02  6:09 ` Fenghua Yu
  2015-11-04 14:51   ` Luiz Capitulino
  2015-10-02  6:09 ` [PATCH V15 05/11] x86/intel_rdt: Add Class of service management Fenghua Yu
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: Fenghua Yu @ 2015-10-02  6:09 UTC (permalink / raw)
  To: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, x86, Fenghua Yu, Vikas Shivappa

This patch includes CPUID enumeration routines for Cache allocation and
new values to track resources to the cpuinfo_x86 structure.

Cache allocation provides a way for the Software (OS/VMM) to restrict
cache allocation to a defined 'subset' of cache which may be overlapping
with other 'subsets'. This feature is used when allocating a line in
cache ie when pulling new data into the cache. The programming of the
hardware is done via programming MSRs (model specific registers).

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/include/asm/cpufeature.h |  6 +++++-
 arch/x86/include/asm/processor.h  |  3 +++
 arch/x86/kernel/cpu/Makefile      |  1 +
 arch/x86/kernel/cpu/common.c      | 15 +++++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c   | 40 +++++++++++++++++++++++++++++++++++++++
 init/Kconfig                      | 12 ++++++++++++
 6 files changed, 76 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/cpu/intel_rdt.c

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index e6cf2ad..4e93006 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -12,7 +12,7 @@
 #include <asm/disabled-features.h>
 #endif
 
-#define NCAPINTS	13	/* N 32-bit words worth of info */
+#define NCAPINTS	14	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -231,6 +231,7 @@
 #define X86_FEATURE_RTM		( 9*32+11) /* Restricted Transactional Memory */
 #define X86_FEATURE_CQM		( 9*32+12) /* Cache QoS Monitoring */
 #define X86_FEATURE_MPX		( 9*32+14) /* Memory Protection Extension */
+#define X86_FEATURE_RDT		( 9*32+15) /* Resource Allocation */
 #define X86_FEATURE_AVX512F	( 9*32+16) /* AVX-512 Foundation */
 #define X86_FEATURE_RDSEED	( 9*32+18) /* The RDSEED instruction */
 #define X86_FEATURE_ADX		( 9*32+19) /* The ADCX and ADOX instructions */
@@ -255,6 +256,9 @@
 /* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
 #define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
 
+/* Intel-defined CPU features, CPUID level 0x00000010:0 (ebx), word 13 */
+#define X86_FEATURE_CAT_L3	(13*32 + 1) /* Cache Allocation L3 */
+
 /*
  * BUG word(s)
  */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 19577dd..b932ec4 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -120,6 +120,9 @@ struct cpuinfo_x86 {
 	int			x86_cache_occ_scale;	/* scale to bytes */
 	int			x86_power;
 	unsigned long		loops_per_jiffy;
+	/* Cache Allocation values: */
+	u16			x86_cache_max_cbm_len;
+	u16			x86_cache_max_closid;
 	/* cpuid returned max cores value: */
 	u16			 x86_max_cores;
 	u16			apicid;
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 4eb065c..5e873c7 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -50,6 +50,7 @@ obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_msr.o
 obj-$(CONFIG_CPU_SUP_AMD)		+= perf_event_msr.o
 endif
 
+obj-$(CONFIG_INTEL_RDT)			+= intel_rdt.o
 
 obj-$(CONFIG_X86_MCE)			+= mcheck/
 obj-$(CONFIG_MTRR)			+= mtrr/
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index de22ea7..026a416 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -654,6 +654,21 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		}
 	}
 
+	/* Additional Intel-defined flags: level 0x00000010 */
+	if (c->cpuid_level >= 0x00000010) {
+		u32 eax, ebx, ecx, edx;
+
+		cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
+		c->x86_capability[13] = ebx;
+
+		if (cpu_has(c, X86_FEATURE_CAT_L3)) {
+
+			cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
+			c->x86_cache_max_closid = edx + 1;
+			c->x86_cache_max_cbm_len = eax + 1;
+		}
+	}
+
 	/* AMD-defined flags: level 0x80000001 */
 	xlvl = cpuid_eax(0x80000000);
 	c->extended_cpuid_level = xlvl;
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
new file mode 100644
index 0000000..f49e970
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -0,0 +1,40 @@
+/*
+ * Resource Director Technology(RDT)
+ * - Cache Allocation code.
+ *
+ * Copyright (C) 2014 Intel Corporation
+ *
+ * 2015-05-25 Written by
+ *    Vikas Shivappa <vikas.shivappa@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * More information about RDT be found in the Intel (R) x86 Architecture
+ * Software Developer Manual June 2015, volume 3, section 17.15.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/slab.h>
+#include <linux/err.h>
+
+static int __init intel_rdt_late_init(void)
+{
+	struct cpuinfo_x86 *c = &boot_cpu_data;
+
+	if (!cpu_has(c, X86_FEATURE_CAT_L3))
+		return -ENODEV;
+
+	pr_info("Intel cache allocation detected\n");
+
+	return 0;
+}
+
+late_initcall(intel_rdt_late_init);
diff --git a/init/Kconfig b/init/Kconfig
index c24b6f7..9fe3f11 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -938,6 +938,18 @@ menuconfig CGROUPS
 
 	  Say N if unsure.
 
+config INTEL_RDT
+	bool "Intel Resource Director Technology support"
+	depends on X86_64 && CPU_SUP_INTEL
+	help
+	  This option provides support for Cache allocation which is a
+	  sub-feature of Intel Resource Director  Technology(RDT).
+	  Current implementation supports L3 cache allocation.
+	  Using this feature a user can specify the amount of L3 cache space
+	  into which an application can fill.
+
+	  Say N if unsure.
+
 if CGROUPS
 
 config CGROUP_DEBUG
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V15 05/11] x86/intel_rdt: Add Class of service management
  2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
                   ` (3 preceding siblings ...)
  2015-10-02  6:09 ` [PATCH V15 04/11] x86/intel_rdt: Add support for Cache Allocation detection Fenghua Yu
@ 2015-10-02  6:09 ` Fenghua Yu
  2015-11-04 14:55   ` Luiz Capitulino
  2015-10-02  6:09 ` [PATCH V15 06/11] x86/intel_rdt: Add L3 cache capacity bitmask management Fenghua Yu
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: Fenghua Yu @ 2015-10-02  6:09 UTC (permalink / raw)
  To: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, x86, Fenghua Yu, Vikas Shivappa

Adds some data-structures and APIs to support Class of service
management(closid). There is a new clos_cbm table which keeps a 1:1
mapping between closid and capacity bit mask (cbm)
and a count of usage of closid. Each task would be associated with a
Closid at a time and this patch adds a new field closid to task_struct
to keep track of the same.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/include/asm/intel_rdt.h | 12 ++++++
 arch/x86/kernel/cpu/intel_rdt.c  | 82 +++++++++++++++++++++++++++++++++++++++-
 include/linux/sched.h            |  3 ++
 3 files changed, 95 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_rdt.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
new file mode 100644
index 0000000..88b7643
--- /dev/null
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -0,0 +1,12 @@
+#ifndef _RDT_H_
+#define _RDT_H_
+
+#ifdef CONFIG_INTEL_RDT
+
+struct clos_cbm_table {
+	unsigned long l3_cbm;
+	unsigned int clos_refcnt;
+};
+
+#endif
+#endif
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index f49e970..d79213a 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -24,17 +24,95 @@
 
 #include <linux/slab.h>
 #include <linux/err.h>
+#include <asm/intel_rdt.h>
+
+/*
+ * cctable maintains 1:1 mapping between CLOSid and cache bitmask.
+ */
+static struct clos_cbm_table *cctable;
+/*
+ * closid availability bit map.
+ */
+unsigned long *closmap;
+static DEFINE_MUTEX(rdt_group_mutex);
+
+static inline void closid_get(u32 closid)
+{
+	struct clos_cbm_table *cct = &cctable[closid];
+
+	lockdep_assert_held(&rdt_group_mutex);
+
+	cct->clos_refcnt++;
+}
+
+static int closid_alloc(u32 *closid)
+{
+	u32 maxid;
+	u32 id;
+
+	lockdep_assert_held(&rdt_group_mutex);
+
+	maxid = boot_cpu_data.x86_cache_max_closid;
+	id = find_first_zero_bit(closmap, maxid);
+	if (id == maxid)
+		return -ENOSPC;
+
+	set_bit(id, closmap);
+	closid_get(id);
+	*closid = id;
+
+	return 0;
+}
+
+static inline void closid_free(u32 closid)
+{
+	clear_bit(closid, closmap);
+	cctable[closid].l3_cbm = 0;
+}
+
+static void closid_put(u32 closid)
+{
+	struct clos_cbm_table *cct = &cctable[closid];
+
+	lockdep_assert_held(&rdt_group_mutex);
+	if (WARN_ON(!cct->clos_refcnt))
+		return;
+
+	if (!--cct->clos_refcnt)
+		closid_free(closid);
+}
 
 static int __init intel_rdt_late_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
+	u32 maxid, max_cbm_len;
+	int err = 0, size;
 
 	if (!cpu_has(c, X86_FEATURE_CAT_L3))
 		return -ENODEV;
 
-	pr_info("Intel cache allocation detected\n");
+	maxid = c->x86_cache_max_closid;
+	max_cbm_len = c->x86_cache_max_cbm_len;
 
-	return 0;
+	size = maxid * sizeof(struct clos_cbm_table);
+	cctable = kzalloc(size, GFP_KERNEL);
+	if (!cctable) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+
+	size = BITS_TO_LONGS(maxid) * sizeof(long);
+	closmap = kzalloc(size, GFP_KERNEL);
+	if (!closmap) {
+		kfree(cctable);
+		err = -ENOMEM;
+		goto out_err;
+	}
+
+	pr_info("Intel cache allocation enabled\n");
+out_err:
+
+	return err;
 }
 
 late_initcall(intel_rdt_late_init);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b7b9501..24bfbac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1668,6 +1668,9 @@ struct task_struct {
 	/* cg_list protected by css_set_lock and tsk->alloc_lock */
 	struct list_head cg_list;
 #endif
+#ifdef CONFIG_INTEL_RDT
+	u32 closid;
+#endif
 #ifdef CONFIG_FUTEX
 	struct robust_list_head __user *robust_list;
 #ifdef CONFIG_COMPAT
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V15 06/11] x86/intel_rdt: Add L3 cache capacity bitmask management
  2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
                   ` (4 preceding siblings ...)
  2015-10-02  6:09 ` [PATCH V15 05/11] x86/intel_rdt: Add Class of service management Fenghua Yu
@ 2015-10-02  6:09 ` Fenghua Yu
  2015-10-02  6:09 ` [PATCH V15 07/11] x86/intel_rdt: Implement scheduling support for Intel RDT Fenghua Yu
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 42+ messages in thread
From: Fenghua Yu @ 2015-10-02  6:09 UTC (permalink / raw)
  To: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, x86, Fenghua Yu, Vikas Shivappa

This patch adds different APIs to manage the L3 cache capacity bitmask.
The capacity bit mask(CBM) needs to have only contiguous bits set. The
current implementation has a global CBM for each class of service id.
There are APIs added to update the CBM via MSR write to IA32_L3_MASK_n
on all packages. Other APIs are to read and write entries to the
clos_cbm_table.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/include/asm/intel_rdt.h |   4 ++
 arch/x86/kernel/cpu/intel_rdt.c  | 133 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 136 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 88b7643..4f45dc8 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -3,6 +3,10 @@
 
 #ifdef CONFIG_INTEL_RDT
 
+#define MAX_CBM_LENGTH			32
+#define IA32_L3_CBM_BASE		0xc90
+#define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
+
 struct clos_cbm_table {
 	unsigned long l3_cbm;
 	unsigned int clos_refcnt;
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index d79213a..6ad5b48 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -34,8 +34,22 @@ static struct clos_cbm_table *cctable;
  * closid availability bit map.
  */
 unsigned long *closmap;
+/*
+ * Mask of CPUs for writing CBM values. We only need one CPU per-socket.
+ */
+static cpumask_t rdt_cpumask;
+/*
+ * Temporary cpumask used during hot cpu notificaiton handling. The usage
+ * is serialized by hot cpu locks.
+ */
+static cpumask_t tmp_cpumask;
 static DEFINE_MUTEX(rdt_group_mutex);
 
+struct rdt_remote_data {
+	int msr;
+	u64 val;
+};
+
 static inline void closid_get(u32 closid)
 {
 	struct clos_cbm_table *cct = &cctable[closid];
@@ -82,11 +96,126 @@ static void closid_put(u32 closid)
 		closid_free(closid);
 }
 
+static bool cbm_validate(unsigned long var)
+{
+	u32 max_cbm_len = boot_cpu_data.x86_cache_max_cbm_len;
+	unsigned long first_bit, zero_bit;
+	u64 max_cbm;
+
+	if (bitmap_weight(&var, max_cbm_len) < 1)
+		return false;
+
+	max_cbm = (1ULL << max_cbm_len) - 1;
+	if (var & ~max_cbm)
+		return false;
+
+	first_bit = find_first_bit(&var, max_cbm_len);
+	zero_bit = find_next_zero_bit(&var, max_cbm_len, first_bit);
+
+	if (find_next_bit(&var, max_cbm_len, zero_bit) < max_cbm_len)
+		return false;
+
+	return true;
+}
+
+static int clos_cbm_table_read(u32 closid, unsigned long *l3_cbm)
+{
+	u32 maxid = boot_cpu_data.x86_cache_max_closid;
+
+	lockdep_assert_held(&rdt_group_mutex);
+
+	if (closid >= maxid)
+		return -EINVAL;
+
+	*l3_cbm = cctable[closid].l3_cbm;
+
+	return 0;
+}
+
+/*
+ * clos_cbm_table_update() - Update a clos cbm table entry.
+ * @closid: the closid whose cbm needs to be updated
+ * @cbm: the new cbm value that has to be updated
+ *
+ * This assumes the cbm is validated as per the interface requirements
+ * and the cache allocation requirements(through the cbm_validate).
+ */
+static int clos_cbm_table_update(u32 closid, unsigned long cbm)
+{
+	u32 maxid = boot_cpu_data.x86_cache_max_closid;
+
+	lockdep_assert_held(&rdt_group_mutex);
+
+	if (closid >= maxid)
+		return -EINVAL;
+
+	cctable[closid].l3_cbm = cbm;
+
+	return 0;
+}
+
+static bool cbm_search(unsigned long cbm, u32 *closid)
+{
+	u32 maxid = boot_cpu_data.x86_cache_max_closid;
+	u32 i;
+
+	for (i = 0; i < maxid; i++) {
+		if (cctable[i].clos_refcnt &&
+		    bitmap_equal(&cbm, &cctable[i].l3_cbm, MAX_CBM_LENGTH)) {
+			*closid = i;
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static void closcbm_map_dump(void)
+{
+	u32 i;
+
+	pr_debug("CBMMAP\n");
+	for (i = 0; i < boot_cpu_data.x86_cache_max_closid; i++) {
+		pr_debug("l3_cbm: 0x%x,clos_refcnt: %u\n",
+		 (unsigned int)cctable[i].l3_cbm, cctable[i].clos_refcnt);
+	}
+}
+
+static void msr_cpu_update(void *arg)
+{
+	struct rdt_remote_data *info = arg;
+
+	wrmsrl(info->msr, info->val);
+}
+
+/*
+ * msr_update_all() - Update the msr for all packages.
+ */
+static inline void msr_update_all(int msr, u64 val)
+{
+	struct rdt_remote_data info;
+
+	info.msr = msr;
+	info.val = val;
+	on_each_cpu_mask(&rdt_cpumask, msr_cpu_update, &info, 1);
+}
+
+static inline bool rdt_cpumask_update(int cpu)
+{
+	cpumask_and(&tmp_cpumask, &rdt_cpumask, topology_core_cpumask(cpu));
+	if (cpumask_empty(&tmp_cpumask)) {
+		cpumask_set_cpu(cpu, &rdt_cpumask);
+		return true;
+	}
+
+	return false;
+}
+
 static int __init intel_rdt_late_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
 	u32 maxid, max_cbm_len;
-	int err = 0, size;
+	int err = 0, size, i;
 
 	if (!cpu_has(c, X86_FEATURE_CAT_L3))
 		return -ENODEV;
@@ -109,6 +238,8 @@ static int __init intel_rdt_late_init(void)
 		goto out_err;
 	}
 
+	for_each_online_cpu(i)
+		rdt_cpumask_update(i);
 	pr_info("Intel cache allocation enabled\n");
 out_err:
 
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V15 07/11] x86/intel_rdt: Implement scheduling support for Intel RDT
  2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
                   ` (5 preceding siblings ...)
  2015-10-02  6:09 ` [PATCH V15 06/11] x86/intel_rdt: Add L3 cache capacity bitmask management Fenghua Yu
@ 2015-10-02  6:09 ` Fenghua Yu
  2015-10-02  6:09 ` [PATCH V15 08/11] x86/intel_rdt: Hot cpu support for Cache Allocation Fenghua Yu
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 42+ messages in thread
From: Fenghua Yu @ 2015-10-02  6:09 UTC (permalink / raw)
  To: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, x86, Fenghua Yu, Vikas Shivappa

Adds support for IA32_PQR_ASSOC MSR writes during task scheduling. For
Cache Allocation, MSR write would let the task fill in the cache
'subset' represented by the task's capacity bit mask.

The high 32 bits in the per processor MSR IA32_PQR_ASSOC represents the
CLOSid. During context switch kernel implements this by writing the
CLOSid of the task belongs to the CPU's IA32_PQR_ASSOC MSR.

This patch also implements a common software cache for IA32_PQR_MSR
(RMID 0:9, CLOSId 32:63) to be used by both Cache monitoring (CMT) and
Cache allocation. CMT updates the RMID where as cache_alloc updates the
CLOSid in the software cache. During scheduling when the new RMID/CLOSid
value is different from the cached values, IA32_PQR_MSR is updated.
Since the measured rdmsr latency for IA32_PQR_MSR is very high (~250
 cycles) this software cache is necessary to avoid reading the MSR to
compare the current CLOSid value.

The following considerations are done for the PQR MSR write so that it
minimally impacts scheduler hot path:
 - This path does not exist on any non-intel platforms.
 - On Intel platforms, this would not exist by default unless INTEL_RDT
 is enabled.
 - remains a no-op when INTEL_RDT is enabled and intel SKU does not
 support the feature.
 - When feature is available and enabled, never does MSR write till the
 user manually starts using one of the capacity bit masks.
 - MSR write is only done when there is a task with different Closid is
 scheduled on the CPU. Typically if the task groups are bound to be
 scheduled on a set of CPUs, the number of MSR writes is greatly
 reduced.
 - A per CPU cache of CLOSids is maintained to do the check so that we
 don't have to do a rdmsr which actually costs a lot of cycles.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/include/asm/intel_rdt.h           | 28 ++++++++++++++++++++++++++++
 arch/x86/include/asm/pqr_common.h          | 27 +++++++++++++++++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c            | 25 +++++++++++++++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 26 +++-----------------------
 arch/x86/kernel/process_64.c               |  6 ++++++
 5 files changed, 89 insertions(+), 23 deletions(-)
 create mode 100644 arch/x86/include/asm/pqr_common.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 4f45dc8..afb6da3 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -3,14 +3,42 @@
 
 #ifdef CONFIG_INTEL_RDT
 
+#include <linux/jump_label.h>
+
 #define MAX_CBM_LENGTH			32
 #define IA32_L3_CBM_BASE		0xc90
 #define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
 
+extern struct static_key rdt_enable_key;
+void __intel_rdt_sched_in(void *dummy);
+
 struct clos_cbm_table {
 	unsigned long l3_cbm;
 	unsigned int clos_refcnt;
 };
 
+/*
+ * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - This will stay as no-op unless we are running on an Intel SKU
+ * which supports L3 cache allocation.
+ * - Caches the per cpu CLOSid values and does the MSR write only
+ * when a task with a different CLOSid is scheduled in.
+ */
+static inline void intel_rdt_sched_in(void)
+{
+	/*
+	 * Call the schedule in code only when RDT is enabled.
+	 */
+	if (static_key_false(&rdt_enable_key))
+		__intel_rdt_sched_in(NULL);
+}
+
+#else
+
+static inline void intel_rdt_sched_in(void) {}
+
 #endif
 #endif
diff --git a/arch/x86/include/asm/pqr_common.h b/arch/x86/include/asm/pqr_common.h
new file mode 100644
index 0000000..11e985c
--- /dev/null
+++ b/arch/x86/include/asm/pqr_common.h
@@ -0,0 +1,27 @@
+#ifndef _X86_RDT_H_
+#define _X86_RDT_H_
+
+#define MSR_IA32_PQR_ASSOC	0x0c8f
+
+/**
+ * struct intel_pqr_state - State cache for the PQR MSR
+ * @rmid:		The cached Resource Monitoring ID
+ * @closid:		The cached Class Of Service ID
+ * @rmid_usecnt:	The usage counter for rmid
+ *
+ * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
+ * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
+ * contains both parts, so we need to cache them.
+ *
+ * The cache also helps to avoid pointless updates if the value does
+ * not change.
+ */
+struct intel_pqr_state {
+	u32			rmid;
+	u32			closid;
+	int			rmid_usecnt;
+};
+
+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
+
+#endif
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 6ad5b48..8379df8 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -24,6 +24,8 @@
 
 #include <linux/slab.h>
 #include <linux/err.h>
+#include <linux/sched.h>
+#include <asm/pqr_common.h>
 #include <asm/intel_rdt.h>
 
 /*
@@ -44,12 +46,33 @@ static cpumask_t rdt_cpumask;
  */
 static cpumask_t tmp_cpumask;
 static DEFINE_MUTEX(rdt_group_mutex);
+struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
 
 struct rdt_remote_data {
 	int msr;
 	u64 val;
 };
 
+void __intel_rdt_sched_in(void *dummy)
+{
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+	u32 closid = current->closid;
+
+	if (closid == state->closid)
+		return;
+
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, closid);
+	state->closid = closid;
+}
+
+/*
+ * Synchronize the IA32_PQR_ASSOC MSR of all currently running tasks.
+ */
+static inline void closid_tasks_sync(void)
+{
+	on_each_cpu_mask(cpu_online_mask, __intel_rdt_sched_in, NULL, 1);
+}
+
 static inline void closid_get(u32 closid)
 {
 	struct clos_cbm_table *cct = &cctable[closid];
@@ -240,6 +263,8 @@ static int __init intel_rdt_late_init(void)
 
 	for_each_online_cpu(i)
 		rdt_cpumask_update(i);
+
+	static_key_slow_inc(&rdt_enable_key);
 	pr_info("Intel cache allocation enabled\n");
 out_err:
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 93e54ad..04a696f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -7,41 +7,22 @@
 #include <linux/perf_event.h>
 #include <linux/slab.h>
 #include <asm/cpu_device_id.h>
+#include <asm/pqr_common.h>
 #include "perf_event.h"
 
-#define MSR_IA32_PQR_ASSOC	0x0c8f
 #define MSR_IA32_QM_CTR		0x0c8e
 #define MSR_IA32_QM_EVTSEL	0x0c8d
 
 static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 
-/**
- * struct intel_pqr_state - State cache for the PQR MSR
- * @rmid:		The cached Resource Monitoring ID
- * @closid:		The cached Class Of Service ID
- * @rmid_usecnt:	The usage counter for rmid
- *
- * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
- * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
- * contains both parts, so we need to cache them.
- *
- * The cache also helps to avoid pointless updates if the value does
- * not change.
- */
-struct intel_pqr_state {
-	u32			rmid;
-	u32			closid;
-	int			rmid_usecnt;
-};
-
 /*
  * The cached intel_pqr_state is strictly per CPU and can never be
  * updated from a remote CPU. Both functions which modify the state
  * (intel_cqm_event_start and intel_cqm_event_stop) are called with
  * interrupts disabled, which is sufficient for the protection.
  */
-static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
 
 /*
  * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -408,9 +389,9 @@ static void __intel_cqm_event_count(void *info);
  */
 static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
 {
-	struct perf_event *event;
 	struct list_head *head = &group->hw.cqm_group_entry;
 	u32 old_rmid = group->hw.cqm_rmid;
+	struct perf_event *event;
 
 	lockdep_assert_held(&cache_mutex);
 
@@ -1265,7 +1246,6 @@ static void intel_cqm_cpu_starting(unsigned int cpu)
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
 
 	state->rmid = 0;
-	state->closid = 0;
 	state->rmid_usecnt = 0;
 
 	WARN_ON(c->x86_cache_max_rmid != cqm_max_rmid);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3c1bbcf..82440e4 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -48,6 +48,7 @@
 #include <asm/syscalls.h>
 #include <asm/debugreg.h>
 #include <asm/switch_to.h>
+#include <asm/intel_rdt.h>
 
 asmlinkage extern void ret_from_fork(void);
 
@@ -447,6 +448,11 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 			loadsegment(ss, __KERNEL_DS);
 	}
 
+	/*
+	 * Load the Intel cache allocation PQR MSR.
+	 */
+	intel_rdt_sched_in();
+
 	return prev_p;
 }
 
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V15 08/11] x86/intel_rdt: Hot cpu support for Cache Allocation
  2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
                   ` (6 preceding siblings ...)
  2015-10-02  6:09 ` [PATCH V15 07/11] x86/intel_rdt: Implement scheduling support for Intel RDT Fenghua Yu
@ 2015-10-02  6:09 ` Fenghua Yu
  2015-10-02  6:09 ` [PATCH V15 09/11] x86/intel_rdt: Intel haswell Cache Allocation enumeration Fenghua Yu
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 42+ messages in thread
From: Fenghua Yu @ 2015-10-02  6:09 UTC (permalink / raw)
  To: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, x86, Fenghua Yu, Vikas Shivappa

This patch adds hot plug cpu support for Intel Cache allocation. Support
includes updating the cache bitmask MSRs IA32_L3_QOS_n when a new CPU
package comes online or goes offline. The IA32_L3_QOS_n MSRs are one per
Class of service on each CPU package. The new package's MSRs are
synchronized with the values of existing MSRs. Also the software cache
for IA32_PQR_ASSOC MSRs are reset during hot cpu notifications.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c | 76 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 8379df8..31f8588 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -24,6 +24,7 @@
 
 #include <linux/slab.h>
 #include <linux/err.h>
+#include <linux/cpu.h>
 #include <linux/sched.h>
 #include <asm/pqr_common.h>
 #include <asm/intel_rdt.h>
@@ -234,6 +235,75 @@ static inline bool rdt_cpumask_update(int cpu)
 	return false;
 }
 
+/*
+ * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
+ * which are one per CLOSid on the current package.
+ */
+static void cbm_update_msrs(void *dummy)
+{
+	int maxid = boot_cpu_data.x86_cache_max_closid;
+	struct rdt_remote_data info;
+	unsigned int i;
+
+	for (i = 0; i < maxid; i++) {
+		if (cctable[i].clos_refcnt) {
+			info.msr = CBM_FROM_INDEX(i);
+			info.val = cctable[i].l3_cbm;
+			msr_cpu_update(&info);
+		}
+	}
+}
+
+static inline void intel_rdt_cpu_start(int cpu)
+{
+	struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
+
+	state->closid = 0;
+	mutex_lock(&rdt_group_mutex);
+	if (rdt_cpumask_update(cpu))
+		smp_call_function_single(cpu, cbm_update_msrs, NULL, 1);
+	mutex_unlock(&rdt_group_mutex);
+}
+
+static void intel_rdt_cpu_exit(unsigned int cpu)
+{
+	int i;
+
+	mutex_lock(&rdt_group_mutex);
+	if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
+		mutex_unlock(&rdt_group_mutex);
+		return;
+	}
+
+	cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), cpu_online_mask);
+	cpumask_clear_cpu(cpu, &tmp_cpumask);
+	i = cpumask_any(&tmp_cpumask);
+
+	if (i < nr_cpu_ids)
+		cpumask_set_cpu(i, &rdt_cpumask);
+	mutex_unlock(&rdt_group_mutex);
+}
+
+static int intel_rdt_cpu_notifier(struct notifier_block *nb,
+				  unsigned long action, void *hcpu)
+{
+	unsigned int cpu  = (unsigned long)hcpu;
+
+	switch (action) {
+	case CPU_DOWN_FAILED:
+	case CPU_ONLINE:
+		intel_rdt_cpu_start(cpu);
+		break;
+	case CPU_DOWN_PREPARE:
+		intel_rdt_cpu_exit(cpu);
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
 static int __init intel_rdt_late_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
@@ -261,9 +331,15 @@ static int __init intel_rdt_late_init(void)
 		goto out_err;
 	}
 
+	cpu_notifier_register_begin();
+
 	for_each_online_cpu(i)
 		rdt_cpumask_update(i);
 
+	__hotcpu_notifier(intel_rdt_cpu_notifier, 0);
+
+	cpu_notifier_register_done();
+
 	static_key_slow_inc(&rdt_enable_key);
 	pr_info("Intel cache allocation enabled\n");
 out_err:
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V15 09/11] x86/intel_rdt: Intel haswell Cache Allocation enumeration
  2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
                   ` (7 preceding siblings ...)
  2015-10-02  6:09 ` [PATCH V15 08/11] x86/intel_rdt: Hot cpu support for Cache Allocation Fenghua Yu
@ 2015-10-02  6:09 ` Fenghua Yu
  2015-10-02  6:09 ` [PATCH V15 10/11] x86,cgroup/intel_rdt : Add intel_rdt cgroup documentation Fenghua Yu
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 42+ messages in thread
From: Fenghua Yu @ 2015-10-02  6:09 UTC (permalink / raw)
  To: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, x86, Fenghua Yu, Vikas Shivappa

This patch is specific to Intel haswell (hsw) server SKUs. Cache
Allocation on hsw server needs to be enumerated separately as HSW does
not have support for CPUID enumeration for Cache Allocation. This patch
does a probe by writing a CLOSid (Class of service id) into high 32 bits
of IA32_PQR_MSR and see if the bits stick. The probe is only done after
confirming that the CPU is HSW server. Other hardcoded values are:

 - L3 cache bit mask must be at least two bits.
 - Maximum CLOSids supported is always 4.
 - Maximum bits support in cache bit mask is always 20.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c | 59 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 57 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 31f8588..ecaf8e6 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -38,6 +38,10 @@ static struct clos_cbm_table *cctable;
  */
 unsigned long *closmap;
 /*
+ * Minimum bits required in Cache bitmask.
+ */
+static unsigned int min_bitmask_len = 1;
+/*
  * Mask of CPUs for writing CBM values. We only need one CPU per-socket.
  */
 static cpumask_t rdt_cpumask;
@@ -54,6 +58,57 @@ struct rdt_remote_data {
 	u64 val;
 };
 
+/*
+ * cache_alloc_hsw_probe() - Have to probe for Intel haswell server CPUs
+ * as it does not have CPUID enumeration support for Cache allocation.
+ *
+ * Probes by writing to the high 32 bits(CLOSid) of the IA32_PQR_MSR and
+ * testing if the bits stick. Max CLOSids is always 4 and max cbm length
+ * is always 20 on hsw server parts. The minimum cache bitmask length
+ * allowed for HSW server is always 2 bits. Hardcode all of them.
+ */
+static inline bool cache_alloc_hsw_probe(void)
+{
+	u32 l, h_old, h_new, h_tmp;
+
+	if (rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_old))
+		return false;
+
+	/*
+	 * Default value is always 0 if feature is present.
+	 */
+	h_tmp = h_old ^ 0x1U;
+	if (wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_tmp) ||
+	    rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_new))
+		return false;
+
+	if (h_tmp != h_new)
+		return false;
+
+	wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_old);
+
+	boot_cpu_data.x86_cache_max_closid = 4;
+	boot_cpu_data.x86_cache_max_cbm_len = 20;
+	min_bitmask_len = 2;
+
+	return true;
+}
+
+static inline bool cache_alloc_supported(struct cpuinfo_x86 *c)
+{
+	if (cpu_has(c, X86_FEATURE_CAT_L3))
+		return true;
+
+	/*
+	 * Probe for Haswell server CPUs.
+	 */
+	if (c->x86 == 0x6 && c->x86_model == 0x3f)
+		return cache_alloc_hsw_probe();
+
+	return false;
+}
+
+
 void __intel_rdt_sched_in(void *dummy)
 {
 	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
@@ -126,7 +181,7 @@ static bool cbm_validate(unsigned long var)
 	unsigned long first_bit, zero_bit;
 	u64 max_cbm;
 
-	if (bitmap_weight(&var, max_cbm_len) < 1)
+	if (bitmap_weight(&var, max_cbm_len) < min_bitmask_len)
 		return false;
 
 	max_cbm = (1ULL << max_cbm_len) - 1;
@@ -310,7 +365,7 @@ static int __init intel_rdt_late_init(void)
 	u32 maxid, max_cbm_len;
 	int err = 0, size, i;
 
-	if (!cpu_has(c, X86_FEATURE_CAT_L3))
+	if (!cache_alloc_supported(c))
 		return -ENODEV;
 
 	maxid = c->x86_cache_max_closid;
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V15 10/11] x86,cgroup/intel_rdt : Add intel_rdt cgroup documentation
  2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
                   ` (8 preceding siblings ...)
  2015-10-02  6:09 ` [PATCH V15 09/11] x86/intel_rdt: Intel haswell Cache Allocation enumeration Fenghua Yu
@ 2015-10-02  6:09 ` Fenghua Yu
  2015-10-02  6:09 ` [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation Fenghua Yu
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 42+ messages in thread
From: Fenghua Yu @ 2015-10-02  6:09 UTC (permalink / raw)
  To: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, x86, Fenghua Yu, Vikas Shivappa

Add documentation on using the cache allocation cgroup interface with
examples.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 Documentation/cgroups/rdt.txt | 133 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 133 insertions(+)
 create mode 100644 Documentation/cgroups/rdt.txt

diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
new file mode 100644
index 0000000..9fa6c6a
--- /dev/null
+++ b/Documentation/cgroups/rdt.txt
@@ -0,0 +1,133 @@
+        RDT
+        ---
+
+Copyright (C) 2014 Intel Corporation
+Written by vikas.shivappa@linux.intel.com
+
+CONTENTS:
+=========
+
+1. Cache Allocation Technology
+  1.1 Why is Cache allocation needed?
+2. Usage Examples and Syntax
+
+1. Cache Allocation Technology
+===================================
+
+1.1 Why is Cache allocation needed
+----------------------------------
+
+In today's new processors the number of cores is continuously increasing
+especially in large scale usage models where VMs are used like
+webservers and datacenters. The number of cores increase the number of
+threads or workloads that can simultaneously be run. When
+multi-threaded-applications, VMs, workloads run concurrently they
+compete for shared resources including L3 cache.
+
+The architecture also allows dynamically changing these subsets during
+runtime to further optimize the performance of the higher priority
+application with minimal degradation to the low priority app.
+Additionally, resources can be rebalanced for system throughput benefit.
+This technique may be useful in managing large computer systems which
+large L3 cache.
+
+Cloud/Container use case:
+The key use case scenarios are in large server clusters in a typical
+cloud or container context. A central 'managing agent' would control
+resource allocations to a set of VMs or containers. In today's resource
+management, cgroups are widely used already and a significant amount of
+plumbing in user space is already done to perform tasks like
+allocating/configuring resources dynamically and statically. An
+important example is dockers using systemd and systemd in turn using
+cgroups in its core to manage resources. This makes cgroup interface an
+easily adaptable interface for cache allocation.
+
+Noisy neighbour use case:
+A more specific use case may be when a streaming app which is constantly
+copying data and accessing linear space larger than L3 cache
+and hence evicting a large amount of cache which could have
+otherwise been used by a high priority computing application. Using the
+cache allocation feature, the 'noisy neighbours' like the streaming
+application can be confined to use a smaller cache and the high priority
+application be awarded a larger amount of cache space. A managing agent
+can monitor the cache allocation using cache monitoring through libperf
+and be able to make resource adjustments either statically or
+dynamically.
+This interface hence helps in maintaining a resource policy to
+provide the quality of service requirements like number of requests
+handled, response time.
+
+More information can be found in the Intel SDM June 2015, Volume 3,
+section 17.16. More information on kernel implementation details can be
+found in Documentation/x86/intel_rdt.txt.
+
+2. Usage examples and syntax
+============================
+
+Following is an example on how a system administrator/root user can
+configure L3 cache allocation to threads.
+
+To enable the cache allocation during compile time set the
+CONFIG_INTEL_RDT=y.
+
+To check if Cache allocation was enabled on your system
+  $ dmesg | grep -i intel_rdt
+  intel_rdt: Intel Cache Allocation enabled
+
+  $ cat /proc/cpuinfo
+output would have 'rdt' (if rdt is enabled) and 'cat_l3' (if L3
+cache allocation is enabled).
+
+example1: Following would mount the cache allocation cgroup subsystem
+and create 2 directories.
+
+  $ cd /sys/fs/cgroup
+  $ mkdir rdt
+  $ mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt
+  $ cd rdt
+  $ mkdir group1
+  $ mkdir group2
+
+Following are some of the Files in the directory
+
+  $ ls
+  intel_rdt.l3_cbm
+  tasks
+
+Say if the cache is 4MB (looked up from /proc/cpuinfo) and max cbm is 16
+bits (indicated by the root nodes cbm). This assigns 1MB of cache to
+group1 and group2 which is exclusive between them.
+
+  $ cd group1
+  $ /bin/echo 0xf > intel_rdt.l3_cbm
+
+  $ cd group2
+  $ /bin/echo 0xf0 > intel_rdt.l3_cbm
+
+Assign tasks to the group2
+
+  $ /bin/echo PID1 > tasks
+  $ /bin/echo PID2 > tasks
+
+Now threads PID1 and PID2 get to fill the 1MB of cache that was
+allocated to group2. Similarly assign tasks to group1.
+
+example2: Below commands allocate '1MB L3 cache on socket1 to group1'
+and '2MB of L3 cache on socket2 to group2'.
+This mounts both cpuset and intel_rdt and hence the ls would list the
+files in both the subsystems.
+  $ mount -t cgroup -ocpuset,intel_rdt cpuset,intel_rdt rdt/
+
+Assign the cache
+  $ /bin/echo 0xf > /sys/fs/cgroup/rdt/group1/intel_rdt.l3_cbm
+  $ /bin/echo 0xff > /sys/fs/cgroup/rdt/group2/intel_rdt.l3_cbm
+
+Assign tasks for group1 and group2
+  $ /bin/echo PID1 > /sys/fs/cgroup/rdt/group1/tasks
+  $ /bin/echo PID2 > /sys/fs/cgroup/rdt/group1/tasks
+  $ /bin/echo PID3 > /sys/fs/cgroup/rdt/group2/tasks
+  $ /bin/echo PID4 > /sys/fs/cgroup/rdt/group2/tasks
+
+Tie the group1 to socket1 and group2 to socket2
+  $ /bin/echo <cpumask for socket1> > /sys/fs/cgroup/rdt/group1/cpuset.cpus
+  $ /bin/echo <cpumask for socket2> > /sys/fs/cgroup/rdt/group2/cpuset.cpus
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation
  2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
                   ` (9 preceding siblings ...)
  2015-10-02  6:09 ` [PATCH V15 10/11] x86,cgroup/intel_rdt : Add intel_rdt cgroup documentation Fenghua Yu
@ 2015-10-02  6:09 ` Fenghua Yu
  2015-11-18 20:58   ` Marcelo Tosatti
                     ` (2 more replies)
  2015-10-11 19:50 ` [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Thomas Gleixner
                   ` (2 subsequent siblings)
  13 siblings, 3 replies; 42+ messages in thread
From: Fenghua Yu @ 2015-10-02  6:09 UTC (permalink / raw)
  To: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, x86, Fenghua Yu, Vikas Shivappa

Add a new cgroup 'intel_rdt' to manage cache allocation. Each cgroup
directory is associated with a class of service id(closid). To map a
task with closid during scheduling, this patch removes the closid field
from task_struct and uses the already existing 'cgroups' field in
task_struct.

The cgroup has a file 'l3_cbm' which represents the L3 cache capacity
bitmask(CBM). The CBM is global for the whole system currently. The
capacity bitmask needs to have only contiguous bits set and number of
bits that can be set is less than the max bits that can be set. The
tasks belonging to a cgroup get to fill in the L3 cache represented by
the capacity bitmask of the cgroup. For ex: if the max bits in the CBM
is 10 and the cache size is 10MB, each bit represents 1MB of cache
capacity.

Root cgroup always has all the bits set in the l3_cbm. User can create
more cgroups with mkdir syscall. By default the child cgroups inherit
the capacity bitmask(CBM) from parent. User can change the CBM specified
in hex for each cgroup. Each unique bitmask is associated with a class
of service ID and an -ENOSPC is returned once we run out of
closids.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/include/asm/intel_rdt.h |  37 +++++++-
 arch/x86/kernel/cpu/intel_rdt.c  | 194 +++++++++++++++++++++++++++++++++++++--
 include/linux/cgroup_subsys.h    |   4 +
 include/linux/sched.h            |   3 -
 init/Kconfig                     |   4 +-
 5 files changed, 229 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index afb6da3..fbe1e00 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -3,6 +3,7 @@
 
 #ifdef CONFIG_INTEL_RDT
 
+#include <linux/cgroup.h>
 #include <linux/jump_label.h>
 
 #define MAX_CBM_LENGTH			32
@@ -12,20 +13,54 @@
 extern struct static_key rdt_enable_key;
 void __intel_rdt_sched_in(void *dummy);
 
+struct intel_rdt {
+	struct cgroup_subsys_state css;
+	u32 closid;
+};
+
 struct clos_cbm_table {
 	unsigned long l3_cbm;
 	unsigned int clos_refcnt;
 };
 
 /*
+ * Return rdt group corresponding to this container.
+ */
+static inline struct intel_rdt *css_rdt(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct intel_rdt, css) : NULL;
+}
+
+static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
+{
+	return css_rdt(ir->css.parent);
+}
+
+/*
+ * Return rdt group to which this task belongs.
+ */
+static inline struct intel_rdt *task_rdt(struct task_struct *task)
+{
+	return css_rdt(task_css(task, intel_rdt_cgrp_id));
+}
+
+/*
  * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
  *
  * Following considerations are made so that this has minimal impact
  * on scheduler hot path:
  * - This will stay as no-op unless we are running on an Intel SKU
  * which supports L3 cache allocation.
+ * - When support is present and enabled, does not do any
+ * IA32_PQR_MSR writes until the user starts really using the feature
+ * ie creates a rdt cgroup directory and assigns a cache_mask thats
+ * different from the root cgroup's cache_mask.
  * - Caches the per cpu CLOSid values and does the MSR write only
- * when a task with a different CLOSid is scheduled in.
+ * when a task with a different CLOSid is scheduled in. That
+ * means the task belongs to a different cgroup.
+ * - Closids are allocated so that different cgroup directories
+ * with same cache_mask gets the same CLOSid. This minimizes CLOSids
+ * used and reduces MSR write frequency.
  */
 static inline void intel_rdt_sched_in(void)
 {
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index ecaf8e6..cb4d2ef 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -53,6 +53,10 @@ static cpumask_t tmp_cpumask;
 static DEFINE_MUTEX(rdt_group_mutex);
 struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
 
+static struct intel_rdt rdt_root_group;
+#define rdt_for_each_child(pos_css, parent_ir)		\
+	css_for_each_child((pos_css), &(parent_ir)->css)
+
 struct rdt_remote_data {
 	int msr;
 	u64 val;
@@ -108,17 +112,16 @@ static inline bool cache_alloc_supported(struct cpuinfo_x86 *c)
 	return false;
 }
 
-
 void __intel_rdt_sched_in(void *dummy)
 {
 	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-	u32 closid = current->closid;
+	struct intel_rdt *ir = task_rdt(current);
 
-	if (closid == state->closid)
+	if (ir->closid == state->closid)
 		return;
 
-	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, closid);
-	state->closid = closid;
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, ir->closid);
+	state->closid = ir->closid;
 }
 
 /*
@@ -359,15 +362,175 @@ static int intel_rdt_cpu_notifier(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
+static struct cgroup_subsys_state *
+intel_rdt_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct intel_rdt *parent = css_rdt(parent_css);
+	struct intel_rdt *ir;
+
+	/*
+	 * cgroup_init cannot handle failures gracefully.
+	 * Return rdt_root_group.css instead of failure
+	 * always even when Cache allocation is not supported.
+	 */
+	if (!parent)
+		return &rdt_root_group.css;
+
+	ir = kzalloc(sizeof(struct intel_rdt), GFP_KERNEL);
+	if (!ir)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_lock(&rdt_group_mutex);
+	ir->closid = parent->closid;
+	closid_get(ir->closid);
+	mutex_unlock(&rdt_group_mutex);
+
+	return &ir->css;
+}
+
+static void intel_rdt_css_free(struct cgroup_subsys_state *css)
+{
+	struct intel_rdt *ir = css_rdt(css);
+
+	mutex_lock(&rdt_group_mutex);
+	closid_put(ir->closid);
+	kfree(ir);
+	mutex_unlock(&rdt_group_mutex);
+}
+
+static int intel_cache_alloc_cbm_read(struct seq_file *m, void *v)
+{
+	struct intel_rdt *ir = css_rdt(seq_css(m));
+	unsigned long l3_cbm = 0;
+
+	clos_cbm_table_read(ir->closid, &l3_cbm);
+	seq_printf(m, "%08lx\n", l3_cbm);
+
+	return 0;
+}
+
+static int cbm_validate_rdt_cgroup(struct intel_rdt *ir, unsigned long cbmvalue)
+{
+	struct cgroup_subsys_state *css;
+	struct intel_rdt *par, *c;
+	unsigned long cbm_tmp = 0;
+	int err = 0;
+
+	if (!cbm_validate(cbmvalue)) {
+		err = -EINVAL;
+		goto out_err;
+	}
+
+	par = parent_rdt(ir);
+	clos_cbm_table_read(par->closid, &cbm_tmp);
+	if (!bitmap_subset(&cbmvalue, &cbm_tmp, MAX_CBM_LENGTH)) {
+		err = -EINVAL;
+		goto out_err;
+	}
+
+	rcu_read_lock();
+	rdt_for_each_child(css, ir) {
+		c = css_rdt(css);
+		clos_cbm_table_read(par->closid, &cbm_tmp);
+		if (!bitmap_subset(&cbm_tmp, &cbmvalue, MAX_CBM_LENGTH)) {
+			rcu_read_unlock();
+			err = -EINVAL;
+			goto out_err;
+		}
+	}
+	rcu_read_unlock();
+out_err:
+
+	return err;
+}
+
+/*
+ * intel_cache_alloc_cbm_write() - Validates and writes the
+ * cache bit mask(cbm) to the IA32_L3_MASK_n
+ * and also store the same in the cctable.
+ *
+ * CLOSids are reused for cgroups which have same bitmask.
+ * This helps to use the scant CLOSids optimally. This also
+ * implies that at context switch write to PQR-MSR is done
+ * only when a task with a different bitmask is scheduled in.
+ */
+static int intel_cache_alloc_cbm_write(struct cgroup_subsys_state *css,
+				 struct cftype *cft, u64 cbmvalue)
+{
+	struct intel_rdt *ir = css_rdt(css);
+	unsigned long ccbm = 0;
+	int err = 0;
+	u32 closid;
+
+	if (ir == &rdt_root_group)
+		return -EPERM;
+
+	/*
+	 * Need global mutex as cbm write may allocate a closid.
+	 */
+	mutex_lock(&rdt_group_mutex);
+
+	clos_cbm_table_read(ir->closid, &ccbm);
+	if (cbmvalue == ccbm)
+		goto out;
+
+	err = cbm_validate_rdt_cgroup(ir, cbmvalue);
+	if (err)
+		goto out;
+
+	/*
+	 * Try to get a reference for a different CLOSid and release the
+	 * reference to the current CLOSid.
+	 * Need to put down the reference here and get it back in case we
+	 * run out of closids. Otherwise we run into a problem when
+	 * we could be using the last closid that could have been available.
+	 */
+	closid_put(ir->closid);
+	if (cbm_search(cbmvalue, &closid)) {
+		ir->closid = closid;
+		closid_get(closid);
+	} else {
+		closid = ir->closid;
+		err = closid_alloc(&ir->closid);
+		if (err) {
+			closid_get(ir->closid);
+			goto out;
+		}
+
+		clos_cbm_table_update(ir->closid, cbmvalue);
+		msr_update_all(CBM_FROM_INDEX(ir->closid), cbmvalue);
+	}
+	closid_tasks_sync();
+	closcbm_map_dump();
+out:
+	mutex_unlock(&rdt_group_mutex);
+
+	return err;
+}
+
+static void rdt_cgroup_init(void)
+{
+	int max_cbm_len = boot_cpu_data.x86_cache_max_cbm_len;
+	u32 closid;
+
+	closid_alloc(&closid);
+
+	WARN_ON(closid != 0);
+
+	rdt_root_group.closid = closid;
+	clos_cbm_table_update(closid, (1ULL << max_cbm_len) - 1);
+}
+
 static int __init intel_rdt_late_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
 	u32 maxid, max_cbm_len;
 	int err = 0, size, i;
 
-	if (!cache_alloc_supported(c))
+	if (!cache_alloc_supported(c)) {
+		rdt_root_group.css.ss->disabled = 1;
 		return -ENODEV;
-
+	}
 	maxid = c->x86_cache_max_closid;
 	max_cbm_len = c->x86_cache_max_cbm_len;
 
@@ -394,6 +557,7 @@ static int __init intel_rdt_late_init(void)
 	__hotcpu_notifier(intel_rdt_cpu_notifier, 0);
 
 	cpu_notifier_register_done();
+	rdt_cgroup_init();
 
 	static_key_slow_inc(&rdt_enable_key);
 	pr_info("Intel cache allocation enabled\n");
@@ -403,3 +567,19 @@ out_err:
 }
 
 late_initcall(intel_rdt_late_init);
+
+static struct cftype rdt_files[] = {
+	{
+		.name		= "l3_cbm",
+		.seq_show	= intel_cache_alloc_cbm_read,
+		.write_u64	= intel_cache_alloc_cbm_write,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys intel_rdt_cgrp_subsys = {
+	.css_alloc		= intel_rdt_css_alloc,
+	.css_free		= intel_rdt_css_free,
+	.legacy_cftypes		= rdt_files,
+	.early_init		= 0,
+};
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 1a96fda..c559ef5 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -58,6 +58,10 @@ SUBSYS(net_prio)
 SUBSYS(hugetlb)
 #endif
 
+#if IS_ENABLED(CONFIG_INTEL_RDT)
+SUBSYS(intel_rdt)
+#endif
+
 /*
  * Subsystems that implement the can_fork() family of callbacks.
  */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 24bfbac..b7b9501 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1668,9 +1668,6 @@ struct task_struct {
 	/* cg_list protected by css_set_lock and tsk->alloc_lock */
 	struct list_head cg_list;
 #endif
-#ifdef CONFIG_INTEL_RDT
-	u32 closid;
-#endif
 #ifdef CONFIG_FUTEX
 	struct robust_list_head __user *robust_list;
 #ifdef CONFIG_COMPAT
diff --git a/init/Kconfig b/init/Kconfig
index 9fe3f11..e0e18d5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -938,6 +938,8 @@ menuconfig CGROUPS
 
 	  Say N if unsure.
 
+if CGROUPS
+
 config INTEL_RDT
 	bool "Intel Resource Director Technology support"
 	depends on X86_64 && CPU_SUP_INTEL
@@ -950,8 +952,6 @@ config INTEL_RDT
 
 	  Say N if unsure.
 
-if CGROUPS
-
 config CGROUP_DEBUG
 	bool "Example debug cgroup subsystem"
 	default n
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
                   ` (10 preceding siblings ...)
  2015-10-02  6:09 ` [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation Fenghua Yu
@ 2015-10-11 19:50 ` Thomas Gleixner
  2015-10-12 18:52   ` Yu, Fenghua
  2015-10-13 21:31   ` Marcelo Tosatti
  2015-11-04 14:42 ` Luiz Capitulino
  2015-11-05  2:19 ` [PATCH 1/2] x86/intel_rdt,intel_cqm: Remove build dependency of RDT code on CQM code David Carrillo-Cisneros
  13 siblings, 2 replies; 42+ messages in thread
From: Thomas Gleixner @ 2015-10-11 19:50 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H Peter Anvin, Ingo Molnar, Peter Zijlstra, linux-kernel, x86,
	Vikas Shivappa, Marcelo Tosatti

Fenghua,

On Thu, 1 Oct 2015, Fenghua Yu wrote:

+Cc: Marcelo

> This series has some preparatory patches and Intel cache allocation
> support.

<snip>

> Changes in v15:
>  - Add a global IPI to update the closid on CPUs for current tasks.
>  - Other minor changes where I remove the updating of clos_cbm_table to be
>  set to all 1s during init.
>  - Fix a few compilation warnings.
>  - Port the patches to 4.3-rc.
 
What's the state of the interface discussion? I have not yet seen any
agreement on that, unless I missed the important mail.

Aside of that, the patches miss a proper

From: Vikas ...

in the patch body at least for those which are untouched by you.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-11 19:50 ` [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Thomas Gleixner
@ 2015-10-12 18:52   ` Yu, Fenghua
  2015-10-12 19:58     ` Thomas Gleixner
  2015-10-13 22:40     ` Marcelo Tosatti
  2015-10-13 21:31   ` Marcelo Tosatti
  1 sibling, 2 replies; 42+ messages in thread
From: Yu, Fenghua @ 2015-10-12 18:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H Peter Anvin, Ingo Molnar, Peter Zijlstra, linux-kernel, x86,
	Vikas Shivappa, Marcelo Tosatti

> From: Thomas Gleixner [mailto:tglx@linutronix.de]
> Sent: Sunday, October 11, 2015 12:50 PM
> To: Yu, Fenghua
> Cc: H Peter Anvin; Ingo Molnar; Peter Zijlstra; linux-kernel; x86; Vikas
> Shivappa; Marcelo Tosatti
> Subject: Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology
> Support
> 
> Fenghua,
> 
> On Thu, 1 Oct 2015, Fenghua Yu wrote:
> 
> +Cc: Marcelo
> 
> > This series has some preparatory patches and Intel cache allocation
> > support.
> 
> <snip>
> 
> > Changes in v15:
> >  - Add a global IPI to update the closid on CPUs for current tasks.
> >  - Other minor changes where I remove the updating of clos_cbm_table
> > to be  set to all 1s during init.
> >  - Fix a few compilation warnings.
> >  - Port the patches to 4.3-rc.
> 
> What's the state of the interface discussion? I have not yet seen any
> agreement on that, unless I missed the important mail.

Peter Anvin will discuss the interface with Tejun during the Kernel Summit. Hopefully Tejun will agree with the current cgroup interface in the patch set.

> 
> Aside of that, the patches miss a proper
> 
> From: Vikas ...
> 
> in the patch body at least for those which are untouched by you.

I'll send patch set v16 today with "From: Vikas Shivappa <vikas.shivappa@linux.intel.com>" in the patches.

Vikas is on sabbatical until Dec. I'm covering him during his leave.

Thanks.

-Fenghua


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-12 18:52   ` Yu, Fenghua
@ 2015-10-12 19:58     ` Thomas Gleixner
  2015-10-13 22:40     ` Marcelo Tosatti
  1 sibling, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2015-10-12 19:58 UTC (permalink / raw)
  To: Yu, Fenghua
  Cc: H Peter Anvin, Ingo Molnar, Peter Zijlstra, linux-kernel, x86,
	Vikas Shivappa, Marcelo Tosatti

On Mon, 12 Oct 2015, Yu, Fenghua wrote:
> > What's the state of the interface discussion? I have not yet seen any
> > agreement on that, unless I missed the important mail.
> 
> Peter Anvin will discuss the interface with Tejun during the Kernel
> Summit. Hopefully Tejun will agree with the current cgroup interface
> in the patch set.

I hope Marcelo is there as well. He had very good arguments and actual
use cases.

> I'll send patch set v16 today with "From: Vikas Shivappa
> <vikas.shivappa@linux.intel.com>" in the patches.

Please don't before the interface is sorted out.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-11 19:50 ` [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Thomas Gleixner
  2015-10-12 18:52   ` Yu, Fenghua
@ 2015-10-13 21:31   ` Marcelo Tosatti
  2015-10-15 11:36     ` Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: Marcelo Tosatti @ 2015-10-13 21:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Fenghua Yu, H Peter Anvin, Ingo Molnar, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa

On Sun, Oct 11, 2015 at 09:50:12PM +0200, Thomas Gleixner wrote:
> Fenghua,
> 
> On Thu, 1 Oct 2015, Fenghua Yu wrote:
> 
> +Cc: Marcelo
> 
> > This series has some preparatory patches and Intel cache allocation
> > support.
> 
> <snip>
> 
> > Changes in v15:
> >  - Add a global IPI to update the closid on CPUs for current tasks.
> >  - Other minor changes where I remove the updating of clos_cbm_table to be
> >  set to all 1s during init.
> >  - Fix a few compilation warnings.
> >  - Port the patches to 4.3-rc.
>  
> What's the state of the interface discussion? I have not yet seen any
> agreement on that, unless I missed the important mail.
> 
> Aside of that, the patches miss a proper
> 
> From: Vikas ...
> 
> in the patch body at least for those which are untouched by you.
> 
> Thanks,
> 
> 	tglx

There are a number of problems with the patches as discussed in the
thread.

I am rewriting the interface with ioctls, with commands similar to the
syscall interface proposed.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-12 18:52   ` Yu, Fenghua
  2015-10-12 19:58     ` Thomas Gleixner
@ 2015-10-13 22:40     ` Marcelo Tosatti
  2015-10-15 11:37       ` Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: Marcelo Tosatti @ 2015-10-13 22:40 UTC (permalink / raw)
  To: Yu, Fenghua
  Cc: Thomas Gleixner, H Peter Anvin, Ingo Molnar, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa

On Mon, Oct 12, 2015 at 06:52:49PM +0000, Yu, Fenghua wrote:
> > From: Thomas Gleixner [mailto:tglx@linutronix.de]
> > Sent: Sunday, October 11, 2015 12:50 PM
> > To: Yu, Fenghua
> > Cc: H Peter Anvin; Ingo Molnar; Peter Zijlstra; linux-kernel; x86; Vikas
> > Shivappa; Marcelo Tosatti
> > Subject: Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology
> > Support
> > 
> > Fenghua,
> > 
> > On Thu, 1 Oct 2015, Fenghua Yu wrote:
> > 
> > +Cc: Marcelo
> > 
> > > This series has some preparatory patches and Intel cache allocation
> > > support.
> > 
> > <snip>
> > 
> > > Changes in v15:
> > >  - Add a global IPI to update the closid on CPUs for current tasks.
> > >  - Other minor changes where I remove the updating of clos_cbm_table
> > > to be  set to all 1s during init.
> > >  - Fix a few compilation warnings.
> > >  - Port the patches to 4.3-rc.
> > 
> > What's the state of the interface discussion? I have not yet seen any
> > agreement on that, unless I missed the important mail.
> 
> Peter Anvin will discuss the interface with Tejun during the Kernel Summit. Hopefully Tejun will agree with the current cgroup interface in the patch set.

How can you fix the issue of sockets with different reserved cache
regions with hw in the cgroup interface?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-13 21:31   ` Marcelo Tosatti
@ 2015-10-15 11:36     ` Peter Zijlstra
  2015-10-16  2:28       ` Marcelo Tosatti
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2015-10-15 11:36 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Thomas Gleixner, Fenghua Yu, H Peter Anvin, Ingo Molnar,
	linux-kernel, x86, Vikas Shivappa

On Tue, Oct 13, 2015 at 06:31:27PM -0300, Marcelo Tosatti wrote:
> I am rewriting the interface with ioctls, with commands similar to the
> syscall interface proposed.

Which is horrible for other use cases. I really don't see the problem
with the cgroup stuff.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-13 22:40     ` Marcelo Tosatti
@ 2015-10-15 11:37       ` Peter Zijlstra
  2015-10-16  0:17         ` Marcelo Tosatti
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2015-10-15 11:37 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Yu, Fenghua, Thomas Gleixner, H Peter Anvin, Ingo Molnar,
	linux-kernel, x86, Vikas Shivappa

On Tue, Oct 13, 2015 at 07:40:58PM -0300, Marcelo Tosatti wrote:
> How can you fix the issue of sockets with different reserved cache
> regions with hw in the cgroup interface?

No idea what you're referring to. But IOCTLs blow.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-15 11:37       ` Peter Zijlstra
@ 2015-10-16  0:17         ` Marcelo Tosatti
  2015-10-16  9:44           ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Marcelo Tosatti @ 2015-10-16  0:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yu, Fenghua, Thomas Gleixner, H Peter Anvin, Ingo Molnar,
	linux-kernel, x86, Vikas Shivappa

On Thu, Oct 15, 2015 at 01:37:02PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 13, 2015 at 07:40:58PM -0300, Marcelo Tosatti wrote:
> > How can you fix the issue of sockets with different reserved cache
> > regions with hw in the cgroup interface?
> 
> No idea what you're referring to. But IOCTLs blow.

Tejun brought up syscalls. Syscalls seem too generic.
So ioctls were chosen instead.

It is necessary to perform the following operations:

1) create cache reservation (params = size, type).
2) delete cache reservation.
3) attach cache reservation (params = cache reservation id, pid).
4) detach cache reservation (params = cache reservation id, pid).

Can it done via cgroups? If so, works for me.

A list of problems with the cgroup interface has been written,
in the thread... and we found another problem.


List of problems with cgroup interface:

1) Global IPI on CBM <---> task change does not scale.

 * cbm_update_all() - Update the cache bit mask for all packages.
 */
static inline void cbm_update_all(u32 closid)
{
       on_each_cpu_mask(&rdt_cpumask, cbm_cpu_update, (void *)closid,
1);
}

Consider a machine with 32 sockets.

2) Syscall interface specification is in kbytes, not
cache ways (which is what must be recorded by the OS
to allow migration of the OS between different
hardware systems).

3) Compilers are able to configure cache optimally for
given ranges of code inside applications, easily,
if desired.

4) Problem-2: The decision to allocate cache is tied to application
initialization / destruction, and application initialization is
essentially random from the POV of the system (the events which trigger
the execution of the application are not visible from the system).

Think of a server running two different servers: one database
with requests that are received with poisson distribution, average 30
requests per hour, and every request takes 1 minute.

One httpd server with nearly constant load.

Without cache reservations, database requests takes 2 minutes.
That is not acceptable for the database clients.
But with cache reservation, database requests takes 1 minute.

You want to maximize performance of httpd and database requests
What you do? You allow the database server to perform cache
reservation once a request comes in, and to undo the reservation
once the request is finished.

Its impossible to perform this with a centralized interface.

5) Modify scenario 2 above as follows: each database request
is handled by two newly created threads, and they share a certain
percentage
of data cache, and a certain percentage of code cache.

So the dispatcher thread, on arrival of request, has to:

        - create data cache reservation = tcrid-A.
        - create code cache reservation = tcrid-B.
        - create thread-1.
        - assign tcird-A and B to thread-1.
        - create thread-2.
        - assign tcird-A and B to thread-2.

6) Create reservations in such a way that the sum is larger than
total amount of cache, and CPU pinning (example from Karen Noel):

VM-1 on socket-1 with 80% of reservation.
VM-2 on socket-2 with 80% of reservation.
VM-1 pinned to socket-1.
VM-2 pinned to socket-2.

Cgroups interface attempts to set a cache mask globally. This is the
problem the "expand" proposal solves:
https://lkml.org/lkml/2015/7/29/682

7) Consider two sockets with different region of L3 cache
shared with HW:

— CPUID.(EAX=10H, ECX=1):EBX[31:0] reports a bit mask. Each set bit
within the length of the CBM
indicates the corresponding unit of the L3 allocation may be used by
other entities in the platform (e.g. an
integrated graphics engine or hardware units outside the processor core
and have direct access to L3).
Each cleared bit within the length of the CBM indicates the
corresponding allocation unit can be configured
to implement a priority-based allocation scheme chosen by an OS/VMM
without interference with other
hardware agents in the system. Bits outside the length of the CBM are
reserved.

You want the kernel to maintain different bitmasks in the CBM:

        socket1 [range-A]
        socket2 [range-B]

And the kernel will automatically switch from range A to range B
when the thread switches sockets.

---------------------

Problems 6, 7 and 2 are fatal for us. If you can fix them in the cgroup
interface, we can use it (please understand these problems, you seem to 
ignore them for some reason).

Problems 1 4 and 5 seem to come from Tejun.

Problem 3 could be a possibility.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-15 11:36     ` Peter Zijlstra
@ 2015-10-16  2:28       ` Marcelo Tosatti
  2015-10-16  9:50         ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Marcelo Tosatti @ 2015-10-16  2:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Fenghua Yu, H Peter Anvin, Ingo Molnar,
	linux-kernel, x86, Vikas Shivappa

On Thu, Oct 15, 2015 at 01:36:14PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 13, 2015 at 06:31:27PM -0300, Marcelo Tosatti wrote:
> > I am rewriting the interface with ioctls, with commands similar to the
> > syscall interface proposed.
> 
> Which is horrible for other use cases. I really don't see the problem
> with the cgroup stuff.

Can you detail what "horrible" means? 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-16  0:17         ` Marcelo Tosatti
@ 2015-10-16  9:44           ` Peter Zijlstra
  2015-10-16 20:24             ` Marcelo Tosatti
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2015-10-16  9:44 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Yu, Fenghua, Thomas Gleixner, H Peter Anvin, Ingo Molnar,
	linux-kernel, x86, Vikas Shivappa

On Thu, Oct 15, 2015 at 09:17:16PM -0300, Marcelo Tosatti wrote:
> On Thu, Oct 15, 2015 at 01:37:02PM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 13, 2015 at 07:40:58PM -0300, Marcelo Tosatti wrote:
> > > How can you fix the issue of sockets with different reserved cache
> > > regions with hw in the cgroup interface?
> > 
> > No idea what you're referring to. But IOCTLs blow.
> 
> Tejun brought up syscalls. Syscalls seem too generic.
> So ioctls were chosen instead.
> 
> It is necessary to perform the following operations:
> 
> 1) create cache reservation (params = size, type).

mkdir

> 2) delete cache reservation.

rmdir

> 3) attach cache reservation (params = cache reservation id, pid).
> 4) detach cache reservation (params = cache reservation id, pid).

echo $pid > tasks

> Can it done via cgroups? If so, works for me.

Trivially.

> A list of problems with the cgroup interface has been written,
> in the thread... and we found another problem.

Which was endless and tiresome so I stopped reading.

> List of problems with cgroup interface:
> 
> 1) Global IPI on CBM <---> task change does not scale.
> 
>  * cbm_update_all() - Update the cache bit mask for all packages.
>  */
> static inline void cbm_update_all(u32 closid)
> {
>        on_each_cpu_mask(&rdt_cpumask, cbm_cpu_update, (void *)closid,
> 1);
> }

There is no way around that, the moment you view the CBM as a global
resource; ie. a CBM is configured the same on all sockets; you need to
do this for a task using that CBM might run on any CPU at any time.

This is not because of the cgroup interface at all. This is because you
want CBMs to be the same machine wide.

The only way to actually change that is to _be_ a cgroup and co-mount
with cpusets and be incestuous and look at the cpusets state and
discover disjoint groups.

> 2) Syscall interface specification is in kbytes, not
> cache ways (which is what must be recorded by the OS
> to allow migration of the OS between different
> hardware systems).

Meh, that again is nothing fundamental. The cgroup interface could do
bytes just the same.

> 3) Compilers are able to configure cache optimally for
> given ranges of code inside applications, easily,
> if desired.

Yeah, so? Every SKU has a different cache size, so once you're down to
that level you're pretty hard set in your configuration and it really
doesn't matter if you give bytes or ways, you _KNOW_ what your
configuration will be.

> 4) Problem-2: The decision to allocate cache is tied to application
> initialization / destruction, and application initialization is
> essentially random from the POV of the system (the events which trigger
> the execution of the application are not visible from the system).
> 
> Think of a server running two different servers: one database
> with requests that are received with poisson distribution, average 30
> requests per hour, and every request takes 1 minute.
> 
> One httpd server with nearly constant load.
> 
> Without cache reservations, database requests takes 2 minutes.
> That is not acceptable for the database clients.
> But with cache reservation, database requests takes 1 minute.
> 
> You want to maximize performance of httpd and database requests
> What you do? You allow the database server to perform cache
> reservation once a request comes in, and to undo the reservation
> once the request is finished.

> Its impossible to perform this with a centralized interface.

Not so; just a wee bit more fragile that desired. But, this is a
pre-existing problem with cgroups and needs to be solved, not using
cgroups because of this is silly.

Every cgroup that can work on tasks suffers this and arguably a few
more.

> 5) Modify scenario 2 above as follows: each database request
> is handled by two newly created threads, and they share a certain
> percentage
> of data cache, and a certain percentage of code cache.
> 
> So the dispatcher thread, on arrival of request, has to:
> 
>         - create data cache reservation = tcrid-A.
>         - create code cache reservation = tcrid-B.
>         - create thread-1.
>         - assign tcird-A and B to thread-1.
>         - create thread-2.
>         - assign tcird-A and B to thread-2.
> 
> 6) Create reservations in such a way that the sum is larger than
> total amount of cache, and CPU pinning (example from Karen Noel):
> 
> VM-1 on socket-1 with 80% of reservation.
> VM-2 on socket-2 with 80% of reservation.
> VM-1 pinned to socket-1.
> VM-2 pinned to socket-2.
> 
> Cgroups interface attempts to set a cache mask globally. This is the
> problem the "expand" proposal solves:
> https://lkml.org/lkml/2015/7/29/682

That email is unparsable. But the only way to sanely do so it do closely
intertwine oneself with cpusets, doing that with anything other than
another cgroup controller absolutely full on insane.

> 7) Consider two sockets with different region of L3 cache
> shared with HW:
> 
> — CPUID.(EAX=10H, ECX=1):EBX[31:0] reports a bit mask. Each set bit
> within the length of the CBM
> indicates the corresponding unit of the L3 allocation may be used by
> other entities in the platform (e.g. an
> integrated graphics engine or hardware units outside the processor core
> and have direct access to L3).
> Each cleared bit within the length of the CBM indicates the
> corresponding allocation unit can be configured
> to implement a priority-based allocation scheme chosen by an OS/VMM
> without interference with other
> hardware agents in the system. Bits outside the length of the CBM are
> reserved.
> 
> You want the kernel to maintain different bitmasks in the CBM:
> 
>         socket1 [range-A]
>         socket2 [range-B]
> 
> And the kernel will automatically switch from range A to range B
> when the thread switches sockets.

This is firmly in the insane range of things.. not going to happen full
stop.

It a thread can freely schedule between two CPUs its configuration on
those two CPUs had better bloody be the same.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-16  2:28       ` Marcelo Tosatti
@ 2015-10-16  9:50         ` Peter Zijlstra
  2015-10-26 20:02           ` Marcelo Tosatti
  2015-11-02 22:20           ` cat cgroup interface proposal (non hierarchical) was " Marcelo Tosatti
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2015-10-16  9:50 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Thomas Gleixner, Fenghua Yu, H Peter Anvin, Ingo Molnar,
	linux-kernel, x86, Vikas Shivappa

On Thu, Oct 15, 2015 at 11:28:52PM -0300, Marcelo Tosatti wrote:
> On Thu, Oct 15, 2015 at 01:36:14PM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 13, 2015 at 06:31:27PM -0300, Marcelo Tosatti wrote:
> > > I am rewriting the interface with ioctls, with commands similar to the
> > > syscall interface proposed.
> > 
> > Which is horrible for other use cases. I really don't see the problem
> > with the cgroup stuff.
> 
> Can you detail what "horrible" means? 

Say an RT scenario; you set up your machine with cgroups. You create a
cpuset with is disjoint from the others, you frob around with the cpu
cgroup, etc..

So once you're all done, you start your RT app into a cgroup.

But oh, fail, now you have to go muck about with ioctl()s to get the
cache allocation cruft to work.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-16  9:44           ` Peter Zijlstra
@ 2015-10-16 20:24             ` Marcelo Tosatti
  2015-10-19 23:49               ` Marcelo Tosatti
  0 siblings, 1 reply; 42+ messages in thread
From: Marcelo Tosatti @ 2015-10-16 20:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yu, Fenghua, Thomas Gleixner, H Peter Anvin, Ingo Molnar,
	linux-kernel, x86, Vikas Shivappa

On Fri, Oct 16, 2015 at 11:44:52AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 15, 2015 at 09:17:16PM -0300, Marcelo Tosatti wrote:
> > On Thu, Oct 15, 2015 at 01:37:02PM +0200, Peter Zijlstra wrote:
> > > On Tue, Oct 13, 2015 at 07:40:58PM -0300, Marcelo Tosatti wrote:
> > > > How can you fix the issue of sockets with different reserved cache
> > > > regions with hw in the cgroup interface?
> > > 
> > > No idea what you're referring to. But IOCTLs blow.
> > 
> > Tejun brought up syscalls. Syscalls seem too generic.
> > So ioctls were chosen instead.
> > 
> > It is necessary to perform the following operations:
> > 
> > 1) create cache reservation (params = size, type).
> 
> mkdir
> 
> > 2) delete cache reservation.
> 
> rmdir
> 
> > 3) attach cache reservation (params = cache reservation id, pid).
> > 4) detach cache reservation (params = cache reservation id, pid).
> 
> echo $pid > tasks
> 
> > Can it done via cgroups? If so, works for me.
> 
> Trivially.

Fine. 

Tejun brought the problem of locking: how do you coordinate locking
between different users?  (on the mkdir / rmdir scenario above).

> 
> > A list of problems with the cgroup interface has been written,
> > in the thread... and we found another problem.
> 
> Which was endless and tiresome so I stopped reading.
> 
> > List of problems with cgroup interface:
> > 
> > 1) Global IPI on CBM <---> task change does not scale.
> > 
> >  * cbm_update_all() - Update the cache bit mask for all packages.
> >  */
> > static inline void cbm_update_all(u32 closid)
> > {
> >        on_each_cpu_mask(&rdt_cpumask, cbm_cpu_update, (void *)closid,
> > 1);
> > }
> 
> There is no way around that, the moment you view the CBM as a global
> resource; ie. a CBM is configured the same on all sockets; you need to
> do this for a task using that CBM might run on any CPU at any time.
> 
> This is not because of the cgroup interface at all. This is because you
> want CBMs to be the same machine wide.

You don't, for two reasons:

1) Item 6 below.
2) Item 7 below.

Please follow on with the discussion (just scroll down and read and
reply inline: item 6 and machine wide CBMs are not incompatible
because...).

> The only way to actually change that is to _be_ a cgroup and co-mount
> with cpusets and be incestuous and look at the cpusets state and
> discover disjoint groups.
> 
> > 2) Syscall interface specification is in kbytes, not
> > cache ways (which is what must be recorded by the OS
> > to allow migration of the OS between different
> > hardware systems).
> 
> Meh, that again is nothing fundamental. The cgroup interface could do
> bytes just the same.

Yes.

> > 3) Compilers are able to configure cache optimally for
> > given ranges of code inside applications, easily,
> > if desired.
> 
> Yeah, so? Every SKU has a different cache size, so once you're down to
> that level you're pretty hard set in your configuration and it really
> doesn't matter if you give bytes or ways, you _KNOW_ what your
> configuration will be.

That item has nothing to do with cache ways in bytes or ways.

> > 4) Problem-2: The decision to allocate cache is tied to application
> > initialization / destruction, and application initialization is
> > essentially random from the POV of the system (the events which trigger
> > the execution of the application are not visible from the system).
> > 
> > Think of a server running two different servers: one database
> > with requests that are received with poisson distribution, average 30
> > requests per hour, and every request takes 1 minute.
> > 
> > One httpd server with nearly constant load.
> > 
> > Without cache reservations, database requests takes 2 minutes.
> > That is not acceptable for the database clients.
> > But with cache reservation, database requests takes 1 minute.
> > 
> > You want to maximize performance of httpd and database requests
> > What you do? You allow the database server to perform cache
> > reservation once a request comes in, and to undo the reservation
> > once the request is finished.
> 
> > Its impossible to perform this with a centralized interface.
> 
> Not so; just a wee bit more fragile that desired. But, this is a
> pre-existing problem with cgroups and needs to be solved, not using
> cgroups because of this is silly.
> 
> Every cgroup that can work on tasks suffers this and arguably a few
> more.
> 
> > 5) Modify scenario 2 above as follows: each database request
> > is handled by two newly created threads, and they share a certain
> > percentage
> > of data cache, and a certain percentage of code cache.
> > 
> > So the dispatcher thread, on arrival of request, has to:
> > 
> >         - create data cache reservation = tcrid-A.
> >         - create code cache reservation = tcrid-B.
> >         - create thread-1.
> >         - assign tcird-A and B to thread-1.
> >         - create thread-2.
> >         - assign tcird-A and B to thread-2.
> > 
> > 6) Create reservations in such a way that the sum is larger than
> > total amount of cache, and CPU pinning (example from Karen Noel):
> > 
> > VM-1 on socket-1 with 80% of reservation.
> > VM-2 on socket-2 with 80% of reservation.
> > VM-1 pinned to socket-1.
> > VM-2 pinned to socket-2.
> > 
> > Cgroups interface attempts to set a cache mask globally. This is the
> > problem the "expand" proposal solves:
> > https://lkml.org/lkml/2015/7/29/682
> 
> That email is unparsable.

Look at item 6. If you create reservations in such a way that the sum
is larger than total amount of cache, "cosid0" which is the
"unconstrained set of tasks" (ie: rest of the system) have 0 bytes of
L3 cache to reclaim from.

> But the only way to sanely do so it do closely
> intertwine oneself with cpusets, doing that with anything other than
> another cgroup controller absolutely full on insane.

void __intel_rdt_sched_in(void)
{
        struct task_struct *task = current;
        unsigned int cpu = smp_processor_id();
        unsigned int this_socket = topology_physical_package_id(cpu);
        unsigned int start, end;

        /*
         * The CBM bitmask for a particular task is enforced
         * on sched-in to a given processor, and only for the
         * range (cbm_start_bit,cbm_end_bit) which the
         * tcr_list (COSid) owns.
         * This way we allow COSid0 (global task pool) to use
         * reserved L3 cache on sockets where the tasks that
         * reserve the cache have not been scheduled.
         *
         * Since reading the MSRs is slow, it is necessary to
         * cache the MSR CBM map on each socket.
         *
         */

        if (test_bit(this_socket,
                     task->tcrlist->synced_to_socket) == 0) {

Makes sense?

> 
> > 7) Consider two sockets with different region of L3 cache
> > shared with HW:
> > 
> > — CPUID.(EAX=10H, ECX=1):EBX[31:0] reports a bit mask. Each set bit
> > within the length of the CBM
> > indicates the corresponding unit of the L3 allocation may be used by
> > other entities in the platform (e.g. an
> > integrated graphics engine or hardware units outside the processor core
> > and have direct access to L3).
> > Each cleared bit within the length of the CBM indicates the
> > corresponding allocation unit can be configured
> > to implement a priority-based allocation scheme chosen by an OS/VMM
> > without interference with other
> > hardware agents in the system. Bits outside the length of the CBM are
> > reserved.
> > 
> > You want the kernel to maintain different bitmasks in the CBM:
> > 
> >         socket1 [range-A]
> >         socket2 [range-B]
> > 
> > And the kernel will automatically switch from range A to range B
> > when the thread switches sockets.
> 
> This is firmly in the insane range of things.. not going to happen full
> stop.

Are you saying that hardware will guarantee reserved region is the same
for all sockets? I asked Vikas and he said this is not the case.

> It a thread can freely schedule between two CPUs its configuration on
> those two CPUs had better bloody be the same.

Its just the (start,end) of the CBM which changes, so on
__intel_rdt_sched_in you do:

                struct per_socket_data *psd = get_socket_data(this_socket);
                struct cache_layout *layout = psd->layout;

                start = task->tcrlist->psd[layout->id].cbm_start;
                end = task->tcrlist->psd[layout->id].cbm_end;
                sync_to_msr(tcrlist, start, end);

Please clarify what you mean.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-16 20:24             ` Marcelo Tosatti
@ 2015-10-19 23:49               ` Marcelo Tosatti
  0 siblings, 0 replies; 42+ messages in thread
From: Marcelo Tosatti @ 2015-10-19 23:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yu, Fenghua, Thomas Gleixner, H Peter Anvin, Ingo Molnar,
	linux-kernel, x86, Vikas Shivappa

On Fri, Oct 16, 2015 at 05:24:42PM -0300, Marcelo Tosatti wrote:
> On Fri, Oct 16, 2015 at 11:44:52AM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 15, 2015 at 09:17:16PM -0300, Marcelo Tosatti wrote:
> > > On Thu, Oct 15, 2015 at 01:37:02PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Oct 13, 2015 at 07:40:58PM -0300, Marcelo Tosatti wrote:
> > > > > How can you fix the issue of sockets with different reserved cache
> > > > > regions with hw in the cgroup interface?
> > > > 
> > > > No idea what you're referring to. But IOCTLs blow.
> > > 
> > > Tejun brought up syscalls. Syscalls seem too generic.
> > > So ioctls were chosen instead.
> > > 
> > > It is necessary to perform the following operations:
> > > 
> > > 1) create cache reservation (params = size, type).
> > 
> > mkdir

You need to specify type and size. So could be:

cd ../cgroups/intel_cat_cgroup/
mkdir reservation-a
cd reservation-a
echo 1000 > size
echo code > type
echo $pid > tasks

So each directory in the intel cat cgroup specifies a reservation, which
contains:
	* size.
	* type.
	* tasks which the reservation is attached to.

> > > 2) delete cache reservation.
> > 
> > rmdir

Detach would simply work when removing tasks from "tasks" field.

> > > 3) attach cache reservation (params = cache reservation id, pid).
> > > 4) detach cache reservation (params = cache reservation id, pid).
> > 
> > echo $pid > tasks
> > 
> > > Can it done via cgroups? If so, works for me.
> > 
> > Trivially.
> 
> Fine. 
> 
> Tejun brought the problem of locking: how do you coordinate locking
> between different users?  (on the mkdir / rmdir scenario above).

Can't see locking issue with "reservation" based interface (that is each
directory is a cache reservation).

Tejun, any comments on this non hierarchical cgroup interface?

(still waiting on you to reply the other emails on this thread, Peter).


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-16  9:50         ` Peter Zijlstra
@ 2015-10-26 20:02           ` Marcelo Tosatti
  2015-11-02 22:20           ` cat cgroup interface proposal (non hierarchical) was " Marcelo Tosatti
  1 sibling, 0 replies; 42+ messages in thread
From: Marcelo Tosatti @ 2015-10-26 20:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Fenghua Yu, H Peter Anvin, Ingo Molnar,
	linux-kernel, x86, Vikas Shivappa

Hi Peter,

On Fri, Oct 16, 2015 at 11:50:22AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 15, 2015 at 11:28:52PM -0300, Marcelo Tosatti wrote:
> > On Thu, Oct 15, 2015 at 01:36:14PM +0200, Peter Zijlstra wrote:
> > > On Tue, Oct 13, 2015 at 06:31:27PM -0300, Marcelo Tosatti wrote:
> > > > I am rewriting the interface with ioctls, with commands similar to the
> > > > syscall interface proposed.
> > > 
> > > Which is horrible for other use cases. I really don't see the problem
> > > with the cgroup stuff.
> > 
> > Can you detail what "horrible" means? 
> 
> Say an RT scenario; you set up your machine with cgroups. You create a
> cpuset with is disjoint from the others, 

taskset. sys_schedsetaffinity.

> you frob around with the cpu
> cgroup, etc..
> 
> So once you're all done, you start your RT app into a cgroup.
> 
> But oh, fail, now you have to go muck about with ioctl()s to get the
> cache allocation cruft to work.

1) Its a command similar to taskset.

2) Cgroup interface as you propose seem to go against the usecase indicated by Tejun where
applications set the cache allocation themselves.

(The two points indicate that i can't see the benefit of the cgroup
interface suggestion, please clarify).


^ permalink raw reply	[flat|nested] 42+ messages in thread

* cat cgroup interface proposal (non hierarchical) was Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-16  9:50         ` Peter Zijlstra
  2015-10-26 20:02           ` Marcelo Tosatti
@ 2015-11-02 22:20           ` Marcelo Tosatti
  1 sibling, 0 replies; 42+ messages in thread
From: Marcelo Tosatti @ 2015-11-02 22:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Fenghua Yu, H Peter Anvin, Ingo Molnar,
	linux-kernel, x86, Vikas Shivappa, Karen Noel, Paolo Bonzini,
	Luiz Capitulino, Clark Williams, Tejun Heo

On Fri, Oct 16, 2015 at 11:50:22AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 15, 2015 at 11:28:52PM -0300, Marcelo Tosatti wrote:
> > On Thu, Oct 15, 2015 at 01:36:14PM +0200, Peter Zijlstra wrote:
> > > On Tue, Oct 13, 2015 at 06:31:27PM -0300, Marcelo Tosatti wrote:
> > > > I am rewriting the interface with ioctls, with commands similar to the
> > > > syscall interface proposed.
> > > 
> > > Which is horrible for other use cases. I really don't see the problem
> > > with the cgroup stuff.
> > 
> > Can you detail what "horrible" means? 
> 
> Say an RT scenario; you set up your machine with cgroups. You create a
> cpuset with is disjoint from the others, you frob around with the cpu
> cgroup, etc..
> 
> So once you're all done, you start your RT app into a cgroup.
> 
> But oh, fail, now you have to go muck about with ioctl()s to get the
> cache allocation cruft to work.

Peter, what follows is your cgroup proposal (extended), but 
at the end there is a point about impossibility of this cgroup 
interface to share cache between tasks, which IMO renders it unuseable
(as it blocks any threads from sharing reserved cache).

If you have any ideas on how to circumvent this, they are appreciated.

Follows non-hierarchical cgroup CAT interface proposal. Thanks to some of the
CC'ed Red Hat folks for early comments.

cgroup CAT interface (non hierarchical):
---------------------------------------

0) Main directory files:

cat_hw_info
-----------
CAT HW information: CBM length, CDP supported, etc.
Information reported per-socket, as sockets can have
different configurations. Perhaps should be inside
sysfs.

1) Sub-directories represent cache reservations (size,type).

mkdir cache_reservation_for_forwarding_app
cd cache_reservation_for_forwarding_app
echo "80KB" > size
echo "data_and_code" > type
echo "socketmask=0xfffff" > socketmask (optional)
echo "1" > enable_reservation
echo "pid-of-forwarding-main-thread pid-of-helper-thread ..." > tasks

Files:

type
----------------
{data_and_code, only_code, only_data}. Type of
L3 CAT cache allocation to use. only_code,only_data only
supported on CDP capable processors.

size
----
size of L3 cache reservation.

rounding
--------
{round_down,round_up} whether to round up / round down
allocation size in kbytes, to cache-way size.

Default: round_up

socketmask
----------
Mask of sockets where the reservation is in effect.
A zero bit means the task will not have the L3 cache
portion that the cgroup references reserved on that socket.
Default: all sockets set.

enable
------
Allocate reservation with parameters set above.

When a reservation is enabled, it reserves L3 cache
space on any socket thats specified in "socketmask".

After cgroup has been enabled by a write of "1" to
"enable_reservation" file, only the "tasks" file can be modified.
To change the size of a cgroup reservation, recreate the directory.

tasks
-----

Contains the list of tasks which use this cache reservation.

Error reporting
---------------

Errors are reported in response to write as appropriate:
for example, write 1 > enable when there is not enough space
for "socketmask" would return -ENOSPC, etc.
Write to "enable" without size being set would return -EINVAL, etc.

Listing
-------
To list which reservations are in place, search for subdirectories
where "enabled" file has value 1.

Semantics: A task has guaranteed cache reservation on any CPU where its
scheduled in, for the lifetime of the cgroup, as long as that task is
not attached to further cgroups.

That is, a task belonging to cgroup-A can have its cache reservation
invalidated when attached to cgroup-B, (reasoning: it might be necessary
to reallocate the CBMs to respect contiguous bits in cache, a
restriction of the CAT HW interface).


-------
BLOCKER
-------

Can't use cgroups for CAT because:

"Control Groups extends the kernel as follows:

 - Each task in the system has a reference-counted pointer to a
   css_set.

 - A css_set contains a set of reference-counted pointers to
   cgroup_subsys_state objects, one for each cgroup subsystem
   registered in the system."

You need a task to be part of two cgroups at one time, 
to support the following configuration:

Task-A: 70% of cache reserved exclusively (reservation-0).
        20% of cache reserved (reservation-1).

Task-B: 20% of cache reserved (reservation-1).

Unless reservations are created separately, then added to cgroups:

mount -t cgroup ... /../catcgroup/
cd /../catcgroup/
# create reservations
cd reservations
mkdir reservation-1
echo "80K" > size
echo "socketmask" > ...
echo "1" > enable
mkdir reservation-2 
echo "160K" > size
echo "socketmask" > ...
echo "1" > enable
# attach reservation to cgroups
cd /../catcgroup/
mkdir cgroup-for-threaded-app
echo reservation-1 reservation-2 > reservations
echo "mainthread" > tasks
cd ..
mkdir cgroup-for-helper-thread
echo reservation-1 > reservations
echo "helperthread" > tasks
cd .. 

This way mainthread and helperthread can share "reservation-1".

But this is abusing cgroups in a way that it has not been designed for.
Who is going to maintain the linkage between reservations and cgroups?



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
                   ` (11 preceding siblings ...)
  2015-10-11 19:50 ` [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Thomas Gleixner
@ 2015-11-04 14:42 ` Luiz Capitulino
  2015-11-04 14:57   ` Thomas Gleixner
  2015-11-05  2:19 ` [PATCH 1/2] x86/intel_rdt,intel_cqm: Remove build dependency of RDT code on CQM code David Carrillo-Cisneros
  13 siblings, 1 reply; 42+ messages in thread
From: Luiz Capitulino @ 2015-11-04 14:42 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa, Marcelo Tosatti, tj, riel

On Thu,  1 Oct 2015 23:09:34 -0700
Fenghua Yu <fenghua.yu@intel.com> wrote:

> This series has some preparatory patches and Intel cache allocation
> support.

Ping? What's the status of this series?

We badly need this series for KVM-RT workloads. I did try it and it
seems to work but, apart from small fixable issues which I'll reply
to specific patches to point out, there are some design issues which
I need some clarification. They are in order of relevance:

 o Cache reservations are global to all NUMA nodes

   CAT is mostly intended for real-time and high performance
   computing. For both of them the most common setup is to
   pin your threads to specific cores on a specific NUMA node.

   So, suppose I have two HPC threads pinned to specific cores
   on node1. I want to reserve 80% of the L3 cache to those
   threads. With current patches I'd do this:

    1. Create a "all-tasks" cgroup which can only access 20% of
       the cache
    2. Create a "hpc" cgroup which can access 80% of the cache
    3. Move my HPC threads to "hpc" and all the other threads to
       "all-tasks"

   This has the intended behavior on node1: the "hpc" threads
   will write into 80% of the L3 cache and any "all-tasks" threads
   executing there will only write into 20% of the cache.

   However, this is also true for node0! So, the "all-tasks"
   threads can only write into 20% of the cache in node0 even
   though "hpc" threads will never execute there.

   Is this intended by design? Like, is this a hardware limitation
   (given that the IA32_L3_MASK_n MSRs are global anyways) or maybe
   a way to enforce cache coherence?

   I was wondering if we could have masks per NUMA node, where
   they are applied to processes whenever they migrate among
   NUMA nodes.

 o How does this feature apply to kernel threads?

   I'm just unable to move kernel threads out of the root
   cgroup. This means that kernel threads can always write
   into all cache no matter what the reservation scheme is.

   Is this intended by design? Why? Unless I'm missing
   something, reservations could and should be applied to
   kernel threads as well.

 o You can't change the root cgroup's CBM

   I can understand this makes the implementation a lot simpler.
   However, the reality is that there are way too little CBMs
   and loosing one for the root group seems like a waste.

   Can we change this or is there strong reasons not to do so?

 o cgroups hierarchy is limited by the number of CBMs

   Today on my Haswell system, this means that I can only have 3
   directories in my cgroups hierarchy. If the number of CBMs
   are expected to grow in next processors, then I think having
   this feature as cgroups makes sense. However, if we're still
   going to be this limited in terms of directory structure, then
   it seems a bit overkill to me to have this as cgroups

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 04/11] x86/intel_rdt: Add support for Cache Allocation detection
  2015-10-02  6:09 ` [PATCH V15 04/11] x86/intel_rdt: Add support for Cache Allocation detection Fenghua Yu
@ 2015-11-04 14:51   ` Luiz Capitulino
  0 siblings, 0 replies; 42+ messages in thread
From: Luiz Capitulino @ 2015-11-04 14:51 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa

On Thu,  1 Oct 2015 23:09:38 -0700
Fenghua Yu <fenghua.yu@intel.com> wrote:

> This patch includes CPUID enumeration routines for Cache allocation and
> new values to track resources to the cpuinfo_x86 structure.
> 
> Cache allocation provides a way for the Software (OS/VMM) to restrict
> cache allocation to a defined 'subset' of cache which may be overlapping
> with other 'subsets'. This feature is used when allocating a line in
> cache ie when pulling new data into the cache. The programming of the
> hardware is done via programming MSRs (model specific registers).
> 
> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> ---
>  arch/x86/include/asm/cpufeature.h |  6 +++++-
>  arch/x86/include/asm/processor.h  |  3 +++
>  arch/x86/kernel/cpu/Makefile      |  1 +
>  arch/x86/kernel/cpu/common.c      | 15 +++++++++++++++
>  arch/x86/kernel/cpu/intel_rdt.c   | 40 +++++++++++++++++++++++++++++++++++++++
>  init/Kconfig                      | 12 ++++++++++++
>  6 files changed, 76 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/kernel/cpu/intel_rdt.c
> 
> diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
> index e6cf2ad..4e93006 100644
> --- a/arch/x86/include/asm/cpufeature.h
> +++ b/arch/x86/include/asm/cpufeature.h
> @@ -12,7 +12,7 @@
>  #include <asm/disabled-features.h>
>  #endif
>  
> -#define NCAPINTS	13	/* N 32-bit words worth of info */
> +#define NCAPINTS	14	/* N 32-bit words worth of info */
>  #define NBUGINTS	1	/* N 32-bit bug flags */
>  
>  /*
> @@ -231,6 +231,7 @@
>  #define X86_FEATURE_RTM		( 9*32+11) /* Restricted Transactional Memory */
>  #define X86_FEATURE_CQM		( 9*32+12) /* Cache QoS Monitoring */
>  #define X86_FEATURE_MPX		( 9*32+14) /* Memory Protection Extension */
> +#define X86_FEATURE_RDT		( 9*32+15) /* Resource Allocation */
>  #define X86_FEATURE_AVX512F	( 9*32+16) /* AVX-512 Foundation */
>  #define X86_FEATURE_RDSEED	( 9*32+18) /* The RDSEED instruction */
>  #define X86_FEATURE_ADX		( 9*32+19) /* The ADCX and ADOX instructions */
> @@ -255,6 +256,9 @@
>  /* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
>  #define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
>  
> +/* Intel-defined CPU features, CPUID level 0x00000010:0 (ebx), word 13 */
> +#define X86_FEATURE_CAT_L3	(13*32 + 1) /* Cache Allocation L3 */
> +
>  /*
>   * BUG word(s)
>   */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 19577dd..b932ec4 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -120,6 +120,9 @@ struct cpuinfo_x86 {
>  	int			x86_cache_occ_scale;	/* scale to bytes */
>  	int			x86_power;
>  	unsigned long		loops_per_jiffy;
> +	/* Cache Allocation values: */
> +	u16			x86_cache_max_cbm_len;
> +	u16			x86_cache_max_closid;
>  	/* cpuid returned max cores value: */
>  	u16			 x86_max_cores;
>  	u16			apicid;
> diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
> index 4eb065c..5e873c7 100644
> --- a/arch/x86/kernel/cpu/Makefile
> +++ b/arch/x86/kernel/cpu/Makefile
> @@ -50,6 +50,7 @@ obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_msr.o
>  obj-$(CONFIG_CPU_SUP_AMD)		+= perf_event_msr.o
>  endif
>  
> +obj-$(CONFIG_INTEL_RDT)			+= intel_rdt.o
>  
>  obj-$(CONFIG_X86_MCE)			+= mcheck/
>  obj-$(CONFIG_MTRR)			+= mtrr/
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index de22ea7..026a416 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -654,6 +654,21 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
>  		}
>  	}
>  
> +	/* Additional Intel-defined flags: level 0x00000010 */
> +	if (c->cpuid_level >= 0x00000010) {
> +		u32 eax, ebx, ecx, edx;
> +
> +		cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
> +		c->x86_capability[13] = ebx;
> +
> +		if (cpu_has(c, X86_FEATURE_CAT_L3)) {
> +
> +			cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
> +			c->x86_cache_max_closid = edx + 1;
> +			c->x86_cache_max_cbm_len = eax + 1;
> +		}

Both registers contain reserved bits. Shouldn't they be masked
out as recommended by the SDM?

Also, I'm surprised you're not storing ebx. How is it going to be
possible to create a CBM to fully isolate L3 entries without
knowing which entries are being used by the hardware?

> +	}
> +
>  	/* AMD-defined flags: level 0x80000001 */
>  	xlvl = cpuid_eax(0x80000000);
>  	c->extended_cpuid_level = xlvl;
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> new file mode 100644
> index 0000000..f49e970
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -0,0 +1,40 @@
> +/*
> + * Resource Director Technology(RDT)
> + * - Cache Allocation code.
> + *
> + * Copyright (C) 2014 Intel Corporation
> + *
> + * 2015-05-25 Written by
> + *    Vikas Shivappa <vikas.shivappa@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * More information about RDT be found in the Intel (R) x86 Architecture
> + * Software Developer Manual June 2015, volume 3, section 17.15.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/slab.h>
> +#include <linux/err.h>
> +
> +static int __init intel_rdt_late_init(void)
> +{
> +	struct cpuinfo_x86 *c = &boot_cpu_data;
> +
> +	if (!cpu_has(c, X86_FEATURE_CAT_L3))
> +		return -ENODEV;
> +
> +	pr_info("Intel cache allocation detected\n");
> +
> +	return 0;
> +}
> +
> +late_initcall(intel_rdt_late_init);
> diff --git a/init/Kconfig b/init/Kconfig
> index c24b6f7..9fe3f11 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -938,6 +938,18 @@ menuconfig CGROUPS
>  
>  	  Say N if unsure.
>  
> +config INTEL_RDT
> +	bool "Intel Resource Director Technology support"
> +	depends on X86_64 && CPU_SUP_INTEL
> +	help
> +	  This option provides support for Cache allocation which is a
> +	  sub-feature of Intel Resource Director  Technology(RDT).
> +	  Current implementation supports L3 cache allocation.
> +	  Using this feature a user can specify the amount of L3 cache space
> +	  into which an application can fill.
> +
> +	  Say N if unsure.
> +
>  if CGROUPS
>  
>  config CGROUP_DEBUG


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 05/11] x86/intel_rdt: Add Class of service management
  2015-10-02  6:09 ` [PATCH V15 05/11] x86/intel_rdt: Add Class of service management Fenghua Yu
@ 2015-11-04 14:55   ` Luiz Capitulino
  0 siblings, 0 replies; 42+ messages in thread
From: Luiz Capitulino @ 2015-11-04 14:55 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa

On Thu,  1 Oct 2015 23:09:39 -0700
Fenghua Yu <fenghua.yu@intel.com> wrote:

> Adds some data-structures and APIs to support Class of service
> management(closid). There is a new clos_cbm table which keeps a 1:1
> mapping between closid and capacity bit mask (cbm)
> and a count of usage of closid. Each task would be associated with a
> Closid at a time and this patch adds a new field closid to task_struct
> to keep track of the same.
> 
> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> ---
>  arch/x86/include/asm/intel_rdt.h | 12 ++++++
>  arch/x86/kernel/cpu/intel_rdt.c  | 82 +++++++++++++++++++++++++++++++++++++++-
>  include/linux/sched.h            |  3 ++
>  3 files changed, 95 insertions(+), 2 deletions(-)
>  create mode 100644 arch/x86/include/asm/intel_rdt.h
> 
> diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
> new file mode 100644
> index 0000000..88b7643
> --- /dev/null
> +++ b/arch/x86/include/asm/intel_rdt.h
> @@ -0,0 +1,12 @@
> +#ifndef _RDT_H_
> +#define _RDT_H_
> +
> +#ifdef CONFIG_INTEL_RDT
> +
> +struct clos_cbm_table {
> +	unsigned long l3_cbm;
> +	unsigned int clos_refcnt;
> +};

Isn't this a single entry?

> +
> +#endif
> +#endif
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index f49e970..d79213a 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -24,17 +24,95 @@
>  
>  #include <linux/slab.h>
>  #include <linux/err.h>
> +#include <asm/intel_rdt.h>
> +
> +/*
> + * cctable maintains 1:1 mapping between CLOSid and cache bitmask.
> + */
> +static struct clos_cbm_table *cctable;
> +/*
> + * closid availability bit map.
> + */
> +unsigned long *closmap;
> +static DEFINE_MUTEX(rdt_group_mutex);
> +
> +static inline void closid_get(u32 closid)
> +{
> +	struct clos_cbm_table *cct = &cctable[closid];
> +
> +	lockdep_assert_held(&rdt_group_mutex);
> +
> +	cct->clos_refcnt++;
> +}
> +
> +static int closid_alloc(u32 *closid)
> +{
> +	u32 maxid;
> +	u32 id;
> +
> +	lockdep_assert_held(&rdt_group_mutex);
> +
> +	maxid = boot_cpu_data.x86_cache_max_closid;
> +	id = find_first_zero_bit(closmap, maxid);
> +	if (id == maxid)
> +		return -ENOSPC;

This causes echo to print "No space left in device" when you run
out of CBMs. I think -ENOMEM makes more sense.

> +
> +	set_bit(id, closmap);
> +	closid_get(id);
> +	*closid = id;
> +
> +	return 0;
> +}
> +
> +static inline void closid_free(u32 closid)
> +{
> +	clear_bit(closid, closmap);
> +	cctable[closid].l3_cbm = 0;
> +}
> +
> +static void closid_put(u32 closid)
> +{
> +	struct clos_cbm_table *cct = &cctable[closid];
> +
> +	lockdep_assert_held(&rdt_group_mutex);
> +	if (WARN_ON(!cct->clos_refcnt))
> +		return;
> +
> +	if (!--cct->clos_refcnt)
> +		closid_free(closid);
> +}

This is very minor, but IMHO you got closid_put() and closid_free()
naming backwards. closeid_free() is the opposite operation of
closeid_alloc(). Likewise, closeid_put() is the opposite of
closide_get().

>  
>  static int __init intel_rdt_late_init(void)
>  {
>  	struct cpuinfo_x86 *c = &boot_cpu_data;
> +	u32 maxid, max_cbm_len;
> +	int err = 0, size;
>  
>  	if (!cpu_has(c, X86_FEATURE_CAT_L3))
>  		return -ENODEV;
>  
> -	pr_info("Intel cache allocation detected\n");
> +	maxid = c->x86_cache_max_closid;
> +	max_cbm_len = c->x86_cache_max_cbm_len;
>  
> -	return 0;
> +	size = maxid * sizeof(struct clos_cbm_table);
> +	cctable = kzalloc(size, GFP_KERNEL);
> +	if (!cctable) {
> +		err = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	size = BITS_TO_LONGS(maxid) * sizeof(long);
> +	closmap = kzalloc(size, GFP_KERNEL);
> +	if (!closmap) {
> +		kfree(cctable);
> +		err = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	pr_info("Intel cache allocation enabled\n");
> +out_err:
> +
> +	return err;
>  }
>  
>  late_initcall(intel_rdt_late_init);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b7b9501..24bfbac 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1668,6 +1668,9 @@ struct task_struct {
>  	/* cg_list protected by css_set_lock and tsk->alloc_lock */
>  	struct list_head cg_list;
>  #endif
> +#ifdef CONFIG_INTEL_RDT
> +	u32 closid;
> +#endif
>  #ifdef CONFIG_FUTEX
>  	struct robust_list_head __user *robust_list;
>  #ifdef CONFIG_COMPAT


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-11-04 14:42 ` Luiz Capitulino
@ 2015-11-04 14:57   ` Thomas Gleixner
  2015-11-04 15:12     ` Luiz Capitulino
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2015-11-04 14:57 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Fenghua Yu, H Peter Anvin, Ingo Molnar, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa, Marcelo Tosatti, tj, riel

On Wed, 4 Nov 2015, Luiz Capitulino wrote:

> On Thu,  1 Oct 2015 23:09:34 -0700
> Fenghua Yu <fenghua.yu@intel.com> wrote:
> 
> > This series has some preparatory patches and Intel cache allocation
> > support.
> 
> Ping? What's the status of this series?

We still need to agree on the user space interface which is the
hardest part of it....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-11-04 14:57   ` Thomas Gleixner
@ 2015-11-04 15:12     ` Luiz Capitulino
  2015-11-04 15:28       ` Thomas Gleixner
  0 siblings, 1 reply; 42+ messages in thread
From: Luiz Capitulino @ 2015-11-04 15:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Fenghua Yu, H Peter Anvin, Ingo Molnar, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa, Marcelo Tosatti, tj, riel

On Wed, 4 Nov 2015 15:57:41 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Wed, 4 Nov 2015, Luiz Capitulino wrote:
> 
> > On Thu,  1 Oct 2015 23:09:34 -0700
> > Fenghua Yu <fenghua.yu@intel.com> wrote:
> > 
> > > This series has some preparatory patches and Intel cache allocation
> > > support.
> > 
> > Ping? What's the status of this series?
> 
> We still need to agree on the user space interface which is the
> hardest part of it....

My understanding is that two interfaces have been proposed: the cgroups
one and an API based on syscalls or ioctls.

Are those proposals mutual exclusive? What about having the cgroups one
merged IFF it's useful, and having the syscall API later if really
needed?

I don't want to make the wrong decision, but the cgroups interface is
here. Holding it while we discuss a perfect interface that doesn't
even exist will just do a bad service for users.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-11-04 15:12     ` Luiz Capitulino
@ 2015-11-04 15:28       ` Thomas Gleixner
  2015-11-04 15:35         ` Luiz Capitulino
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2015-11-04 15:28 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Fenghua Yu, H Peter Anvin, Ingo Molnar, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa, Marcelo Tosatti, tj, riel

On Wed, 4 Nov 2015, Luiz Capitulino wrote:
> On Wed, 4 Nov 2015 15:57:41 +0100 (CET)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Wed, 4 Nov 2015, Luiz Capitulino wrote:
> > 
> > > On Thu,  1 Oct 2015 23:09:34 -0700
> > > Fenghua Yu <fenghua.yu@intel.com> wrote:
> > > 
> > > > This series has some preparatory patches and Intel cache allocation
> > > > support.
> > > 
> > > Ping? What's the status of this series?
> > 
> > We still need to agree on the user space interface which is the
> > hardest part of it....
> 
> My understanding is that two interfaces have been proposed: the cgroups
> one and an API based on syscalls or ioctls.
> 
> Are those proposals mutual exclusive? What about having the cgroups one
> merged IFF it's useful, and having the syscall API later if really
> needed?
> 
> I don't want to make the wrong decision, but the cgroups interface is
> here. Holding it while we discuss a perfect interface that doesn't
> even exist will just do a bad service for users.

Well, no. We do not just introduce a random user space ABI simply
because we have to support it forever.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-11-04 15:28       ` Thomas Gleixner
@ 2015-11-04 15:35         ` Luiz Capitulino
  2015-11-04 15:50           ` Thomas Gleixner
  0 siblings, 1 reply; 42+ messages in thread
From: Luiz Capitulino @ 2015-11-04 15:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Fenghua Yu, H Peter Anvin, Ingo Molnar, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa, Marcelo Tosatti, tj, riel

On Wed, 4 Nov 2015 16:28:04 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Wed, 4 Nov 2015, Luiz Capitulino wrote:
> > On Wed, 4 Nov 2015 15:57:41 +0100 (CET)
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > > On Wed, 4 Nov 2015, Luiz Capitulino wrote:
> > > 
> > > > On Thu,  1 Oct 2015 23:09:34 -0700
> > > > Fenghua Yu <fenghua.yu@intel.com> wrote:
> > > > 
> > > > > This series has some preparatory patches and Intel cache allocation
> > > > > support.
> > > > 
> > > > Ping? What's the status of this series?
> > > 
> > > We still need to agree on the user space interface which is the
> > > hardest part of it....
> > 
> > My understanding is that two interfaces have been proposed: the cgroups
> > one and an API based on syscalls or ioctls.
> > 
> > Are those proposals mutual exclusive? What about having the cgroups one
> > merged IFF it's useful, and having the syscall API later if really
> > needed?
> > 
> > I don't want to make the wrong decision, but the cgroups interface is
> > here. Holding it while we discuss a perfect interface that doesn't
> > even exist will just do a bad service for users.
> 
> Well, no. We do not just introduce a random user space ABI simply
> because we have to support it forever.

I don't think it's random, it's in discussion for a long time and
Peter seems to be in favor of it.

But I'm all for progress here whatever route we take. In that regard,
what's your opinion on the best way to move forward?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
  2015-11-04 15:35         ` Luiz Capitulino
@ 2015-11-04 15:50           ` Thomas Gleixner
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2015-11-04 15:50 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Fenghua Yu, H Peter Anvin, Ingo Molnar, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa, Marcelo Tosatti, tj, riel

On Wed, 4 Nov 2015, Luiz Capitulino wrote:
> On Wed, 4 Nov 2015 16:28:04 +0100 (CET)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Wed, 4 Nov 2015, Luiz Capitulino wrote:
> > > On Wed, 4 Nov 2015 15:57:41 +0100 (CET)
> > > Thomas Gleixner <tglx@linutronix.de> wrote:
> > > 
> > > > On Wed, 4 Nov 2015, Luiz Capitulino wrote:
> > > > 
> > > > > On Thu,  1 Oct 2015 23:09:34 -0700
> > > > > Fenghua Yu <fenghua.yu@intel.com> wrote:
> > > > > 
> > > > > > This series has some preparatory patches and Intel cache allocation
> > > > > > support.
> > > > > 
> > > > > Ping? What's the status of this series?
> > > > 
> > > > We still need to agree on the user space interface which is the
> > > > hardest part of it....
> > > 
> > > My understanding is that two interfaces have been proposed: the cgroups
> > > one and an API based on syscalls or ioctls.
> > > 
> > > Are those proposals mutual exclusive? What about having the cgroups one
> > > merged IFF it's useful, and having the syscall API later if really
> > > needed?
> > > 
> > > I don't want to make the wrong decision, but the cgroups interface is
> > > here. Holding it while we discuss a perfect interface that doesn't
> > > even exist will just do a bad service for users.
> > 
> > Well, no. We do not just introduce a random user space ABI simply
> > because we have to support it forever.
> 
> I don't think it's random, it's in discussion for a long time and
> Peter seems to be in favor of it.

It does not matter whether it's in discussion for a long time. We have
requests for functionality which cannot be covered with that
interface.
 
> But I'm all for progress here whatever route we take. In that regard,
> what's your opinion on the best way to move forward?

Talk to the people in your very company, who are having a different
opinion and requests for stuff which cannot be handled by the current
proposed interface. You had yourself a list of things you want to see
handled.

So feel free to come up with patches which implement that instead of
telling us that your company needs it badly for some reason.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/2] x86/intel_rdt,intel_cqm: Remove build dependency of RDT code on CQM code.
  2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
                   ` (12 preceding siblings ...)
  2015-11-04 14:42 ` Luiz Capitulino
@ 2015-11-05  2:19 ` David Carrillo-Cisneros
  2015-11-05  2:19   ` [PATCH 2/2] x86/intel_rdt: Fix bug in initialization, locks and write cbm mask David Carrillo-Cisneros
  13 siblings, 1 reply; 42+ messages in thread
From: David Carrillo-Cisneros @ 2015-11-05  2:19 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Stephane Eranian, Paul Turner, linux-kernel, David Carrillo-Cisneros

Minor code move to remove build dependency of RDT code on
perf_event_intel_cqm.c .

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/include/asm/pqr_common.h          |  3 +++
 arch/x86/kernel/cpu/Makefile               |  6 +++++-
 arch/x86/kernel/cpu/perf_event_intel_cqm.c |  8 --------
 arch/x86/kernel/cpu/pqr_common.c           | 10 ++++++++++
 4 files changed, 18 insertions(+), 9 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/pqr_common.c

diff --git a/arch/x86/include/asm/pqr_common.h b/arch/x86/include/asm/pqr_common.h
index 11e985c..228e943 100644
--- a/arch/x86/include/asm/pqr_common.h
+++ b/arch/x86/include/asm/pqr_common.h
@@ -1,6 +1,9 @@
 #ifndef _X86_RDT_H_
 #define _X86_RDT_H_
 
+#include <linux/types.h>
+#include <asm/percpu.h>
+
 #define MSR_IA32_PQR_ASSOC	0x0c8f
 
 /**
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index b3292a4..5eb0f6e 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -39,7 +39,8 @@ obj-$(CONFIG_CPU_SUP_AMD)		+= perf_event_amd_iommu.o
 endif
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_p6.o perf_event_knc.o perf_event_p4.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
-obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_rapl.o perf_event_intel_cqm.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_rapl.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= pqr_common.o perf_event_intel_cqm.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_pt.o perf_event_intel_bts.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_cstate.o
 
@@ -49,6 +50,9 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE)	+= perf_event_intel_uncore.o \
 					   perf_event_intel_uncore_nhmex.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_msr.o
 obj-$(CONFIG_CPU_SUP_AMD)		+= perf_event_msr.o
+
+else
+obj-$(CONFIG_INTEL_RDT)			+= pqr_common.o
 endif
 
 obj-$(CONFIG_INTEL_RDT)			+= intel_rdt.o
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 04a696f..eee960d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -17,14 +17,6 @@ static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 
 /*
- * The cached intel_pqr_state is strictly per CPU and can never be
- * updated from a remote CPU. Both functions which modify the state
- * (intel_cqm_event_start and intel_cqm_event_stop) are called with
- * interrupts disabled, which is sufficient for the protection.
- */
-DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
-
-/*
  * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
  * Also protects event->hw.cqm_rmid
  *
diff --git a/arch/x86/kernel/cpu/pqr_common.c b/arch/x86/kernel/cpu/pqr_common.c
new file mode 100644
index 0000000..abcb432
--- /dev/null
+++ b/arch/x86/kernel/cpu/pqr_common.c
@@ -0,0 +1,10 @@
+#include <asm/pqr_common.h>
+
+/*
+ * The cached intel_pqr_state is strictly per CPU and can never be
+ * updated from a remote CPU. Both functions which modify the state
+ * (intel_cqm_event_start and intel_cqm_event_stop) are called with
+ * interrupts disabled, which is sufficient for the protection.
+ */
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+
-- 
2.6.0.rc2.230.g3dd15c0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/2] x86/intel_rdt: Fix bug in initialization, locks and write cbm mask.
  2015-11-05  2:19 ` [PATCH 1/2] x86/intel_rdt,intel_cqm: Remove build dependency of RDT code on CQM code David Carrillo-Cisneros
@ 2015-11-05  2:19   ` David Carrillo-Cisneros
  0 siblings, 0 replies; 42+ messages in thread
From: David Carrillo-Cisneros @ 2015-11-05  2:19 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Stephane Eranian, Paul Turner, linux-kernel, David Carrillo-Cisneros

Fix bugs in patch series "x86:Intel Cache Allocation Technology Support"
patches by Fenghua Yu. Changes are:
  1) Instruct task_css_check not to print a warning for
    unnecesary lockdeps when calling from __rdt_intel_sched_in
    since all callers are already synchronized by task_rq_lock().
  2) Add missing mutex_locks surrounding accesses to clos_cbm_table.
  3) Properly initialize online cpus in intel_rdt_late_init by using
    intel_rdt_cpu_start() instead of rdt_cpumask_update().
  4) Make cbm_validate_rdt_cgroup to actually use the children's mask
    when validating children's masks (as it should).

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/include/asm/intel_rdt.h | 12 +++++++++---
 arch/x86/kernel/cpu/intel_rdt.c  | 24 ++++++++++++++++++------
 2 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index fbe1e00..f487a93 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -37,11 +37,17 @@ static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
 }
 
 /*
- * Return rdt group to which this task belongs.
+ * Return rdt group to which this task belongs without checking for lockdep.
  */
-static inline struct intel_rdt *task_rdt(struct task_struct *task)
+static inline struct intel_rdt *task_rdt_nocheck(struct task_struct *task)
 {
-	return css_rdt(task_css(task, intel_rdt_cgrp_id));
+	/*
+	 * The checks for lockdep performed by task_subsys_state are not
+	 * necessary when callers are properly synchronized by other locks.
+	 * If the caller for this function is not properly synchronized
+	 * use task_css instead.
+	 */
+	return css_rdt(task_css_check(task, intel_rdt_cgrp_id, true));
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index cb4d2ef..d5fa76f 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -115,7 +115,13 @@ static inline bool cache_alloc_supported(struct cpuinfo_x86 *c)
 void __intel_rdt_sched_in(void *dummy)
 {
 	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-	struct intel_rdt *ir = task_rdt(current);
+
+	/*
+	 * All callers are synchronized by task_rq_lock(); we do not use RCU
+	 * which is pointless here. Thus, we call task_rdt_nocheck that avoids
+	 * the lockdep checks.
+	 */
+	struct intel_rdt *ir = task_rdt_nocheck(current);
 
 	if (ir->closid == state->closid)
 		return;
@@ -403,7 +409,9 @@ static int intel_cache_alloc_cbm_read(struct seq_file *m, void *v)
 	struct intel_rdt *ir = css_rdt(seq_css(m));
 	unsigned long l3_cbm = 0;
 
+	mutex_lock(&rdt_group_mutex);
 	clos_cbm_table_read(ir->closid, &l3_cbm);
+	mutex_unlock(&rdt_group_mutex);
 	seq_printf(m, "%08lx\n", l3_cbm);
 
 	return 0;
@@ -431,7 +439,7 @@ static int cbm_validate_rdt_cgroup(struct intel_rdt *ir, unsigned long cbmvalue)
 	rcu_read_lock();
 	rdt_for_each_child(css, ir) {
 		c = css_rdt(css);
-		clos_cbm_table_read(par->closid, &cbm_tmp);
+		clos_cbm_table_read(c->closid, &cbm_tmp);
 		if (!bitmap_subset(&cbm_tmp, &cbmvalue, MAX_CBM_LENGTH)) {
 			rcu_read_unlock();
 			err = -EINVAL;
@@ -504,7 +512,6 @@ static int intel_cache_alloc_cbm_write(struct cgroup_subsys_state *css,
 	closcbm_map_dump();
 out:
 	mutex_unlock(&rdt_group_mutex);
-
 	return err;
 }
 
@@ -513,12 +520,16 @@ static void rdt_cgroup_init(void)
 	int max_cbm_len = boot_cpu_data.x86_cache_max_cbm_len;
 	u32 closid;
 
+	mutex_lock(&rdt_group_mutex);
+
 	closid_alloc(&closid);
 
 	WARN_ON(closid != 0);
 
 	rdt_root_group.closid = closid;
 	clos_cbm_table_update(closid, (1ULL << max_cbm_len) - 1);
+
+	mutex_unlock(&rdt_group_mutex);
 }
 
 static int __init intel_rdt_late_init(void)
@@ -552,15 +563,16 @@ static int __init intel_rdt_late_init(void)
 	cpu_notifier_register_begin();
 
 	for_each_online_cpu(i)
-		rdt_cpumask_update(i);
-
+		intel_rdt_cpu_start(i);
 	__hotcpu_notifier(intel_rdt_cpu_notifier, 0);
 
 	cpu_notifier_register_done();
+
 	rdt_cgroup_init();
 
 	static_key_slow_inc(&rdt_enable_key);
-	pr_info("Intel cache allocation enabled\n");
+	pr_info("Intel cache allocation enabled\n"
+		"max_closid:%u, max_cbm_len:%u\n", maxid, max_cbm_len);
 out_err:
 
 	return err;
-- 
2.6.0.rc2.230.g3dd15c0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation
  2015-10-02  6:09 ` [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation Fenghua Yu
@ 2015-11-18 20:58   ` Marcelo Tosatti
  2015-11-18 21:27   ` Marcelo Tosatti
  2015-11-18 22:15   ` Marcelo Tosatti
  2 siblings, 0 replies; 42+ messages in thread
From: Marcelo Tosatti @ 2015-11-18 20:58 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa

On Thu, Oct 01, 2015 at 11:09:45PM -0700, Fenghua Yu wrote:
> Add a new cgroup 'intel_rdt' to manage cache allocation. Each cgroup
> directory is associated with a class of service id(closid). To map a
> task with closid during scheduling, this patch removes the closid field
> from task_struct and uses the already existing 'cgroups' field in
> task_struct.
> 
> The cgroup has a file 'l3_cbm' which represents the L3 cache capacity
> bitmask(CBM). The CBM is global for the whole system currently. The
> capacity bitmask needs to have only contiguous bits set and number of
> bits that can be set is less than the max bits that can be set. The
> tasks belonging to a cgroup get to fill in the L3 cache represented by
> the capacity bitmask of the cgroup. For ex: if the max bits in the CBM
> is 10 and the cache size is 10MB, each bit represents 1MB of cache
> capacity.
> 
> Root cgroup always has all the bits set in the l3_cbm. User can create
> more cgroups with mkdir syscall. By default the child cgroups inherit
> the capacity bitmask(CBM) from parent. User can change the CBM specified
> in hex for each cgroup. Each unique bitmask is associated with a class
> of service ID and an -ENOSPC is returned once we run out of
> closids.
> 
> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> ---
>  arch/x86/include/asm/intel_rdt.h |  37 +++++++-
>  arch/x86/kernel/cpu/intel_rdt.c  | 194 +++++++++++++++++++++++++++++++++++++--
>  include/linux/cgroup_subsys.h    |   4 +
>  include/linux/sched.h            |   3 -
>  init/Kconfig                     |   4 +-
>  5 files changed, 229 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
> index afb6da3..fbe1e00 100644
> --- a/arch/x86/include/asm/intel_rdt.h
> +++ b/arch/x86/include/asm/intel_rdt.h
> @@ -3,6 +3,7 @@
>  
>  #ifdef CONFIG_INTEL_RDT
>  
> +#include <linux/cgroup.h>
>  #include <linux/jump_label.h>
>  
>  #define MAX_CBM_LENGTH			32
> @@ -12,20 +13,54 @@
>  extern struct static_key rdt_enable_key;
>  void __intel_rdt_sched_in(void *dummy);
>  
> +struct intel_rdt {
> +	struct cgroup_subsys_state css;
> +	u32 closid;
> +};
> +
>  struct clos_cbm_table {
>  	unsigned long l3_cbm;
>  	unsigned int clos_refcnt;
>  };
>  
>  /*
> + * Return rdt group corresponding to this container.
> + */
> +static inline struct intel_rdt *css_rdt(struct cgroup_subsys_state *css)
> +{
> +	return css ? container_of(css, struct intel_rdt, css) : NULL;
> +}
> +
> +static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
> +{
> +	return css_rdt(ir->css.parent);
> +}
> +
> +/*
> + * Return rdt group to which this task belongs.
> + */
> +static inline struct intel_rdt *task_rdt(struct task_struct *task)
> +{
> +	return css_rdt(task_css(task, intel_rdt_cgrp_id));
> +}
> +
> +/*
>   * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
>   *
>   * Following considerations are made so that this has minimal impact
>   * on scheduler hot path:
>   * - This will stay as no-op unless we are running on an Intel SKU
>   * which supports L3 cache allocation.
> + * - When support is present and enabled, does not do any
> + * IA32_PQR_MSR writes until the user starts really using the feature
> + * ie creates a rdt cgroup directory and assigns a cache_mask thats
> + * different from the root cgroup's cache_mask.
>   * - Caches the per cpu CLOSid values and does the MSR write only
> - * when a task with a different CLOSid is scheduled in.
> + * when a task with a different CLOSid is scheduled in. That
> + * means the task belongs to a different cgroup.
> + * - Closids are allocated so that different cgroup directories
> + * with same cache_mask gets the same CLOSid. This minimizes CLOSids
> + * used and reduces MSR write frequency.
>   */
>  static inline void intel_rdt_sched_in(void)
>  {
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index ecaf8e6..cb4d2ef 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -53,6 +53,10 @@ static cpumask_t tmp_cpumask;
>  static DEFINE_MUTEX(rdt_group_mutex);
>  struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
>  
> +static struct intel_rdt rdt_root_group;
> +#define rdt_for_each_child(pos_css, parent_ir)		\
> +	css_for_each_child((pos_css), &(parent_ir)->css)
> +
>  struct rdt_remote_data {
>  	int msr;
>  	u64 val;
> @@ -108,17 +112,16 @@ static inline bool cache_alloc_supported(struct cpuinfo_x86 *c)
>  	return false;
>  }
>  
> -
>  void __intel_rdt_sched_in(void *dummy)
>  {
>  	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> -	u32 closid = current->closid;
> +	struct intel_rdt *ir = task_rdt(current);
>  
> -	if (closid == state->closid)
> +	if (ir->closid == state->closid)
>  		return;
>  
> -	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, closid);
> -	state->closid = closid;
> +	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, ir->closid);

What if another CPU runs

intel_cache_alloc_cbm_write()
        if (cbm_search(cbmvalue, &closid)) {
               ir->closid = closid;

Here? Probably a spinlock is necessary.

> +	state->closid = ir->closid;
>  }
>  


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation
  2015-10-02  6:09 ` [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation Fenghua Yu
  2015-11-18 20:58   ` Marcelo Tosatti
@ 2015-11-18 21:27   ` Marcelo Tosatti
  2015-12-16 22:00     ` Yu, Fenghua
  2015-11-18 22:15   ` Marcelo Tosatti
  2 siblings, 1 reply; 42+ messages in thread
From: Marcelo Tosatti @ 2015-11-18 21:27 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa

On Thu, Oct 01, 2015 at 11:09:45PM -0700, Fenghua Yu wrote:
> Add a new cgroup 'intel_rdt' to manage cache allocation. Each cgroup
> directory is associated with a class of service id(closid). To map a
> task with closid during scheduling, this patch removes the closid field
> from task_struct and uses the already existing 'cgroups' field in
> task_struct.
> 
> The cgroup has a file 'l3_cbm' which represents the L3 cache capacity
> bitmask(CBM). The CBM is global for the whole system currently. The
> capacity bitmask needs to have only contiguous bits set and number of
> bits that can be set is less than the max bits that can be set. The
> tasks belonging to a cgroup get to fill in the L3 cache represented by
> the capacity bitmask of the cgroup. For ex: if the max bits in the CBM
> is 10 and the cache size is 10MB, each bit represents 1MB of cache
> capacity.
> 
> Root cgroup always has all the bits set in the l3_cbm. User can create
> more cgroups with mkdir syscall. By default the child cgroups inherit
> the capacity bitmask(CBM) from parent. User can change the CBM specified
> in hex for each cgroup. Each unique bitmask is associated with a class
> of service ID and an -ENOSPC is returned once we run out of
> closids.
> 
> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>


+	clos_cbm_table_read(ir->closid, &ccbm);
+	if (cbmvalue == ccbm)
+		goto out;
+
+	err = cbm_validate_rdt_cgroup(ir, cbmvalue);
+	if (err)
+		goto out;
+
+	/*
+	 * Try to get a reference for a different CLOSid and release the
+	 * reference to the current CLOSid.
+	 * Need to put down the reference here and get it back in case we
+	 * run out of closids. Otherwise we run into a problem when
+	 * we could be using the last closid that could have been available.
+	 */
+	closid_put(ir->closid);
+	if (cbm_search(cbmvalue, &closid)) {

Can't you move closid_put here?

+		ir->closid = closid;
+		closid_get(closid);
+	} else {
+		closid = ir->closid;

Variable unused.

+		err = closid_alloc(&ir->closid);
+		if (err) {
+			closid_get(ir->closid);
+			goto out;
+		}

This makes you cycle closid when changing the cbm, not necessary.
(not very important, but closid_put is nerving because it can possibly 
set l3_cbm to zero).


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation
  2015-10-02  6:09 ` [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation Fenghua Yu
  2015-11-18 20:58   ` Marcelo Tosatti
  2015-11-18 21:27   ` Marcelo Tosatti
@ 2015-11-18 22:15   ` Marcelo Tosatti
  2015-12-14 22:58     ` Yu, Fenghua
  2 siblings, 1 reply; 42+ messages in thread
From: Marcelo Tosatti @ 2015-11-18 22:15 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa

On Thu, Oct 01, 2015 at 11:09:45PM -0700, Fenghua Yu wrote:
> Add a new cgroup 'intel_rdt' to manage cache allocation. Each cgroup
> directory is associated with a class of service id(closid). To map a
> task with closid during scheduling, this patch removes the closid field
> from task_struct and uses the already existing 'cgroups' field in
> task_struct.
> 
> The cgroup has a file 'l3_cbm' which represents the L3 cache capacity
> bitmask(CBM). The CBM is global for the whole system currently. The
> capacity bitmask needs to have only contiguous bits set and number of
> bits that can be set is less than the max bits that can be set. The
> tasks belonging to a cgroup get to fill in the L3 cache represented by
> the capacity bitmask of the cgroup. For ex: if the max bits in the CBM
> is 10 and the cache size is 10MB, each bit represents 1MB of cache
> capacity.
> 
> Root cgroup always has all the bits set in the l3_cbm. User can create
> more cgroups with mkdir syscall. By default the child cgroups inherit
> the capacity bitmask(CBM) from parent. User can change the CBM specified
> in hex for each cgroup. Each unique bitmask is associated with a class
> of service ID and an -ENOSPC is returned once we run out of
> closids.
> 
> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> ---
>  arch/x86/include/asm/intel_rdt.h |  37 +++++++-
>  arch/x86/kernel/cpu/intel_rdt.c  | 194 +++++++++++++++++++++++++++++++++++++--
>  include/linux/cgroup_subsys.h    |   4 +
>  include/linux/sched.h            |   3 -
>  init/Kconfig                     |   4 +-
>  5 files changed, 229 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
> index afb6da3..fbe1e00 100644
> --- a/arch/x86/include/asm/intel_rdt.h
> +++ b/arch/x86/include/asm/intel_rdt.h
> @@ -3,6 +3,7 @@
>  
>  #ifdef CONFIG_INTEL_RDT
>  
> +#include <linux/cgroup.h>
>  #include <linux/jump_label.h>
>  
>  #define MAX_CBM_LENGTH			32
> @@ -12,20 +13,54 @@
>  extern struct static_key rdt_enable_key;
>  void __intel_rdt_sched_in(void *dummy);
>  
> +struct intel_rdt {
> +	struct cgroup_subsys_state css;
> +	u32 closid;
> +};
> +
>  struct clos_cbm_table {
>  	unsigned long l3_cbm;
>  	unsigned int clos_refcnt;
>  };
>  
>  /*
> + * Return rdt group corresponding to this container.
> + */
> +static inline struct intel_rdt *css_rdt(struct cgroup_subsys_state *css)
> +{
> +	return css ? container_of(css, struct intel_rdt, css) : NULL;
> +}
> +
> +static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
> +{
> +	return css_rdt(ir->css.parent);
> +}
> +
> +/*
> + * Return rdt group to which this task belongs.
> + */
> +static inline struct intel_rdt *task_rdt(struct task_struct *task)
> +{
> +	return css_rdt(task_css(task, intel_rdt_cgrp_id));
> +}
> +
> +/*
>   * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
>   *
>   * Following considerations are made so that this has minimal impact
>   * on scheduler hot path:
>   * - This will stay as no-op unless we are running on an Intel SKU
>   * which supports L3 cache allocation.
> + * - When support is present and enabled, does not do any
> + * IA32_PQR_MSR writes until the user starts really using the feature
> + * ie creates a rdt cgroup directory and assigns a cache_mask thats
> + * different from the root cgroup's cache_mask.
>   * - Caches the per cpu CLOSid values and does the MSR write only
> - * when a task with a different CLOSid is scheduled in.

Why is this even allowed? 

	socket CBM bits:

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
[ | | | | | | | | | |  |  |  |  |  ]

 x x x x x x x 				    
			 x  x  x  x

 x x x x x

cgroupA.bits = [ 0 - 6 ] cgroupB.bits = [ 10 - 14]  (level 1)
cgroupA-A.bits = [ 0 - 4 ]			    (level 2)

Two ways to create a cgroup with bits [ 0 - 4] set:

1) Create a cgroup C in level 1 with a different name.
Useful to have same cgroup with two different names.

2) Create a cgroup A-B under cgroup-A with bits [0 - 4].

It just creates confusion, having two or more cgroups under 
different levels of the hierarchy with the same bits set.
(can't see any organizational benefit).

Why not return -EINVAL ? Ah, cgroups are hierarchical, right.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation
  2015-11-18 22:15   ` Marcelo Tosatti
@ 2015-12-14 22:58     ` Yu, Fenghua
  0 siblings, 0 replies; 42+ messages in thread
From: Yu, Fenghua @ 2015-12-14 22:58 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa, Luck, Tony

> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> Sent: Wednesday, November 18, 2015 2:15 PM
> To: Yu, Fenghua <fenghua.yu@intel.com>
> Cc: H Peter Anvin <hpa@zytor.com>; Ingo Molnar <mingo@redhat.com>;
> Thomas Gleixner <tglx@linutronix.de>; Peter Zijlstra
> <peterz@infradead.org>; linux-kernel <linux-kernel@vger.kernel.org>; x86
> <x86@kernel.org>; Vikas Shivappa <vikas.shivappa@linux.intel.com>
> Subject: Re: [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface
> to manage Intel cache allocation
> 
> On Thu, Oct 01, 2015 at 11:09:45PM -0700, Fenghua Yu wrote:
> > Add a new cgroup 'intel_rdt' to manage cache allocation. Each cgroup
> > directory is associated with a class of service id(closid). To map a
> > task with closid during scheduling, this patch removes the closid field
> > from task_struct and uses the already existing 'cgroups' field in
> > task_struct.
> >
> > +
> > +/*
> >   * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
> >   *
> >   * Following considerations are made so that this has minimal impact
> >   * on scheduler hot path:
> >   * - This will stay as no-op unless we are running on an Intel SKU
> >   * which supports L3 cache allocation.
> > + * - When support is present and enabled, does not do any
> > + * IA32_PQR_MSR writes until the user starts really using the feature
> > + * ie creates a rdt cgroup directory and assigns a cache_mask thats
> > + * different from the root cgroup's cache_mask.
> >   * - Caches the per cpu CLOSid values and does the MSR write only
> > - * when a task with a different CLOSid is scheduled in.
> 
> Why is this even allowed?
> 
> 	socket CBM bits:
> 
>  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
> [ | | | | | | | | | |  |  |  |  |  ]
> 
>  x x x x x x x
> 			 x  x  x  x
> 
>  x x x x x
> 
> cgroupA.bits = [ 0 - 6 ] cgroupB.bits = [ 10 - 14]  (level 1)
> cgroupA-A.bits = [ 0 - 4 ]			    (level 2)
> 
> Two ways to create a cgroup with bits [ 0 - 4] set:
> 
> 1) Create a cgroup C in level 1 with a different name.
> Useful to have same cgroup with two different names.
> 
> 2) Create a cgroup A-B under cgroup-A with bits [0 - 4].
> 
> It just creates confusion, having two or more cgroups under
> different levels of the hierarchy with the same bits set.
> (can't see any organizational benefit).
> 
> Why not return -EINVAL ? Ah, cgroups are hierarchical, right.

I would let the situation be handled by user space management tool. Kernel handles only minimum situation.
The management tool should have more knowledge to create CLOSID. Kernel only pass that info to hardware.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation
  2015-11-18 21:27   ` Marcelo Tosatti
@ 2015-12-16 22:00     ` Yu, Fenghua
  0 siblings, 0 replies; 42+ messages in thread
From: Yu, Fenghua @ 2015-12-16 22:00 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: H Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	linux-kernel, x86, Vikas Shivappa, Luck, Tony, Shankar, Ravi V

> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> Sent: Wednesday, November 18, 2015 1:27 PM
> To: Yu, Fenghua <fenghua.yu@intel.com>
> Cc: H Peter Anvin <hpa@zytor.com>; Ingo Molnar <mingo@redhat.com>;
> Thomas Gleixner <tglx@linutronix.de>; Peter Zijlstra
> <peterz@infradead.org>; linux-kernel <linux-kernel@vger.kernel.org>; x86
> <x86@kernel.org>; Vikas Shivappa <vikas.shivappa@linux.intel.com>
> Subject: Re: [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface
> to manage Intel cache allocation
> 
> On Thu, Oct 01, 2015 at 11:09:45PM -0700, Fenghua Yu wrote:
> > Add a new cgroup 'intel_rdt' to manage cache allocation. Each cgroup
> +	/*
> +	 * Try to get a reference for a different CLOSid and release the
> +	 * reference to the current CLOSid.
> +	 * Need to put down the reference here and get it back in case we
> +	 * run out of closids. Otherwise we run into a problem when
> +	 * we could be using the last closid that could have been available.
> +	 */
> +	closid_put(ir->closid);
> +	if (cbm_search(cbmvalue, &closid)) {
> 
> Can't you move closid_put here?

No. This cannot be moved here.

If it's moved here, it won't work in the case of the current rdt is the only usage of the closid.
In this case, the closid will released from the cbm table and a new cbm will be allocated.
So the closid_put() is in the right place and can handle both the only usage of closid or recycling
case, I think.

> 
> +		ir->closid = closid;
> +		closid_get(closid);
> +	} else {
> +		closid = ir->closid;
> 
> Variable unused.

You are right. I'll remove this statement.

> 
> +		err = closid_alloc(&ir->closid);
> +		if (err) {
> +			closid_get(ir->closid);
> +			goto out;
> +		}
> 
> This makes you cycle closid when changing the cbm, not necessary.
> (not very important, but closid_put is nerving because it can possibly set
> l3_cbm to zero).

I think the current code is ok. If closid_put sets l3_cbm to zero (i.e. the closid only has this usage),
a new closid allocaation will be started to get a new closid.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2015-12-16 22:00 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-02  6:09 [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Fenghua Yu
2015-10-02  6:09 ` [PATCH V15 01/11] x86/intel_cqm: Modify hot cpu notification handling Fenghua Yu
2015-10-02  6:09 ` [PATCH V15 02/11] x86/intel_rapl: " Fenghua Yu
2015-10-02  6:09 ` [PATCH V15 03/11] x86/intel_rdt: Cache Allocation documentation Fenghua Yu
2015-10-02  6:09 ` [PATCH V15 04/11] x86/intel_rdt: Add support for Cache Allocation detection Fenghua Yu
2015-11-04 14:51   ` Luiz Capitulino
2015-10-02  6:09 ` [PATCH V15 05/11] x86/intel_rdt: Add Class of service management Fenghua Yu
2015-11-04 14:55   ` Luiz Capitulino
2015-10-02  6:09 ` [PATCH V15 06/11] x86/intel_rdt: Add L3 cache capacity bitmask management Fenghua Yu
2015-10-02  6:09 ` [PATCH V15 07/11] x86/intel_rdt: Implement scheduling support for Intel RDT Fenghua Yu
2015-10-02  6:09 ` [PATCH V15 08/11] x86/intel_rdt: Hot cpu support for Cache Allocation Fenghua Yu
2015-10-02  6:09 ` [PATCH V15 09/11] x86/intel_rdt: Intel haswell Cache Allocation enumeration Fenghua Yu
2015-10-02  6:09 ` [PATCH V15 10/11] x86,cgroup/intel_rdt : Add intel_rdt cgroup documentation Fenghua Yu
2015-10-02  6:09 ` [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation Fenghua Yu
2015-11-18 20:58   ` Marcelo Tosatti
2015-11-18 21:27   ` Marcelo Tosatti
2015-12-16 22:00     ` Yu, Fenghua
2015-11-18 22:15   ` Marcelo Tosatti
2015-12-14 22:58     ` Yu, Fenghua
2015-10-11 19:50 ` [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support Thomas Gleixner
2015-10-12 18:52   ` Yu, Fenghua
2015-10-12 19:58     ` Thomas Gleixner
2015-10-13 22:40     ` Marcelo Tosatti
2015-10-15 11:37       ` Peter Zijlstra
2015-10-16  0:17         ` Marcelo Tosatti
2015-10-16  9:44           ` Peter Zijlstra
2015-10-16 20:24             ` Marcelo Tosatti
2015-10-19 23:49               ` Marcelo Tosatti
2015-10-13 21:31   ` Marcelo Tosatti
2015-10-15 11:36     ` Peter Zijlstra
2015-10-16  2:28       ` Marcelo Tosatti
2015-10-16  9:50         ` Peter Zijlstra
2015-10-26 20:02           ` Marcelo Tosatti
2015-11-02 22:20           ` cat cgroup interface proposal (non hierarchical) was " Marcelo Tosatti
2015-11-04 14:42 ` Luiz Capitulino
2015-11-04 14:57   ` Thomas Gleixner
2015-11-04 15:12     ` Luiz Capitulino
2015-11-04 15:28       ` Thomas Gleixner
2015-11-04 15:35         ` Luiz Capitulino
2015-11-04 15:50           ` Thomas Gleixner
2015-11-05  2:19 ` [PATCH 1/2] x86/intel_rdt,intel_cqm: Remove build dependency of RDT code on CQM code David Carrillo-Cisneros
2015-11-05  2:19   ` [PATCH 2/2] x86/intel_rdt: Fix bug in initialization, locks and write cbm mask David Carrillo-Cisneros

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).