All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V7 0/7] x86/intel_rdt: Intel Cache Allocation support
@ 2015-05-11 19:02 Vikas Shivappa
  2015-05-11 19:02 ` [PATCH 1/7] x86/intel_rdt: Intel Cache Allocation detection Vikas Shivappa
                   ` (7 more replies)
  0 siblings, 8 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-11 19:02 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, peter.zijlstra, h.peter.anvin, kanaka.d.juvva,
	mtosatti, vikas.shivappa

This patch adds a cgroup subsystem to support the new Cache Allocation
feature found in future Intel Xeon Intel processors. Cache Allocation is
a sub-feature with in Resource Director Technology(RDT) feature. RDT
provides support to control sharing of platform resources like L3
cache.

Cache Allocation Technology provides a way for the Software (OS/VMM) to
restrict L3 cache allocation to a defined 'subset' of cache which may be
overlapping with other 'subsets'.  This feature is used when allocating
a line in cache ie when pulling new data into the cache.  The
programming of the h/w is done via programming  MSRs. The patch series
only supports to perform L3 cache allocation. However note that some of
the datastructures and framework which is common to RDT are also
implemented but its currently used only by the cache allocation
sub-feature.

In todays new processors the number of cores is continuously increasing
which in turn increase the number of threads or workloads that can 
simultaneously be run. When multi-threaded
 applications run concurrently, they compete for shared
resources including L3 cache.  At times, this L3 cache resource contention may 
result in inefficient space utilization. For example a higher priority thread 
may end up with lesser L3 cache resource or a cache sensitive app may not get
optimal cache occupancy thereby degrading the performance.
Cache Allocation kernel patch helps provides a framework for sharing L3 cache so that users 
can allocate the resource according to set requirements.

More information about the feature can be found in the Intel SDM, Volume 3 
section 17.15.  SDM does not yet use the 'RDT' term yet and it is planned to be 
changed at a later time.

*All the patches will apply on 4.1-rc0*.

Changes in V7: (Thanks to feedback from PeterZ and Matt and following
discussions)
- changed lot of naming to reflect the data structures which are common
to RDT and specific to Cache allocation.
- removed all usage of 'cat'. replace with more friendly cache
allocation
- fixed lot of convention issues (whitespace, return paradigm etc)
- changed the scheduling hook for RDT to not use a inline.
- removed adding new scheduling hook and just reused the existing one
similar to perf hook.

Changes in V6:
- rebased to 4.1-rc1 which has the CMT(cache monitoring) support included.
- (Thanks to Marcelo's feedback).Fixed support for hot cpu handling for 
IA32_L3_QOS MSRs. Although during deep C states the MSR need not be restored 
this is needed when physically a new package is added.
-some other coding convention changes including renaming to cache_mask using a 
 refcnt to track the number of cgroups using a closid in clos_cbm map.
-1b cbm support for non-hsw SKUs. HSW is an exception which needs the cache 
 bit masks to be at least 2 bits.

Changes in v5:
- Added support to propagate the cache bit mask update for each 
package.
- Removed the cache bit mask reference in the intel_rdt structure as
  there was no need for that and we already maintain a separate
  closid<->cbm mapping.
- Made a few coding convention changes which include adding the 
assertion while freeing the CLOSID.

Changes in V4:
- Integrated with the latest V5 CMT patches.
- Changed naming of cgroup to rdt(resource director technology) from
  cat(cache allocation technology). This was done as the RDT is the
  umbrella term for platform shared resources allocation. Hence in
  future it would be easier to add resource allocation to the same 
  cgroup
- Naming changes also applied to a lot of other data structures/APIs.
- Added documentation on cgroup usage for cache allocation to address
  a lot of questions from various academic and industry regarding 
  cache allocation usage.

Changes in V3:
- Implements a common software cache for IA32_PQR_MSR
- Implements support for hsw Cache Allocation enumeration. This does not use the brand 
strings like earlier version but does a probe test. The probe test is done only 
on hsw family of processors
- Made a few coding convention, name changes
- Check for lock being held when ClosID manipulation happens

Changes in V2:
- Removed HSW specific enumeration changes. Plan to include it later as a
  separate patch.  
- Fixed the code in prep_arch_switch to be specific for x86 and removed
  x86 defines.
- Fixed cbm_write to not write all 1s when a cgroup is freed.
- Fixed one possible memory leak in init.  
- Changed some of manual bitmap
  manipulation to use the predefined bitmap APIs to make code more readable
- Changed name in sources from cqe to cat
- Global cat enable flag changed to static_key and disabled cgroup early_init

[PATCH 1/7] x86/intel_rdt: Intel Cache Allocation detection
[PATCH 2/7] x86/intel_rdt: Adds support for Class of service management
[PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management
[PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT
[PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR
[PATCH 6/7] x86/intel_rdt: Intel haswell Cache Allocation enumeration
[PATCH 7/7] x86/intel_rdt: Add Cache Allocation documentation and user
guide

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/7] x86/intel_rdt: Intel Cache Allocation detection
  2015-05-11 19:02 [PATCH V7 0/7] x86/intel_rdt: Intel Cache Allocation support Vikas Shivappa
@ 2015-05-11 19:02 ` Vikas Shivappa
  2015-05-11 19:02 ` [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management Vikas Shivappa
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-11 19:02 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, peter.zijlstra, h.peter.anvin, kanaka.d.juvva,
	mtosatti, vikas.shivappa

This patch adds support for Cache Allocation Technology feature found in
future Intel Xeon processors. Cache allocation is a sub-feature of Intel
Resource Director Technology(RDT) which enables sharing of processor
resources.  This patch includes CPUID enumeration routines for Cache
allocation and new values to track resources to the cpuinfo_x86
structure.

Cache allocation provides a way for the Software (OS/VMM) to restrict
cache allocation to a defined 'subset' of cache which may be overlapping
with other 'subsets'.  This feature is used when allocating a line in
cache ie when pulling new data into the cache.  The programming of the
h/w is done via programming  MSRs.

More information about Cache allocation be found in the Intel (R) x86
Architecture Software Developer Manual,Volume 3, section 17.15.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/cpufeature.h |  6 +++++-
 arch/x86/include/asm/processor.h  |  3 +++
 arch/x86/kernel/cpu/Makefile      |  1 +
 arch/x86/kernel/cpu/common.c      | 15 +++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c   | 44 +++++++++++++++++++++++++++++++++++++++
 init/Kconfig                      | 11 ++++++++++
 6 files changed, 79 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/cpu/intel_rdt.c

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 3d6606f..ae5ae9d 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -12,7 +12,7 @@
 #include <asm/disabled-features.h>
 #endif
 
-#define NCAPINTS	13	/* N 32-bit words worth of info */
+#define NCAPINTS	14	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -229,6 +229,7 @@
 #define X86_FEATURE_RTM		( 9*32+11) /* Restricted Transactional Memory */
 #define X86_FEATURE_CQM		( 9*32+12) /* Cache QoS Monitoring */
 #define X86_FEATURE_MPX		( 9*32+14) /* Memory Protection Extension */
+#define X86_FEATURE_RDT		( 9*32+15) /* Resource Allocation */
 #define X86_FEATURE_AVX512F	( 9*32+16) /* AVX-512 Foundation */
 #define X86_FEATURE_RDSEED	( 9*32+18) /* The RDSEED instruction */
 #define X86_FEATURE_ADX		( 9*32+19) /* The ADCX and ADOX instructions */
@@ -252,6 +253,9 @@
 /* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
 #define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
 
+/* Intel-defined CPU features, CPUID level 0x00000010:0 (ebx), word 13 */
+#define X86_FEATURE_CAT_L3	(13*32 + 1) /* Cache Allocation L3 */
+
 /*
  * BUG word(s)
  */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 23ba676..e84de35 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -114,6 +114,9 @@ struct cpuinfo_x86 {
 	int			x86_cache_occ_scale;	/* scale to bytes */
 	int			x86_power;
 	unsigned long		loops_per_jiffy;
+	/* Resource Allocation values */
+	u16			x86_rdt_max_cbm_len;
+	u16			x86_rdt_max_closid;
 	/* cpuid returned max cores value: */
 	u16			 x86_max_cores;
 	u16			apicid;
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 9bff687..4ff7a1f 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE)	+= perf_event_intel_uncore.o \
 					   perf_event_intel_uncore_nhmex.o
 endif
 
+obj-$(CONFIG_CGROUP_RDT) 		+= intel_rdt.o
 
 obj-$(CONFIG_X86_MCE)			+= mcheck/
 obj-$(CONFIG_MTRR)			+= mtrr/
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a62cf04..4133d3c 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -670,6 +670,21 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		}
 	}
 
+	/* Additional Intel-defined flags: level 0x00000010 */
+	if (c->cpuid_level >= 0x00000010) {
+		u32 eax, ebx, ecx, edx;
+
+		cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
+		c->x86_capability[13] = ebx;
+
+		if (cpu_has(c, X86_FEATURE_CAT_L3)) {
+
+			cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
+			c->x86_rdt_max_closid = edx + 1;
+			c->x86_rdt_max_cbm_len = eax + 1;
+		}
+	}
+
 	/* AMD-defined flags: level 0x80000001 */
 	xlvl = cpuid_eax(0x80000000);
 	c->extended_cpuid_level = xlvl;
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
new file mode 100644
index 0000000..f393d3e
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -0,0 +1,44 @@
+/*
+ * Resource Director Technology(RDT)
+ * - Cache Allocation code.
+ *
+ * Copyright (C) 2014 Intel Corporation
+ *
+ * 2014-09-10 Written by
+ *    Vikas Shivappa <vikas.shivappa@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * More information about RDT be found in the Intel (R) x86 Architecture
+ * Software Developer Manual, volume 3, section 17.15.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/slab.h>
+#include <linux/err.h>
+
+static int __init intel_rdt_late_init(void)
+{
+	struct cpuinfo_x86 *c = &boot_cpu_data;
+	int maxid, max_cbm_len;
+
+	if (!cpu_has(c, X86_FEATURE_CAT_L3))
+		return -ENODEV;
+
+	maxid = c->x86_rdt_max_closid;
+	max_cbm_len = c->x86_rdt_max_cbm_len;
+
+	pr_info("Max bitmask length:%u,Max ClosIds: %u\n", max_cbm_len, maxid);
+
+	return 0;
+}
+
+late_initcall(intel_rdt_late_init);
diff --git a/init/Kconfig b/init/Kconfig
index dc24dec..d97ff5e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -983,6 +983,17 @@ config CPUSETS
 
 	  Say N if unsure.
 
+config CGROUP_RDT
+	bool "Resource Director Technology cgroup subsystem"
+	depends on X86_64 && CPU_SUP_INTEL
+	help
+	  This option provides a cgroup to allocate Platform shared
+	  resources. Among the shared resources, current implementation
+	  focuses on L3 Cache. Using the interface user can specify the
+	  amount of L3 cache space into which an application can fill.
+
+	  Say N if unsure.
+
 config PROC_PID_CPUSET
 	bool "Include legacy /proc/<pid>/cpuset file"
 	depends on CPUSETS
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management
  2015-05-11 19:02 [PATCH V7 0/7] x86/intel_rdt: Intel Cache Allocation support Vikas Shivappa
  2015-05-11 19:02 ` [PATCH 1/7] x86/intel_rdt: Intel Cache Allocation detection Vikas Shivappa
@ 2015-05-11 19:02 ` Vikas Shivappa
  2015-05-15 19:18   ` Thomas Gleixner
  2015-05-11 19:02 ` [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management Vikas Shivappa
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-11 19:02 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, peter.zijlstra, h.peter.anvin, kanaka.d.juvva,
	mtosatti, vikas.shivappa

This patch adds a cgroup subsystem for Intel Resource Director
Technology(RDT) feature and Class of service(CLOSid) management which is
part of common RDT framework.  This cgroup would eventually be used by
all sub-features of RDT and hence be associated with the common RDT
framework as well as sub-feature specific framework.  However current
patch series only adds cache allocation sub-feature specific code.

When a cgroup directory is created it has a CLOSid associated with it
which is inherited from its parent.  The Closid is mapped to a
cache_mask which represents the L3 cache allocation to the cgroup.
Tasks belonging to the cgroup get to fill the cache represented by the
cache_mask.

CLOSid is internal to the kernel and not exposed to user.  Kernel uses
several ways to optimize the allocation of Closid and thereby exposing
the available Closids may actually provide wrong information to users as
it may be dynamically changing depending on its usage.

CLOSid allocation is tracked using a separate bitmap. The maximum number
of CLOSids is specified by the h/w during CPUID enumeration and the
kernel simply throws an -ENOSPC when it runs out of CLOSids.
Each cache_mask(CBM) has an associated CLOSid.  However if multiple
cgroups have the same cache mask they would also have the same CLOSid.
The reference count parameter in CLOSid-CBM map keeps track of how many
cgroups are using each CLOSid<->CBM mapping.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/intel_rdt.h |  38 +++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c  | 112 +++++++++++++++++++++++++++++++++++++--
 include/linux/cgroup_subsys.h    |   4 ++
 3 files changed, 150 insertions(+), 4 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_rdt.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
new file mode 100644
index 0000000..6f62891
--- /dev/null
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -0,0 +1,38 @@
+#ifndef _RDT_H_
+#define _RDT_H_
+
+#ifdef CONFIG_CGROUP_RDT
+
+#include <linux/cgroup.h>
+
+struct rdt_subsys_info {
+	/* Clos Bitmap to keep track of available CLOSids.*/
+	unsigned long *closmap;
+};
+
+struct intel_rdt {
+	struct cgroup_subsys_state css;
+	/* Class of service for the cgroup.*/
+	unsigned int clos;
+};
+
+struct clos_cbm_map {
+	unsigned long cache_mask;
+	unsigned int clos_refcnt;
+};
+
+/*
+ * Return rdt group corresponding to this container.
+ */
+static inline struct intel_rdt *css_rdt(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct intel_rdt, css) : NULL;
+}
+
+static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
+{
+	return css_rdt(ir->css.parent);
+}
+
+#endif
+#endif
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index f393d3e..e8134a9 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -24,21 +24,125 @@
 
 #include <linux/slab.h>
 #include <linux/err.h>
+#include <linux/spinlock.h>
+#include <asm/intel_rdt.h>
+
+/*
+ * ccmap maintains 1:1 mapping between CLOSid and cache bitmask.
+ */
+static struct clos_cbm_map *ccmap;
+static struct rdt_subsys_info rdtss_info;
+static DEFINE_MUTEX(rdt_group_mutex);
+struct intel_rdt rdt_root_group;
+
+static void intel_rdt_free_closid(unsigned int clos)
+{
+	lockdep_assert_held(&rdt_group_mutex);
+
+	clear_bit(clos, rdtss_info.closmap);
+}
+
+static void __clos_get(unsigned int closid)
+{
+	struct clos_cbm_map *ccm = &ccmap[closid];
+
+	lockdep_assert_held(&rdt_group_mutex);
+
+	ccm->clos_refcnt += 1;
+}
+
+static void __clos_put(unsigned int closid)
+{
+	struct clos_cbm_map *ccm = &ccmap[closid];
+
+	lockdep_assert_held(&rdt_group_mutex);
+	WARN_ON(!ccm->clos_refcnt);
+
+	ccm->clos_refcnt -= 1;
+	if (!ccm->clos_refcnt)
+		intel_rdt_free_closid(closid);
+}
+
+static struct cgroup_subsys_state *
+intel_rdt_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct intel_rdt *parent = css_rdt(parent_css);
+	struct intel_rdt *ir;
+
+	/*
+	 * Cannot return failure on systems with no Cache Allocation
+	 * as the cgroup_init does not handle failures gracefully.
+	 */
+	if (!parent)
+		return &rdt_root_group.css;
+
+	ir = kzalloc(sizeof(struct intel_rdt), GFP_KERNEL);
+	if (!ir)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_lock(&rdt_group_mutex);
+	ir->clos = parent->clos;
+	__clos_get(ir->clos);
+	mutex_unlock(&rdt_group_mutex);
+
+	return &ir->css;
+}
+
+static void intel_rdt_css_free(struct cgroup_subsys_state *css)
+{
+	struct intel_rdt *ir = css_rdt(css);
+
+	mutex_lock(&rdt_group_mutex);
+	__clos_put(ir->clos);
+	kfree(ir);
+	mutex_unlock(&rdt_group_mutex);
+}
 
 static int __init intel_rdt_late_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
-	int maxid, max_cbm_len;
+	static struct clos_cbm_map *ccm;
+	int maxid, max_cbm_len, err = 0;
+	size_t sizeb;
 
-	if (!cpu_has(c, X86_FEATURE_CAT_L3))
+	if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
+		rdt_root_group.css.ss->disabled = 1;
 		return -ENODEV;
-
+	}
 	maxid = c->x86_rdt_max_closid;
 	max_cbm_len = c->x86_rdt_max_cbm_len;
 
+	sizeb = BITS_TO_LONGS(maxid) * sizeof(long);
+	rdtss_info.closmap = kzalloc(sizeb, GFP_KERNEL);
+	if (!rdtss_info.closmap) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+
+	sizeb = maxid * sizeof(struct clos_cbm_map);
+	ccmap = kzalloc(sizeb, GFP_KERNEL);
+	if (!ccmap) {
+		kfree(rdtss_info.closmap);
+		err = -ENOMEM;
+		goto out_err;
+	}
+
+	set_bit(0, rdtss_info.closmap);
+	rdt_root_group.clos = 0;
+	ccm = &ccmap[0];
+	bitmap_set(&ccm->cache_mask, 0, max_cbm_len);
+	ccm->clos_refcnt = 1;
+
 	pr_info("Max bitmask length:%u,Max ClosIds: %u\n", max_cbm_len, maxid);
+out_err:
 
-	return 0;
+	return err;
 }
 
 late_initcall(intel_rdt_late_init);
+
+struct cgroup_subsys intel_rdt_cgrp_subsys = {
+	.css_alloc		= intel_rdt_css_alloc,
+	.css_free		= intel_rdt_css_free,
+	.early_init		= 0,
+};
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e4a96fb..0339312 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -47,6 +47,10 @@ SUBSYS(net_prio)
 SUBSYS(hugetlb)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_RDT)
+SUBSYS(intel_rdt)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management
  2015-05-11 19:02 [PATCH V7 0/7] x86/intel_rdt: Intel Cache Allocation support Vikas Shivappa
  2015-05-11 19:02 ` [PATCH 1/7] x86/intel_rdt: Intel Cache Allocation detection Vikas Shivappa
  2015-05-11 19:02 ` [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management Vikas Shivappa
@ 2015-05-11 19:02 ` Vikas Shivappa
  2015-05-15 19:25   ` Thomas Gleixner
  2015-05-11 19:02 ` [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT Vikas Shivappa
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-11 19:02 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, peter.zijlstra, h.peter.anvin, kanaka.d.juvva,
	mtosatti, vikas.shivappa

The change adds a file cache_mask to the RDT cgroup which represents the
cache bit mask(CBM) for the cgroup. cache_mask is specific to the Cache
allocation sub-feature of RDT. The tasks in the RDT cgroup would get to
fill the L3 cache represented by the cgroup's cache_mask file.

Update to the CBM is done by writing to the IA32_L3_MASK_n.
The RDT cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
cgroup never fails.  When a child cgroup is created it inherits the
CLOSid and the cache_mask from its parent.  When a user changes the
default CBM for a cgroup, a new CLOSid may be allocated if the
cache_mask was not used before. If the new CBM is the one that is
already used, the count for that CLOSid<->CBM is incremented. The
changing of 'cache_mask' may fail with -ENOSPC once the kernel runs out of
maximum CLOSids it can support.
User can create as many cgroups as he wants but having different CBMs at
the same time is restricted by the maximum number of CLOSids .Kernel
maintains a CLOSid<->cbm mapping which keeps count of cgroups using a
CLOSid.

Reuse of CLOSids for cgroups with same bitmask also has following
advantages:
- This helps to use the scant CLOSids optimally.
- This also implies that during context switch, write to PQR-MSR is done
only when a task with a different bitmask is scheduled in.

During cpu bringup due to a hotplug event, IA32_L3_MASK_n MSR is
synchronized from the clos cbm map if it is used by any cgroup for the
package.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/intel_rdt.h |   3 +
 arch/x86/kernel/cpu/intel_rdt.c  | 314 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 316 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 6f62891..9e9dbbe 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,6 +4,9 @@
 #ifdef CONFIG_CGROUP_RDT
 
 #include <linux/cgroup.h>
+#define MAX_CBM_LENGTH			32
+#define IA32_L3_CBM_BASE		0xc90
+#define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
 
 struct rdt_subsys_info {
 	/* Clos Bitmap to keep track of available CLOSids.*/
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index e8134a9..125318d 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -25,6 +25,7 @@
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/spinlock.h>
+#include <linux/cpu.h>
 #include <asm/intel_rdt.h>
 
 /*
@@ -35,6 +36,42 @@ static struct rdt_subsys_info rdtss_info;
 static DEFINE_MUTEX(rdt_group_mutex);
 struct intel_rdt rdt_root_group;
 
+/*
+ * Mask of CPUs for writing CBM values. We only need one per-socket.
+ */
+static cpumask_t rdt_cpumask;
+
+#define rdt_for_each_child(pos_css, parent_ir)		\
+	css_for_each_child((pos_css), &(parent_ir)->css)
+
+static void __clos_init(unsigned int closid)
+{
+	struct clos_cbm_map *ccm = &ccmap[closid];
+
+	lockdep_assert_held(&rdt_group_mutex);
+
+	ccm->clos_refcnt = 1;
+}
+
+static int intel_rdt_alloc_closid(struct intel_rdt *ir)
+{
+	unsigned int id;
+	unsigned int maxid;
+
+	lockdep_assert_held(&rdt_group_mutex);
+
+	maxid = boot_cpu_data.x86_rdt_max_closid;
+	id = find_next_zero_bit(rdtss_info.closmap, maxid, 0);
+	if (id == maxid)
+		return -ENOSPC;
+
+	set_bit(id, rdtss_info.closmap);
+	__clos_init(id);
+	ir->clos = id;
+
+	return 0;
+}
+
 static void intel_rdt_free_closid(unsigned int clos)
 {
 	lockdep_assert_held(&rdt_group_mutex);
@@ -98,11 +135,267 @@ static void intel_rdt_css_free(struct cgroup_subsys_state *css)
 	mutex_unlock(&rdt_group_mutex);
 }
 
+static inline bool cbm_is_contiguous(unsigned long var)
+{
+	unsigned long maxcbm = MAX_CBM_LENGTH;
+	unsigned long first_bit, zero_bit;
+
+	if (!var)
+		return false;
+
+	first_bit = find_next_bit(&var, maxcbm, 0);
+	zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);
+
+	if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
+		return false;
+
+	return true;
+}
+
+static int intel_cache_alloc_cbm_read(struct seq_file *m, void *v)
+{
+	struct intel_rdt *ir = css_rdt(seq_css(m));
+
+	seq_printf(m, "%08lx\n", ccmap[ir->clos].cache_mask);
+
+	return 0;
+}
+
+static int validate_cbm(struct intel_rdt *ir, unsigned long cbmvalue)
+{
+	struct cgroup_subsys_state *css;
+	struct intel_rdt *par, *c;
+	unsigned long *cbm_tmp;
+	int err = 0;
+
+	if (!cbm_is_contiguous(cbmvalue)) {
+		pr_err("bitmask should have >= 1 bit and be contiguous\n");
+		err = -EINVAL;
+		goto out_err;
+	}
+
+	par = parent_rdt(ir);
+	cbm_tmp = &ccmap[par->clos].cache_mask;
+	if (!bitmap_subset(&cbmvalue, cbm_tmp, MAX_CBM_LENGTH)) {
+		err = -EINVAL;
+		goto out_err;
+	}
+
+	rcu_read_lock();
+	rdt_for_each_child(css, ir) {
+		c = css_rdt(css);
+		cbm_tmp = &ccmap[c->clos].cache_mask;
+		if (!bitmap_subset(cbm_tmp, &cbmvalue, MAX_CBM_LENGTH)) {
+			rcu_read_unlock();
+			pr_err("Children's mask not a subset\n");
+			err = -EINVAL;
+			goto out_err;
+		}
+	}
+	rcu_read_unlock();
+out_err:
+
+	return err;
+}
+
+static bool cbm_search(unsigned long cbm, int *closid)
+{
+	int maxid = boot_cpu_data.x86_rdt_max_closid;
+	unsigned int i;
+
+	for (i = 0; i < maxid; i++) {
+		if (bitmap_equal(&cbm, &ccmap[i].cache_mask, MAX_CBM_LENGTH)) {
+			*closid = i;
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static void cbmmap_dump(void)
+{
+	int i;
+
+	pr_debug("CBMMAP\n");
+	for (i = 0; i < boot_cpu_data.x86_rdt_max_closid; i++) {
+		pr_debug("cache_mask: 0x%x,clos_refcnt: %u\n",
+		 (unsigned int)ccmap[i].cache_mask, ccmap[i].clos_refcnt);
+	}
+}
+
+static void __cpu_cbm_update(void *info)
+{
+	unsigned int closid = *((unsigned int *)info);
+
+	wrmsrl(CBM_FROM_INDEX(closid), ccmap[closid].cache_mask);
+}
+
+/*
+ * cbm_update_all() - Update the cache bit mask for all packages.
+ */
+static inline void cbm_update_all(unsigned int closid)
+{
+	on_each_cpu_mask(&rdt_cpumask, __cpu_cbm_update, &closid, 1);
+}
+
+/*
+ * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
+ * which are one per CLOSid, except IA32_L3_MASK_0 on the current package.
+ * @cpu : the cpu on which the mask is updated.
+ */
+static inline void cbm_update_msrs(int cpu)
+{
+	int maxid = boot_cpu_data.x86_rdt_max_closid;
+	unsigned int i;
+
+	if (WARN_ON(cpu != smp_processor_id()))
+		return;
+
+	for (i = 1; i < maxid; i++) {
+		if (ccmap[i].clos_refcnt)
+			__cpu_cbm_update(&i);
+	}
+}
+
+/*
+ * intel_cache_alloc_cbm_write() - Validates and writes the
+ * cache bit mask(cbm) to the IA32_L3_MASK_n
+ * and also store the same in the ccmap.
+ *
+ * CLOSids are reused for cgroups which have same bitmask.
+ * - This helps to use the scant CLOSids optimally.
+ * - This also implies that at context switch write
+ * to PQR-MSR is done only when a task with a
+ * different bitmask is scheduled in.
+ */
+static int intel_cache_alloc_cbm_write(struct cgroup_subsys_state *css,
+				 struct cftype *cft, u64 cbmvalue)
+{
+	u32 max_cbm = boot_cpu_data.x86_rdt_max_cbm_len;
+	struct intel_rdt *ir = css_rdt(css);
+	unsigned long cache_mask, max_mask;
+	unsigned long *cbm_tmp;
+	unsigned int closid;
+	ssize_t err = 0;
+
+	if (ir == &rdt_root_group)
+		return -EPERM;
+	bitmap_set(&max_mask, 0, max_cbm);
+
+	/*
+	 * Need global mutex as cbm write may allocate a closid.
+	 */
+	mutex_lock(&rdt_group_mutex);
+	bitmap_and(&cache_mask, (unsigned long *)&cbmvalue, &max_mask, max_cbm);
+	cbm_tmp = &ccmap[ir->clos].cache_mask;
+
+	if (bitmap_equal(&cache_mask, cbm_tmp, MAX_CBM_LENGTH))
+		goto out;
+
+	err = validate_cbm(ir, cache_mask);
+	if (err)
+		goto out;
+
+	/*
+	 * At this point we are sure to change the cache_mask.Hence release the
+	 * reference to the current CLOSid and try to get a reference for
+	 * a different CLOSid.
+	 */
+	__clos_put(ir->clos);
+
+	if (cbm_search(cache_mask, &closid)) {
+		ir->clos = closid;
+		__clos_get(closid);
+	} else {
+		err = intel_rdt_alloc_closid(ir);
+		if (err)
+			goto out;
+
+		ccmap[ir->clos].cache_mask = cache_mask;
+		cbm_update_all(ir->clos);
+	}
+	cbmmap_dump();
+out:
+	mutex_unlock(&rdt_group_mutex);
+
+	return err;
+}
+
+static inline bool intel_rdt_update_cpumask(int cpu)
+{
+	int phys_id = topology_physical_package_id(cpu);
+	struct cpumask *mask = &rdt_cpumask;
+	int i;
+
+	for_each_cpu(i, mask) {
+		if (phys_id == topology_physical_package_id(i))
+			return false;
+	}
+	cpumask_set_cpu(cpu, mask);
+
+	return true;
+}
+
+/*
+ * intel_rdt_cpu_start() - If a new package has come up, update all
+ * the Cache bitmasks on the package.
+ */
+static inline void intel_rdt_cpu_start(int cpu)
+{
+	mutex_lock(&rdt_group_mutex);
+	if (intel_rdt_update_cpumask(cpu))
+		cbm_update_msrs(cpu);
+	mutex_unlock(&rdt_group_mutex);
+}
+
+static void intel_rdt_cpu_exit(unsigned int cpu)
+{
+	int phys_id = topology_physical_package_id(cpu);
+	int i;
+
+	mutex_lock(&rdt_group_mutex);
+	if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
+		mutex_unlock(&rdt_group_mutex);
+		return;
+	}
+
+	for_each_online_cpu(i) {
+		if (i == cpu)
+			continue;
+
+		if (phys_id == topology_physical_package_id(i)) {
+			cpumask_set_cpu(i, &rdt_cpumask);
+			break;
+		}
+	}
+	mutex_unlock(&rdt_group_mutex);
+}
+
+static int intel_rdt_cpu_notifier(struct notifier_block *nb,
+				  unsigned long action, void *hcpu)
+{
+	unsigned int cpu  = (unsigned long)hcpu;
+
+	switch (action) {
+	case CPU_STARTING:
+		intel_rdt_cpu_start(cpu);
+		break;
+	case CPU_DOWN_PREPARE:
+		intel_rdt_cpu_exit(cpu);
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
 static int __init intel_rdt_late_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
+	int maxid, max_cbm_len, err = 0, i;
 	static struct clos_cbm_map *ccm;
-	int maxid, max_cbm_len, err = 0;
 	size_t sizeb;
 
 	if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
@@ -133,6 +426,14 @@ static int __init intel_rdt_late_init(void)
 	bitmap_set(&ccm->cache_mask, 0, max_cbm_len);
 	ccm->clos_refcnt = 1;
 
+	cpu_notifier_register_begin();
+	for_each_online_cpu(i)
+		intel_rdt_update_cpumask(i);
+
+	__hotcpu_notifier(intel_rdt_cpu_notifier, 0);
+
+	cpu_notifier_register_done();
+
 	pr_info("Max bitmask length:%u,Max ClosIds: %u\n", max_cbm_len, maxid);
 out_err:
 
@@ -141,8 +442,19 @@ out_err:
 
 late_initcall(intel_rdt_late_init);
 
+static struct cftype rdt_files[] = {
+	{
+		.name = "cache_mask",
+		.seq_show = intel_cache_alloc_cbm_read,
+		.write_u64 = intel_cache_alloc_cbm_write,
+		.mode = 0666,
+	},
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys intel_rdt_cgrp_subsys = {
 	.css_alloc		= intel_rdt_css_alloc,
 	.css_free		= intel_rdt_css_free,
+	.legacy_cftypes		= rdt_files,
 	.early_init		= 0,
 };
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT
  2015-05-11 19:02 [PATCH V7 0/7] x86/intel_rdt: Intel Cache Allocation support Vikas Shivappa
                   ` (2 preceding siblings ...)
  2015-05-11 19:02 ` [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management Vikas Shivappa
@ 2015-05-11 19:02 ` Vikas Shivappa
  2015-05-15 19:39   ` Thomas Gleixner
  2015-05-11 19:02 ` [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR Vikas Shivappa
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-11 19:02 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, peter.zijlstra, h.peter.anvin, kanaka.d.juvva,
	mtosatti, vikas.shivappa

Adds support for IA32_PQR_ASSOC MSR writes during task scheduling.
For Cache Allocation, MSR write would let the task fill in the cache
'subset' represented by the cgroup's cache_mask.

The high 32 bits in the per processor MSR IA32_PQR_ASSOC represents the
CLOSid. During context switch kernel implements this by writing the
CLOSid of the cgroup to which the task belongs to the CPU's
IA32_PQR_ASSOC MSR.

The following considerations are done for the PQR MSR write so that it
minimally impacts scheduler hot path:
- This path does not exist on any non-intel platforms.
- On Intel platforms, this would not exist by default unless CGROUP_RDT
is enabled.
- remains a no-op when CGROUP_RDT is enabled and intel SKU does not
support the feature.
- When feature is available and enabled, never does MSR write till the
user manually creates a cgroup directory *and* assigns a cache_mask
different from root cgroup directory.  Since the child node inherits the
parents cache mask , by cgroup creation there is no scheduling hot path
impact from the new cgroup.
- MSR write is only done when there is a task with different Closid is
scheduled on the CPU. Typically if the task groups are bound to be
scheduled on a set of CPUs , the number of MSR writes is greatly
reduced.
- For cgroup directories having same cache_mask the CLOSids are reused.
This minimizes the number of CLOSids used and hence reduces the MSR
write frequency.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/intel_rdt.h | 44 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/switch_to.h |  3 +++
 arch/x86/kernel/cpu/intel_rdt.c  | 30 +++++++++++++++++++++++++++
 3 files changed, 77 insertions(+)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 9e9dbbe..589394b 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,9 +4,15 @@
 #ifdef CONFIG_CGROUP_RDT
 
 #include <linux/cgroup.h>
+
+#define MSR_IA32_PQR_ASSOC		0xc8f
 #define MAX_CBM_LENGTH			32
 #define IA32_L3_CBM_BASE		0xc90
 #define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
+DECLARE_PER_CPU(unsigned int, x86_cpu_clos);
+extern struct static_key rdt_enable_key;
+extern void __rdt_sched_in(void);
+
 
 struct rdt_subsys_info {
 	/* Clos Bitmap to keep track of available CLOSids.*/
@@ -24,6 +30,11 @@ struct clos_cbm_map {
 	unsigned int clos_refcnt;
 };
 
+static inline bool rdt_enabled(void)
+{
+	return static_key_false(&rdt_enable_key);
+}
+
 /*
  * Return rdt group corresponding to this container.
  */
@@ -37,5 +48,38 @@ static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
 	return css_rdt(ir->css.parent);
 }
 
+/*
+ * Return rdt group to which this task belongs.
+ */
+static inline struct intel_rdt *task_rdt(struct task_struct *task)
+{
+	return css_rdt(task_css(task, intel_rdt_cgrp_id));
+}
+
+/*
+ * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - This will stay as no-op unless we are running on an Intel SKU
+ * which supports L3 cache allocation.
+ * - When support is present and enabled, does not do any
+ * IA32_PQR_MSR writes until the user starts really using the feature
+ * ie creates a rdt cgroup directory and assigns a cache_mask thats
+ * different from the root cgroup's cache_mask.
+ * - Closids are allocated so that different cgroup directories
+ * with same cache_mask gets the same CLOSid. This minimizes CLOSids
+ * used and reduces MSR write frequency.
+ */
+static inline void intel_rdt_sched_in(void)
+{
+	if (rdt_enabled())
+		__rdt_sched_in();
+}
+
+#else
+
+static inline void intel_rdt_sched_in(struct task_struct *task) {}
+
 #endif
 #endif
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 751bf4b..9149577 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,9 @@ struct tss_struct;
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss);
 
+#include <asm/intel_rdt.h>
+#define finish_arch_switch(prev)	intel_rdt_sched_in()
+
 #ifdef CONFIG_X86_32
 
 #ifdef CONFIG_CC_STACKPROTECTOR
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 125318d..fe3ce4e 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -35,6 +35,8 @@ static struct clos_cbm_map *ccmap;
 static struct rdt_subsys_info rdtss_info;
 static DEFINE_MUTEX(rdt_group_mutex);
 struct intel_rdt rdt_root_group;
+struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
+DEFINE_PER_CPU(unsigned int, x86_cpu_clos);
 
 /*
  * Mask of CPUs for writing CBM values. We only need one per-socket.
@@ -79,6 +81,33 @@ static void intel_rdt_free_closid(unsigned int clos)
 	clear_bit(clos, rdtss_info.closmap);
 }
 
+void __rdt_sched_in(void)
+{
+	struct task_struct *task = current;
+	struct intel_rdt *ir;
+	unsigned int clos;
+
+	/*
+	 * This needs to be fixed
+	 * to cache the whole PQR instead of just CLOSid.
+	 * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
+	 * Should not write a 0 to the low 10 bits of PQR
+	 * and corrupt RMID.
+	 */
+	clos = this_cpu_read(x86_cpu_clos);
+
+	rcu_read_lock();
+	ir = task_rdt(task);
+	if (ir->clos == clos) {
+		rcu_read_unlock();
+		return;
+	}
+
+	wrmsr(MSR_IA32_PQR_ASSOC, 0, ir->clos);
+	this_cpu_write(x86_cpu_clos, ir->clos);
+	rcu_read_unlock();
+}
+
 static void __clos_get(unsigned int closid)
 {
 	struct clos_cbm_map *ccm = &ccmap[closid];
@@ -433,6 +462,7 @@ static int __init intel_rdt_late_init(void)
 	__hotcpu_notifier(intel_rdt_cpu_notifier, 0);
 
 	cpu_notifier_register_done();
+	static_key_slow_inc(&rdt_enable_key);
 
 	pr_info("Max bitmask length:%u,Max ClosIds: %u\n", max_cbm_len, maxid);
 out_err:
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR
  2015-05-11 19:02 [PATCH V7 0/7] x86/intel_rdt: Intel Cache Allocation support Vikas Shivappa
                   ` (3 preceding siblings ...)
  2015-05-11 19:02 ` [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT Vikas Shivappa
@ 2015-05-11 19:02 ` Vikas Shivappa
  2015-05-15 20:15   ` Thomas Gleixner
  2015-05-11 19:02 ` [PATCH 6/7] x86/intel_rdt: Intel haswell Cache Allocation enumeration Vikas Shivappa
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-11 19:02 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, peter.zijlstra, h.peter.anvin, kanaka.d.juvva,
	mtosatti, vikas.shivappa

This patch implements a common software cache for IA32_PQR_MSR(RMID 0:9,
    CLOSId 32:63) to be used by both Cache monitoring(CMT) and Cache
allocation. CMT updates the RMID where as cache_alloc updates the CLOSid
in the software cache.  During scheduling when the new RMID/CLOSid value
is different from the cached values, IA32_PQR_MSR is updated.  Since the
measured rdmsr latency for IA32_PQR_MSR is very high(~250 cycles) this
software cache is necessary to avoid reading the MSR to compare the
current CLOSid value. Caching reduces the frequency of MSR writes during
the scheduler hot path for cache allocation.  During CPU hotplug pqr
cache is updated to zero.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>

Conflicts:
	arch/x86/kernel/cpu/perf_event_intel_cqm.c
---
 arch/x86/include/asm/intel_rdt.h           |  9 ++++++---
 arch/x86/include/asm/rdt_common.h          | 13 +++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c            | 30 +++++++++++++++++-------------
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 20 +++++++-------------
 4 files changed, 43 insertions(+), 29 deletions(-)
 create mode 100644 arch/x86/include/asm/rdt_common.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 589394b..f4372d8 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,16 +4,16 @@
 #ifdef CONFIG_CGROUP_RDT
 
 #include <linux/cgroup.h>
+#include <asm/rdt_common.h>
 
-#define MSR_IA32_PQR_ASSOC		0xc8f
 #define MAX_CBM_LENGTH			32
 #define IA32_L3_CBM_BASE		0xc90
 #define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
-DECLARE_PER_CPU(unsigned int, x86_cpu_clos);
+
+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
 extern struct static_key rdt_enable_key;
 extern void __rdt_sched_in(void);
 
-
 struct rdt_subsys_info {
 	/* Clos Bitmap to keep track of available CLOSids.*/
 	unsigned long *closmap;
@@ -67,6 +67,9 @@ static inline struct intel_rdt *task_rdt(struct task_struct *task)
  * IA32_PQR_MSR writes until the user starts really using the feature
  * ie creates a rdt cgroup directory and assigns a cache_mask thats
  * different from the root cgroup's cache_mask.
+ * - Caches the per cpu CLOSid values and does the MSR write only
+ * when a task with a different CLOSid is scheduled in. That
+ * means the task belongs to a different cgroup.
  * - Closids are allocated so that different cgroup directories
  * with same cache_mask gets the same CLOSid. This minimizes CLOSids
  * used and reduces MSR write frequency.
diff --git a/arch/x86/include/asm/rdt_common.h b/arch/x86/include/asm/rdt_common.h
new file mode 100644
index 0000000..33fd8ea
--- /dev/null
+++ b/arch/x86/include/asm/rdt_common.h
@@ -0,0 +1,13 @@
+#ifndef _X86_RDT_H_
+#define _X86_RDT_H_
+
+#define MSR_IA32_PQR_ASSOC	0x0c8f
+
+struct intel_pqr_state {
+	raw_spinlock_t	lock;
+	int		rmid;
+	int		clos;
+	int		cnt;
+};
+
+#endif
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index fe3ce4e..2415965 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -85,27 +85,28 @@ void __rdt_sched_in(void)
 {
 	struct task_struct *task = current;
 	struct intel_rdt *ir;
-	unsigned int clos;
-
-	/*
-	 * This needs to be fixed
-	 * to cache the whole PQR instead of just CLOSid.
-	 * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
-	 * Should not write a 0 to the low 10 bits of PQR
-	 * and corrupt RMID.
-	 */
-	clos = this_cpu_read(x86_cpu_clos);
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+	unsigned long flags;
 
+	raw_spin_lock_irqsave(&state->lock, flags);
 	rcu_read_lock();
 	ir = task_rdt(task);
-	if (ir->clos == clos) {
+	if (ir->clos == state->clos) {
 		rcu_read_unlock();
+		raw_spin_unlock_irqrestore(&state->lock, flags);
 		return;
 	}
 
-	wrmsr(MSR_IA32_PQR_ASSOC, 0, ir->clos);
-	this_cpu_write(x86_cpu_clos, ir->clos);
+	/*
+	 * PQR has closid in high 32 bits and CQM-RMID
+	 * in low 10 bits. Rewrite the exsting rmid from
+	 * software cache.
+	 */
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, ir->clos);
+	state->clos = ir->clos;
 	rcu_read_unlock();
+	raw_spin_unlock_irqrestore(&state->lock, flags);
+
 }
 
 static void __clos_get(unsigned int closid)
@@ -372,6 +373,9 @@ static inline bool intel_rdt_update_cpumask(int cpu)
  */
 static inline void intel_rdt_cpu_start(int cpu)
 {
+	struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
+
+	state->clos = 0;
 	mutex_lock(&rdt_group_mutex);
 	if (intel_rdt_update_cpumask(cpu))
 		cbm_update_msrs(cpu);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index e4d1b8b..fd039899 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -7,22 +7,16 @@
 #include <linux/perf_event.h>
 #include <linux/slab.h>
 #include <asm/cpu_device_id.h>
+#include <asm/rdt_common.h>
 #include "perf_event.h"
 
-#define MSR_IA32_PQR_ASSOC	0x0c8f
 #define MSR_IA32_QM_CTR		0x0c8e
 #define MSR_IA32_QM_EVTSEL	0x0c8d
 
 static unsigned int cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 
-struct intel_cqm_state {
-	raw_spinlock_t		lock;
-	int			rmid;
-	int			cnt;
-};
-
-static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
 
 /*
  * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -961,7 +955,7 @@ out:
 
 static void intel_cqm_event_start(struct perf_event *event, int mode)
 {
-	struct intel_cqm_state *state = this_cpu_ptr(&cqm_state);
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
 	unsigned int rmid = event->hw.cqm_rmid;
 	unsigned long flags;
 
@@ -978,14 +972,14 @@ static void intel_cqm_event_start(struct perf_event *event, int mode)
 		WARN_ON_ONCE(state->rmid);
 
 	state->rmid = rmid;
-	wrmsrl(MSR_IA32_PQR_ASSOC, state->rmid);
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->clos);
 
 	raw_spin_unlock_irqrestore(&state->lock, flags);
 }
 
 static void intel_cqm_event_stop(struct perf_event *event, int mode)
 {
-	struct intel_cqm_state *state = this_cpu_ptr(&cqm_state);
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
 	unsigned long flags;
 
 	if (event->hw.cqm_state & PERF_HES_STOPPED)
@@ -998,7 +992,7 @@ static void intel_cqm_event_stop(struct perf_event *event, int mode)
 
 	if (!--state->cnt) {
 		state->rmid = 0;
-		wrmsrl(MSR_IA32_PQR_ASSOC, 0);
+		wrmsr(MSR_IA32_PQR_ASSOC, 0, state->clos);
 	} else {
 		WARN_ON_ONCE(!state->rmid);
 	}
@@ -1243,7 +1237,7 @@ static inline void cqm_pick_event_reader(int cpu)
 
 static void intel_cqm_cpu_prepare(unsigned int cpu)
 {
-	struct intel_cqm_state *state = &per_cpu(cqm_state, cpu);
+	struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
 
 	raw_spin_lock_init(&state->lock);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 6/7] x86/intel_rdt: Intel haswell Cache Allocation enumeration
  2015-05-11 19:02 [PATCH V7 0/7] x86/intel_rdt: Intel Cache Allocation support Vikas Shivappa
                   ` (4 preceding siblings ...)
  2015-05-11 19:02 ` [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR Vikas Shivappa
@ 2015-05-11 19:02 ` Vikas Shivappa
  2015-05-11 19:02 ` [PATCH 7/7] x86/intel_rdt: Add Cache Allocation documentation and usage guide Vikas Shivappa
  2015-05-13 16:52 ` [PATCH V7 0/7] x86/intel_rdt: Intel Cache Allocation support Vikas Shivappa
  7 siblings, 0 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-11 19:02 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, peter.zijlstra, h.peter.anvin, kanaka.d.juvva,
	mtosatti, vikas.shivappa

Cache Allocation on hsw needs to be enumerated
separately as HSW does not have support for CPUID enumeration for Cache
Allocation. Cache Allocation is only supported on certain HSW SKUs. This
patch does a probe test for hsw CPUs by writing a CLOSid(Class of
    service id) into high 32 bits of IA32_PQR_MSR and see if the bits
stick. The probe test is only done after confirming that the CPU is HSW.
Other HSW specific quirks are:
-HSW requires the L3 cache bit mask to be at least two bits.
-Maximum CLOSids supported is always 4.
-Maximum bits support in cache bit mask is always 20.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c | 63 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 60 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 2415965..0abb0f6 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -39,6 +39,11 @@ struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
 DEFINE_PER_CPU(unsigned int, x86_cpu_clos);
 
 /*
+ * Minimum bits required in Cache bitmask.
+ */
+static unsigned int min_bitmask_len = 1;
+
+/*
  * Mask of CPUs for writing CBM values. We only need one per-socket.
  */
 static cpumask_t rdt_cpumask;
@@ -46,6 +51,57 @@ static cpumask_t rdt_cpumask;
 #define rdt_for_each_child(pos_css, parent_ir)		\
 	css_for_each_child((pos_css), &(parent_ir)->css)
 
+/*
+ * hsw_probetest() - Have to do probe
+ * test for Intel haswell CPUs as it does not have
+ * CPUID enumeration support for Cache allocation.
+ *
+ * Probes by writing to the high 32 bits(CLOSid)
+ * of the IA32_PQR_MSR and testing if the bits stick.
+ * Then hardcode the max CLOS and max bitmask length on hsw.
+ * The minimum cache bitmask length allowed for HSW is 2 bits.
+ */
+static inline bool hsw_probetest(void)
+{
+	u32 l, h_old, h_new, h_tmp;
+
+	if (rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_old))
+		return false;
+
+	/*
+	 * Default value is always 0 if feature is present.
+	 */
+	h_tmp = h_old ^ 0x1U;
+	if (wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_tmp) ||
+	    rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_new))
+		return false;
+
+	if (h_tmp != h_new)
+		return false;
+
+	wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_old);
+
+	boot_cpu_data.x86_rdt_max_closid = 4;
+	boot_cpu_data.x86_rdt_max_cbm_len = 20;
+	min_bitmask_len = 2;
+
+	return true;
+}
+
+static inline bool intel_cache_alloc_supported(struct cpuinfo_x86 *c)
+{
+	if (cpu_has(c, X86_FEATURE_CAT_L3))
+		return true;
+
+	/*
+	 * Probe test for Haswell CPUs.
+	 */
+	if (c->x86 == 0x6 && c->x86_model == 0x3f)
+		return hsw_probetest();
+
+	return false;
+}
+
 static void __clos_init(unsigned int closid)
 {
 	struct clos_cbm_map *ccm = &ccmap[closid];
@@ -170,7 +226,7 @@ static inline bool cbm_is_contiguous(unsigned long var)
 	unsigned long maxcbm = MAX_CBM_LENGTH;
 	unsigned long first_bit, zero_bit;
 
-	if (!var)
+	if (bitmap_weight(&var, maxcbm) < min_bitmask_len)
 		return false;
 
 	first_bit = find_next_bit(&var, maxcbm, 0);
@@ -199,7 +255,8 @@ static int validate_cbm(struct intel_rdt *ir, unsigned long cbmvalue)
 	int err = 0;
 
 	if (!cbm_is_contiguous(cbmvalue)) {
-		pr_err("bitmask should have >= 1 bit and be contiguous\n");
+		pr_err("bitmask should have >=%d bits and be contiguous\n",
+			 min_bitmask_len);
 		err = -EINVAL;
 		goto out_err;
 	}
@@ -431,7 +488,7 @@ static int __init intel_rdt_late_init(void)
 	static struct clos_cbm_map *ccm;
 	size_t sizeb;
 
-	if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
+	if (!intel_cache_alloc_supported(c)) {
 		rdt_root_group.css.ss->disabled = 1;
 		return -ENODEV;
 	}
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 7/7] x86/intel_rdt: Add Cache Allocation documentation and usage guide
  2015-05-11 19:02 [PATCH V7 0/7] x86/intel_rdt: Intel Cache Allocation support Vikas Shivappa
                   ` (5 preceding siblings ...)
  2015-05-11 19:02 ` [PATCH 6/7] x86/intel_rdt: Intel haswell Cache Allocation enumeration Vikas Shivappa
@ 2015-05-11 19:02 ` Vikas Shivappa
  2015-05-13 16:52 ` [PATCH V7 0/7] x86/intel_rdt: Intel Cache Allocation support Vikas Shivappa
  7 siblings, 0 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-11 19:02 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, peter.zijlstra, h.peter.anvin, kanaka.d.juvva,
	mtosatti, vikas.shivappa

Adds a description of Cache allocation technology, overview
of kernel implementation and usage of Cache Allocation cgroup interface.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 Documentation/cgroups/rdt.txt | 206 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 206 insertions(+)
 create mode 100644 Documentation/cgroups/rdt.txt

diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
new file mode 100644
index 0000000..1af77d5
--- /dev/null
+++ b/Documentation/cgroups/rdt.txt
@@ -0,0 +1,206 @@
+        RDT
+        ---
+
+Copyright (C) 2014 Intel Corporation
+Written by vikas.shivappa@linux.intel.com
+(based on contents and format from cpusets.txt)
+
+CONTENTS:
+=========
+
+1. Cache Allocation Technology
+  1.1 What is RDT and Cache allocation ?
+  1.2 Why is Cache allocation needed ?
+  1.3 Cache allocation implementation overview
+  1.4 Assignment of CBM and CLOS
+  1.5 Scheduling and Context Switch
+2. Usage Examples and Syntax
+
+1. Cache Allocation Technology(Cache allocation)
+===================================
+
+1.1 What is RDT and Cache allocation
+-----------------------
+
+Cache allocation is a part of Resource Director Technology(RDT) or 
+Platform Shared resource control which provides support to control 
+Platform shared resources like L3 cache. Currently Cache is the only 
+resource that is supported in RDT.
+More information can be found in the Intel SDM, Volume 3, section 17.15.
+
+Cache Allocation Technology provides a way for the Software (OS/VMM)
+to restrict cache allocation to a defined 'subset' of cache which may
+be overlapping with other 'subsets'.  This feature is used when
+allocating a line in cache ie when pulling new data into the cache.
+The programming of the h/w is done via programming  MSRs.
+
+The different cache subsets are identified by CLOS identifier (class
+of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
+contiguous set of bits which defines the amount of cache resource that
+is available for each 'subset'.
+
+1.2 Why is Cache allocation needed
+----------------------------------
+
+In todays new processors the number of cores is continuously increasing,
+especially in large scale usage models where VMs are used like
+webservers and datacenters. The number of cores increase the number
+of threads or workloads that can simultaneously be run. When
+multi-threaded-applications, VMs, workloads run concurrently they
+compete for shared resources including L3 cache.
+
+The Cache allocation  enables more cache resources to be made available
+for higher priority applications based on guidance from the execution
+environment.
+
+The architecture also allows dynamically changing these subsets during
+runtime to further optimize the performance of the higher priority
+application with minimal degradation to the low priority app.
+Additionally, resources can be rebalanced for system throughput
+benefit.  (Refer to Section 17.15 in the Intel SDM)
+
+This technique may be useful in managing large computer systems which
+large L3 cache. Examples may be large servers running  instances of
+webservers or database servers. In such complex systems, these subsets
+can be used for more careful placing of the available cache
+resources.
+
+1.3 Cache allocation implementation Overview
+--------------------------------------------
+
+Kernel implements a cgroup subsystem to support cache allocation.
+
+Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
+to the kernel and not exposed to user.  Each cgroup would have one CBM
+and would just represent one cache 'subset'.
+
+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
+cgroup never fails.  When a child cgroup is created it inherits the
+CLOSid and the CBM from its parent.  When a user changes the default
+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
+used before.  The changing of 'cache_mask' may fail with -ENOSPC once
+the kernel runs out of maximum CLOSids it can support.
+User can create as many cgroups as he wants but having different CBMs
+at the same time is restricted by the maximum number of CLOSids
+(multiple cgroups can have the same CBM).
+Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
+for each cgroup using a CLOSid.
+
+The tasks in the cgroup would get to fill the L3 cache represented by
+the cgroup's 'cache_mask' file.
+
+Root directory would have all available  bits set in 'cache_mask' file
+by default.
+
+1.4 Assignment of CBM,CLOS
+--------------------------
+
+The 'cache_mask' needs to be a  subset of the parent node's
+'cache_mask'. Any contiguous subset of these bits(with a minimum of 2
+bits on hsw SKUs) maybe set to indicate the cache mapping desired. The
+'cache_mask' between 2 directories can overlap. The 'cache_mask' would
+represent the cache 'subset' of the Cache allocation cgroup. For ex: on
+a system with 16 bits of max cbm bits, if the directory has the least
+significant 4 bits set in its 'cache_mask' file(meaning the 'cache_mask'
+is just 0xf), it would be allocated the right quarter of the Last level
+cache which means the tasks belonging to this Cache allocation cgroup
+can use the right quarter of the cache to fill. If it
+has the most significant 8 bits set ,it would be allocated the left
+half of the cache(8 bits  out of 16 represents 50%).
+
+The cache portion defined in the CBM file is available to all tasks
+within the cgroup to fill and these task are not allowed to allocate
+space in other parts of the cache.
+
+1.5 Scheduling and Context Switch
+---------------------------------
+
+During context switch kernel implements this by writing the
+CLOSid (internally maintained by kernel) of the cgroup to which the
+task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
+when there is a change in the CLOSid for the CPU in order to minimize
+the latency incurred during context switch.
+
+The following considerations are done for the PQR MSR write so that it 
+has minimal impact on scheduling hot path:
+- This path doesnt exist on any non-intel platforms.
+- On Intel platforms, this would not exist by default unless CGROUP_RDT
+is enabled.
+- remains a no-op when CGROUP_RDT is enabled and intel hardware does not
+support the feature.
+- When feature is available, still remains a no-op till the user
+manually creates a cgroup *and* assigns a new cache mask. Since the
+child node inherits the parents cache mask , by cgroup creation there is
+no scheduling hot path impact from the new cgroup.
+- per cpu PQR values are cached and the MSR write is only done when
+there is a task with different PQR is scheduled on the CPU. Typically if
+the task groups are bound to be scheduled on a set of CPUs , the number
+of MSR writes is greatly reduced.
+
+2. Usage examples and syntax
+============================
+
+To check if Cache allocation was enabled on your system
+
+dmesg | grep -i intel_rdt
+should output : intel_rdt: Max bitmask length: xx,Max ClosIds: xx
+the length of cache_mask and CLOS should depend on the system you use.
+
+Following would mount the cache allocation cgroup subsystem and create
+2 directories. Please refer to Documentation/cgroups/cgroups.txt on
+details about how to use cgroups.
+
+  cd /sys/fs/cgroup
+  mkdir rdt
+  mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt
+  cd rdt
+
+Create 2 rdt cgroups
+
+  mkdir group1
+  mkdir group2
+
+Following are some of the Files in the directory
+
+  ls
+  rdt.cache_mask
+  tasks
+
+Say if the cache is 2MB and cbm supports 16 bits, then setting the
+below allocates the 'right 1/4th(512KB)' of the cache to group2
+
+Edit the CBM for group2 to set the least significant 4 bits.  This
+allocates 'right quarter' of the cache.
+
+  cd group2
+  /bin/echo 0xf > rdt.cache_mask
+
+
+Edit the CBM for group2 to set the least significant 8 bits.This
+allocates the right half of the cache to 'group2'.
+
+  cd group2
+  /bin/echo 0xff > rdt.cache_mask
+
+Assign tasks to the group2
+
+  /bin/echo PID1 > tasks
+  /bin/echo PID2 > tasks
+
+  Meaning now threads
+  PID1 and PID2 get to fill the 'right half' of
+  the cache as the belong to cgroup group2.
+
+Create a group under group2
+
+  cd group2
+  mkdir group21
+  cat rdt.cache_mask
+   0xff - inherits parents mask.
+
+  /bin/echo 0xfff > rdt.cache_mask - throws error as mask has to parent's mask's subset
+
+In order to restrict RDT cgroups to specific set of CPUs rdt can be comounted 
+with cpusets.
+
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH V7 0/7] x86/intel_rdt: Intel Cache Allocation support
  2015-05-11 19:02 [PATCH V7 0/7] x86/intel_rdt: Intel Cache Allocation support Vikas Shivappa
                   ` (6 preceding siblings ...)
  2015-05-11 19:02 ` [PATCH 7/7] x86/intel_rdt: Add Cache Allocation documentation and usage guide Vikas Shivappa
@ 2015-05-13 16:52 ` Vikas Shivappa
  7 siblings, 0 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-13 16:52 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, x86, linux-kernel, hpa, tglx, mingo, tj, peterz,
	Matt Fleming, Auld, Will, peter.zijlstra, h.peter.anvin, Juvva,
	Kanaka D, mtosatti


Hello PeterZ,

V7 is mostly trying to address all the issues you raised in V6. Please let me 
know if you think any more changes that should be done to this.

Thanks,
Vikas

On Mon, 11 May 2015, Vikas Shivappa wrote:

> This patch adds a cgroup subsystem to support the new Cache Allocation
> feature found in future Intel Xeon Intel processors. Cache Allocation is
> a sub-feature with in Resource Director Technology(RDT) feature. RDT
> provides support to control sharing of platform resources like L3
> cache.
>
> Cache Allocation Technology provides a way for the Software (OS/VMM) to
> restrict L3 cache allocation to a defined 'subset' of cache which may be
> overlapping with other 'subsets'.  This feature is used when allocating
> a line in cache ie when pulling new data into the cache.  The
> programming of the h/w is done via programming  MSRs. The patch series
> only supports to perform L3 cache allocation. However note that some of
> the datastructures and framework which is common to RDT are also
> implemented but its currently used only by the cache allocation
> sub-feature.
>
> In todays new processors the number of cores is continuously increasing
> which in turn increase the number of threads or workloads that can
> simultaneously be run. When multi-threaded
> applications run concurrently, they compete for shared
> resources including L3 cache.  At times, this L3 cache resource contention may
> result in inefficient space utilization. For example a higher priority thread
> may end up with lesser L3 cache resource or a cache sensitive app may not get
> optimal cache occupancy thereby degrading the performance.
> Cache Allocation kernel patch helps provides a framework for sharing L3 cache so that users
> can allocate the resource according to set requirements.
>
> More information about the feature can be found in the Intel SDM, Volume 3
> section 17.15.  SDM does not yet use the 'RDT' term yet and it is planned to be
> changed at a later time.
>
> *All the patches will apply on 4.1-rc0*.
>
> Changes in V7: (Thanks to feedback from PeterZ and Matt and following
> discussions)
> - changed lot of naming to reflect the data structures which are common
> to RDT and specific to Cache allocation.
> - removed all usage of 'cat'. replace with more friendly cache
> allocation
> - fixed lot of convention issues (whitespace, return paradigm etc)
> - changed the scheduling hook for RDT to not use a inline.
> - removed adding new scheduling hook and just reused the existing one
> similar to perf hook.
>
> Changes in V6:
> - rebased to 4.1-rc1 which has the CMT(cache monitoring) support included.
> - (Thanks to Marcelo's feedback).Fixed support for hot cpu handling for
> IA32_L3_QOS MSRs. Although during deep C states the MSR need not be restored
> this is needed when physically a new package is added.
> -some other coding convention changes including renaming to cache_mask using a
> refcnt to track the number of cgroups using a closid in clos_cbm map.
> -1b cbm support for non-hsw SKUs. HSW is an exception which needs the cache
> bit masks to be at least 2 bits.
>
> Changes in v5:
> - Added support to propagate the cache bit mask update for each
> package.
> - Removed the cache bit mask reference in the intel_rdt structure as
>  there was no need for that and we already maintain a separate
>  closid<->cbm mapping.
> - Made a few coding convention changes which include adding the
> assertion while freeing the CLOSID.
>
> Changes in V4:
> - Integrated with the latest V5 CMT patches.
> - Changed naming of cgroup to rdt(resource director technology) from
>  cat(cache allocation technology). This was done as the RDT is the
>  umbrella term for platform shared resources allocation. Hence in
>  future it would be easier to add resource allocation to the same
>  cgroup
> - Naming changes also applied to a lot of other data structures/APIs.
> - Added documentation on cgroup usage for cache allocation to address
>  a lot of questions from various academic and industry regarding
>  cache allocation usage.
>
> Changes in V3:
> - Implements a common software cache for IA32_PQR_MSR
> - Implements support for hsw Cache Allocation enumeration. This does not use the brand
> strings like earlier version but does a probe test. The probe test is done only
> on hsw family of processors
> - Made a few coding convention, name changes
> - Check for lock being held when ClosID manipulation happens
>
> Changes in V2:
> - Removed HSW specific enumeration changes. Plan to include it later as a
>  separate patch.
> - Fixed the code in prep_arch_switch to be specific for x86 and removed
>  x86 defines.
> - Fixed cbm_write to not write all 1s when a cgroup is freed.
> - Fixed one possible memory leak in init.
> - Changed some of manual bitmap
>  manipulation to use the predefined bitmap APIs to make code more readable
> - Changed name in sources from cqe to cat
> - Global cat enable flag changed to static_key and disabled cgroup early_init
>
> [PATCH 1/7] x86/intel_rdt: Intel Cache Allocation detection
> [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management
> [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management
> [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT
> [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR
> [PATCH 6/7] x86/intel_rdt: Intel haswell Cache Allocation enumeration
> [PATCH 7/7] x86/intel_rdt: Add Cache Allocation documentation and user
> guide
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management
  2015-05-11 19:02 ` [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management Vikas Shivappa
@ 2015-05-15 19:18   ` Thomas Gleixner
  2015-05-18 17:59     ` Vikas Shivappa
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2015-05-15 19:18 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, x86, linux-kernel, hpa, mingo, tj, peterz,
	matt.fleming, will.auld, peter.zijlstra, h.peter.anvin,
	kanaka.d.juvva, mtosatti

On Mon, 11 May 2015, Vikas Shivappa wrote:
>  arch/x86/include/asm/intel_rdt.h |  38 +++++++++++++
>  arch/x86/kernel/cpu/intel_rdt.c  | 112 +++++++++++++++++++++++++++++++++++++--
>  include/linux/cgroup_subsys.h    |   4 ++

Where is the Documentation for the new cgroup subsystem?

> +struct rdt_subsys_info {
> +	/* Clos Bitmap to keep track of available CLOSids.*/

If you want to document your struct members, please use KernelDoc
formatting, so tools can pick it up as well.

> +	unsigned long *closmap;
> +};
> +
> +struct intel_rdt {
> +	struct cgroup_subsys_state css;

Ditto

> +	/* Class of service for the cgroup.*/
> +	unsigned int clos;
> +};
> +
> +struct clos_cbm_map {
> +	unsigned long cache_mask;
> +	unsigned int clos_refcnt;
> +};

> +/*
> + * ccmap maintains 1:1 mapping between CLOSid and cache bitmask.
> + */
> +static struct clos_cbm_map *ccmap;
> +static struct rdt_subsys_info rdtss_info;
> +static DEFINE_MUTEX(rdt_group_mutex);
> +struct intel_rdt rdt_root_group;
> +
> +static void intel_rdt_free_closid(unsigned int clos)
> +{
> +	lockdep_assert_held(&rdt_group_mutex);
> +
> +	clear_bit(clos, rdtss_info.closmap);
> +}
> +
> +static void __clos_get(unsigned int closid)

What's the purpose of these double underscores?

> +{
> +	struct clos_cbm_map *ccm = &ccmap[closid];
> +
> +	lockdep_assert_held(&rdt_group_mutex);

Do we really need all these lockdep asserts for a handfull of call
sites?

> +	ccm->clos_refcnt += 1;

What's wrong with:

       ccmap[closid].clos_refcnt++;

Hmm?

> +}
> +
> +static void __clos_put(unsigned int closid)
> +{
> +	struct clos_cbm_map *ccm = &ccmap[closid];
> +
> +	lockdep_assert_held(&rdt_group_mutex);
> +	WARN_ON(!ccm->clos_refcnt);

So we have a warning but we do not act on it.

> +
> +	ccm->clos_refcnt -= 1;
> +	if (!ccm->clos_refcnt)

You can do that in a single line w/o the pointer magic. We want easy
to read and small code rather than pointlessly blown up helper
functions which just eat screen space.

> +		intel_rdt_free_closid(closid);
> +}
> +
> +static struct cgroup_subsys_state *
> +intel_rdt_css_alloc(struct cgroup_subsys_state *parent_css)
> +{
> +	struct intel_rdt *parent = css_rdt(parent_css);
> +	struct intel_rdt *ir;
> +
> +	/*
> +	 * Cannot return failure on systems with no Cache Allocation
> +	 * as the cgroup_init does not handle failures gracefully.

This comment is blatantly wrong. 5 lines further down you return
-ENOMEM. Now what?

> +	 */
> +	if (!parent)
> +		return &rdt_root_group.css;
> +
> +	ir = kzalloc(sizeof(struct intel_rdt), GFP_KERNEL);
> +	if (!ir)
> +		return ERR_PTR(-ENOMEM);
> +
> +	mutex_lock(&rdt_group_mutex);
> +	ir->clos = parent->clos;
> +	__clos_get(ir->clos);
> +	mutex_unlock(&rdt_group_mutex);
> +
> +	return &ir->css;
> +}
> +
> +static void intel_rdt_css_free(struct cgroup_subsys_state *css)
> +{
> +	struct intel_rdt *ir = css_rdt(css);
> +
> +	mutex_lock(&rdt_group_mutex);
> +	__clos_put(ir->clos);
> +	kfree(ir);
> +	mutex_unlock(&rdt_group_mutex);
> +}
>  
>  static int __init intel_rdt_late_init(void)
>  {
>  	struct cpuinfo_x86 *c = &boot_cpu_data;
> -	int maxid, max_cbm_len;
> +	static struct clos_cbm_map *ccm;
> +	int maxid, max_cbm_len, err = 0;
> +	size_t sizeb;
>  
> -	if (!cpu_has(c, X86_FEATURE_CAT_L3))
> +	if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
> +		rdt_root_group.css.ss->disabled = 1;
>  		return -ENODEV;
> -
> +	}
>  	maxid = c->x86_rdt_max_closid;
>  	max_cbm_len = c->x86_rdt_max_cbm_len;
>  
> +	sizeb = BITS_TO_LONGS(maxid) * sizeof(long);
> +	rdtss_info.closmap = kzalloc(sizeb, GFP_KERNEL);

What's the point of this bitmap? In later patches you have a
restriction to 64bits and the SDM says that CBM_LEN is limited to 32.

So where is the point for allocating a bitmap?

> +	if (!rdtss_info.closmap) {
> +		err = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	sizeb = maxid * sizeof(struct clos_cbm_map);
> +	ccmap = kzalloc(sizeb, GFP_KERNEL);
> +	if (!ccmap) {
> +		kfree(rdtss_info.closmap);
> +		err = -ENOMEM;
> +		goto out_err;

What's wrong with return -ENOMEM? We only use goto if we have to
cleanup stuff, certainly not for just returning err.

> +	}
> +
> +	set_bit(0, rdtss_info.closmap);
> +	rdt_root_group.clos = 0;
> +	ccm = &ccmap[0];
> +	bitmap_set(&ccm->cache_mask, 0, max_cbm_len);
> +	ccm->clos_refcnt = 1;
> +
>  	pr_info("Max bitmask length:%u,Max ClosIds: %u\n", max_cbm_len, maxid);

We surely do not want to sprinkle these all over dmesg.

+out_err:
 
-       return 0;
+       return err;


Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management
  2015-05-11 19:02 ` [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management Vikas Shivappa
@ 2015-05-15 19:25   ` Thomas Gleixner
  2015-05-18 19:17     ` Vikas Shivappa
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2015-05-15 19:25 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, x86, linux-kernel, hpa, mingo, tj, peterz,
	matt.fleming, will.auld, peter.zijlstra, h.peter.anvin,
	kanaka.d.juvva, mtosatti

On Mon, 11 May 2015, Vikas Shivappa wrote:
> @@ -35,6 +36,42 @@ static struct rdt_subsys_info rdtss_info;
>  static DEFINE_MUTEX(rdt_group_mutex);
>  struct intel_rdt rdt_root_group;
>  
> +/*
> + * Mask of CPUs for writing CBM values. We only need one per-socket.

One mask? One CPU? One what? Please add comments which are
understandable and useful for people who did NOT write that code.

> + */
> +static cpumask_t rdt_cpumask;

> +static void __cpu_cbm_update(void *info)
> +{
> +	unsigned int closid = *((unsigned int *)info);

So all you want is to transfer closid, right? Why having a pointer
instead of just doing

	on_each_cpu_mask(...., (void *) id);

and
	id = (unsigned long) id;

Looks too simple and readable, right?

> +/*
> + * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
> + * which are one per CLOSid, except IA32_L3_MASK_0 on the current package.
> + * @cpu : the cpu on which the mask is updated.
> + */
> +static inline void cbm_update_msrs(int cpu)
> +{
> +	int maxid = boot_cpu_data.x86_rdt_max_closid;
> +	unsigned int i;
> +
> +	if (WARN_ON(cpu != smp_processor_id()))
> +		return;
> +
> +	for (i = 1; i < maxid; i++) {

Lacks a comment why this starts with 1 and not with 0 as one would expect.

> +		if (ccmap[i].clos_refcnt)
> +			__cpu_cbm_update(&i);
> +	}
> +}

This is only used by the cpu hotplug code. So why does it not live
close to it? Because its more fun to search for functions than having
them just in place?

> +/*
> + * intel_cache_alloc_cbm_write() - Validates and writes the
> + * cache bit mask(cbm) to the IA32_L3_MASK_n
> + * and also store the same in the ccmap.
> + *
> + * CLOSids are reused for cgroups which have same bitmask.
> + * - This helps to use the scant CLOSids optimally.
> + * - This also implies that at context switch write
> + * to PQR-MSR is done only when a task with a
> + * different bitmask is scheduled in.
> + */
> +static int intel_cache_alloc_cbm_write(struct cgroup_subsys_state *css,

Can you please drop these intel_ prefixes? They provide no value and
just eat space.

> +				 struct cftype *cft, u64 cbmvalue)
> +{
> +	u32 max_cbm = boot_cpu_data.x86_rdt_max_cbm_len;
> +	struct intel_rdt *ir = css_rdt(css);
> +	unsigned long cache_mask, max_mask;
> +	unsigned long *cbm_tmp;
> +	unsigned int closid;
> +	ssize_t err = 0;
> +
> +	if (ir == &rdt_root_group)
> +		return -EPERM;
> +	bitmap_set(&max_mask, 0, max_cbm);
> +
> +	/*
> +	 * Need global mutex as cbm write may allocate a closid.
> +	 */
> +	mutex_lock(&rdt_group_mutex);
> +	bitmap_and(&cache_mask, (unsigned long *)&cbmvalue, &max_mask, max_cbm);
> +	cbm_tmp = &ccmap[ir->clos].cache_mask;
> +
> +	if (bitmap_equal(&cache_mask, cbm_tmp, MAX_CBM_LENGTH))
> +		goto out;
> +

Sigh, what's the point of this bitmap shuffling? Nothing as far as I
can tell, because you first set all bits of maxmask and then AND
cbmvalue.

All you achieve is, that you truncate cbmvalue to max_cbm bits. So if
cbmvalue has any of the invalid bits set, you happily ignore that.

This whole bitmap usage is just making the code obfuscated.

     u64 max_mask = (1ULL << max_cbm) - 1;

     if (cbm_value & ~max_mask)
     	   return -EINVAL;

     if (cbm_value == (u64)ccmap[ir->clos].cache_mask)
     	   return 0;	   	   

No bitmap_set, bitmap_and nothing. Simple and readable, right?

The bitmap would only be necessary, IF max_cbm is > 64. But that's not
the case and nothing in your code would survive ma_ cbm > 64. And as I
said before CBM_LEN is 32bit for now and I serioulsy doubt, that it
can ever go over 64bit simply because that would not fit into the mask
MSRs.

So can you please get rid of that bitmap nonsense including the
dynamic allocation and  use simple and readable code constructs?

Just for the record:

     struct bla {
     	    u64			cbm_mask;
     };

versus:

     struct bla {
     	    unsigned long	*cbm_mask;
     };
	    
   plus

     bla.cbm_mask = kzalloc(sizeof(u64));

Is just utter waste of data and text space.

> +	/*
> +	 * At this point we are sure to change the cache_mask.Hence release the
> +	 * reference to the current CLOSid and try to get a reference for
> +	 * a different CLOSid.
> +	 */
> +	__clos_put(ir->clos);

So you drop the reference, before you make sure that you can get a new one?

> +	if (cbm_search(cache_mask, &closid)) {
> +		ir->clos = closid;
> +		__clos_get(closid);
> +	} else {
> +		err = intel_rdt_alloc_closid(ir);
> +		if (err)
> +			goto out;

What happens if intel_rdt_alloc_closid() fails? ir->clos is unchanged,
but you have no reference anymore. The next call to this function or
anything else which calls __clos_put() will trigger the underflow
warning and the warning will be completely useless because the
wreckage happened before that.

> +static inline bool intel_rdt_update_cpumask(int cpu)
> +{
> +	int phys_id = topology_physical_package_id(cpu);
> +	struct cpumask *mask = &rdt_cpumask;
> +	int i;
> +
> +	for_each_cpu(i, mask) {
> +		if (phys_id == topology_physical_package_id(i))
> +			return false;
> +	}

You must be kidding. You know how many packages are possible. What's
wrong with having a bitmap of physical packages which have a CPU
handling it?

	id =  topology_physical_package_id(cpu);
	if (!__test_and_set_bit(id, &package_map)) {
		cpumask_set_cpu(cpu, &rdt_cpumask);
		cbm_update_msrs(cpu);
	}

Four lines of code without loops and hoops. And here is a bitmap the
proper data structure.
	
> +static void intel_rdt_cpu_exit(unsigned int cpu)
> +{
> +	int phys_id = topology_physical_package_id(cpu);
> +	int i;
> +
> +	mutex_lock(&rdt_group_mutex);
> +	if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
> +		mutex_unlock(&rdt_group_mutex);
> +		return;
> +	}
> +
> +	for_each_online_cpu(i) {
> +		if (i == cpu)
> +			continue;
> +
> +		if (phys_id == topology_physical_package_id(i)) {
> +			cpumask_set_cpu(i, &rdt_cpumask);
> +			break;
> +		}
> +	}

There is a better way than iterating over all online cpus ....

>  
> +static struct cftype rdt_files[] = {
> +	{
> +		.name = "cache_mask",
> +		.seq_show = intel_cache_alloc_cbm_read,
> +		.write_u64 = intel_cache_alloc_cbm_write,
> +		.mode = 0666,
> +	},
> +	{ }	/* terminate */
> +};

Where is the Documentation for that file? What's the content, format,
error codes ....

Sigh.

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT
  2015-05-11 19:02 ` [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT Vikas Shivappa
@ 2015-05-15 19:39   ` Thomas Gleixner
  2015-05-18 18:01     ` Vikas Shivappa
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2015-05-15 19:39 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, x86, linux-kernel, hpa, mingo, tj, peterz,
	matt.fleming, will.auld, peter.zijlstra, h.peter.anvin,
	kanaka.d.juvva, mtosatti

On Mon, 11 May 2015, Vikas Shivappa wrote:
>  struct rdt_subsys_info {
>  	/* Clos Bitmap to keep track of available CLOSids.*/
> @@ -24,6 +30,11 @@ struct clos_cbm_map {
>  	unsigned int clos_refcnt;
>  };
>  
> +static inline bool rdt_enabled(void)
> +{
> +	return static_key_false(&rdt_enable_key);

So again, why this useless helper function for a single call site?

> +static inline void intel_rdt_sched_in(void)
> +{
> +	if (rdt_enabled())
> +		__rdt_sched_in();
> +}
  
> +void __rdt_sched_in(void)
> +{
> +	struct task_struct *task = current;
> +	struct intel_rdt *ir;
> +	unsigned int clos;
> +
> +	/*
> +	 * This needs to be fixed
> +	 * to cache the whole PQR instead of just CLOSid.
> +	 * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
> +	 * Should not write a 0 to the low 10 bits of PQR
> +	 * and corrupt RMID.

And why is this not fixed __BEFORE__ this patch? You can do the
changes to struct intel_cqm_state in a seperate patch and then do the
proper implementation from the beginning instead of providing a half
broken variant which gets replaced in the next patch.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR
  2015-05-11 19:02 ` [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR Vikas Shivappa
@ 2015-05-15 20:15   ` Thomas Gleixner
  2015-05-20 17:18     ` Vikas Shivappa
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2015-05-15 20:15 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, x86, linux-kernel, hpa, mingo, tj, peterz,
	matt.fleming, will.auld, peter.zijlstra, h.peter.anvin,
	kanaka.d.juvva, mtosatti

On Mon, 11 May 2015, Vikas Shivappa wrote:

> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> 
> Conflicts:
> 	arch/x86/kernel/cpu/perf_event_intel_cqm.c

And that's interesting for what?

> --- /dev/null
> +++ b/arch/x86/include/asm/rdt_common.h

> @@ -0,0 +1,13 @@
> +#ifndef _X86_RDT_H_
> +#define _X86_RDT_H_
> +
> +#define MSR_IA32_PQR_ASSOC	0x0c8f
> +
> +struct intel_pqr_state {
> +	raw_spinlock_t	lock;
> +	int		rmid;
> +	int		clos;

Those want to be u32. We have no type checkes there, but u32 is the
proper data type for wrmsr.

> +	int		cnt;
> +};
> +

What's wrong with having this in intel_rdt.h? It seems you're a big
fan of sprinkling stuff all over the place so reading becomes a
chasing game.

>  {
>  	struct task_struct *task = current;
>  	struct intel_rdt *ir;
> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> +	unsigned long flags;
>  
> +	raw_spin_lock_irqsave(&state->lock, flags);

finish_arch_switch() is called with interrupts disabled already ...

>  	rcu_read_lock();

So we now have a spin_lock() and rcu_read_lock() and no explanation
what is protecting what.

>  	ir = task_rdt(task);
> -	if (ir->clos == clos) {
> +	if (ir->clos == state->clos) {

And of course taking the spin lock for checking state->clos is
complete and utter nonsense.

state->clos can only be changed by this code and the only reason why
we need the lock is to protect against concurrent modification of
state->rmid.

So the check for ir->clos == state->clos can be done lockless.

And I seriously doubt, that state->lock is necessary at all.

Q: What is it protecting?
A: state->clos, state->rmid, state->cnt

Q: What's the context?
A: Per cpu context. The per cpu pqr state is NEVER modified from a
   remote cpu.

Q: What is the lock for?
A: Nothing.

Q: Why?
A: Because interrupt disabled regions protect per cpu state perfectly
   fine and there is is no memory ordering issue which would require a
   lock or barrier either.

Peter explained it to you several times already that context switch is
one the most sensitive hot pathes where we care about every cycle. But
you just go ahead and mindlessly create pointless overhead.

> +	/*
> +	 * PQR has closid in high 32 bits and CQM-RMID
> +	 * in low 10 bits. Rewrite the exsting rmid from
> +	 * software cache.

This comment wouldnt be necessary if you would have proper documented
struct pqr_state.

> +	 */
> +	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, ir->clos);
> +	state->clos = ir->clos;
>  	rcu_read_unlock();
> +	raw_spin_unlock_irqrestore(&state->lock, flags);
> +
>  }

> -static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
> +DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);

With CONFIG_PERF=n and CONFIG_CGROUP_RDT=y the linker will fail.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management
  2015-05-15 19:18   ` Thomas Gleixner
@ 2015-05-18 17:59     ` Vikas Shivappa
  2015-05-18 18:41       ` Thomas Gleixner
  2015-05-18 18:52       ` Thomas Gleixner
  0 siblings, 2 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-18 17:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, x86, linux-kernel, hpa, mingo,
	tj, peterz, Matt Fleming, Auld, Will, peter.zijlstra,
	h.peter.anvin, Juvva, Kanaka D, mtosatti



On Fri, 15 May 2015, Thomas Gleixner wrote:

> On Mon, 11 May 2015, Vikas Shivappa wrote:
>>  arch/x86/include/asm/intel_rdt.h |  38 +++++++++++++
>>  arch/x86/kernel/cpu/intel_rdt.c  | 112 +++++++++++++++++++++++++++++++++++++--
>>  include/linux/cgroup_subsys.h    |   4 ++
>
> Where is the Documentation for the new cgroup subsystem?

Documentation/cgroups/rdt.txt

>
>> +struct rdt_subsys_info {
>> +	/* Clos Bitmap to keep track of available CLOSids.*/
>
> If you want to document your struct members, please use KernelDoc
> formatting, so tools can pick it up as well.
>

Will fix

>> +	unsigned long *closmap;
>> +};
>> +
>> +struct intel_rdt {
>> +	struct cgroup_subsys_state css;
>
> Ditto

Will fix

>
>> +	/* Class of service for the cgroup.*/
>> +	unsigned int clos;
>> +};
>> +
>> +struct clos_cbm_map {
>> +	unsigned long cache_mask;
>> +	unsigned int clos_refcnt;
>> +};
>
>> +/*
>> + * ccmap maintains 1:1 mapping between CLOSid and cache bitmask.
>> + */
>> +static struct clos_cbm_map *ccmap;
>> +static struct rdt_subsys_info rdtss_info;
>> +static DEFINE_MUTEX(rdt_group_mutex);
>> +struct intel_rdt rdt_root_group;
>> +
>> +static void intel_rdt_free_closid(unsigned int clos)
>> +{
>> +	lockdep_assert_held(&rdt_group_mutex);
>> +
>> +	clear_bit(clos, rdtss_info.closmap);
>> +}
>> +
>> +static void __clos_get(unsigned int closid)
>
> What's the purpose of these double underscores?
>

Similar to __get_rmid.

>> +{
>> +	struct clos_cbm_map *ccm = &ccmap[closid];
>> +
>> +	lockdep_assert_held(&rdt_group_mutex);
>
> Do we really need all these lockdep asserts for a handfull of call
> sites?

Well, we still have some calls some may be frequent depending on the usage.. 
should the assert decided based on number of times its called ?

>
>> +	ccm->clos_refcnt += 1;
>
> What's wrong with:
>
>       ccmap[closid].clos_refcnt++;
>
> Hmm?

Yes , this can be fixed.

>
>> +}
>> +
>> +static void __clos_put(unsigned int closid)
>> +{
>> +	struct clos_cbm_map *ccm = &ccmap[closid];
>> +
>> +	lockdep_assert_held(&rdt_group_mutex);
>> +	WARN_ON(!ccm->clos_refcnt);
>
> So we have a warning but we do not act on it.

Ok , will change to printing the WARN_ON debug message and just returning.

>
>> +
>> +	ccm->clos_refcnt -= 1;
>> +	if (!ccm->clos_refcnt)
>
> You can do that in a single line w/o the pointer magic. We want easy
> to read and small code rather than pointlessly blown up helper
> functions which just eat screen space.

Will change this.

>
>> +		intel_rdt_free_closid(closid);
>> +}
>> +
>> +static struct cgroup_subsys_state *
>> +intel_rdt_css_alloc(struct cgroup_subsys_state *parent_css)
>> +{
>> +	struct intel_rdt *parent = css_rdt(parent_css);
>> +	struct intel_rdt *ir;
>> +
>> +	/*
>> +	 * Cannot return failure on systems with no Cache Allocation
>> +	 * as the cgroup_init does not handle failures gracefully.
>
> This comment is blatantly wrong. 5 lines further down you return
> -ENOMEM. Now what?
>

Well , this is for cgroup_init called with parent=null which is when its 
initializing the cgroup subsystem. I cant return error in this case but I can 
otherwise.

static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
{

 	/* Create the root cgroup state for this subsystem */
 	ss->root = &cgrp_dfl_root;
 	css = ss->css_alloc(cgroup_css(&cgrp_dfl_root.cgrp, ss));
 	/* We don't handle early failures gracefully */
 	BUG_ON(IS_ERR(css));

>> +	 */
>> +	if (!parent)
>> +		return &rdt_root_group.css;
>> +
>> +	ir = kzalloc(sizeof(struct intel_rdt), GFP_KERNEL);
>> +	if (!ir)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	mutex_lock(&rdt_group_mutex);
>> +	ir->clos = parent->clos;
>> +	__clos_get(ir->clos);
>> +	mutex_unlock(&rdt_group_mutex);
>> +
>> +	return &ir->css;
>> +}
>> +
>> +static void intel_rdt_css_free(struct cgroup_subsys_state *css)
>> +{
>> +	struct intel_rdt *ir = css_rdt(css);
>> +
>> +	mutex_lock(&rdt_group_mutex);
>> +	__clos_put(ir->clos);
>> +	kfree(ir);
>> +	mutex_unlock(&rdt_group_mutex);
>> +}
>>
>>  static int __init intel_rdt_late_init(void)
>>  {
>>  	struct cpuinfo_x86 *c = &boot_cpu_data;
>> -	int maxid, max_cbm_len;
>> +	static struct clos_cbm_map *ccm;
>> +	int maxid, max_cbm_len, err = 0;
>> +	size_t sizeb;
>>
>> -	if (!cpu_has(c, X86_FEATURE_CAT_L3))
>> +	if (!cpu_has(c, X86_FEATURE_CAT_L3)) {
>> +		rdt_root_group.css.ss->disabled = 1;
>>  		return -ENODEV;
>> -
>> +	}
>>  	maxid = c->x86_rdt_max_closid;
>>  	max_cbm_len = c->x86_rdt_max_cbm_len;
>>
>> +	sizeb = BITS_TO_LONGS(maxid) * sizeof(long);
>> +	rdtss_info.closmap = kzalloc(sizeb, GFP_KERNEL);
>
> What's the point of this bitmap? In later patches you have a
> restriction to 64bits and the SDM says that CBM_LEN is limited to 32.
>
> So where is the point for allocating a bitmap?

The cache bitmask is really a bitmask where every bit represents a cache way , 
so its way based mapping . Although currently the max is 32 bits we still need 
to treat it as a bitmask.

In the first patch version you are the one who suggested to use all the bitmap 
functions to check the presence of contiguous bits (although I was already using 
the bitmaps, I had not used for some cases). So i made the change and other 
similar code was changed to using bitmap/bit manipulation APIs. There are 
scenarios where in we need to check for subset of 
bitmasks and other cases but agree they can all be done without the bitmask APIs 
as well. But now your comment contradicts the old comment ?

>
>> +	if (!rdtss_info.closmap) {
>> +		err = -ENOMEM;
>> +		goto out_err;
>> +	}
>> +
>> +	sizeb = maxid * sizeof(struct clos_cbm_map);
>> +	ccmap = kzalloc(sizeb, GFP_KERNEL);
>> +	if (!ccmap) {
>> +		kfree(rdtss_info.closmap);
>> +		err = -ENOMEM;
>> +		goto out_err;
>
> What's wrong with return -ENOMEM? We only use goto if we have to
> cleanup stuff, certainly not for just returning err.

This was due to PeterZ's feedback that the return paradigm needs to not have 
too many exit points or return a lot of times from the middle of code..
Both contradict now ?

>
>> +	}
>> +
>> +	set_bit(0, rdtss_info.closmap);
>> +	rdt_root_group.clos = 0;
>> +	ccm = &ccmap[0];
>> +	bitmap_set(&ccm->cache_mask, 0, max_cbm_len);
>> +	ccm->clos_refcnt = 1;
>> +
>>  	pr_info("Max bitmask length:%u,Max ClosIds: %u\n", max_cbm_len, maxid);
>
> We surely do not want to sprinkle these all over dmesg.

This is just printed once! how is that sprinke all over?   - we have a dmsg 
print for Cache monitoring as well when cqm is enabled.

>
> +out_err:
>
> -       return 0;
> +       return err;
>
>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT
  2015-05-15 19:39   ` Thomas Gleixner
@ 2015-05-18 18:01     ` Vikas Shivappa
  2015-05-18 18:45       ` Thomas Gleixner
  0 siblings, 1 reply; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-18 18:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, x86, linux-kernel, hpa, mingo,
	tj, peterz, matt.fleming, will.auld, peter.zijlstra,
	h.peter.anvin, kanaka.d.juvva, mtosatti



On Fri, 15 May 2015, Thomas Gleixner wrote:

> On Mon, 11 May 2015, Vikas Shivappa wrote:
>>  struct rdt_subsys_info {
>>  	/* Clos Bitmap to keep track of available CLOSids.*/
>> @@ -24,6 +30,11 @@ struct clos_cbm_map {
>>  	unsigned int clos_refcnt;
>>  };
>>
>> +static inline bool rdt_enabled(void)
>> +{
>> +	return static_key_false(&rdt_enable_key);
>
> So again, why this useless helper function for a single call site?
>
>> +static inline void intel_rdt_sched_in(void)
>> +{
>> +	if (rdt_enabled())
>> +		__rdt_sched_in();
>> +}
>
>> +void __rdt_sched_in(void)
>> +{
>> +	struct task_struct *task = current;
>> +	struct intel_rdt *ir;
>> +	unsigned int clos;
>> +
>> +	/*
>> +	 * This needs to be fixed
>> +	 * to cache the whole PQR instead of just CLOSid.
>> +	 * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
>> +	 * Should not write a 0 to the low 10 bits of PQR
>> +	 * and corrupt RMID.
>
> And why is this not fixed __BEFORE__ this patch? You can do the
> changes to struct intel_cqm_state in a seperate patch and then do the
> proper implementation from the beginning instead of providing a half
> broken variant which gets replaced in the next patch.

Ok , can fix both items in your comments. Reason I had it seperately is that the 
cache affects both cmt and cache allocation patches.

>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management
  2015-05-18 17:59     ` Vikas Shivappa
@ 2015-05-18 18:41       ` Thomas Gleixner
  2015-05-18 19:20         ` Borislav Petkov
  2015-05-18 19:44         ` Vikas Shivappa
  2015-05-18 18:52       ` Thomas Gleixner
  1 sibling, 2 replies; 39+ messages in thread
From: Thomas Gleixner @ 2015-05-18 18:41 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, x86, linux-kernel, hpa, mingo, tj, peterz,
	Matt Fleming, Auld, Will, peter.zijlstra, h.peter.anvin, Juvva,
	Kanaka D, mtosatti

On Mon, 18 May 2015, Vikas Shivappa wrote:
> > > +static void __clos_get(unsigned int closid)
> > 
> > What's the purpose of these double underscores?
> > 
> 
> Similar to __get_rmid.

Errm. We use double underscore for such cases:

__bla()
{
	do_stuff();
}
	
bla()
{
	lock();
	__bla();
	unlock();
}

So you have two sorts of callers. Locked and unlocked. But I don't see
something like this. It's just random underscores for no value.

> > > +{
> > > +	struct clos_cbm_map *ccm = &ccmap[closid];
> > > +
> > > +	lockdep_assert_held(&rdt_group_mutex);
> > 
> > Do we really need all these lockdep asserts for a handfull of call
> > sites?
> 
> Well, we still have some calls some may be frequent depending on the usage..
> should the assert decided based on number of times its called ?

We add these things for complex locking scenarios, but for single
callsites where locking is obvious its not really valuable. But that's
the least of my worries.
 
> > > +static struct cgroup_subsys_state *
> > > +intel_rdt_css_alloc(struct cgroup_subsys_state *parent_css)
> > > +{
> > > +	struct intel_rdt *parent = css_rdt(parent_css);
> > > +	struct intel_rdt *ir;
> > > +
> > > +	/*
> > > +	 * Cannot return failure on systems with no Cache Allocation
> > > +	 * as the cgroup_init does not handle failures gracefully.
> > 
> > This comment is blatantly wrong. 5 lines further down you return
> > -ENOMEM. Now what?
> > 
> 
> Well , this is for cgroup_init called with parent=null which is when its
> initializing the cgroup subsystem. I cant return error in this case but I can
> otherwise.

And then you run into the same BUG_ON ...

> static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
> {
> 
> 	/* Create the root cgroup state for this subsystem */
> 	ss->root = &cgrp_dfl_root;
> 	css = ss->css_alloc(cgroup_css(&cgrp_dfl_root.cgrp, ss));
> 	/* We don't handle early failures gracefully */
> 	BUG_ON(IS_ERR(css));

I know. But still the comment is misleading.

  	/* 
	 * cgroup_init() expects  valid css pointer. Return
	 * the rdt_root_group.css instead of a failure code.
	 */

Can you see the difference?
 
> > > +	 */
> > > +	if (!parent)
> > > +		return &rdt_root_group.css;
> > > +
> > > +	ir = kzalloc(sizeof(struct intel_rdt), GFP_KERNEL);
> > > +	if (!ir)
> > > +		return ERR_PTR(-ENOMEM);

And why can't you return something useful here instead of letting the
thing run into a BUG?

> > >  	maxid = c->x86_rdt_max_closid;
> > >  	max_cbm_len = c->x86_rdt_max_cbm_len;
> > > 
> > > +	sizeb = BITS_TO_LONGS(maxid) * sizeof(long);
> > > +	rdtss_info.closmap = kzalloc(sizeb, GFP_KERNEL);
> > 
> > What's the point of this bitmap? In later patches you have a
> > restriction to 64bits and the SDM says that CBM_LEN is limited to 32.
> > 
> > So where is the point for allocating a bitmap?
> 
> The cache bitmask is really a bitmask where every bit represents a cache way ,
> so its way based mapping . Although currently the max is 32 bits we still need
> to treat it as a bitmask.
> 
> In the first patch version you are the one who suggested to use all the bitmap
> functions to check the presence of contiguous bits (although I was already
> using the bitmaps, I had not used for some cases). So i made the change and
> other similar code was changed to using bitmap/bit manipulation APIs. There
> are scenarios where in we need to check for subset of bitmasks and other cases
> but agree they can all be done without the bitmask APIs as well. But now your
> comment contradicts the old comment ?

Oh well. You can still use bitmap functions on an u64 where
appropriate.
 
> > 
> > > +	if (!rdtss_info.closmap) {
> > > +		err = -ENOMEM;
> > > +		goto out_err;
> > > +	}
> > > +
> > > +	sizeb = maxid * sizeof(struct clos_cbm_map);
> > > +	ccmap = kzalloc(sizeb, GFP_KERNEL);
> > > +	if (!ccmap) {
> > > +		kfree(rdtss_info.closmap);
> > > +		err = -ENOMEM;
> > > +		goto out_err;
> > 
> > What's wrong with return -ENOMEM? We only use goto if we have to
> > cleanup stuff, certainly not for just returning err.
> 
> This was due to PeterZ's feedback that the return paradigm needs to not have
> too many exit points or return a lot of times from the middle of code..
> Both contradict now ?

There are different opinions about that. But again, that's the least
of my worries.
 
> > 
> > > +	}
> > > +
> > > +	set_bit(0, rdtss_info.closmap);
> > > +	rdt_root_group.clos = 0;
> > > +	ccm = &ccmap[0];
> > > +	bitmap_set(&ccm->cache_mask, 0, max_cbm_len);
> > > +	ccm->clos_refcnt = 1;
> > > +
> > >  	pr_info("Max bitmask length:%u,Max ClosIds: %u\n", max_cbm_len,
> > > maxid);
> > 
> > We surely do not want to sprinkle these all over dmesg.
> 
> This is just printed once! how is that sprinke all over?   - we have a dmsg
> print for Cache monitoring as well when cqm is enabled.

Sorry, mapped that to the wrong function. Though the message itself is
horrible. 

	  "Max bitmask length:32,Max ClosIds: 16"	

With some taste and formatting applied this would read:

     	  "Max. bitmask length: 32, max. closids: 16"

Can you spot the difference?

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT
  2015-05-18 18:01     ` Vikas Shivappa
@ 2015-05-18 18:45       ` Thomas Gleixner
  2015-05-18 19:18         ` Vikas Shivappa
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2015-05-18 18:45 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, x86, linux-kernel, hpa, mingo, tj, peterz,
	matt.fleming, will.auld, peter.zijlstra, h.peter.anvin,
	kanaka.d.juvva, mtosatti

On Mon, 18 May 2015, Vikas Shivappa wrote:
> On Fri, 15 May 2015, Thomas Gleixner wrote:
> > On Mon, 11 May 2015, Vikas Shivappa wrote:
> > > +	/*
> > > +	 * This needs to be fixed
> > > +	 * to cache the whole PQR instead of just CLOSid.
> > > +	 * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
> > > +	 * Should not write a 0 to the low 10 bits of PQR
> > > +	 * and corrupt RMID.
> > 
> > And why is this not fixed __BEFORE__ this patch? You can do the
> > changes to struct intel_cqm_state in a seperate patch and then do the
> > proper implementation from the beginning instead of providing a half
> > broken variant which gets replaced in the next patch.
> 
> Ok , can fix both items in your comments. Reason I had it seperately is that
> the cache affects both cmt and cache allocation patches.

And that's the wrong reason. Sure it affects both, but we first
prepare the changes to the existing code and then build new stuff on
top of it not the other way round. Building the roof before the
basement is almost never a good idea.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management
  2015-05-18 17:59     ` Vikas Shivappa
  2015-05-18 18:41       ` Thomas Gleixner
@ 2015-05-18 18:52       ` Thomas Gleixner
  2015-05-18 19:27         ` Vikas Shivappa
  1 sibling, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2015-05-18 18:52 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, x86, linux-kernel, hpa, mingo, tj, peterz,
	Matt Fleming, Auld, Will, peter.zijlstra, h.peter.anvin, Juvva,
	Kanaka D, mtosatti

On Mon, 18 May 2015, Vikas Shivappa wrote:
> On Fri, 15 May 2015, Thomas Gleixner wrote:
> 
> > On Mon, 11 May 2015, Vikas Shivappa wrote:
> > >  arch/x86/include/asm/intel_rdt.h |  38 +++++++++++++
> > >  arch/x86/kernel/cpu/intel_rdt.c  | 112
> > > +++++++++++++++++++++++++++++++++++++--
> > >  include/linux/cgroup_subsys.h    |   4 ++
> > 
> > Where is the Documentation for the new cgroup subsystem?
> 
> Documentation/cgroups/rdt.txt

That's where it should be, but according to the diffstat it is not. It
comes as the last patch in the series, but it should be there with the
patch which introduces the subsystem or even before that.

And the features which are added in subsequent patches should be
reflected with updates to the documentation.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management
  2015-05-15 19:25   ` Thomas Gleixner
@ 2015-05-18 19:17     ` Vikas Shivappa
  2015-05-18 20:15       ` Thomas Gleixner
  0 siblings, 1 reply; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-18 19:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, x86, linux-kernel, hpa, mingo,
	tj, peterz, Matt Fleming, Auld, Will, peter.zijlstra,
	h.peter.anvin, Juvva, Kanaka D, mtosatti



On Fri, 15 May 2015, Thomas Gleixner wrote:

> On Mon, 11 May 2015, Vikas Shivappa wrote:
>> @@ -35,6 +36,42 @@ static struct rdt_subsys_info rdtss_info;
>>  static DEFINE_MUTEX(rdt_group_mutex);
>>  struct intel_rdt rdt_root_group;
>>
>> +/*
>> + * Mask of CPUs for writing CBM values. We only need one per-socket.
>
> One mask? One CPU? One what? Please add comments which are
> understandable and useful for people who did NOT write that code.

when there is already code in the upstream code which has comments exactly 
consistent - when that made sense , this should ?

/*
  * Mask of CPUs for reading CQM values. We only need one per-socket.
  */
static cpumask_t cqm_cpumask;

>
>> + */
>> +static cpumask_t rdt_cpumask;
>
>> +static void __cpu_cbm_update(void *info)
>> +{
>> +	unsigned int closid = *((unsigned int *)info);
>
> So all you want is to transfer closid, right? Why having a pointer
> instead of just doing
>
> 	on_each_cpu_mask(...., (void *) id);
>
> and
> 	id = (unsigned long) id;

will fix.

>
> Looks too simple and readable, right?
>
>> +/*
>> + * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
>> + * which are one per CLOSid, except IA32_L3_MASK_0 on the current package.
>> + * @cpu : the cpu on which the mask is updated.
>> + */
>> +static inline void cbm_update_msrs(int cpu)
>> +{
>> +	int maxid = boot_cpu_data.x86_rdt_max_closid;
>> +	unsigned int i;
>> +
>> +	if (WARN_ON(cpu != smp_processor_id()))
>> +		return;
>> +
>> +	for (i = 1; i < maxid; i++) {
>
> Lacks a comment why this starts with 1 and not with 0 as one would expect.

Its in the function comment just above - "except IA32_L3_MASK_0 on the current 
package."

>
>> +		if (ccmap[i].clos_refcnt)
>> +			__cpu_cbm_update(&i);
>> +	}
>> +}
>
> This is only used by the cpu hotplug code. So why does it not live
> close to it?

ok will fix

Because its more fun to search for functions than having
> them just in place?
>
>> +/*
>> + * intel_cache_alloc_cbm_write() - Validates and writes the
>> + * cache bit mask(cbm) to the IA32_L3_MASK_n
>> + * and also store the same in the ccmap.
>> + *
>> + * CLOSids are reused for cgroups which have same bitmask.
>> + * - This helps to use the scant CLOSids optimally.
>> + * - This also implies that at context switch write
>> + * to PQR-MSR is done only when a task with a
>> + * different bitmask is scheduled in.
>> + */
>> +static int intel_cache_alloc_cbm_write(struct cgroup_subsys_state *css,
>
> Can you please drop these intel_ prefixes? They provide no value and
> just eat space.

well , i got ample comments which wanted me to me to specify intel as this 
feature is specific to intel.
the cache monitoring code has similar prefixes as well - is that pref 
specific then ? I just dont want people to come back and ask to make it clear 
that things are intel specific

>
>> +				 struct cftype *cft, u64 cbmvalue)
>> +{
>> +	u32 max_cbm = boot_cpu_data.x86_rdt_max_cbm_len;
>> +	struct intel_rdt *ir = css_rdt(css);
>> +	unsigned long cache_mask, max_mask;
>> +	unsigned long *cbm_tmp;
>> +	unsigned int closid;
>> +	ssize_t err = 0;
>> +
>> +	if (ir == &rdt_root_group)
>> +		return -EPERM;
>> +	bitmap_set(&max_mask, 0, max_cbm);
>> +
>> +	/*
>> +	 * Need global mutex as cbm write may allocate a closid.
>> +	 */
>> +	mutex_lock(&rdt_group_mutex);
>> +	bitmap_and(&cache_mask, (unsigned long *)&cbmvalue, &max_mask, max_cbm);
>> +	cbm_tmp = &ccmap[ir->clos].cache_mask;
>> +
>> +	if (bitmap_equal(&cache_mask, cbm_tmp, MAX_CBM_LENGTH))
>> +		goto out;
>> +
>
> Sigh, what's the point of this bitmap shuffling? Nothing as far as I
> can tell, because you first set all bits of maxmask and then AND
> cbmvalue.

Again , this contraditcs your earlier comment where you asked me to use the bit 
manipulation APIs to make it more *readable*. So do you want more readable now 
or is that not clear now ?

https://lkml.org/lkml/2014/11/19/1145

+	/* Reset the least significant bit.*/
+	i = var & (var - 1);
+
+	if (i != (var & (var << 1)))
+		return false;

was replaced with the

 	if (bitmap_weight(&var, maxcbm) < min_bitmask_len)
 		return false;

 	first_bit = find_next_bit(&var, maxcbm, 0);
 	zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);

 	if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
 		return false;

3 lines replaced with more to make it more readable ! so does this come back 
then ?

>
> All you achieve is, that you truncate cbmvalue to max_cbm bits. So if
> cbmvalue has any of the invalid bits set, you happily ignore that.

taskset ignores the extra bits. is that a problem here ?

>
> This whole bitmap usage is just making the code obfuscated.
>
>     u64 max_mask = (1ULL << max_cbm) - 1;
>
>     if (cbm_value & ~max_mask)
>     	   return -EINVAL;
>
>     if (cbm_value == (u64)ccmap[ir->clos].cache_mask)
>     	   return 0;
>
> No bitmap_set, bitmap_and nothing. Simple and readable, right?
>
> The bitmap would only be necessary, IF max_cbm is > 64. But that's not
> the case and nothing in your code would survive ma_ cbm > 64. And as I
> said before CBM_LEN is 32bit for now and I serioulsy doubt, that it
> can ever go over 64bit simply because that would not fit into the mask
> MSRs.
>
> So can you please get rid of that bitmap nonsense including the
> dynamic allocation and  use simple and readable code constructs?
>
> Just for the record:
>
>     struct bla {
>     	    u64			cbm_mask;
>     };
>
> versus:
>
>     struct bla {
>     	    unsigned long	*cbm_mask;
>     };
>
>   plus
>
>     bla.cbm_mask = kzalloc(sizeof(u64));
>
> Is just utter waste of data and text space.
>
>> +	/*
>> +	 * At this point we are sure to change the cache_mask.Hence release the
>> +	 * reference to the current CLOSid and try to get a reference for
>> +	 * a different CLOSid.
>> +	 */
>> +	__clos_put(ir->clos);
>
> So you drop the reference, before you make sure that you can get a new one?

I was horrible , will add fix. thanks for pointing out .

>
>> +	if (cbm_search(cache_mask, &closid)) {
>> +		ir->clos = closid;
>> +		__clos_get(closid);
>> +	} else {
>> +		err = intel_rdt_alloc_closid(ir);
>> +		if (err)
>> +			goto out;
>
> What happens if intel_rdt_alloc_closid() fails? ir->clos is unchanged,
> but you have no reference anymore. The next call to this function or
> anything else which calls __clos_put() will trigger the underflow
> warning and the warning will be completely useless because the
> wreckage happened before that.
>
>> +static inline bool intel_rdt_update_cpumask(int cpu)
>> +{
>> +	int phys_id = topology_physical_package_id(cpu);
>> +	struct cpumask *mask = &rdt_cpumask;
>> +	int i;
>> +
>> +	for_each_cpu(i, mask) {
>> +		if (phys_id == topology_physical_package_id(i))
>> +			return false;
>> +	}
>
> You must be kidding.

the rapl and cqm use similar code. You want me to keep a seperate 
package mask for this code which not would be that frequent at all ?

for c-state transitions only idle_notifier gets called - so this is only when a 
new package is physically added. and we dont need anything for cache alloc for 
idle transitions. not really frequent ?

You know how many packages are possible. What's
> wrong with having a bitmap of physical packages which have a CPU
> handling it?
>
> 	id =  topology_physical_package_id(cpu);
> 	if (!__test_and_set_bit(id, &package_map)) {
> 		cpumask_set_cpu(cpu, &rdt_cpumask);
> 		cbm_update_msrs(cpu);
> 	}
>
> Four lines of code without loops and hoops. And here is a bitmap the
> proper data structure.
>
>> +static void intel_rdt_cpu_exit(unsigned int cpu)
>> +{
>> +	int phys_id = topology_physical_package_id(cpu);
>> +	int i;
>> +
>> +	mutex_lock(&rdt_group_mutex);
>> +	if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
>> +		mutex_unlock(&rdt_group_mutex);
>> +		return;
>> +	}
>> +
>> +	for_each_online_cpu(i) {
>> +		if (i == cpu)
>> +			continue;
>> +
>> +		if (phys_id == topology_physical_package_id(i)) {
>> +			cpumask_set_cpu(i, &rdt_cpumask);
>> +			break;
>> +		}
>> +	}
>
> There is a better way than iterating over all online cpus ....
>
>>
>> +static struct cftype rdt_files[] = {
>> +	{
>> +		.name = "cache_mask",
>> +		.seq_show = intel_cache_alloc_cbm_read,
>> +		.write_u64 = intel_cache_alloc_cbm_write,
>> +		.mode = 0666,
>> +	},
>> +	{ }	/* terminate */
>> +};
>
> Where is the Documentation for that file? What's the content, format,
> error codes ....

Will add section to rdt.txt in Documentation/cgroups - also will move the 
documentation patch before all these patches..

>
> Sigh.
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT
  2015-05-18 18:45       ` Thomas Gleixner
@ 2015-05-18 19:18         ` Vikas Shivappa
  0 siblings, 0 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-18 19:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, Vikas Shivappa, x86, linux-kernel, hpa, mingo,
	tj, peterz, matt.fleming, will.auld, peter.zijlstra,
	h.peter.anvin, kanaka.d.juvva, mtosatti



On Mon, 18 May 2015, Thomas Gleixner wrote:

> On Mon, 18 May 2015, Vikas Shivappa wrote:
>> On Fri, 15 May 2015, Thomas Gleixner wrote:
>>> On Mon, 11 May 2015, Vikas Shivappa wrote:
>>>> +	/*
>>>> +	 * This needs to be fixed
>>>> +	 * to cache the whole PQR instead of just CLOSid.
>>>> +	 * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
>>>> +	 * Should not write a 0 to the low 10 bits of PQR
>>>> +	 * and corrupt RMID.
>>>
>>> And why is this not fixed __BEFORE__ this patch? You can do the
>>> changes to struct intel_cqm_state in a seperate patch and then do the
>>> proper implementation from the beginning instead of providing a half
>>> broken variant which gets replaced in the next patch.
>>
>> Ok , can fix both items in your comments. Reason I had it seperately is that
>> the cache affects both cmt and cache allocation patches.
>
> And that's the wrong reason. Sure it affects both, but we first
> prepare the changes to the existing code and then build new stuff on
> top of it not the other way round. Building the roof before the
> basement is almost never a good idea.

Ok , will merge all scheduing changes to one patch if you think thats better.

Thanks,
Vikas

>
> Thanks,
>
> 	tglx
>
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management
  2015-05-18 18:41       ` Thomas Gleixner
@ 2015-05-18 19:20         ` Borislav Petkov
  2015-05-19 17:33           ` Vikas Shivappa
  2015-05-18 19:44         ` Vikas Shivappa
  1 sibling, 1 reply; 39+ messages in thread
From: Borislav Petkov @ 2015-05-18 19:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, Vikas Shivappa, x86, linux-kernel, hpa, mingo,
	tj, peterz, Matt Fleming, Auld, Will, peter.zijlstra,
	h.peter.anvin, Juvva, Kanaka D, mtosatti

On Mon, May 18, 2015 at 08:41:53PM +0200, Thomas Gleixner wrote:
> > > > +	}
> > > > +
> > > > +	set_bit(0, rdtss_info.closmap);
> > > > +	rdt_root_group.clos = 0;
> > > > +	ccm = &ccmap[0];
> > > > +	bitmap_set(&ccm->cache_mask, 0, max_cbm_len);
> > > > +	ccm->clos_refcnt = 1;
> > > > +
> > > >  	pr_info("Max bitmask length:%u,Max ClosIds: %u\n", max_cbm_len,
> > > > maxid);
> > > 
> > > We surely do not want to sprinkle these all over dmesg.
> > 
> > This is just printed once! how is that sprinke all over?   - we have a dmsg
> > print for Cache monitoring as well when cqm is enabled.
> 
> Sorry, mapped that to the wrong function. Though the message itself is
> horrible. 
> 
> 	  "Max bitmask length:32,Max ClosIds: 16"	
> 
> With some taste and formatting applied this would read:
> 
>      	  "Max. bitmask length: 32, max. closids: 16"
> 
> Can you spot the difference?

I sure can.

Also, I'd still like to ask about the usability of that message. What
does it bring us?

And if the dmesg ring buffer wraps around and it gets overwritten, what
do we do then?

So basically this message does tell us about some max bitmap length and
so on. IOW, useless, right?

Can it be read out from CPUID maybe? If so, stuff which is interested in
it can generate it then.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management
  2015-05-18 18:52       ` Thomas Gleixner
@ 2015-05-18 19:27         ` Vikas Shivappa
  0 siblings, 0 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-18 19:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, Vikas Shivappa, x86, linux-kernel, hpa, mingo,
	tj, peterz, Matt Fleming, Auld, Will, peter.zijlstra,
	h.peter.anvin, Juvva, Kanaka D, mtosatti



On Mon, 18 May 2015, Thomas Gleixner wrote:

> On Mon, 18 May 2015, Vikas Shivappa wrote:
>> On Fri, 15 May 2015, Thomas Gleixner wrote:
>>
>>> On Mon, 11 May 2015, Vikas Shivappa wrote:
>>>>  arch/x86/include/asm/intel_rdt.h |  38 +++++++++++++
>>>>  arch/x86/kernel/cpu/intel_rdt.c  | 112
>>>> +++++++++++++++++++++++++++++++++++++--
>>>>  include/linux/cgroup_subsys.h    |   4 ++
>>>
>>> Where is the Documentation for the new cgroup subsystem?
>>
>> Documentation/cgroups/rdt.txt
>
> That's where it should be, but according to the diffstat it is not. It
> comes as the last patch in the series, but it should be there with the
> patch which introduces the subsystem or even before that.

Agree :) . i realized this later after i sent the email and mentioned it in next 
email , but that way i did the same exact mistake - eh! cant even call that 
recursion i dont know.

Thanks,
Vikas

>
> And the features which are added in subsequent patches should be
> reflected with updates to the documentation.
>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management
  2015-05-18 18:41       ` Thomas Gleixner
  2015-05-18 19:20         ` Borislav Petkov
@ 2015-05-18 19:44         ` Vikas Shivappa
  1 sibling, 0 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-18 19:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, Vikas Shivappa, x86, linux-kernel, hpa, mingo,
	tj, peterz, Matt Fleming, Auld, Will, peter.zijlstra,
	h.peter.anvin, Juvva, Kanaka D, mtosatti



On Mon, 18 May 2015, Thomas Gleixner wrote:

> On Mon, 18 May 2015, Vikas Shivappa wrote:
>>>> +static void __clos_get(unsigned int closid)
>>>
>>> What's the purpose of these double underscores?
>>>
>>
>> Similar to __get_rmid.
>
> Errm. We use double underscore for such cases:
>
> __bla()
> {
> 	do_stuff();
> }
>
> bla()
> {
> 	lock();
> 	__bla();
> 	unlock();
> }
>
> So you have two sorts of callers. Locked and unlocked. But I don't see
> something like this. It's just random underscores for no value.

Ok - I saw this .. will remove underscores then since cpuset also dont use these 
anyways.

>
>>>> +{
>>>> +	struct clos_cbm_map *ccm = &ccmap[closid];
>>>> +
>>>> +	lockdep_assert_held(&rdt_group_mutex);
>>>
>>> Do we really need all these lockdep asserts for a handfull of call
>>> sites?
>>
>> Well, we still have some calls some may be frequent depending on the usage..
>> should the assert decided based on number of times its called ?
>
> We add these things for complex locking scenarios, but for single
> callsites where locking is obvious its not really valuable. But that's
> the least of my worries.

Ok. was just wanting to make sure help debug just in case as its not used by 
too many people as yet.

>
>>>> +static struct cgroup_subsys_state *
>>>> +intel_rdt_css_alloc(struct cgroup_subsys_state *parent_css)
>>>> +{
>>>> +	struct intel_rdt *parent = css_rdt(parent_css);
>>>> +	struct intel_rdt *ir;
>>>> +
>>>> +	/*
>>>> +	 * Cannot return failure on systems with no Cache Allocation
>>>> +	 * as the cgroup_init does not handle failures gracefully.
>>>
>>> This comment is blatantly wrong. 5 lines further down you return
>>> -ENOMEM. Now what?
>>>
>>
>> Well , this is for cgroup_init called with parent=null which is when its
>> initializing the cgroup subsystem. I cant return error in this case but I can
>> otherwise.
>
> And then you run into the same BUG_ON ...

May be I misled you - I pasted the code from cgroup.c which is not part of cache 
alloc below to show cgroup_init doesnt handle failure.. (cgroup_init_subsys ). 
The cache allocation code doesnt return BUG_ON. So I never run into BUG_ON.. 
(right?)

>
>> static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
>> {
>>
>> 	/* Create the root cgroup state for this subsystem */
>> 	ss->root = &cgrp_dfl_root;
>> 	css = ss->css_alloc(cgroup_css(&cgrp_dfl_root.cgrp, ss));
>> 	/* We don't handle early failures gracefully */
>> 	BUG_ON(IS_ERR(css));
>
> I know. But still the comment is misleading.
>
>  	/*
> 	 * cgroup_init() expects  valid css pointer. Return
> 	 * the rdt_root_group.css instead of a failure code.
> 	 */
>
> Can you see the difference?

ok can fix the comment. Reminds me of my middle school days when English was my 
hardest subject.

>
>>>> +	 */
>>>> +	if (!parent)
>>>> +		return &rdt_root_group.css;
>>>> +
>>>> +	ir = kzalloc(sizeof(struct intel_rdt), GFP_KERNEL);
>>>> +	if (!ir)
>>>> +		return ERR_PTR(-ENOMEM);
>
> And why can't you return something useful here instead of letting the
> thing run into a BUG?

This is a return for no memory - this is not for the call from 
cgroup_init_subsys where failure turns into a BUG.. this is when user is trying 
to create new cgroups..

>
>>>>  	maxid = c->x86_rdt_max_closid;
>>>>  	max_cbm_len = c->x86_rdt_max_cbm_len;
>>>>
>>>> +	sizeb = BITS_TO_LONGS(maxid) * sizeof(long);
>>>> +	rdtss_info.closmap = kzalloc(sizeb, GFP_KERNEL);
>>>
>>> What's the point of this bitmap? In later patches you have a
>>> restriction to 64bits and the SDM says that CBM_LEN is limited to 32.
>>>
>>> So where i the point for allocating a bitmap?
>>
>> The cache bitmask is really a bitmask where every bit represents a cache way ,
>> so its way based mapping . Although currently the max is 32 bits we still need
>> to treat it as a bitmask.
>>
>> In the first patch version you are the one who suggested to use all the bitmap
>> functions to check the presence of contiguous bits (although I was already
>> using the bitmaps, I had not used for some cases). So i made the change and
>> other similar code was changed to using bitmap/bit manipulation APIs. There
>> are scenarios where in we need to check for subset of bitmasks and other cases
>> but agree they can all be done without the bitmask APIs as well. But now your
>> comment contradicts the old comment ?
>
> Oh well. You can still use bitmap functions on an u64 where
> appropriate.

I see - I did not notice this bitmap is not for the CBM though.. this is the 
closmap which manages the closids. the cbm is just kept as a unsigned long - i 
dont do bitmap allocation for that. So we are on same page i guess.

>
>>>
>>>> +	if (!rdtss_info.closmap) {
>>>> +		err = -ENOMEM;
>>>> +		goto out_err;
>>>> +	}
>>>> +
>>>> +	sizeb = maxid * sizeof(struct clos_cbm_map);
>>>> +	ccmap = kzalloc(sizeb, GFP_KERNEL);
>>>> +	if (!ccmap) {
>>>> +		kfree(rdtss_info.closmap);
>>>> +		err = -ENOMEM;
>>>> +		goto out_err;
>>>
>>> What's wrong with return -ENOMEM? We only use goto if we have to
>>> cleanup stuff, certainly not for just returning err.
>>
>> This was due to PeterZ's feedback that the return paradigm needs to not have
>> too many exit points or return a lot of times from the middle of code..
>> Both contradict now ?
>
> There are different opinions about that. But again, that's the least
> of my worries.

Great.

>
>>>
>>>> +	}
>>>> +
>>>> +	set_bit(0, rdtss_info.closmap);
>>>> +	rdt_root_group.clos = 0;
>>>> +	ccm = &ccmap[0];
>>>> +	bitmap_set(&ccm->cache_mask, 0, max_cbm_len);
>>>> +	ccm->clos_refcnt = 1;
>>>> +
>>>>  	pr_info("Max bitmask length:%u,Max ClosIds: %u\n", max_cbm_len,
>>>> maxid);
>>>
>>> We surely do not want to sprinkle these all over dmesg.
>>
>> This is just printed once! how is that sprinke all over?   - we have a dmsg
>> print for Cache monitoring as well when cqm is enabled.
>
> Sorry, mapped that to the wrong function. Though the message itself is
> horrible.
>
> 	  "Max bitmask length:32,Max ClosIds: 16"
>
> With some taste and formatting applied this would read:
>
>     	  "Max. bitmask length: 32, max. closids: 16"
>
> Can you spot the difference?

Will fix comment.

Thanks,
Vikas

>
> Thanks,
>
> 	tglx
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management
  2015-05-18 19:17     ` Vikas Shivappa
@ 2015-05-18 20:15       ` Thomas Gleixner
  2015-05-18 21:09         ` Vikas Shivappa
  2015-05-20 17:22         ` Vikas Shivappa
  0 siblings, 2 replies; 39+ messages in thread
From: Thomas Gleixner @ 2015-05-18 20:15 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, x86, linux-kernel, hpa, mingo, tj, peterz,
	Matt Fleming, Auld, Will, peter.zijlstra, h.peter.anvin, Juvva,
	Kanaka D, mtosatti

On Mon, 18 May 2015, Vikas Shivappa wrote:
> On Fri, 15 May 2015, Thomas Gleixner wrote:
> > > +/*
> > > + * Mask of CPUs for writing CBM values. We only need one per-socket.
> > 
> > One mask? One CPU? One what? Please add comments which are
> > understandable and useful for people who did NOT write that code.
> 
> when there is already code in the upstream code which has comments exactly
> consistent - when that made sense , this should ?
> 
> /*
>  * Mask of CPUs for reading CQM values. We only need one per-socket.
>  */
> static cpumask_t cqm_cpumask;

And just because there is a lousy comment upstream it does not become
better by copying it.
 
> > > +/*
> > > + * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
> > > + * which are one per CLOSid, except IA32_L3_MASK_0 on the current
> > > package.
> > > + * @cpu : the cpu on which the mask is updated.
> > > + */
> > > +static inline void cbm_update_msrs(int cpu)
> > > +{
> > > +	int maxid = boot_cpu_data.x86_rdt_max_closid;
> > > +	unsigned int i;
> > > +
> > > +	if (WARN_ON(cpu != smp_processor_id()))
> > > +		return;
> > > +
> > > +	for (i = 1; i < maxid; i++) {
> > 
> > Lacks a comment why this starts with 1 and not with 0 as one would expect.
> 
> Its in the function comment just above - "except IA32_L3_MASK_0 on the current
> package."

No. That comment explains WHAT it does not WHY. Asided of that that
comment is not correct KernelDoc format.
 
> > > +/*
> > > + * intel_cache_alloc_cbm_write() - Validates and writes the
> > > + * cache bit mask(cbm) to the IA32_L3_MASK_n
> > > + * and also store the same in the ccmap.
> > > + *
> > > + * CLOSids are reused for cgroups which have same bitmask.
> > > + * - This helps to use the scant CLOSids optimally.
> > > + * - This also implies that at context switch write
> > > + * to PQR-MSR is done only when a task with a
> > > + * different bitmask is scheduled in.
> > > + */
> > > +static int intel_cache_alloc_cbm_write(struct cgroup_subsys_state *css,
> > 
> > Can you please drop these intel_ prefixes? They provide no value and
> > just eat space.
> 
> well , i got ample comments which wanted me to me to specify intel as this
> feature is specific to intel.

It's fine for public interfaces, but for functions which are only used
inside a file which contains only intel specific code it's just silly.

> the cache monitoring code has similar prefixes as well - is that pref specific

Sigh. There is so much crap in the kernel, you'll find an example for
everything.

> then ? I just dont want people to come back and ask to make it clear that
> things are intel specific

The file name tells everyone its INTEL specific, right? And following
your argumentation you should have named your helper functions
intel_get_closid() and intel_put_closid() as well.

If your code would have been in a proper shape otherwise, I would
probably have just let it slip.

> > > +				 struct cftype *cft, u64 cbmvalue)
> > > +{
> > > +	u32 max_cbm = boot_cpu_data.x86_rdt_max_cbm_len;
> > > +	struct intel_rdt *ir = css_rdt(css);
> > > +	unsigned long cache_mask, max_mask;
> > > +	unsigned long *cbm_tmp;
> > > +	unsigned int closid;
> > > +	ssize_t err = 0;
> > > +
> > > +	if (ir == &rdt_root_group)
> > > +		return -EPERM;
> > > +	bitmap_set(&max_mask, 0, max_cbm);
> > > +
> > > +	/*
> > > +	 * Need global mutex as cbm write may allocate a closid.
> > > +	 */
> > > +	mutex_lock(&rdt_group_mutex);
> > > +	bitmap_and(&cache_mask, (unsigned long *)&cbmvalue, &max_mask,
> > > max_cbm);
> > > +	cbm_tmp = &ccmap[ir->clos].cache_mask;
> > > +
> > > +	if (bitmap_equal(&cache_mask, cbm_tmp, MAX_CBM_LENGTH))
> > > +		goto out;
> > > +
> > 
> > Sigh, what's the point of this bitmap shuffling? Nothing as far as I
> > can tell, because you first set all bits of maxmask and then AND
> > cbmvalue.
> 
> Again , this contraditcs your earlier comment where you asked me to use the
> bit manipulation APIs to make it more *readable*. So do you want more readable
> now or is that not clear now ?

> https://lkml.org/lkml/2014/11/19/1145

 
> +	/* Reset the least significant bit.*/
> +	i = var & (var - 1);
> +
> +	if (i != (var & (var << 1)))
> +		return false;
> 
> was replaced with the
> 
> 	if (bitmap_weight(&var, maxcbm) < min_bitmask_len)
> 		return false;
> 
> 	first_bit = find_next_bit(&var, maxcbm, 0);
> 	zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);
> 
> 	if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
> 		return false;
> 
> 3 lines replaced with more to make it more readable ! so does this come back
> then ?

3 lines? 

> +static inline bool cbm_minbits(unsigned long var)
> +{
> +
> +	unsigned long i;
> +
> +	/*Minimum of 2 bits must be set.*/
> +
> +	i = var & (var - 1);
> +	if (!i || !var)
> +	   return false;
> +
> +	return true;
> +
> +}

These 14 lines are replaced by a two lines:

	if (bitmap_weight(&var, maxcbm) < min_bitmask_len)
		return false;

> static inline bool cbm_iscontiguous(unsigned long var)
> +{
> +
> +	unsigned long i;
> +
> +	/* Reset the least significant bit.*/
> +	i = var & (var - 1);
> +
> +	/*
> +	 * We would have a set of non-contiguous bits when
> +	 * there is at least one zero
> +	 * between the most significant 1 and least significant 1.
> +	 * In the below '&' operation,(var <<1) would have zero in
> +	 * at least 1 bit position in var apart from least
> +	 * significant bit if it does not have contiguous bits.
> +	 * Multiple sets of contiguous bits wont succeed in the below
> +	 * case as well.
> +	 */
> +
> +	if (i != (var & (var << 1)))
> +	   return false;
> +
> +	return true;
> +
> +}

And these 25 lines are replaced by 5 lines.

 	first_bit = find_next_bit(&var, maxcbm, 0);
 	zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);
 
 	if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
 		return false;


I asked you to use it for cbm_minbits and cbm_iscontiguous() instead
of the horrible open coded mess you had. But I certainly did not ask
you to blindly use bitmaps just everywhere.

> > All you achieve is, that you truncate cbmvalue to max_cbm bits. So if
> > cbmvalue has any of the invalid bits set, you happily ignore that.
> 
> taskset ignores the extra bits. is that a problem here ?

I do not care, what taskset does. I care what this code does.

It tells the user that: 0x100ffff is fine, even if the valid bits are
just 0xffff.

And taskset is a different story due to cpu hotplug.

But for CAT the maskbits are defined by hardware and cannot change
ever. So why would we allow to set invalid ones and claim that its
correct?

If there is a reason to do so, then it needs a proper comment in the
code and a proper explanation in Documentation/.....

> > > +static inline bool intel_rdt_update_cpumask(int cpu)
> > > +{
> > > +	int phys_id = topology_physical_package_id(cpu);
> > > +	struct cpumask *mask = &rdt_cpumask;
> > > +	int i;
> > > +
> > > +	for_each_cpu(i, mask) {
> > > +		if (phys_id == topology_physical_package_id(i))
> > > +			return false;
> > > +	}
> > 
> > You must be kidding.
> 
> the rapl and cqm use similar code. You want me to keep a seperate package mask
> for this code which not would be that frequent at all ?

You find for everything a place where you copied your stuff from
without thinking about it, right?

Other people dessperately try to fix the cpu online times which are
more and more interesting the larger the systems become. So it might
be a good idea to come up with a proper fast implementation which can
be used everywhere instead of blindly copying code.
 
> for c-state transitions only idle_notifier gets called - so this is only when
> a new package is physically added. and we dont need anything for cache alloc
> for idle transitions. not really frequent ?

Crap. It's called for every cpu which comes online and goes offline,
not for new packages.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management
  2015-05-18 20:15       ` Thomas Gleixner
@ 2015-05-18 21:09         ` Vikas Shivappa
  2015-05-20 17:22         ` Vikas Shivappa
  1 sibling, 0 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-18 21:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, Vikas Shivappa, x86, linux-kernel, hpa, mingo,
	tj, peterz, Matt Fleming, Auld, Will, peter.zijlstra,
	h.peter.anvin, Juvva, Kanaka D, mtosatti



On Mon, 18 May 2015, Thomas Gleixner wrote:

> On Mon, 18 May 2015, Vikas Shivappa wrote:
>> On Fri, 15 May 2015, Thomas Gleixner wrote:
>>>> +/*
>>>> + * Mask of CPUs for writing CBM values. We only need one per-socket.
>>>
>>> One mask? One CPU? One what? Please add comments which are
>>> understandable and useful for people who did NOT write that code.
>>
>> when there is already code in the upstream code which has comments exactly
>> consistent - when that made sense , this should ?
>>
>> /*
>>  * Mask of CPUs for reading CQM values. We only need one per-socket.
>>  */
>> static cpumask_t cqm_cpumask;
>
> And just because there is a lousy comment upstream it does not become
> better by copying it.

Nice , will fix.

>
>>>> +/*
>>>> + * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
>>>> + * which are one per CLOSid, except IA32_L3_MASK_0 on the current
>>>> package.
>>>> + * @cpu : the cpu on which the mask is updated.
>>>> + */
>>>> +static inline void cbm_update_msrs(int cpu)
>>>> +{
>>>> +	int maxid = boot_cpu_data.x86_rdt_max_closid;
>>>> +	unsigned int i;
>>>> +
>>>> +	if (WARN_ON(cpu != smp_processor_id()))
>>>> +		return;
>>>> +
>>>> +	for (i = 1; i < maxid; i++) {
>>>
>>> Lacks a comment why this starts with 1 and not with 0 as one would expect.
>>
>> Its in the function comment just above - "except IA32_L3_MASK_0 on the current
>> package."
>
> No. That comment explains WHAT it does not WHY. Asided of that that
> comment is not correct KernelDoc format.

Will fix to kerneldoc format.

>
>>>> +/*
>>>> + * intel_cache_alloc_cbm_write() - Validates and writes the
>>>> + * cache bit mask(cbm) to the IA32_L3_MASK_n
>>>> + * and also store the same in the ccmap.
>>>> + *
>>>> + * CLOSids are reused for cgroups which have same bitmask.
>>>> + * - This helps to use the scant CLOSids optimally.
>>>> + * - This also implies that at context switch write
>>>> + * to PQR-MSR is done only when a task with a
>>>> + * different bitmask is scheduled in.
>>>> + */
>>>> +static int intel_cache_alloc_cbm_write(struct cgroup_subsys_state *css,
>>>
>>> Can you please drop these intel_ prefixes? They provide no value and
>>> just eat space.
>>
>> well , i got ample comments which wanted me to me to specify intel as this
>> feature is specific to intel.
>
> It's fine for public interfaces, but for functions which are only used
> inside a file which contains only intel specific code it's just silly.

Ok , will keep for cgroup interfaces or something similar and remove for others.

>
>> the cache monitoring code has similar prefixes as well - is that pref specific
>
> Sigh. There is so much crap in the kernel, you'll find an example for
> everything.

its the example for the same RDT feature !

>
>> then ? I just dont want people to come back and ask to make it clear that
>> things are intel specific
>
> The file name tells everyone its INTEL specific, right? And following
> your argumentation you should have named your helper functions
> intel_get_closid() and intel_put_closid() as well.
>
> If your code would have been in a proper shape otherwise, I would
> probably have just let it slip.
>


>> taskset ignores the extra bits. is that a problem here ?
>
> I do not care, what taskset does. I care what this code does.
>
> It tells the user that: 0x100ffff is fine, even if the valid bits are
> just 0xffff.
>
> And taskset is a different story due to cpu hotplug.

>
> But for CAT the maskbits are defined by hardware and cannot change
> ever. So why would we allow to set invalid ones and claim that its
> correct?

ok ,will return error for all invalid bitmasks.

>
> If there is a reason to do so, then it needs a proper comment in the
> code and a proper explanation in Documentation/.....
>
>>>> +static inline bool intel_rdt_update_cpumask(int cpu)
>>>> +{
>>>> +	int phys_id = topology_physical_package_id(cpu);
>>>> +	struct cpumask *mask = &rdt_cpumask;
>>>> +	int i;
>>>> +
>>>> +	for_each_cpu(i, mask) {
>>>> +		if (phys_id == topology_physical_package_id(i))
>>>> +			return false;
>>>> +	}
>>>
>>> You must be kidding.
>>
>> the rapl and cqm use similar code. You want me to keep a seperate package mask
>> for this code which not would be that frequent at all ?
>
> You find for everything a place where you copied your stuff from
> without thinking about it, right?

its the code for the same RDT feature which was upstreamed just a few weeks 
ago.. not ancient!

>
> Other people dessperately try to fix the cpu online times which are
> more and more interesting the larger the systems become. So it might
> be a good idea to come up with a proper fast implementation which can
> be used everywhere instead of blindly copying code.
>
>> for c-state transitions only idle_notifier gets called - so this is only when
>> a new package is physically added. and we dont need anything for cache alloc
>> for idle transitions. not really frequent ?
>
> Crap. It's called for every cpu which comes online and goes offline,
> not for new packages.

I erred during comment, I meant we update the masks for every package but I get 
your point , the loop runs nonetheless for every cpu.
will fix.

Thanks,
Vikas

>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management
  2015-05-18 19:20         ` Borislav Petkov
@ 2015-05-19 17:33           ` Vikas Shivappa
  2015-05-19 20:35             ` Borislav Petkov
  0 siblings, 1 reply; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-19 17:33 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Vikas Shivappa, Vikas Shivappa, x86,
	linux-kernel, hpa, mingo, tj, peterz, Matt Fleming, Auld, Will,
	peter.zijlstra, h.peter.anvin, Juvva, Kanaka D, mtosatti



On Mon, 18 May 2015, Borislav Petkov wrote:

> On Mon, May 18, 2015 at 08:41:53PM +0200, Thomas Gleixner wrote:
>>>>> +	}
>>>>> +
>>>>> +	set_bit(0, rdtss_info.closmap);
>>>>> +	rdt_root_group.clos = 0;
>>>>> +	ccm = &ccmap[0];
>>>>> +	bitmap_set(&ccm->cache_mask, 0, max_cbm_len);
>>>>> +	ccm->clos_refcnt = 1;
>>>>> +
>>>>>  	pr_info("Max bitmask length:%u,Max ClosIds: %u\n", max_cbm_len,
>>>>> maxid);
>>>>
>>>> We surely do not want to sprinkle these all over dmesg.
>>>
>>> This is just printed once! how is that sprinke all over?   - we have a dmsg
>>> print for Cache monitoring as well when cqm is enabled.
>>
>> Sorry, mapped that to the wrong function. Though the message itself is
>> horrible.
>>
>> 	  "Max bitmask length:32,Max ClosIds: 16"
>>
>> With some taste and formatting applied this would read:
>>
>>      	  "Max. bitmask length: 32, max. closids: 16"
>>
>> Can you spot the difference?
>
> I sure can.
>
> Also, I'd still like to ask about the usability of that message. What
> does it bring us?
>
> And if the dmesg ring buffer wraps around and it gets overwritten, what
> do we do then?
>
> So basically this message does tell us about some max bitmap length and
> so on. IOW, useless, right?
>
> Can it be read out from CPUID maybe? If so, stuff which is interested in
> it can generate it then.

Yes the max bitmask and max closid can be read from cpuid. the presence of rdt 
and l3 
cache alloc is indicated by rdt and cat_l3 in /proc/cpuinfo. So even if this 
message gets overwritten , user can see the feature enabled in kernel.

Can be modified somthing similar to "Intel cache allocation enabled" as well.

Thanks,
Vikas

>
> -- 
> Regards/Gruss,
>    Boris.
>
> ECO tip #101: Trim your mails when you reply.
> --
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management
  2015-05-19 17:33           ` Vikas Shivappa
@ 2015-05-19 20:35             ` Borislav Petkov
  0 siblings, 0 replies; 39+ messages in thread
From: Borislav Petkov @ 2015-05-19 20:35 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Thomas Gleixner, Vikas Shivappa, x86, linux-kernel, hpa, mingo,
	tj, peterz, Matt Fleming, Auld, Will, peter.zijlstra,
	h.peter.anvin, Juvva, Kanaka D, mtosatti

On Tue, May 19, 2015 at 10:33:55AM -0700, Vikas Shivappa wrote:
> Yes the max bitmask and max closid can be read from cpuid. the presence of
> rdt and l3 cache alloc is indicated by rdt and cat_l3 in /proc/cpuinfo. So
> even if this message gets overwritten , user can see the feature enabled in
> kernel.
> 
> Can be modified somthing similar to "Intel cache allocation enabled" as well.

Yeah, that would have more value to the user than a cryptic message
about closids and bitmask length.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR
  2015-05-15 20:15   ` Thomas Gleixner
@ 2015-05-20 17:18     ` Vikas Shivappa
  2015-05-20 18:50       ` Thomas Gleixner
  0 siblings, 1 reply; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-20 17:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, x86, linux-kernel, hpa, mingo,
	tj, peterz, matt.fleming, will.auld, peter.zijlstra,
	h.peter.anvin, kanaka.d.juvva, mtosatti



On Fri, 15 May 2015, Thomas Gleixner wrote:

> On Mon, 11 May 2015, Vikas Shivappa wrote:
>
>> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
>>
>> Conflicts:
>> 	arch/x86/kernel/cpu/perf_event_intel_cqm.c
>
> And that's interesting for what?

Will remove this, fixed some conflicts as this code changes both cqm and cache 
allocation.

>
>> --- /dev/null
>> +++ b/arch/x86/include/asm/rdt_common.h
>
>> @@ -0,0 +1,13 @@
>> +#ifndef _X86_RDT_H_
>> +#define _X86_RDT_H_
>> +
>> +#define MSR_IA32_PQR_ASSOC	0x0c8f
>> +
>> +struct intel_pqr_state {
>> +	raw_spinlock_t	lock;
>> +	int		rmid;
>> +	int		clos;
>
> Those want to be u32. We have no type checkes there, but u32 is the
> proper data type for wrmsr.

will fix.

>
>> +	int		cnt;
>> +};
>> +
>
> What's wrong with having this in intel_rdt.h?

intel_rdt.h is specific for only cache allocation and rdt features in the 
cgroup. well, thats a criteria  to seperate them , isnt it ? cqm wont 
use most of the things (and growing) in intel_rdt.h.

It seems you're a big
> fan of sprinkling stuff all over the place so reading becomes a
> chasing game.
>
>>  {
>>  	struct task_struct *task = current;
>>  	struct intel_rdt *ir;
>> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> +	unsigned long flags;
>>
>> +	raw_spin_lock_irqsave(&state->lock, flags);
>
> finish_arch_switch() is called with interrupts disabled already ...

Ok , I somehow could not locate this cli when I thought about it. So if this is 
interrupt disabled and no preempt ,
then we dont need both the spin lock and rcu 
lock (since rcu lock sync would obviously wait for the preempt disable..).

will remove both of them.

>
>>  	rcu_read_lock();
>
> So we now have a spin_lock() and rcu_read_lock() and no explanation
> what is protecting what.
>
>>  	ir = task_rdt(task);
>> -	if (ir->clos == clos) {
>> +	if (ir->clos == state->clos) {
>
> And of course taking the spin lock for checking state->clos is
> complete and utter nonsense.
>
> state->clos can only be changed by this code and the only reason why
> we need the lock is to protect against concurrent modification of
> state->rmid.
>
> So the check for ir->clos == state->clos can be done lockless.
>
> And I seriously doubt, that state->lock is necessary at all.
>
> Q: What is it protecting?
> A: state->clos, state->rmid, state->cnt
>
> Q: What's the context?
> A: Per cpu context. The per cpu pqr state is NEVER modified from a
>   remote cpu.
>
> Q: What is the lock for?
> A: Nothing.
>
> Q: Why?
> A: Because interrupt disabled regions protect per cpu state perfectly
>   fine and there is is no memory ordering issue which would require a
>   lock or barrier either.
>
> Peter explained it to you several times already that context switch is
> one the most sensitive hot pathes where we care about every cycle. But
> you just go ahead and mindlessly create pointless overhead.
>
>> +	/*
>> +	 * PQR has closid in high 32 bits and CQM-RMID
>> +	 * in low 10 bits. Rewrite the exsting rmid from
>> +	 * software cache.
>
> This comment wouldnt be necessary if you would have proper documented
> struct pqr_state.

Will fix, that would be lot better.

>
>> +	 */
>> +	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, ir->clos);
>> +	state->clos = ir->clos;
>>  	rcu_read_unlock();
>> +	raw_spin_unlock_irqrestore(&state->lock, flags);
>> +
>>  }
>
>> -static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
>> +DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
>
> With CONFIG_PERF=n and CONFIG_CGROUP_RDT=y the linker will fail.

copy from Makefile below -
obj-$(CONFIG_CPU_SUP_INTEL)             += perf_event_intel_rapl.o 
perf_event_intel_cqm.o

should work with CONFIG_PERF_EVENTS=n and CGROUP_RDT=y ?

Thanks,
Vikas

>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management
  2015-05-18 20:15       ` Thomas Gleixner
  2015-05-18 21:09         ` Vikas Shivappa
@ 2015-05-20 17:22         ` Vikas Shivappa
  2015-05-20 19:02           ` Thomas Gleixner
  1 sibling, 1 reply; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-20 17:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, Vikas Shivappa, x86, linux-kernel, hpa, mingo,
	tj, peterz, Matt Fleming, Auld, Will, peter.zijlstra,
	h.peter.anvin, Juvva, Kanaka D, mtosatti



On Mon, 18 May 2015, Thomas Gleixner wrote:

> On Mon, 18 May 2015, Vikas Shivappa wrote:
>> On Fri, 15 May 2015, Thomas Gleixner wrote:
>>>> +static inline bool intel_rdt_update_cpumask(int cpu)
>>>> +{
>>>> +	}
>>>
>>> You must be kidding.
>>
>> the rapl and cqm use similar code. You want me to keep a seperate package mask
>> for this code which not would be that frequent at all ?
>
> You find for everything a place where you copied your stuff from
> without thinking about it, right?
>
> Other people dessperately try to fix the cpu online times which are
> more and more interesting the larger the systems become. So it might
> be a good idea to come up with a proper fast implementation which can
> be used everywhere instead of blindly copying code.

Ok , i can try to do this as a seperate patch after the cache allocation to get 
a support for faster implementation for traversing package and cpus in the 
packages which can be used by everyone. we would need to start from scratch with 
having packagemask_t equivalent to cpumask_t. hope that is fair ?

Thanks,
Vikas

>
>> for c-state transitions only idle_notifier gets called - so this is only when
>> a new package is physically added. and we dont need anything for cache alloc
>> for idle transitions. not really frequent ?
>
> Crap. It's called for every cpu which comes online and goes offline,
> not for new packages.
>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR
  2015-05-20 17:18     ` Vikas Shivappa
@ 2015-05-20 18:50       ` Thomas Gleixner
  2015-05-20 20:43         ` Vikas Shivappa
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2015-05-20 18:50 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, x86, linux-kernel, hpa, mingo, tj, peterz,
	matt.fleming, will.auld, peter.zijlstra, h.peter.anvin,
	kanaka.d.juvva, mtosatti

On Wed, 20 May 2015, Vikas Shivappa wrote:
> On Fri, 15 May 2015, Thomas Gleixner wrote:
> > > -static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
> > > +DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
> > 
> > With CONFIG_PERF=n and CONFIG_CGROUP_RDT=y the linker will fail.
> 
> copy from Makefile below -
> obj-$(CONFIG_CPU_SUP_INTEL)             += perf_event_intel_rapl.o
> perf_event_intel_cqm.o
> 
> should work with CONFIG_PERF_EVENTS=n and CGROUP_RDT=y ?

Groan. Did you try to compile it? Certainly not.

Simply because the whole section which contains perf_event* object
files is conditional on

ifdef CONFIG_PERF_EVENTS

I'm starting to get really grumpy and tired of your attitude.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management
  2015-05-20 17:22         ` Vikas Shivappa
@ 2015-05-20 19:02           ` Thomas Gleixner
  2015-05-21  0:54             ` Thomas Gleixner
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2015-05-20 19:02 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, x86, linux-kernel, hpa, mingo, tj, peterz,
	Matt Fleming, Auld, Will, peter.zijlstra, h.peter.anvin, Juvva,
	Kanaka D, mtosatti

On Wed, 20 May 2015, Vikas Shivappa wrote:
> On Mon, 18 May 2015, Thomas Gleixner wrote:
> 
> > On Mon, 18 May 2015, Vikas Shivappa wrote:
> > > On Fri, 15 May 2015, Thomas Gleixner wrote:
> > > > > +static inline bool intel_rdt_update_cpumask(int cpu)
> > > > > +{
> > > > > +	}
> > > > 
> > > > You must be kidding.
> > > 
> > > the rapl and cqm use similar code. You want me to keep a seperate package
> > > mask
> > > for this code which not would be that frequent at all ?
> > 
> > You find for everything a place where you copied your stuff from
> > without thinking about it, right?
> > 
> > Other people dessperately try to fix the cpu online times which are
> > more and more interesting the larger the systems become. So it might
> > be a good idea to come up with a proper fast implementation which can
> > be used everywhere instead of blindly copying code.
> 
> Ok , i can try to do this as a seperate patch after the cache allocation to

Hell no. We do preparatory patches first. I'm not believing in 'can
try' promises.

> get a support for faster implementation for traversing package and cpus in the
> packages which can be used by everyone. we would need to start from scratch
> with having packagemask_t equivalent to cpumask_t. hope that is fair ?

Yes, that's what I want to see.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR
  2015-05-20 18:50       ` Thomas Gleixner
@ 2015-05-20 20:43         ` Vikas Shivappa
  2015-05-20 21:14           ` Thomas Gleixner
  0 siblings, 1 reply; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-20 20:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, Vikas Shivappa, x86, linux-kernel, hpa, mingo,
	tj, peterz, Matt Fleming, Auld, Will, peter.zijlstra,
	h.peter.anvin, Juvva, Kanaka D, mtosatti



On Wed, 20 May 2015, Thomas Gleixner wrote:

> On Wed, 20 May 2015, Vikas Shivappa wrote:
>> On Fri, 15 May 2015, Thomas Gleixner wrote:
>>>> -static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
>>>> +DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
>>>
>>> With CONFIG_PERF=n and CONFIG_CGROUP_RDT=y the linker will fail.
>>
>> copy from Makefile below -
>> obj-$(CONFIG_CPU_SUP_INTEL)             += perf_event_intel_rapl.o
>> perf_event_intel_cqm.o
>>
>> should work with CONFIG_PERF_EVENTS=n and CGROUP_RDT=y ?
>
> Groan. Did you try to compile it? Certainly not.

of course I did compile successfully after I changed PERF_EVENTS=n and RDT=y in 
the .config.
sorry, my reason was wrong though.

What I had not noticed was the .config is simply added to 
enabling CONFIG_PERF_EVENTS=y even though i disable it by editing 
.config.
Thats because x86 is selecting it.
So PERF_EVENTS cant be disabled on x86 and RDT is only in x86 && SUP_INTEL.

How did you configure CONFIG_PERF_EVENTS=n and CGROUP_RDT=y and see the error ?

Thanks,
Vikas

>
> Simply because the whole section which contains perf_event* object
> files is conditional on
>
> ifdef CONFIG_PERF_EVENTS
>
> I'm starting to get really grumpy and tired of your attitude.
>
> Thanks,
>
> 	tglx
>
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR
  2015-05-20 20:43         ` Vikas Shivappa
@ 2015-05-20 21:14           ` Thomas Gleixner
  2015-05-20 22:51             ` Vikas Shivappa
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2015-05-20 21:14 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, x86, linux-kernel, hpa, mingo, tj, peterz,
	Matt Fleming, Auld, Will, peter.zijlstra, h.peter.anvin, Juvva,
	Kanaka D, mtosatti

On Wed, 20 May 2015, Vikas Shivappa wrote:
> On Wed, 20 May 2015, Thomas Gleixner wrote:
> 
> > On Wed, 20 May 2015, Vikas Shivappa wrote:
> > > On Fri, 15 May 2015, Thomas Gleixner wrote:
> > > > > -static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
> > > > > +DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
> > > > 
> > > > With CONFIG_PERF=n and CONFIG_CGROUP_RDT=y the linker will fail.
> > > 
> > > copy from Makefile below -
> > > obj-$(CONFIG_CPU_SUP_INTEL)             += perf_event_intel_rapl.o
> > > perf_event_intel_cqm.o
> > > 
> > > should work with CONFIG_PERF_EVENTS=n and CGROUP_RDT=y ?
> > 
> > Groan. Did you try to compile it? Certainly not.
> 
> of course I did compile successfully after I changed PERF_EVENTS=n and RDT=y
> in the .config.
> sorry, my reason was wrong though.
> 
> What I had not noticed was the .config is simply added to enabling
> CONFIG_PERF_EVENTS=y even though i disable it by editing .config.
> Thats because x86 is selecting it.
> So PERF_EVENTS cant be disabled on x86 and RDT is only in x86 && SUP_INTEL.
> 
> How did you configure CONFIG_PERF_EVENTS=n and CGROUP_RDT=y and see the error

I did not compile it. I looked at the config switches. And my two
observations still stand:

1) It will fail with CONFIG_PERF_EVENTS=n and CGROUP_RDT=y

2) perf_event_intel_rapl.o depends on CONFIG_PERF_EVENTS=y

It's up to you to provide me a proper reason why it will be always
built. I'm reviewing code and its not my job to figure out why
something magically works. 

It's your job to provide proper answers to my review observations. And
your answer to my observation #1 definitely does not fall into that
category:

> > > copy from Makefile below -
> > > obj-$(CONFIG_CPU_SUP_INTEL)             += perf_event_intel_rapl.o
> > > perf_event_intel_cqm.o
> > > 
> > > should work with CONFIG_PERF_EVENTS=n and CGROUP_RDT=y ?

Then I told you that this is crap, because of #2

So now you gave me an explanation WHY it works magically:

> So PERF_EVENTS cant be disabled on x86 and RDT is only in x86 && SUP_INTEL.

That should have been your answer to #1, right?

Thanks,

	tglx




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR
  2015-05-20 21:14           ` Thomas Gleixner
@ 2015-05-20 22:51             ` Vikas Shivappa
  0 siblings, 0 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-20 22:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, Vikas Shivappa, x86, linux-kernel, hpa, mingo,
	tj, peterz, Matt Fleming, Auld, Will, peter.zijlstra,
	h.peter.anvin, Juvva, Kanaka D, mtosatti



On Wed, 20 May 2015, Thomas Gleixner wrote:

> On Wed, 20 May 2015, Vikas Shivappa wrote:
>> On Wed, 20 May 2015, Thomas Gleixner wrote:
>>
>>> On Wed, 20 May 2015, Vikas Shivappa wrote:
>>>> On Fri, 15 May 2015, Thomas Gleixner wrote:

>> sorry, my reason was wrong though.
>>

>
>> So PERF_EVENTS cant be disabled on x86 and RDT is only in x86 && SUP_INTEL.
>
> That should have been your answer to #1, right?

yes, thats why i said '>> sorry, my reason was wrong though.' above

I thought the .config edit would have spit out an error message as the 
PERF_EVENT was reselected (basically print any such dependency changes it 
does when starting to compile).

I was only curious if you compiled by any chance as some of the folks were 
trying to get the Intel systems with RDT..

Thanks,
Vikas

>
> Thanks,
>
> 	tglx
>
>
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management
  2015-05-20 19:02           ` Thomas Gleixner
@ 2015-05-21  0:54             ` Thomas Gleixner
  2015-05-21 16:36               ` Vikas Shivappa
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2015-05-21  0:54 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, x86, linux-kernel, hpa, mingo, tj, peterz,
	Matt Fleming, Auld, Will, peter.zijlstra, h.peter.anvin, Juvva,
	Kanaka D, mtosatti

On Wed, 20 May 2015, Thomas Gleixner wrote:
> On Wed, 20 May 2015, Vikas Shivappa wrote:
> > On Mon, 18 May 2015, Thomas Gleixner wrote:
> > 
> > > On Mon, 18 May 2015, Vikas Shivappa wrote:
> > > > On Fri, 15 May 2015, Thomas Gleixner wrote:
> > > > > > +static inline bool intel_rdt_update_cpumask(int cpu)
> > > > > > +{
> > > > > > +	}
> > > > > 
> > > > > You must be kidding.
> > > > 
> > > > the rapl and cqm use similar code. You want me to keep a seperate package
> > > > mask
> > > > for this code which not would be that frequent at all ?
> > > 
> > > You find for everything a place where you copied your stuff from
> > > without thinking about it, right?
> > > 
> > > Other people dessperately try to fix the cpu online times which are
> > > more and more interesting the larger the systems become. So it might
> > > be a good idea to come up with a proper fast implementation which can
> > > be used everywhere instead of blindly copying code.
> > 
> > Ok , i can try to do this as a seperate patch after the cache allocation to
> 
> Hell no. We do preparatory patches first. I'm not believing in 'can
> try' promises.
> 
> > get a support for faster implementation for traversing package and cpus in the
> > packages which can be used by everyone. we would need to start from scratch
> > with having packagemask_t equivalent to cpumask_t. hope that is fair ?
> 
> Yes, that's what I want to see.

I was kidding. You need two functionalities:

  1) A way to figure out whether you already have a CPU taking care of the
     package to which the newly online CPU belongs.

     That's something you need to track in your own code and I already
     gave you the 5 lines of code which can handle that. Remember?

        id =  topology_physical_package_id(cpu);
        if (!__test_and_set_bit(id, &package_map)) {
                cpumask_set_cpu(cpu, &rdt_cpumask);
                cbm_update_msrs(cpu);
        }

     So you cannot add much infrastructure for that because you need
     to track your own state in the CAT relevant package bitmap.

  2) A way to find out whether the to be offlined CPU is the last
     online CPU of a package. If it's not the last one, then you need
     a fast way to find the next online cpu which belongs to that
     package. If it's the last one you need to kill the cat.

     The information is already available, so it's not rocket
     science. And it takes maximal 7 lines of code to implement
     it.

     Q: Why 'maximal' 7?
     A: Because I'm a lazy bastard and cannot be bothered to figure
     	out whether one of the lines is superfluous.
     Q: Why can't I be bothered?
     A: It's none of my problems and it actually does not matter much.

     Here is the pseudo code:

     	   do_magic_stuff(tmp, TCC, COM);
	   clr(c, tmp);
	   n = do_more_stuff(tmp);
     	   if (notavailable(n))
	      	kill_the_cat();
	   else
	        set(n, RCM);

     Hint: One of the NNN placeholders resolves to topology_core_cpumask()

Now once you figured that out, you will notice that the above 5
lines of code to solve problem #1 can be simplified because you can
avoid the package_map bitmap completely.

Sorry, no pseudo code for this. But you get more than one hint this
time:
	- Hint #1 still applies
	- The logic is very similar to the above #2 pseudo code
	- It takes maximal 6 lines of code to implement it

There is one caveat:

    If Hint #1 cannot solve your problem, then you need to figure out
    why and then work with the people who are responsible for it to
    figure out how it can be resolved.

A few general hints:

    - The line counts are neither including the conditionals which
      invoke that code nor the function body nor variable
      declarations. But they include braces,

      All I care about is the logic itself. See the pseudo code
      example above.

    - Please provide a solution for #2 and #1 before you bother me
      with another patch series.

Thanks,

	tglx


     

     	       

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management
  2015-05-21  0:54             ` Thomas Gleixner
@ 2015-05-21 16:36               ` Vikas Shivappa
  0 siblings, 0 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-21 16:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, Vikas Shivappa, x86, linux-kernel, hpa, mingo,
	tj, peterz, Matt Fleming, Auld, Will, peter.zijlstra,
	h.peter.anvin, Juvva, Kanaka D, mtosatti



On Wed, 20 May 2015, Thomas Gleixner wrote:

> On Wed, 20 May 2015, Thomas Gleixner wrote:
>> On Wed, 20 May 2015, Vikas Shivappa wrote:
>>> On Mon, 18 May 2015, Thomas Gleixner wrote:
>>>
>>>> On Mon, 18 May 2015, Vikas Shivappa wrote:
>>>>> On Fri, 15 May 2015, Thomas Gleixner wrote:
>>>>>>> +static inline bool intel_rdt_update_cpumask(int cpu)
>>>>>>> +{
>>>>>>> +	}
>>>>>>
>>>>>> You must be kidding.
>>>>>
>>>>> the rapl and cqm use similar code. You want me to keep a seperate package
>>>>> mask
>>>>> for this code which not would be that frequent at all ?
>>>>
>>>> You find for everything a place where you copied your stuff from
>>>> without thinking about it, right?
>>>>
>>>> Other people dessperately try to fix the cpu online times which are
>>>> more and more interesting the larger the systems become. So it might
>>>> be a good idea to come up with a proper fast implementation which can
>>>> be used everywhere instead of blindly copying code.
>>>
>>> Ok , i can try to do this as a seperate patch after the cache allocation to
>>
>> Hell no. We do preparatory patches first. I'm not believing in 'can
>> try' promises.
>>
>>> get a support for faster implementation for traversing package and cpus in the
>>> packages which can be used by everyone. we would need to start from scratch
>>> with having packagemask_t equivalent to cpumask_t. hope that is fair ?
>>
>> Yes, that's what I want to see.
>
> I was kidding. You need two functionalities:
>
>  1) A way to figure out whether you already have a CPU taking care of the
>     package to which the newly online CPU belongs.
>
>     That's something you need to track in your own code and I already
>     gave you the 5 lines of code which can handle that. Remember?
>
>        id =  topology_physical_package_id(cpu);
>        if (!__test_and_set_bit(id, &package_map)) {
>                cpumask_set_cpu(cpu, &rdt_cpumask);
>                cbm_update_msrs(cpu);
>        }
>
>     So you cannot add much infrastructure for that because you need
>     to track your own state in the CAT relevant package bitmap.
>
>  2) A way to find out whether the to be offlined CPU is the last
>     online CPU of a package. If it's not the last one, then you need
>     a fast way to find the next online cpu which belongs to that
>     package. If it's the last one you need to kill the cat.
>
>     The information is already available, so it's not rocket
>     science. And it takes maximal 7 lines of code to implement
>     it.
>
>     Q: Why 'maximal' 7?
>     A: Because I'm a lazy bastard and cannot be bothered to figure
>     	out whether one of the lines is superfluous.
>     Q: Why can't I be bothered?
>     A: It's none of my problems and it actually does not matter much.
>
>     Here is the pseudo code:
>
>     	   do_magic_stuff(tmp, TCC, COM);
> 	   clr(c, tmp);
> 	   n = do_more_stuff(tmp);
>     	   if (notavailable(n))
> 	      	kill_the_cat();
> 	   else
> 	        set(n, RCM);
>
>     Hint: One of the NNN placeholders resolves to topology_core_cpumask()
>
> Now once you figured that out, you will notice that the above 5
> lines of code to solve problem #1 can be simplified because you can
> avoid the package_map bitmap completely.
>
> Sorry, no pseudo code for this. But you get more than one hint this
> time:
> 	- Hint #1 still applies
> 	- The logic is very similar to the above #2 pseudo code
> 	- It takes maximal 6 lines of code to implement it
>
> There is one caveat:
>
>    If Hint #1 cannot solve your problem, then you need to figure out
>    why and then work with the people who are responsible for it to
>    figure out how it can be resolved.
>
> A few general hints:
>
>    - The line counts are neither including the conditionals which
>      invoke that code nor the function body nor variable
>      declarations. But they include braces,
>
>      All I care about is the logic itself. See the pseudo code
>      example above.
>
>    - Please provide a solution for #2 and #1 before you bother me
>      with another patch series.

Thanks a lot for the very detailed expectations. I appreciate your time. Will 
work on the requirements and send a patch before anything else.

Vikas

>
> Thanks,
>
> 	tglx
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR
  2015-05-02  1:36 [PATCH V6 0/7] x86/intel_rdt: Intel Cache Allocation Technology Vikas Shivappa
@ 2015-05-02  1:36 ` Vikas Shivappa
  0 siblings, 0 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-05-02  1:36 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, peter.zijlstra, h.peter.anvin, kanaka.d.juvva,
	vikas.shivappa

This patch implements a common software cache for IA32_PQR_MSR(RMID 0:9,
    CLOSId 32:63) to be used by both CMT and CAT. CMT updates the RMID
where as CAT updates the CLOSid in the software cache. When the new
RMID/CLOSid value is different from the cached values, IA32_PQR_MSR is
updated. Since the measured rdmsr latency for IA32_PQR_MSR is very
high(~250 cycles) this software cache is necessary to avoid reading the
MSR to compare the current CLOSid value.
During CPU hotplug the pqr cache is updated to zero.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>

Conflicts:
	arch/x86/kernel/cpu/perf_event_intel_cqm.c
---
 arch/x86/include/asm/intel_rdt.h           | 31 +++++++++++++++---------------
 arch/x86/include/asm/rdt_common.h          | 13 +++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c            |  3 +++
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 20 +++++++------------
 4 files changed, 39 insertions(+), 28 deletions(-)
 create mode 100644 arch/x86/include/asm/rdt_common.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 2fc496f..6aae109 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,12 +4,13 @@
 #ifdef CONFIG_CGROUP_RDT
 
 #include <linux/cgroup.h>
+#include <asm/rdt_common.h>
 
-#define MSR_IA32_PQR_ASSOC		0xc8f
 #define MAX_CBM_LENGTH			32
 #define IA32_L3_CBM_BASE		0xc90
 #define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
-DECLARE_PER_CPU(unsigned int, x86_cpu_clos);
+
+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
 extern struct static_key rdt_enable_key;
 
 struct rdt_subsys_info {
@@ -61,30 +62,30 @@ static inline struct intel_rdt *task_rdt(struct task_struct *task)
 static inline void rdt_sched_in(struct task_struct *task)
 {
 	struct intel_rdt *ir;
-	unsigned int clos;
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+	unsigned long flags;
 
 	if (!rdt_enabled())
 		return;
 
-	/*
-	 * This needs to be fixed
-	 * to cache the whole PQR instead of just CLOSid.
-	 * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
-	 * Should not write a 0 to the low 10 bits of PQR
-	 * and corrupt RMID.
-	 */
-	clos = this_cpu_read(x86_cpu_clos);
-
+	raw_spin_lock_irqsave(&state->lock, flags);
 	rcu_read_lock();
 	ir = task_rdt(task);
-	if (ir->clos == clos) {
+	if (ir->clos == state->clos) {
 		rcu_read_unlock();
+		raw_spin_unlock_irqrestore(&state->lock, flags);
 		return;
 	}
 
-	wrmsr(MSR_IA32_PQR_ASSOC, 0, ir->clos);
-	this_cpu_write(x86_cpu_clos, ir->clos);
+	/*
+	 * PQR has closid in high 32 bits and CQM-RMID
+	 * in low 10 bits. Rewrite the exsting rmid from
+	 * software cache.
+	 */
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, ir->clos);
+	state->clos = ir->clos;
 	rcu_read_unlock();
+	raw_spin_unlock_irqrestore(&state->lock, flags);
 }
 
 #else
diff --git a/arch/x86/include/asm/rdt_common.h b/arch/x86/include/asm/rdt_common.h
new file mode 100644
index 0000000..33fd8ea
--- /dev/null
+++ b/arch/x86/include/asm/rdt_common.h
@@ -0,0 +1,13 @@
+#ifndef _X86_RDT_H_
+#define _X86_RDT_H_
+
+#define MSR_IA32_PQR_ASSOC	0x0c8f
+
+struct intel_pqr_state {
+	raw_spinlock_t	lock;
+	int		rmid;
+	int		clos;
+	int		cnt;
+};
+
+#endif
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 74b1e28..9da61b2 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -347,6 +347,9 @@ static inline bool rdt_update_cpumask(int cpu)
  */
 static inline void rdt_cpu_start(int cpu)
 {
+	struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
+
+	state->clos = 0;
 	mutex_lock(&rdt_group_mutex);
 	if (rdt_update_cpumask(cpu))
 		cbm_update_msrs(cpu);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index e4d1b8b..fd039899 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -7,22 +7,16 @@
 #include <linux/perf_event.h>
 #include <linux/slab.h>
 #include <asm/cpu_device_id.h>
+#include <asm/rdt_common.h>
 #include "perf_event.h"
 
-#define MSR_IA32_PQR_ASSOC	0x0c8f
 #define MSR_IA32_QM_CTR		0x0c8e
 #define MSR_IA32_QM_EVTSEL	0x0c8d
 
 static unsigned int cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 
-struct intel_cqm_state {
-	raw_spinlock_t		lock;
-	int			rmid;
-	int			cnt;
-};
-
-static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
 
 /*
  * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -961,7 +955,7 @@ out:
 
 static void intel_cqm_event_start(struct perf_event *event, int mode)
 {
-	struct intel_cqm_state *state = this_cpu_ptr(&cqm_state);
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
 	unsigned int rmid = event->hw.cqm_rmid;
 	unsigned long flags;
 
@@ -978,14 +972,14 @@ static void intel_cqm_event_start(struct perf_event *event, int mode)
 		WARN_ON_ONCE(state->rmid);
 
 	state->rmid = rmid;
-	wrmsrl(MSR_IA32_PQR_ASSOC, state->rmid);
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->clos);
 
 	raw_spin_unlock_irqrestore(&state->lock, flags);
 }
 
 static void intel_cqm_event_stop(struct perf_event *event, int mode)
 {
-	struct intel_cqm_state *state = this_cpu_ptr(&cqm_state);
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
 	unsigned long flags;
 
 	if (event->hw.cqm_state & PERF_HES_STOPPED)
@@ -998,7 +992,7 @@ static void intel_cqm_event_stop(struct perf_event *event, int mode)
 
 	if (!--state->cnt) {
 		state->rmid = 0;
-		wrmsrl(MSR_IA32_PQR_ASSOC, 0);
+		wrmsr(MSR_IA32_PQR_ASSOC, 0, state->clos);
 	} else {
 		WARN_ON_ONCE(!state->rmid);
 	}
@@ -1243,7 +1237,7 @@ static inline void cqm_pick_event_reader(int cpu)
 
 static void intel_cqm_cpu_prepare(unsigned int cpu)
 {
-	struct intel_cqm_state *state = &per_cpu(cqm_state, cpu);
+	struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
 
 	raw_spin_lock_init(&state->lock);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR
  2015-03-12 23:16 [PATCH V5 0/7] x86/intel_rdt: Intel Cache Allocation Technology Vikas Shivappa
@ 2015-03-12 23:16 ` Vikas Shivappa
  0 siblings, 0 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-03-12 23:16 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

This patch implements a common software cache for IA32_PQR_MSR(RMID 0:9,
    CLOSId 32:63) to be used by both CMT and CAT. CMT updates the RMID
where as CAT updates the CLOSid in the software cache. When the new
RMID/CLOSid value is different from the cached values, IA32_PQR_MSR is
updated. Since the measured rdmsr latency for IA32_PQR_MSR is very
high(~250 cycles) this software cache is necessary to avoid reading the
MSR to compare the current CLOSid value.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/intel_rdt.h           | 31 +++++++++++++++---------------
 arch/x86/include/asm/rdt_common.h          | 13 +++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 20 +++++++------------
 3 files changed, 36 insertions(+), 28 deletions(-)
 create mode 100644 arch/x86/include/asm/rdt_common.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 6383a24..5a8139e 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,12 +4,13 @@
 #ifdef CONFIG_CGROUP_RDT
 
 #include <linux/cgroup.h>
+#include <asm/rdt_common.h>
 
-#define MSR_IA32_PQR_ASSOC		0xc8f
 #define MAX_CBM_LENGTH			32
 #define IA32_L3_CBM_BASE		0xc90
 #define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
-DECLARE_PER_CPU(unsigned int, x86_cpu_clos);
+
+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
 extern struct static_key rdt_enable_key;
 
 struct rdt_subsys_info {
@@ -62,30 +63,30 @@ static inline struct intel_rdt *task_rdt(struct task_struct *task)
 static inline void rdt_sched_in(struct task_struct *task)
 {
 	struct intel_rdt *ir;
-	unsigned int clos;
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+	unsigned long flags;
 
 	if (!rdt_enabled())
 		return;
 
-	/*
-	 * This needs to be fixed after CQM code stabilizes
-	 * to cache the whole PQR instead of just CLOSid.
-	 * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
-	 * Should not write a 0 to the low 10 bits of PQR
-	 * and corrupt RMID.
-	 */
-	clos = this_cpu_read(x86_cpu_clos);
-
+	raw_spin_lock_irqsave(&state->lock, flags);
 	rcu_read_lock();
 	ir = task_rdt(task);
-	if (ir->clos == clos) {
+	if (ir->clos == state->clos) {
 		rcu_read_unlock();
+		raw_spin_unlock_irqrestore(&state->lock, flags);
 		return;
 	}
 
-	wrmsr(MSR_IA32_PQR_ASSOC, 0, ir->clos);
-	this_cpu_write(x86_cpu_clos, ir->clos);
+	/*
+	 * PQR has closid in high 32 bits and CQM-RMID
+	 * in low 10 bits. Rewrite the exsting rmid from
+	 * software cache.
+	 */
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, ir->clos);
+	state->clos = ir->clos;
 	rcu_read_unlock();
+	raw_spin_unlock_irqrestore(&state->lock, flags);
 }
 
 #else
diff --git a/arch/x86/include/asm/rdt_common.h b/arch/x86/include/asm/rdt_common.h
new file mode 100644
index 0000000..c87f908
--- /dev/null
+++ b/arch/x86/include/asm/rdt_common.h
@@ -0,0 +1,13 @@
+#ifndef _X86_RDT_H_
+#define _X86_RDT_H_
+
+#define MSR_IA32_PQR_ASSOC	0x0c8f
+
+struct intel_pqr_state {
+	raw_spinlock_t    lock;
+	int     rmid;
+	int     clos;
+	int       cnt;
+};
+
+#endif
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 596d1ec..63c52e0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -7,22 +7,16 @@
 #include <linux/perf_event.h>
 #include <linux/slab.h>
 #include <asm/cpu_device_id.h>
+#include <asm/rdt_common.h>
 #include "perf_event.h"
 
-#define MSR_IA32_PQR_ASSOC	0x0c8f
 #define MSR_IA32_QM_CTR		0x0c8e
 #define MSR_IA32_QM_EVTSEL	0x0c8d
 
 static unsigned int cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 
-struct intel_cqm_state {
-	raw_spinlock_t		lock;
-	int			rmid;
-	int 			cnt;
-};
-
-static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
 
 /*
  * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -931,7 +925,7 @@ out:
 
 static void intel_cqm_event_start(struct perf_event *event, int mode)
 {
-	struct intel_cqm_state *state = this_cpu_ptr(&cqm_state);
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
 	unsigned int rmid = event->hw.cqm_rmid;
 	unsigned long flags;
 
@@ -948,14 +942,14 @@ static void intel_cqm_event_start(struct perf_event *event, int mode)
 		WARN_ON_ONCE(state->rmid);
 
 	state->rmid = rmid;
-	wrmsrl(MSR_IA32_PQR_ASSOC, state->rmid);
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->clos);
 
 	raw_spin_unlock_irqrestore(&state->lock, flags);
 }
 
 static void intel_cqm_event_stop(struct perf_event *event, int mode)
 {
-	struct intel_cqm_state *state = this_cpu_ptr(&cqm_state);
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
 	unsigned long flags;
 
 	if (event->hw.cqm_state & PERF_HES_STOPPED)
@@ -968,7 +962,7 @@ static void intel_cqm_event_stop(struct perf_event *event, int mode)
 
 	if (!--state->cnt) {
 		state->rmid = 0;
-		wrmsrl(MSR_IA32_PQR_ASSOC, 0);
+		wrmsr(MSR_IA32_PQR_ASSOC, 0, state->clos);
 	} else {
 		WARN_ON_ONCE(!state->rmid);
 	}
@@ -1213,7 +1207,7 @@ static inline void cqm_pick_event_reader(int cpu)
 
 static void intel_cqm_cpu_prepare(unsigned int cpu)
 {
-	struct intel_cqm_state *state = &per_cpu(cqm_state, cpu);
+	struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
 
 	raw_spin_lock_init(&state->lock);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR
  2015-02-24 23:16 [PATCH V4 0/7] x86/intel_rdt: Intel Cache Allocation Technology Vikas Shivappa
@ 2015-02-24 23:16 ` Vikas Shivappa
  0 siblings, 0 replies; 39+ messages in thread
From: Vikas Shivappa @ 2015-02-24 23:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: vikas.shivappa, vikas.shivappa, matt.fleming, hpa, tglx, mingo,
	tj, peterz, will.auld, dave.hansen, andi.kleen, tony.luck,
	kanaka.d.juvva

This patch implements a common software cache for IA32_PQR_MSR(RMID 0:9,
    CLOSId 32:63) to be used by both CMT and CAT. CMT updates the RMID
where as CAT updates the CLOSid in the software cache. When the new
RMID/CLOSid value is different from the cached values, IA32_PQR_MSR is
updated. Since the measured rdmsr latency for IA32_PQR_MSR is very
high(~250 cycles) this software cache is necessary to avoid reading the
MSR to compare the current CLOSid value.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/intel_rdt.h           | 31 +++++++++++++++---------------
 arch/x86/include/asm/rdt_common.h          | 13 +++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 20 +++++++------------
 3 files changed, 36 insertions(+), 28 deletions(-)
 create mode 100644 arch/x86/include/asm/rdt_common.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index bc57b56..27621c8 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,12 +4,13 @@
 #ifdef CONFIG_CGROUP_RDT
 
 #include <linux/cgroup.h>
+#include <asm/rdt_common.h>
 
-#define MSR_IA32_PQR_ASSOC		0xc8f
 #define MAX_CBM_LENGTH			32
 #define IA32_L3_CBM_BASE		0xc90
 #define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
-DECLARE_PER_CPU(unsigned int, x86_cpu_clos);
+
+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
 extern struct static_key rdt_enable_key;
 
 struct rdt_subsys_info {
@@ -64,30 +65,30 @@ static inline struct intel_rdt *task_rdt(struct task_struct *task)
 static inline void rdt_sched_in(struct task_struct *task)
 {
 	struct intel_rdt *ir;
-	unsigned int clos;
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+	unsigned long flags;
 
 	if (!rdt_enabled())
 		return;
 
-	/*
-	 * This needs to be fixed after CQM code stabilizes
-	 * to cache the whole PQR instead of just CLOSid.
-	 * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
-	 * Should not write a 0 to the low 10 bits of PQR
-	 * and corrupt RMID.
-	 */
-	clos = this_cpu_read(x86_cpu_clos);
-
+	raw_spin_lock_irqsave(&state->lock, flags);
 	rcu_read_lock();
 	ir = task_rdt(task);
-	if (ir->clos == clos) {
+	if (ir->clos == state->clos) {
 		rcu_read_unlock();
+		raw_spin_unlock_irqrestore(&state->lock, flags);
 		return;
 	}
 
-	wrmsr(MSR_IA32_PQR_ASSOC, 0, ir->clos);
-	this_cpu_write(x86_cpu_clos, ir->clos);
+	/*
+	 * PQR has closid in high 32 bits and CQM-RMID
+	 * in low 10 bits. Rewrite the exsting rmid from
+	 * software cache.
+	 */
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, ir->clos);
+	state->clos = ir->clos;
 	rcu_read_unlock();
+	raw_spin_unlock_irqrestore(&state->lock, flags);
 }
 
 #else
diff --git a/arch/x86/include/asm/rdt_common.h b/arch/x86/include/asm/rdt_common.h
new file mode 100644
index 0000000..c87f908
--- /dev/null
+++ b/arch/x86/include/asm/rdt_common.h
@@ -0,0 +1,13 @@
+#ifndef _X86_RDT_H_
+#define _X86_RDT_H_
+
+#define MSR_IA32_PQR_ASSOC	0x0c8f
+
+struct intel_pqr_state {
+	raw_spinlock_t    lock;
+	int     rmid;
+	int     clos;
+	int       cnt;
+};
+
+#endif
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 596d1ec..63c52e0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -7,22 +7,16 @@
 #include <linux/perf_event.h>
 #include <linux/slab.h>
 #include <asm/cpu_device_id.h>
+#include <asm/rdt_common.h>
 #include "perf_event.h"
 
-#define MSR_IA32_PQR_ASSOC	0x0c8f
 #define MSR_IA32_QM_CTR		0x0c8e
 #define MSR_IA32_QM_EVTSEL	0x0c8d
 
 static unsigned int cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 
-struct intel_cqm_state {
-	raw_spinlock_t		lock;
-	int			rmid;
-	int 			cnt;
-};
-
-static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
 
 /*
  * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -931,7 +925,7 @@ out:
 
 static void intel_cqm_event_start(struct perf_event *event, int mode)
 {
-	struct intel_cqm_state *state = this_cpu_ptr(&cqm_state);
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
 	unsigned int rmid = event->hw.cqm_rmid;
 	unsigned long flags;
 
@@ -948,14 +942,14 @@ static void intel_cqm_event_start(struct perf_event *event, int mode)
 		WARN_ON_ONCE(state->rmid);
 
 	state->rmid = rmid;
-	wrmsrl(MSR_IA32_PQR_ASSOC, state->rmid);
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->clos);
 
 	raw_spin_unlock_irqrestore(&state->lock, flags);
 }
 
 static void intel_cqm_event_stop(struct perf_event *event, int mode)
 {
-	struct intel_cqm_state *state = this_cpu_ptr(&cqm_state);
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
 	unsigned long flags;
 
 	if (event->hw.cqm_state & PERF_HES_STOPPED)
@@ -968,7 +962,7 @@ static void intel_cqm_event_stop(struct perf_event *event, int mode)
 
 	if (!--state->cnt) {
 		state->rmid = 0;
-		wrmsrl(MSR_IA32_PQR_ASSOC, 0);
+		wrmsr(MSR_IA32_PQR_ASSOC, 0, state->clos);
 	} else {
 		WARN_ON_ONCE(!state->rmid);
 	}
@@ -1213,7 +1207,7 @@ static inline void cqm_pick_event_reader(int cpu)
 
 static void intel_cqm_cpu_prepare(unsigned int cpu)
 {
-	struct intel_cqm_state *state = &per_cpu(cqm_state, cpu);
+	struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
 
 	raw_spin_lock_init(&state->lock);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2015-05-21 16:38 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-11 19:02 [PATCH V7 0/7] x86/intel_rdt: Intel Cache Allocation support Vikas Shivappa
2015-05-11 19:02 ` [PATCH 1/7] x86/intel_rdt: Intel Cache Allocation detection Vikas Shivappa
2015-05-11 19:02 ` [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management Vikas Shivappa
2015-05-15 19:18   ` Thomas Gleixner
2015-05-18 17:59     ` Vikas Shivappa
2015-05-18 18:41       ` Thomas Gleixner
2015-05-18 19:20         ` Borislav Petkov
2015-05-19 17:33           ` Vikas Shivappa
2015-05-19 20:35             ` Borislav Petkov
2015-05-18 19:44         ` Vikas Shivappa
2015-05-18 18:52       ` Thomas Gleixner
2015-05-18 19:27         ` Vikas Shivappa
2015-05-11 19:02 ` [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management Vikas Shivappa
2015-05-15 19:25   ` Thomas Gleixner
2015-05-18 19:17     ` Vikas Shivappa
2015-05-18 20:15       ` Thomas Gleixner
2015-05-18 21:09         ` Vikas Shivappa
2015-05-20 17:22         ` Vikas Shivappa
2015-05-20 19:02           ` Thomas Gleixner
2015-05-21  0:54             ` Thomas Gleixner
2015-05-21 16:36               ` Vikas Shivappa
2015-05-11 19:02 ` [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT Vikas Shivappa
2015-05-15 19:39   ` Thomas Gleixner
2015-05-18 18:01     ` Vikas Shivappa
2015-05-18 18:45       ` Thomas Gleixner
2015-05-18 19:18         ` Vikas Shivappa
2015-05-11 19:02 ` [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR Vikas Shivappa
2015-05-15 20:15   ` Thomas Gleixner
2015-05-20 17:18     ` Vikas Shivappa
2015-05-20 18:50       ` Thomas Gleixner
2015-05-20 20:43         ` Vikas Shivappa
2015-05-20 21:14           ` Thomas Gleixner
2015-05-20 22:51             ` Vikas Shivappa
2015-05-11 19:02 ` [PATCH 6/7] x86/intel_rdt: Intel haswell Cache Allocation enumeration Vikas Shivappa
2015-05-11 19:02 ` [PATCH 7/7] x86/intel_rdt: Add Cache Allocation documentation and usage guide Vikas Shivappa
2015-05-13 16:52 ` [PATCH V7 0/7] x86/intel_rdt: Intel Cache Allocation support Vikas Shivappa
  -- strict thread matches above, loose matches on Subject: below --
2015-05-02  1:36 [PATCH V6 0/7] x86/intel_rdt: Intel Cache Allocation Technology Vikas Shivappa
2015-05-02  1:36 ` [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR Vikas Shivappa
2015-03-12 23:16 [PATCH V5 0/7] x86/intel_rdt: Intel Cache Allocation Technology Vikas Shivappa
2015-03-12 23:16 ` [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR Vikas Shivappa
2015-02-24 23:16 [PATCH V4 0/7] x86/intel_rdt: Intel Cache Allocation Technology Vikas Shivappa
2015-02-24 23:16 ` [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR Vikas Shivappa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.