All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/18] Intel Cache Allocation Technology
@ 2016-10-15  2:12 Fenghua Yu
  2016-10-15  2:12 ` [PATCH v4 01/18] Documentation, ABI: Add a document entry for cache id Fenghua Yu
                   ` (17 more replies)
  0 siblings, 18 replies; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Change log in v4:

* Changed CONFIG_INTEL_RDT to CONFIG_RDT_A. Plain "RDT" refers to all
  resource features, both monitoring (CQM, MBM, ...) and control (CAT L3,
  L3/CDP, L2, ...). Adding the "_A" matches with the feature bit name
  for all the control features "X86_FEATURE_RDT_A".
* Nilay: Cleaned up duplicate declarations (and ones that appear in the
  wrong patch)
* Boris: Add comment on which specific Haswell models support CAT L3
* Boris: Use boot_cpu_data and boot_cpu_has()
* Don't call smp_call_function_single() from hot add notifier. We are already
  on the right cpu.
* Thomas (from earlier postings) check return value from rdtgroup_init() and
  cleanup if it failed
* Boris: Be more descriptive than "cache allocation" (we now say "RDT cache
  allocation")
* Nilay: Move some code out of mount() [mostly into closid_alloc()]
* Nilay: Drop wrapper functions rdtgroup_{alloc,free}, just call kzalloc/kfree
* smp_call_function_many() skips current cpu (to Tony's surprise). Make sure
  we make the call locally if current cpu is included in the mask.
* Also be preempt aware around calls to smp_call_function_many()
* Nilay: Add extra parens in:
	if ((attr == &dev_attr_id.attr) && (this_leaf->attributes & CACHE_ID))
* Added a comment for closid_alloc that current allocation is global and will
  do per resource domain allocation later.
* Added CAT L2 support and changed commit message in patch #8.

The patches are in the same order as V3 and have small commit messages changes
in patch #7 and #8.

0001-Documentation-ABI-Add-a-document-entry-for-cache-id.patch
0002-cacheinfo-Introduce-cache-id.patch
0003-x86-intel_cacheinfo-Enable-cache-id-in-x86.patch

	These three define an "id" for each cache ... we need a "name"
	for a cache so we can say what restrictions to apply to each
	cache in the system.  All you will see at this point is an
	extra "id" file in each /sys/devices/system/cpu/cpu*/cache/index*/
	directory.

0004-x86-intel_rdt-Feature-discovery.patch

	Look at CPUID for the features related to cache allocation.
	At this point /proc/cpuinfo shows extra flags for the features
	found on your system.

0005-Documentation-x86-Documentation-for-Intel-resource-a.patch

	Documentation patch could be anywhere in this sequence. We
	put in early so you can read it to see how to use the
	interface.

0006-x86-intel_rdt-Add-CONFIG-Makefile-and-basic-initiali.patch

	Add CONFIG_INTEL_RDT (default "n" ... you'll have to set
	it to have this, and all the following patches do anything).
	Template driver here just checks for features and spams
	the console with one line for each.

0007-x86-intel_rdt-Add-Haswell-feature-discovery.patch

	There are some Haswell systems that support cache allocation,
	but they were made before the CPUID bits were fully defined.
	So we check by probing the CBM base MSR to see if CLOSID
	bits stick. Unless you have one of these Haswells, you won't
	see any difference here.

0008-x86-intel_rdt-Pick-up-L3-L2-RDT-parameters-from-CPUID.patch

	This is all new code, not seen in the previous versions of this
	patch series. L3 and L2 cache allocations are just the first of
	several resource control features. Define rdt_resource structure
	that contains all the useful things we need to know about a
	resource. Pick up the parameters for the resource from CPUID.
	The console spam strings change format here.

0009-x86-cqm-Move-PQR_ASSOC-management-code-into-generic-.patch

	The PQR_ASSOC MSR has a field for the CLOSID (which we need
	define which allocation rules are in effect). But it also
	contains the RMID (used by CQM and MBM perf monitoring).
	The perf code got here first, but defined structures that
	make it easy for the two systems to co-exist without stomping
	on each other. This patch moves the relevant parts into a
	common header file and changes the scope from "static" to
	global so we can access them. No visible change.

0010-x86-intel_rdt-Build-structures-for-each-resource-bas.patch

	For each enabled resource, we build a list of "rdt_domains" based
	on hotplug cpu notifications. Since we only have L3 at this point,
	this is just a list of L3 caches (named by the "id" established
	in the first three patches). As each cache is found we initialize
	the array of CBMs (cache bit masks). No visible change here.

0011-x86-intel_rdt-Add-basic-resctrl-filesystem-support.patch

	Our interface is a kernfs backed file system. Establish the
	mount point, and provide mount/unmount functionality.
	At this point "/sys/fs/resctrl" appears. You can mount and
	unmount the resctrl file system (if your system supports
	code/data prioritization, you can use the "cdp" mount option).
	The file system is empty and doesn't allow creation of any
	files or subdirectories.

0012-x86-intel_rdt-Add-info-files-to-resctrl-file-system.patch

	Parameters for each resource are buried in CPUID leaf 0x10.
	This isn't very user friendly for scripts and applications
	that want to configure resource allocation. Create an
	"info" directory, with a subdirectory for each resource
	containing a couple of useful parameters. Visible change:
	$ ls -l /sys/fs/resctrl/info/L3
	total 0
	-r--r--r-- 1 root root 0 Oct  7 11:20 cbm_val
	-r--r--r-- 1 root root 0 Oct  7 11:20 num_closid

0013-x86-intel_rdt-Add-mkdir-to-resctrl-file-system.patch

	Each resource group is represented by a directory in the
	resctrl file system. The root directory is the default group.
	Use "mkdir" to create new groups and "rmdir" to remove them.
	The maximum number of groups is defined by the effective
	number of CLOSIDs.
	Visible change: If you have CDP (and enable with the "cdp"
	mount option) you will find that you can only create half
	as many groups as without (e.g. 8 vs. 16 on Broadwell, but
	the default group uses one ... so actually 7, 15).

0014-x86-intel_rdt-Add-cpus-file.patch

	One of the control mechanisms for a resource group is the
	logical CPU. Initially all CPUs are assigned to the default
	group. They can be reassigned to other groups by writing
	a cpumask to the "cpus" file. See the documentation for what
	this means.
	Visible change: "cpus" file in the root, and automatically
	in each created subdirectory. You can "echo" masks to these
	files and watch as CPUs added to one group are removed from
	whatever group they previously belonged to. Removing a directory
	will give all CPUs owned by it back to the default (root)
	group.

0015-x86-intel_rdt-Add-tasks-files.patch

	Tasks can be assigned to resource groups by writing their PID
	to a "tasks" file (which removes the task from its previous
	group). Forked/cloned tasks inherit the group from their
	parent. You cannot remove a group (directory) that has any
	tasks assigned.
	Visible change: "tasks" files appear. E.g. (we see two tasks
	in the group, our shell, and the "cat" that it spawned).
	# echo $$ > p0/tasks; cat p0/tasks
	268890
	268914

0016-x86-intel_rdt-Add-schemata-file.patch

	The "schemata" file in each group/directory defines what
	access tasks controlled by this resource are permitted.
	One line per resource type. Fields for each instance of
	the resource. You redefine the access by wrting to the
	file in the same format.
	Visible change: "schemata" file which starts out with maximum
	allowed resources. E.g.
	$ cat schemata
	L3:0=fffff;1=fffff
	Now restrict this group to just 20% of L3 on first cache, but
	allow 50% on the second
	# echo L3:0=f;1=3ff > schemata

0017-x86-intel_rdt-Add-scheduler-hook.patch

	When context switching we check if we are changing resource
	groups for the new process, and update the PQR_ASSOC MSR with
	the new CLOSID if needed.
	Visble change: Everything should be working now. Tasks run with
	the permitted access to L3 cache.

0018-MAINTAINERS-Add-maintainer-for-Intel-RDT-resource-al.patch

	New files ... need a maintainer. Fenghua has the job.

Fenghua Yu (15):
  Documentation, ABI: Add a document entry for cache id
  cacheinfo: Introduce cache id
  x86, intel_cacheinfo: Enable cache id in x86
  x86/intel_rdt: Feature discovery
  Documentation, x86: Documentation for Intel resource allocation user
    interface
  x86/intel_rdt: Add CONFIG, Makefile, and basic initialization
  x86/intel_rdt: Add Haswell feature discovery
  x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID
  x86/cqm: Move PQR_ASSOC management code into generic code used by both
    CQM and CAT
  x86/intel_rdt: Add basic resctrl filesystem support
  x86/intel_rdt: Add "info" files to resctrl file system
  x86/intel_rdt: Add mkdir to resctrl file system
  x86/intel_rdt: Add tasks files
  x86/intel_rdt: Add scheduler hook
  MAINTAINERS: Add maintainer for Intel RDT resource allocation

Tony Luck (3):
  x86/intel_rdt: Build structures for each resource based on cache
    topology
  x86/intel_rdt: Add cpus file
  x86/intel_rdt: Add schemata file

 Documentation/ABI/testing/sysfs-devices-system-cpu |  16 +
 Documentation/x86/intel_rdt_ui.txt                 | 162 ++++
 MAINTAINERS                                        |   8 +
 arch/x86/Kconfig                                   |  12 +
 arch/x86/events/intel/cqm.c                        |  23 +-
 arch/x86/include/asm/cpufeatures.h                 |   5 +
 arch/x86/include/asm/intel_rdt.h                   | 206 +++++
 arch/x86/include/asm/intel_rdt_common.h            |  27 +
 arch/x86/kernel/cpu/Makefile                       |   2 +
 arch/x86/kernel/cpu/intel_cacheinfo.c              |  20 +
 arch/x86/kernel/cpu/intel_rdt.c                    | 304 +++++++
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c           | 973 +++++++++++++++++++++
 arch/x86/kernel/cpu/intel_rdt_schemata.c           | 266 ++++++
 arch/x86/kernel/cpu/scattered.c                    |   3 +
 arch/x86/kernel/process_32.c                       |   4 +
 arch/x86/kernel/process_64.c                       |   4 +
 drivers/base/cacheinfo.c                           |   5 +
 include/linux/cacheinfo.h                          |   3 +
 include/linux/sched.h                              |   3 +
 include/uapi/linux/magic.h                         |   1 +
 20 files changed, 2026 insertions(+), 21 deletions(-)
 create mode 100644 Documentation/x86/intel_rdt_ui.txt
 create mode 100644 arch/x86/include/asm/intel_rdt.h
 create mode 100644 arch/x86/include/asm/intel_rdt_common.h
 create mode 100644 arch/x86/kernel/cpu/intel_rdt.c
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_schemata.c

-- 
2.5.0

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 01/18] Documentation, ABI: Add a document entry for cache id
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-17 10:31   ` Thomas Gleixner
  2016-10-15  2:12 ` [PATCH v4 02/18] cacheinfo: Introduce " Fenghua Yu
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Add an ABI document entry for /sys/devices/system/cpu/cpu*/cache/index*/id.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 4987417..b1c3d69 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -272,6 +272,22 @@ Description:	Parameters for the CPU cache attributes
 				     the modified cache line is written to main
 				     memory only when it is replaced
 
+
+What:		/sys/devices/system/cpu/cpu*/cache/index*/id
+Date:		September 2016
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:	Cache id
+
+		The id provides a unique name for a specific instance of
+		a cache of a particular type. E.g. there may be a level
+		3 unified cache on each socket in a server and we may
+		assign them ids 0, 1, 2, ...
+
+		Note that id value may not be contiguous. E.g. level 1
+		caches typically exist per core, but there may not be a
+		power of two cores on a socket, so these caches may be
+		numbered 0, 1, 2, 3, 4, 5, 8, 9, 10, ...
+
 What:		/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats
 		/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/turbo_stat
 		/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/sub_turbo_stat
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 02/18] cacheinfo: Introduce cache id
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
  2016-10-15  2:12 ` [PATCH v4 01/18] Documentation, ABI: Add a document entry for cache id Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-17 10:32   ` Thomas Gleixner
  2016-10-15  2:12 ` [PATCH v4 03/18] x86, intel_cacheinfo: Enable cache id in x86 Fenghua Yu
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Cache management software needs a name for each instance of a cache of
a particular type.

The current cacheinfo structure does not provide any information about
the underlying hardware so there is no way to expose it.

Hardware with cache management features provides means (cpuid, enumeration
etc.) to retrieve the hardware id of a particular cache instance. Cache
instances which share hardware have the same hardware id.

Add an 'id' field to struct cacheinfo to store this information. Expose
this information under the /sys/devices/system/cpu/cpu*/cache/index*/
directory as well.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 drivers/base/cacheinfo.c  | 5 +++++
 include/linux/cacheinfo.h | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index e9fd32e..00a9688 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -233,6 +233,7 @@ static ssize_t file_name##_show(struct device *dev,		\
 	return sprintf(buf, "%u\n", this_leaf->object);		\
 }
 
+show_one(id, id);
 show_one(level, level);
 show_one(coherency_line_size, coherency_line_size);
 show_one(number_of_sets, number_of_sets);
@@ -314,6 +315,7 @@ static ssize_t write_policy_show(struct device *dev,
 	return n;
 }
 
+static DEVICE_ATTR_RO(id);
 static DEVICE_ATTR_RO(level);
 static DEVICE_ATTR_RO(type);
 static DEVICE_ATTR_RO(coherency_line_size);
@@ -327,6 +329,7 @@ static DEVICE_ATTR_RO(shared_cpu_list);
 static DEVICE_ATTR_RO(physical_line_partition);
 
 static struct attribute *cache_default_attrs[] = {
+	&dev_attr_id.attr,
 	&dev_attr_type.attr,
 	&dev_attr_level.attr,
 	&dev_attr_shared_cpu_map.attr,
@@ -350,6 +353,8 @@ cache_default_attrs_is_visible(struct kobject *kobj,
 	const struct cpumask *mask = &this_leaf->shared_cpu_map;
 	umode_t mode = attr->mode;
 
+	if ((attr == &dev_attr_id.attr) && (this_leaf->attributes & CACHE_ID))
+		return mode;
 	if ((attr == &dev_attr_type.attr) && this_leaf->type)
 		return mode;
 	if ((attr == &dev_attr_level.attr) && this_leaf->level)
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index 2189935..0bcbb67 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -18,6 +18,7 @@ enum cache_type {
 
 /**
  * struct cacheinfo - represent a cache leaf node
+ * @id: This cache's id. It is unique among caches with the same (type, level).
  * @type: type of the cache - data, inst or unified
  * @level: represents the hierarchy in the multi-level cache
  * @coherency_line_size: size of each cache line usually representing
@@ -44,6 +45,7 @@ enum cache_type {
  * keeping, the remaining members form the core properties of the cache
  */
 struct cacheinfo {
+	unsigned int id;
 	enum cache_type type;
 	unsigned int level;
 	unsigned int coherency_line_size;
@@ -61,6 +63,7 @@ struct cacheinfo {
 #define CACHE_WRITE_ALLOCATE	BIT(3)
 #define CACHE_ALLOCATE_POLICY_MASK	\
 	(CACHE_READ_ALLOCATE | CACHE_WRITE_ALLOCATE)
+#define CACHE_ID		BIT(4)
 
 	struct device_node *of_node;
 	bool disable_sysfs;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 03/18] x86, intel_cacheinfo: Enable cache id in x86
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
  2016-10-15  2:12 ` [PATCH v4 01/18] Documentation, ABI: Add a document entry for cache id Fenghua Yu
  2016-10-15  2:12 ` [PATCH v4 02/18] cacheinfo: Introduce " Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-17 10:48   ` Thomas Gleixner
  2016-10-15  2:12 ` [PATCH v4 04/18] x86/intel_rdt: Feature discovery Fenghua Yu
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Cache id is retrieved from APIC ID and CPUID leaf 4 on x86.

For more details see the section on "Cache ID Extraction Parameters" in
"Intel 64 Architecture Processor Topology Enumeration" at
https://software.intel.com/sites/default/files/63/1a/Kuo_CpuTopology_rc1.rh1.final.pdf

Also "Intel 64 and IA-32 Architectures Software Developer's Manual" volume 2,
table 3-18 "information Returned by CPUID Instruction" at
http://www.intel.com/sdm

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/intel_cacheinfo.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_cacheinfo.c b/arch/x86/kernel/cpu/intel_cacheinfo.c
index de6626c..8dc5720 100644
--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
@@ -153,6 +153,7 @@ struct _cpuid4_info_regs {
 	union _cpuid4_leaf_eax eax;
 	union _cpuid4_leaf_ebx ebx;
 	union _cpuid4_leaf_ecx ecx;
+	unsigned int id;
 	unsigned long size;
 	struct amd_northbridge *nb;
 };
@@ -894,6 +895,8 @@ static void __cache_cpumap_setup(unsigned int cpu, int index,
 static void ci_leaf_init(struct cacheinfo *this_leaf,
 			 struct _cpuid4_info_regs *base)
 {
+	this_leaf->id = base->id;
+	this_leaf->attributes = CACHE_ID;
 	this_leaf->level = base->eax.split.level;
 	this_leaf->type = cache_type_map[base->eax.split.type];
 	this_leaf->coherency_line_size =
@@ -920,6 +923,22 @@ static int __init_cache_level(unsigned int cpu)
 	return 0;
 }
 
+/*
+ * The max shared threads number comes from CPUID.4:EAX[25-14] with input
+ * ECX as cache index. Then right shift apicid by the number's order to get
+ * cache id for this cache node.
+ */
+static void get_cache_id(int cpu, struct _cpuid4_info_regs *id4_regs)
+{
+	struct cpuinfo_x86 *c = &cpu_data(cpu);
+	unsigned long num_threads_sharing;
+	int index_msb;
+
+	num_threads_sharing = 1 + id4_regs->eax.split.num_threads_sharing;
+	index_msb = get_count_order(num_threads_sharing);
+	id4_regs->id = c->apicid >> index_msb;
+}
+
 static int __populate_cache_leaves(unsigned int cpu)
 {
 	unsigned int idx, ret;
@@ -931,6 +950,7 @@ static int __populate_cache_leaves(unsigned int cpu)
 		ret = cpuid4_cache_lookup_regs(idx, &id4_regs);
 		if (ret)
 			return ret;
+		get_cache_id(cpu, &id4_regs);
 		ci_leaf_init(this_leaf++, &id4_regs);
 		__cache_cpumap_setup(cpu, idx, &id4_regs);
 	}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 04/18] x86/intel_rdt: Feature discovery
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (2 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 03/18] x86, intel_cacheinfo: Enable cache id in x86 Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-15  2:12 ` [PATCH v4 05/18] Documentation, x86: Documentation for Intel resource allocation user interface Fenghua Yu
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Check CPUID leaves for all the Resource Director Technology (RDT)
Cache Allocation Technology (CAT) bits.

Presence of allocation features:
  CPUID.(EAX=7H, ECX=0):EBX[bit 15]	X86_FEATURE_RDT_A

L2 and L3 caches are each separately enabled:
  CPUID.(EAX=10H, ECX=0):EBX[bit 1]	X86_FEATURE_CAT_L3
  CPUID.(EAX=10H, ECX=0):EBX[bit 2]	X86_FEATURE_CAT_L2

L3 cache may support independent control of allocation for
code and data (CDP = Code/Data Prioritization):
  CPUID.(EAX=10H, ECX=1):ECX[bit 2]	X86_FEATURE_CDP_L3

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/cpufeatures.h | 5 +++++
 arch/x86/kernel/cpu/scattered.c    | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 92a8308..64dd8274 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -196,6 +196,10 @@
 
 #define X86_FEATURE_INTEL_PT	( 7*32+15) /* Intel Processor Trace */
 
+#define X86_FEATURE_CAT_L3	( 7*32+16) /* Cache Allocation Technology L3 */
+#define X86_FEATURE_CAT_L2	( 7*32+17) /* Cache Allocation Technology L2 */
+#define X86_FEATURE_CDP_L3	( 7*32+18) /* Code and Data Prioritization L3 */
+
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW  ( 8*32+ 0) /* Intel TPR Shadow */
 #define X86_FEATURE_VNMI        ( 8*32+ 1) /* Intel Virtual NMI */
@@ -220,6 +224,7 @@
 #define X86_FEATURE_RTM		( 9*32+11) /* Restricted Transactional Memory */
 #define X86_FEATURE_CQM		( 9*32+12) /* Cache QoS Monitoring */
 #define X86_FEATURE_MPX		( 9*32+14) /* Memory Protection Extension */
+#define X86_FEATURE_RDT_A	( 9*32+15) /* Resource Director Technology Allocation */
 #define X86_FEATURE_AVX512F	( 9*32+16) /* AVX-512 Foundation */
 #define X86_FEATURE_AVX512DQ	( 9*32+17) /* AVX-512 DQ (Double/Quad granular) Instructions */
 #define X86_FEATURE_RDSEED	( 9*32+18) /* The RDSEED instruction */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 8cb57df..11f39a2 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -34,6 +34,9 @@ void init_scattered_cpuid_features(struct cpuinfo_x86 *c)
 		{ X86_FEATURE_INTEL_PT,		CR_EBX,25, 0x00000007, 0 },
 		{ X86_FEATURE_APERFMPERF,	CR_ECX, 0, 0x00000006, 0 },
 		{ X86_FEATURE_EPB,		CR_ECX, 3, 0x00000006, 0 },
+		{ X86_FEATURE_CAT_L3,		CR_EBX, 1, 0x00000010, 0 },
+		{ X86_FEATURE_CAT_L2,		CR_EBX, 2, 0x00000010, 0 },
+		{ X86_FEATURE_CDP_L3,		CR_ECX, 2, 0x00000010, 1 },
 		{ X86_FEATURE_HW_PSTATE,	CR_EDX, 7, 0x80000007, 0 },
 		{ X86_FEATURE_CPB,		CR_EDX, 9, 0x80000007, 0 },
 		{ X86_FEATURE_PROC_FEEDBACK,	CR_EDX,11, 0x80000007, 0 },
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 05/18] Documentation, x86: Documentation for Intel resource allocation user interface
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (3 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 04/18] x86/intel_rdt: Feature discovery Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-15  2:12 ` [PATCH v4 06/18] x86/intel_rdt: Add CONFIG, Makefile, and basic initialization Fenghua Yu
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

The documentation describes user interface of how to allocate resource
in Intel RDT.

Please note that the documentation covers generic user interface. Current
patch set code only implemente CAT L3. CAT L2 code will be sent later.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/x86/intel_rdt_ui.txt | 162 +++++++++++++++++++++++++++++++++++++
 1 file changed, 162 insertions(+)
 create mode 100644 Documentation/x86/intel_rdt_ui.txt

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
new file mode 100644
index 0000000..e567819
--- /dev/null
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -0,0 +1,162 @@
+User Interface for Resource Allocation in Intel Resource Director Technology
+
+Copyright (C) 2016 Intel Corporation
+
+Fenghua Yu <fenghua.yu@intel.com>
+Tony Luck <tony.luck@intel.com>
+
+This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the
+X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
+
+To use the feature mount the file system:
+
+ # mount -t resctrl resctrl [-o cdp] /sys/fs/resctrl
+
+mount options are:
+
+"cdp": Enable code/data prioritization in L3 cache allocations.
+
+
+Resource groups
+---------------
+Resource groups are represented as directories in the resctrl file
+system. The default group is the root directory. Other groups may be
+created as desired by the system administrator using the "mkdir(1)"
+command, and removed using "rmdir(1)".
+
+There are three files associated with each group:
+
+"tasks": A list of tasks that belongs to this group. Tasks can be
+	added to a group by writing the task ID to the "tasks" file
+	(which will automatically remove them from the previous
+	group to which they belonged). New tasks created by fork(2)
+	and clone(2) are added to the same group as their parent.
+	If a pid is not in any sub partition, it is in root partition
+	(i.e. default partition).
+
+"cpus": A bitmask of logical CPUs assigned to this group. Writing
+	a new mask can add/remove CPUs from this group. Added CPUs
+	are removed from their previous group. Removed ones are
+	given to the default (root) group. You cannot remove CPUs
+	from the default group.
+
+"schemata": A list of all the resources available to this group.
+	Each resource has its own line and format - see below for
+	details.
+
+When a task is running the following rules define which resources
+are available to it:
+
+1) If the task is a member of a non-default group, then the schemata
+for that group is used.
+
+2) Else if the task belongs to the default group, but is running on a
+CPU that is assigned to some specific group, then the schemata for
+the CPU's group is used.
+
+3) Otherwise the schemata for the default group is used.
+
+
+Schemata files - general concepts
+---------------------------------
+Each line in the file describes one resource. The line starts with
+the name of the resource, followed by specific values to be applied
+in each of the instances of that resource on the system.
+
+Cache IDs
+---------
+On current generation systems there is one L3 cache per socket and L2
+caches are generally just shared by the hyperthreads on a core, but this
+isn't an architectural requirement. We could have multiple separate L3
+caches on a socket, multiple cores could share an L2 cache. So instead
+of using "socket" or "core" to define the set of logical cpus sharing
+a resource we use a "Cache ID". At a given cache level this will be a
+unique number across the whole system (but it isn't guaranteed to be a
+contiguous sequence, there may be gaps).  To find the ID for each logical
+CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
+
+Cache Bit Masks (CBM)
+---------------------
+For cache resources we describe the portion of the cache that is available
+for allocation using a bitmask. The maximum value of the mask is defined
+by each cpu model (and may be different for different cache levels). It
+is found using CPUID, but is also provided in the "info" directory of
+the resctrl file system in "info/{resource}/max_cbm_val". X86 hardware
+requires that these masks have all the '1' bits in a contiguous block. So
+0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
+and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
+of the capacity of the cache. You could partition the cache into four
+equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
+
+
+L3 details (code and data prioritization disabled)
+--------------------------------------------------
+With CDP disabled the L3 schemata format is:
+
+	L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+L3 details (CDP enabled via mount option to resctrl)
+----------------------------------------------------
+When CDP is enabled, you need to specify separate cache bit masks for
+code and data access. The generic format is:
+
+	L3:<cache_id0>=<d_cbm>,<i_cbm>;<cache_id1>=<d_cbm>,<i_cbm>;...
+
+where the d_cbm masks are for data access, and the i_cbm masks for code.
+
+
+Example 1
+---------
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+# mkdir p0 p1
+# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
+# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Example 2
+---------
+Again two sockets, but this time with a more realistic 20-bit mask.
+
+Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
+processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
+neighbors, each of the two real-time tasks exclusively occupies one quarter
+of L3 cache on socket 0.
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+
+First we reset the schemata for the default group so that the "upper"
+50% of the L3 cache on socket 0 cannot be used by ordinary tasks:
+
+# echo "L3:0=3ff;1=fffff" > schemata
+
+Next we make a resource group for our first real time task and give
+it access to the "top" 25% of the cache on socket 0.
+
+# mkdir p0
+# echo "L3:0=f8000;1=fffff" > p0/schemata
+
+Finally we move our first real time task into this resource group. We
+also use taskset(1) to ensure the task always runs on a dedicated CPU
+on socket 0. Most uses of resource groups will also constrain which
+processors tasks run on.
+
+# echo 1234 > p0/tasks
+# taskset -cp 1 1234
+
+Ditto for the second real time task (with the remaining 25% of cache):
+
+# mkdir p1
+# echo "L3:0=7c00;1=fffff" > p1/schemata
+# echo 5678 > p1/tasks
+# taskset -cp 2 5678
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 06/18] x86/intel_rdt: Add CONFIG, Makefile, and basic initialization
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (4 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 05/18] Documentation, x86: Documentation for Intel resource allocation user interface Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-17 10:57   ` Thomas Gleixner
  2016-10-15  2:12 ` [PATCH v4 07/18] x86/intel_rdt: Add Haswell feature discovery Fenghua Yu
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Introduce CONFIG_INTEL_RDT_A (default: no, dependent on X86 and
CPU_SUP_INTEL) to control inclusion of Resource Director Technology in
the build.

Simple init() routine just checks which features are present. If they are
pr_info() one line summary for each feature.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/Kconfig                | 12 +++++++++
 arch/x86/kernel/cpu/Makefile    |  2 ++
 arch/x86/kernel/cpu/intel_rdt.c | 54 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/intel_rdt.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2a1f0ce..b245fcb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -406,6 +406,18 @@ config GOLDFISH
        def_bool y
        depends on X86_GOLDFISH
 
+config INTEL_RDT_A
+	bool "Intel Resource Director Technology Allocation support"
+	default n
+	depends on X86 && CPU_SUP_INTEL
+	help
+	  Select to enable resource allocation which is a sub-feature of
+	  Intel Resource Director Technology(RDT). More information about
+	  RDT can be found in the Intel x86 Architecture Software
+	  Developer Manual June 2016, volume 3, section 17.17.
+
+	  Say N if unsure.
+
 if X86_32
 config X86_EXTENDED_PLATFORM
 	bool "Support for extended (non-PC) x86 platforms"
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 4a8697f..cf4bfd0 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -34,6 +34,8 @@ obj-$(CONFIG_CPU_SUP_CENTAUR)		+= centaur.o
 obj-$(CONFIG_CPU_SUP_TRANSMETA_32)	+= transmeta.o
 obj-$(CONFIG_CPU_SUP_UMC_32)		+= umc.o
 
+obj-$(CONFIG_INTEL_RDT_A)	+= intel_rdt.o
+
 obj-$(CONFIG_X86_MCE)			+= mcheck/
 obj-$(CONFIG_MTRR)			+= mtrr/
 obj-$(CONFIG_MICROCODE)			+= microcode/
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
new file mode 100644
index 0000000..7d7aebe
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -0,0 +1,54 @@
+/*
+ * Resource Director Technology(RDT)
+ * - Cache Allocation code.
+ *
+ * Copyright (C) 2016 Intel Corporation
+ *
+ * Authors:
+ *    Fenghua Yu <fenghua.yu@intel.com>
+ *    Tony Luck <tony.luck@intel.com>
+ *    Vikas Shivappa <vikas.shivappa@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * More information about RDT be found in the Intel (R) x86 Architecture
+ * Software Developer Manual June 2016, volume 3, section 17.17.
+ */
+
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <linux/slab.h>
+#include <linux/err.h>
+
+static inline bool get_rdt_resources(void)
+{
+	bool ret = false;
+
+	if (!boot_cpu_has(X86_FEATURE_RDT_A))
+		return false;
+	if (boot_cpu_has(X86_FEATURE_CAT_L3))
+		ret = true;
+
+	return ret;
+}
+
+static int __init intel_rdt_late_init(void)
+{
+	if (!get_rdt_resources())
+		return -ENODEV;
+
+	pr_info("Intel RDT cache allocation detected\n");
+	if (boot_cpu_has(X86_FEATURE_CDP_L3))
+		pr_info("Intel RDT code data prioritization detected\n");
+
+	return 0;
+}
+
+late_initcall(intel_rdt_late_init);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 07/18] x86/intel_rdt: Add Haswell feature discovery
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (5 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 06/18] x86/intel_rdt: Add CONFIG, Makefile, and basic initialization Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-17 11:03   ` Thomas Gleixner
  2016-10-15  2:12 ` [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID Fenghua Yu
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Some Haswell generation CPUs support RDT, but they don't enumerate this
using CPUID.  Use rdmsr_safe() and wrmsr_safe() to probe the MSRs on
cpu model 63 (INTEL_FAM6_HASWELL_X)

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/events/intel/cqm.c             |  2 +-
 arch/x86/include/asm/intel_rdt.h        |  6 +++++
 arch/x86/include/asm/intel_rdt_common.h |  6 +++++
 arch/x86/kernel/cpu/intel_rdt.c         | 41 +++++++++++++++++++++++++++++++++
 4 files changed, 54 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/intel_rdt.h
 create mode 100644 arch/x86/include/asm/intel_rdt_common.h

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 8f82b02..df86874 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -7,9 +7,9 @@
 #include <linux/perf_event.h>
 #include <linux/slab.h>
 #include <asm/cpu_device_id.h>
+#include <asm/intel_rdt_common.h>
 #include "../perf_event.h"
 
-#define MSR_IA32_PQR_ASSOC	0x0c8f
 #define MSR_IA32_QM_CTR		0x0c8e
 #define MSR_IA32_QM_EVTSEL	0x0c8d
 
diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
new file mode 100644
index 0000000..3aca86d
--- /dev/null
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -0,0 +1,6 @@
+#ifndef _ASM_X86_INTEL_RDT_H
+#define _ASM_X86_INTEL_RDT_H
+
+#define IA32_L3_CBM_BASE	0xc90
+
+#endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
new file mode 100644
index 0000000..e6e15cf
--- /dev/null
+++ b/arch/x86/include/asm/intel_rdt_common.h
@@ -0,0 +1,6 @@
+#ifndef _ASM_X86_INTEL_RDT_COMMON_H
+#define _ASM_X86_INTEL_RDT_COMMON_H
+
+#define MSR_IA32_PQR_ASSOC	0x0c8f
+
+#endif /* _ASM_X86_INTEL_RDT_COMMON_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 7d7aebe..9d55942 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -27,10 +27,51 @@
 #include <linux/slab.h>
 #include <linux/err.h>
 
+#include <asm/intel_rdt_common.h>
+#include <asm/intel-family.h>
+#include <asm/intel_rdt.h>
+
+/*
+ * cache_alloc_hsw_probe() - Have to probe for Intel haswell server CPUs
+ * as they do not have CPUID enumeration support for Cache allocation.
+ * The check for Vendor/Family/Model is not enough to guarantee that
+ * the MSRs won't #GP fault because only the following SKUs support
+ * CAT:
+ *	Intel(R) Xeon(R)  CPU E5-2658  v3  @  2.20GHz
+ *	Intel(R) Xeon(R)  CPU E5-2648L v3  @  1.80GHz
+ *	Intel(R) Xeon(R)  CPU E5-2628L v3  @  2.00GHz
+ *	Intel(R) Xeon(R)  CPU E5-2618L v3  @  2.30GHz
+ *	Intel(R) Xeon(R)  CPU E5-2608L v3  @  2.00GHz
+ *	Intel(R) Xeon(R)  CPU E5-2658A v3  @  2.20GHz
+ *
+ * Probe by trying to write the first of the L3 cach mask registers
+ * and checking that the bits stick. Max CLOSids is always 4 and max cbm length
+ * is always 20 on hsw server parts. The minimum cache bitmask length
+ * allowed for HSW server is always 2 bits. Hardcode all of them.
+ */
+static inline bool cache_alloc_hsw_probe(void)
+{
+	u32 l, h;
+	u32 max_cbm = BIT_MASK(20) - 1;
+
+	if (wrmsr_safe(IA32_L3_CBM_BASE, max_cbm, 0))
+		return false;
+	rdmsr(IA32_L3_CBM_BASE, l, h);
+	if (l != max_cbm)
+		return false;
+
+	return true;
+}
+
 static inline bool get_rdt_resources(void)
 {
 	bool ret = false;
 
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+	    boot_cpu_data.x86 == 6 &&
+	    boot_cpu_data.x86_model == INTEL_FAM6_HASWELL_X)
+		return cache_alloc_hsw_probe();
+
 	if (!boot_cpu_has(X86_FEATURE_RDT_A))
 		return false;
 	if (boot_cpu_has(X86_FEATURE_CAT_L3))
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (6 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 07/18] x86/intel_rdt: Add Haswell feature discovery Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-17 13:45   ` Thomas Gleixner
  2016-10-15  2:12 ` [PATCH v4 09/18] x86/cqm: Move PQR_ASSOC management code into generic code used by both CQM and CAT Fenghua Yu
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Define struct rdt_resource to hold all the parameterized
values for an RDT resource. Fill in some of those values
from CPUID leaf 0x10 (on Haswell we hard code them).

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt.h | 62 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c  | 72 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 3aca86d..8c61d83 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -2,5 +2,67 @@
 #define _ASM_X86_INTEL_RDT_H
 
 #define IA32_L3_CBM_BASE	0xc90
+#define IA32_L2_CBM_BASE	0xd10
 
+/**
+ * struct rdt_resource - attributes of an RDT resource
+ * @enabled:			Is this feature enabled on this machine
+ * @name:			Name to use in "schemata" file
+ * @max_closid:			Maximum number of CLOSIDs supported
+ * @num_closid:			Current number of CLOSIDs available
+ * @max_cbm:			Largest Cache Bit Mask allowed
+ * @min_cbm_bits:		Minimum number of bits to be set in a cache
+ *				bit mask
+ * @domains:			All domains for this resource
+ * @num_domains:		Number of domains active
+ * @msr_base:			Base MSR address for CBMs
+ * @cdp_capable:		Code/Data Prioritization available
+ * @cdp_enabled:		Code/Data Prioritization enabled
+ * @tmp_cbms:			Scratch space when updating schemata
+ * @cache_level:		Which cache level defines scope of this domain
+ */
+struct rdt_resource {
+	bool			enabled;
+	char			*name;
+	int			max_closid;
+	int			num_closid;
+	int			cbm_len;
+	int			min_cbm_bits;
+	u32			max_cbm;
+	struct list_head	domains;
+	int			num_domains;
+	int			msr_base;
+	bool			cdp_capable;
+	bool			cdp_enabled;
+	u32			*tmp_cbms;
+	int			cache_level;
+};
+
+#define for_each_rdt_resource(r)	\
+	for (r = rdt_resources_all; r->name; r++) \
+		if (r->enabled)
+
+#define IA32_L3_CBM_BASE	0xc90
+extern struct rdt_resource rdt_resources_all[];
+
+enum {
+	RDT_RESOURCE_L3,
+	RDT_RESOURCE_L2,
+};
+
+/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
+union cpuid_0x10_1_eax {
+	struct {
+		unsigned int cbm_len:5;
+	} split;
+	unsigned int full;
+};
+
+/* CPUID.(EAX=10H, ECX=ResID=1).EDX */
+union cpuid_0x10_1_edx {
+	struct {
+		unsigned int cos_max:16;
+	} split;
+	unsigned int full;
+};
 #endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 9d55942..87f9650 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -31,6 +31,28 @@
 #include <asm/intel-family.h>
 #include <asm/intel_rdt.h>
 
+#define domain_init(name) LIST_HEAD_INIT(rdt_resources_all[name].domains)
+
+struct rdt_resource rdt_resources_all[] = {
+	{
+		.name		= "L3",
+		.domains	= domain_init(RDT_RESOURCE_L3),
+		.msr_base	= IA32_L3_CBM_BASE,
+		.min_cbm_bits	= 1,
+		.cache_level	= 3
+	},
+	{
+		.name		= "L2",
+		.domains	= domain_init(RDT_RESOURCE_L2),
+		.msr_base	= IA32_L2_CBM_BASE,
+		.min_cbm_bits	= 1,
+		.cache_level	= 2
+	},
+	{
+		/* NULL terminated */
+	}
+};
+
 /*
  * cache_alloc_hsw_probe() - Have to probe for Intel haswell server CPUs
  * as they do not have CPUID enumeration support for Cache allocation.
@@ -53,6 +75,7 @@ static inline bool cache_alloc_hsw_probe(void)
 {
 	u32 l, h;
 	u32 max_cbm = BIT_MASK(20) - 1;
+	struct rdt_resource *r  = &rdt_resources_all[RDT_RESOURCE_L3];
 
 	if (wrmsr_safe(IA32_L3_CBM_BASE, max_cbm, 0))
 		return false;
@@ -60,11 +83,19 @@ static inline bool cache_alloc_hsw_probe(void)
 	if (l != max_cbm)
 		return false;
 
+	r->max_closid = 4;
+	r->num_closid = r->max_closid;
+	r->cbm_len = 20;
+	r->max_cbm = max_cbm;
+	r->min_cbm_bits = 2;
+	r->enabled = true;
+
 	return true;
 }
 
 static inline bool get_rdt_resources(void)
 {
+	struct rdt_resource *r;
 	bool ret = false;
 
 	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
@@ -74,20 +105,53 @@ static inline bool get_rdt_resources(void)
 
 	if (!boot_cpu_has(X86_FEATURE_RDT_A))
 		return false;
-	if (boot_cpu_has(X86_FEATURE_CAT_L3))
+	if (boot_cpu_has(X86_FEATURE_CAT_L3)) {
+		union cpuid_0x10_1_eax eax;
+		union cpuid_0x10_1_edx edx;
+		u32 ebx, ecx;
+
+		r = &rdt_resources_all[RDT_RESOURCE_L3];
+		cpuid_count(0x00000010, 1, &eax.full, &ebx, &ecx, &edx.full);
+		r->max_closid = edx.split.cos_max + 1;
+		r->num_closid = r->max_closid;
+		r->cbm_len = eax.split.cbm_len + 1;
+		r->max_cbm = BIT_MASK(eax.split.cbm_len + 1) - 1;
+		if (boot_cpu_has(X86_FEATURE_CDP_L3))
+			r->cdp_capable = true;
+		r->enabled = true;
+
 		ret = true;
+	}
+	if (boot_cpu_has(X86_FEATURE_CAT_L2)) {
+		union cpuid_0x10_1_eax eax;
+		union cpuid_0x10_1_edx edx;
+		u32 ebx, ecx;
+
+		/* CPUID 0x10.2 fields are same format at 0x10.1 */
+		r = &rdt_resources_all[RDT_RESOURCE_L2];
+		cpuid_count(0x00000010, 2, &eax.full, &ebx, &ecx, &edx.full);
+		r->max_closid = edx.split.cos_max + 1;
+		r->num_closid = r->max_closid;
+		r->cbm_len = eax.split.cbm_len + 1;
+		r->max_cbm = BIT_MASK(eax.split.cbm_len + 1) - 1;
+		r->enabled = true;
+
+		ret = true;
+	}
 
 	return ret;
 }
 
 static int __init intel_rdt_late_init(void)
 {
+	struct rdt_resource *r;
+
 	if (!get_rdt_resources())
 		return -ENODEV;
 
-	pr_info("Intel RDT cache allocation detected\n");
-	if (boot_cpu_has(X86_FEATURE_CDP_L3))
-		pr_info("Intel RDT code data prioritization detected\n");
+	for_each_rdt_resource(r)
+		pr_info("Intel RDT %s allocation %s detected\n", r->name,
+			r->cdp_capable ? " (with CDP)" : "");
 
 	return 0;
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 09/18] x86/cqm: Move PQR_ASSOC management code into generic code used by both CQM and CAT
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (7 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-15  2:12 ` [PATCH v4 10/18] x86/intel_rdt: Build structures for each resource based on cache topology Fenghua Yu
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

A CLOSID goes into effect when written to the PQR_ASSOC MSR. But this MSR
also contains the RMID used for preformance monitoring of cache occupancy
and memory bandwidth. Move the management code for this MSR out of
arch/x86/events/intel/cqm.c and into generic RDT code so we can coordinate
updates to the MSR.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/events/intel/cqm.c             | 21 +--------------------
 arch/x86/include/asm/intel_rdt_common.h | 21 +++++++++++++++++++++
 2 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index df86874..0c45cc8 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -24,32 +24,13 @@ static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 static bool cqm_enabled, mbm_enabled;
 unsigned int mbm_socket_max;
 
-/**
- * struct intel_pqr_state - State cache for the PQR MSR
- * @rmid:		The cached Resource Monitoring ID
- * @closid:		The cached Class Of Service ID
- * @rmid_usecnt:	The usage counter for rmid
- *
- * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
- * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
- * contains both parts, so we need to cache them.
- *
- * The cache also helps to avoid pointless updates if the value does
- * not change.
- */
-struct intel_pqr_state {
-	u32			rmid;
-	u32			closid;
-	int			rmid_usecnt;
-};
-
 /*
  * The cached intel_pqr_state is strictly per CPU and can never be
  * updated from a remote CPU. Both functions which modify the state
  * (intel_cqm_event_start and intel_cqm_event_stop) are called with
  * interrupts disabled, which is sufficient for the protection.
  */
-static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
 static struct hrtimer *mbm_timers;
 /**
  * struct sample - mbm event's (local or total) data
diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
index e6e15cf..b31081b 100644
--- a/arch/x86/include/asm/intel_rdt_common.h
+++ b/arch/x86/include/asm/intel_rdt_common.h
@@ -3,4 +3,25 @@
 
 #define MSR_IA32_PQR_ASSOC	0x0c8f
 
+/**
+ * struct intel_pqr_state - State cache for the PQR MSR
+ * @rmid:		The cached Resource Monitoring ID
+ * @closid:		The cached Class Of Service ID
+ * @rmid_usecnt:	The usage counter for rmid
+ *
+ * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
+ * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
+ * contains both parts, so we need to cache them.
+ *
+ * The cache also helps to avoid pointless updates if the value does
+ * not change.
+ */
+struct intel_pqr_state {
+	u32			rmid;
+	u32			closid;
+	int			rmid_usecnt;
+};
+
+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
+
 #endif /* _ASM_X86_INTEL_RDT_COMMON_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 10/18] x86/intel_rdt: Build structures for each resource based on cache topology
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (8 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 09/18] x86/cqm: Move PQR_ASSOC management code into generic code used by both CQM and CAT Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-17 14:44   ` Thomas Gleixner
  2016-10-15  2:12 ` [PATCH v4 11/18] x86/intel_rdt: Add basic resctrl filesystem support Fenghua Yu
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Tony Luck <tony.luck@intel.com>

We use the cpu hotplug notifier to catch each cpu in turn and look at
its cache topology w.r.t each of the resource groups. As we discover
new resources, we initialize the bitmask array for each to the default
(full access) value.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/include/asm/intel_rdt.h |  31 ++++++++++
 arch/x86/kernel/cpu/intel_rdt.c  | 128 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 159 insertions(+)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 8c61d83..b3df691 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -43,6 +43,35 @@ struct rdt_resource {
 		if (r->enabled)
 
 #define IA32_L3_CBM_BASE	0xc90
+
+/**
+ * struct rdt_domain - group of cpus sharing an RDT resource
+ * @list:	all instances of this resource
+ * @id:		unique id for this instance
+ * @cpu_mask:	which cpus share this resource
+ * @cbm:	array of cache bit masks (indexed by CLOSID)
+ */
+struct rdt_domain {
+	struct list_head	list;
+	int			id;
+	struct cpumask		cpu_mask;
+	u32			*cbm;
+};
+
+/**
+ * struct msr_param - set a range of MSRs from a domain
+ * @res:       The resource to use
+ * @low:       Beginning index from base MSR
+ * @high:      End index
+ */
+struct msr_param {
+	struct rdt_resource	*res;
+	int			low;
+	int			high;
+};
+
+extern struct mutex rdtgroup_mutex;
+
 extern struct rdt_resource rdt_resources_all[];
 
 enum {
@@ -65,4 +94,6 @@ union cpuid_0x10_1_edx {
 	} split;
 	unsigned int full;
 };
+
+void rdt_cbm_update(void *arg);
 #endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 87f9650..0b47ea9 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -26,11 +26,16 @@
 
 #include <linux/slab.h>
 #include <linux/err.h>
+#include <linux/cacheinfo.h>
+#include <linux/cpuhotplug.h>
 
 #include <asm/intel_rdt_common.h>
 #include <asm/intel-family.h>
 #include <asm/intel_rdt.h>
 
+/* Mutex to protect rdtgroup access. */
+DEFINE_MUTEX(rdtgroup_mutex);
+
 #define domain_init(name) LIST_HEAD_INIT(rdt_resources_all[name].domains)
 
 struct rdt_resource rdt_resources_all[] = {
@@ -142,13 +147,136 @@ static inline bool get_rdt_resources(void)
 	return ret;
 }
 
+static int get_cache_id(int cpu, int level)
+{
+	struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
+	int i;
+
+	for (i = 0; i < ci->num_leaves; i++)
+		if (ci->info_list[i].level == level)
+			return ci->info_list[i].id;
+	return -1;
+}
+
+void rdt_cbm_update(void *arg)
+{
+	struct msr_param *m = (struct msr_param *)arg;
+	struct rdt_resource *r = m->res;
+	struct rdt_domain *d;
+	struct list_head *l;
+	int i, cpu = smp_processor_id();
+
+	list_for_each(l, &r->domains) {
+		d = list_entry(l, struct rdt_domain, list);
+		if (cpumask_test_cpu(cpu, &d->cpu_mask))
+			goto found;
+	}
+	pr_info_once("cpu %d not found in any domain for resource %s\n",
+		     cpu, r->name);
+
+found:
+	for (i = m->low; i < m->high; i++)
+		wrmsrl(r->msr_base + i, d->cbm[i]);
+}
+
+static void update_domain(int cpu, struct rdt_resource *r, int add)
+{
+	struct list_head *l;
+	struct rdt_domain *d;
+	int i, cache_id;
+
+	cache_id = get_cache_id(cpu, r->cache_level);
+
+	if (cache_id == -1) {
+		pr_info_once("Could't find cache id for cpu %d\n", cpu);
+		return;
+	}
+	list_for_each(l, &r->domains) {
+		d = list_entry(l, struct rdt_domain, list);
+		if (cache_id == d->id)
+			goto found;
+		if (cache_id < d->id)
+			break;
+	}
+	if (!add) {
+		pr_info_once("removed unknown cpu %d\n", cpu);
+		return;
+	}
+	d = kzalloc(sizeof(*d), GFP_KERNEL);
+	if (!d)
+		return;
+
+	d->id = cache_id;
+	d->cbm = kmalloc_array(r->max_closid, sizeof(*d->cbm), GFP_KERNEL);
+	if (!d->cbm) {
+		pr_info("Failed to alloc CBM array for cpu %d\n", cpu);
+		kfree(d);
+		return;
+	}
+	cpumask_set_cpu(cpu, &d->cpu_mask);
+	for (i = 0; i < r->max_closid; i++) {
+		d->cbm[i] = r->max_cbm;
+		wrmsrl(r->msr_base + i, d->cbm[i]);
+	}
+	list_add_tail(&d->list, l);
+	r->num_domains++;
+	return;
+
+found:
+	if (add) {
+		cpumask_set_cpu(cpu, &d->cpu_mask);
+	} else {
+		cpumask_clear_cpu(cpu, &d->cpu_mask);
+		if (cpumask_empty(&d->cpu_mask)) {
+			r->num_domains--;
+			kfree(d->cbm);
+			list_del(&d->list);
+			kfree(d);
+		}
+	}
+}
+
+static int intel_rdt_online_cpu(unsigned int cpu)
+{
+	struct rdt_resource *r;
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+
+	mutex_lock(&rdtgroup_mutex);
+	for_each_rdt_resource(r)
+		update_domain(cpu, r, 1);
+	state->closid = 0;
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, 0);
+	mutex_unlock(&rdtgroup_mutex);
+
+	return 0;
+}
+
+static int intel_rdt_offline_cpu(unsigned int cpu)
+{
+	struct rdt_resource *r;
+
+	mutex_lock(&rdtgroup_mutex);
+	for_each_rdt_resource(r)
+		update_domain(cpu, r, 0);
+	mutex_unlock(&rdtgroup_mutex);
+
+	return 0;
+}
+
 static int __init intel_rdt_late_init(void)
 {
 	struct rdt_resource *r;
+	int state;
 
 	if (!get_rdt_resources())
 		return -ENODEV;
 
+	state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+				"AP_INTEL_RDT_ONLINE",
+				intel_rdt_online_cpu, intel_rdt_offline_cpu);
+	if (state < 0)
+		return state;
+
 	for_each_rdt_resource(r)
 		pr_info("Intel RDT %s allocation %s detected\n", r->name,
 			r->cdp_capable ? " (with CDP)" : "");
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 11/18] x86/intel_rdt: Add basic resctrl filesystem support
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (9 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 10/18] x86/intel_rdt: Build structures for each resource based on cache topology Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-17 19:35   ` Thomas Gleixner
  2016-10-15  2:12 ` [PATCH v4 12/18] x86/intel_rdt: Add "info" files to resctrl file system Fenghua Yu
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Use kernfs as basis for our user interface filesystem. This patch
supports mount/umount, and one mount parameter "cdp" to enable code/data
prioritization (though all we do at this point is ensure that the system
can support CDP).  The file system is not populated yet in this patch.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt.h         |  22 +++
 arch/x86/kernel/cpu/Makefile             |   2 +-
 arch/x86/kernel/cpu/intel_rdt.c          |   8 +-
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 235 +++++++++++++++++++++++++++++++
 include/uapi/linux/magic.h               |   1 +
 5 files changed, 266 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index b3df691..f7acae2 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -5,6 +5,23 @@
 #define IA32_L2_CBM_BASE	0xd10
 
 /**
+ * struct rdtgroup - store rdtgroup's data in resctrl file system.
+ * @kn:				kernfs node
+ * @rdtgroup_list:		linked list for all rdtgroups
+ * @closid:			closid for this rdtgroup
+ */
+struct rdtgroup {
+	struct kernfs_node	*kn;
+	struct list_head	rdtgroup_list;
+	int			closid;
+};
+
+/* List of all resource groups */
+extern struct list_head rdt_all_groups;
+
+int __init rdtgroup_init(void);
+
+/**
  * struct rdt_resource - attributes of an RDT resource
  * @enabled:			Is this feature enabled on this machine
  * @name:			Name to use in "schemata" file
@@ -42,6 +59,7 @@ struct rdt_resource {
 	for (r = rdt_resources_all; r->name; r++) \
 		if (r->enabled)
 
+#define IA32_L3_QOS_CFG		0xc81
 #define IA32_L3_CBM_BASE	0xc90
 
 /**
@@ -73,6 +91,10 @@ struct msr_param {
 extern struct mutex rdtgroup_mutex;
 
 extern struct rdt_resource rdt_resources_all[];
+extern struct rdtgroup rdtgroup_default;
+extern struct static_key_false rdt_enable_key;
+
+int __init rdtgroup_init(void);
 
 enum {
 	RDT_RESOURCE_L3,
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index cf4bfd0..b4334e8 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -34,7 +34,7 @@ obj-$(CONFIG_CPU_SUP_CENTAUR)		+= centaur.o
 obj-$(CONFIG_CPU_SUP_TRANSMETA_32)	+= transmeta.o
 obj-$(CONFIG_CPU_SUP_UMC_32)		+= umc.o
 
-obj-$(CONFIG_INTEL_RDT_A)	+= intel_rdt.o
+obj-$(CONFIG_INTEL_RDT_A)	+= intel_rdt.o intel_rdt_rdtgroup.o
 
 obj-$(CONFIG_X86_MCE)			+= mcheck/
 obj-$(CONFIG_MTRR)			+= mtrr/
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 0b47ea9..3d4a125 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -266,7 +266,7 @@ static int intel_rdt_offline_cpu(unsigned int cpu)
 static int __init intel_rdt_late_init(void)
 {
 	struct rdt_resource *r;
-	int state;
+	int state, ret;
 
 	if (!get_rdt_resources())
 		return -ENODEV;
@@ -277,6 +277,12 @@ static int __init intel_rdt_late_init(void)
 	if (state < 0)
 		return state;
 
+	ret = rdtgroup_init();
+	if (ret) {
+		cpuhp_remove_state(state);
+		return ret;
+	}
+
 	for_each_rdt_resource(r)
 		pr_info("Intel RDT %s allocation %s detected\n", r->name,
 			r->cdp_capable ? " (with CDP)" : "");
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
new file mode 100644
index 0000000..b2bbd04
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -0,0 +1,235 @@
+/*
+ * User interface for Resource Alloction in Resource Director Technology(RDT)
+ *
+ * Copyright (C) 2016 Intel Corporation
+ *
+ * Author: Fenghua Yu <fenghua.yu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * More information about RDT be found in the Intel (R) x86 Architecture
+ * Software Developer Manual.
+ */
+
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <linux/fs.h>
+#include <linux/sysfs.h>
+#include <linux/kernfs.h>
+#include <linux/slab.h>
+
+#include <uapi/linux/magic.h>
+
+#include <asm/intel_rdt.h>
+
+DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
+struct kernfs_root *rdt_root;
+struct rdtgroup rdtgroup_default;
+LIST_HEAD(rdt_all_groups);
+
+static void l3_qos_cfg_update(void *arg)
+{
+	struct rdt_resource *r = arg;
+
+	wrmsrl(IA32_L3_QOS_CFG, r->cdp_enabled);
+}
+
+static void set_l3_qos_cfg(struct rdt_resource *r)
+{
+	struct list_head *l;
+	struct rdt_domain *d;
+	struct cpumask cpu_mask;
+	int cpu;
+
+	cpumask_clear(&cpu_mask);
+	list_for_each(l, &r->domains) {
+		d = list_entry(l, struct rdt_domain, list);
+		cpumask_set_cpu(cpumask_any(&d->cpu_mask), &cpu_mask);
+	}
+	cpu = get_cpu();
+	/* Update QOS_CFG MSR on this cpu if it's in cpu_mask. */
+	if (cpumask_test_cpu(cpu, &cpu_mask))
+		l3_qos_cfg_update(r);
+	/* Update QOS_CFG MSR on all other cpus in cpu_mask. */
+	smp_call_function_many(&cpu_mask, l3_qos_cfg_update, r, 1);
+	put_cpu();
+}
+
+static int parse_rdtgroupfs_options(char *data)
+{
+	char *token, *o = data;
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];
+
+	r->cdp_enabled = false;
+	while ((token = strsep(&o, ",")) != NULL) {
+		if (!*token)
+			return -EINVAL;
+
+		if (!strcmp(token, "cdp"))
+			if (r->enabled && r->cdp_capable)
+				r->cdp_enabled = true;
+	}
+
+	return 0;
+}
+
+static struct dentry *rdt_mount(struct file_system_type *fs_type,
+				int flags, const char *unused_dev_name,
+				void *data)
+{
+	struct dentry *dentry;
+	int ret;
+	bool new_sb;
+
+	mutex_lock(&rdtgroup_mutex);
+	/*
+	 * resctrl file system can only be mounted once.
+	 */
+	if (static_branch_unlikely(&rdt_enable_key)) {
+		dentry = ERR_PTR(-EBUSY);
+		goto out;
+	}
+
+	ret = parse_rdtgroupfs_options(data);
+	if (ret) {
+		dentry = ERR_PTR(ret);
+		goto out;
+	}
+
+	dentry = kernfs_mount(fs_type, flags, rdt_root,
+			      RDTGROUP_SUPER_MAGIC, &new_sb);
+	if (IS_ERR(dentry))
+		goto out;
+	if (!new_sb) {
+		dentry = ERR_PTR(-EINVAL);
+		goto out;
+	}
+	if (rdt_resources_all[RDT_RESOURCE_L3].cdp_capable)
+		set_l3_qos_cfg(&rdt_resources_all[RDT_RESOURCE_L3]);
+	static_branch_enable(&rdt_enable_key);
+
+out:
+	mutex_unlock(&rdtgroup_mutex);
+
+	return dentry;
+}
+
+static void reset_all_cbms(struct rdt_resource *r)
+{
+	struct list_head *l;
+	struct rdt_domain *d;
+	struct msr_param msr_param;
+	struct cpumask cpu_mask;
+	int i, cpu;
+
+	cpumask_clear(&cpu_mask);
+	msr_param.res = r;
+	msr_param.low = 0;
+	msr_param.high = r->max_closid;
+
+	list_for_each(l, &r->domains) {
+		d = list_entry(l, struct rdt_domain, list);
+		cpumask_set_cpu(cpumask_any(&d->cpu_mask), &cpu_mask);
+
+		for (i = 0; i < r->max_closid; i++)
+			d->cbm[i] = r->max_cbm;
+	}
+	cpu = get_cpu();
+	/* Update CBM on this cpu if it's in cpu_mask. */
+	if (cpumask_test_cpu(cpu, &cpu_mask))
+		rdt_cbm_update(&msr_param);
+	/* Updte CBM on all other cpus in cpu_mask. */
+	smp_call_function_many(&cpu_mask, rdt_cbm_update, &msr_param, 1);
+	put_cpu();
+}
+
+static void rdt_kill_sb(struct super_block *sb)
+{
+	struct rdt_resource *r;
+
+	mutex_lock(&rdtgroup_mutex);
+
+	/*Put everything back to default values. */
+	for_each_rdt_resource(r)
+		reset_all_cbms(r);
+	r = &rdt_resources_all[RDT_RESOURCE_L3];
+	if (r->cdp_capable) {
+		r->cdp_enabled = 0;
+		set_l3_qos_cfg(r);
+	}
+
+	static_branch_disable(&rdt_enable_key);
+	kernfs_kill_sb(sb);
+	mutex_unlock(&rdtgroup_mutex);
+}
+
+static struct file_system_type rdt_fs_type = {
+	.name    = "resctrl",
+	.mount   = rdt_mount,
+	.kill_sb = rdt_kill_sb,
+};
+
+static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops = {
+};
+
+static int __init rdtgroup_setup_root(void)
+{
+	rdt_root = kernfs_create_root(&rdtgroup_kf_syscall_ops,
+				      KERNFS_ROOT_CREATE_DEACTIVATED,
+				      &rdtgroup_default);
+	if (IS_ERR(rdt_root))
+		return PTR_ERR(rdt_root);
+
+	mutex_lock(&rdtgroup_mutex);
+
+	rdtgroup_default.closid = 0;
+	list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups);
+
+	rdtgroup_default.kn = rdt_root->kn;
+	kernfs_activate(rdtgroup_default.kn);
+
+	mutex_unlock(&rdtgroup_mutex);
+
+	return 0;
+}
+
+/*
+ * rdtgroup_init - rdtgroup initialization
+ *
+ * Setup resctrl file system including set up root, create mount point,
+ * register rdtgroup filesystem, and initialize files under root directory.
+ *
+ * Return: 0 on success or -errno
+ */
+int __init rdtgroup_init(void)
+{
+	int ret = 0;
+
+	ret = rdtgroup_setup_root();
+	if (ret)
+		return ret;
+
+	ret = sysfs_create_mount_point(fs_kobj, "resctrl");
+	if (ret)
+		goto cleanup_root;
+
+	ret = register_filesystem(&rdt_fs_type);
+	if (ret)
+		goto cleanup_mountpoint;
+
+	return 0;
+
+cleanup_mountpoint:
+	sysfs_remove_mount_point(fs_kobj, "resctrl");
+cleanup_root:
+	kernfs_destroy_root(rdt_root);
+
+	return ret;
+}
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index e398bea..27ef03d 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -57,6 +57,7 @@
 #define CGROUP_SUPER_MAGIC	0x27e0eb
 #define CGROUP2_SUPER_MAGIC	0x63677270
 
+#define RDTGROUP_SUPER_MAGIC	0x7655821
 
 #define STACK_END_MAGIC		0x57AC6E9D
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 12/18] x86/intel_rdt: Add "info" files to resctrl file system
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (10 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 11/18] x86/intel_rdt: Add basic resctrl filesystem support Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-17 19:46   ` Thomas Gleixner
  2016-10-15  2:12 ` [PATCH v4 13/18] x86/intel_rdt: Add mkdir " Fenghua Yu
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

For the convenience of applications we make the decoded values of some
of the CPUID values available in read-only (0444) files.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt.h         |  24 +++++
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 175 ++++++++++++++++++++++++++++++-
 2 files changed, 198 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index f7acae2..ea8c09b3 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -22,6 +22,30 @@ extern struct list_head rdt_all_groups;
 int __init rdtgroup_init(void);
 
 /**
+ * struct rftype - describe each file in the resctrl file system
+ * @name: file name
+ * @mode: access mode
+ * @kf_ops: operations
+ * @seq_show: show content of the file
+ * @write: write to the file
+ */
+struct rftype {
+	char			*name;
+	umode_t			mode;
+	struct kernfs_ops	*kf_ops;
+
+	int (*seq_show)(struct kernfs_open_file *of,
+			struct seq_file *sf, void *v);
+	/*
+	 * write() is the generic write callback which maps directly to
+	 * kernfs write operation and overrides all other operations.
+	 * Maximum write size is determined by ->max_write_len.
+	 */
+	ssize_t (*write)(struct kernfs_open_file *of,
+			 char *buf, size_t nbytes, loff_t off);
+};
+
+/**
  * struct rdt_resource - attributes of an RDT resource
  * @enabled:			Is this feature enabled on this machine
  * @name:			Name to use in "schemata" file
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index b2bbd04..316fa0c 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -23,6 +23,8 @@
 #include <linux/fs.h>
 #include <linux/sysfs.h>
 #include <linux/kernfs.h>
+#include <linux/seq_file.h>
+#include <linux/sched.h>
 #include <linux/slab.h>
 
 #include <uapi/linux/magic.h>
@@ -34,6 +36,173 @@ struct kernfs_root *rdt_root;
 struct rdtgroup rdtgroup_default;
 LIST_HEAD(rdt_all_groups);
 
+/* set uid and gid of rdtgroup dirs and files to that of the creator */
+static int rdtgroup_kn_set_ugid(struct kernfs_node *kn)
+{
+	struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID,
+				.ia_uid = current_fsuid(),
+				.ia_gid = current_fsgid(), };
+
+	if (uid_eq(iattr.ia_uid, GLOBAL_ROOT_UID) &&
+	    gid_eq(iattr.ia_gid, GLOBAL_ROOT_GID))
+		return 0;
+
+	return kernfs_setattr(kn, &iattr);
+}
+
+static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
+{
+	struct kernfs_node *kn;
+	int ret;
+
+	kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
+				  0, rft->kf_ops, rft, NULL, NULL);
+	if (IS_ERR(kn))
+		return PTR_ERR(kn);
+
+	ret = rdtgroup_kn_set_ugid(kn);
+	if (ret) {
+		kernfs_remove(kn);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int rdtgroup_add_files(struct kernfs_node *kn, struct rftype *rfts)
+{
+	struct rftype *rft;
+	int ret;
+
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	for (rft = rfts; rft->name; rft++) {
+		ret = rdtgroup_add_file(kn, rft);
+		if (ret)
+			goto error;
+	}
+
+	return 0;
+error:
+	pr_warn("%s: failed to add %s, err=%d\n", __func__, rft->name, ret);
+	while (--rft >= rfts)
+		kernfs_remove_by_name(kn, rft->name);
+	return ret;
+}
+
+static int rdtgroup_seqfile_show(struct seq_file *m, void *arg)
+{
+	struct kernfs_open_file *of = m->private;
+	struct rftype *rft = of->kn->priv;
+
+	if (rft->seq_show)
+		return rft->seq_show(of, m, arg);
+	return 0;
+}
+
+static ssize_t rdtgroup_file_write(struct kernfs_open_file *of, char *buf,
+				   size_t nbytes, loff_t off)
+{
+	struct rftype *rft = of->kn->priv;
+
+	if (rft->write)
+		return rft->write(of, buf, nbytes, off);
+
+	return -EINVAL;
+}
+
+static struct kernfs_ops rdtgroup_kf_single_ops = {
+	.atomic_write_len	= PAGE_SIZE,
+	.write			= rdtgroup_file_write,
+	.seq_show		= rdtgroup_seqfile_show,
+};
+
+static int rdt_num_closid_show(struct kernfs_open_file *of,
+			       struct seq_file *seq, void *v)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+
+	seq_printf(seq, "%d\n", r->num_closid);
+
+	return 0;
+}
+
+static int rdt_cbm_val_show(struct kernfs_open_file *of,
+			    struct seq_file *seq, void *v)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+
+	seq_printf(seq, "%x\n", r->max_cbm);
+
+	return 0;
+}
+
+/* rdtgroup information files for one cache resource. */
+static struct rftype res_info_files[] = {
+	{
+		.name		= "num_closid",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= rdt_num_closid_show,
+	},
+	{
+		.name		= "cbm_val",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= rdt_cbm_val_show,
+	},
+	{
+		/* NULL terminated */
+	}
+};
+
+static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn)
+{
+	struct kernfs_node *kn, *kn_subdir;
+	struct rdt_resource *r;
+	int ret;
+
+	/* create the directory */
+	kn = kernfs_create_dir(parent_kn, "info", parent_kn->mode, NULL);
+	if (IS_ERR(kn))
+		return PTR_ERR(kn);
+	kernfs_get(kn);
+
+	for_each_rdt_resource(r) {
+		kn_subdir = kernfs_create_dir(kn, r->name, kn->mode, r);
+		if (IS_ERR(kn_subdir)) {
+			ret = PTR_ERR(kn_subdir);
+			goto out_destroy;
+		}
+		kernfs_get(kn_subdir);
+		ret = rdtgroup_kn_set_ugid(kn_subdir);
+		if (ret)
+			goto out_destroy;
+		ret = rdtgroup_add_files(kn_subdir, res_info_files);
+		if (ret)
+			goto out_destroy;
+		kernfs_activate(kn_subdir);
+	}
+
+	/*
+	 * This extra ref will be put in kernfs_remove() and guarantees
+	 * that @rdtgrp->kn is always accessible.
+	 */
+	kernfs_get(kn);
+
+	ret = rdtgroup_kn_set_ugid(kn);
+	if (ret)
+		goto out_destroy;
+
+	kernfs_activate(kn);
+
+	return 0;
+
+out_destroy:
+	kernfs_remove(kn);
+	return ret;
+}
+
 static void l3_qos_cfg_update(void *arg)
 {
 	struct rdt_resource *r = arg;
@@ -181,6 +350,8 @@ static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops = {
 
 static int __init rdtgroup_setup_root(void)
 {
+	int ret;
+
 	rdt_root = kernfs_create_root(&rdtgroup_kf_syscall_ops,
 				      KERNFS_ROOT_CREATE_DEACTIVATED,
 				      &rdtgroup_default);
@@ -193,7 +364,9 @@ static int __init rdtgroup_setup_root(void)
 	list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups);
 
 	rdtgroup_default.kn = rdt_root->kn;
-	kernfs_activate(rdtgroup_default.kn);
+	ret = rdtgroup_create_info_dir(rdtgroup_default.kn);
+	if (!ret)
+		kernfs_activate(rdtgroup_default.kn);
 
 	mutex_unlock(&rdtgroup_mutex);
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (11 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 12/18] x86/intel_rdt: Add "info" files to resctrl file system Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-17 21:14   ` Thomas Gleixner
  2016-10-15  2:12 ` [PATCH v4 14/18] x86/intel_rdt: Add cpus file Fenghua Yu
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Resource control groups are represented as directories in the resctrl
file system. The root directory describes the default resources available
to tasks that have not been assigned specific resources. Other directories
can be created at the root level to make new resource groups. It is not
permitted to make directories within other directories.

Hardware uses a CLOSID (Class of service ID) to determine which resource
limits are currently in effect. The exact number available is enumerated
by CPUID leaf 0x10, but on current implementations it is a small number.
We implement a simple bitmask allocator for CLOSIDs.

Each resource control group uses one CLOSID, which limits the total number
of directories that can be created.

Resource groups can be removed using rmdir.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt.h         |   9 ++
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 255 +++++++++++++++++++++++++++++++
 2 files changed, 264 insertions(+)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index ea8c09b3..7eb8078 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -9,13 +9,20 @@
  * @kn:				kernfs node
  * @rdtgroup_list:		linked list for all rdtgroups
  * @closid:			closid for this rdtgroup
+ * @flags:			status bits
+ * @waitcount:			how many cpus expect to find this
  */
 struct rdtgroup {
 	struct kernfs_node	*kn;
 	struct list_head	rdtgroup_list;
 	int			closid;
+	int			flags;
+	atomic_t		waitcount;
 };
 
+/* rdtgroup.flags */
+#define	RDT_DELETED		1
+
 /* List of all resource groups */
 extern struct list_head rdt_all_groups;
 
@@ -142,4 +149,6 @@ union cpuid_0x10_1_edx {
 };
 
 void rdt_cbm_update(void *arg);
+struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn);
+void rdtgroup_kn_unlock(struct kernfs_node *kn);
 #endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 316fa0c..9e0044d 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -26,16 +26,79 @@
 #include <linux/seq_file.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/cpu.h>
 
 #include <uapi/linux/magic.h>
 
 #include <asm/intel_rdt.h>
+#include <asm/intel_rdt_common.h>
 
 DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
 struct kernfs_root *rdt_root;
 struct rdtgroup rdtgroup_default;
 LIST_HEAD(rdt_all_groups);
 
+/*
+ * Trivial allocator for CLOSIDs. Since h/w only supports a small number,
+ * we can keep a bitmap of free CLOSIDs in a single integer.
+ *
+ * Please note: This only supports global CLOSID across multiple
+ * resources and multiple sockets. User can create rdtgroups including root
+ * rdtgroup up to the number of CLOSIDs, which is 16 on Broadwell. When
+ * number of caches is big or number of supported resources sharing CLOSID
+ * is growing, it's getting harder to find usable rdtgroups which is limited
+ * by the small number of CLOSIDs.
+ *
+ * In the future, if it's necessary, we can implement more complex CLOSID
+ * allocation per socket/per resource domain and utilize CLOSIDs as many
+ * as possible. E.g. on 2-socket Broadwell, user can create upto 16x16=256
+ * rdtgroups and each rdtgroup has different combination of two L3 CBMs.
+ */
+static int closid_free_map;
+
+static void closid_init(void)
+{
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];
+	int rdt_max_closid;
+
+	/* Enabling L3 CDP halves the number of CLOSIDs */
+	if (r->cdp_enabled)
+		r->num_closid = r->max_closid / 2;
+	else
+		r->num_closid = r->max_closid;
+
+	/* Compute rdt_max_closid across all resources */
+	rdt_max_closid = 0;
+	for_each_rdt_resource(r)
+		rdt_max_closid = max(rdt_max_closid, r->num_closid);
+	if (rdt_max_closid > 32) {
+		pr_warn_once("Only using 32/%d CLOSIDs\n", rdt_max_closid);
+		rdt_max_closid = 32;
+	}
+
+	closid_free_map = BIT_MASK(rdt_max_closid) - 1;
+
+	/* CLOSID 0 is always reserved for the default group */
+	closid_free_map &= ~1;
+}
+
+int closid_alloc(void)
+{
+	int closid = ffs(closid_free_map);
+
+	if (closid == 0)
+		return -ENOSPC;
+	closid--;
+	closid_free_map &= ~(1 << closid);
+
+	return closid;
+}
+
+static void closid_free(int closid)
+{
+	closid_free_map |= 1 << closid;
+}
+
 /* set uid and gid of rdtgroup dirs and files to that of the creator */
 static int rdtgroup_kn_set_ugid(struct kernfs_node *kn)
 {
@@ -249,6 +312,54 @@ static int parse_rdtgroupfs_options(char *data)
 	return 0;
 }
 
+/*
+ * We don't allow rdtgroup directories to be created anywhere
+ * except the root directory. Thus when looking for the rdtgroup
+ * structure for a kernfs node we are either looking at a directory,
+ * in which case the rdtgroup structure is pointed at by the "priv"
+ * field, otherwise we have a file, and need only look to the parent
+ * to find the rdtgroup.
+ */
+static struct rdtgroup *kernfs_to_rdtgroup(struct kernfs_node *kn)
+{
+	if (kernfs_type(kn) == KERNFS_DIR)
+		return kn->priv;
+	else
+		return kn->parent->priv;
+}
+
+struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn)
+{
+	struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn);
+
+	atomic_inc(&rdtgrp->waitcount);
+	kernfs_break_active_protection(kn);
+
+	mutex_lock(&rdtgroup_mutex);
+
+	/* Was this group deleted while we waited? */
+	if (rdtgrp->flags & RDT_DELETED)
+		return NULL;
+
+	return rdtgrp;
+}
+
+void rdtgroup_kn_unlock(struct kernfs_node *kn)
+{
+	struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn);
+
+	mutex_unlock(&rdtgroup_mutex);
+
+	if (atomic_dec_and_test(&rdtgrp->waitcount) &&
+	    (rdtgrp->flags & RDT_DELETED)) {
+		kernfs_unbreak_active_protection(kn);
+		kernfs_put(kn);
+		kfree(rdtgrp);
+	} else {
+		kernfs_unbreak_active_protection(kn);
+	}
+}
+
 static struct dentry *rdt_mount(struct file_system_type *fs_type,
 				int flags, const char *unused_dev_name,
 				void *data)
@@ -272,6 +383,8 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
 		goto out;
 	}
 
+	closid_init();
+
 	dentry = kernfs_mount(fs_type, flags, rdt_root,
 			      RDTGROUP_SUPER_MAGIC, &new_sb);
 	if (IS_ERR(dentry))
@@ -319,6 +432,39 @@ static void reset_all_cbms(struct rdt_resource *r)
 	put_cpu();
 }
 
+static void rdt_reset_pqr_assoc_closid(void *v)
+{
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+
+	state->closid = 0;
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, 0);
+}
+
+/*
+ * Forcibly remove all of subdirectories under root.
+ */
+static void rmdir_all_sub(void)
+{
+	struct rdtgroup *rdtgrp;
+	struct list_head *l, *next;
+
+	get_cpu();
+	/* Reset PQR_ASSOC MSR on this cpu. */
+	rdt_reset_pqr_assoc_closid(NULL);
+	/* Reset PQR_ASSOC MSR on the rest of cpus. */
+	smp_call_function_many(cpu_online_mask, rdt_reset_pqr_assoc_closid,
+			       NULL, 1);
+	put_cpu();
+	list_for_each_safe(l, next, &rdt_all_groups) {
+		rdtgrp = list_entry(l, struct rdtgroup, rdtgroup_list);
+		if (rdtgrp == &rdtgroup_default)
+			continue;
+		kernfs_remove(rdtgrp->kn);
+		list_del(&rdtgrp->rdtgroup_list);
+		kfree(rdtgrp);
+	}
+}
+
 static void rdt_kill_sb(struct super_block *sb)
 {
 	struct rdt_resource *r;
@@ -334,6 +480,7 @@ static void rdt_kill_sb(struct super_block *sb)
 		set_l3_qos_cfg(r);
 	}
 
+	rmdir_all_sub();
 	static_branch_disable(&rdt_enable_key);
 	kernfs_kill_sb(sb);
 	mutex_unlock(&rdtgroup_mutex);
@@ -345,7 +492,115 @@ static struct file_system_type rdt_fs_type = {
 	.kill_sb = rdt_kill_sb,
 };
 
+static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
+			  umode_t mode)
+{
+	struct rdtgroup *parent, *rdtgrp;
+	struct kernfs_node *kn;
+	int ret, closid;
+
+	/* Only allow mkdir in the root directory */
+	if (parent_kn != rdtgroup_default.kn)
+		return -EPERM;
+
+	/* Do not accept '\n' to avoid unparsable situation. */
+	if (strchr(name, '\n'))
+		return -EINVAL;
+
+	parent = rdtgroup_kn_lock_live(parent_kn);
+	if (!parent) {
+		ret = -ENODEV;
+		goto out_unlock;
+	}
+
+	ret = closid_alloc();
+	if (ret < 0)
+		goto out_unlock;
+	closid = ret;
+
+	/* allocate the rdtgroup. */
+	rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL);
+	if (!rdtgrp) {
+		ret = -ENOSPC;
+		goto out_closid_free;
+	}
+	rdtgrp->closid = closid;
+	list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
+
+	/* kernfs creates the directory for rdtgrp */
+	kn = kernfs_create_dir(parent->kn, name, mode, rdtgrp);
+	if (IS_ERR(kn)) {
+		ret = PTR_ERR(kn);
+		goto out_cancel_ref;
+	}
+	rdtgrp->kn = kn;
+
+	/*
+	 * This extra ref will be put in kernfs_remove() and guarantees
+	 * that @rdtgrp->kn is always accessible.
+	 */
+	kernfs_get(kn);
+
+	ret = rdtgroup_kn_set_ugid(kn);
+	if (ret)
+		goto out_destroy;
+
+	kernfs_activate(kn);
+
+	ret = 0;
+	goto out_unlock;
+
+out_destroy:
+	kernfs_remove(rdtgrp->kn);
+out_cancel_ref:
+	kfree(rdtgrp);
+out_closid_free:
+	closid_free(closid);
+out_unlock:
+	rdtgroup_kn_unlock(parent_kn);
+	return ret;
+}
+
+static int rdtgroup_rmdir(struct kernfs_node *kn)
+{
+	struct rdtgroup *rdtgrp;
+	int ret = 0;
+
+	rdtgrp = rdtgroup_kn_lock_live(kn);
+	if (!rdtgrp) {
+		rdtgroup_kn_unlock(kn);
+		return -ENOENT;
+	}
+
+	/*
+	 * rmdir is for deleting resource groups. Don't
+	 * allow deletion of "info" or any of its subdirectories
+	 */
+	if (!rdtgrp) {
+		mutex_unlock(&rdtgroup_mutex);
+		kernfs_unbreak_active_protection(kn);
+		return -EPERM;
+	}
+
+	rdtgrp->flags = RDT_DELETED;
+	closid_free(rdtgrp->closid);
+	list_del(&rdtgrp->rdtgroup_list);
+
+	/*
+	 * one extra hold on this, will drop when we kfree(rdtgrp)
+	 * in rdtgroup_kn_unlock()
+	 */
+	kernfs_get(kn);
+	kernfs_remove(rdtgrp->kn);
+
+	rdtgroup_kn_unlock(kn);
+
+	return ret;
+}
+
 static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops = {
+	.mkdir	= rdtgroup_mkdir,
+	.rmdir	= rdtgroup_rmdir,
 };
 
 static int __init rdtgroup_setup_root(void)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 14/18] x86/intel_rdt: Add cpus file
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (12 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 13/18] x86/intel_rdt: Add mkdir " Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-17 21:27   ` Thomas Gleixner
  2016-10-15  2:12 ` [PATCH v4 15/18] x86/intel_rdt: Add tasks files Fenghua Yu
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Tony Luck <tony.luck@intel.com>

Now we populate each directory with a read/write (mode 0644) file
named "cpus". This is used to over-ride the resources available
to processes in the default resource group when running on specific
CPUs.  Each "cpus" file reads as a cpumask showing which CPUs belong
to this resource group. Initially all online CPUs are assigned to
the default group. They can be added to other groups by writing a
cpumask to the "cpus" file in the directory for the resource group
(which will remove them from the previous group to which they were
assigned). CPU online/offline operations will delete CPUs that go
offline from whatever group they are in and add new CPUs to the
default group.

If there are CPUs assigned to a group when the directory is removed,
they are returned to the default group.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt.h         |   5 ++
 arch/x86/kernel/cpu/intel_rdt.c          |  12 +++
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 127 +++++++++++++++++++++++++++++++
 3 files changed, 144 insertions(+)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 7eb8078..6ab31ba 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -9,13 +9,16 @@
  * @kn:				kernfs node
  * @rdtgroup_list:		linked list for all rdtgroups
  * @closid:			closid for this rdtgroup
+ * @cpu_mask:			CPUs assigned to this rdtgroup
  * @flags:			status bits
  * @waitcount:			how many cpus expect to find this
+ *				group when they acquire rdtgroup_mutex
  */
 struct rdtgroup {
 	struct kernfs_node	*kn;
 	struct list_head	rdtgroup_list;
 	int			closid;
+	struct cpumask		cpu_mask;
 	int			flags;
 	atomic_t		waitcount;
 };
@@ -148,6 +151,8 @@ union cpuid_0x10_1_edx {
 	unsigned int full;
 };
 
+DECLARE_PER_CPU_READ_MOSTLY(int, cpu_closid);
+
 void rdt_cbm_update(void *arg);
 struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn);
 void rdtgroup_kn_unlock(struct kernfs_node *kn);
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 3d4a125..556a426 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -36,6 +36,8 @@
 /* Mutex to protect rdtgroup access. */
 DEFINE_MUTEX(rdtgroup_mutex);
 
+DEFINE_PER_CPU_READ_MOSTLY(int, cpu_closid);
+
 #define domain_init(name) LIST_HEAD_INIT(rdt_resources_all[name].domains)
 
 struct rdt_resource rdt_resources_all[] = {
@@ -242,8 +244,11 @@ static int intel_rdt_online_cpu(unsigned int cpu)
 	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
 
 	mutex_lock(&rdtgroup_mutex);
+	per_cpu(cpu_closid, cpu) = 0;
 	for_each_rdt_resource(r)
 		update_domain(cpu, r, 1);
+	/* The cpu is set in default rdtgroup after online. */
+	cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
 	state->closid = 0;
 	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, 0);
 	mutex_unlock(&rdtgroup_mutex);
@@ -254,10 +259,17 @@ static int intel_rdt_online_cpu(unsigned int cpu)
 static int intel_rdt_offline_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
+	struct rdtgroup *rdtgrp;
+	struct list_head *l;
 
 	mutex_lock(&rdtgroup_mutex);
 	for_each_rdt_resource(r)
 		update_domain(cpu, r, 0);
+	list_for_each(l, &rdt_all_groups) {
+		rdtgrp = list_entry(l, struct rdtgroup, rdtgroup_list);
+		if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask))
+			break;
+	}
 	mutex_unlock(&rdtgroup_mutex);
 
 	return 0;
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 9e0044d..f2d7a3a 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -20,6 +20,7 @@
 
 #define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
 
+#include <linux/cpu.h>
 #include <linux/fs.h>
 #include <linux/sysfs.h>
 #include <linux/kernfs.h>
@@ -180,6 +181,117 @@ static struct kernfs_ops rdtgroup_kf_single_ops = {
 	.seq_show		= rdtgroup_seqfile_show,
 };
 
+static int rdtgroup_cpus_show(struct kernfs_open_file *of,
+			      struct seq_file *s, void *v)
+{
+	struct rdtgroup *rdtgrp;
+	int ret = 0;
+
+	rdtgrp = rdtgroup_kn_lock_live(of->kn);
+
+	if (rdtgrp)
+		seq_printf(s, "%*pb\n", cpumask_pr_args(&rdtgrp->cpu_mask));
+	else
+		ret = -ENOENT;
+	rdtgroup_kn_unlock(of->kn);
+
+	return ret;
+}
+
+static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
+				   char *buf, size_t nbytes, loff_t off)
+{
+	struct rdtgroup *rdtgrp, *r;
+	cpumask_var_t tmpmask, newmask;
+	int ret, cpu;
+
+	if (!buf)
+		return -EINVAL;
+
+	if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL))
+		return -ENOMEM;
+	if (!alloc_cpumask_var(&newmask, GFP_KERNEL)) {
+		free_cpumask_var(tmpmask);
+		return -ENOMEM;
+	}
+	rdtgrp = rdtgroup_kn_lock_live(of->kn);
+	if (!rdtgrp) {
+		ret = -ENOENT;
+		goto unlock;
+	}
+
+	ret = cpumask_parse(buf, newmask);
+	if (ret)
+		goto unlock;
+
+	get_online_cpus();
+	/* check that user didn't specify any offline cpus */
+	cpumask_andnot(tmpmask, newmask, cpu_online_mask);
+	if (cpumask_weight(tmpmask)) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* Are trying to drop some cpus from this group? */
+	cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask);
+	if (cpumask_weight(tmpmask)) {
+		/* Can't drop from default group */
+		if (rdtgrp == &rdtgroup_default) {
+			ret = -EINVAL;
+			goto end;
+		}
+		/* Give any dropped cpus to rdtgroup_default */
+		cpumask_or(&rdtgroup_default.cpu_mask,
+			   &rdtgroup_default.cpu_mask, tmpmask);
+		for_each_cpu(cpu, tmpmask)
+			per_cpu(cpu_closid, cpu) = 0;
+	}
+
+	/*
+	 * If we added cpus, remove them from previous group that owned them
+	 * and update per-cpu rdtgroup pointers to refer to us
+	 */
+	cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask);
+	if (cpumask_weight(tmpmask)) {
+		struct list_head *l;
+
+		list_for_each(l, &rdt_all_groups) {
+			r = list_entry(l, struct rdtgroup, rdtgroup_list);
+			if (r == rdtgrp)
+				continue;
+			cpumask_andnot(&r->cpu_mask, &r->cpu_mask, tmpmask);
+		}
+		for_each_cpu(cpu, tmpmask)
+			per_cpu(cpu_closid, cpu) = rdtgrp->closid;
+	}
+
+	/* Done pushing/pulling - update this group with new mask */
+	cpumask_copy(&rdtgrp->cpu_mask, newmask);
+
+end:
+	put_online_cpus();
+unlock:
+	rdtgroup_kn_unlock(of->kn);
+	free_cpumask_var(tmpmask);
+	free_cpumask_var(newmask);
+
+	return ret ?: nbytes;
+}
+
+/* Files in each rdtgroup */
+static struct rftype rdtgroup_base_files[] = {
+	{
+		.name		= "cpus",
+		.mode		= 0644,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.write		= rdtgroup_cpus_write,
+		.seq_show	= rdtgroup_cpus_show,
+	},
+	{
+		/* NULL terminated */
+	}
+};
+
 static int rdt_num_closid_show(struct kernfs_open_file *of,
 			       struct seq_file *seq, void *v)
 {
@@ -545,6 +657,10 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
 	if (ret)
 		goto out_destroy;
 
+	ret = rdtgroup_add_files(kn, rdtgroup_base_files);
+	if (ret)
+		goto out_destroy;
+
 	kernfs_activate(kn);
 
 	ret = 0;
@@ -582,6 +698,10 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
 		return -EPERM;
 	}
 
+	/* Give any CPUs back to the default group */
+	cpumask_or(&rdtgroup_default.cpu_mask,
+		   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
+
 	rdtgrp->flags = RDT_DELETED;
 	closid_free(rdtgrp->closid);
 	list_del(&rdtgrp->rdtgroup_list);
@@ -618,11 +738,18 @@ static int __init rdtgroup_setup_root(void)
 	rdtgroup_default.closid = 0;
 	list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups);
 
+	ret = rdtgroup_add_files(rdt_root->kn, rdtgroup_base_files);
+	if (ret) {
+		kernfs_destroy_root(rdt_root);
+		goto out;
+	}
+
 	rdtgroup_default.kn = rdt_root->kn;
 	ret = rdtgroup_create_info_dir(rdtgroup_default.kn);
 	if (!ret)
 		kernfs_activate(rdtgroup_default.kn);
 
+out:
 	mutex_unlock(&rdtgroup_mutex);
 
 	return 0;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 15/18] x86/intel_rdt: Add tasks files
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (13 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 14/18] x86/intel_rdt: Add cpus file Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-17 22:01   ` Thomas Gleixner
  2016-10-15  2:12 ` [PATCH v4 16/18] x86/intel_rdt: Add schemata file Fenghua Yu
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

The root directory all subdirectories are automatically populated
with a read/write (mode 0644) file named "tasks". When read it will
show all the task IDs assigned to the resource group. Tasks can be
added (one at a time) to a group by writing the task ID to the file.
E.g.

Membership in a resource group is indicated by a new field in the
task_struct "int closid" which holds the CLOSID for each task. The
default resource group uses CLOSID=0 which means that all existing
tasks when the resctrl file system is mounted belong to the default
group.

A resource group cannot be removed while there are tasks assigned
to it.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 173 +++++++++++++++++++++++++++++++
 include/linux/sched.h                    |   3 +
 2 files changed, 176 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index f2d7a3a..bdbe2d1 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -28,6 +28,7 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
+#include <linux/task_work.h>
 
 #include <uapi/linux/magic.h>
 
@@ -278,6 +279,152 @@ unlock:
 	return ret ?: nbytes;
 }
 
+struct task_move_callback {
+	struct callback_head work;
+	struct rdtgroup *rdtgrp;
+};
+
+static void move_myself(struct callback_head *head)
+{
+	struct task_move_callback *callback;
+	struct rdtgroup *rdtgrp;
+
+	callback = container_of(head, struct task_move_callback, work);
+	rdtgrp = callback->rdtgrp;
+
+	/* Resource group might have been deleted before process runs */
+	if (atomic_dec_and_test(&rdtgrp->waitcount) &&
+	    (rdtgrp->flags & RDT_DELETED)) {
+		current->closid = 0;
+		kfree(rdtgrp);
+	}
+
+	kfree(callback);
+}
+
+static int __rdtgroup_move_task(struct task_struct *tsk,
+				struct rdtgroup *rdtgrp)
+{
+	struct task_move_callback *callback;
+	int ret;
+
+	callback = kzalloc(sizeof(*callback), GFP_KERNEL);
+	if (!callback)
+		return -ENOMEM;
+	callback->work.func = move_myself;
+	callback->rdtgrp = rdtgrp;
+	atomic_inc(&rdtgrp->waitcount);
+	ret = task_work_add(tsk, &callback->work, true);
+	if (ret) {
+		atomic_dec(&rdtgrp->waitcount);
+		kfree(callback);
+	} else {
+		tsk->closid = rdtgrp->closid;
+	}
+	return ret;
+}
+
+static int rdtgroup_task_write_permission(struct task_struct *task,
+					  struct kernfs_open_file *of)
+{
+	const struct cred *cred = current_cred();
+	const struct cred *tcred = get_task_cred(task);
+	int ret = 0;
+
+	/*
+	 * even if we're attaching all tasks in the thread group, we only
+	 * need to check permissions on one of them.
+	 */
+	if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
+	    !uid_eq(cred->euid, tcred->uid) &&
+	    !uid_eq(cred->euid, tcred->suid))
+		ret = -EPERM;
+
+	put_cred(tcred);
+	return ret;
+}
+
+static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp,
+			      struct kernfs_open_file *of)
+{
+	struct task_struct *tsk;
+	int ret;
+
+	rcu_read_lock();
+	if (pid) {
+		tsk = find_task_by_vpid(pid);
+		if (!tsk) {
+			ret = -ESRCH;
+			goto out_unlock_rcu;
+		}
+	} else {
+		tsk = current;
+	}
+
+	get_task_struct(tsk);
+	rcu_read_unlock();
+
+	ret = rdtgroup_task_write_permission(tsk, of);
+	if (!ret)
+		ret = __rdtgroup_move_task(tsk, rdtgrp);
+
+	put_task_struct(tsk);
+	return ret;
+
+out_unlock_rcu:
+	rcu_read_unlock();
+	return ret;
+}
+
+static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of,
+				    char *buf, size_t nbytes, loff_t off)
+{
+	struct rdtgroup *rdtgrp;
+	pid_t pid;
+	int ret = 0;
+
+	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
+		return -EINVAL;
+	rdtgrp = rdtgroup_kn_lock_live(of->kn);
+
+	if (rdtgrp)
+		ret = rdtgroup_move_task(pid, rdtgrp, of);
+	else
+		ret = -ENOENT;
+
+	rdtgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
+static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s)
+{
+	struct task_struct *p;
+
+	rcu_read_lock();
+	for_each_process(p) {
+		if (p->closid == r->closid)
+			seq_printf(s, "%d\n", p->pid);
+	}
+	rcu_read_unlock();
+}
+
+static int rdtgroup_tasks_show(struct kernfs_open_file *of,
+			       struct seq_file *s, void *v)
+{
+	struct rdtgroup *rdtgrp;
+	int ret = 0;
+
+	rdtgrp = rdtgroup_kn_lock_live(of->kn);
+	if (rdtgrp)
+		show_rdt_tasks(rdtgrp, s);
+	else
+		ret = -ENOENT;
+	rdtgroup_kn_unlock(of->kn);
+
+	return ret;
+}
+
 /* Files in each rdtgroup */
 static struct rftype rdtgroup_base_files[] = {
 	{
@@ -288,6 +435,13 @@ static struct rftype rdtgroup_base_files[] = {
 		.seq_show	= rdtgroup_cpus_show,
 	},
 	{
+		.name		= "tasks",
+		.mode		= 0644,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.write		= rdtgroup_tasks_write,
+		.seq_show	= rdtgroup_tasks_show,
+	},
+	{
 		/* NULL terminated */
 	}
 };
@@ -559,6 +713,13 @@ static void rmdir_all_sub(void)
 {
 	struct rdtgroup *rdtgrp;
 	struct list_head *l, *next;
+	struct task_struct *p;
+
+	/* move all tasks to default resource group */
+	read_lock(&tasklist_lock);
+	for_each_process(p)
+		p->closid = 0;
+	read_unlock(&tasklist_lock);
 
 	get_cpu();
 	/* Reset PQR_ASSOC MSR on this cpu. */
@@ -681,6 +842,7 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
 {
 	struct rdtgroup *rdtgrp;
 	int ret = 0;
+	struct task_struct *p;
 
 	rdtgrp = rdtgroup_kn_lock_live(kn);
 	if (!rdtgrp) {
@@ -698,6 +860,17 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
 		return -EPERM;
 	}
 
+	/* Don't allow if there are processes in this group */
+	read_lock(&tasklist_lock);
+	for_each_process(p) {
+		if (p->closid == rdtgrp->closid) {
+			read_unlock(&tasklist_lock);
+			rdtgroup_kn_unlock(kn);
+			return -EBUSY;
+		}
+	}
+	read_unlock(&tasklist_lock);
+
 	/* Give any CPUs back to the default group */
 	cpumask_or(&rdtgroup_default.cpu_mask,
 		   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e5..8a05c46 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1766,6 +1766,9 @@ struct task_struct {
 	/* cg_list protected by css_set_lock and tsk->alloc_lock */
 	struct list_head cg_list;
 #endif
+#ifdef CONFIG_INTEL_RDT_A
+	int closid;
+#endif
 #ifdef CONFIG_FUTEX
 	struct robust_list_head __user *robust_list;
 #ifdef CONFIG_COMPAT
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 16/18] x86/intel_rdt: Add schemata file
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (14 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 15/18] x86/intel_rdt: Add tasks files Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-17 22:35   ` Thomas Gleixner
  2016-10-15  2:12 ` [PATCH v4 17/18] x86/intel_rdt: Add scheduler hook Fenghua Yu
  2016-10-15  2:12 ` [PATCH v4 18/18] MAINTAINERS: Add maintainer for Intel RDT resource allocation Fenghua Yu
  17 siblings, 1 reply; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Tony Luck <tony.luck@intel.com>

Last of the per resource group files. Also mode 0644. This one shows
the resources available to the group. Syntax depends on whether the
"cdp" mount option was given. With code/data prioritization disabled
it is simply a list of masks for each cache domain. Initial value
allows access to all of the L3 cache on all domains. E.g. on a 2 socket
Broadwell:
        L3:0=fffff;1=fffff
With CDP enabled, separate masks for data and instructions are provided:
        L3:0=fffff,fffff;1=fffff,fffff

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/include/asm/intel_rdt.h         |   6 +
 arch/x86/kernel/cpu/Makefile             |   2 +-
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c |   7 +
 arch/x86/kernel/cpu/intel_rdt_schemata.c | 266 +++++++++++++++++++++++++++++++
 4 files changed, 280 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_schemata.c

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 6ab31ba..8435ec6 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -70,6 +70,7 @@ struct rftype {
  * @cdp_capable:		Code/Data Prioritization available
  * @cdp_enabled:		Code/Data Prioritization enabled
  * @tmp_cbms:			Scratch space when updating schemata
+ * @num_cbms:			Number of CBMs in tmp_cbms
  * @cache_level:		Which cache level defines scope of this domain
  */
 struct rdt_resource {
@@ -86,6 +87,7 @@ struct rdt_resource {
 	bool			cdp_capable;
 	bool			cdp_enabled;
 	u32			*tmp_cbms;
+	int			num_cbms;
 	int			cache_level;
 };
 
@@ -156,4 +158,8 @@ DECLARE_PER_CPU_READ_MOSTLY(int, cpu_closid);
 void rdt_cbm_update(void *arg);
 struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn);
 void rdtgroup_kn_unlock(struct kernfs_node *kn);
+ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off);
+int rdtgroup_schemata_show(struct kernfs_open_file *of,
+			   struct seq_file *s, void *v);
 #endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index b4334e8..c9f8c81 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -34,7 +34,7 @@ obj-$(CONFIG_CPU_SUP_CENTAUR)		+= centaur.o
 obj-$(CONFIG_CPU_SUP_TRANSMETA_32)	+= transmeta.o
 obj-$(CONFIG_CPU_SUP_UMC_32)		+= umc.o
 
-obj-$(CONFIG_INTEL_RDT_A)	+= intel_rdt.o intel_rdt_rdtgroup.o
+obj-$(CONFIG_INTEL_RDT_A)	+= intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_schemata.o
 
 obj-$(CONFIG_X86_MCE)			+= mcheck/
 obj-$(CONFIG_MTRR)			+= mtrr/
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index bdbe2d1..6787fd8 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -442,6 +442,13 @@ static struct rftype rdtgroup_base_files[] = {
 		.seq_show	= rdtgroup_tasks_show,
 	},
 	{
+		.name		= "schemata",
+		.mode		= 0644,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.write		= rdtgroup_schemata_write,
+		.seq_show	= rdtgroup_schemata_show,
+	},
+	{
 		/* NULL terminated */
 	}
 };
diff --git a/arch/x86/kernel/cpu/intel_rdt_schemata.c b/arch/x86/kernel/cpu/intel_rdt_schemata.c
new file mode 100644
index 0000000..ead4f43
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt_schemata.c
@@ -0,0 +1,266 @@
+/*
+ * Resource Director Technology(RDT)
+ * - Cache Allocation code.
+ *
+ * Copyright (C) 2016 Intel Corporation
+ *
+ * Authors:
+ *    Fenghua Yu <fenghua.yu@intel.com>
+ *    Tony Luck <tony.luck@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * More information about RDT be found in the Intel (R) x86 Architecture
+ * Software Developer Manual June 2016, volume 3, section 17.17.
+ */
+
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <linux/kernfs.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <asm/intel_rdt.h>
+
+/*
+ * Check whether a cache bit mask is valid. The SDM says:
+ *	Please note that all (and only) contiguous '1' combinations
+ *	are allowed (e.g. FFFFH, 0FF0H, 003CH, etc.).
+ * Additionally Haswell requires at least two bits set.
+ */
+static bool cbm_validate(unsigned long var, struct rdt_resource *r)
+{
+	unsigned long first_bit, zero_bit;
+
+	if (var == 0 || var > r->max_cbm)
+		return false;
+
+	first_bit = find_first_bit(&var, r->cbm_len);
+	zero_bit = find_next_zero_bit(&var, r->cbm_len, first_bit);
+
+	if (find_next_bit(&var, r->cbm_len, zero_bit) < r->cbm_len)
+		return false;
+
+	if ((zero_bit - first_bit) < r->min_cbm_bits)
+		return false;
+	return true;
+}
+
+/*
+ * Read one cache bit mask (hex). Check that it is valid for the current
+ * resource type.
+ */
+static int parse_cbm_token(char *tok, struct rdt_resource *r)
+{
+	unsigned long data;
+	int ret;
+
+	ret = kstrtoul(tok, 16, &data);
+	if (ret)
+		return ret;
+	if (!cbm_validate(data, r))
+		return -EINVAL;
+	r->tmp_cbms[r->num_cbms++] = data;
+	return 0;
+}
+
+/*
+ * If code/data prioritization is enabled for this resource we need
+ * two bit masks separated by a ",". Otherwise a single bit mask.
+ */
+static int parse_cbm(char *buf, struct rdt_resource *r)
+{
+	char *cbm1 = buf;
+	int ret;
+
+	if (r->cdp_enabled)
+		cbm1 = strsep(&buf, ",");
+	if (!cbm1 || !buf)
+		return 1;
+	ret = parse_cbm_token(cbm1, r);
+	if (ret)
+		return ret;
+	if (r->cdp_enabled)
+		return parse_cbm_token(buf, r);
+	return 0;
+}
+
+/*
+ * For each domain in this resource we expect to find a series of:
+ *	id=mask[,mask]
+ * separated by ";". The "id" is in decimal, and must appear in the
+ * right order.
+ */
+static int parse_line(char *line, struct rdt_resource *r)
+{
+	struct list_head *l;
+	struct rdt_domain *d;
+	char *dom = NULL, *id;
+	unsigned long dom_id;
+
+	list_for_each(l, &r->domains) {
+		d = list_entry(l, struct rdt_domain, list);
+		dom = strsep(&line, ";");
+		if (!dom)
+			return -EINVAL;
+		id = strsep(&dom, "=");
+		if (kstrtoul(id, 10, &dom_id) || dom_id != d->id)
+			return -EINVAL;
+		if (parse_cbm(dom, r))
+			return -EINVAL;
+	}
+
+	/* Any garbage at the end of the line? */
+	if (line && line[0])
+		return -EINVAL;
+	return 0;
+}
+
+static void update_domains(struct rdt_resource *r, int closid)
+{
+	int cpu, idx = 0;
+	struct list_head *l;
+	struct rdt_domain *d;
+	struct msr_param msr_param;
+	struct cpumask cpu_mask;
+
+	cpumask_clear(&cpu_mask);
+	msr_param.low = closid << r->cdp_enabled;
+	msr_param.high = msr_param.low + 1 + r->cdp_enabled;
+	msr_param.res = r;
+
+	list_for_each(l, &r->domains) {
+		d = list_entry(l, struct rdt_domain, list);
+		cpumask_set_cpu(cpumask_any(&d->cpu_mask), &cpu_mask);
+		d->cbm[msr_param.low] = r->tmp_cbms[idx++];
+		if (r->cdp_enabled)
+			d->cbm[msr_param.low + 1] = r->tmp_cbms[idx++];
+	}
+	cpu = get_cpu();
+	/* Update CBM on this cpu if it's in cpu_mask. */
+	if (cpumask_test_cpu(cpu, &cpu_mask))
+		rdt_cbm_update(&msr_param);
+	/* Update CBM on other cpus. */
+	smp_call_function_many(&cpu_mask, rdt_cbm_update, &msr_param, 1);
+	put_cpu();
+}
+
+ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	char *tok, *resname;
+	struct rdtgroup *rdtgrp;
+	struct rdt_resource *r;
+	int closid, ret = 0;
+	u32 *l3_cbms = NULL;
+
+	/* Legal input requires a trailing newline */
+	if (nbytes == 0 || buf[nbytes - 1] != '\n')
+		return -EINVAL;
+	buf[nbytes - 1] = '\0';
+
+	rdtgrp = rdtgroup_kn_lock_live(of->kn);
+	if (!rdtgrp) {
+		rdtgroup_kn_unlock(of->kn);
+		return -ENOENT;
+	}
+
+	closid = rdtgrp->closid;
+
+	/* get scratch space to save all the masks while we validate input */
+	for_each_rdt_resource(r) {
+		r->tmp_cbms = kcalloc(r->num_domains << r->cdp_enabled,
+				      sizeof(*l3_cbms), GFP_KERNEL);
+		if (!r->tmp_cbms) {
+			ret = -ENOMEM;
+			goto fail;
+		}
+		r->num_cbms = 0;
+	}
+
+	while ((tok = strsep(&buf, "\n")) != NULL) {
+		resname = strsep(&tok, ":");
+		if (!tok) {
+			ret = -EINVAL;
+			goto fail;
+		}
+		for_each_rdt_resource(r) {
+			if (!strcmp(resname, r->name) &&
+			    closid < r->num_closid) {
+				ret = parse_line(tok, r);
+				if (ret)
+					goto fail;
+				break;
+			}
+		}
+		if (!r->name) {
+			ret = -EINVAL;
+			goto fail;
+		}
+	}
+
+	/* Did the parser find all the masks we need? */
+	for_each_rdt_resource(r) {
+		if (r->num_cbms != r->num_domains << r->cdp_enabled) {
+			ret = -EINVAL;
+			goto fail;
+		}
+	}
+
+	for_each_rdt_resource(r)
+		update_domains(r, closid);
+
+fail:
+	rdtgroup_kn_unlock(of->kn);
+	for_each_rdt_resource(r) {
+		kfree(r->tmp_cbms);
+		r->tmp_cbms = NULL;
+	}
+	return ret ?: nbytes;
+}
+
+static void show_doms(struct seq_file *s, struct rdt_resource *r, int closid)
+{
+	struct list_head *l;
+	struct rdt_domain *dom;
+	int idx = closid << r->cdp_enabled;
+	bool sep = false;
+
+	seq_printf(s, "%s:", r->name);
+	list_for_each(l, &r->domains) {
+		dom = list_entry(l, struct rdt_domain, list);
+		if (sep)
+			seq_puts(s, ";");
+		seq_printf(s, "%d=%x", dom->id, dom->cbm[idx]);
+		if (r->cdp_enabled)
+			seq_printf(s, ",%x", dom->cbm[idx + 1]);
+		sep = true;
+	}
+	seq_puts(s, "\n");
+}
+
+int rdtgroup_schemata_show(struct kernfs_open_file *of,
+			   struct seq_file *s, void *v)
+{
+	struct rdtgroup *rdtgrp;
+	struct rdt_resource *r;
+	int closid, ret = 0;
+
+	rdtgrp = rdtgroup_kn_lock_live(of->kn);
+	if (rdtgrp) {
+		closid = rdtgrp->closid;
+		for_each_rdt_resource(r)
+			if (closid < r->num_closid)
+				show_doms(s, r, closid);
+	} else {
+		ret = -ENOENT;
+	}
+	rdtgroup_kn_unlock(of->kn);
+	return ret;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 17/18] x86/intel_rdt: Add scheduler hook
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (15 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 16/18] x86/intel_rdt: Add schemata file Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  2016-10-15  2:12 ` [PATCH v4 18/18] MAINTAINERS: Add maintainer for Intel RDT resource allocation Fenghua Yu
  17 siblings, 0 replies; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Hook the x86 scheduler code to update closid based on whether the current
task is assigned to a specific closid or running on a CPU assigned to a
specific closid.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt.h         | 41 ++++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c          |  1 -
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c |  3 +++
 arch/x86/kernel/process_32.c             |  4 ++++
 arch/x86/kernel/process_64.c             |  4 ++++
 5 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 8435ec6..863e838 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,6 +4,9 @@
 #define IA32_L3_CBM_BASE	0xc90
 #define IA32_L2_CBM_BASE	0xd10
 
+#ifdef CONFIG_INTEL_RDT_A
+#include <asm/intel_rdt_common.h>
+
 /**
  * struct rdtgroup - store rdtgroup's data in resctrl file system.
  * @kn:				kernfs node
@@ -162,4 +165,42 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
+
+/*
+ * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - This will stay as no-op unless we are running on an Intel SKU
+ *   which supports resource control and we enable by mounting the
+ *   resctrl file system.
+ * - Caches the per cpu CLOSid values and does the MSR write only
+ *   when a task with a different CLOSid is scheduled in.
+ */
+static inline void intel_rdt_sched_in(void)
+{
+	if (static_branch_likely(&rdt_enable_key)) {
+		struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+		int closid;
+
+		/*
+		 * If this task has a closid assigned, use it.
+		 * Else use the closid assigned to this cpu.
+		 */
+		closid = current->closid;
+		if (closid == 0)
+			closid = this_cpu_read(cpu_closid);
+
+		if (closid != state->closid) {
+			state->closid = closid;
+			wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, closid);
+		}
+	}
+}
+
+#else
+
+static inline void intel_rdt_sched_in(void) {}
+
+#endif /* CONFIG_INTEL_RDT_A */
 #endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 556a426..933bcb4 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -29,7 +29,6 @@
 #include <linux/cacheinfo.h>
 #include <linux/cpuhotplug.h>
 
-#include <asm/intel_rdt_common.h>
 #include <asm/intel-family.h>
 #include <asm/intel_rdt.h>
 
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 6787fd8..62883af 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -299,6 +299,9 @@ static void move_myself(struct callback_head *head)
 		kfree(rdtgrp);
 	}
 
+	/* update PQR_ASSOC MSR to make resource group go into effect */
+	intel_rdt_sched_in();
+
 	kfree(callback);
 }
 
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index d86be29..5495527 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -54,6 +54,7 @@
 #include <asm/debugreg.h>
 #include <asm/switch_to.h>
 #include <asm/vm86.h>
+#include <asm/intel_rdt.h>
 
 asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
 asmlinkage void ret_from_kernel_thread(void) __asm__("ret_from_kernel_thread");
@@ -314,5 +315,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 	this_cpu_write(current_task, next_p);
 
+	/* Load the Intel cache allocation PQR MSR. */
+	intel_rdt_sched_in();
+
 	return prev_p;
 }
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 63236d8..cdea7e3 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -49,6 +49,7 @@
 #include <asm/debugreg.h>
 #include <asm/switch_to.h>
 #include <asm/xen/hypervisor.h>
+#include <asm/intel_rdt.h>
 
 asmlinkage extern void ret_from_fork(void);
 
@@ -472,6 +473,9 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 			loadsegment(ss, __KERNEL_DS);
 	}
 
+	/* Load the Intel cache allocation PQR MSR. */
+	intel_rdt_sched_in();
+
 	return prev_p;
 }
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 18/18] MAINTAINERS: Add maintainer for Intel RDT resource allocation
  2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
                   ` (16 preceding siblings ...)
  2016-10-15  2:12 ` [PATCH v4 17/18] x86/intel_rdt: Add scheduler hook Fenghua Yu
@ 2016-10-15  2:12 ` Fenghua Yu
  17 siblings, 0 replies; 52+ messages in thread
From: Fenghua Yu @ 2016-10-15  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

We create five new files for Intel RDT resource allocation:
arch/x86/kernel/cpu/intel_rdt.c
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
arch/x86/kernel/cpu/intel_rdt_schemata.c
arch/x86/include/asm/intel_rdt.h
Documentation/x86/intel_rdt_ui.txt

Fenghua Yu will maintain this code.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 MAINTAINERS | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index f593300..bcb75e9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9845,6 +9845,14 @@ L:	linux-rdma@vger.kernel.org
 S:	Supported
 F:	drivers/infiniband/sw/rdmavt
 
+RDT - RESOURCE ALLOCATION
+M:	Fenghua Yu <fenghua.yu@intel.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	arch/x86/kernel/cpu/intel_rdt*
+F:	arch/x86/include/asm/intel_rdt*
+F:	Documentation/x86/intel_rdt*
+
 READ-COPY UPDATE (RCU)
 M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
 M:	Josh Triplett <josh@joshtriplett.org>
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/18] Documentation, ABI: Add a document entry for cache id
  2016-10-15  2:12 ` [PATCH v4 01/18] Documentation, ABI: Add a document entry for cache id Fenghua Yu
@ 2016-10-17 10:31   ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 10:31 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Fri, 14 Oct 2016, Fenghua Yu wrote:

> From: Fenghua Yu <fenghua.yu@intel.com>
> 
> Add an ABI document entry for /sys/devices/system/cpu/cpu*/cache/index*/id.
> 
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>

The SOB chain here is bogus ....

> ---
>  Documentation/ABI/testing/sysfs-devices-system-cpu | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
> index 4987417..b1c3d69 100644
> --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> @@ -272,6 +272,22 @@ Description:	Parameters for the CPU cache attributes
>  				     the modified cache line is written to main
>  				     memory only when it is replaced
>  
> +
> +What:		/sys/devices/system/cpu/cpu*/cache/index*/id
> +Date:		September 2016
> +Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
> +Description:	Cache id
> +
> +		The id provides a unique name for a specific instance of

s/name/number/

> +		a cache of a particular type. E.g. there may be a level
> +		3 unified cache on each socket in a server and we may
> +		assign them ids 0, 1, 2, ...
> +
> +		Note that id value may not be contiguous. E.g. level 1

Note, that the id values can be non-contiguous.

> +		caches typically exist per core, but there may not be a
> +		power of two cores on a socket, so these caches may be
> +		numbered 0, 1, 2, 3, 4, 5, 8, 9, 10, ...
> +

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 02/18] cacheinfo: Introduce cache id
  2016-10-15  2:12 ` [PATCH v4 02/18] cacheinfo: Introduce " Fenghua Yu
@ 2016-10-17 10:32   ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 10:32 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Fri, 14 Oct 2016, Fenghua Yu wrote:

> From: Fenghua Yu <fenghua.yu@intel.com>
> 
> Cache management software needs a name for each instance of a cache of
> a particular type.

s/name/id/

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 03/18] x86, intel_cacheinfo: Enable cache id in x86
  2016-10-15  2:12 ` [PATCH v4 03/18] x86, intel_cacheinfo: Enable cache id in x86 Fenghua Yu
@ 2016-10-17 10:48   ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 10:48 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Fri, 14 Oct 2016, Fenghua Yu wrote:

> Subject: x86, intel_cacheinfo: Enable cache id in x86

That should be:

  x86/intel_cacheinfo: Enable cache id in cache info

> Cache id is retrieved from APIC ID and CPUID leaf 4 on x86.
> 
> For more details see the section on "Cache ID Extraction Parameters" in
> "Intel 64 Architecture Processor Topology Enumeration" at
> https://software.intel.com/sites/default/files/63/1a/Kuo_CpuTopology_rc1.rh1.final.pdf

That link is going to be stale before this hits Linus tree.

> Also "Intel 64 and IA-32 Architectures Software Developer's Manual" volume 2,
> table 3-18 "information Returned by CPUID Instruction" at

   Table 3-18 FADD/FADDP/FIADD Results ...

The correct one is:

   Table 3-8 Information Returned by CPUID Instruction ...

> http://www.intel.com/sdm

So can we please just say:

  For more details please see the documentation of the CPUID instruction in
  the "Intel 64 and IA-32 Architectures Software Developer's Manual".

and be done with it?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 06/18] x86/intel_rdt: Add CONFIG, Makefile, and basic initialization
  2016-10-15  2:12 ` [PATCH v4 06/18] x86/intel_rdt: Add CONFIG, Makefile, and basic initialization Fenghua Yu
@ 2016-10-17 10:57   ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 10:57 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Fri, 14 Oct 2016, Fenghua Yu wrote:
> +config INTEL_RDT_A
> +	bool "Intel Resource Director Technology Allocation support"
> +	default n
> +	depends on X86 && CPU_SUP_INTEL
> +	help
> +	  Select to enable resource allocation which is a sub-feature of
> +	  Intel Resource Director Technology(RDT). More information about
> +	  RDT can be found in the Intel x86 Architecture Software
> +	  Developer Manual June 2016, volume 3, section 17.17.

s/June.*//

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 07/18] x86/intel_rdt: Add Haswell feature discovery
  2016-10-15  2:12 ` [PATCH v4 07/18] x86/intel_rdt: Add Haswell feature discovery Fenghua Yu
@ 2016-10-17 11:03   ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 11:03 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Fri, 14 Oct 2016, Fenghua Yu wrote:
> +static inline bool cache_alloc_hsw_probe(void)
> +{
> +	u32 l, h;
> +	u32 max_cbm = BIT_MASK(20) - 1;

Two options here:

+	u32 l, h, max_cbm = BIT_MASK(20) - 1;

or

+	u32 max_cbm = BIT_MASK(20) - 1;
+	u32 l, h;

I personally prefer #1, but I can accept #2 as well. Both are quick to
parse while the one you chose is stopping the reading flow.

> +
> +	if (wrmsr_safe(IA32_L3_CBM_BASE, max_cbm, 0))
> +		return false;
> +	rdmsr(IA32_L3_CBM_BASE, l, h);
> +	if (l != max_cbm)
> +		return false;
> +
> +	return true;

  	return l == max_cbm;

Hmm?

> +}
> +
>  static inline bool get_rdt_resources(void)
>  {
>  	bool ret = false;
>  
> +	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
> +	    boot_cpu_data.x86 == 6 &&
> +	    boot_cpu_data.x86_model == INTEL_FAM6_HASWELL_X)
> +		return cache_alloc_hsw_probe();

Can you please stick that model check into the probe function and do:

    	if (cache_alloc_hsw_probe())
		return true;
> +
>  	if (!boot_cpu_has(X86_FEATURE_RDT_A))
>  		return false;
>  	if (boot_cpu_has(X86_FEATURE_CAT_L3))

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID
  2016-10-15  2:12 ` [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID Fenghua Yu
@ 2016-10-17 13:45   ` Thomas Gleixner
  2016-10-17 18:06     ` Fenghua Yu
  0 siblings, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 13:45 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Fri, 14 Oct 2016, Fenghua Yu wrote:
> +/**
> + * struct rdt_resource - attributes of an RDT resource
> + * @enabled:			Is this feature enabled on this machine
> + * @name:			Name to use in "schemata" file
> + * @max_closid:			Maximum number of CLOSIDs supported
> + * @num_closid:			Current number of CLOSIDs available
> + * @max_cbm:			Largest Cache Bit Mask allowed
> + * @min_cbm_bits:		Minimum number of bits to be set in a cache

That should be 'number of consecutive bits', right?

> + *				bit mask
> + * @domains:			All domains for this resource
> + * @num_domains:		Number of domains active
> + * @msr_base:			Base MSR address for CBMs
> + * @cdp_capable:		Code/Data Prioritization available
> + * @cdp_enabled:		Code/Data Prioritization enabled

I wonder whether this is the proper abstraction level. We might as well do
the following:

rdtresources[] = {
     {
	.name	= "L3",
     },
     {
	.name	= "L3Data",
     },
     {
	.name	= "L3Code",
     },

and enable either L3 or L3Data+L3Code. Not sure if that makes things
simpler, but it's definitely worth a thought or two.

> +#define for_each_rdt_resource(r)	\
> +	for (r = rdt_resources_all; r->name; r++) \
> +		if (r->enabled)

So the resource array must be NULL terminated, right? You might as well use

   r < rdt_resources_all + ARRAY_SIZE(rdt_resources_all)

as the loop condition. So you avoid the NULL termination.

> +
> +#define IA32_L3_CBM_BASE	0xc90
> +extern struct rdt_resource rdt_resources_all[];

Please visually split this. CBM_BASE has nothing to do with the resource
array.

> +#define domain_init(name) LIST_HEAD_INIT(rdt_resources_all[name].domains)

name is really misleading here. Please use id and make this an inline
function.

> +struct rdt_resource rdt_resources_all[] = {

>  static inline bool get_rdt_resources(void)
>  {
> +	struct rdt_resource *r;
>  	bool ret = false;
>  
>  	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
> @@ -74,20 +105,53 @@ static inline bool get_rdt_resources(void)
>  
>  	if (!boot_cpu_has(X86_FEATURE_RDT_A))
>  		return false;
> -	if (boot_cpu_has(X86_FEATURE_CAT_L3))
> +	if (boot_cpu_has(X86_FEATURE_CAT_L3)) {
> +		union cpuid_0x10_1_eax eax;
> +		union cpuid_0x10_1_edx edx;
> +		u32 ebx, ecx;
> +
> +		r = &rdt_resources_all[RDT_RESOURCE_L3];
> +		cpuid_count(0x00000010, 1, &eax.full, &ebx, &ecx, &edx.full);
> +		r->max_closid = edx.split.cos_max + 1;
> +		r->num_closid = r->max_closid;
> +		r->cbm_len = eax.split.cbm_len + 1;
> +		r->max_cbm = BIT_MASK(eax.split.cbm_len + 1) - 1;
> +		if (boot_cpu_has(X86_FEATURE_CDP_L3))
> +			r->cdp_capable = true;
> +		r->enabled = true;
> +
>  		ret = true;
> +	}
> +	if (boot_cpu_has(X86_FEATURE_CAT_L2)) {
> +		union cpuid_0x10_1_eax eax;
> +		union cpuid_0x10_1_edx edx;
> +		u32 ebx, ecx;
> +
> +		/* CPUID 0x10.2 fields are same format at 0x10.1 */
> +		r = &rdt_resources_all[RDT_RESOURCE_L2];
> +		cpuid_count(0x00000010, 2, &eax.full, &ebx, &ecx, &edx.full);
> +		r->max_closid = edx.split.cos_max + 1;
> +		r->num_closid = r->max_closid;
> +		r->cbm_len = eax.split.cbm_len + 1;
> +		r->max_cbm = BIT_MASK(eax.split.cbm_len + 1) - 1;
> +		r->enabled = true;

Copy and paste is a wonderful thing, right?

static void rdt_get_config(int idx, struct rdt_resource *r)
{
	union cpuid_0x10_1_eax eax;
	union cpuid_0x10_1_edx edx;
	u32 ebx, ecx;

	cpuid_count(0x00000010, idx, &eax.full, &ebx, &ecx, &edx.full);
	r->max_closid = edx.split.cos_max + 1;
	r->num_closid = r->max_closid;
	r->cbm_len = eax.split.cbm_len + 1;
	r->max_cbm = BIT_MASK(eax.split.cbm_len + 1) - 1;
	r->enabled = true;
}

and and the call site:

	if (boot_cpu_has(X86_FEATURE_CAT_L3)) {
		rdt_get_config(1, &rdt_resources_all[RDT_RESOURCE_L3]);
		if (boot_cpu_has(X86_FEATURE_CDP_L3))
			r->cdp_capable = true;
		ret = true;
	}

	if (boot_cpu_has(X86_FEATURE_CAT_L2)) {
		rdt_get_config(2, &rdt_resources_all[RDT_RESOURCE_L2]);
		ret = true;
	}

Hmm?

> +	for_each_rdt_resource(r)
> +		pr_info("Intel RDT %s allocation %s detected\n", r->name,
> +			r->cdp_capable ? " (with CDP)" : "");

This want's curly braces around the loop.
  
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 10/18] x86/intel_rdt: Build structures for each resource based on cache topology
  2016-10-15  2:12 ` [PATCH v4 10/18] x86/intel_rdt: Build structures for each resource based on cache topology Fenghua Yu
@ 2016-10-17 14:44   ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 14:44 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Fri, 14 Oct 2016, Fenghua Yu wrote:
>  
> +static int get_cache_id(int cpu, int level)
> +{
> +	struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
> +	int i;
> +
> +	for (i = 0; i < ci->num_leaves; i++)
> +		if (ci->info_list[i].level == level)
> +			return ci->info_list[i].id;

Multi line statements need curly braces.

> +	return -1;
> +}
> +
> +void rdt_cbm_update(void *arg)
> +{
> +	struct msr_param *m = (struct msr_param *)arg;
> +	struct rdt_resource *r = m->res;
> +	struct rdt_domain *d;
> +	struct list_head *l;
> +	int i, cpu = smp_processor_id();
> +
> +	list_for_each(l, &r->domains) {
> +		d = list_entry(l, struct rdt_domain, list);

We have list_for_each_entry() ....

> +		if (cpumask_test_cpu(cpu, &d->cpu_mask))
> +			goto found;
> +	}
> +	pr_info_once("cpu %d not found in any domain for resource %s\n",
> +		     cpu, r->name);

If the list is empty, then 'd' is not initialized and dereferenced. If the
list is not empty then you use the last domain in the list. Neither one is
the right thing to do....

> +
> +found:
> +	for (i = m->low; i < m->high; i++)
> +		wrmsrl(r->msr_base + i, d->cbm[i]);
> +}
> +
> +static void update_domain(int cpu, struct rdt_resource *r, int add)
> +{
> +	struct list_head *l;
> +	struct rdt_domain *d;
> +	int i, cache_id;

+	struct rdt_domain *d;
+	struct list_head *l;
+	int i, cache_id;

Can you see how this is simpler to read?

> +
> +	cache_id = get_cache_id(cpu, r->cache_level);

> +	if (cache_id == -1) {
> +		pr_info_once("Could't find cache id for cpu %d\n", cpu);
> +		return;
> +	}

Please seperate this by a empty line.

> +	list_for_each(l, &r->domains) {
> +		d = list_entry(l, struct rdt_domain, list);

list_for_each_entry() once more.

> +		if (cache_id == d->id)
> +			goto found;
> +		if (cache_id < d->id)
> +			break;
> +	}
> +	if (!add) {
> +		pr_info_once("removed unknown cpu %d\n", cpu);

_once? Why should this happen at all?

> +		return;
> +	}
> +	d = kzalloc(sizeof(*d), GFP_KERNEL);

Shouldn't this be kzalloc_node() ?

> +	if (!d)
> +		return;
> +
> +	d->id = cache_id;
> +	d->cbm = kmalloc_array(r->max_closid, sizeof(*d->cbm), GFP_KERNEL);
> +	if (!d->cbm) {
> +		pr_info("Failed to alloc CBM array for cpu %d\n", cpu);
> +		kfree(d);
> +		return;
> +	}
> +	cpumask_set_cpu(cpu, &d->cpu_mask);
> +	for (i = 0; i < r->max_closid; i++) {
> +		d->cbm[i] = r->max_cbm;
> +		wrmsrl(r->msr_base + i, d->cbm[i]);
> +	}
> +	list_add_tail(&d->list, l);
> +	r->num_domains++;
> +	return;
> +
> +found:
> +	if (add) {
> +		cpumask_set_cpu(cpu, &d->cpu_mask);

Gah. Reusing this function for add and del is just a silly optimization
which makes the code harder to read.

All you share is the find thing. So you can split that out into a helper:

static struct rdt_domain *rdt_find_domain(struct rdt_resource *r,
					  unsigned int cpu, int id)
{
	if (id < 0)
		return ERR_PTR(id);

	list_for_each_entry(d, &r->domains, list) {
		if (d->id == cache_id)
			return d;
	}
	return NULL;
}

and use it for seperate add/del functions.

static void domain_add_cpu(unsigned int cpu, struct rdt_resource *r)
{
	int id = get_cache_id(cpu, r->cache_level);
	struct rdt_domain *d;

	d = rdt_find_domain(r, cpu, id);
	if (IS_ERR(d)) {
		pr_warn("Print some useful information\n", r->cache_level, cpu, ....);
		return;
	}

	if (d) {
		cpumask_set_cpu(cpu, &d->cpu_mask);
		return;
	}

   	/* Do the allocaction stuff */
}

static void domain_remove_cpu(unsigned int cpu, struct rdt_resource *r)
{
	int id = get_cache_id(cpu, r->cache_level);
	struct rdt_domain *d;

	d = rdt_find_domain(r, cpu, id);
	if (IS_ERR_OR_NULL(d)) {
		pr_warn("Print some useful information\n", r->cache_level, cpu, ...);
		return;
	}

	cpumask_clear_cpu(cpu, &d->cpu_mask);
	....	
}

Hmm?

>  static int __init intel_rdt_late_init(void)
>  {
>  	struct rdt_resource *r;
> +	int state;
>  
>  	if (!get_rdt_resources())
>  		return -ENODEV;
>  
> +	state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> +				"AP_INTEL_RDT_ONLINE",

Please use: "x86/rdt/cat:online:" or similar. We messed that up in a few
places when adding the names, but we don't want to add more of this.

> +				intel_rdt_online_cpu, intel_rdt_offline_cpu);

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID
  2016-10-17 18:06     ` Fenghua Yu
@ 2016-10-17 16:35       ` Luck, Tony
  2016-10-17 16:43         ` Yu, Fenghua
  2016-10-17 16:54         ` Thomas Gleixner
  2016-10-17 16:53       ` Thomas Gleixner
  2016-10-17 17:02       ` Thomas Gleixner
  2 siblings, 2 replies; 52+ messages in thread
From: Luck, Tony @ 2016-10-17 16:35 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

> > I wonder whether this is the proper abstraction level. We might as well do
> > the following:
> > 
> > rdtresources[] = {
> >      {
> > 	.name	= "L3",
> >      },
> >      {
> > 	.name	= "L3Data",
> >      },
> >      {
> > 	.name	= "L3Code",
> >      },
> > 
> > and enable either L3 or L3Data+L3Code. Not sure if that makes things
> > simpler, but it's definitely worth a thought or two.
> 
> This way will be better than having cdp_enabled/capable for L3 and not
> for L2.  And this doesn't change current userinterface design either,
> I think.

User interface would change if you did this. The schemata file would
look like this with CDP enabled:

# cat schemata
L3Data:0=fffff;1=fffff;2=fffff;3=fffff
L3Code:0=fffff;1=fffff;2=fffff;3=fffff

but that is easier to read than the current:

# cat schemata
L3:0=fffff,fffff;1=fffff,fffff;2=fffff,fffff;3=fffff,fffff

which gives you no clue on which mask is code and which is data.

We'd also end up with "info/L3Data/" and "info/L3code/"
which would be a little redundant (since the files in each
would contain the same numbers), but perhaps that is worth
it to get the better schemata file.

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID
  2016-10-17 16:35       ` Luck, Tony
@ 2016-10-17 16:43         ` Yu, Fenghua
  2016-10-17 20:20           ` Luck, Tony
  2016-10-17 16:54         ` Thomas Gleixner
  1 sibling, 1 reply; 52+ messages in thread
From: Yu, Fenghua @ 2016-10-17 16:43 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Thomas Gleixner, Anvin, H Peter, Ingo Molnar, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Hansen, Dave, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Shankar, Ravi V, Prakhya,
	Sai Praneeth, Vikas Shivappa, linux-kernel, x86

> > > I wonder whether this is the proper abstraction level. We might as
> > > well do the following:
> > >
> > > rdtresources[] = {
> > >      {
> > > 	.name	= "L3",
> > >      },
> > >      {
> > > 	.name	= "L3Data",
> > >      },
> > >      {
> > > 	.name	= "L3Code",
> > >      },
> > >
> > > and enable either L3 or L3Data+L3Code. Not sure if that makes things
> > > simpler, but it's definitely worth a thought or two.
> >
> > This way will be better than having cdp_enabled/capable for L3 and not
> > for L2.  And this doesn't change current userinterface design either,
> > I think.
> 
> User interface would change if you did this. The schemata file would look like
> this with CDP enabled:
> 
> # cat schemata
> L3Data:0=fffff;1=fffff;2=fffff;3=fffff
> L3Code:0=fffff;1=fffff;2=fffff;3=fffff
> 
> but that is easier to read than the current:
> 
> # cat schemata
> L3:0=fffff,fffff;1=fffff,fffff;2=fffff,fffff;3=fffff,fffff
> 
> which gives you no clue on which mask is code and which is data.

Right.

Also changing to uniform format <resname>:<id1>=cbm1;<id2>=cbm2;...
is lot easier to parse schemata line in CDP mode.

So I'll change the code and doc to have two new resources: L3Data and L3Code for CDP mode.

> 
> We'd also end up with "info/L3Data/" and "info/L3code/"
> which would be a little redundant (since the files in each would contain the
> same numbers), but perhaps that is worth it to get the better schemata file.
> 
> -Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID
  2016-10-17 18:06     ` Fenghua Yu
  2016-10-17 16:35       ` Luck, Tony
@ 2016-10-17 16:53       ` Thomas Gleixner
  2016-10-17 17:02       ` Thomas Gleixner
  2 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 16:53 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Mon, 17 Oct 2016, Fenghua Yu wrote:
> On Mon, Oct 17, 2016 at 03:45:32PM +0200, Thomas Gleixner wrote:
> > On Fri, 14 Oct 2016, Fenghua Yu wrote:
> > > +/**
> > > + * struct rdt_resource - attributes of an RDT resource
> > > + * @enabled:			Is this feature enabled on this machine
> > > + * @name:			Name to use in "schemata" file
> > > + * @max_closid:			Maximum number of CLOSIDs supported
> > > + * @num_closid:			Current number of CLOSIDs available
> > > + * @max_cbm:			Largest Cache Bit Mask allowed
> > > + * @min_cbm_bits:		Minimum number of bits to be set in a cache
> > 
> > That should be 'number of consecutive bits', right?
> 
> Change to "Minimum number of consecutive bits to be set in a cache", is
> that ok?

Yes.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID
  2016-10-17 16:35       ` Luck, Tony
  2016-10-17 16:43         ` Yu, Fenghua
@ 2016-10-17 16:54         ` Thomas Gleixner
  1 sibling, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 16:54 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Fenghua Yu, H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Mon, 17 Oct 2016, Luck, Tony wrote:
> > > I wonder whether this is the proper abstraction level. We might as well do
> > > the following:
> > > 
> > > rdtresources[] = {
> > >      {
> > > 	.name	= "L3",
> > >      },
> > >      {
> > > 	.name	= "L3Data",
> > >      },
> > >      {
> > > 	.name	= "L3Code",
> > >      },
> > > 
> > > and enable either L3 or L3Data+L3Code. Not sure if that makes things
> > > simpler, but it's definitely worth a thought or two.
> > 
> > This way will be better than having cdp_enabled/capable for L3 and not
> > for L2.  And this doesn't change current userinterface design either,
> > I think.
> 
> User interface would change if you did this. The schemata file would
> look like this with CDP enabled:
> 
> # cat schemata
> L3Data:0=fffff;1=fffff;2=fffff;3=fffff
> L3Code:0=fffff;1=fffff;2=fffff;3=fffff
> 
> but that is easier to read than the current:
> 
> # cat schemata
> L3:0=fffff,fffff;1=fffff,fffff;2=fffff,fffff;3=fffff,fffff
> 
> which gives you no clue on which mask is code and which is data.

Indeed.
 
> We'd also end up with "info/L3Data/" and "info/L3code/"
> which would be a little redundant (since the files in each
> would contain the same numbers), but perhaps that is worth
> it to get the better schemata file.

I think so. Making the user interface more intuitive is always worth it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID
  2016-10-17 18:06     ` Fenghua Yu
  2016-10-17 16:35       ` Luck, Tony
  2016-10-17 16:53       ` Thomas Gleixner
@ 2016-10-17 17:02       ` Thomas Gleixner
  2016-10-17 21:22         ` Fenghua Yu
  2 siblings, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 17:02 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Mon, 17 Oct 2016, Fenghua Yu wrote:
> On Mon, Oct 17, 2016 at 03:45:32PM +0200, Thomas Gleixner wrote:
> > I wonder whether this is the proper abstraction level. We might as well do
> > the following:
> > 
> > rdtresources[] = {
> >      {
> > 	.name	= "L3",
> >      },
> >      {
> > 	.name	= "L3Data",
> >      },
> >      {
> > 	.name	= "L3Code",
> >      },
> > 
> > and enable either L3 or L3Data+L3Code. Not sure if that makes things
> > simpler, but it's definitely worth a thought or two.
> 
> This way will be better than having cdp_enabled/capable for L3 and not
> for L2.

So you need to change the struct to have capable and enabled

> > static void rdt_get_config(int idx, struct rdt_resource *r)
> > {
> > 	union cpuid_0x10_1_eax eax;
> > 	union cpuid_0x10_1_edx edx;
> > 	u32 ebx, ecx;
> > 
> > 	cpuid_count(0x00000010, idx, &eax.full, &ebx, &ecx, &edx.full);
> > 	r->max_closid = edx.split.cos_max + 1;
> > 	r->num_closid = r->max_closid;
> > 	r->cbm_len = eax.split.cbm_len + 1;
> > 	r->max_cbm = BIT_MASK(eax.split.cbm_len + 1) - 1;
> > 	r->enabled = true;

And set 

    	r->capable = true; 

here instead of r->enabled and set r->enabled at mount time for the
resources depending on the mount flags.

Thanks,

	tglx


	

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID
  2016-10-17 13:45   ` Thomas Gleixner
@ 2016-10-17 18:06     ` Fenghua Yu
  2016-10-17 16:35       ` Luck, Tony
                         ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Fenghua Yu @ 2016-10-17 18:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Fenghua Yu, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Stephane Eranian, Borislav Petkov, Dave Hansen,
	Nilay Vaish, Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar,
	Sai Prakhya, Vikas Shivappa, linux-kernel, x86

On Mon, Oct 17, 2016 at 03:45:32PM +0200, Thomas Gleixner wrote:
> On Fri, 14 Oct 2016, Fenghua Yu wrote:
> > +/**
> > + * struct rdt_resource - attributes of an RDT resource
> > + * @enabled:			Is this feature enabled on this machine
> > + * @name:			Name to use in "schemata" file
> > + * @max_closid:			Maximum number of CLOSIDs supported
> > + * @num_closid:			Current number of CLOSIDs available
> > + * @max_cbm:			Largest Cache Bit Mask allowed
> > + * @min_cbm_bits:		Minimum number of bits to be set in a cache
> 
> That should be 'number of consecutive bits', right?

Change to "Minimum number of consecutive bits to be set in a cache", is
that ok?

It's 2 on Haswell. It's 1 in other cases i.e. L3 on Broadwell and
Skylake servers etc.

> 
> > + *				bit mask
> > + * @domains:			All domains for this resource
> > + * @num_domains:		Number of domains active
> > + * @msr_base:			Base MSR address for CBMs
> > + * @cdp_capable:		Code/Data Prioritization available
> > + * @cdp_enabled:		Code/Data Prioritization enabled
> 
> I wonder whether this is the proper abstraction level. We might as well do
> the following:
> 
> rdtresources[] = {
>      {
> 	.name	= "L3",
>      },
>      {
> 	.name	= "L3Data",
>      },
>      {
> 	.name	= "L3Code",
>      },
> 
> and enable either L3 or L3Data+L3Code. Not sure if that makes things
> simpler, but it's definitely worth a thought or two.

This way will be better than having cdp_enabled/capable for L3 and not
for L2.  And this doesn't change current userinterface design either,
I think.

> 
> > +#define for_each_rdt_resource(r)	\
> > +	for (r = rdt_resources_all; r->name; r++) \
> > +		if (r->enabled)
> 
> So the resource array must be NULL terminated, right? You might as well use
> 
>    r < rdt_resources_all + ARRAY_SIZE(rdt_resources_all)
> 
> as the loop condition. So you avoid the NULL termination.

Will do.

> 
> > +
> > +#define IA32_L3_CBM_BASE	0xc90
> > +extern struct rdt_resource rdt_resources_all[];
> 
> Please visually split this. CBM_BASE has nothing to do with the resource
> array.
> 
> > +#define domain_init(name) LIST_HEAD_INIT(rdt_resources_all[name].domains)
> 
> name is really misleading here. Please use id and make this an inline
> function.
> 
> > +struct rdt_resource rdt_resources_all[] = {
> 
> >  static inline bool get_rdt_resources(void)
> >  {
> > +	struct rdt_resource *r;
> >  	bool ret = false;
> >  
> >  	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
> > @@ -74,20 +105,53 @@ static inline bool get_rdt_resources(void)
> >  
> >  	if (!boot_cpu_has(X86_FEATURE_RDT_A))
> >  		return false;
> > -	if (boot_cpu_has(X86_FEATURE_CAT_L3))
> > +	if (boot_cpu_has(X86_FEATURE_CAT_L3)) {
> > +		union cpuid_0x10_1_eax eax;
> > +		union cpuid_0x10_1_edx edx;
> > +		u32 ebx, ecx;
> > +
> > +		r = &rdt_resources_all[RDT_RESOURCE_L3];
> > +		cpuid_count(0x00000010, 1, &eax.full, &ebx, &ecx, &edx.full);
> > +		r->max_closid = edx.split.cos_max + 1;
> > +		r->num_closid = r->max_closid;
> > +		r->cbm_len = eax.split.cbm_len + 1;
> > +		r->max_cbm = BIT_MASK(eax.split.cbm_len + 1) - 1;
> > +		if (boot_cpu_has(X86_FEATURE_CDP_L3))
> > +			r->cdp_capable = true;
> > +		r->enabled = true;
> > +
> >  		ret = true;
> > +	}
> > +	if (boot_cpu_has(X86_FEATURE_CAT_L2)) {
> > +		union cpuid_0x10_1_eax eax;
> > +		union cpuid_0x10_1_edx edx;
> > +		u32 ebx, ecx;
> > +
> > +		/* CPUID 0x10.2 fields are same format at 0x10.1 */
> > +		r = &rdt_resources_all[RDT_RESOURCE_L2];
> > +		cpuid_count(0x00000010, 2, &eax.full, &ebx, &ecx, &edx.full);
> > +		r->max_closid = edx.split.cos_max + 1;
> > +		r->num_closid = r->max_closid;
> > +		r->cbm_len = eax.split.cbm_len + 1;
> > +		r->max_cbm = BIT_MASK(eax.split.cbm_len + 1) - 1;
> > +		r->enabled = true;
> 
> Copy and paste is a wonderful thing, right?
> 
> static void rdt_get_config(int idx, struct rdt_resource *r)
> {
> 	union cpuid_0x10_1_eax eax;
> 	union cpuid_0x10_1_edx edx;
> 	u32 ebx, ecx;
> 
> 	cpuid_count(0x00000010, idx, &eax.full, &ebx, &ecx, &edx.full);
> 	r->max_closid = edx.split.cos_max + 1;
> 	r->num_closid = r->max_closid;
> 	r->cbm_len = eax.split.cbm_len + 1;
> 	r->max_cbm = BIT_MASK(eax.split.cbm_len + 1) - 1;
> 	r->enabled = true;
> }
> 
> and and the call site:
> 
> 	if (boot_cpu_has(X86_FEATURE_CAT_L3)) {
> 		rdt_get_config(1, &rdt_resources_all[RDT_RESOURCE_L3]);
> 		if (boot_cpu_has(X86_FEATURE_CDP_L3))
> 			r->cdp_capable = true;
> 		ret = true;
> 	}
> 
> 	if (boot_cpu_has(X86_FEATURE_CAT_L2)) {
> 		rdt_get_config(2, &rdt_resources_all[RDT_RESOURCE_L2]);
> 		ret = true;
> 	}
> 
> Hmm?

Sure.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 11/18] x86/intel_rdt: Add basic resctrl filesystem support
  2016-10-15  2:12 ` [PATCH v4 11/18] x86/intel_rdt: Add basic resctrl filesystem support Fenghua Yu
@ 2016-10-17 19:35   ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 19:35 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Fri, 14 Oct 2016, Fenghua Yu wrote:
> +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> + *
> + * More information about RDT be found in the Intel (R) x86 Architecture
> + * Software Developer Manual.

Yes, that's how it should look like.

> +static void l3_qos_cfg_update(void *arg)
> +{
> +	struct rdt_resource *r = arg;
> +
> +	wrmsrl(IA32_L3_QOS_CFG, r->cdp_enabled);
> +}
> +
> +static void set_l3_qos_cfg(struct rdt_resource *r)
> +{
> +	struct list_head *l;
> +	struct rdt_domain *d;
> +	struct cpumask cpu_mask;

You cannot have cpumasks on stack.

        cpumask_var_t mask;

        if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
                return -ENOMEM;


> +	int cpu;
> +
> +	cpumask_clear(&cpu_mask);

That can go away then

> +	list_for_each(l, &r->domains) {

list_for_each_entry() again

> +		d = list_entry(l, struct rdt_domain, list);
> +		cpumask_set_cpu(cpumask_any(&d->cpu_mask), &cpu_mask);

A comment to explain what this does would be helpful.

> +	}
> +	cpu = get_cpu();
> +	/* Update QOS_CFG MSR on this cpu if it's in cpu_mask. */
> +	if (cpumask_test_cpu(cpu, &cpu_mask))
> +		l3_qos_cfg_update(r);
> +	/* Update QOS_CFG MSR on all other cpus in cpu_mask. */
> +	smp_call_function_many(&cpu_mask, l3_qos_cfg_update, r, 1);
> +	put_cpu();
> +}
> +
> +static int parse_rdtgroupfs_options(char *data)
> +{
> +	char *token, *o = data;
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];
> +
> +	r->cdp_enabled = false;
> +	while ((token = strsep(&o, ",")) != NULL) {
> +		if (!*token)
> +			return -EINVAL;
> +
> +		if (!strcmp(token, "cdp"))
> +			if (r->enabled && r->cdp_capable)
> +				r->cdp_enabled = true;
> +	}
> +
> +	return 0;
> +}
> +
> +static struct dentry *rdt_mount(struct file_system_type *fs_type,
> +				int flags, const char *unused_dev_name,
> +				void *data)
> +{
> +	struct dentry *dentry;
> +	int ret;
> +	bool new_sb;
> +
> +	mutex_lock(&rdtgroup_mutex);
> +	/*
> +	 * resctrl file system can only be mounted once.
> +	 */
> +	if (static_branch_unlikely(&rdt_enable_key)) {
> +		dentry = ERR_PTR(-EBUSY);
> +		goto out;
> +	}
> +
> +	ret = parse_rdtgroupfs_options(data);
> +	if (ret) {
> +		dentry = ERR_PTR(ret);
> +		goto out;
> +	}
> +
> +	dentry = kernfs_mount(fs_type, flags, rdt_root,
> +			      RDTGROUP_SUPER_MAGIC, &new_sb);

&new_sb is pointless here. It just tells the caller that a new superblock
has been created, So in case of a valid dentry new_sb will always be true,
and if anything failed in kernfs_mount() including the allocation of a new
superblock then new_sb is completely irrelevant as IS_ERR(dentry) will be
true. So you can just hand in NULL because you do not allow multiple
mounts.

> +	if (IS_ERR(dentry))
> +		goto out;
> +	if (!new_sb) {
> +		dentry = ERR_PTR(-EINVAL);
> +		goto out;
> +	}
> +	if (rdt_resources_all[RDT_RESOURCE_L3].cdp_capable)
> +		set_l3_qos_cfg(&rdt_resources_all[RDT_RESOURCE_L3]);
> +	static_branch_enable(&rdt_enable_key);
> +
> +out:
> +	mutex_unlock(&rdtgroup_mutex);
> +
> +	return dentry;
> +}
> +
> +static void reset_all_cbms(struct rdt_resource *r)
> +{
> +	struct list_head *l;
> +	struct rdt_domain *d;
> +	struct msr_param msr_param;
> +	struct cpumask cpu_mask;
> +	int i, cpu;
> +
> +	cpumask_clear(&cpu_mask);
> +	msr_param.res = r;
> +	msr_param.low = 0;
> +	msr_param.high = r->max_closid;
> +
> +	list_for_each(l, &r->domains) {
> +		d = list_entry(l, struct rdt_domain, list);

list_for_each_entry()

> +		cpumask_set_cpu(cpumask_any(&d->cpu_mask), &cpu_mask);
> +
> +		for (i = 0; i < r->max_closid; i++)
> +			d->cbm[i] = r->max_cbm;
> +	}
> +	cpu = get_cpu();
> +	/* Update CBM on this cpu if it's in cpu_mask. */
> +	if (cpumask_test_cpu(cpu, &cpu_mask))
> +		rdt_cbm_update(&msr_param);
> +	/* Updte CBM on all other cpus in cpu_mask. */

Update

> +	smp_call_function_many(&cpu_mask, rdt_cbm_update, &msr_param, 1);
> +	put_cpu();
> +}
> +

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 12/18] x86/intel_rdt: Add "info" files to resctrl file system
  2016-10-15  2:12 ` [PATCH v4 12/18] x86/intel_rdt: Add "info" files to resctrl file system Fenghua Yu
@ 2016-10-17 19:46   ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 19:46 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Fri, 14 Oct 2016, Fenghua Yu wrote:
>  static int __init rdtgroup_setup_root(void)
>  {
> +	int ret;
> +
>  	rdt_root = kernfs_create_root(&rdtgroup_kf_syscall_ops,
>  				      KERNFS_ROOT_CREATE_DEACTIVATED,
>  				      &rdtgroup_default);
> @@ -193,7 +364,9 @@ static int __init rdtgroup_setup_root(void)
>  	list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups);
>  
>  	rdtgroup_default.kn = rdt_root->kn;
> -	kernfs_activate(rdtgroup_default.kn);
> +	ret = rdtgroup_create_info_dir(rdtgroup_default.kn);
> +	if (!ret)
> +		kernfs_activate(rdtgroup_default.kn);
>  
>  	mutex_unlock(&rdtgroup_mutex);

So this is followed by:

   	return 0;

which means that an error in rdtgroup_create_info_dir() is ignored. As a
consequence the mount point is created and the file system is registered
w/o the info directory....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID
  2016-10-17 16:43         ` Yu, Fenghua
@ 2016-10-17 20:20           ` Luck, Tony
  0 siblings, 0 replies; 52+ messages in thread
From: Luck, Tony @ 2016-10-17 20:20 UTC (permalink / raw)
  To: Yu, Fenghua
  Cc: Thomas Gleixner, Anvin, H Peter, Ingo Molnar, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Hansen, Dave, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Shankar, Ravi V, Prakhya,
	Sai Praneeth, Vikas Shivappa, linux-kernel, x86

On Mon, Oct 17, 2016 at 09:43:41AM -0700, Yu, Fenghua wrote:
> > > > I wonder whether this is the proper abstraction level. We might as
> > > > well do the following:
> > > >
> > > > rdtresources[] = {
> > > >      {
> > > > 	.name	= "L3",
> > > >      },
> > > >      {
> > > > 	.name	= "L3Data",
> > > >      },
> > > >      {
> > > > 	.name	= "L3Code",
> > > >      },
> > > >
> > > > and enable either L3 or L3Data+L3Code. Not sure if that makes things
> > > > simpler, but it's definitely worth a thought or two.
> > >
> > > This way will be better than having cdp_enabled/capable for L3 and not
> > > for L2.  And this doesn't change current userinterface design either,
> > > I think.
> > 
> > User interface would change if you did this. The schemata file would look like
> > this with CDP enabled:
> > 
> > # cat schemata
> > L3Data:0=fffff;1=fffff;2=fffff;3=fffff
> > L3Code:0=fffff;1=fffff;2=fffff;3=fffff
> > 
> > but that is easier to read than the current:
> > 
> > # cat schemata
> > L3:0=fffff,fffff;1=fffff,fffff;2=fffff,fffff;3=fffff,fffff
> > 
> > which gives you no clue on which mask is code and which is data.
> 
> Right.
> 
> Also changing to uniform format <resname>:<id1>=cbm1;<id2>=cbm2;...
> is lot easier to parse schemata line in CDP mode.
> 
> So I'll change the code and doc to have two new resources: L3Data and L3Code for CDP mode.

Doc change (fold into part 05):
diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index e56781952f42..b9f634c9a058 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -97,13 +97,18 @@ With CDP disabled the L3 schemata format is:
 
 L3 details (CDP enabled via mount option to resctrl)
 ----------------------------------------------------
-When CDP is enabled, you need to specify separate cache bit masks for
-code and data access. The generic format is:
+When CDP is enabled L3 control is split into two separate resources
+so you can specify independent masks for code and data like this:
 
-	L3:<cache_id0>=<d_cbm>,<i_cbm>;<cache_id1>=<d_cbm>,<i_cbm>;...
+	L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+	L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
 
-where the d_cbm masks are for data access, and the i_cbm masks for code.
+L2 details
+----------
+L2 cache does not support code and data prioritization, so the
+schemata format is always:
 
+	L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
 
 Example 1
 ---------

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system
  2016-10-15  2:12 ` [PATCH v4 13/18] x86/intel_rdt: Add mkdir " Fenghua Yu
@ 2016-10-17 21:14   ` Thomas Gleixner
  2016-10-17 21:50     ` Luck, Tony
  2016-10-18  1:18     ` Fenghua Yu
  0 siblings, 2 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 21:14 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Fri, 14 Oct 2016, Fenghua Yu wrote:
> +/*
> + * Trivial allocator for CLOSIDs. Since h/w only supports a small number,
> + * we can keep a bitmap of free CLOSIDs in a single integer.
> + *
> + * Please note: This only supports global CLOSID across multiple
> + * resources and multiple sockets. User can create rdtgroups including root
> + * rdtgroup up to the number of CLOSIDs, which is 16 on Broadwell. When
> + * number of caches is big or number of supported resources sharing CLOSID
> + * is growing, it's getting harder to find usable rdtgroups which is limited
> + * by the small number of CLOSIDs.
> + *
> + * In the future, if it's necessary, we can implement more complex CLOSID
> + * allocation per socket/per resource domain and utilize CLOSIDs as many
> + * as possible. E.g. on 2-socket Broadwell, user can create upto 16x16=256
> + * rdtgroups and each rdtgroup has different combination of two L3 CBMs.

I'm confused as usual, but a two socket broadwell has exactly two L3 cache
domains and exactly 16 CLOSIDs per cache domain.

If you take CDP into account then the number of CLOSIDs is reduced to 8 per
cache domains.

So we never can have more than nr(CLOSIDs) * nr(L3 cache domains) unique
settings. So for a two socket broadwell its 32 for !CDP and 16 for CDP.

With the proposed user interface the number of unique rdtgroups is simply
the number of CLOSIDs because we handle the cache domains already per
resource, i.e. the meaning of CLOSID can be set independently per cache
domain.

Can you please explain why you think that we can have 16x16 unique
rdtgroups if we just have 16 resp. 8 CLOSIDs available?

> +static int closid_free_map;
> +
> +static void closid_init(void)
> +{
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];
> +	int rdt_max_closid;
> +
> +	/* Enabling L3 CDP halves the number of CLOSIDs */
> +	if (r->cdp_enabled)
> +		r->num_closid = r->max_closid / 2;
> +	else
> +		r->num_closid = r->max_closid;

If you do the L3Data/L3Core thingy then this can go away.

> +	/* Compute rdt_max_closid across all resources */
> +	rdt_max_closid = 0;
> +	for_each_rdt_resource(r)
> +		rdt_max_closid = max(rdt_max_closid, r->num_closid);

Oh no! This needs to be min().

Assume you have a system with L3 and L2 CAT. L2 reports COS_MAX=16, L3
reports COS_MAX=16 as well. Then you enabled CDP which cuts L3 COS_MAX in
half. So the real usable number of CLOSIDs is going to be 8 for both L2 and
L3 simply because you do not have a seperation of L2 and L3 in
MSR_PQR_ASSOC. And if you allow 16 then any CLOSID > 8 will result in
undefined behaviour. See SDM:

 "When CDP is enabled, specifying a COS value in IA32_PQR_ASSOC.COS outside
  of the lower half of the COS space will cause undefined performance impact
  to code and data fetches due to MSR space re-indexing into code/data masks
  when CDP is enabled."

> +	if (rdt_max_closid > 32) {
> +		pr_warn_once("Only using 32/%d CLOSIDs\n", rdt_max_closid);

You have a fancy for pr_*_once(). How often is this going to be executed?
Once per mount and we can really print it each time. mount is hardly a fast
path operation.

Also please spell out: "Using only 32 of %d CLOSIDs"
because "Using only 32/64 COSIDs" does not make any sense.

> +		rdt_max_closid = 32;
> +	}

> +static void rdt_reset_pqr_assoc_closid(void *v)
> +{
> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> +
> +	state->closid = 0;
> +	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, 0);

So this is called with preemption disabled, but is thread context the only
context in which pqr_state can be modified? If yes, please add a comment
explaining this clearly, otherwise the protection is not sufficient. I'm
too lazy to stare into the monitoring code ...

> +/*
> + * Forcibly remove all of subdirectories under root.
> + */
> +static void rmdir_all_sub(void)
> +{
> +	struct rdtgroup *rdtgrp;
> +	struct list_head *l, *next;
> +
> +	get_cpu();
> +	/* Reset PQR_ASSOC MSR on this cpu. */
> +	rdt_reset_pqr_assoc_closid(NULL);
> +	/* Reset PQR_ASSOC MSR on the rest of cpus. */
> +	smp_call_function_many(cpu_online_mask, rdt_reset_pqr_assoc_closid,
> +			       NULL, 1);

So you have reset the CLOSIDs to 0, but what resets L3/2_QOS_MASK_0 to the
default value (all valid bits set)? Further what clears CDP? 

> +static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
> +			  umode_t mode)
> +{
    ....
> +	/* allocate the rdtgroup. */
> +	rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL);
> +	if (!rdtgrp) {
> +		ret = -ENOSPC;
> +		goto out_closid_free;
> +	}
> +	rdtgrp->closid = closid;
> +	list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
> +
> +	/* kernfs creates the directory for rdtgrp */
> +	kn = kernfs_create_dir(parent->kn, name, mode, rdtgrp);
> +	if (IS_ERR(kn)) {
> +		ret = PTR_ERR(kn);
> +		goto out_cancel_ref;

Yuck! You are going to free rdtgrp, which is still enqueued in
rdt_all_groups...

> +static int rdtgroup_rmdir(struct kernfs_node *kn)
> +{
> +	struct rdtgroup *rdtgrp;
> +	int ret = 0;
> +
> +	rdtgrp = rdtgroup_kn_lock_live(kn);
> +	if (!rdtgrp) {
> +		rdtgroup_kn_unlock(kn);
> +		return -ENOENT;
> +	}
> +
> +	/*
> +	 * rmdir is for deleting resource groups. Don't
> +	 * allow deletion of "info" or any of its subdirectories
> +	 */
> +	if (!rdtgrp) {

And how are we going to reach this? You checked !rdtgrp already above ....

> +		mutex_unlock(&rdtgroup_mutex);
> +		kernfs_unbreak_active_protection(kn);
> +		return -EPERM;
> +	}
> +
> +	rdtgrp->flags = RDT_DELETED;
> +	closid_free(rdtgrp->closid);
> +	list_del(&rdtgrp->rdtgroup_list);
> +
> +	/*
> +	 * one extra hold on this, will drop when we kfree(rdtgrp)
> +	 * in rdtgroup_kn_unlock()
> +	 */
> +	kernfs_get(kn);

So in rdtgrp_mkdir you have this: 

> +	/*
> +	 * This extra ref will be put in kernfs_remove() and guarantees
> +	 * that @rdtgrp->kn is always accessible.
> +	 */
> +	kernfs_get(kn);
> +

That smells fishy ,..

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID
  2016-10-17 17:02       ` Thomas Gleixner
@ 2016-10-17 21:22         ` Fenghua Yu
  0 siblings, 0 replies; 52+ messages in thread
From: Fenghua Yu @ 2016-10-17 21:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Fenghua Yu, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Stephane Eranian, Borislav Petkov, Dave Hansen,
	Nilay Vaish, Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar,
	Sai Prakhya, Vikas Shivappa, linux-kernel, x86

On Mon, Oct 17, 2016 at 07:02:19PM +0200, Thomas Gleixner wrote:
> On Mon, 17 Oct 2016, Fenghua Yu wrote:
> > On Mon, Oct 17, 2016 at 03:45:32PM +0200, Thomas Gleixner wrote:
> > > I wonder whether this is the proper abstraction level. We might as well do
> > > the following:
> > > 
> > > rdtresources[] = {
> > >      {
> > > 	.name	= "L3",
> > >      },
> > >      {
> > > 	.name	= "L3Data",
> > >      },
> > >      {
> > > 	.name	= "L3Code",
> > >      },
> > > 
> > > and enable either L3 or L3Data+L3Code. Not sure if that makes things
> > > simpler, but it's definitely worth a thought or two.
> > 
> > This way will be better than having cdp_enabled/capable for L3 and not
> > for L2.
> 
> So you need to change the struct to have capable and enabled
> 
> > > static void rdt_get_config(int idx, struct rdt_resource *r)
> > > {
> > > 	union cpuid_0x10_1_eax eax;
> > > 	union cpuid_0x10_1_edx edx;
> > > 	u32 ebx, ecx;
> > > 
> > > 	cpuid_count(0x00000010, idx, &eax.full, &ebx, &ecx, &edx.full);
> > > 	r->max_closid = edx.split.cos_max + 1;
> > > 	r->num_closid = r->max_closid;
> > > 	r->cbm_len = eax.split.cbm_len + 1;
> > > 	r->max_cbm = BIT_MASK(eax.split.cbm_len + 1) - 1;
> > > 	r->enabled = true;
> 
> And set 
> 
>     	r->capable = true; 
> 
> here instead of r->enabled and set r->enabled at mount time for the
> resources depending on the mount flags.

Neat code change!
 
Thanks,

-Fenghua

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 14/18] x86/intel_rdt: Add cpus file
  2016-10-15  2:12 ` [PATCH v4 14/18] x86/intel_rdt: Add cpus file Fenghua Yu
@ 2016-10-17 21:27   ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 21:27 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Fri, 14 Oct 2016, Fenghua Yu wrote:
>  static int intel_rdt_offline_cpu(unsigned int cpu)
>  {
>  	struct rdt_resource *r;
> +	struct rdtgroup *rdtgrp;
> +	struct list_head *l;
>  
>  	mutex_lock(&rdtgroup_mutex);
>  	for_each_rdt_resource(r)
>  		update_domain(cpu, r, 0);
> +	list_for_each(l, &rdt_all_groups) {

list_for_each_entry ...

> +		rdtgrp = list_entry(l, struct rdtgroup, rdtgroup_list);
> +		if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask))
> +			break;

> +static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
> +				   char *buf, size_t nbytes, loff_t off)
> +{
     ....
> +	/* Are trying to drop some cpus from this group? */

  	/* Check whether cpus are dropped from this group */

> +	cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask);
> +	if (cpumask_weight(tmpmask)) {
> +		/* Can't drop from default group */
> +		if (rdtgrp == &rdtgroup_default) {
> +			ret = -EINVAL;
> +			goto end;
> +		}
> +		/* Give any dropped cpus to rdtgroup_default */
> +		cpumask_or(&rdtgroup_default.cpu_mask,
> +			   &rdtgroup_default.cpu_mask, tmpmask);
> +		for_each_cpu(cpu, tmpmask)
> +			per_cpu(cpu_closid, cpu) = 0;
> +	}
> +
> +	/*
> +	 * If we added cpus, remove them from previous group that owned them
> +	 * and update per-cpu rdtgroup pointers to refer to us

s/per-cpu rdtgroup pointers to refer to us/the per-cpu closid/

> +	 */
> +	cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask);
> +	if (cpumask_weight(tmpmask)) {
> +		struct list_head *l;
> +
> +		list_for_each(l, &rdt_all_groups) {
> +			r = list_entry(l, struct rdtgroup, rdtgroup_list);

Once more: list_for_each_entry()

> @@ -582,6 +698,10 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
>  		return -EPERM;
>  	}
>  
> +	/* Give any CPUs back to the default group */
> +	cpumask_or(&rdtgroup_default.cpu_mask,
> +		   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);

What resets the per-cpu closid to 0?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system
  2016-10-17 21:14   ` Thomas Gleixner
@ 2016-10-17 21:50     ` Luck, Tony
  2016-10-17 22:52       ` Thomas Gleixner
  2016-10-18  1:18     ` Fenghua Yu
  1 sibling, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2016-10-17 21:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Fenghua Yu, H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Mon, Oct 17, 2016 at 11:14:55PM +0200, Thomas Gleixner wrote:
> > +	/* Compute rdt_max_closid across all resources */
> > +	rdt_max_closid = 0;
> > +	for_each_rdt_resource(r)
> > +		rdt_max_closid = max(rdt_max_closid, r->num_closid);
> 
> Oh no! This needs to be min().
> 
> Assume you have a system with L3 and L2 CAT. L2 reports COS_MAX=16, L3
> reports COS_MAX=16 as well. Then you enabled CDP which cuts L3 COS_MAX in
> half. So the real usable number of CLOSIDs is going to be 8 for both L2 and
> L3 simply because you do not have a seperation of L2 and L3 in
> MSR_PQR_ASSOC. And if you allow 16 then any CLOSID > 8 will result in
> undefined behaviour. See SDM:
> 
>  "When CDP is enabled, specifying a COS value in IA32_PQR_ASSOC.COS outside
>   of the lower half of the COS space will cause undefined performance impact
>   to code and data fetches due to MSR space re-indexing into code/data masks
>   when CDP is enabled."

Bother.  The SDM also has this gem:

17.17.5.1 Cache Allocation Technology Dynamic Configuration
Both the CAT masks and CQM registers are accessible and modifiable at
any time during execution using RDMSR/WRMSR unless otherwise noted. When
writing to these MSRs a #GP(0) will be generated if any of the following
conditions occur:

 * Writing a COS greater than the supported maximum (specified as the
   maximum value of CPUID.(EAX=10H, ECX=ResID):EDX[15:0] for all valid
   ResID values) is written to the IA32_PQR_ASSOC.CLOS field.

With the intent here being that if you have more of one resource than
another, you can use all of the resources in the larger (with the
resource with fewer mask registers defaulting to the maximum value
when PQR_ASSOC.COS is too large [and I can't find the text that talks
about that default behaviour :-( ]

I think this all means that L3/CDP is "special". If CDP is on, we can't
use the top half of the CLOSID space that CPUID.(EAX=10H, ECX=1):EDX[15:0]
told us is present. So we can't exceed the half-way point. But in other
cases like L3 with max=16 and L2 with max=8 we should allow 16 groups,
but 0-7 allow control of L3 and L2, while 8-15 only allow L3 control (the
schemata code enforces this when you get to that part).

Perhaps we can encode this in another field in the rdt_resource structure
that says that some maximums globally override all others, while some
can be legitimately exceeded.

I'll have some words with the h/w architect, and get the SDM fixed in
the next edition.

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 15/18] x86/intel_rdt: Add tasks files
  2016-10-15  2:12 ` [PATCH v4 15/18] x86/intel_rdt: Add tasks files Fenghua Yu
@ 2016-10-17 22:01   ` Thomas Gleixner
  2016-10-17 22:17     ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 22:01 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Fri, 14 Oct 2016, Fenghua Yu wrote:
> +struct task_move_callback {
> +	struct callback_head work;
> +	struct rdtgroup *rdtgrp;

Please align the struct members as you did everywhere else already.

> +};
> +
> +static void move_myself(struct callback_head *head)
> +{
> +	struct task_move_callback *callback;
> +	struct rdtgroup *rdtgrp;
> +
> +	callback = container_of(head, struct task_move_callback, work);
> +	rdtgrp = callback->rdtgrp;
> +
> +	/* Resource group might have been deleted before process runs */

  	/*
	 * If resource group was deleted before this task work callback
	 * was invoked, then assign the task to root group and free the
	 * resource group,
	 */

> +	if (atomic_dec_and_test(&rdtgrp->waitcount) &&
> +	    (rdtgrp->flags & RDT_DELETED)) {
> +		current->closid = 0;
> +		kfree(rdtgrp);
> +	}
> +
> +	kfree(callback);
> +}
> +
> +static int __rdtgroup_move_task(struct task_struct *tsk,
> +				struct rdtgroup *rdtgrp)
> +{
> +	struct task_move_callback *callback;
> +	int ret;
> +
> +	callback = kzalloc(sizeof(*callback), GFP_KERNEL);
> +	if (!callback)
> +		return -ENOMEM;
> +	callback->work.func = move_myself;
> +	callback->rdtgrp = rdtgrp;

Lacks a comment:

  	/*
	 * Take a refcount, so rdtgrp cannot be freed before the
	 * callback has been invoked
	 */

> +	atomic_inc(&rdtgrp->waitcount);
> +	ret = task_work_add(tsk, &callback->work, true);
> +	if (ret) {


Lacks a comment as well:

      		/*
		 * Task is exiting. Drop the refcount and free the callback.
		 * No need to check the refcount as the group cannot be
		 * deleted before the write function unlocks rdtgroup_mutex.
		 */

For you the comment might be obvious, but I had to lookup the world and
some more.

> +		atomic_dec(&rdtgrp->waitcount);
> +		kfree(callback);
> +	} else {
> +		tsk->closid = rdtgrp->closid;
> +	}
> +	return ret;

> +static int rdtgroup_task_write_permission(struct task_struct *task,
> +					  struct kernfs_open_file *of)
> +{
> +	const struct cred *cred = current_cred();
> +	const struct cred *tcred = get_task_cred(task);
> +	int ret = 0;
> +
> +	/*
> +	 * even if we're attaching all tasks in the thread group, we only

Sentences start with an uppercase letter.

> +	 * need to check permissions on one of them.
> +	 */
> +	if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
> +	    !uid_eq(cred->euid, tcred->uid) &&
> +	    !uid_eq(cred->euid, tcred->suid))
> +		ret = -EPERM;
> +
> +	put_cred(tcred);
> +	return ret;
> +}
> +
> +static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp,
> +			      struct kernfs_open_file *of)
> +{
> +	struct task_struct *tsk;
> +	int ret;
> +
> +	rcu_read_lock();
> +	if (pid) {
> +		tsk = find_task_by_vpid(pid);
> +		if (!tsk) {
> +			ret = -ESRCH;
> +			goto out_unlock_rcu;

This goto is pointless as this is the only user,

     	     	        rcu_read_unlock()l
			return -ESRCH;

> +		}
> +	} else {
> +		tsk = current;
> +	}

> @@ -559,6 +713,13 @@ static void rmdir_all_sub(void)
>  {
>  	struct rdtgroup *rdtgrp;
>  	struct list_head *l, *next;
> +	struct task_struct *p;
> +
> +	/* move all tasks to default resource group */
> +	read_lock(&tasklist_lock);
> +	for_each_process(p)
> +		p->closid = 0;
> +	read_unlock(&tasklist_lock);

Ok.

> +	/* Don't allow if there are processes in this group */
> +	read_lock(&tasklist_lock);
> +	for_each_process(p) {
> +		if (p->closid == rdtgrp->closid) {
> +			read_unlock(&tasklist_lock);
> +			rdtgroup_kn_unlock(kn);
> +			return -EBUSY;
> +		}
> +	}
> +	read_unlock(&tasklist_lock);

I wonder, whether we should simply give those tasks back to the default
group, same as we do with the cpus.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 15/18] x86/intel_rdt: Add tasks files
  2016-10-17 22:01   ` Thomas Gleixner
@ 2016-10-17 22:17     ` Luck, Tony
  0 siblings, 0 replies; 52+ messages in thread
From: Luck, Tony @ 2016-10-17 22:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Fenghua Yu, H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Tue, Oct 18, 2016 at 12:01:01AM +0200, Thomas Gleixner wrote:
> > +	/* Don't allow if there are processes in this group */
> > +	read_lock(&tasklist_lock);
> > +	for_each_process(p) {
> > +		if (p->closid == rdtgrp->closid) {
> > +			read_unlock(&tasklist_lock);
> > +			rdtgroup_kn_unlock(kn);
> > +			return -EBUSY;
> > +		}
> > +	}
> > +	read_unlock(&tasklist_lock);
> 
> I wonder, whether we should simply give those tasks back to the default
> group, same as we do with the cpus.

Leftover inherited semantics from the cgroup version. It would
simplify the code here (one less error case to handle) if we
did drop this.  I can't come up with a good reason why we'd
want to make the rmdir fail. I can imagine that a sysadmin
dealing with an application that is a fork bomb being happy
about not having to race to move things out so they can do the
rmdir.

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 16/18] x86/intel_rdt: Add schemata file
  2016-10-15  2:12 ` [PATCH v4 16/18] x86/intel_rdt: Add schemata file Fenghua Yu
@ 2016-10-17 22:35   ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 22:35 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Fri, 14 Oct 2016, Fenghua Yu wrote:
> +static void update_domains(struct rdt_resource *r, int closid)
> +{
> +	int cpu, idx = 0;
> +	struct list_head *l;
> +	struct rdt_domain *d;
> +	struct msr_param msr_param;
> +	struct cpumask cpu_mask;

Again. No cpumasks on stack.

> +ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
> +				char *buf, size_t nbytes, loff_t off)
> +{
> +	char *tok, *resname;
> +	struct rdtgroup *rdtgrp;
> +	struct rdt_resource *r;
> +	int closid, ret = 0;
> +	u32 *l3_cbms = NULL;
> +
> +	/* Legal input requires a trailing newline */

s/Legal/Valid/ please. There is no law which enforces this.

> +	if (nbytes == 0 || buf[nbytes - 1] != '\n')
> +		return -EINVAL;
> +	buf[nbytes - 1] = '\0';
> +
> +	rdtgrp = rdtgroup_kn_lock_live(of->kn);
> +	if (!rdtgrp) {
> +		rdtgroup_kn_unlock(of->kn);
> +		return -ENOENT;
> +	}
> +
> +	closid = rdtgrp->closid;
> +
> +	/* get scratch space to save all the masks while we validate input */
> +	for_each_rdt_resource(r) {
> +		r->tmp_cbms = kcalloc(r->num_domains << r->cdp_enabled,
> +				      sizeof(*l3_cbms), GFP_KERNEL);
> +		if (!r->tmp_cbms) {
> +			ret = -ENOMEM;
> +			goto fail;
> +		}
> +		r->num_cbms = 0;

This wants to be r->num_tmp_cbms for clarity.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system
  2016-10-17 21:50     ` Luck, Tony
@ 2016-10-17 22:52       ` Thomas Gleixner
  2016-10-17 23:00         ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 22:52 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Fenghua Yu, H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Mon, 17 Oct 2016, Luck, Tony wrote:
> On Mon, Oct 17, 2016 at 11:14:55PM +0200, Thomas Gleixner wrote:
> > > +	/* Compute rdt_max_closid across all resources */
> > > +	rdt_max_closid = 0;
> > > +	for_each_rdt_resource(r)
> > > +		rdt_max_closid = max(rdt_max_closid, r->num_closid);
> > 
> > Oh no! This needs to be min().
> > 
> > Assume you have a system with L3 and L2 CAT. L2 reports COS_MAX=16, L3
> > reports COS_MAX=16 as well. Then you enabled CDP which cuts L3 COS_MAX in
> > half. So the real usable number of CLOSIDs is going to be 8 for both L2 and
> > L3 simply because you do not have a seperation of L2 and L3 in
> > MSR_PQR_ASSOC. And if you allow 16 then any CLOSID > 8 will result in
> > undefined behaviour. See SDM:
> > 
> >  "When CDP is enabled, specifying a COS value in IA32_PQR_ASSOC.COS outside
> >   of the lower half of the COS space will cause undefined performance impact
> >   to code and data fetches due to MSR space re-indexing into code/data masks
> >   when CDP is enabled."
> 
> Bother.  The SDM also has this gem:
> 
> 17.17.5.1 Cache Allocation Technology Dynamic Configuration
> Both the CAT masks and CQM registers are accessible and modifiable at
> any time during execution using RDMSR/WRMSR unless otherwise noted. When
> writing to these MSRs a #GP(0) will be generated if any of the following
> conditions occur:
> 
>  * Writing a COS greater than the supported maximum (specified as the
>    maximum value of CPUID.(EAX=10H, ECX=ResID):EDX[15:0] for all valid
>    ResID values) is written to the IA32_PQR_ASSOC.CLOS field.
> 
> With the intent here being that if you have more of one resource than
> another, you can use all of the resources in the larger (with the
> resource with fewer mask registers defaulting to the maximum value
> when PQR_ASSOC.COS is too large [and I can't find the text that talks
> about that default behaviour :-( ]
>
> I think this all means that L3/CDP is "special". If CDP is on, we can't
> use the top half of the CLOSID space that CPUID.(EAX=10H, ECX=1):EDX[15:0]
> told us is present. So we can't exceed the half-way point.

That's my understanding.

> But in other cases like L3 with max=16 and L2 with max=8 we should allow
> 16 groups, but 0-7 allow control of L3 and L2, while 8-15 only allow L3
> control (the schemata code enforces this when you get to that part).

Cute. Welcome to configuration hell!

So how are we going to deal with that in the schematas? Assume the L3=16
and L2=8 case(no CDP). So effectively any write of L2 to CLOSID=0 will
affect the setting of L2 in CLOSID=8.

Will the code tell the user that L2 cannot be set for CLOSID >= 8? 

Will it print the setting of CLOSID - 8 for CLOSID >= 8 when you read the
schemata file of CLOSID >= 8?

And of course this becomes very interesting when CLOSID 1 is deleted, then
what happens to CLOSID 9? Not to talk about the case when CLOSID 1 is
reused again.

Seems to be a well thought out hardware feature once again.

> Perhaps we can encode this in another field in the rdt_resource structure
> that says that some maximums globally override all others, while some
> can be legitimately exceeded.

That should work.
 
> I'll have some words with the h/w architect, and get the SDM fixed in
> the next edition.

Whatever the outcome will be :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system
  2016-10-17 22:52       ` Thomas Gleixner
@ 2016-10-17 23:00         ` Luck, Tony
  2016-10-17 23:03           ` Thomas Gleixner
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2016-10-17 23:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Yu, Fenghua, Anvin, H Peter, Ingo Molnar, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Hansen, Dave, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Shankar, Ravi V, Prakhya,
	Sai Praneeth, Vikas Shivappa, linux-kernel, x86

> So how are we going to deal with that in the schematas? Assume the L3=16
> and L2=8 case(no CDP). So effectively any write of L2 to CLOSID=0 will
> affect the setting of L2 in CLOSID=8.
>
> Will the code tell the user that L2 cannot be set for CLOSID >= 8? 
>
> Will it print the setting of CLOSID - 8 for CLOSID >= 8 when you read the
> schemata file of CLOSID >= 8?

The default and first 7 directories you make will have lines in schemata for
both L3 and L2.  The next 8 will just have L3 (and won't let you add L2).

> And of course this becomes very interesting when CLOSID 1 is deleted, then
> what happens to CLOSID 9? Not to talk about the case when CLOSID 1 is
> reused again.

Deleting CLOSID 1 won't affect CLOSID 9.  When you make a new directory
that gets allocated CLOSID 1, it will have both L3 and L2 lines.

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system
  2016-10-17 23:00         ` Luck, Tony
@ 2016-10-17 23:03           ` Thomas Gleixner
  2016-10-17 23:10             ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 23:03 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Yu, Fenghua, Anvin, H Peter, Ingo Molnar, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Hansen, Dave, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Shankar, Ravi V, Prakhya,
	Sai Praneeth, Vikas Shivappa, linux-kernel, x86

On Mon, 17 Oct 2016, Luck, Tony wrote:
> > So how are we going to deal with that in the schematas? Assume the L3=16
> > and L2=8 case(no CDP). So effectively any write of L2 to CLOSID=0 will
> > affect the setting of L2 in CLOSID=8.
> >
> > Will the code tell the user that L2 cannot be set for CLOSID >= 8? 
> >
> > Will it print the setting of CLOSID - 8 for CLOSID >= 8 when you read the
> > schemata file of CLOSID >= 8?
> 
> The default and first 7 directories you make will have lines in schemata for
> both L3 and L2.  The next 8 will just have L3 (and won't let you add L2).
> 
> > And of course this becomes very interesting when CLOSID 1 is deleted, then
> > what happens to CLOSID 9? Not to talk about the case when CLOSID 1 is
> > reused again.
> 
> Deleting CLOSID 1 won't affect CLOSID 9.

How so? CLOSID 9 is using CLOSID 1 L2 settings. Are we just keeping the L2
setting of CLOSID 1 around and do not reset it to default?

>  When you make a new directory that gets allocated CLOSID 1, it will have
>  both L3 and L2 lines.

Now CLOSID 1 is reused and reset to all 1's per default and even if not
then the operator will set a new L2 value and therefor wreckaging CLOSID 9.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system
  2016-10-17 23:03           ` Thomas Gleixner
@ 2016-10-17 23:10             ` Luck, Tony
  2016-10-17 23:25               ` Thomas Gleixner
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2016-10-17 23:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Yu, Fenghua, Anvin, H Peter, Ingo Molnar, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Hansen, Dave, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Shankar, Ravi V, Prakhya,
	Sai Praneeth, Vikas Shivappa, linux-kernel, x86

> How so? CLOSID 9 is using CLOSID 1 L2 settings. Are we just keeping the L2
> setting of CLOSID 1 around and do not reset it to default?

No. When CLOSID 9 arrives at the L2 h/w, it doesn't just take the bits it
likes an discard the high bits to map to L2_CBM[1]. It just turns into 
into the maximum allowed value for an L2 CBM.

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system
  2016-10-18  1:18     ` Fenghua Yu
@ 2016-10-17 23:20       ` Thomas Gleixner
  2016-10-17 23:37         ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 23:20 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Mon, 17 Oct 2016, Fenghua Yu wrote:
> part0: L3:0=1;1=1       closid0/cbm=1 on cache0 and closid0/cbm=1 on cache1
> (closid 15 on cache0 combined with 16 different closids on cache1)
> ...
> part254: L3:0=ffff;1=7fff   closid15/cbm=ffff on cache0 and closid14/cbm=7fff on cache1
> part255: L3:0=ffff;1=ffff   closid15/cbm=ffff on cache0 and closid15/cbm=ffff on cache1
> 
> To utilize as much combinations as possbile, we may implement a
> more complex allocation than current one.
> 
> Does this make sense?

Thanks for the explanation. I knew that I'm missing something.

But how is that supposed to work? The schemata files have no idea of
closids simply because the closids are assigned automatically. And that
makes the whole thing exponentially complex. You must allow to create ALL
rdt groups (initialy as a copy of the root group) and then when the
schemata file is written you have to look whether the particular CBM value
for a particular domain is already used and assign the same cosid for this
domain. That of course makes the whole L2 business completely diffuse
because you might end up with:

Dom0 = COSID1	  and DOM1 = COSID9

So you can set the L2 for Dom0, but not for DOM1 and then if you set L2 for
Dom0 you must find a new COSID for Dom0. If there is none, then you must
reject the write and leave the admin puzzled.

There is a reason why I suggested:

 https://lkml.kernel.org/r/alpine.DEB.2.11.1511181534450.3761@nanos

It's certainly not perfect (missing L2 etc.), but clearly avoids exactly
the above issues. And it would allow you to utilize the 256 groups in an
understandable way.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system
  2016-10-17 23:10             ` Luck, Tony
@ 2016-10-17 23:25               ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-17 23:25 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Yu, Fenghua, Anvin, H Peter, Ingo Molnar, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Hansen, Dave, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Shankar, Ravi V, Prakhya,
	Sai Praneeth, Vikas Shivappa, linux-kernel, x86

On Mon, 17 Oct 2016, Luck, Tony wrote:

> > How so? CLOSID 9 is using CLOSID 1 L2 settings. Are we just keeping the L2
> > setting of CLOSID 1 around and do not reset it to default?
> 
> No. When CLOSID 9 arrives at the L2 h/w, it doesn't just take the bits it
> likes an discard the high bits to map to L2_CBM[1]. It just turns into 
> into the maximum allowed value for an L2 CBM.

So all CLOSIDs >= 8 will use all valid CBM bits for L2? That's even more
insane as this breaks any L2 partitioning which is set up by CLOSIDs < 8.

So effectively CLOSIDs >= 8 are useless in the L3=16 and L2=8 case.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system
  2016-10-17 23:20       ` Thomas Gleixner
@ 2016-10-17 23:37         ` Luck, Tony
  2016-10-18  2:56           ` Fenghua Yu
  2016-10-18 10:44           ` Thomas Gleixner
  0 siblings, 2 replies; 52+ messages in thread
From: Luck, Tony @ 2016-10-17 23:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Fenghua Yu, H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Tue, Oct 18, 2016 at 01:20:36AM +0200, Thomas Gleixner wrote:
> On Mon, 17 Oct 2016, Fenghua Yu wrote:
> > part0: L3:0=1;1=1       closid0/cbm=1 on cache0 and closid0/cbm=1 on cache1
> > (closid 15 on cache0 combined with 16 different closids on cache1)
> > ...
> > part254: L3:0=ffff;1=7fff   closid15/cbm=ffff on cache0 and closid14/cbm=7fff on cache1
> > part255: L3:0=ffff;1=ffff   closid15/cbm=ffff on cache0 and closid15/cbm=ffff on cache1
> > 
> > To utilize as much combinations as possbile, we may implement a
> > more complex allocation than current one.
> > 
> > Does this make sense?
> 
> Thanks for the explanation. I knew that I'm missing something.
> 
> But how is that supposed to work? The schemata files have no idea of
> closids simply because the closids are assigned automatically. And that
> makes the whole thing exponentially complex. You must allow to create ALL
> rdt groups (initialy as a copy of the root group) and then when the
> schemata file is written you have to look whether the particular CBM value
> for a particular domain is already used and assign the same cosid for this
> domain. That of course makes the whole L2 business completely diffuse
> because you might end up with:
> 
> Dom0 = COSID1	  and DOM1 = COSID9
> 
> So you can set the L2 for Dom0, but not for DOM1 and then if you set L2 for
> Dom0 you must find a new COSID for Dom0. If there is none, then you must
> reject the write and leave the admin puzzled.
> 
> There is a reason why I suggested:
> 
>  https://lkml.kernel.org/r/alpine.DEB.2.11.1511181534450.3761@nanos
> 
> It's certainly not perfect (missing L2 etc.), but clearly avoids exactly
> the above issues. And it would allow you to utilize the 256 groups in an
> understandable way.

If you head down that path someone with a 4-socket system will try to
make 16x16x16x16 = 65536 groups and "understandable" takes a bit of
a beating. The eight socket system with 16^8 = 4G groups defies any
rationale hope. Best not to think about 16 sockets.

The L2 + L3 configuration space gets unbelievably messy too.

There's a reason why I ripped out the allocation code and went with
a simple global allocator in this version.  If we decide we need something
fancier we can adapt later. Some solutions might be transparent to
applications, others might add a "closid" file into each directory to
give 2nd generation applications hooks to view (and maybe control)
which closid is used by each group.

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system
  2016-10-17 21:14   ` Thomas Gleixner
  2016-10-17 21:50     ` Luck, Tony
@ 2016-10-18  1:18     ` Fenghua Yu
  2016-10-17 23:20       ` Thomas Gleixner
  1 sibling, 1 reply; 52+ messages in thread
From: Fenghua Yu @ 2016-10-18  1:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Fenghua Yu, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Stephane Eranian, Borislav Petkov, Dave Hansen,
	Nilay Vaish, Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar,
	Sai Prakhya, Vikas Shivappa, linux-kernel, x86

On Mon, Oct 17, 2016 at 11:14:55PM +0200, Thomas Gleixner wrote:
> On Fri, 14 Oct 2016, Fenghua Yu wrote:
> > +/*
> > + * Trivial allocator for CLOSIDs. Since h/w only supports a small number,
> > + * we can keep a bitmap of free CLOSIDs in a single integer.
> > + *
> > + * Please note: This only supports global CLOSID across multiple
> > + * resources and multiple sockets. User can create rdtgroups including root
> > + * rdtgroup up to the number of CLOSIDs, which is 16 on Broadwell. When
> > + * number of caches is big or number of supported resources sharing CLOSID
> > + * is growing, it's getting harder to find usable rdtgroups which is limited
> > + * by the small number of CLOSIDs.
> > + *
> > + * In the future, if it's necessary, we can implement more complex CLOSID
> > + * allocation per socket/per resource domain and utilize CLOSIDs as many
> > + * as possible. E.g. on 2-socket Broadwell, user can create upto 16x16=256
> > + * rdtgroups and each rdtgroup has different combination of two L3 CBMs.
> 
> I'm confused as usual, but a two socket broadwell has exactly two L3 cache
> domains and exactly 16 CLOSIDs per cache domain.
> 
> If you take CDP into account then the number of CLOSIDs is reduced to 8 per
> cache domains.
> 
> So we never can have more than nr(CLOSIDs) * nr(L3 cache domains) unique
> settings. So for a two socket broadwell its 32 for !CDP and 16 for CDP.
> 
> With the proposed user interface the number of unique rdtgroups is simply
> the number of CLOSIDs because we handle the cache domains already per
> resource, i.e. the meaning of CLOSID can be set independently per cache
> domain.
> 
> Can you please explain why you think that we can have 16x16 unique
> rdtgroups if we just have 16 resp. 8 CLOSIDs available?

To simplify, we only consider CAT case. CDP has similar similar situation.

For a two socket broadwell, we have schemata format "L3:0=x;1=y"
suppose the two cache ids are 0 and 1, and x is cache 0's cbm and y is
cache 1's cbm. Then kernel allocates one closid and its cbm=x for cache 0
and one closid and its cbm=y for cache 1.

So we can have the following 16x16 different partitions/rdtgroups.
Each partition/rdgroup has its name and has its own unique
closid combinations on two caches. If a task is assigned to any of
partition, the task has its unique combination of closids when running on
cache 0 and when running on cache 1. Belowing cbm values are example values.

name   schemata         closids on cache 0 and 1 allocated by kernel
----   --------         --------------------------------------------
(closid 0 on cache0 combined with 16 different closid on cache1)
part0: L3:0=1;1=1       closid0/cbm=1 on cache0 and closid0/cbm=1 on cache1
part1: L3:0=1;1=3       closid0/cbm=1 on cache0 and closid1/cbm=3 on cache1
part2: L3:0=1;1=7       closid0/cbm=1 on cache0 and closid2/cbm=7 on cache1
part3: L3:0=1;1=f       closid0/cbm=1 on cache0 and closid3/cbm=f on cache1
part4: L3:0=1;1=1f      closid0/cbm=1 on cache0 and closid4/cbm=1f on cache1
part5: L3:0=1;1=3f      closid0/cbm=1 on cache0 and closid5/cbm=3f on cache1
part6: L3:0=1;1=7f      closid0/cbm=1 on cache0 and closid6/cbm=7f on cache1
part7: L3:0=1;1=ff      closid0/cbm=1 on cache0 and closid7/cbm=ff on cache1
part8: L3:0=1;1=1ff     closid0/cbm=1 on cache0 and closid8/cbm=1ff on cache1
part9: L3:0=1;1=3ff     closid0/cbm=1 on cache0 and closid9/cbm=3ff on cache1
part10: L3:0=1;1=7ff    closid0/cbm=1 on cache0 and closid10/cbm=7ff on cache1
part11: L3:0=1;1=fff    closid0/cbm=1 on cache0 and closid11/cbm=fff on cache1
part12: L3:0=1;1=1fff   closid0/cbm=1 on cache0 and closid12/cbm=1fff on cache1
part13: L3:0=1;1=3fff   closid0/cbm=1 on cache0 and closid13/cbm=3fff on cache1
part14: L3:0=1;1=7fff   closid0/cbm=1 on cache0 and closid14/cbm=7fff on cache1
part15: L3:0=1;1=ffff   closid0/cbm=1 on cache0 and closid15/cbm=ffff on cache1
(closid 1 on cache0 combined with 16 different closid on cache1)
part16: L3:0=3;1=1      closid1/cbm=3 on cache0 and closid0/cbm=1 on cache1
part17: L3:0=3;1=3      closid1/cbm=3 on cache0 and closid1/cbm=3 on cache1
...
part31: L3:0=3;1=ffff   closid1/cbm=3 on cache0 and closid15/cbm=ffff on cache1
(closid 2 on cache0 combined with 16 different closid on cache1)
part16: L3:0=7;1=1      closid2/cbm=3 on cache0 and closid0/cbm=1 on cache1
part17: L3:0=7;1=3      closid2/cbm=3 on cache0 and closid1/cbm=3 on cache1
...
part31: L3:0=7;1=ffff   closid2/cbm=3 on cache0 and closid15/cbm=ffff on cache1
(closid 3 on cache0 combined with 16 different closids on cache1)
...
(closid 4 on cache0 combined with 16 different closids on cache1)
...
(closid 5 on cache0 combined with 16 different closids on cache1)
...
(closid 6 on cache0 combined with 16 different closids on cache1)
...
(closid 7 on cache0 combined with 16 different closids on cache1)
...
(closid 8 on cache0 combined with 16 different closids on cache1)
...
(closid 9 on cache0 combined with 16 different closids on cache1)
...
(closid 10 on cache0 combined with 16 different closids on cache1)
...
(closid 11 on cache0 combined with 16 different closids on cache1)
...
(closid 12 on cache0 combined with 16 different closids on cache1)
...
(closid 13 on cache0 combined with 16 different closids on cache1)
...
(closid 14 on cache0 combined with 16 different closids on cache1)
...
(closid 15 on cache0 combined with 16 different closids on cache1)
...
part254: L3:0=ffff;1=7fff   closid15/cbm=ffff on cache0 and closid14/cbm=7fff on cache1
part255: L3:0=ffff;1=ffff   closid15/cbm=ffff on cache0 and closid15/cbm=ffff on cache1

To utilize as much combinations as possbile, we may implement a
more complex allocation than current one.

Does this make sense?

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system
  2016-10-17 23:37         ` Luck, Tony
@ 2016-10-18  2:56           ` Fenghua Yu
  2016-10-18 10:44           ` Thomas Gleixner
  1 sibling, 0 replies; 52+ messages in thread
From: Fenghua Yu @ 2016-10-18  2:56 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Thomas Gleixner, Fenghua Yu, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Stephane Eranian, Borislav Petkov, Dave Hansen,
	Nilay Vaish, Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar,
	Sai Prakhya, Vikas Shivappa, linux-kernel, x86

On Mon, Oct 17, 2016 at 04:37:30PM -0700, Luck, Tony wrote:
> On Tue, Oct 18, 2016 at 01:20:36AM +0200, Thomas Gleixner wrote:
> > On Mon, 17 Oct 2016, Fenghua Yu wrote:
> > > part0: L3:0=1;1=1       closid0/cbm=1 on cache0 and closid0/cbm=1 on cache1
> > > (closid 15 on cache0 combined with 16 different closids on cache1)
> > > ...
> > > part254: L3:0=ffff;1=7fff   closid15/cbm=ffff on cache0 and closid14/cbm=7fff on cache1
> > > part255: L3:0=ffff;1=ffff   closid15/cbm=ffff on cache0 and closid15/cbm=ffff on cache1
> > > 
> > > To utilize as much combinations as possbile, we may implement a
> > > more complex allocation than current one.
> > > 
> > > Does this make sense?
> > 
> > Thanks for the explanation. I knew that I'm missing something.
> > 
> > But how is that supposed to work? The schemata files have no idea of
> > closids simply because the closids are assigned automatically. And that
> > makes the whole thing exponentially complex. You must allow to create ALL
> > rdt groups (initialy as a copy of the root group) and then when the
> > schemata file is written you have to look whether the particular CBM value
> > for a particular domain is already used and assign the same cosid for this
> > domain. That of course makes the whole L2 business completely diffuse
> > because you might end up with:
> > 
> > Dom0 = COSID1	  and DOM1 = COSID9
> > 
> > So you can set the L2 for Dom0, but not for DOM1 and then if you set L2 for
> > Dom0 you must find a new COSID for Dom0. If there is none, then you must
> > reject the write and leave the admin puzzled.
> > 
> > There is a reason why I suggested:
> > 
> >  https://lkml.kernel.org/r/alpine.DEB.2.11.1511181534450.3761@nanos
> > 
> > It's certainly not perfect (missing L2 etc.), but clearly avoids exactly
> > the above issues. And it would allow you to utilize the 256 groups in an
> > understandable way.
> 
> If you head down that path someone with a 4-socket system will try to
> make 16x16x16x16 = 65536 groups and "understandable" takes a bit of
> a beating. The eight socket system with 16^8 = 4G groups defies any
> rationale hope. Best not to think about 16 sockets.

The number of 16^L3 cache numbers is max partition number limitation 
that a sysadmin can create in theory. Beyond the number, allocation
returns no space. It's kind of like other cases eg many many mkdir in one
directory can fail at one point because mkdir run out of disk space etc.

> 
> The L2 + L3 configuration space gets unbelievably messy too.
> 
> There's a reason why I ripped out the allocation code and went with
> a simple global allocator in this version.  If we decide we need something
> fancier we can adapt later. Some solutions might be transparent to
> applications, others might add a "closid" file into each directory to
> give 2nd generation applications hooks to view (and maybe control)
> which closid is used by each group.

Fully agree with Tony. We understand the complexity of the situation and
just have a simple and working solution for the first version.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system
  2016-10-17 23:37         ` Luck, Tony
  2016-10-18  2:56           ` Fenghua Yu
@ 2016-10-18 10:44           ` Thomas Gleixner
  1 sibling, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2016-10-18 10:44 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Fenghua Yu, H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	Stephane Eranian, Borislav Petkov, Dave Hansen, Nilay Vaish,
	Shaohua Li, David Carrillo-Cisneros, Ravi V Shankar, Sai Prakhya,
	Vikas Shivappa, linux-kernel, x86

On Mon, 17 Oct 2016, Luck, Tony wrote:
> On Tue, Oct 18, 2016 at 01:20:36AM +0200, Thomas Gleixner wrote:
> > It's certainly not perfect (missing L2 etc.), but clearly avoids exactly
> > the above issues. And it would allow you to utilize the 256 groups in an
> > understandable way.
> 
> If you head down that path someone with a 4-socket system will try to
> make 16x16x16x16 = 65536 groups and "understandable" takes a bit of
> a beating. The eight socket system with 16^8 = 4G groups defies any
> rationale hope. Best not to think about 16 sockets.
> 
> The L2 + L3 configuration space gets unbelievably messy too.
> 
> There's a reason why I ripped out the allocation code and went with
> a simple global allocator in this version.  If we decide we need something
> fancier we can adapt later. Some solutions might be transparent to
> applications, others might add a "closid" file into each directory to
> give 2nd generation applications hooks to view (and maybe control)
> which closid is used by each group.

I'm not saying that we want something fancier. I fully agree with your
decision to make a simple global allocator.

I was just puzzled by the 16*16 comment and wondered what this is
about. Looking at Fenghuas explanation and the examples there is nothing
which really looks like we ever want it. In fact the fancy CLOSID matrix
does not make much sense at all.

So I rather would like to see a comment clearly explaining why the chosen
allocator (grouping) gives us the most straight forward way to utilize the
hardware. It surely restricts the theoretical choices, but it limits them to
the subset which makes technically sense.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2016-10-18 10:47 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-15  2:12 [PATCH v4 00/18] Intel Cache Allocation Technology Fenghua Yu
2016-10-15  2:12 ` [PATCH v4 01/18] Documentation, ABI: Add a document entry for cache id Fenghua Yu
2016-10-17 10:31   ` Thomas Gleixner
2016-10-15  2:12 ` [PATCH v4 02/18] cacheinfo: Introduce " Fenghua Yu
2016-10-17 10:32   ` Thomas Gleixner
2016-10-15  2:12 ` [PATCH v4 03/18] x86, intel_cacheinfo: Enable cache id in x86 Fenghua Yu
2016-10-17 10:48   ` Thomas Gleixner
2016-10-15  2:12 ` [PATCH v4 04/18] x86/intel_rdt: Feature discovery Fenghua Yu
2016-10-15  2:12 ` [PATCH v4 05/18] Documentation, x86: Documentation for Intel resource allocation user interface Fenghua Yu
2016-10-15  2:12 ` [PATCH v4 06/18] x86/intel_rdt: Add CONFIG, Makefile, and basic initialization Fenghua Yu
2016-10-17 10:57   ` Thomas Gleixner
2016-10-15  2:12 ` [PATCH v4 07/18] x86/intel_rdt: Add Haswell feature discovery Fenghua Yu
2016-10-17 11:03   ` Thomas Gleixner
2016-10-15  2:12 ` [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID Fenghua Yu
2016-10-17 13:45   ` Thomas Gleixner
2016-10-17 18:06     ` Fenghua Yu
2016-10-17 16:35       ` Luck, Tony
2016-10-17 16:43         ` Yu, Fenghua
2016-10-17 20:20           ` Luck, Tony
2016-10-17 16:54         ` Thomas Gleixner
2016-10-17 16:53       ` Thomas Gleixner
2016-10-17 17:02       ` Thomas Gleixner
2016-10-17 21:22         ` Fenghua Yu
2016-10-15  2:12 ` [PATCH v4 09/18] x86/cqm: Move PQR_ASSOC management code into generic code used by both CQM and CAT Fenghua Yu
2016-10-15  2:12 ` [PATCH v4 10/18] x86/intel_rdt: Build structures for each resource based on cache topology Fenghua Yu
2016-10-17 14:44   ` Thomas Gleixner
2016-10-15  2:12 ` [PATCH v4 11/18] x86/intel_rdt: Add basic resctrl filesystem support Fenghua Yu
2016-10-17 19:35   ` Thomas Gleixner
2016-10-15  2:12 ` [PATCH v4 12/18] x86/intel_rdt: Add "info" files to resctrl file system Fenghua Yu
2016-10-17 19:46   ` Thomas Gleixner
2016-10-15  2:12 ` [PATCH v4 13/18] x86/intel_rdt: Add mkdir " Fenghua Yu
2016-10-17 21:14   ` Thomas Gleixner
2016-10-17 21:50     ` Luck, Tony
2016-10-17 22:52       ` Thomas Gleixner
2016-10-17 23:00         ` Luck, Tony
2016-10-17 23:03           ` Thomas Gleixner
2016-10-17 23:10             ` Luck, Tony
2016-10-17 23:25               ` Thomas Gleixner
2016-10-18  1:18     ` Fenghua Yu
2016-10-17 23:20       ` Thomas Gleixner
2016-10-17 23:37         ` Luck, Tony
2016-10-18  2:56           ` Fenghua Yu
2016-10-18 10:44           ` Thomas Gleixner
2016-10-15  2:12 ` [PATCH v4 14/18] x86/intel_rdt: Add cpus file Fenghua Yu
2016-10-17 21:27   ` Thomas Gleixner
2016-10-15  2:12 ` [PATCH v4 15/18] x86/intel_rdt: Add tasks files Fenghua Yu
2016-10-17 22:01   ` Thomas Gleixner
2016-10-17 22:17     ` Luck, Tony
2016-10-15  2:12 ` [PATCH v4 16/18] x86/intel_rdt: Add schemata file Fenghua Yu
2016-10-17 22:35   ` Thomas Gleixner
2016-10-15  2:12 ` [PATCH v4 17/18] x86/intel_rdt: Add scheduler hook Fenghua Yu
2016-10-15  2:12 ` [PATCH v4 18/18] MAINTAINERS: Add maintainer for Intel RDT resource allocation Fenghua Yu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.