linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v2 04/33] drivers/base/cacheinfo.c: Export some cacheinfo functions for others to use
  2016-09-08  9:56 ` [PATCH v2 04/33] drivers/base/cacheinfo.c: Export some cacheinfo functions for others to use Fenghua Yu
@ 2016-09-08  8:21   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08  8:21 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:

> From: Fenghua Yu <fenghua.yu@intel.com>
> 
> We use ci_cpu_cacheinfo in CAT. Export this function for CAT to reuse.

So ci_cpu_cacheinfo is a function? AFAICT it's a struct.
 
> +#define ci_cacheinfo(cpu)       (&per_cpu(ci_cpu_cacheinfo, cpu))

Why a define and not an inline? &per_cpu should be per_cpu_ptr ....

And a define is not a function either and certainly that whole thing has
nothing to do with an export.

Furthermore $subject talks about some functions. I still have to see one.

It's an art to get a onliner patch screwed up in more than one way so
badly.

No bisquit!

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 08/33] x86/intel_rdt: Add Class of service management
  2016-09-08  9:57 ` [PATCH v2 08/33] x86/intel_rdt: Add Class of service management Fenghua Yu
@ 2016-09-08  8:56   ` Thomas Gleixner
  2016-09-12 16:02   ` Nilay Vaish
  1 sibling, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08  8:56 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
> @@ -0,0 +1,12 @@
> +#ifndef _RDT_H_
> +#define _RDT_H_

The standard guard is

    _ASM_X86_RDT_H

> +
> +#ifdef CONFIG_INTEL_RDT

What's the purpose of sticking a struct definition inside an ifdef?

> +
> +struct clos_cbm_table {
> +	unsigned long cbm;
> +	unsigned int clos_refcnt;

Please proper align the struct members for readability sake

	unsigned long	cbm;
	unsigned int	clos_refcnt;

Adding a kerneldoc comment above the struct explaining it would not hurting
either.

> +/*
> + * cctable maintains 1:1 mapping between CLOSid and cache bitmask.
> + */
> +static struct clos_cbm_table *cctable;
> +/*
> + * closid availability bit map.
> + */
> +unsigned long *closmap;
> +static DEFINE_MUTEX(rdtgroup_mutex);
> +
> +static inline void closid_get(u32 closid)
> +{
> +	struct clos_cbm_table *cct = &cctable[closid];
> +
> +	lockdep_assert_held(&rdtgroup_mutex);
> +
> +	cct->clos_refcnt++;

So the whole thing can be written with a single line at the single call
site.

	cc_table[closid].clos_refcnt++;

And that already has the lockdep_assert_held() check. While companies might
pay based on line counts, kernel development asks for readable and sensible
code.

> +}
> +
> +static int closid_alloc(u32 *closid)
> +{
> +	u32 maxid;
> +	u32 id;

Please put variables with the same type into one line. There is no value in
wasting screen estate.

> +
> +	lockdep_assert_held(&rdtgroup_mutex);
> +
> +	maxid = boot_cpu_data.x86_cache_max_closid;
> +	id = find_first_zero_bit(closmap, maxid);
> +	if (id == maxid)
> +		return -ENOSPC;
> +
> +	set_bit(id, closmap);
> +	closid_get(id);
> +	*closid = id;
> +
> +	return 0;
> +}
> +
> +static inline void closid_free(u32 closid)
> +{
> +	clear_bit(closid, closmap);
> +	cctable[closid].cbm = 0;
> +}
> +
> +static void closid_put(u32 closid)
> +{
> +	struct clos_cbm_table *cct = &cctable[closid];
> +
> +	lockdep_assert_held(&rdtgroup_mutex);
> +	if (WARN_ON(!cct->clos_refcnt))
> +		return;
> +
> +	if (!--cct->clos_refcnt)
> +		closid_free(closid);
> +}
>  
>  static int __init intel_rdt_late_init(void)
>  {
>  	struct cpuinfo_x86 *c = &boot_cpu_data;
> +	u32 maxid;
> +	int err = 0, size;
>  
>  	if (!cpu_has(c, X86_FEATURE_CAT_L3))
>  		return -ENODEV;
>  
> -	pr_info("Intel cache allocation detected\n");
> +	maxid = c->x86_cache_max_closid;
>  
> -	return 0;
> +	size = maxid * sizeof(struct clos_cbm_table);
> +	cctable = kzalloc(size, GFP_KERNEL);
> +	if (!cctable) {
> +		err = -ENOMEM;
> +		goto out_err;

I told you that before: Using a goto just to return err is pointless and
silly. What the hell is wrong with 

		return -ENOMEM;
???

> +	}
> +
> +	size = BITS_TO_LONGS(maxid) * sizeof(long);
> +	closmap = kzalloc(size, GFP_KERNEL);
> +	if (!closmap) {
> +		kfree(cctable);
> +		err = -ENOMEM;
> +		goto out_err;

Groan.

> +	}
> +
> +	pr_info("Intel cache allocation enabled\n");
> +out_err:

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 09/33] x86/intel_rdt: Add L3 cache capacity bitmask management
  2016-09-08  9:57 ` [PATCH v2 09/33] x86/intel_rdt: Add L3 cache capacity bitmask management Fenghua Yu
@ 2016-09-08  9:40   ` Thomas Gleixner
  2016-09-12 16:10   ` Nilay Vaish
  1 sibling, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08  9:40 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
>  
> +	for_each_online_cpu(i)
> +		rdt_cpumask_update(i);

The only reason why this does not blow up in your face is that at this
point the secondary cpus have been brought up already and user space is not
yet running, so cpu hotplug cannot happen in parallel. Protection by chance
is never a good idea.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 10/33] x86/intel_rdt: Implement scheduling support for Intel RDT
  2016-09-08  9:57 ` [PATCH v2 10/33] x86/intel_rdt: Implement scheduling support for Intel RDT Fenghua Yu
@ 2016-09-08  9:53   ` Thomas Gleixner
  2016-09-15 21:36     ` Yu, Fenghua
  2016-09-13 17:55   ` Nilay Vaish
  1 sibling, 1 reply; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08  9:53 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
> +extern struct static_key rdt_enable_key;
> +void __intel_rdt_sched_in(void *dummy);
> +
>  struct clos_cbm_table {
>  	unsigned long cbm;
>  	unsigned int clos_refcnt;
>  };
>  
> +/*
> + * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
> + *
> + * Following considerations are made so that this has minimal impact
> + * on scheduler hot path:
> + * - This will stay as no-op unless we are running on an Intel SKU
> + * which supports L3 cache allocation.
> + * - When support is present and enabled, does not do any
> + * IA32_PQR_MSR writes until the user starts really using the feature
> + * ie creates a rdtgroup directory and assigns a cache_mask thats
> + * different from the root rdtgroup's cache_mask.
> + * - Caches the per cpu CLOSid values and does the MSR write only
> + * when a task with a different CLOSid is scheduled in. That
> + * means the task belongs to a different rdtgroup.
> + * - Closids are allocated so that different rdtgroup directories
> + * with same cache_mask gets the same CLOSid. This minimizes CLOSids
> + * used and reduces MSR write frequency.
> + */
> +static inline void intel_rdt_sched_in(void)
> +{
> +	/*
> +	 * Call the schedule in code only when RDT is enabled.
> +	 */
> +	if (static_key_false(&rdt_enable_key))

static_branch_[un]likely() is the proper function to use.

> +		__intel_rdt_sched_in(NULL);
> +
> +void __intel_rdt_sched_in(void *dummy)

What's the purpose of this dummy argument?

> +{
> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> +
> +	/*
> +	 * Currently closid is always 0. When  user interface is added,
> +	 * closid will come from user interface.
> +	 */
> +	if (state->closid == 0)
> +		return;
> +
> +	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, 0);
> +	state->closid = 0;
> +}

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology
@ 2016-09-08  9:56 Fenghua Yu
  2016-09-08  9:56 ` [PATCH v2 01/33] cacheinfo: Introduce cache id Fenghua Yu
                   ` (32 more replies)
  0 siblings, 33 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:56 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

L3 cache allocation allows per task control over which areas of the last
level cache are available for allocation. It is the first resource that
can be controlled as part of Intel Resource Director Technology (RDT).
This patch series creates a framework that will make it easy to add
additional resources (like L2 cache).

See Intel Software Developer manual volume 3, chapter 17 for architectural
details. Also Documentation/x86/intel_rdt.txt and
Documentation/x86/intel_rdt_ui.txt (in parts 0001 & 0013 of this patch
series).

A previous implementation used "cgroups" as the user interface. This was
rejected.

The new interface:
1) Aligns better with the h/w capabilities provided
2) Gives finer control (per thread instead of process)
3) Gives control over kernel threads as well as user threads
4) Allows resource allocation policies to be tied to certain cpus across
all contexts (tglx request)

Note 1: that parts 1-12 are largely unchanged from what was posted last
year except for the removal of cgroup pieces and dynamic CAT/CDP switch.

Note 2: This patch set is an infrastructure for future multi-resources
support. Some code are not only for L3. For example, cat_l3_enabled
is checked after initialization and mount time. This doesn't make
sense for L3 only. But it's easy to add L2 support on top of the
current patch set.

Changes:
v2:
- Merge and reorder some patches
- "tasks" has higher priority than "cpus".
- Re-write UI doc.
- Add include/linux/resctrl.h
- Remove rg_list
- A few other changes.

Fenghua Yu (21):
  cacheinfo: Introduce cache id
  Documentation, ABI: Add a document entry for cache id
  x86, intel_cacheinfo: Enable cache id in x86
  drivers/base/cacheinfo.c: Export some cacheinfo functions for others
    to use
  Documentation, x86: Documentation for Intel resource allocation user
    interface
  sched.h: Add rg_list and rdtgroup in task_struct
  magic number for resctrl file system
  x86/intel_rdt.h: Header for inter_rdt.c
  x86/intel_rdt_rdtgroup.h: Header for user interface
  x86/intel_rdt.c: Extend RDT to per cache and per resources
  x86/intel_rdt_rdtgroup.c: User interface for RDT
  x86/intel_rdt_rdtgroup.c: Create info directory
  include/linux/resctrl.h: Define fork and exit functions in a new
    header file
  Task fork and exit for rdtgroup
  x86/intel_rdt_rdtgroup.c: Implement resctrl file system commands
  x86/intel_rdt_rdtgroup.c: Read and write cpus
  x86/intel_rdt_rdtgroup.c: Tasks iterator and write
  x86/intel_rdt_rdtgroup.c: Process schemata input from resctrl
    interface
  Documentation/kernel-parameters: Add kernel parameter "resctrl" for
    CAT
  MAINTAINERS: Add maintainer for Intel RDT resource allocation
  x86/Makefile: Build intel_rdt_rdtgroup.c

Vikas Shivappa (12):
  x86/intel_rdt: Cache Allocation documentation
  x86/intel_rdt: Add support for Cache Allocation detection
  x86/intel_rdt: Add Class of service management
  x86/intel_rdt: Add L3 cache capacity bitmask management
  x86/intel_rdt: Implement scheduling support for Intel RDT
  x86/intel_rdt: Hot cpu support for Cache Allocation
  x86/intel_rdt: Intel haswell Cache Allocation enumeration
  Define CONFIG_INTEL_RDT
  x86/intel_rdt: Intel Code Data Prioritization detection
  x86/intel_rdt: Adds support to enable Code Data Prioritization
  x86/intel_rdt: Class of service and capacity bitmask management for
    CDP
  x86/intel_rdt: Hot cpu update for code data prioritization

 Documentation/ABI/testing/sysfs-devices-system-cpu |   17 +
 Documentation/kernel-parameters.txt                |   13 +
 Documentation/x86/intel_rdt.txt                    |  109 ++
 Documentation/x86/intel_rdt_ui.txt                 |  164 +++
 MAINTAINERS                                        |    9 +
 arch/x86/Kconfig                                   |   13 +
 arch/x86/events/intel/cqm.c                        |   24 +-
 arch/x86/include/asm/cpufeature.h                  |   10 +-
 arch/x86/include/asm/cpufeatures.h                 |    9 +-
 arch/x86/include/asm/disabled-features.h           |    4 +-
 arch/x86/include/asm/intel_rdt.h                   |  129 ++
 arch/x86/include/asm/intel_rdt_rdtgroup.h          |  164 +++
 arch/x86/include/asm/pqr_common.h                  |   27 +
 arch/x86/include/asm/processor.h                   |    3 +
 arch/x86/include/asm/required-features.h           |    4 +-
 arch/x86/kernel/cpu/Makefile                       |    2 +
 arch/x86/kernel/cpu/common.c                       |   19 +
 arch/x86/kernel/cpu/intel_cacheinfo.c              |   20 +
 arch/x86/kernel/cpu/intel_rdt.c                    |  802 ++++++++++++
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c           | 1362 ++++++++++++++++++++
 arch/x86/kernel/cpu/intel_rdt_schemata.c           |  674 ++++++++++
 arch/x86/kernel/process_64.c                       |    6 +
 drivers/base/cacheinfo.c                           |    7 +-
 include/linux/cacheinfo.h                          |    5 +
 include/linux/resctrl.h                            |   12 +
 include/linux/sched.h                              |    3 +
 include/uapi/linux/magic.h                         |    2 +
 kernel/exit.c                                      |    2 +
 kernel/fork.c                                      |    2 +
 29 files changed, 3589 insertions(+), 28 deletions(-)
 create mode 100644 Documentation/x86/intel_rdt.txt
 create mode 100644 Documentation/x86/intel_rdt_ui.txt
 create mode 100644 arch/x86/include/asm/intel_rdt.h
 create mode 100644 arch/x86/include/asm/intel_rdt_rdtgroup.h
 create mode 100644 arch/x86/include/asm/pqr_common.h
 create mode 100644 arch/x86/kernel/cpu/intel_rdt.c
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_schemata.c
 create mode 100644 include/linux/resctrl.h

-- 
2.5.0

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v2 01/33] cacheinfo: Introduce cache id
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
@ 2016-09-08  9:56 ` Fenghua Yu
  2016-09-09 15:01   ` Nilay Vaish
  2016-09-08  9:56 ` [PATCH v2 02/33] Documentation, ABI: Add a document entry for " Fenghua Yu
                   ` (31 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:56 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Each cache is described by cacheinfo and is unique in the same index
across the platform. But there is no id for a cache. We introduce cache
ID to identify a cache.

Intel Cache Allocation Technology (CAT) allows some control on the
allocation policy within each cache that it controls. We need a unique
cache ID for each cache level to allow the user to specify which
controls are applied to which cache. Cache id is a concise way to specify
a cache.

Cache id is first enabled on x86. It can be enabled on other platforms
as well. The cache id is not necessary contiguous.

Add an "id" entry to /sys/devices/system/cpu/cpu*/cache/index*/

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <bp@suse.com>
---
 drivers/base/cacheinfo.c  | 5 +++++
 include/linux/cacheinfo.h | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index e9fd32e..2a21c15 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -233,6 +233,7 @@ static ssize_t file_name##_show(struct device *dev,		\
 	return sprintf(buf, "%u\n", this_leaf->object);		\
 }
 
+show_one(id, id);
 show_one(level, level);
 show_one(coherency_line_size, coherency_line_size);
 show_one(number_of_sets, number_of_sets);
@@ -314,6 +315,7 @@ static ssize_t write_policy_show(struct device *dev,
 	return n;
 }
 
+static DEVICE_ATTR_RO(id);
 static DEVICE_ATTR_RO(level);
 static DEVICE_ATTR_RO(type);
 static DEVICE_ATTR_RO(coherency_line_size);
@@ -327,6 +329,7 @@ static DEVICE_ATTR_RO(shared_cpu_list);
 static DEVICE_ATTR_RO(physical_line_partition);
 
 static struct attribute *cache_default_attrs[] = {
+	&dev_attr_id.attr,
 	&dev_attr_type.attr,
 	&dev_attr_level.attr,
 	&dev_attr_shared_cpu_map.attr,
@@ -350,6 +353,8 @@ cache_default_attrs_is_visible(struct kobject *kobj,
 	const struct cpumask *mask = &this_leaf->shared_cpu_map;
 	umode_t mode = attr->mode;
 
+	if ((attr == &dev_attr_id.attr) && this_leaf->attributes & CACHE_ID)
+		return mode;
 	if ((attr == &dev_attr_type.attr) && this_leaf->type)
 		return mode;
 	if ((attr == &dev_attr_level.attr) && this_leaf->level)
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index 2189935..cf6984d 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -18,6 +18,7 @@ enum cache_type {
 
 /**
  * struct cacheinfo - represent a cache leaf node
+ * @id: This cache's id. ID is unique in the same index on the platform.
  * @type: type of the cache - data, inst or unified
  * @level: represents the hierarchy in the multi-level cache
  * @coherency_line_size: size of each cache line usually representing
@@ -44,6 +45,7 @@ enum cache_type {
  * keeping, the remaining members form the core properties of the cache
  */
 struct cacheinfo {
+	unsigned int id;
 	enum cache_type type;
 	unsigned int level;
 	unsigned int coherency_line_size;
@@ -61,6 +63,7 @@ struct cacheinfo {
 #define CACHE_WRITE_ALLOCATE	BIT(3)
 #define CACHE_ALLOCATE_POLICY_MASK	\
 	(CACHE_READ_ALLOCATE | CACHE_WRITE_ALLOCATE)
+#define CACHE_ID		BIT(4)
 
 	struct device_node *of_node;
 	bool disable_sysfs;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 02/33] Documentation, ABI: Add a document entry for cache id
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
  2016-09-08  9:56 ` [PATCH v2 01/33] cacheinfo: Introduce cache id Fenghua Yu
@ 2016-09-08  9:56 ` Fenghua Yu
  2016-09-08 19:33   ` Thomas Gleixner
  2016-09-08  9:56 ` [PATCH v2 03/33] x86, intel_cacheinfo: Enable cache id in x86 Fenghua Yu
                   ` (30 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:56 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Add an ABI document entry for /sys/devices/system/cpu/cpu*/cache/index*/id.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <bp@suse.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 4987417..d5d99dc 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -272,6 +272,23 @@ Description:	Parameters for the CPU cache attributes
 				     the modified cache line is written to main
 				     memory only when it is replaced
 
+
+What:		/sys/devices/system/cpu/cpu*/cache/index*/id
+Date:		July 2016
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:	Cache id
+
+		The id identifies a hardware cache of the system within a given
+		cache index in a set of cache indices. The "index" name is
+		simply a nomenclature from CPUID's leaf 4 which enumerates all
+		caches on the system by referring to each one as a cache index.
+		The (cache index, cache id) pair is unique for the whole
+		system.
+
+		Currently id is implemented on x86. On other platforms, id is
+		not enabled yet.
+
+
 What:		/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats
 		/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/turbo_stat
 		/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/sub_turbo_stat
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 03/33] x86, intel_cacheinfo: Enable cache id in x86
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
  2016-09-08  9:56 ` [PATCH v2 01/33] cacheinfo: Introduce cache id Fenghua Yu
  2016-09-08  9:56 ` [PATCH v2 02/33] Documentation, ABI: Add a document entry for " Fenghua Yu
@ 2016-09-08  9:56 ` Fenghua Yu
  2016-09-08  9:56 ` [PATCH v2 04/33] drivers/base/cacheinfo.c: Export some cacheinfo functions for others to use Fenghua Yu
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:56 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Enable cache id in x86. Cache id comes from APIC ID and CPUID4.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <bp@suse.com>
---
 arch/x86/kernel/cpu/intel_cacheinfo.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_cacheinfo.c b/arch/x86/kernel/cpu/intel_cacheinfo.c
index de6626c..8dc5720 100644
--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
@@ -153,6 +153,7 @@ struct _cpuid4_info_regs {
 	union _cpuid4_leaf_eax eax;
 	union _cpuid4_leaf_ebx ebx;
 	union _cpuid4_leaf_ecx ecx;
+	unsigned int id;
 	unsigned long size;
 	struct amd_northbridge *nb;
 };
@@ -894,6 +895,8 @@ static void __cache_cpumap_setup(unsigned int cpu, int index,
 static void ci_leaf_init(struct cacheinfo *this_leaf,
 			 struct _cpuid4_info_regs *base)
 {
+	this_leaf->id = base->id;
+	this_leaf->attributes = CACHE_ID;
 	this_leaf->level = base->eax.split.level;
 	this_leaf->type = cache_type_map[base->eax.split.type];
 	this_leaf->coherency_line_size =
@@ -920,6 +923,22 @@ static int __init_cache_level(unsigned int cpu)
 	return 0;
 }
 
+/*
+ * The max shared threads number comes from CPUID.4:EAX[25-14] with input
+ * ECX as cache index. Then right shift apicid by the number's order to get
+ * cache id for this cache node.
+ */
+static void get_cache_id(int cpu, struct _cpuid4_info_regs *id4_regs)
+{
+	struct cpuinfo_x86 *c = &cpu_data(cpu);
+	unsigned long num_threads_sharing;
+	int index_msb;
+
+	num_threads_sharing = 1 + id4_regs->eax.split.num_threads_sharing;
+	index_msb = get_count_order(num_threads_sharing);
+	id4_regs->id = c->apicid >> index_msb;
+}
+
 static int __populate_cache_leaves(unsigned int cpu)
 {
 	unsigned int idx, ret;
@@ -931,6 +950,7 @@ static int __populate_cache_leaves(unsigned int cpu)
 		ret = cpuid4_cache_lookup_regs(idx, &id4_regs);
 		if (ret)
 			return ret;
+		get_cache_id(cpu, &id4_regs);
 		ci_leaf_init(this_leaf++, &id4_regs);
 		__cache_cpumap_setup(cpu, idx, &id4_regs);
 	}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 04/33] drivers/base/cacheinfo.c: Export some cacheinfo functions for others to use
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (2 preceding siblings ...)
  2016-09-08  9:56 ` [PATCH v2 03/33] x86, intel_cacheinfo: Enable cache id in x86 Fenghua Yu
@ 2016-09-08  9:56 ` Fenghua Yu
  2016-09-08  8:21   ` Thomas Gleixner
  2016-09-08  9:56 ` [PATCH v2 05/33] x86/intel_rdt: Cache Allocation documentation Fenghua Yu
                   ` (28 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:56 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

We use ci_cpu_cacheinfo in CAT. Export this function for CAT to reuse.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 drivers/base/cacheinfo.c  | 2 +-
 include/linux/cacheinfo.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index 2a21c15..f6e269a 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -29,7 +29,7 @@
 #include <linux/sysfs.h>
 
 /* pointer to per cpu cacheinfo */
-static DEFINE_PER_CPU(struct cpu_cacheinfo, ci_cpu_cacheinfo);
+DEFINE_PER_CPU(struct cpu_cacheinfo, ci_cpu_cacheinfo);
 #define ci_cacheinfo(cpu)	(&per_cpu(ci_cpu_cacheinfo, cpu))
 #define cache_leaves(cpu)	(ci_cacheinfo(cpu)->num_leaves)
 #define per_cpu_cacheinfo(cpu)	(ci_cacheinfo(cpu)->info_list)
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index cf6984d..fa5e829 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -94,6 +94,8 @@ int func(unsigned int cpu)					\
 	return ret;						\
 }
 
+#define ci_cacheinfo(cpu)       (&per_cpu(ci_cpu_cacheinfo, cpu))
+
 struct cpu_cacheinfo *get_cpu_cacheinfo(unsigned int cpu);
 int init_cache_level(unsigned int cpu);
 int populate_cache_leaves(unsigned int cpu);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 05/33] x86/intel_rdt: Cache Allocation documentation
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (3 preceding siblings ...)
  2016-09-08  9:56 ` [PATCH v2 04/33] drivers/base/cacheinfo.c: Export some cacheinfo functions for others to use Fenghua Yu
@ 2016-09-08  9:56 ` Fenghua Yu
  2016-09-08  9:57 ` [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface Fenghua Yu
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:56 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Vikas Shivappa <vikas.shivappa@linux.intel.com>

Adds a description of Cache allocation technology, overview of kernel
framework implementation. The framework has APIs to manage class of
service, capacity bitmask(CBM), scheduling support and other
architecture specific implementation. The APIs are used to build the
resctrl interface in later patches.

Cache allocation is a sub-feature of Resource Director Technology (RDT)
or Platform Shared resource control which provides support to control
Platform shared resources like L3 cache.

Cache Allocation Technology provides a way for the Software (OS/VMM) to
restrict cache allocation to a defined 'subset' of cache which may be
overlapping with other 'subsets'. This feature is used when allocating a
line in cache ie when pulling new data into the cache. The tasks are
grouped into CLOS (class of service). OS uses MSR writes to indicate the
CLOSid of the thread when scheduling in and to indicate the cache
capacity associated with the CLOSid. Currently cache allocation is
supported for L3 cache.

More information can be found in the Intel SDM June 2015, Volume 3,
section 17.16.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/x86/intel_rdt.txt | 109 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)
 create mode 100644 Documentation/x86/intel_rdt.txt

diff --git a/Documentation/x86/intel_rdt.txt b/Documentation/x86/intel_rdt.txt
new file mode 100644
index 0000000..05ec819
--- /dev/null
+++ b/Documentation/x86/intel_rdt.txt
@@ -0,0 +1,109 @@
+        Intel RDT
+        ---------
+
+Copyright (C) 2014 Intel Corporation
+Written by vikas.shivappa@linux.intel.com
+
+CONTENTS:
+=========
+
+1. Cache Allocation Technology
+  1.1 What is RDT and Cache allocation ?
+  1.2 Why is Cache allocation needed ?
+  1.3 Cache allocation implementation overview
+  1.4 Assignment of CBM and CLOS
+  1.5 Scheduling and Context Switch
+
+1. Cache Allocation Technology
+===================================
+
+1.1 What is RDT and Cache allocation
+------------------------------------
+
+Cache allocation is a sub-feature of Resource Director Technology (RDT)
+Allocation or Platform Shared resource control which provides support to
+control Platform shared resources like L3 cache. Currently L3 Cache is
+the only resource that is supported in RDT. More information can be
+found in the Intel SDM June 2015, Volume 3, section 17.16.
+
+Cache Allocation Technology provides a way for the Software (OS/VMM) to
+restrict cache allocation to a defined 'subset' of cache which may be
+overlapping with other 'subsets'. This feature is used when allocating a
+line in cache ie when pulling new data into the cache. The programming
+of the h/w is done via programming MSRs.
+
+The different cache subsets are identified by CLOS identifier (class of
+service) and each CLOS has a CBM (cache bit mask). The CBM is a
+contiguous set of bits which defines the amount of cache resource that
+is available for each 'subset'.
+
+1.2 Why is Cache allocation needed
+----------------------------------
+
+In todays new processors the number of cores is continuously increasing
+especially in large scale usage models where VMs are used like
+webservers and datacenters. The number of cores increase the number of
+threads or workloads that can simultaneously be run. When
+multi-threaded-applications, VMs, workloads run concurrently they
+compete for shared resources including L3 cache.
+
+The architecture also allows dynamically changing these subsets during
+runtime to further optimize the performance of the higher priority
+application with minimal degradation to the low priority app.
+Additionally, resources can be rebalanced for system throughput benefit.
+
+This technique may be useful in managing large computer server systems
+with large L3 cache, in the cloud and container context. Examples may be
+large servers running instances of webservers or database servers. In
+such complex systems, these subsets can be used for more careful placing
+of the available cache resources by a centralized root accessible
+interface.
+
+A specific use case may be to solve the noisy neighbour issue when a app
+which is constantly copying data like streaming app is using large
+amount of cache which could have otherwise been used by a high priority
+computing application. Using the cache allocation feature, the streaming
+application can be confined to use a smaller cache and the high priority
+application be awarded a larger amount of cache space.
+
+1.3 Cache allocation implementation Overview
+--------------------------------------------
+
+Kernel has a new field in the task_struct called 'closid' which
+represents the Class of service ID of the task.
+
+There is a 1:1 CLOSid <-> CBM (capacity bit mask) mapping. A CLOS (Class
+of service) is represented by a CLOSid. Each closid would have one CBM
+and would just represent one cache 'subset'.  The tasks would get to
+fill the L3 cache represented by the capacity bit mask or CBM.
+
+The APIs to manage the closid and CBM can be used to develop user
+interfaces.
+
+1.4 Assignment of CBM, CLOS
+---------------------------
+
+The framework provides APIs to manage the closid and CBM which can be
+used to develop user/kernel mode interfaces.
+
+1.5 Scheduling and Context Switch
+---------------------------------
+
+During context switch kernel implements this by writing the CLOSid of
+the task to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written when
+there is a change in the CLOSid for the CPU in order to minimize the
+latency incurred during context switch.
+
+The following considerations are done for the PQR MSR write so that it
+has minimal impact on scheduling hot path:
+ - This path doesn't exist on any non-intel platforms.
+ - On Intel platforms, this would not exist by default unless INTEL_RDT
+ is enabled.
+ - remains a no-op when INTEL_RDT is enabled and intel hardware does
+ not support the feature.
+ - When feature is available, does not do MSR write till the user
+ starts using the feature *and* assigns a new cache capacity mask.
+ - per cpu PQR values are cached and the MSR write is only done when
+ there is a task with different PQR is scheduled on the CPU. Typically
+ if the task groups are bound to be scheduled on a set of CPUs, the
+ number of MSR writes is greatly reduced.
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (4 preceding siblings ...)
  2016-09-08  9:56 ` [PATCH v2 05/33] x86/intel_rdt: Cache Allocation documentation Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 11:22   ` Borislav Petkov
  2016-09-08 22:01   ` Shaohua Li
  2016-09-08  9:57 ` [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection Fenghua Yu
                   ` (26 subsequent siblings)
  32 siblings, 2 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

The documentation describes user interface of how to allocate resource
in Intel RDT.

Please note that the documentation covers generic user interface. Current
patch set code only implemente CAT L3. CAT L2 code will be sent later.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/x86/intel_rdt_ui.txt | 164 +++++++++++++++++++++++++++++++++++++
 1 file changed, 164 insertions(+)
 create mode 100644 Documentation/x86/intel_rdt_ui.txt

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
new file mode 100644
index 0000000..27de386
--- /dev/null
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -0,0 +1,164 @@
+User Interface for Resource Allocation in Intel Resource Director Technology
+
+Copyright (C) 2016 Intel Corporation
+
+Fenghua Yu <fenghua.yu@intel.com>
+Tony Luck <tony.luck@intel.com>
+
+This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
+X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
+
+To use the feature mount the file system:
+
+ # mount -t resctrl resctrl [-o cdp,verbose] /sys/fs/resctrl
+
+mount options are:
+
+"cdp": Enable code/data prioritization in L3 cache allocations.
+
+"verbose": Output more info in the "info" file under info directory
+	and in dmesg. This is mainly for debug.
+
+
+Resource groups
+---------------
+Resource groups are represented as directories in the resctrl file
+system. The default group is the root directory. Other groups may be
+created as desired by the system administrator using the "mkdir(1)"
+command, and removed using "rmdir(1)".
+
+There are three files associated with each group:
+
+"tasks": A list of tasks that belongs to this group. Tasks can be
+	added to a group by writing the task ID to the "tasks" file
+	(which will automatically remove them from the previous
+	group to which they belonged). New tasks created by fork(2)
+	and clone(2) are added to the same group as their parent.
+	If a pid is not in any sub partition, it is in root partition
+	(i.e. default partition).
+
+"cpus": A bitmask of logical CPUs assigned to this group. Writing
+	a new mask can add/remove CPUs from this group. Added CPUs
+	are removed from their previous group. Removed ones are
+	given to the default (root) group.
+
+"schemata": A list of all the resources available to this group.
+	Each resource has its own line and format - see below for
+	details.
+
+When a task is running the following rules define which resources
+are available to it:
+
+1) If the task is a member of a non-default group, then the schemata
+for that group is used.
+
+2) Else if the task belongs to the default group, but is running on a
+CPU that is assigned to some specific group, then the schemata for
+the CPU's group is used.
+
+3) Otherwise the schemata for the default group is used.
+
+
+Schemata files - general concepts
+---------------------------------
+Each line in the file describes one resource. The line starts with
+the name of the resource, followed by specific values to be applied
+in each of the instances of that resource on the system.
+
+Cache IDs
+---------
+On current generation systems there is one L3 cache per socket and L2
+caches are generally just shared by the hyperthreads on a core, but this
+isn't an architectural requirement. We could have multiple separate L3
+caches on a socket, multiple cores could share an L2 cache. So instead
+of using "socket" or "core" to define the set of logical cpus sharing
+a resource we use a "Cache ID". At a given cache level this will be a
+unique number across the whole system (but it isn't guaranteed to be a
+contiguous sequence, there may be gaps).  To find the ID for each logical
+CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
+
+Cache Bit Masks (CBM)
+---------------------
+For cache resources we describe the portion of the cache that is available
+for allocation using a bitmask. The number of bits in the mask is defined
+by each cpu model (and may be different for different cache levels). It
+is found using CPUID, but is also provided in the "info" directory of
+the resctrl file system in "info/{resource}/max_cbm_len". X86 hardware
+requires that these masks have all the '1' bits in a contiguous block. So
+0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
+and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
+of the capacity of the cache. You could partition the cache into four
+equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
+
+
+L3 details (code and data prioritization disabled)
+--------------------------------------------------
+With CDP disabled the L3 schemata format is:
+
+	L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+L3 details (CDP enabled via mount option to resctrl)
+----------------------------------------------------
+When CDP is enabled, you need to specify separate cache bit masks for
+code and data access. The generic format is:
+
+	L3:<cache_id0>=<d_cbm>,<i_cbm>;<cache_id1>=<d_cbm>,<i_cbm>;...
+
+where the d_cbm masks are for data access, and the i_cbm masks for code.
+
+
+Example 1
+---------
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+# mkdir p0 p1
+# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
+# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Example 2
+---------
+Again two sockets, but this time with a more realistic 20-bit mask.
+
+Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
+processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
+neighbors, each of the two real-time tasks exclusively occupies one quarter
+of L3 cache on socket 0.
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+
+First we reset the schemata for the default group so that the "upper"
+50% of the L3 cache on socket 0 cannot be used by ordinary tasks:
+
+# echo "L3:0=3ff;1=fffff" > schemata
+
+Next we make a resource group for our first real time task and give
+it access to the "top" 25% of the cache on socket 0.
+
+# mkdir p0
+# echo "L3:0=f8000;1=fffff" > p0/schemata
+
+Finally we move our first real time task into this resource group. We
+also use taskset(1) to ensure the task always runs on a dedicated CPU
+on socket 0. Most uses of resource groups will also constrain which
+processors tasks run on.
+
+# echo 1234 > p0/tasks
+# taskset -cp 1 1234
+
+Ditto for the second real time task (with the remaining 25% of cache):
+
+# mkdir p1
+# echo "L3:0=7c00;1=fffff" > p1/schemata
+# echo 5678 > p1/tasks
+# taskset -cp 2 5678
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (5 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 11:50   ` Borislav Petkov
                     ` (2 more replies)
  2016-09-08  9:57 ` [PATCH v2 08/33] x86/intel_rdt: Add Class of service management Fenghua Yu
                   ` (25 subsequent siblings)
  32 siblings, 3 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Vikas Shivappa <vikas.shivappa@linux.intel.com>

This patch includes CPUID enumeration routines for Cache allocation and
new values to track resources to the cpuinfo_x86 structure.

Cache allocation provides a way for the Software (OS/VMM) to restrict
cache allocation to a defined 'subset' of cache which may be overlapping
with other 'subsets'. This feature is used when allocating a line in
cache ie when pulling new data into the cache. The programming of the
hardware is done via programming MSRs (model specific registers).

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/cpufeature.h        |  8 +++++--
 arch/x86/include/asm/cpufeatures.h       |  6 +++++-
 arch/x86/include/asm/disabled-features.h |  3 ++-
 arch/x86/include/asm/processor.h         |  3 +++
 arch/x86/include/asm/required-features.h |  3 ++-
 arch/x86/kernel/cpu/common.c             | 19 ++++++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c          | 37 ++++++++++++++++++++++++++++++++
 7 files changed, 74 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/intel_rdt.c

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 1d2b69f..9985b4cf 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -28,6 +28,8 @@ enum cpuid_leafs
 	CPUID_8000_000A_EDX,
 	CPUID_7_ECX,
 	CPUID_8000_0007_EBX,
+	CPUID_10_0_EBX,
+	CPUID_10_1_ECX,
 };
 
 #ifdef CONFIG_X86_FEATURE_NAMES
@@ -78,8 +80,9 @@ extern const char * const x86_bug_flags[NBUGINTS*32];
 	   CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 15, feature_bit) ||	\
 	   CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 16, feature_bit) ||	\
 	   CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 17, feature_bit) ||	\
+	   CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 18, feature_bit) ||	\
 	   REQUIRED_MASK_CHECK					  ||	\
-	   BUILD_BUG_ON_ZERO(NCAPINTS != 18))
+	   BUILD_BUG_ON_ZERO(NCAPINTS != 19))
 
 #define DISABLED_MASK_BIT_SET(feature_bit)				\
 	 ( CHECK_BIT_IN_MASK_WORD(DISABLED_MASK,  0, feature_bit) ||	\
@@ -100,8 +103,9 @@ extern const char * const x86_bug_flags[NBUGINTS*32];
 	   CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 15, feature_bit) ||	\
 	   CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 16, feature_bit) ||	\
 	   CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 17, feature_bit) ||	\
+	   CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 18, feature_bit) ||	\
 	   DISABLED_MASK_CHECK					  ||	\
-	   BUILD_BUG_ON_ZERO(NCAPINTS != 18))
+	   BUILD_BUG_ON_ZERO(NCAPINTS != 19))
 
 #define cpu_has(c, bit)							\
 	(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 :	\
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 92a8308..62d979b9 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -12,7 +12,7 @@
 /*
  * Defines x86 CPU feature bits
  */
-#define NCAPINTS	18	/* N 32-bit words worth of info */
+#define NCAPINTS	19	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -220,6 +220,7 @@
 #define X86_FEATURE_RTM		( 9*32+11) /* Restricted Transactional Memory */
 #define X86_FEATURE_CQM		( 9*32+12) /* Cache QoS Monitoring */
 #define X86_FEATURE_MPX		( 9*32+14) /* Memory Protection Extension */
+#define X86_FEATURE_RDT		( 9*32+15) /* Resource Director Technology */
 #define X86_FEATURE_AVX512F	( 9*32+16) /* AVX-512 Foundation */
 #define X86_FEATURE_AVX512DQ	( 9*32+17) /* AVX-512 DQ (Double/Quad granular) Instructions */
 #define X86_FEATURE_RDSEED	( 9*32+18) /* The RDSEED instruction */
@@ -286,6 +287,9 @@
 #define X86_FEATURE_SUCCOR	(17*32+1) /* Uncorrectable error containment and recovery */
 #define X86_FEATURE_SMCA	(17*32+3) /* Scalable MCA */
 
+/* Intel-defined CPU features, CPUID level 0x00000010:0 (ebx), word 18 */
+#define X86_FEATURE_CAT_L3      (18*32+ 1) /* Cache Allocation L3 */
+
 /*
  * BUG word(s)
  */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 85599ad..8b45e08 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -57,6 +57,7 @@
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE)
 #define DISABLED_MASK17	0
-#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)
+#define DISABLED_MASK18	0
+#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
 
 #endif /* _ASM_X86_DISABLED_FEATURES_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 63def95..e940b2d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -119,6 +119,9 @@ struct cpuinfo_x86 {
 	int			x86_cache_occ_scale;	/* scale to bytes */
 	int			x86_power;
 	unsigned long		loops_per_jiffy;
+	/* Cache Allocation values: */
+	u16			x86_l3_max_cbm_len;
+	u16			x86_l3_max_closid;
 	/* cpuid returned max cores value: */
 	u16			 x86_max_cores;
 	u16			apicid;
diff --git a/arch/x86/include/asm/required-features.h b/arch/x86/include/asm/required-features.h
index fac9a5c..6847d85 100644
--- a/arch/x86/include/asm/required-features.h
+++ b/arch/x86/include/asm/required-features.h
@@ -100,6 +100,7 @@
 #define REQUIRED_MASK15	0
 #define REQUIRED_MASK16	0
 #define REQUIRED_MASK17	0
-#define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)
+#define REQUIRED_MASK18	0
+#define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
 
 #endif /* _ASM_X86_REQUIRED_FEATURES_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 809eda0..997d1d5 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -711,6 +711,25 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		}
 	}
 
+	/* Additional Intel-defined flags: level 0x00000010 */
+	if (c->cpuid_level >= 0x00000010) {
+		u32 eax, ebx, ecx, edx;
+
+		cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
+		c->x86_capability[CPUID_10_0_EBX] = ebx;
+
+		if (cpu_has(c, X86_FEATURE_CAT_L3)) {
+
+			cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
+			c->x86_l3_max_closid = edx + 1;
+			c->x86_l3_max_cbm_len = eax + 1;
+			c->x86_capability[CPUID_10_1_ECX] = ecx;
+		} else {
+			c->x86_l3_max_closid = -1;
+			c->x86_l3_max_cbm_len = -1;
+		}
+	}
+
 	/* AMD-defined flags: level 0x80000001 */
 	eax = cpuid_eax(0x80000000);
 	c->extended_cpuid_level = eax;
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
new file mode 100644
index 0000000..fcd0642
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -0,0 +1,37 @@
+/*
+ * Resource Director Technology(RDT)
+ * - Cache Allocation code.
+ *
+ * Copyright (C) 2014 Intel Corporation
+ *
+ * 2015-05-25 Written by
+ *    Vikas Shivappa <vikas.shivappa@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * More information about RDT be found in the Intel (R) x86 Architecture
+ * Software Developer Manual June 2015, volume 3, section 17.15.
+ */
+#include <linux/slab.h>
+#include <linux/err.h>
+
+static int __init intel_rdt_late_init(void)
+{
+	struct cpuinfo_x86 *c = &boot_cpu_data;
+
+	if (!cpu_has(c, X86_FEATURE_CAT_L3))
+		return -ENODEV;
+
+	pr_info("Intel cache allocation detected\n");
+
+	return 0;
+}
+
+late_initcall(intel_rdt_late_init);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 08/33] x86/intel_rdt: Add Class of service management
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (6 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08  8:56   ` Thomas Gleixner
  2016-09-12 16:02   ` Nilay Vaish
  2016-09-08  9:57 ` [PATCH v2 09/33] x86/intel_rdt: Add L3 cache capacity bitmask management Fenghua Yu
                   ` (24 subsequent siblings)
  32 siblings, 2 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Vikas Shivappa <vikas.shivappa@linux.intel.com>

Adds some data-structures and APIs to support Class of service
management(closid). There is a new clos_cbm table which keeps a 1:1
mapping between closid and capacity bit mask (cbm)
and a count of usage of closid. Each task would be associated with a
Closid at a time and this patch adds a new field closid to task_struct
to keep track of the same.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt.h | 12 ++++++
 arch/x86/kernel/cpu/intel_rdt.c  | 81 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 91 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_rdt.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
new file mode 100644
index 0000000..68bab26
--- /dev/null
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -0,0 +1,12 @@
+#ifndef _RDT_H_
+#define _RDT_H_
+
+#ifdef CONFIG_INTEL_RDT
+
+struct clos_cbm_table {
+	unsigned long cbm;
+	unsigned int clos_refcnt;
+};
+
+#endif
+#endif
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index fcd0642..b25940a 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -21,17 +21,94 @@
  */
 #include <linux/slab.h>
 #include <linux/err.h>
+#include <asm/intel_rdt.h>
+
+/*
+ * cctable maintains 1:1 mapping between CLOSid and cache bitmask.
+ */
+static struct clos_cbm_table *cctable;
+/*
+ * closid availability bit map.
+ */
+unsigned long *closmap;
+static DEFINE_MUTEX(rdtgroup_mutex);
+
+static inline void closid_get(u32 closid)
+{
+	struct clos_cbm_table *cct = &cctable[closid];
+
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	cct->clos_refcnt++;
+}
+
+static int closid_alloc(u32 *closid)
+{
+	u32 maxid;
+	u32 id;
+
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	maxid = boot_cpu_data.x86_cache_max_closid;
+	id = find_first_zero_bit(closmap, maxid);
+	if (id == maxid)
+		return -ENOSPC;
+
+	set_bit(id, closmap);
+	closid_get(id);
+	*closid = id;
+
+	return 0;
+}
+
+static inline void closid_free(u32 closid)
+{
+	clear_bit(closid, closmap);
+	cctable[closid].cbm = 0;
+}
+
+static void closid_put(u32 closid)
+{
+	struct clos_cbm_table *cct = &cctable[closid];
+
+	lockdep_assert_held(&rdtgroup_mutex);
+	if (WARN_ON(!cct->clos_refcnt))
+		return;
+
+	if (!--cct->clos_refcnt)
+		closid_free(closid);
+}
 
 static int __init intel_rdt_late_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
+	u32 maxid;
+	int err = 0, size;
 
 	if (!cpu_has(c, X86_FEATURE_CAT_L3))
 		return -ENODEV;
 
-	pr_info("Intel cache allocation detected\n");
+	maxid = c->x86_cache_max_closid;
 
-	return 0;
+	size = maxid * sizeof(struct clos_cbm_table);
+	cctable = kzalloc(size, GFP_KERNEL);
+	if (!cctable) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+
+	size = BITS_TO_LONGS(maxid) * sizeof(long);
+	closmap = kzalloc(size, GFP_KERNEL);
+	if (!closmap) {
+		kfree(cctable);
+		err = -ENOMEM;
+		goto out_err;
+	}
+
+	pr_info("Intel cache allocation enabled\n");
+out_err:
+
+	return err;
 }
 
 late_initcall(intel_rdt_late_init);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 09/33] x86/intel_rdt: Add L3 cache capacity bitmask management
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (7 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 08/33] x86/intel_rdt: Add Class of service management Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08  9:40   ` Thomas Gleixner
  2016-09-12 16:10   ` Nilay Vaish
  2016-09-08  9:57 ` [PATCH v2 10/33] x86/intel_rdt: Implement scheduling support for Intel RDT Fenghua Yu
                   ` (23 subsequent siblings)
  32 siblings, 2 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Vikas Shivappa <vikas.shivappa@linux.intel.com>

This patch adds different APIs to manage the L3 cache capacity bitmask.
The capacity bit mask(CBM) needs to have only contiguous bits set. The
current implementation has a global CBM for each class of service id.
There are APIs added to update the CBM via MSR write to IA32_L3_MASK_n
on all packages. Other APIs are to read and write entries to the
clos_cbm_table.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt.h |  4 ++++
 arch/x86/kernel/cpu/intel_rdt.c  | 48 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 68bab26..68c9a79 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -3,6 +3,10 @@
 
 #ifdef CONFIG_INTEL_RDT
 
+#define MAX_CBM_LENGTH			32
+#define IA32_L3_CBM_BASE		0xc90
+#define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
+
 struct clos_cbm_table {
 	unsigned long cbm;
 	unsigned int clos_refcnt;
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index b25940a..9cf3a7d 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -31,8 +31,22 @@ static struct clos_cbm_table *cctable;
  * closid availability bit map.
  */
 unsigned long *closmap;
+/*
+ * Mask of CPUs for writing CBM values. We only need one CPU per-socket.
+ */
+static cpumask_t rdt_cpumask;
+/*
+ * Temporary cpumask used during hot cpu notificaiton handling. The usage
+ * is serialized by hot cpu locks.
+ */
+static cpumask_t tmp_cpumask;
 static DEFINE_MUTEX(rdtgroup_mutex);
 
+struct rdt_remote_data {
+	int msr;
+	u64 val;
+};
+
 static inline void closid_get(u32 closid)
 {
 	struct clos_cbm_table *cct = &cctable[closid];
@@ -79,11 +93,41 @@ static void closid_put(u32 closid)
 		closid_free(closid);
 }
 
+static void msr_cpu_update(void *arg)
+{
+	struct rdt_remote_data *info = arg;
+
+	wrmsrl(info->msr, info->val);
+}
+
+/*
+ * msr_update_all() - Update the msr for all packages.
+ */
+static inline void msr_update_all(int msr, u64 val)
+{
+	struct rdt_remote_data info;
+
+	info.msr = msr;
+	info.val = val;
+	on_each_cpu_mask(&rdt_cpumask, msr_cpu_update, &info, 1);
+}
+
+static inline bool rdt_cpumask_update(int cpu)
+{
+	cpumask_and(&tmp_cpumask, &rdt_cpumask, topology_core_cpumask(cpu));
+	if (cpumask_empty(&tmp_cpumask)) {
+		cpumask_set_cpu(cpu, &rdt_cpumask);
+		return true;
+	}
+
+	return false;
+}
+
 static int __init intel_rdt_late_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
 	u32 maxid;
-	int err = 0, size;
+	int err = 0, size, i;
 
 	if (!cpu_has(c, X86_FEATURE_CAT_L3))
 		return -ENODEV;
@@ -105,6 +149,8 @@ static int __init intel_rdt_late_init(void)
 		goto out_err;
 	}
 
+	for_each_online_cpu(i)
+		rdt_cpumask_update(i);
 	pr_info("Intel cache allocation enabled\n");
 out_err:
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 10/33] x86/intel_rdt: Implement scheduling support for Intel RDT
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (8 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 09/33] x86/intel_rdt: Add L3 cache capacity bitmask management Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08  9:53   ` Thomas Gleixner
  2016-09-13 17:55   ` Nilay Vaish
  2016-09-08  9:57 ` [PATCH v2 11/33] x86/intel_rdt: Hot cpu support for Cache Allocation Fenghua Yu
                   ` (22 subsequent siblings)
  32 siblings, 2 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Vikas Shivappa <vikas.shivappa@linux.intel.com>

Adds support for IA32_PQR_ASSOC MSR writes during task scheduling. For
Cache Allocation, MSR write would let the task fill in the cache
'subset' represented by the task's capacity bit mask.

The high 32 bits in the per processor MSR IA32_PQR_ASSOC represents the
CLOSid. During context switch kernel implements this by writing the
CLOSid of the task belongs to the CPU's IA32_PQR_ASSOC MSR.

This patch also implements a common software cache for IA32_PQR_MSR
(RMID 0:9, CLOSId 32:63) to be used by both Cache monitoring (CMT) and
Cache allocation. CMT updates the RMID where as cache_alloc updates the
CLOSid in the software cache. During scheduling when the new RMID/CLOSid
value is different from the cached values, IA32_PQR_MSR is updated.
Since the measured rdmsr latency for IA32_PQR_MSR is very high (~250
 cycles) this software cache is necessary to avoid reading the MSR to
compare the current CLOSid value.

The following considerations are done for the PQR MSR write so that it
minimally impacts scheduler hot path:
 - This path does not exist on any non-intel platforms.
 - On Intel platforms, this would not exist by default unless INTEL_RDT
 is enabled.
 - remains a no-op when INTEL_RDT is enabled and intel SKU does not
 support the feature.
 - When feature is available and enabled, never does MSR write till the
 user manually starts using one of the capacity bit masks.
 - MSR write is only done when there is a task with different Closid is
 scheduled on the CPU. Typically if the task groups are bound to be
 scheduled on a set of CPUs, the number of MSR writes is greatly
 reduced.
 - A per CPU cache of CLOSids is maintained to do the check so that we
 don't have to do a rdmsr which actually costs a lot of cycles.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/events/intel/cqm.c       | 24 ++----------------------
 arch/x86/include/asm/intel_rdt.h  | 36 ++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/pqr_common.h | 27 +++++++++++++++++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c   | 18 ++++++++++++++++++
 arch/x86/kernel/process_64.c      |  6 ++++++
 5 files changed, 89 insertions(+), 22 deletions(-)
 create mode 100644 arch/x86/include/asm/pqr_common.h

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 783c49d..0ed56ae 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -7,9 +7,9 @@
 #include <linux/perf_event.h>
 #include <linux/slab.h>
 #include <asm/cpu_device_id.h>
+#include <asm/pqr_common.h>
 #include "../perf_event.h"
 
-#define MSR_IA32_PQR_ASSOC	0x0c8f
 #define MSR_IA32_QM_CTR		0x0c8e
 #define MSR_IA32_QM_EVTSEL	0x0c8d
 
@@ -24,32 +24,13 @@ static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 static bool cqm_enabled, mbm_enabled;
 unsigned int mbm_socket_max;
 
-/**
- * struct intel_pqr_state - State cache for the PQR MSR
- * @rmid:		The cached Resource Monitoring ID
- * @closid:		The cached Class Of Service ID
- * @rmid_usecnt:	The usage counter for rmid
- *
- * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
- * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
- * contains both parts, so we need to cache them.
- *
- * The cache also helps to avoid pointless updates if the value does
- * not change.
- */
-struct intel_pqr_state {
-	u32			rmid;
-	u32			closid;
-	int			rmid_usecnt;
-};
-
 /*
  * The cached intel_pqr_state is strictly per CPU and can never be
  * updated from a remote CPU. Both functions which modify the state
  * (intel_cqm_event_start and intel_cqm_event_stop) are called with
  * interrupts disabled, which is sufficient for the protection.
  */
-static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
 static struct hrtimer *mbm_timers;
 /**
  * struct sample - mbm event's (local or total) data
@@ -1583,7 +1564,6 @@ static int intel_cqm_cpu_starting(unsigned int cpu)
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
 
 	state->rmid = 0;
-	state->closid = 0;
 	state->rmid_usecnt = 0;
 
 	WARN_ON(c->x86_cache_max_rmid != cqm_max_rmid);
diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 68c9a79..8512174 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -3,14 +3,50 @@
 
 #ifdef CONFIG_INTEL_RDT
 
+#include <linux/jump_label.h>
+
 #define MAX_CBM_LENGTH			32
 #define IA32_L3_CBM_BASE		0xc90
 #define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
 
+extern struct static_key rdt_enable_key;
+void __intel_rdt_sched_in(void *dummy);
+
 struct clos_cbm_table {
 	unsigned long cbm;
 	unsigned int clos_refcnt;
 };
 
+/*
+ * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - This will stay as no-op unless we are running on an Intel SKU
+ * which supports L3 cache allocation.
+ * - When support is present and enabled, does not do any
+ * IA32_PQR_MSR writes until the user starts really using the feature
+ * ie creates a rdtgroup directory and assigns a cache_mask thats
+ * different from the root rdtgroup's cache_mask.
+ * - Caches the per cpu CLOSid values and does the MSR write only
+ * when a task with a different CLOSid is scheduled in. That
+ * means the task belongs to a different rdtgroup.
+ * - Closids are allocated so that different rdtgroup directories
+ * with same cache_mask gets the same CLOSid. This minimizes CLOSids
+ * used and reduces MSR write frequency.
+ */
+static inline void intel_rdt_sched_in(void)
+{
+	/*
+	 * Call the schedule in code only when RDT is enabled.
+	 */
+	if (static_key_false(&rdt_enable_key))
+		__intel_rdt_sched_in(NULL);
+}
+
+#else
+
+static inline void intel_rdt_sched_in(void) {}
+
 #endif
 #endif
diff --git a/arch/x86/include/asm/pqr_common.h b/arch/x86/include/asm/pqr_common.h
new file mode 100644
index 0000000..11e985c
--- /dev/null
+++ b/arch/x86/include/asm/pqr_common.h
@@ -0,0 +1,27 @@
+#ifndef _X86_RDT_H_
+#define _X86_RDT_H_
+
+#define MSR_IA32_PQR_ASSOC	0x0c8f
+
+/**
+ * struct intel_pqr_state - State cache for the PQR MSR
+ * @rmid:		The cached Resource Monitoring ID
+ * @closid:		The cached Class Of Service ID
+ * @rmid_usecnt:	The usage counter for rmid
+ *
+ * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
+ * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
+ * contains both parts, so we need to cache them.
+ *
+ * The cache also helps to avoid pointless updates if the value does
+ * not change.
+ */
+struct intel_pqr_state {
+	u32			rmid;
+	u32			closid;
+	int			rmid_usecnt;
+};
+
+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
+
+#endif
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 9cf3a7d..9f30492 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -21,6 +21,8 @@
  */
 #include <linux/slab.h>
 #include <linux/err.h>
+#include <linux/sched.h>
+#include <asm/pqr_common.h>
 #include <asm/intel_rdt.h>
 
 /*
@@ -41,12 +43,28 @@ static cpumask_t rdt_cpumask;
  */
 static cpumask_t tmp_cpumask;
 static DEFINE_MUTEX(rdtgroup_mutex);
+struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
 
 struct rdt_remote_data {
 	int msr;
 	u64 val;
 };
 
+void __intel_rdt_sched_in(void *dummy)
+{
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+
+	/*
+	 * Currently closid is always 0. When  user interface is added,
+	 * closid will come from user interface.
+	 */
+	if (state->closid == 0)
+		return;
+
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, 0);
+	state->closid = 0;
+}
+
 static inline void closid_get(u32 closid)
 {
 	struct clos_cbm_table *cct = &cctable[closid];
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 63236d8..1c98f80 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -48,6 +48,7 @@
 #include <asm/syscalls.h>
 #include <asm/debugreg.h>
 #include <asm/switch_to.h>
+#include <asm/intel_rdt.h>
 #include <asm/xen/hypervisor.h>
 
 asmlinkage extern void ret_from_fork(void);
@@ -472,6 +473,11 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 			loadsegment(ss, __KERNEL_DS);
 	}
 
+	/*
+	 * Load the Intel cache allocation PQR MSR.
+	 */
+	intel_rdt_sched_in();
+
 	return prev_p;
 }
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 11/33] x86/intel_rdt: Hot cpu support for Cache Allocation
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (9 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 10/33] x86/intel_rdt: Implement scheduling support for Intel RDT Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 10:03   ` Thomas Gleixner
  2016-09-13 18:18   ` Nilay Vaish
  2016-09-08  9:57 ` [PATCH v2 12/33] x86/intel_rdt: Intel haswell Cache Allocation enumeration Fenghua Yu
                   ` (21 subsequent siblings)
  32 siblings, 2 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Vikas Shivappa <vikas.shivappa@linux.intel.com>

This patch adds hot plug cpu support for Intel Cache allocation. Support
includes updating the cache bitmask MSRs IA32_L3_QOS_n when a new CPU
package comes online or goes offline. The IA32_L3_QOS_n MSRs are one per
Class of service on each CPU package. The new package's MSRs are
synchronized with the values of existing MSRs. Also the software cache
for IA32_PQR_ASSOC MSRs are reset during hot cpu notifications.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c | 85 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 9f30492..4537658 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -21,6 +21,7 @@
  */
 #include <linux/slab.h>
 #include <linux/err.h>
+#include <linux/cpu.h>
 #include <linux/sched.h>
 #include <asm/pqr_common.h>
 #include <asm/intel_rdt.h>
@@ -130,6 +131,9 @@ static inline void msr_update_all(int msr, u64 val)
 	on_each_cpu_mask(&rdt_cpumask, msr_cpu_update, &info, 1);
 }
 
+/*
+ * Set only one cpu in cpumask in all cpus that share the same cache.
+ */
 static inline bool rdt_cpumask_update(int cpu)
 {
 	cpumask_and(&tmp_cpumask, &rdt_cpumask, topology_core_cpumask(cpu));
@@ -141,6 +145,80 @@ static inline bool rdt_cpumask_update(int cpu)
 	return false;
 }
 
+/*
+ * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
+ * which are one per CLOSid on the current package.
+ */
+static void cbm_update_msrs(void *dummy)
+{
+	int maxid = boot_cpu_data.x86_cache_max_closid;
+	struct rdt_remote_data info;
+	unsigned int i;
+
+	for (i = 0; i < maxid; i++) {
+		if (cctable[i].clos_refcnt) {
+			info.msr = CBM_FROM_INDEX(i);
+			info.val = cctable[i].cbm;
+			msr_cpu_update(&info);
+		}
+	}
+}
+
+static int intel_rdt_online_cpu(unsigned int cpu)
+{
+	struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
+
+	state->closid = 0;
+	mutex_lock(&rdtgroup_mutex);
+	/* The cpu is set in root rdtgroup after online. */
+	cpumask_set_cpu(cpu, &root_rdtgrp->cpu_mask);
+	per_cpu(cpu_rdtgroup, cpu) = root_rdtgrp;
+	/*
+	 * If the cpu is first time found and set in its siblings that
+	 * share the same cache, update the CBM MSRs for the cache.
+	 */
+	if (rdt_cpumask_update(cpu))
+		smp_call_function_single(cpu, cbm_update_msrs, NULL, 1);
+	mutex_unlock(&rdtgroup_mutex);
+}
+
+static int clear_rdtgroup_cpumask(unsigned int cpu)
+{
+	struct list_head *l;
+	struct rdtgroup *r;
+
+	list_for_each(l, &rdtgroup_lists) {
+		r = list_entry(l, struct rdtgroup, rdtgroup_list);
+		if (cpumask_test_cpu(cpu, &r->cpu_mask)) {
+			cpumask_clear_cpu(cpu, &r->cpu_mask);
+			return 0;
+		}
+	}
+
+	return -EINVAL;
+}
+
+static int intel_rdt_offline_cpu(unsigned int cpu)
+{
+	int i;
+
+	mutex_lock(&rdtgroup_mutex);
+	if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
+		mutex_unlock(&rdtgroup_mutex);
+		return;
+	}
+
+	cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), cpu_online_mask);
+	cpumask_clear_cpu(cpu, &tmp_cpumask);
+	i = cpumask_any(&tmp_cpumask);
+
+	if (i < nr_cpu_ids)
+		cpumask_set_cpu(i, &rdt_cpumask);
+
+	clear_rdtgroup_cpumask(cpu);
+	mutex_unlock(&rdtgroup_mutex);
+}
+
 static int __init intel_rdt_late_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
@@ -169,6 +247,13 @@ static int __init intel_rdt_late_init(void)
 
 	for_each_online_cpu(i)
 		rdt_cpumask_update(i);
+
+	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+				"AP_INTEL_RDT_ONLINE",
+				intel_rdt_online_cpu, intel_rdt_offline_cpu);
+	if (err < 0)
+		goto out_err;
+
 	pr_info("Intel cache allocation enabled\n");
 out_err:
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 12/33] x86/intel_rdt: Intel haswell Cache Allocation enumeration
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (10 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 11/33] x86/intel_rdt: Hot cpu support for Cache Allocation Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 10:08   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 13/33] Define CONFIG_INTEL_RDT Fenghua Yu
                   ` (20 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Vikas Shivappa <vikas.shivappa@linux.intel.com>

This patch is specific to Intel haswell (hsw) server SKUs. Cache
Allocation on hsw server needs to be enumerated separately as HSW does
not have support for CPUID enumeration for Cache Allocation. This patch
does a probe by writing a CLOSid (Class of service id) into high 32 bits
of IA32_PQR_MSR and see if the bits stick. The probe is only done after
confirming that the CPU is HSW server. Other hardcoded values are:

 - L3 cache bit mask must be at least two bits.
 - Maximum CLOSids supported is always 4.
 - Maximum bits support in cache bit mask is always 20.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c | 43 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 40 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 4537658..fb5a9a9 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -35,6 +35,10 @@ static struct clos_cbm_table *cctable;
  */
 unsigned long *closmap;
 /*
+ * Minimum bits required in Cache bitmask.
+ */
+unsigned int min_bitmask_len = 1;
+/*
  * Mask of CPUs for writing CBM values. We only need one CPU per-socket.
  */
 static cpumask_t rdt_cpumask;
@@ -51,6 +55,42 @@ struct rdt_remote_data {
 	u64 val;
 };
 
+/*
+ * cache_alloc_hsw_probe() - Have to probe for Intel haswell server CPUs
+ * as it does not have CPUID enumeration support for Cache allocation.
+ *
+ * Probes by writing to the high 32 bits(CLOSid) of the IA32_PQR_MSR and
+ * testing if the bits stick. Max CLOSids is always 4 and max cbm length
+ * is always 20 on hsw server parts. The minimum cache bitmask length
+ * allowed for HSW server is always 2 bits. Hardcode all of them.
+ */
+static inline bool cache_alloc_hsw_probe(void)
+{
+	u32 l, h_old, h_new, h_tmp;
+
+	if (rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_old))
+		return false;
+
+	/*
+	 * Default value is always 0 if feature is present.
+	 */
+	h_tmp = h_old ^ 0x1U;
+	if (wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_tmp) ||
+	    rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_new))
+		return false;
+
+	if (h_tmp != h_new)
+		return false;
+
+	wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_old);
+
+	boot_cpu_data.x86_cache_max_closid = 4;
+	boot_cpu_data.x86_cache_max_cbm_len = 20;
+	min_bitmask_len = 2;
+
+	return true;
+}
+
 void __intel_rdt_sched_in(void *dummy)
 {
 	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
@@ -225,9 +265,6 @@ static int __init intel_rdt_late_init(void)
 	u32 maxid;
 	int err = 0, size, i;
 
-	if (!cpu_has(c, X86_FEATURE_CAT_L3))
-		return -ENODEV;
-
 	maxid = c->x86_cache_max_closid;
 
 	size = maxid * sizeof(struct clos_cbm_table);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 13/33] Define CONFIG_INTEL_RDT
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (11 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 12/33] x86/intel_rdt: Intel haswell Cache Allocation enumeration Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 10:14   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 14/33] x86/intel_rdt: Intel Code Data Prioritization detection Fenghua Yu
                   ` (19 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Vikas Shivappa <vikas.shivappa@linux.intel.com>

CONFIG_INTEL_RDT is defined. The option provides support for resource
allocation which is a sub-feature of Intel Resource Director Technology
(RDT).

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/Kconfig | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2a1f0ce..6782127 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -406,6 +406,19 @@ config GOLDFISH
        def_bool y
        depends on X86_GOLDFISH
 
+config INTEL_RDT
+	bool "Intel Resource Director Technology support"
+	default n
+	depends on X86_64 && CPU_SUP_INTEL
+	help
+	  This option provides support for resource allocation which is a
+	  sub-feature of Intel Resource Director Technology(RDT).
+	  Current implementation supports L3 cache allocation.
+	  Using this feature a user can specify the amount of L3 cache space
+	  into which an application can fill.
+
+	  Say N if unsure.
+
 if X86_32
 config X86_EXTENDED_PLATFORM
 	bool "Support for extended (non-PC) x86 platforms"
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 14/33] x86/intel_rdt: Intel Code Data Prioritization detection
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (12 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 13/33] Define CONFIG_INTEL_RDT Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08  9:57 ` [PATCH v2 15/33] x86/intel_rdt: Adds support to enable Code Data Prioritization Fenghua Yu
                   ` (18 subsequent siblings)
  32 siblings, 0 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Vikas Shivappa <vikas.shivappa@linux.intel.com>

This patch adds enumeration support for Code Data Prioritization(CDP)
feature found in future Intel Xeon processors. It includes CPUID
enumeration routines for CDP.

CDP is an extension to Cache Allocation and lets threads allocate subset
of L3 cache for code and data separately. The allocation is represented
by the code or data cache capacity bit mask(cbm) MSRs
IA32_L3_QOS_MASK_n. Each Class of service would be associated with one
dcache_cbm and one icache_cbm MSR and hence the number of available
CLOSids is halved with CDP. The association for a CLOSid 'n' is shown
below :

data_cbm_address (n) = base + (n <<1)
code_cbm_address (n) = base + (n <<1) +1.
During scheduling the kernel writes the CLOSid
of the thread to IA32_PQR_ASSOC_MSR.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/cpufeature.h        | 6 ++++--
 arch/x86/include/asm/cpufeatures.h       | 5 ++++-
 arch/x86/include/asm/disabled-features.h | 3 ++-
 arch/x86/include/asm/required-features.h | 3 ++-
 arch/x86/kernel/cpu/intel_rdt.c          | 2 ++
 5 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 9985b4cf..7a54f87 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -81,8 +81,9 @@ extern const char * const x86_bug_flags[NBUGINTS*32];
 	   CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 16, feature_bit) ||	\
 	   CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 17, feature_bit) ||	\
 	   CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 18, feature_bit) ||	\
+	   CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 19, feature_bit) ||	\
 	   REQUIRED_MASK_CHECK					  ||	\
-	   BUILD_BUG_ON_ZERO(NCAPINTS != 19))
+	   BUILD_BUG_ON_ZERO(NCAPINTS != 20))
 
 #define DISABLED_MASK_BIT_SET(feature_bit)				\
 	 ( CHECK_BIT_IN_MASK_WORD(DISABLED_MASK,  0, feature_bit) ||	\
@@ -104,8 +105,9 @@ extern const char * const x86_bug_flags[NBUGINTS*32];
 	   CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 16, feature_bit) ||	\
 	   CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 17, feature_bit) ||	\
 	   CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 18, feature_bit) ||	\
+	   CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 19, feature_bit) ||	\
 	   DISABLED_MASK_CHECK					  ||	\
-	   BUILD_BUG_ON_ZERO(NCAPINTS != 19))
+	   BUILD_BUG_ON_ZERO(NCAPINTS != 20))
 
 #define cpu_has(c, bit)							\
 	(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 :	\
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 62d979b9..03f5d39 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -12,7 +12,7 @@
 /*
  * Defines x86 CPU feature bits
  */
-#define NCAPINTS	19	/* N 32-bit words worth of info */
+#define NCAPINTS	20	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -290,6 +290,9 @@
 /* Intel-defined CPU features, CPUID level 0x00000010:0 (ebx), word 18 */
 #define X86_FEATURE_CAT_L3      (18*32+ 1) /* Cache Allocation L3 */
 
+/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x00000010:1 (ecx), word 19 */
+#define X86_FEATURE_CDP_L3	(19*32+ 2) /* Code Data Prioritization L3 */
+
 /*
  * BUG word(s)
  */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 8b45e08..49d3c9f 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -58,6 +58,7 @@
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK18	0
-#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
+#define DISABLED_MASK19	0
+#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 20)
 
 #endif /* _ASM_X86_DISABLED_FEATURES_H */
diff --git a/arch/x86/include/asm/required-features.h b/arch/x86/include/asm/required-features.h
index 6847d85..fa57000 100644
--- a/arch/x86/include/asm/required-features.h
+++ b/arch/x86/include/asm/required-features.h
@@ -101,6 +101,7 @@
 #define REQUIRED_MASK16	0
 #define REQUIRED_MASK17	0
 #define REQUIRED_MASK18	0
-#define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
+#define REQUIRED_MASK19	0
+#define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 20)
 
 #endif /* _ASM_X86_REQUIRED_FEATURES_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index fb5a9a9..a4937d4 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -292,6 +292,8 @@ static int __init intel_rdt_late_init(void)
 		goto out_err;
 
 	pr_info("Intel cache allocation enabled\n");
+	if (cpu_has(c, X86_FEATURE_CDP_L3))
+		pr_info("Intel code data prioritization detected\n");
 out_err:
 
 	return err;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 15/33] x86/intel_rdt: Adds support to enable Code Data Prioritization
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (13 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 14/33] x86/intel_rdt: Intel Code Data Prioritization detection Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 10:18   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 16/33] x86/intel_rdt: Class of service and capacity bitmask management for CDP Fenghua Yu
                   ` (17 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Vikas Shivappa <vikas.shivappa@linux.intel.com>

On Intel SKUs that support Code Data Prioritization(CDP), intel_rdt
operates in 2 modes - legacy cache allocation mode/default or CDP mode.

When CDP is enabled, the number of available CLOSids is halved. Hence the
enabling is done when less than half the number of CLOSids available are
used. When CDP is enabled each CLOSid maps to a
data cache mask and an instruction cache mask. The enabling itself is done
by writing to the IA32_PQOS_CFG MSR and can dynamically be enabled or
disabled.

CDP is disabled when for each (dcache_cbm,icache_cbm) pair, the
dcache_cbm = icache_cbm.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt.h |  7 ++++++
 arch/x86/kernel/cpu/intel_rdt.c  | 46 +++++++++++++++++++++++++++++-----------
 2 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 8512174..4e05c6e 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -8,6 +8,7 @@
 #define MAX_CBM_LENGTH			32
 #define IA32_L3_CBM_BASE		0xc90
 #define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
+#define MSR_IA32_PQOS_CFG		0xc81
 
 extern struct static_key rdt_enable_key;
 void __intel_rdt_sched_in(void *dummy);
@@ -17,6 +18,12 @@ struct clos_cbm_table {
 	unsigned int clos_refcnt;
 };
 
+struct clos_config {
+	unsigned long *closmap;
+	u32 max_closid;
+	u32 closids_used;
+};
+
 /*
  * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
  *
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index a4937d4..e0f23b6 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -31,10 +31,6 @@
  */
 static struct clos_cbm_table *cctable;
 /*
- * closid availability bit map.
- */
-unsigned long *closmap;
-/*
  * Minimum bits required in Cache bitmask.
  */
 unsigned int min_bitmask_len = 1;
@@ -49,6 +45,11 @@ static cpumask_t rdt_cpumask;
 static cpumask_t tmp_cpumask;
 static DEFINE_MUTEX(rdtgroup_mutex);
 struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
+struct clos_config cconfig;
+bool cdp_enabled;
+
+#define __DCBM_TABLE_INDEX(x)	(x << 1)
+#define __ICBM_TABLE_INDEX(x)	((x << 1) + 1)
 
 struct rdt_remote_data {
 	int msr;
@@ -122,22 +123,28 @@ static int closid_alloc(u32 *closid)
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
-	maxid = boot_cpu_data.x86_cache_max_closid;
-	id = find_first_zero_bit(closmap, maxid);
+	maxid = cconfig.max_closid;
+	id = find_first_zero_bit(cconfig.closmap, maxid);
 	if (id == maxid)
 		return -ENOSPC;
 
-	set_bit(id, closmap);
+	set_bit(id, cconfig.closmap);
 	closid_get(id);
 	*closid = id;
+	cconfig.closids_used++;
 
 	return 0;
 }
 
 static inline void closid_free(u32 closid)
 {
-	clear_bit(closid, closmap);
+	clear_bit(closid, cconfig.closmap);
 	cctable[closid].cbm = 0;
+
+	if (WARN_ON(!cconfig.closids_used))
+		return;
+
+	cconfig.closids_used--;
 }
 
 static void closid_put(u32 closid)
@@ -171,6 +178,21 @@ static inline void msr_update_all(int msr, u64 val)
 	on_each_cpu_mask(&rdt_cpumask, msr_cpu_update, &info, 1);
 }
 
+static bool code_data_mask_equal(void)
+{
+	int i, dindex, iindex;
+
+	for (i = 0; i < cconfig.max_closid; i++) {
+		dindex = __DCBM_TABLE_INDEX(i);
+		iindex = __ICBM_TABLE_INDEX(i);
+		if (cctable[dindex].clos_refcnt &&
+		     (cctable[dindex].cbm != cctable[iindex].cbm))
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * Set only one cpu in cpumask in all cpus that share the same cache.
  */
@@ -191,7 +213,7 @@ static inline bool rdt_cpumask_update(int cpu)
  */
 static void cbm_update_msrs(void *dummy)
 {
-	int maxid = boot_cpu_data.x86_cache_max_closid;
+	int maxid = cconfig.max_closid;
 	struct rdt_remote_data info;
 	unsigned int i;
 
@@ -199,7 +221,7 @@ static void cbm_update_msrs(void *dummy)
 		if (cctable[i].clos_refcnt) {
 			info.msr = CBM_FROM_INDEX(i);
 			info.val = cctable[i].cbm;
-			msr_cpu_update(&info);
+			msr_cpu_update((void *) &info);
 		}
 	}
 }
@@ -275,8 +297,8 @@ static int __init intel_rdt_late_init(void)
 	}
 
 	size = BITS_TO_LONGS(maxid) * sizeof(long);
-	closmap = kzalloc(size, GFP_KERNEL);
-	if (!closmap) {
+	cconfig.closmap = kzalloc(size, GFP_KERNEL);
+	if (!cconfig.closmap) {
 		kfree(cctable);
 		err = -ENOMEM;
 		goto out_err;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 16/33] x86/intel_rdt: Class of service and capacity bitmask management for CDP
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (14 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 15/33] x86/intel_rdt: Adds support to enable Code Data Prioritization Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 10:29   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 17/33] x86/intel_rdt: Hot cpu update for code data prioritization Fenghua Yu
                   ` (16 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Vikas Shivappa <vikas.shivappa@linux.intel.com>

Add support to manage CLOSid(CLass Of Service id) and capacity
bitmask(cbm) for code data prioritization(CDP).

Closid management includes changes to allocating, freeing closid and
closid_get and closid_put and changes to closid availability map during
CDP set up. CDP has a separate cbm for code and data.

Each closid is mapped to a (dcache_cbm, icache_cbm) pair when cdp mode
is enabled.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index e0f23b6..9cee3fe 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -27,7 +27,13 @@
 #include <asm/intel_rdt.h>
 
 /*
- * cctable maintains 1:1 mapping between CLOSid and cache bitmask.
+ * During cache alloc mode cctable maintains 1:1 mapping between
+ * CLOSid and cache bitmask.
+ *
+ * During CDP mode, the cctable maintains a 1:2 mapping between the closid
+ * and (dcache_cbm, icache_cbm) pair.
+ * index of a dcache_cbm for CLOSid 'n' = n << 1.
+ * index of a icache_cbm for CLOSid 'n' = n << 1 + 1
  */
 static struct clos_cbm_table *cctable;
 /*
@@ -50,6 +56,13 @@ bool cdp_enabled;
 
 #define __DCBM_TABLE_INDEX(x)	(x << 1)
 #define __ICBM_TABLE_INDEX(x)	((x << 1) + 1)
+#define __DCBM_MSR_INDEX(x)			\
+	CBM_FROM_INDEX(__DCBM_TABLE_INDEX(x))
+#define __ICBM_MSR_INDEX(x)			\
+	CBM_FROM_INDEX(__ICBM_TABLE_INDEX(x))
+
+#define DCBM_TABLE_INDEX(x)	(x << cdp_enabled)
+#define ICBM_TABLE_INDEX(x)	((x << cdp_enabled) + cdp_enabled)
 
 struct rdt_remote_data {
 	int msr;
@@ -107,9 +120,12 @@ void __intel_rdt_sched_in(void *dummy)
 	state->closid = 0;
 }
 
+/*
+ * When cdp mode is enabled, refcnt is maintained in the dcache_cbm entry.
+ */
 static inline void closid_get(u32 closid)
 {
-	struct clos_cbm_table *cct = &cctable[closid];
+	struct clos_cbm_table *cct = &cctable[DCBM_TABLE_INDEX(closid)];
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -139,7 +155,7 @@ static int closid_alloc(u32 *closid)
 static inline void closid_free(u32 closid)
 {
 	clear_bit(closid, cconfig.closmap);
-	cctable[closid].cbm = 0;
+	cctable[DCBM_TABLE_INDEX(closid)].cbm = 0;
 
 	if (WARN_ON(!cconfig.closids_used))
 		return;
@@ -149,7 +165,7 @@ static inline void closid_free(u32 closid)
 
 static void closid_put(u32 closid)
 {
-	struct clos_cbm_table *cct = &cctable[closid];
+	struct clos_cbm_table *cct = &cctable[DCBM_TABLE_INDEX(closid)];
 
 	lockdep_assert_held(&rdtgroup_mutex);
 	if (WARN_ON(!cct->clos_refcnt))
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 17/33] x86/intel_rdt: Hot cpu update for code data prioritization
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (15 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 16/33] x86/intel_rdt: Class of service and capacity bitmask management for CDP Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 10:34   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 18/33] sched.h: Add rg_list and rdtgroup in task_struct Fenghua Yu
                   ` (15 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Vikas Shivappa <vikas.shivappa@linux.intel.com>

Updates hot cpu notification handling for code data prioritization(cdp).
The capacity bitmask(cbm) is global for both data and instruction and we
need to update the new online package with all the cbms by writing to
the IA32_L3_QOS_n MSRs.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 9cee3fe..1bcff29 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -223,6 +223,26 @@ static inline bool rdt_cpumask_update(int cpu)
 	return false;
 }
 
+static void cbm_update_msr(u32 index)
+{
+	struct rdt_remote_data info;
+	int dindex;
+
+	dindex = DCBM_TABLE_INDEX(index);
+	if (cctable[dindex].clos_refcnt) {
+
+		info.msr = CBM_FROM_INDEX(dindex);
+		info.val = cctable[dindex].cbm;
+		msr_cpu_update((void *) &info);
+
+		if (cdp_enabled) {
+			info.msr = __ICBM_MSR_INDEX(index);
+			info.val = cctable[dindex + 1].cbm;
+			msr_cpu_update((void *) &info);
+		}
+	}
+}
+
 /*
  * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
  * which are one per CLOSid on the current package.
@@ -230,15 +250,10 @@ static inline bool rdt_cpumask_update(int cpu)
 static void cbm_update_msrs(void *dummy)
 {
 	int maxid = cconfig.max_closid;
-	struct rdt_remote_data info;
 	unsigned int i;
 
 	for (i = 0; i < maxid; i++) {
-		if (cctable[i].clos_refcnt) {
-			info.msr = CBM_FROM_INDEX(i);
-			info.val = cctable[i].cbm;
-			msr_cpu_update((void *) &info);
-		}
+		cbm_update_msr(i);
 	}
 }
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 18/33] sched.h: Add rg_list and rdtgroup in task_struct
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (16 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 17/33] x86/intel_rdt: Hot cpu update for code data prioritization Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 10:36   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 19/33] magic number for resctrl file system Fenghua Yu
                   ` (14 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

rg_list is linked list to connect to other tasks in a rdtgroup.

The point of rdtgroup allows the task to access its own rdtgroup directly.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/sched.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e5..4b1dce0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1766,6 +1766,9 @@ struct task_struct {
 	/* cg_list protected by css_set_lock and tsk->alloc_lock */
 	struct list_head cg_list;
 #endif
+#ifdef CONFIG_INTEL_RDT
+	struct rdtgroup *rdtgroup;
+#endif
 #ifdef CONFIG_FUTEX
 	struct robust_list_head __user *robust_list;
 #ifdef CONFIG_COMPAT
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 19/33] magic number for resctrl file system
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (17 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 18/33] sched.h: Add rg_list and rdtgroup in task_struct Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 10:41   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 20/33] x86/intel_rdt.h: Header for inter_rdt.c Fenghua Yu
                   ` (13 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 include/uapi/linux/magic.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index e398bea..33f1d64 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -57,6 +57,8 @@
 #define CGROUP_SUPER_MAGIC	0x27e0eb
 #define CGROUP2_SUPER_MAGIC	0x63677270
 
+#define RDTGROUP_SUPER_MAGIC	0x7655821
+
 
 #define STACK_END_MAGIC		0x57AC6E9D
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 20/33] x86/intel_rdt.h: Header for inter_rdt.c
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (18 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 19/33] magic number for resctrl file system Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 12:36   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 21/33] x86/intel_rdt_rdtgroup.h: Header for user interface Fenghua Yu
                   ` (12 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

The header mainly provides functions to call from the user interface
file intel_rdt_rdtgroup.c.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt.h | 77 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 72 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 4e05c6e..a4f794b 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -3,27 +3,94 @@
 
 #ifdef CONFIG_INTEL_RDT
 
+#include <linux/seq_file.h>
 #include <linux/jump_label.h>
 
-#define MAX_CBM_LENGTH			32
 #define IA32_L3_CBM_BASE		0xc90
-#define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
-#define MSR_IA32_PQOS_CFG		0xc81
+#define L3_CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
+
+#define MSR_IA32_L3_QOS_CFG		0xc81
+
+enum resource_type {
+	RESOURCE_L3  = 0,
+	RESOURCE_NUM = 1,
+};
+
+#define MAX_CACHE_LEAVES        4
+#define MAX_CACHE_DOMAINS       64
+
+DECLARE_PER_CPU_READ_MOSTLY(int, cpu_l3_domain);
+DECLARE_PER_CPU_READ_MOSTLY(struct rdtgroup *, cpu_rdtgroup);
 
 extern struct static_key rdt_enable_key;
 void __intel_rdt_sched_in(void *dummy);
 
+extern bool cdp_enabled;
+
+struct rdt_opts {
+	bool cdp_enabled;
+	bool verbose;
+	bool simulate_cat_l3;
+};
+
+struct cache_domain {
+	cpumask_t shared_cpu_map[MAX_CACHE_DOMAINS];
+	unsigned int max_cache_domains_num;
+	unsigned int level;
+	unsigned int shared_cache_id[MAX_CACHE_DOMAINS];
+};
+
+extern struct rdt_opts rdt_opts;
+
 struct clos_cbm_table {
 	unsigned long cbm;
 	unsigned int clos_refcnt;
 };
 
 struct clos_config {
-	unsigned long *closmap;
+	unsigned long **closmap;
 	u32 max_closid;
-	u32 closids_used;
 };
 
+struct shared_domain {
+	struct cpumask cpumask;
+	int l3_domain;
+};
+
+#define for_each_cache_domain(domain, start_domain, max_domain)	\
+	for (domain = start_domain; domain < max_domain; domain++)
+
+extern struct clos_config cconfig;
+extern struct shared_domain *shared_domain;
+extern int shared_domain_num;
+
+extern struct rdtgroup *root_rdtgrp;
+
+extern struct clos_cbm_table **l3_cctable;
+
+extern unsigned int min_bitmask_len;
+extern void msr_cpu_update(void *arg);
+extern inline void closid_get(u32 closid, int domain);
+extern void closid_put(u32 closid, int domain);
+extern void closid_free(u32 closid, int domain, int level);
+extern int closid_alloc(u32 *closid, int domain);
+extern bool cat_l3_enabled;
+extern unsigned int get_domain_num(int level);
+extern struct shared_domain *shared_domain;
+extern int shared_domain_num;
+extern inline int get_dcbm_table_index(int x);
+extern inline int get_icbm_table_index(int x);
+
+extern int get_cache_leaf(int level, int cpu);
+
+extern void cbm_update_l3_msr(void *pindex);
+extern int level_to_leaf(int level);
+
+extern void init_msrs(bool cdpenabled);
+extern bool cat_enabled(int level);
+extern u64 max_cbm(int level);
+extern u32 max_cbm_len(int level);
+
 /*
  * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
  *
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 21/33] x86/intel_rdt_rdtgroup.h: Header for user interface
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (19 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 20/33] x86/intel_rdt.h: Header for inter_rdt.c Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 12:44   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 22/33] x86/intel_rdt.c: Extend RDT to per cache and per resources Fenghua Yu
                   ` (11 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

This is header file for user interface file intel_rdt_rdtgroup.c.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt_rdtgroup.h | 150 ++++++++++++++++++++++++++++++
 1 file changed, 150 insertions(+)
 create mode 100644 arch/x86/include/asm/intel_rdt_rdtgroup.h

diff --git a/arch/x86/include/asm/intel_rdt_rdtgroup.h b/arch/x86/include/asm/intel_rdt_rdtgroup.h
new file mode 100644
index 0000000..3703964
--- /dev/null
+++ b/arch/x86/include/asm/intel_rdt_rdtgroup.h
@@ -0,0 +1,150 @@
+#ifndef _RDT_PGROUP_H
+#define _RDT_PGROUP_H
+#define MAX_RDTGROUP_TYPE_NAMELEN	32
+#define MAX_RDTGROUP_ROOT_NAMELEN	64
+#define MAX_RFTYPE_NAME			64
+
+#include <linux/kernfs.h>
+#include <asm/intel_rdt.h>
+
+extern void rdtgroup_exit(struct task_struct *tsk);
+
+/* cftype->flags */
+enum {
+	RFTYPE_WORLD_WRITABLE = (1 << 4),/* (DON'T USE FOR NEW FILES) S_IWUGO */
+
+	/* internal flags, do not use outside rdtgroup core proper */
+	__RFTYPE_ONLY_ON_DFL  = (1 << 16),/* only on default hierarchy */
+	__RFTYPE_NOT_ON_DFL   = (1 << 17),/* not on default hierarchy */
+};
+
+#define CACHE_LEVEL3		3
+
+struct cache_resource {
+	u64 *cbm;
+	u64 *cbm2;
+	int *closid;
+	int *refcnt;
+};
+
+struct rdt_resource {
+	bool valid;
+	int closid[MAX_CACHE_DOMAINS];
+	/* Add more resources here. */
+};
+
+struct rdtgroup {
+	struct kernfs_node *kn;		/* rdtgroup kernfs entry */
+
+	struct rdtgroup_root *root;
+
+	struct list_head rdtgroup_list;
+
+	atomic_t refcount;
+	struct cpumask cpu_mask;
+	char schema[1024];
+
+	struct rdt_resource resource;
+
+	/* ids of the ancestors at each level including self */
+	int ancestor_ids[];
+};
+
+struct rftype {
+	/*
+	 * By convention, the name should begin with the name of the
+	 * subsystem, followed by a period.  Zero length string indicates
+	 * end of cftype array.
+	 */
+	char name[MAX_CFTYPE_NAME];
+	unsigned long private;
+
+	/*
+	 * The maximum length of string, excluding trailing nul, that can
+	 * be passed to write.  If < PAGE_SIZE-1, PAGE_SIZE-1 is assumed.
+	 */
+	size_t max_write_len;
+
+	/* CFTYPE_* flags */
+	unsigned int flags;
+
+	/*
+	 * Fields used for internal bookkeeping.  Initialized automatically
+	 * during registration.
+	 */
+	struct kernfs_ops *kf_ops;
+
+	/*
+	 * read_u64() is a shortcut for the common case of returning a
+	 * single integer. Use it in place of read()
+	 */
+	u64 (*read_u64)(struct rftype *rft);
+	/*
+	 * read_s64() is a signed version of read_u64()
+	 */
+	s64 (*read_s64)(struct rftype *rft);
+
+	/* generic seq_file read interface */
+	int (*seq_show)(struct seq_file *sf, void *v);
+
+	/* optional ops, implement all or none */
+	void *(*seq_start)(struct seq_file *sf, loff_t *ppos);
+	void *(*seq_next)(struct seq_file *sf, void *v, loff_t *ppos);
+	void (*seq_stop)(struct seq_file *sf, void *v);
+
+	/*
+	 * write_u64() is a shortcut for the common case of accepting
+	 * a single integer (as parsed by simple_strtoull) from
+	 * userspace. Use in place of write(); return 0 or error.
+	 */
+	int (*write_u64)(struct rftype *rft, u64 val);
+	/*
+	 * write_s64() is a signed version of write_u64()
+	 */
+	int (*write_s64)(struct rftype *rft, s64 val);
+
+	/*
+	 * write() is the generic write callback which maps directly to
+	 * kernfs write operation and overrides all other operations.
+	 * Maximum write size is determined by ->max_write_len.  Use
+	 * of_css/cft() to access the associated css and cft.
+	 */
+	ssize_t (*write)(struct kernfs_open_file *of,
+			 char *buf, size_t nbytes, loff_t off);
+};
+
+struct rdtgroup_root {
+	struct kernfs_root *kf_root;
+
+	/* Unique id for this hierarchy. */
+	int hierarchy_id;
+
+	/* The root rdtgroup.  Root is destroyed on its release. */
+	struct rdtgroup rdtgrp;
+
+	/* Number of rdtgroups in the hierarchy */
+	atomic_t nr_rdtgrps;
+
+	/* Hierarchy-specific flags */
+	unsigned int flags;
+
+	/* IDs for rdtgroups in this hierarchy */
+	struct idr rdtgroup_idr;
+
+	/* The name for this hierarchy - may be empty */
+	char name[MAX_RDTGROUP_ROOT_NAMELEN];
+};
+
+/* get rftype from of */
+static inline struct rftype *of_rft(struct kernfs_open_file *of)
+{
+	return of->kn->priv;
+}
+
+/* get rftype from seq_file */
+static inline struct rftype *seq_rft(struct seq_file *seq)
+{
+	return of_rft(seq->private);
+}
+
+#endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 22/33] x86/intel_rdt.c: Extend RDT to per cache and per resources
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (20 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 21/33] x86/intel_rdt_rdtgroup.h: Header for user interface Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 14:57   ` Thomas Gleixner
  2016-09-13 22:54   ` Dave Hansen
  2016-09-08  9:57 ` [PATCH v2 23/33] x86/intel_rdt_rdtgroup.c: User interface for RDT Fenghua Yu
                   ` (10 subsequent siblings)
  32 siblings, 2 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

QoS mask MSRs array is per cache. We need to allocate CLOSID per cache
instead global CLOSID.

A few different resources can share same QoS mask MSRs array. For
example, one L2 cache can share QoS MSRs with its next level
L3 cache. A domain number represents the L2 cache, the L3 cache, the L2
cache's shared cpumask, and the L3 cache's shared cpumask.

cctable is extended to be index by domain number so that each cache
has its own control table.

shared_domain is introduced to cover multiple resources sharing
CLOSID.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt.h |   1 +
 arch/x86/kernel/cpu/intel_rdt.c  | 647 +++++++++++++++++++++++++++++++++------
 2 files changed, 547 insertions(+), 101 deletions(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index a4f794b..85beecc 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -74,6 +74,7 @@ extern inline void closid_get(u32 closid, int domain);
 extern void closid_put(u32 closid, int domain);
 extern void closid_free(u32 closid, int domain, int level);
 extern int closid_alloc(u32 *closid, int domain);
+extern struct mutex rdtgroup_mutex;
 extern bool cat_l3_enabled;
 extern unsigned int get_domain_num(int level);
 extern struct shared_domain *shared_domain;
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 1bcff29..f7c728b 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -17,14 +17,17 @@
  * more details.
  *
  * More information about RDT be found in the Intel (R) x86 Architecture
- * Software Developer Manual June 2015, volume 3, section 17.15.
+ * Software Developer Manual.
  */
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/cpumask.h>
+#include <linux/cacheinfo.h>
 #include <asm/pqr_common.h>
 #include <asm/intel_rdt.h>
+#include <asm/intel_rdt_rdtgroup.h>
 
 /*
  * During cache alloc mode cctable maintains 1:1 mapping between
@@ -35,40 +38,62 @@
  * index of a dcache_cbm for CLOSid 'n' = n << 1.
  * index of a icache_cbm for CLOSid 'n' = n << 1 + 1
  */
-static struct clos_cbm_table *cctable;
+struct clos_cbm_table **l3_cctable;
+
 /*
  * Minimum bits required in Cache bitmask.
  */
 unsigned int min_bitmask_len = 1;
+
 /*
  * Mask of CPUs for writing CBM values. We only need one CPU per-socket.
  */
-static cpumask_t rdt_cpumask;
-/*
- * Temporary cpumask used during hot cpu notificaiton handling. The usage
- * is serialized by hot cpu locks.
- */
-static cpumask_t tmp_cpumask;
-static DEFINE_MUTEX(rdtgroup_mutex);
+cpumask_t rdt_l3_cpumask;
+
+bool cat_l3_enabled;
+
 struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
 struct clos_config cconfig;
 bool cdp_enabled;
+static bool disable_cat_l3 __initdata;
+struct shared_domain *shared_domain;
+int shared_domain_num;
+
+struct rdt_opts rdt_opts = {
+	.cdp_enabled = false,
+	.verbose = false,
+	.simulate_cat_l3 = false,
+};
+
+#define __DCBM_TABLE_INDEX(x) (x << 1)
+#define __ICBM_TABLE_INDEX(x) ((x << 1) + 1)
+#define __ICBM_MSR_INDEX(x)                    \
+	L3_CBM_FROM_INDEX(__ICBM_TABLE_INDEX(x))
+
+#define DCBM_TABLE_INDEX(x)    (x << cdp_enabled)
+#define ICBM_TABLE_INDEX(x)    ((x << cdp_enabled) + cdp_enabled)
 
-#define __DCBM_TABLE_INDEX(x)	(x << 1)
-#define __ICBM_TABLE_INDEX(x)	((x << 1) + 1)
-#define __DCBM_MSR_INDEX(x)			\
-	CBM_FROM_INDEX(__DCBM_TABLE_INDEX(x))
-#define __ICBM_MSR_INDEX(x)			\
-	CBM_FROM_INDEX(__ICBM_TABLE_INDEX(x))
+DEFINE_MUTEX(rdtgroup_mutex);
 
-#define DCBM_TABLE_INDEX(x)	(x << cdp_enabled)
-#define ICBM_TABLE_INDEX(x)	((x << cdp_enabled) + cdp_enabled)
+DEFINE_PER_CPU_READ_MOSTLY(int, cpu_l3_domain) = -1;
+DEFINE_PER_CPU_READ_MOSTLY(int, cpu_shared_domain) = -1;
+DEFINE_PER_CPU_READ_MOSTLY(struct rdtgroup *, cpu_rdtgroup) = 0;
 
 struct rdt_remote_data {
 	int msr;
 	u64 val;
 };
 
+inline int get_dcbm_table_index(int x)
+{
+	return DCBM_TABLE_INDEX(x);
+}
+
+inline int get_icbm_table_index(int x)
+{
+	return ICBM_TABLE_INDEX(x);
+}
+
 /*
  * cache_alloc_hsw_probe() - Have to probe for Intel haswell server CPUs
  * as it does not have CPUID enumeration support for Cache allocation.
@@ -98,41 +123,141 @@ static inline bool cache_alloc_hsw_probe(void)
 
 	wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_old);
 
-	boot_cpu_data.x86_cache_max_closid = 4;
-	boot_cpu_data.x86_cache_max_cbm_len = 20;
+	boot_cpu_data.x86_l3_max_closid = 4;
+	boot_cpu_data.x86_l3_max_cbm_len = 20;
 	min_bitmask_len = 2;
 
 	return true;
 }
 
+u32 max_cbm_len(int level)
+{
+	switch (level) {
+	case CACHE_LEVEL3:
+		return boot_cpu_data.x86_l3_max_cbm_len;
+	default:
+		break;
+	}
+
+	return (u32)~0;
+}
+
+u64 max_cbm(int level)
+{
+	switch (level) {
+	case CACHE_LEVEL3:
+		return (1ULL << boot_cpu_data.x86_l3_max_cbm_len) - 1;
+	default:
+		break;
+	}
+
+	return (u64)~0;
+}
+
+static u32 hw_max_closid(int level)
+{
+	switch (level) {
+	case CACHE_LEVEL3:
+		return  boot_cpu_data.x86_l3_max_closid;
+	default:
+		break;
+	}
+
+	WARN(1, "invalid level\n");
+	return 0;
+}
+
+static int cbm_from_index(u32 i, int level)
+{
+	switch (level) {
+	case CACHE_LEVEL3:
+		return L3_CBM_FROM_INDEX(i);
+	default:
+		break;
+	}
+
+	WARN(1, "invalid level\n");
+	return 0;
+}
+
+bool cat_enabled(int level)
+{
+	switch (level) {
+	case CACHE_LEVEL3:
+		return cat_l3_enabled;
+	default:
+		break;
+	}
+
+	return false;
+}
+
+static inline bool cat_l3_supported(struct cpuinfo_x86 *c)
+{
+	if (cpu_has(c, X86_FEATURE_CAT_L3))
+		return true;
+
+	/*
+	 * Probe for Haswell server CPUs.
+	 */
+	if (c->x86 == 0x6 && c->x86_model == 0x3f)
+		return cache_alloc_hsw_probe();
+
+	return false;
+}
+
 void __intel_rdt_sched_in(void *dummy)
 {
 	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+	struct rdtgroup *rdtgrp;
+	int closid;
+	int domain;
 
 	/*
-	 * Currently closid is always 0. When  user interface is added,
-	 * closid will come from user interface.
+	 * If this task is assigned to an rdtgroup, use it.
+	 * Else use the group assigned to this cpu.
 	 */
-	if (state->closid == 0)
+	rdtgrp = current->rdtgroup;
+	if (!rdtgrp)
+		rdtgrp = this_cpu_read(cpu_rdtgroup);
+
+	domain = this_cpu_read(cpu_shared_domain);
+	closid = rdtgrp->resource.closid[domain];
+
+	if (closid == state->closid)
 		return;
 
-	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, 0);
-	state->closid = 0;
+	state->closid = closid;
+	/* Don't really write PQR register in simulation mode. */
+	if (unlikely(rdt_opts.simulate_cat_l3))
+		return;
+
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, closid);
 }
 
 /*
  * When cdp mode is enabled, refcnt is maintained in the dcache_cbm entry.
  */
-static inline void closid_get(u32 closid)
+inline void closid_get(u32 closid, int domain)
 {
-	struct clos_cbm_table *cct = &cctable[DCBM_TABLE_INDEX(closid)];
-
 	lockdep_assert_held(&rdtgroup_mutex);
 
-	cct->clos_refcnt++;
+	if (cat_l3_enabled) {
+		int l3_domain;
+		int dindex;
+
+		l3_domain = shared_domain[domain].l3_domain;
+		dindex = DCBM_TABLE_INDEX(closid);
+		l3_cctable[l3_domain][dindex].clos_refcnt++;
+		if (cdp_enabled) {
+			int iindex = ICBM_TABLE_INDEX(closid);
+
+			l3_cctable[l3_domain][iindex].clos_refcnt++;
+		}
+	}
 }
 
-static int closid_alloc(u32 *closid)
+int closid_alloc(u32 *closid, int domain)
 {
 	u32 maxid;
 	u32 id;
@@ -140,105 +265,215 @@ static int closid_alloc(u32 *closid)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	maxid = cconfig.max_closid;
-	id = find_first_zero_bit(cconfig.closmap, maxid);
+	id = find_first_zero_bit((unsigned long *)cconfig.closmap[domain],
+				 maxid);
+
 	if (id == maxid)
 		return -ENOSPC;
 
-	set_bit(id, cconfig.closmap);
-	closid_get(id);
+	set_bit(id, (unsigned long *)cconfig.closmap[domain]);
+	closid_get(id, domain);
 	*closid = id;
-	cconfig.closids_used++;
 
 	return 0;
 }
 
-static inline void closid_free(u32 closid)
+unsigned int get_domain_num(int level)
 {
-	clear_bit(closid, cconfig.closmap);
-	cctable[DCBM_TABLE_INDEX(closid)].cbm = 0;
-
-	if (WARN_ON(!cconfig.closids_used))
-		return;
+	if (level == CACHE_LEVEL3)
+		return cpumask_weight(&rdt_l3_cpumask);
+	else
+		return -EINVAL;
+}
 
-	cconfig.closids_used--;
+int level_to_leaf(int level)
+{
+	switch (level) {
+	case CACHE_LEVEL3:
+		return 3;
+	default:
+		return -EINVAL;
+	}
 }
 
-static void closid_put(u32 closid)
+void closid_free(u32 closid, int domain, int level)
 {
-	struct clos_cbm_table *cct = &cctable[DCBM_TABLE_INDEX(closid)];
+	struct clos_cbm_table **cctable;
+	int leaf;
+	struct cpumask *mask;
+	int cpu;
+
+	if (level == CACHE_LEVEL3)
+		cctable = l3_cctable;
+
+	clear_bit(closid, (unsigned long *)cconfig.closmap[domain]);
+
+	if (level == CACHE_LEVEL3) {
+		cctable[domain][closid].cbm = max_cbm(level);
+		leaf = level_to_leaf(level);
+		mask = &cache_domains[leaf].shared_cpu_map[domain];
+		cpu = cpumask_first(mask);
+		smp_call_function_single(cpu, cbm_update_l3_msr, &closid, 1);
+	}
+}
 
+static void _closid_put(u32 closid, struct clos_cbm_table *cct,
+			int domain, int level)
+{
 	lockdep_assert_held(&rdtgroup_mutex);
 	if (WARN_ON(!cct->clos_refcnt))
 		return;
 
 	if (!--cct->clos_refcnt)
-		closid_free(closid);
+		closid_free(closid, domain, level);
 }
 
-static void msr_cpu_update(void *arg)
+void closid_put(u32 closid, int domain)
+{
+	struct clos_cbm_table *cct;
+
+	if (cat_l3_enabled) {
+		int l3_domain = shared_domain[domain].l3_domain;
+
+		cct = &l3_cctable[l3_domain][DCBM_TABLE_INDEX(closid)];
+		_closid_put(closid, cct, l3_domain, CACHE_LEVEL3);
+		if (cdp_enabled) {
+			cct = &l3_cctable[l3_domain][ICBM_TABLE_INDEX(closid)];
+			_closid_put(closid, cct, l3_domain, CACHE_LEVEL3);
+		}
+	}
+}
+
+void msr_cpu_update(void *arg)
 {
 	struct rdt_remote_data *info = arg;
 
+	if (unlikely(rdt_opts.verbose))
+		pr_info("Write %lx to msr %x on cpu%d\n",
+			(unsigned long)info->val, info->msr,
+			smp_processor_id());
+
+	if (unlikely(rdt_opts.simulate_cat_l3))
+		return;
+
 	wrmsrl(info->msr, info->val);
 }
 
+static struct cpumask *rdt_cache_cpumask(int level)
+{
+	return &rdt_l3_cpumask;
+}
+
 /*
  * msr_update_all() - Update the msr for all packages.
  */
-static inline void msr_update_all(int msr, u64 val)
+static inline void msr_update_all(int msr, u64 val, int level)
 {
 	struct rdt_remote_data info;
 
 	info.msr = msr;
 	info.val = val;
-	on_each_cpu_mask(&rdt_cpumask, msr_cpu_update, &info, 1);
+	on_each_cpu_mask(rdt_cache_cpumask(level), msr_cpu_update, &info, 1);
 }
 
-static bool code_data_mask_equal(void)
+static void init_qos_msrs(int level)
 {
-	int i, dindex, iindex;
+	if (cat_enabled(level)) {
+		u32 maxcbm;
+		u32 i;
 
-	for (i = 0; i < cconfig.max_closid; i++) {
-		dindex = __DCBM_TABLE_INDEX(i);
-		iindex = __ICBM_TABLE_INDEX(i);
-		if (cctable[dindex].clos_refcnt &&
-		     (cctable[dindex].cbm != cctable[iindex].cbm))
-			return false;
+		maxcbm = max_cbm(level);
+		for (i = 0; i < hw_max_closid(level); i++)
+			msr_update_all(cbm_from_index(i, level), maxcbm, level);
 	}
+}
 
-	return true;
+/*
+ * Initialize QOS_MASK_n registers to all 1's.
+ *
+ * Initialize L3_QOS_CFG register to enable or disable CDP.
+ */
+void init_msrs(bool cdpenabled)
+{
+	if (cat_enabled(CACHE_LEVEL3)) {
+		init_qos_msrs(CACHE_LEVEL3);
+		msr_update_all(MSR_IA32_L3_QOS_CFG, cdpenabled, CACHE_LEVEL3);
+	}
+
+}
+
+int get_cache_leaf(int level, int cpu)
+{
+	int index;
+	struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
+	struct cacheinfo *this_leaf;
+	int num_leaves = this_cpu_ci->num_leaves;
+
+	for (index = 0; index < num_leaves; index++) {
+		this_leaf = this_cpu_ci->info_list + index;
+		if (this_leaf->level == level)
+			return index;
+	}
+
+	return -EINVAL;
+}
+
+static struct cpumask *get_shared_cpu_map(int cpu, int level)
+{
+	int index;
+	struct cacheinfo *leaf;
+	struct cpu_cacheinfo *cpu_ci = get_cpu_cacheinfo(cpu);
+
+	index = get_cache_leaf(level, cpu);
+	if (index < 0)
+		return 0;
+
+	leaf = cpu_ci->info_list + index;
+
+	return &leaf->shared_cpu_map;
 }
 
 /*
  * Set only one cpu in cpumask in all cpus that share the same cache.
  */
-static inline bool rdt_cpumask_update(int cpu)
+inline bool rdt_cpumask_update(struct cpumask *cpumask, int cpu, int level)
 {
-	cpumask_and(&tmp_cpumask, &rdt_cpumask, topology_core_cpumask(cpu));
+	struct cpumask *shared_cpu_map;
+	cpumask_t tmp_cpumask;
+
+	shared_cpu_map = get_shared_cpu_map(cpu, level);
+	if (!shared_cpu_map)
+		return false;
+
+	cpumask_and(&tmp_cpumask, cpumask, shared_cpu_map);
 	if (cpumask_empty(&tmp_cpumask)) {
-		cpumask_set_cpu(cpu, &rdt_cpumask);
+		cpumask_set_cpu(cpu, cpumask);
 		return true;
 	}
 
 	return false;
 }
 
-static void cbm_update_msr(u32 index)
+void cbm_update_l3_msr(void *pindex)
 {
 	struct rdt_remote_data info;
+	int index;
 	int dindex;
+	int l3_domain;
+	struct clos_cbm_table *pl3_cctable;
 
+	index = *(int *)pindex;
 	dindex = DCBM_TABLE_INDEX(index);
-	if (cctable[dindex].clos_refcnt) {
-
-		info.msr = CBM_FROM_INDEX(dindex);
-		info.val = cctable[dindex].cbm;
-		msr_cpu_update((void *) &info);
-
+	l3_domain =  per_cpu(cpu_l3_domain, smp_processor_id());
+	pl3_cctable = &l3_cctable[l3_domain][dindex];
+	if (pl3_cctable->clos_refcnt) {
+		info.msr = L3_CBM_FROM_INDEX(dindex);
+		info.val = pl3_cctable->cbm;
+		msr_cpu_update(&info);
 		if (cdp_enabled) {
 			info.msr = __ICBM_MSR_INDEX(index);
-			info.val = cctable[dindex + 1].cbm;
-			msr_cpu_update((void *) &info);
+			info.val = l3_cctable[l3_domain][dindex+1].cbm;
+			msr_cpu_update(&info);
 		}
 	}
 }
@@ -252,8 +487,9 @@ static void cbm_update_msrs(void *dummy)
 	int maxid = cconfig.max_closid;
 	unsigned int i;
 
-	for (i = 0; i < maxid; i++) {
-		cbm_update_msr(i);
+	if (cat_l3_enabled) {
+		for (i = 0; i < maxid; i++)
+			cbm_update_l3_msr(&i);
 	}
 }
 
@@ -270,9 +506,11 @@ static int intel_rdt_online_cpu(unsigned int cpu)
 	 * If the cpu is first time found and set in its siblings that
 	 * share the same cache, update the CBM MSRs for the cache.
 	 */
-	if (rdt_cpumask_update(cpu))
+	if (rdt_cpumask_update(&rdt_l3_cpumask, cpu, CACHE_LEVEL3))
 		smp_call_function_single(cpu, cbm_update_msrs, NULL, 1);
 	mutex_unlock(&rdtgroup_mutex);
+
+	return 0;
 }
 
 static int clear_rdtgroup_cpumask(unsigned int cpu)
@@ -293,63 +531,270 @@ static int clear_rdtgroup_cpumask(unsigned int cpu)
 
 static int intel_rdt_offline_cpu(unsigned int cpu)
 {
-	int i;
+	struct cpumask *shared_cpu_map;
+	int new_cpu;
+	int l3_domain;
+	int level;
+	int leaf;
 
 	mutex_lock(&rdtgroup_mutex);
-	if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
-		mutex_unlock(&rdtgroup_mutex);
-		return;
-	}
 
-	cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), cpu_online_mask);
-	cpumask_clear_cpu(cpu, &tmp_cpumask);
-	i = cpumask_any(&tmp_cpumask);
+	level = CACHE_LEVEL3;
+
+	l3_domain = per_cpu(cpu_l3_domain, cpu);
+	leaf = level_to_leaf(level);
+	shared_cpu_map = &cache_domains[leaf].shared_cpu_map[l3_domain];
 
-	if (i < nr_cpu_ids)
-		cpumask_set_cpu(i, &rdt_cpumask);
+	cpumask_clear_cpu(cpu, &rdt_l3_cpumask);
+	cpumask_clear_cpu(cpu, shared_cpu_map);
+	if (cpumask_empty(shared_cpu_map))
+		goto out;
+
+	new_cpu = cpumask_first(shared_cpu_map);
+	rdt_cpumask_update(&rdt_l3_cpumask, new_cpu, level);
 
 	clear_rdtgroup_cpumask(cpu);
+out:
 	mutex_unlock(&rdtgroup_mutex);
+	return 0;
+}
+
+/*
+ * Initialize per-cpu cpu_l3_domain.
+ *
+ * cpu_l3_domain numbers are consequtive integer starting from 0.
+ * Sets up 1:1 mapping of cpu id and cpu_l3_domain.
+ */
+static int __init cpu_cache_domain_init(int level)
+{
+	int i, j;
+	int max_cpu_cache_domain = 0;
+	int index;
+	struct cacheinfo *leaf;
+	int *domain;
+	struct cpu_cacheinfo *cpu_ci;
+
+	for_each_online_cpu(i) {
+		domain = &per_cpu(cpu_l3_domain, i);
+		if (*domain == -1) {
+			index = get_cache_leaf(level, i);
+			if (index < 0)
+				return -EINVAL;
+
+			cpu_ci = get_cpu_cacheinfo(i);
+			leaf = cpu_ci->info_list + index;
+			if (cpumask_empty(&leaf->shared_cpu_map)) {
+				WARN(1, "no shared cpu for L2\n");
+				return -EINVAL;
+			}
+
+			for_each_cpu(j, &leaf->shared_cpu_map) {
+				domain = &per_cpu(cpu_l3_domain, j);
+				*domain = max_cpu_cache_domain;
+			}
+			max_cpu_cache_domain++;
+		}
+	}
+
+	return 0;
+}
+
+static int __init rdt_setup(char *str)
+{
+	char *tok;
+
+	while ((tok = strsep(&str, ",")) != NULL) {
+		if (!*tok)
+			return -EINVAL;
+
+		if (strcmp(tok, "simulate_cat_l3") == 0) {
+			pr_info("Simulate CAT L3\n");
+			rdt_opts.simulate_cat_l3 = true;
+		} else if (strcmp(tok, "disable_cat_l3") == 0) {
+			pr_info("CAT L3 is disabled\n");
+			disable_cat_l3 = true;
+		} else {
+			pr_info("Invalid rdt option\n");
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+__setup("resctrl=", rdt_setup);
+
+static inline bool resource_alloc_enabled(void)
+{
+	return cat_l3_enabled;
+}
+
+static int shared_domain_init(void)
+{
+	int l3_domain_num = get_domain_num(CACHE_LEVEL3);
+	int size;
+	int domain;
+	struct cpumask *cpumask;
+	struct cpumask *shared_cpu_map;
+	int cpu;
+
+	if (cat_l3_enabled) {
+		shared_domain_num = l3_domain_num;
+		cpumask = &rdt_l3_cpumask;
+	} else
+		return -EINVAL;
+
+	size = shared_domain_num * sizeof(struct shared_domain);
+	shared_domain = kzalloc(size, GFP_KERNEL);
+	if (!shared_domain)
+		return -EINVAL;
+
+	domain = 0;
+	for_each_cpu(cpu, cpumask) {
+		if (cat_l3_enabled)
+			shared_domain[domain].l3_domain =
+					per_cpu(cpu_l3_domain, cpu);
+		else
+			shared_domain[domain].l3_domain = -1;
+
+		shared_cpu_map = get_shared_cpu_map(cpu, CACHE_LEVEL3);
+
+		cpumask_copy(&shared_domain[domain].cpumask, shared_cpu_map);
+
+		domain++;
+	}
+	for_each_online_cpu(cpu) {
+		if (cat_l3_enabled)
+			per_cpu(cpu_shared_domain, cpu) =
+					per_cpu(cpu_l3_domain, cpu);
+	}
+
+	return 0;
+}
+
+static int cconfig_init(int maxid)
+{
+	int num;
+	int domain;
+	unsigned long *closmap_block;
+	int maxid_size;
+
+	maxid_size = BITS_TO_LONGS(maxid);
+	num = maxid_size * shared_domain_num;
+	cconfig.closmap = kcalloc(maxid, sizeof(unsigned long *), GFP_KERNEL);
+	if (!cconfig.closmap)
+		goto out_free;
+
+	closmap_block = kcalloc(num, sizeof(unsigned long), GFP_KERNEL);
+	if (!closmap_block)
+		goto out_free;
+
+	for (domain = 0; domain < shared_domain_num; domain++)
+		cconfig.closmap[domain] = (unsigned long *)closmap_block +
+					  domain * maxid_size;
+
+	cconfig.max_closid = maxid;
+
+	return 0;
+out_free:
+	kfree(cconfig.closmap);
+	kfree(closmap_block);
+	return -ENOMEM;
+}
+
+static int __init cat_cache_init(int level, int maxid,
+				 struct clos_cbm_table ***cctable)
+{
+	int domain_num;
+	int domain;
+	int size;
+	int ret = 0;
+	struct clos_cbm_table *p;
+
+	domain_num = get_domain_num(level);
+	size = domain_num * sizeof(struct clos_cbm_table *);
+	*cctable = kzalloc(size, GFP_KERNEL);
+	if (!*cctable) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	size = maxid * domain_num * sizeof(struct clos_cbm_table);
+	p = kzalloc(size, GFP_KERNEL);
+	if (!p) {
+		kfree(*cctable);
+		ret = -ENOMEM;
+		goto out;
+	}
+	for (domain = 0; domain < domain_num; domain++)
+		(*cctable)[domain] = p + domain * maxid;
+
+	ret = cpu_cache_domain_init(level);
+	if (ret) {
+		kfree(*cctable);
+		kfree(p);
+	}
+out:
+	return ret;
 }
 
 static int __init intel_rdt_late_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
 	u32 maxid;
-	int err = 0, size, i;
-
-	maxid = c->x86_cache_max_closid;
-
-	size = maxid * sizeof(struct clos_cbm_table);
-	cctable = kzalloc(size, GFP_KERNEL);
-	if (!cctable) {
-		err = -ENOMEM;
-		goto out_err;
+	int i;
+	int ret;
+
+	if (unlikely(disable_cat_l3))
+		cat_l3_enabled = false;
+	else if (cat_l3_supported(c))
+		cat_l3_enabled = true;
+	else if (rdt_opts.simulate_cat_l3 &&
+		 get_cache_leaf(CACHE_LEVEL3, 0) >= 0)
+		cat_l3_enabled = true;
+	else
+		cat_l3_enabled = false;
+
+	if (!resource_alloc_enabled())
+		return -ENODEV;
+
+	if (rdt_opts.simulate_cat_l3) {
+		boot_cpu_data.x86_l3_max_closid = 16;
+		boot_cpu_data.x86_l3_max_cbm_len = 20;
+	}
+	for_each_online_cpu(i) {
+		rdt_cpumask_update(&rdt_l3_cpumask, i, CACHE_LEVEL3);
 	}
 
-	size = BITS_TO_LONGS(maxid) * sizeof(long);
-	cconfig.closmap = kzalloc(size, GFP_KERNEL);
-	if (!cconfig.closmap) {
-		kfree(cctable);
-		err = -ENOMEM;
-		goto out_err;
+	maxid = 0;
+	if (cat_l3_enabled) {
+		maxid = boot_cpu_data.x86_l3_max_closid;
+		ret = cat_cache_init(CACHE_LEVEL3, maxid, &l3_cctable);
+		if (ret)
+			cat_l3_enabled = false;
 	}
 
-	for_each_online_cpu(i)
-		rdt_cpumask_update(i);
+	if (!cat_l3_enabled)
+		return -ENOSPC;
+
+	ret = shared_domain_init();
+	if (ret)
+		return -ENODEV;
+
+	ret = cconfig_init(maxid);
+	if (ret)
+		return ret;
 
 	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
 				"AP_INTEL_RDT_ONLINE",
 				intel_rdt_online_cpu, intel_rdt_offline_cpu);
-	if (err < 0)
-		goto out_err;
+	if (ret < 0)
+		return ret;
 
 	pr_info("Intel cache allocation enabled\n");
 	if (cpu_has(c, X86_FEATURE_CDP_L3))
 		pr_info("Intel code data prioritization detected\n");
-out_err:
 
-	return err;
+	return 0;
 }
 
 late_initcall(intel_rdt_late_init);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 23/33] x86/intel_rdt_rdtgroup.c: User interface for RDT
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (21 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 22/33] x86/intel_rdt.c: Extend RDT to per cache and per resources Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 14:59   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 24/33] x86/intel_rdt_rdtgroup.c: Create info directory Fenghua Yu
                   ` (9 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

We introduce a new resctrl file system mounted under /sys/fs/resctrl.
User uses this file system to control resource allocation.

Hiearchy of the file system is as follows:
/sys/fs/resctrl/info/info
		    /<resource0>/<resource0 specific info files>
		    /<resource1>/<resource1 specific info files>
			....
		/tasks
		/cpus
		/schemata
		/sub-dir1
		/sub-dir2
		....

User can specify which task uses which schemata for resource allocation.

More details can be found in Documentation/x86/intel_rdt_ui.txt

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 85beecc..aaed4b4 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -40,6 +40,8 @@ struct cache_domain {
 	unsigned int shared_cache_id[MAX_CACHE_DOMAINS];
 };
 
+extern struct cache_domain cache_domains[MAX_CACHE_LEAVES];
+
 extern struct rdt_opts rdt_opts;
 
 struct clos_cbm_table {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 24/33] x86/intel_rdt_rdtgroup.c: Create info directory
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (22 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 23/33] x86/intel_rdt_rdtgroup.c: User interface for RDT Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 16:04   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 25/33] include/linux/resctrl.h: Define fork and exit functions in a new header file Fenghua Yu
                   ` (8 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

During boot time, the "info" directory is set up under resctrl root.
it contains one "info" file and one resource specific directory
if the resource is enabled.

If L3 is enabled, "l3" sub-directory is created under the "info"
directory. There are three l3 specific info files under it:
max_closid, max_cbm_len, and domain_to_cache_id.

The "info" directory is exposed to user after resctrl is mounted.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt_rdtgroup.h |   7 +
 arch/x86/kernel/cpu/intel_rdt.c           |   2 +
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c  | 958 ++++++++++++++++++++++++++++++
 3 files changed, 967 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c

diff --git a/arch/x86/include/asm/intel_rdt_rdtgroup.h b/arch/x86/include/asm/intel_rdt_rdtgroup.h
index 3703964..92208a2 100644
--- a/arch/x86/include/asm/intel_rdt_rdtgroup.h
+++ b/arch/x86/include/asm/intel_rdt_rdtgroup.h
@@ -7,8 +7,15 @@
 #include <linux/kernfs.h>
 #include <asm/intel_rdt.h>
 
+/* Defined in intel_rdt_rdtgroup.c.*/
+extern int __init rdtgroup_init(void);
 extern void rdtgroup_exit(struct task_struct *tsk);
 
+/* Defined in intel_rdt.c. */
+extern struct list_head rdtgroup_lists;
+extern struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn);
+extern void rdtgroup_kn_unlock(struct kernfs_node *kn);
+
 /* cftype->flags */
 enum {
 	RFTYPE_WORLD_WRITABLE = (1 << 4),/* (DON'T USE FOR NEW FILES) S_IWUGO */
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index f7c728b..6c3df9e 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -790,6 +790,8 @@ static int __init intel_rdt_late_init(void)
 	if (ret < 0)
 		return ret;
 
+	rdtgroup_init();
+
 	pr_info("Intel cache allocation enabled\n");
 	if (cpu_has(c, X86_FEATURE_CDP_L3))
 		pr_info("Intel code data prioritization detected\n");
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
new file mode 100644
index 0000000..7842194
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -0,0 +1,958 @@
+/*
+ * Resource Director Technology(RDT)
+ * - User interface for Resource Alloction in RDT.
+ *
+ * Copyright (C) 2016 Intel Corporation
+ *
+ * 2016 Written by
+ *    Fenghua Yu <fenghua.yu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * More information about RDT be found in the Intel (R) x86 Architecture
+ * Software Developer Manual.
+ */
+#include <linux/cred.h>
+#include <linux/ctype.h>
+#include <linux/errno.h>
+#include <linux/init_task.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/magic.h>
+#include <linux/mm.h>
+#include <linux/mutex.h>
+#include <linux/mount.h>
+#include <linux/pagemap.h>
+#include <linux/proc_fs.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/string.h>
+#include <linux/pid_namespace.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/cpumask.h>
+#include <linux/cacheinfo.h>
+#include <asm/intel_rdt_rdtgroup.h>
+#include <asm/intel_rdt.h>
+
+#define RDTGROUP_FILE_NAME_LEN	(MAX_RDTGROUP_TYPE_NAMELEN +	\
+				 MAX_RFTYPE_NAME + 2)
+
+static int rdt_info_show(struct seq_file *seq, void *v);
+static int rdt_max_closid_show(struct seq_file *seq, void *v);
+static int rdt_max_cbm_len_show(struct seq_file *seq, void *v);
+static int domain_to_cache_id_show(struct seq_file *seq, void *v);
+
+/* rdtgroup core interface files */
+static struct rftype rdtgroup_root_base_files[] = {
+	{
+		.name = "tasks",
+		.seq_show = rdtgroup_tasks_show,
+		.write = rdtgroup_tasks_write,
+	},
+	{
+		.name = "cpus",
+		.write = rdtgroup_cpus_write,
+		.seq_show = rdtgroup_cpus_show,
+	},
+	{
+		.name = "schemata",
+		.write = rdtgroup_schemata_write,
+		.seq_show = rdtgroup_schemata_show,
+	},
+};
+
+static struct rftype info_files[] = {
+	{
+		.name = "info",
+		.seq_show = rdt_info_show,
+	},
+};
+
+/* rdtgroup information files for one cache resource. */
+static struct rftype res_info_files[] = {
+	{
+		.name = "max_closid",
+		.seq_show = rdt_max_closid_show,
+	},
+	{
+		.name = "max_cbm_len",
+		.seq_show = rdt_max_cbm_len_show,
+	},
+	{
+		.name = "domain_to_cache_id",
+		.seq_show = domain_to_cache_id_show,
+	},
+};
+
+static struct rftype rdtgroup_partition_base_files[] = {
+	{
+		.name = "tasks",
+		.seq_show = rdtgroup_tasks_show,
+		.write = rdtgroup_tasks_write,
+	},
+	{
+		.name = "cpus",
+		.write = rdtgroup_cpus_write,
+		.seq_show = rdtgroup_cpus_show,
+	},
+	{
+		.name = "schemata",
+		.write = rdtgroup_schemata_write,
+		.seq_show = rdtgroup_schemata_show,
+	},
+};
+
+struct rdtgroup *root_rdtgrp;
+static struct rftype rdtgroup_partition_base_files[];
+struct cache_domain cache_domains[MAX_CACHE_LEAVES];
+/* The default hierarchy. */
+struct rdtgroup_root rdtgrp_dfl_root;
+static struct list_head rdtgroups;
+
+/*
+ * kernfs_root - find out the kernfs_root a kernfs_node belongs to
+ * @kn: kernfs_node of interest
+ *
+ * Return the kernfs_root @kn belongs to.
+ */
+static inline struct kernfs_root *get_kernfs_root(struct kernfs_node *kn)
+{
+	if (kn->parent)
+		kn = kn->parent;
+	return kn->dir.root;
+}
+
+/*
+ * rdtgroup_file_mode - deduce file mode of a control file
+ * @cft: the control file in question
+ *
+ * S_IRUGO for read, S_IWUSR for write.
+ */
+static umode_t rdtgroup_file_mode(const struct rftype *rft)
+{
+	umode_t mode = 0;
+
+	if (rft->read_u64 || rft->read_s64 || rft->seq_show)
+		mode |= S_IRUGO;
+
+	if (rft->write_u64 || rft->write_s64 || rft->write)
+		mode |= S_IWUSR;
+
+	return mode;
+}
+
+/* set uid and gid of rdtgroup dirs and files to that of the creator */
+static int rdtgroup_kn_set_ugid(struct kernfs_node *kn)
+{
+	struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID,
+			       .ia_uid = current_fsuid(),
+			       .ia_gid = current_fsgid(), };
+
+	if (uid_eq(iattr.ia_uid, GLOBAL_ROOT_UID) &&
+	    gid_eq(iattr.ia_gid, GLOBAL_ROOT_GID))
+		return 0;
+
+	return kernfs_setattr(kn, &iattr);
+}
+
+static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
+{
+	char name[RDTGROUP_FILE_NAME_LEN];
+	struct kernfs_node *kn;
+	struct lock_class_key *key = NULL;
+	int ret;
+
+	strncpy(name, rft->name, RDTGROUP_FILE_NAME_LEN);
+	kn = __kernfs_create_file(parent_kn, name, rdtgroup_file_mode(rft),
+				  0, rft->kf_ops, rft, NULL, key);
+	if (IS_ERR(kn))
+		return PTR_ERR(kn);
+
+	ret = rdtgroup_kn_set_ugid(kn);
+	if (ret) {
+		kernfs_remove(kn);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void rdtgroup_rm_file(struct kernfs_node *kn, const struct rftype *rft)
+{
+	char name[RDTGROUP_FILE_NAME_LEN];
+
+	strncpy(name, rft->name, RDTGROUP_FILE_NAME_LEN);
+	kernfs_remove_by_name(kn, name);
+}
+
+static void rdtgroup_rm_files(struct kernfs_node *kn, struct rftype *rft,
+			      const struct rftype *end)
+{
+	for (; rft != end; rft++)
+		rdtgroup_rm_file(kn, rft);
+}
+
+static int rdtgroup_add_files(struct kernfs_node *kn, struct rftype *rfts,
+			      const struct rftype *end)
+{
+	struct rftype *rft;
+	int ret;
+
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	for (rft = rfts; rft != end; rft++) {
+		ret = rdtgroup_add_file(kn, rft);
+		if (ret) {
+			pr_warn("%s: failed to add %s, err=%d\n",
+				__func__, rft->name, ret);
+			rdtgroup_rm_files(kn, rft, end);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Get resource type from name in kernfs_node. This can be extended to
+ * multi-resources (e.g. L2). Right now simply return RESOURCE_L3 because
+ * we only have L3 support.
+ */
+static enum resource_type get_kn_res_type(struct kernfs_node *kn)
+{
+	return RESOURCE_L3;
+}
+
+static int rdt_max_closid_show(struct seq_file *seq, void *v)
+{
+	struct kernfs_open_file *of = seq->private;
+
+	switch (get_kn_res_type(of->kn)) {
+	case RESOURCE_L3:
+		seq_printf(seq, "%d\n",
+			boot_cpu_data.x86_l3_max_closid);
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static int rdt_max_cbm_len_show(struct seq_file *seq, void *v)
+{
+	struct kernfs_open_file *of = seq->private;
+
+	switch (get_kn_res_type(of->kn)) {
+	case RESOURCE_L3:
+		seq_printf(seq, "%d\n",
+			boot_cpu_data.x86_l3_max_cbm_len);
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static int get_shared_domain(int domain, int level)
+{
+	int sd;
+
+	for_each_cache_domain(sd, 0, shared_domain_num) {
+		if (cat_l3_enabled && level == CACHE_LEVEL3) {
+			if (shared_domain[sd].l3_domain == domain)
+				return sd;
+		}
+	}
+
+	return -1;
+}
+
+static void rdt_info_show_cat(struct seq_file *seq, int level)
+{
+	int domain;
+	int domain_num = get_domain_num(level);
+	int closid;
+	u64 cbm;
+	struct clos_cbm_table **cctable;
+	int maxid;
+	int shared_domain;
+	int cnt;
+
+	if (level == CACHE_LEVEL3)
+		cctable = l3_cctable;
+	else
+		return;
+
+	maxid = cconfig.max_closid;
+	for (domain = 0; domain < domain_num; domain++) {
+		seq_printf(seq, "domain %d:\n", domain);
+		shared_domain = get_shared_domain(domain, level);
+		for (closid = 0; closid < maxid; closid++) {
+			int dindex, iindex;
+
+			if (test_bit(closid,
+			(unsigned long *)cconfig.closmap[shared_domain])) {
+				dindex = get_dcbm_table_index(closid);
+				cbm = cctable[domain][dindex].cbm;
+				cnt = cctable[domain][dindex].clos_refcnt;
+				seq_printf(seq, "cbm[%d]=%lx, refcnt=%d\n",
+					 dindex, (unsigned long)cbm, cnt);
+				if (cdp_enabled) {
+					iindex = get_icbm_table_index(closid);
+					cbm = cctable[domain][iindex].cbm;
+					cnt =
+					   cctable[domain][iindex].clos_refcnt;
+					seq_printf(seq,
+						   "cbm[%d]=%lx, refcnt=%d\n",
+						   iindex, (unsigned long)cbm,
+						   cnt);
+				}
+			} else {
+				cbm = max_cbm(level);
+				cnt = 0;
+				dindex = get_dcbm_table_index(closid);
+				seq_printf(seq, "cbm[%d]=%lx, refcnt=%d\n",
+					 dindex, (unsigned long)cbm, cnt);
+				if (cdp_enabled) {
+					iindex = get_icbm_table_index(closid);
+					seq_printf(seq,
+						 "cbm[%d]=%lx, refcnt=%d\n",
+						 iindex, (unsigned long)cbm,
+						 cnt);
+				}
+			}
+		}
+	}
+}
+
+static void show_shared_domain(struct seq_file *seq)
+{
+	int domain;
+
+	seq_puts(seq, "Shared domains:\n");
+
+	for_each_cache_domain(domain, 0, shared_domain_num) {
+		struct shared_domain *sd;
+
+		sd = &shared_domain[domain];
+		seq_printf(seq, "domain[%d]:", domain);
+		if (cat_enabled(CACHE_LEVEL3))
+			seq_printf(seq, "l3_domain=%d ", sd->l3_domain);
+		seq_printf(seq, "cpumask=%*pb\n",
+			   cpumask_pr_args(&sd->cpumask));
+	}
+}
+
+static int rdt_info_show(struct seq_file *seq, void *v)
+{
+	show_shared_domain(seq);
+
+	if (cat_l3_enabled) {
+		if (rdt_opts.verbose)
+			rdt_info_show_cat(seq, CACHE_LEVEL3);
+	}
+
+	seq_puts(seq, "\n");
+
+	return 0;
+}
+
+static int res_type_to_level(enum resource_type res_type, int *level)
+{
+	int ret = 0;
+
+	switch (res_type) {
+	case RESOURCE_L3:
+		*level = CACHE_LEVEL3;
+		break;
+	case RESOURCE_NUM:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static int domain_to_cache_id_show(struct seq_file *seq, void *v)
+{
+	struct kernfs_open_file *of = seq->private;
+	enum resource_type res_type;
+	int domain;
+	int leaf;
+	int level = 0;
+	int ret;
+
+	res_type = (enum resource_type)of->kn->parent->priv;
+
+	ret = res_type_to_level(res_type, &level);
+	if (ret)
+		return 0;
+
+	leaf =	get_cache_leaf(level, 0);
+
+	for (domain = 0; domain < get_domain_num(level); domain++) {
+		unsigned int cid;
+
+		cid = cache_domains[leaf].shared_cache_id[domain];
+		seq_printf(seq, "%d:%d\n", domain, cid);
+	}
+
+	return 0;
+}
+
+static int rdtgroup_procs_write_permission(struct task_struct *task,
+					   struct kernfs_open_file *of)
+{
+	const struct cred *cred = current_cred();
+	const struct cred *tcred = get_task_cred(task);
+	int ret = 0;
+
+	/*
+	 * even if we're attaching all tasks in the thread group, we only
+	 * need to check permissions on one of them.
+	 */
+	if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
+	    !uid_eq(cred->euid, tcred->uid) &&
+	    !uid_eq(cred->euid, tcred->suid))
+		ret = -EPERM;
+
+	put_cred(tcred);
+	return ret;
+}
+
+static int info_populate_dir(struct kernfs_node *kn)
+{
+	struct rftype *rfts;
+
+	rfts = info_files;
+	return rdtgroup_add_files(kn, rfts, rfts + ARRAY_SIZE(info_files));
+}
+
+static int res_info_populate_dir(struct kernfs_node *kn)
+{
+	struct rftype *rfts;
+
+	rfts = res_info_files;
+	return rdtgroup_add_files(kn, rfts, rfts + ARRAY_SIZE(res_info_files));
+}
+
+static int rdtgroup_populate_dir(struct kernfs_node *kn)
+{
+	struct rftype *rfts;
+
+	rfts = rdtgroup_root_base_files;
+	return rdtgroup_add_files(kn, rfts,
+				  rfts + ARRAY_SIZE(rdtgroup_root_base_files));
+}
+
+static int rdtgroup_partition_populate_dir(struct kernfs_node *kn)
+{
+	struct rftype *rfts;
+
+	rfts = rdtgroup_partition_base_files;
+	return rdtgroup_add_files(kn, rfts,
+			rfts + ARRAY_SIZE(rdtgroup_partition_base_files));
+}
+
+LIST_HEAD(rdtgroup_lists);
+static void init_rdtgroup_root(struct rdtgroup_root *root)
+{
+	struct rdtgroup *rdtgrp = &root->rdtgrp;
+
+	INIT_LIST_HEAD(&rdtgrp->rdtgroup_list);
+	list_add_tail(&rdtgrp->rdtgroup_list, &rdtgroup_lists);
+	atomic_set(&root->nr_rdtgrps, 1);
+	rdtgrp->root = root;
+}
+
+static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops;
+struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn)
+{
+	struct rdtgroup *rdtgrp;
+
+	if (kernfs_type(kn) == KERNFS_DIR)
+		rdtgrp = kn->priv;
+	else
+		rdtgrp = kn->parent->priv;
+
+	kernfs_break_active_protection(kn);
+
+	mutex_lock(&rdtgroup_mutex);
+	/* Unlock if rdtgrp is dead. */
+	if (!rdtgrp)
+		rdtgroup_kn_unlock(kn);
+
+	return rdtgrp;
+}
+
+void rdtgroup_kn_unlock(struct kernfs_node *kn)
+{
+	mutex_unlock(&rdtgroup_mutex);
+
+	kernfs_unbreak_active_protection(kn);
+}
+
+static char *res_info_dir_name(enum resource_type res_type, char *name)
+{
+	switch (res_type) {
+	case RESOURCE_L3:
+		strncpy(name, "l3", RDTGROUP_FILE_NAME_LEN);
+		break;
+	default:
+		break;
+	}
+
+	return name;
+}
+
+static int create_res_info(enum resource_type res_type,
+			   struct kernfs_node *parent_kn)
+{
+	struct kernfs_node *kn;
+	char name[RDTGROUP_FILE_NAME_LEN];
+	int ret;
+
+	res_info_dir_name(res_type, name);
+	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, NULL);
+	if (IS_ERR(kn)) {
+		ret = PTR_ERR(kn);
+		goto out;
+	}
+
+	/*
+	 * This extra ref will be put in kernfs_remove() and guarantees
+	 * that @rdtgrp->kn is always accessible.
+	 */
+	kernfs_get(kn);
+
+	ret = rdtgroup_kn_set_ugid(kn);
+	if (ret)
+		goto out_destroy;
+
+	ret = res_info_populate_dir(kn);
+	if (ret)
+		goto out_destroy;
+
+	kernfs_activate(kn);
+
+	ret = 0;
+	goto out;
+
+out_destroy:
+	kernfs_remove(kn);
+out:
+	return ret;
+
+}
+
+static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn,
+				    const char *name)
+{
+	struct kernfs_node *kn;
+	int ret;
+
+	if (parent_kn != root_rdtgrp->kn)
+		return -EPERM;
+
+	/* create the directory */
+	kn = kernfs_create_dir(parent_kn, "info", parent_kn->mode, root_rdtgrp);
+	if (IS_ERR(kn)) {
+		ret = PTR_ERR(kn);
+		goto out;
+	}
+
+	ret = info_populate_dir(kn);
+	if (ret)
+		goto out_destroy;
+
+	if (cat_enabled(CACHE_LEVEL3))
+		create_res_info(RESOURCE_L3, kn);
+
+	/*
+	 * This extra ref will be put in kernfs_remove() and guarantees
+	 * that @rdtgrp->kn is always accessible.
+	 */
+	kernfs_get(kn);
+
+	ret = rdtgroup_kn_set_ugid(kn);
+	if (ret)
+		goto out_destroy;
+
+	kernfs_activate(kn);
+
+	ret = 0;
+	goto out;
+
+out_destroy:
+	kernfs_remove(kn);
+out:
+	return ret;
+}
+
+static int rdtgroup_setup_root(struct rdtgroup_root *root,
+			       unsigned long ss_mask)
+{
+	int ret;
+
+	root_rdtgrp = &root->rdtgrp;
+
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	root->kf_root = kernfs_create_root(&rdtgroup_kf_syscall_ops,
+					   KERNFS_ROOT_CREATE_DEACTIVATED,
+					   root_rdtgrp);
+	if (IS_ERR(root->kf_root)) {
+		ret = PTR_ERR(root->kf_root);
+		goto out;
+	}
+	root_rdtgrp->kn = root->kf_root->kn;
+
+	ret = rdtgroup_populate_dir(root->kf_root->kn);
+	if (ret)
+		goto destroy_root;
+
+	rdtgroup_create_info_dir(root->kf_root->kn, "info_dir");
+
+	/*
+	 * Link the root rdtgroup in this hierarchy into all the css_set
+	 * objects.
+	 */
+	WARN_ON(atomic_read(&root->nr_rdtgrps) != 1);
+
+	kernfs_activate(root_rdtgrp->kn);
+	ret = 0;
+	goto out;
+
+destroy_root:
+	kernfs_destroy_root(root->kf_root);
+	root->kf_root = NULL;
+out:
+	return ret;
+}
+
+static int get_shared_cache_id(int cpu, int level)
+{
+	struct cpuinfo_x86 *c;
+	int index_msb;
+	struct cpu_cacheinfo *this_cpu_ci;
+	struct cacheinfo *this_leaf;
+
+	this_cpu_ci = get_cpu_cacheinfo(cpu);
+
+	this_leaf = this_cpu_ci->info_list + level_to_leaf(level);
+	return this_leaf->id;
+	return c->apicid >> index_msb;
+}
+
+static void init_cache_domain(int cpu, int leaf)
+{
+	struct cpu_cacheinfo *this_cpu_ci;
+	struct cpumask *mask;
+	unsigned int level;
+	struct cacheinfo *this_leaf;
+	int domain;
+
+	this_cpu_ci = get_cpu_cacheinfo(cpu);
+	this_leaf = this_cpu_ci->info_list + leaf;
+	cache_domains[leaf].level = this_leaf->level;
+	mask = &this_leaf->shared_cpu_map;
+	for (domain = 0; domain < MAX_CACHE_DOMAINS; domain++) {
+		if (cpumask_test_cpu(cpu,
+			&cache_domains[leaf].shared_cpu_map[domain]))
+			return;
+	}
+	if (domain == MAX_CACHE_DOMAINS) {
+		domain = cache_domains[leaf].max_cache_domains_num++;
+
+		cache_domains[leaf].shared_cpu_map[domain] = *mask;
+
+		level = cache_domains[leaf].level;
+		cache_domains[leaf].shared_cache_id[domain] =
+			get_shared_cache_id(cpu, level);
+	}
+}
+
+static __init void init_cache_domains(void)
+{
+	int cpu;
+	int leaf;
+
+	for (leaf = 0; leaf < get_cpu_cacheinfo(0)->num_leaves; leaf++) {
+		for_each_online_cpu(cpu)
+			init_cache_domain(cpu, leaf);
+	}
+}
+
+void rdtgroup_exit(struct task_struct *tsk)
+{
+
+	if (!list_empty(&tsk->rg_list)) {
+		struct rdtgroup *rdtgrp = tsk->rdtgroup;
+
+		list_del_init(&tsk->rg_list);
+		tsk->rdtgroup = NULL;
+		atomic_dec(&rdtgrp->refcount);
+	}
+}
+
+static void rdtgroup_destroy_locked(struct rdtgroup *rdtgrp)
+	__releases(&rdtgroup_mutex) __acquires(&rdtgroup_mutex)
+{
+	int shared_domain;
+	int closid;
+
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	/* free closid occupied by this rdtgroup. */
+	for_each_cache_domain(shared_domain, 0, shared_domain_num) {
+		closid = rdtgrp->resource.closid[shared_domain];
+		closid_put(closid, shared_domain);
+	}
+
+	list_del_init(&rdtgrp->rdtgroup_list);
+
+	/*
+	 * Remove @rdtgrp directory along with the base files.  @rdtgrp has an
+	 * extra ref on its kn.
+	 */
+	kernfs_remove(rdtgrp->kn);
+}
+
+static int
+rdtgroup_move_task_all(struct rdtgroup *src_rdtgrp, struct rdtgroup *dst_rdtgrp)
+{
+	struct list_head *tasks;
+
+	tasks = &src_rdtgrp->pset.tasks;
+	while (!list_empty(tasks)) {
+		struct task_struct *tsk;
+		struct list_head *pos;
+		pid_t pid;
+		int ret;
+
+		pos = tasks->next;
+		tsk = list_entry(pos, struct task_struct, rg_list);
+		pid = tsk->pid;
+		ret = rdtgroup_move_task(pid, dst_rdtgrp, false, NULL);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/*
+ * Forcibly remove all of subdirectories under root.
+ */
+static void rmdir_all_sub(void)
+{
+	struct rdtgroup *rdtgrp;
+	int cpu;
+	struct list_head *l;
+	struct task_struct *p;
+
+	/* Move all tasks from sub rdtgroups to default */
+	rcu_read_lock();
+	for_each_process(p) {
+		if (p->rdtgroup)
+			p->rdtgroup = NULL;
+	}
+	rcu_read_unlock();
+
+	while (!list_is_last(&root_rdtgrp->rdtgroup_list, &rdtgroup_lists)) {
+		l = rdtgroup_lists.next;
+		if (l == &root_rdtgrp->rdtgroup_list)
+			l = l->next;
+
+		rdtgrp = list_entry(l, struct rdtgroup, rdtgroup_list);
+		if (rdtgrp == root_rdtgrp)
+			continue;
+
+		for_each_cpu(cpu, &rdtgrp->cpu_mask)
+			per_cpu(cpu_rdtgroup, cpu) = root_rdtgrp;
+
+		rdtgroup_destroy_locked(rdtgrp);
+	}
+}
+
+static int parse_rdtgroupfs_options(char *data)
+{
+	char *token, *o = data;
+	int nr_opts = 0;
+
+	while ((token = strsep(&o, ",")) != NULL) {
+		nr_opts++;
+
+		if (!*token)
+			return -EINVAL;
+		if (!strcmp(token, "cdp")) {
+			/* Enable CDP */
+			rdt_opts.cdp_enabled = true;
+			continue;
+		}
+		if (!strcmp(token, "verbose")) {
+			rdt_opts.verbose = true;
+			continue;
+		}
+	}
+
+	return 0;
+}
+
+static void release_root_closid(void)
+{
+	int domain;
+	int closid;
+
+	if (!root_rdtgrp->resource.valid)
+		return;
+
+	for_each_cache_domain(domain, 0, shared_domain_num) {
+		/* Put closid in root rdtgrp's domain if valid. */
+		closid = root_rdtgrp->resource.closid[domain];
+		closid_put(closid, domain);
+	}
+}
+
+static ssize_t rdtgroup_file_write(struct kernfs_open_file *of, char *buf,
+				 size_t nbytes, loff_t off)
+{
+	struct rftype *rft = of->kn->priv;
+
+	if (rft->write)
+		return rft->write(of, buf, nbytes, off);
+
+	return -EINVAL;
+}
+
+static void *rdtgroup_seqfile_start(struct seq_file *seq, loff_t *ppos)
+{
+	return seq_rft(seq)->seq_start(seq, ppos);
+}
+
+static void *rdtgroup_seqfile_next(struct seq_file *seq, void *v, loff_t *ppos)
+{
+	return seq_rft(seq)->seq_next(seq, v, ppos);
+}
+
+static void rdtgroup_seqfile_stop(struct seq_file *seq, void *v)
+{
+	seq_rft(seq)->seq_stop(seq, v);
+}
+
+static int rdtgroup_seqfile_show(struct seq_file *m, void *arg)
+{
+	struct rftype *rft = seq_rft(m);
+
+	if (rft->seq_show)
+		return rft->seq_show(m, arg);
+	return 0;
+}
+
+static struct kernfs_ops rdtgroup_kf_ops = {
+	.atomic_write_len	= PAGE_SIZE,
+	.write			= rdtgroup_file_write,
+	.seq_start		= rdtgroup_seqfile_start,
+	.seq_next		= rdtgroup_seqfile_next,
+	.seq_stop		= rdtgroup_seqfile_stop,
+	.seq_show		= rdtgroup_seqfile_show,
+};
+
+static struct kernfs_ops rdtgroup_kf_single_ops = {
+	.atomic_write_len	= PAGE_SIZE,
+	.write			= rdtgroup_file_write,
+	.seq_show		= rdtgroup_seqfile_show,
+};
+
+static void rdtgroup_exit_rftypes(struct rftype *rfts)
+{
+	struct rftype *rft;
+
+	for (rft = rfts; rft->name[0] != '\0'; rft++) {
+		/* free copy for custom atomic_write_len, see init_cftypes() */
+		if (rft->max_write_len && rft->max_write_len != PAGE_SIZE)
+			kfree(rft->kf_ops);
+		rft->kf_ops = NULL;
+
+		/* revert flags set by rdtgroup core while adding @cfts */
+		rft->flags &= ~(__RFTYPE_ONLY_ON_DFL | __RFTYPE_NOT_ON_DFL);
+	}
+}
+
+static int rdtgroup_init_rftypes(struct rftype *rfts)
+{
+	struct rftype *rft;
+
+	for (rft = rfts; rft->name[0] != '\0'; rft++) {
+		struct kernfs_ops *kf_ops;
+
+		if (rft->seq_start)
+			kf_ops = &rdtgroup_kf_ops;
+		else
+			kf_ops = &rdtgroup_kf_single_ops;
+
+		/*
+		 * Ugh... if @cft wants a custom max_write_len, we need to
+		 * make a copy of kf_ops to set its atomic_write_len.
+		 */
+		if (rft->max_write_len && rft->max_write_len != PAGE_SIZE) {
+			kf_ops = kmemdup(kf_ops, sizeof(*kf_ops), GFP_KERNEL);
+			if (!kf_ops) {
+				rdtgroup_exit_rftypes(rfts);
+				return -ENOMEM;
+			}
+			kf_ops->atomic_write_len = rft->max_write_len;
+		}
+
+		rft->kf_ops = kf_ops;
+	}
+
+	return 0;
+}
+
+/*
+ * rdtgroup_init - rdtgroup initialization
+ *
+ * Register rdtgroup filesystem, and initialize any subsystems that didn't
+ * request early init.
+ */
+int __init rdtgroup_init(void)
+{
+	int cpu;
+
+	WARN_ON(rdtgroup_init_rftypes(rdtgroup_root_base_files));
+
+	WARN_ON(rdtgroup_init_rftypes(res_info_files));
+	WARN_ON(rdtgroup_init_rftypes(info_files));
+
+	WARN_ON(rdtgroup_init_rftypes(rdtgroup_partition_base_files));
+	mutex_lock(&rdtgroup_mutex);
+
+	init_rdtgroup_root(&rdtgrp_dfl_root);
+	WARN_ON(rdtgroup_setup_root(&rdtgrp_dfl_root, 0));
+
+	mutex_unlock(&rdtgroup_mutex);
+
+	WARN_ON(sysfs_create_mount_point(fs_kobj, "resctrl"));
+	WARN_ON(register_filesystem(&rdt_fs_type));
+	init_cache_domains();
+
+	INIT_LIST_HEAD(&rdtgroups);
+
+	for_each_online_cpu(cpu)
+		per_cpu(cpu_rdtgroup, cpu) = root_rdtgrp;
+
+	return 0;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 25/33] include/linux/resctrl.h: Define fork and exit functions in a new header file
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (23 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 24/33] x86/intel_rdt_rdtgroup.c: Create info directory Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 16:07   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 26/33] Task fork and exit for rdtgroup Fenghua Yu
                   ` (7 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

A new header file is created in include/linux/resctrl.h. It contains
defintions of rdtgroup_fork and rdtgroup_exit for x86. The functions
are empty for other architectures.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/resctrl.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 include/linux/resctrl.h

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
new file mode 100644
index 0000000..68dabc4
--- /dev/null
+++ b/include/linux/resctrl.h
@@ -0,0 +1,12 @@
+#ifndef _LINUX_RESCTRL_H
+#define _LINUX_RESCTRL_H
+
+#ifdef CONFIG_INTEL_RDT
+extern void rdtgroup_fork(struct task_struct *child);
+extern void rdtgroup_exit(struct task_struct *tsk);
+#else
+static inline void rdtgroup_fork(struct task_struct *child) {}
+static inline void rdtgroup_exit(struct task_struct *tsk) {}
+#endif /* CONFIG_X86 */
+
+#endif /* _LINUX_RESCTRL_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 26/33] Task fork and exit for rdtgroup
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (24 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 25/33] include/linux/resctrl.h: Define fork and exit functions in a new header file Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 19:41   ` Thomas Gleixner
  2016-09-13 23:13   ` Dave Hansen
  2016-09-08  9:57 ` [PATCH v2 27/33] x86/intel_rdt_rdtgroup.c: Implement resctrl file system commands Fenghua Yu
                   ` (6 subsequent siblings)
  32 siblings, 2 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

When a task is forked, it inherites its parent rdtgroup. The task
can be moved to other rdtgroup during its run time.

When the task exits, it's deleted from it's current rdtgroup's task
list.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 22 ++++++++++++++++++++++
 kernel/exit.c                            |  2 ++
 kernel/fork.c                            |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 7842194..acea62c 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -956,3 +956,25 @@ int __init rdtgroup_init(void)
 
 	return 0;
 }
+
+void rdtgroup_fork(struct task_struct *child)
+{
+	struct rdtgroup *rdtgrp;
+
+	INIT_LIST_HEAD(&child->rg_list);
+	if (!rdtgroup_mounted)
+		return;
+
+	mutex_lock(&rdtgroup_mutex);
+
+	rdtgrp = current->rdtgroup;
+	if (!rdtgrp)
+		goto out;
+
+	list_add_tail(&child->rg_list, &rdtgrp->pset.tasks);
+	child->rdtgroup = rdtgrp;
+	atomic_inc(&rdtgrp->refcount);
+
+out:
+	mutex_unlock(&rdtgroup_mutex);
+}
diff --git a/kernel/exit.c b/kernel/exit.c
index 091a78b..270ede6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -54,6 +54,7 @@
 #include <linux/writeback.h>
 #include <linux/shm.h>
 #include <linux/kcov.h>
+#include <linux/resctrl.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -837,6 +838,7 @@ void do_exit(long code)
 	perf_event_exit_task(tsk);
 
 	cgroup_exit(tsk);
+	rdtgroup_exit(tsk);
 
 	/*
 	 * FIXME: do that only when needed, using sched_exit tracepoint
diff --git a/kernel/fork.c b/kernel/fork.c
index beb3172..79bfc99 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -76,6 +76,7 @@
 #include <linux/compiler.h>
 #include <linux/sysctl.h>
 #include <linux/kcov.h>
+#include <linux/resctrl.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1426,6 +1427,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->io_context = NULL;
 	p->audit_context = NULL;
 	cgroup_fork(p);
+	rdtgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
 	if (IS_ERR(p->mempolicy)) {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 27/33] x86/intel_rdt_rdtgroup.c: Implement resctrl file system commands
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (25 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 26/33] Task fork and exit for rdtgroup Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 20:09   ` Thomas Gleixner
  2016-09-08 22:04   ` Shaohua Li
  2016-09-08  9:57 ` [PATCH v2 28/33] x86/intel_rdt_rdtgroup.c: Read and write cpus Fenghua Yu
                   ` (5 subsequent siblings)
  32 siblings, 2 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Four basic file system commands are implement for resctrl.
mount, umount, mkdir, and rmdir.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt_rdtgroup.h |   1 +
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c  | 211 ++++++++++++++++++++++++++++++
 2 files changed, 212 insertions(+)

diff --git a/arch/x86/include/asm/intel_rdt_rdtgroup.h b/arch/x86/include/asm/intel_rdt_rdtgroup.h
index 92208a2..43a3b83 100644
--- a/arch/x86/include/asm/intel_rdt_rdtgroup.h
+++ b/arch/x86/include/asm/intel_rdt_rdtgroup.h
@@ -10,6 +10,7 @@
 /* Defined in intel_rdt_rdtgroup.c.*/
 extern int __init rdtgroup_init(void);
 extern void rdtgroup_exit(struct task_struct *tsk);
+extern bool rdtgroup_mounted;
 
 /* Defined in intel_rdt.c. */
 extern struct list_head rdtgroup_lists;
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index acea62c..71231ba 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -51,6 +51,13 @@ static int rdt_info_show(struct seq_file *seq, void *v);
 static int rdt_max_closid_show(struct seq_file *seq, void *v);
 static int rdt_max_cbm_len_show(struct seq_file *seq, void *v);
 static int domain_to_cache_id_show(struct seq_file *seq, void *v);
+static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
+			umode_t mode);
+static int rdtgroup_rmdir(struct kernfs_node *kn);
+static struct dentry *rdt_mount(struct file_system_type *fs_type,
+			 int flags, const char *unused_dev_name,
+			 void *data);
+static void rdt_kill_sb(struct super_block *sb);
 
 /* rdtgroup core interface files */
 static struct rftype rdtgroup_root_base_files[] = {
@@ -112,12 +119,24 @@ static struct rftype rdtgroup_partition_base_files[] = {
 	},
 };
 
+static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops = {
+	.mkdir          = rdtgroup_mkdir,
+	.rmdir          = rdtgroup_rmdir,
+};
+
+static struct file_system_type rdt_fs_type = {
+	.name = "resctrl",
+	.mount = rdt_mount,
+	.kill_sb = rdt_kill_sb,
+};
+
 struct rdtgroup *root_rdtgrp;
 static struct rftype rdtgroup_partition_base_files[];
 struct cache_domain cache_domains[MAX_CACHE_LEAVES];
 /* The default hierarchy. */
 struct rdtgroup_root rdtgrp_dfl_root;
 static struct list_head rdtgroups;
+bool rdtgroup_mounted;
 
 /*
  * kernfs_root - find out the kernfs_root a kernfs_node belongs to
@@ -730,6 +749,110 @@ static void rdtgroup_destroy_locked(struct rdtgroup *rdtgrp)
 	kernfs_remove(rdtgrp->kn);
 }
 
+static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
+			umode_t mode)
+{
+	struct rdtgroup *parent, *rdtgrp;
+	struct rdtgroup_root *root;
+	struct kernfs_node *kn;
+	int ret;
+
+	if (parent_kn != root_rdtgrp->kn)
+		return -EPERM;
+
+	/* Do not accept '\n' to avoid unparsable situation.
+	 */
+	if (strchr(name, '\n'))
+		return -EINVAL;
+
+	parent = rdtgroup_kn_lock_live(parent_kn);
+	if (!parent)
+		return -ENODEV;
+	root = parent->root;
+
+	/* allocate the rdtgroup. */
+	rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL);
+	if (!rdtgrp) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	INIT_LIST_HEAD(&rdtgrp->pset.tasks);
+
+	cpumask_clear(&rdtgrp->cpu_mask);
+
+	rdtgrp->root = root;
+
+	/* create the directory */
+	kn = kernfs_create_dir(parent->kn, name, mode, rdtgrp);
+	if (IS_ERR(kn)) {
+		ret = PTR_ERR(kn);
+		goto out_cancel_ref;
+	}
+	rdtgrp->kn = kn;
+
+	/*
+	 * This extra ref will be put in kernfs_remove() and guarantees
+	 * that @rdtgrp->kn is always accessible.
+	 */
+	kernfs_get(kn);
+
+	atomic_inc(&root->nr_rdtgrps);
+
+	ret = rdtgroup_kn_set_ugid(kn);
+	if (ret)
+		goto out_destroy;
+
+	ret = rdtgroup_partition_populate_dir(kn);
+	if (ret)
+		goto out_destroy;
+
+	kernfs_activate(kn);
+
+	list_add_tail(&rdtgrp->rdtgroup_list, &rdtgroup_lists);
+	/* Generate default schema for rdtgrp. */
+	ret = get_default_resources(rdtgrp);
+	if (ret)
+		goto out_destroy;
+
+	ret = 0;
+	goto out_unlock;
+
+out_cancel_ref:
+	kfree(rdtgrp);
+out_unlock:
+	rdtgroup_kn_unlock(parent_kn);
+	return ret;
+
+out_destroy:
+	rdtgroup_destroy_locked(rdtgrp);
+	goto out_unlock;
+}
+
+static int rdtgroup_rmdir(struct kernfs_node *kn)
+{
+	struct rdtgroup *rdtgrp;
+	int cpu;
+	int ret = 0;
+
+	rdtgrp = rdtgroup_kn_lock_live(kn);
+	if (!rdtgrp)
+		return -ENODEV;
+
+	if (!list_empty(&rdtgrp->pset.tasks)) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	for_each_cpu(cpu, &rdtgrp->cpu_mask)
+		per_cpu(cpu_rdtgroup, cpu) = root_rdtgrp;
+
+	rdtgroup_destroy_locked(rdtgrp);
+
+out:
+	rdtgroup_kn_unlock(kn);
+	return ret;
+}
 static int
 rdtgroup_move_task_all(struct rdtgroup *src_rdtgrp, struct rdtgroup *dst_rdtgrp)
 {
@@ -978,3 +1101,91 @@ void rdtgroup_fork(struct task_struct *child)
 out:
 	mutex_unlock(&rdtgroup_mutex);
 }
+
+static struct dentry *rdt_mount(struct file_system_type *fs_type,
+			 int flags, const char *unused_dev_name,
+			 void *data)
+{
+	struct super_block *pinned_sb = NULL;
+	struct rdtgroup_root *root;
+	struct dentry *dentry;
+	int ret;
+	bool new_sb;
+
+	/*
+	 * The first time anyone tries to mount a rdtgroup, enable the list
+	 * linking tasks and fix up all existing tasks.
+	 */
+	if (rdtgroup_mounted)
+		return ERR_PTR(-EBUSY);
+
+	rdt_opts.cdp_enabled = false;
+	rdt_opts.verbose = false;
+	cdp_enabled = false;
+
+	ret = parse_rdtgroupfs_options(data);
+	if (ret)
+		goto out_mount;
+
+	if (rdt_opts.cdp_enabled) {
+		cdp_enabled = true;
+		cconfig.max_closid >>= cdp_enabled;
+		pr_info("CDP is enabled\n");
+	}
+
+	init_msrs(cdp_enabled);
+
+	root = &rdtgrp_dfl_root;
+
+	ret = get_default_resources(&root->rdtgrp);
+	if (ret)
+		return ERR_PTR(-ENOSPC);
+
+out_mount:
+	dentry = kernfs_mount(fs_type, flags, root->kf_root,
+			      RDTGROUP_SUPER_MAGIC,
+			      &new_sb);
+	if (IS_ERR(dentry) || !new_sb)
+		goto out_unlock;
+
+	/*
+	 * If @pinned_sb, we're reusing an existing root and holding an
+	 * extra ref on its sb.  Mount is complete.  Put the extra ref.
+	 */
+	if (pinned_sb) {
+		WARN_ON(new_sb);
+		deactivate_super(pinned_sb);
+	}
+
+	INIT_LIST_HEAD(&root->rdtgrp.pset.tasks);
+
+	cpumask_copy(&root->rdtgrp.cpu_mask, cpu_online_mask);
+	static_key_slow_inc(&rdt_enable_key);
+	rdtgroup_mounted = true;
+
+	return dentry;
+
+out_unlock:
+	return ERR_PTR(ret);
+}
+
+static void rdt_kill_sb(struct super_block *sb)
+{
+	mutex_lock(&rdtgroup_mutex);
+
+	rmdir_all_sub();
+
+	static_key_slow_dec(&rdt_enable_key);
+
+	release_root_closid();
+	root_rdtgrp->resource.valid = false;
+
+	/* Restore max_closid to original value. */
+	cconfig.max_closid <<= cdp_enabled;
+
+	kernfs_kill_sb(sb);
+	INIT_LIST_HEAD(&root_rdtgrp->pset.tasks);
+	rdtgroup_mounted = false;
+
+	mutex_unlock(&rdtgroup_mutex);
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 28/33] x86/intel_rdt_rdtgroup.c: Read and write cpus
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (26 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 27/33] x86/intel_rdt_rdtgroup.c: Implement resctrl file system commands Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 20:25   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 29/33] x86/intel_rdt_rdtgroup.c: Tasks iterator and write Fenghua Yu
                   ` (4 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Normally each task is associated with one rdtgroup and we use the schema
for that rdtgroup whenever the task is running. The user can designate
some cpus to always use the same schema, regardless of which task is
running. To do that the user write a cpumask bit string to the "cpus"
file.

A cpu can only be listed in one rdtgroup. If the user specifies a cpu
that is currently assigned to a different rdtgroup, it is removed
from that rdtgroup.

See Documentation/x86/intel_rdt_ui.txt

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 92 ++++++++++++++++++++++++++++++++
 1 file changed, 92 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 71231ba..6c3161a 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -59,6 +59,10 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
 			 void *data);
 static void rdt_kill_sb(struct super_block *sb);
 
+static int rdtgroup_cpus_show(struct seq_file *s, void *v);
+static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
+			char *buf, size_t nbytes, loff_t off);
+
 /* rdtgroup core interface files */
 static struct rftype rdtgroup_root_base_files[] = {
 	{
@@ -1189,3 +1193,91 @@ static void rdt_kill_sb(struct super_block *sb)
 
 	mutex_unlock(&rdtgroup_mutex);
 }
+
+static int rdtgroup_cpus_show(struct seq_file *s, void *v)
+{
+	struct kernfs_open_file *of = s->private;
+	struct rdtgroup *rdtgrp;
+
+	rdtgrp = rdtgroup_kn_lock_live(of->kn);
+	if (!rdtgrp)
+		return -ENODEV;
+
+	seq_printf(s, "%*pb\n", cpumask_pr_args(&rdtgrp->cpu_mask));
+	rdtgroup_kn_unlock(of->kn);
+
+	return 0;
+}
+
+static int cpus_validate(struct cpumask *cpumask, struct rdtgroup *rdtgrp)
+{
+	int old_cpumask_bit, new_cpumask_bit;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		old_cpumask_bit = cpumask_test_cpu(cpu, &rdtgrp->cpu_mask);
+		new_cpumask_bit = cpumask_test_cpu(cpu, cpumask);
+		/* Cannot clear a "cpus" bit in a rdtgroup. */
+		if (old_cpumask_bit == 1 && new_cpumask_bit == 0)
+			return -EINVAL;
+	}
+
+	/* If a cpu is not online, cannot set it. */
+	for_each_cpu(cpu, cpumask) {
+		if (!cpu_online(cpu))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
+			char *buf, size_t nbytes, loff_t off)
+{
+	struct rdtgroup *rdtgrp;
+	unsigned long bitmap[BITS_TO_LONGS(NR_CPUS)];
+	struct cpumask *cpumask;
+	int cpu;
+	struct list_head *l;
+	struct rdtgroup *r;
+	int ret = 0;
+
+	if (!buf)
+		return -EINVAL;
+
+	rdtgrp = rdtgroup_kn_lock_live(of->kn);
+	if (!rdtgrp)
+		return -ENODEV;
+
+	if (list_empty(&rdtgroup_lists)) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	ret = __bitmap_parse(buf, strlen(buf), 0, bitmap, nr_cpu_ids);
+	if (ret)
+		goto end;
+
+	cpumask = to_cpumask(bitmap);
+	ret = cpus_validate(cpumask, rdtgrp);
+	if (ret)
+		goto end;
+
+	list_for_each(l, &rdtgroup_lists) {
+		r = list_entry(l, struct rdtgroup, rdtgroup_list);
+		if (r == rdtgrp)
+			continue;
+
+		for_each_cpu_and(cpu, &r->cpu_mask, cpumask)
+			cpumask_clear_cpu(cpu, &r->cpu_mask);
+	}
+
+	cpumask_copy(&rdtgrp->cpu_mask, cpumask);
+	for_each_cpu(cpu, cpumask)
+		per_cpu(cpu_rdtgroup, cpu) = rdtgrp;
+
+end:
+	rdtgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 29/33] x86/intel_rdt_rdtgroup.c: Tasks iterator and write
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (27 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 28/33] x86/intel_rdt_rdtgroup.c: Read and write cpus Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 20:50   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 30/33] x86/intel_rdt_rdtgroup.c: Process schemata input from resctrl interface Fenghua Yu
                   ` (3 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

"tasks" file in rdtgroup contains task pids. User can move a task pid
to one directory. A task can only stay in one directory at the same
time.

Each rdtgroup contains a rg_list. When a pid is written to this
rdtgroup's tasks, the task's rg_list is added in the rdtgroup's
linked list and deleted from its previous rdtgroup's linked list.

When user reads the "tasks" file, all pids are shown in the order
from small to large.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 161 +++++++++++++++++++++++--------
 1 file changed, 120 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 6c3161a..16f7195 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -63,6 +63,9 @@ static int rdtgroup_cpus_show(struct seq_file *s, void *v);
 static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
 			char *buf, size_t nbytes, loff_t off);
 
+static int rdtgroup_tasks_show(struct seq_file *s, void *v);
+static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of,
+				  char *buf, size_t nbytes, loff_t off);
 /* rdtgroup core interface files */
 static struct rftype rdtgroup_root_base_files[] = {
 	{
@@ -720,14 +723,41 @@ static __init void init_cache_domains(void)
 
 void rdtgroup_exit(struct task_struct *tsk)
 {
+	if (tsk->rdtgroup) {
+		atomic_dec(&tsk->rdtgroup->refcount);
+		tsk->rdtgroup = NULL;
+	}
+}
 
-	if (!list_empty(&tsk->rg_list)) {
-		struct rdtgroup *rdtgrp = tsk->rdtgroup;
+static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s)
+{
+	struct task_struct *p;
+	struct rdtgroup *this = r;
 
-		list_del_init(&tsk->rg_list);
-		tsk->rdtgroup = NULL;
-		atomic_dec(&rdtgrp->refcount);
+
+	if (r == root_rdtgrp)
+		return;
+
+	rcu_read_lock();
+	for_each_process(p) {
+		if (p->rdtgroup == this)
+			seq_printf(s, "%d\n", p->pid);
 	}
+	rcu_read_unlock();
+}
+
+static int rdtgroup_tasks_show(struct seq_file *s, void *v)
+{
+	struct kernfs_open_file *of = s->private;
+	struct rdtgroup *rdtgrp;
+
+	rdtgrp = rdtgroup_kn_lock_live(of->kn);
+	if (rdtgrp == NULL)
+		return -ENODEV;
+
+	show_rdt_tasks(rdtgrp, s);
+	rdtgroup_kn_unlock(of->kn);
+	return 0;
 }
 
 static void rdtgroup_destroy_locked(struct rdtgroup *rdtgrp)
@@ -781,8 +811,6 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
 		goto out_unlock;
 	}
 
-	INIT_LIST_HEAD(&rdtgrp->pset.tasks);
-
 	cpumask_clear(&rdtgrp->cpu_mask);
 
 	rdtgrp->root = root;
@@ -843,7 +871,7 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
 	if (!rdtgrp)
 		return -ENODEV;
 
-	if (!list_empty(&rdtgrp->pset.tasks)) {
+	if (atomic_read(&rdtgrp->refcount)) {
 		ret = -EBUSY;
 		goto out;
 	}
@@ -857,28 +885,6 @@ out:
 	rdtgroup_kn_unlock(kn);
 	return ret;
 }
-static int
-rdtgroup_move_task_all(struct rdtgroup *src_rdtgrp, struct rdtgroup *dst_rdtgrp)
-{
-	struct list_head *tasks;
-
-	tasks = &src_rdtgrp->pset.tasks;
-	while (!list_empty(tasks)) {
-		struct task_struct *tsk;
-		struct list_head *pos;
-		pid_t pid;
-		int ret;
-
-		pos = tasks->next;
-		tsk = list_entry(pos, struct task_struct, rg_list);
-		pid = tsk->pid;
-		ret = rdtgroup_move_task(pid, dst_rdtgrp, false, NULL);
-		if (ret)
-			return ret;
-	}
-
-	return 0;
-}
 
 /*
  * Forcibly remove all of subdirectories under root.
@@ -1088,22 +1094,16 @@ void rdtgroup_fork(struct task_struct *child)
 {
 	struct rdtgroup *rdtgrp;
 
-	INIT_LIST_HEAD(&child->rg_list);
+	child->rdtgroup = NULL;
 	if (!rdtgroup_mounted)
 		return;
 
-	mutex_lock(&rdtgroup_mutex);
-
 	rdtgrp = current->rdtgroup;
 	if (!rdtgrp)
-		goto out;
+		return;
 
-	list_add_tail(&child->rg_list, &rdtgrp->pset.tasks);
 	child->rdtgroup = rdtgrp;
 	atomic_inc(&rdtgrp->refcount);
-
-out:
-	mutex_unlock(&rdtgroup_mutex);
 }
 
 static struct dentry *rdt_mount(struct file_system_type *fs_type,
@@ -1161,8 +1161,6 @@ out_mount:
 		deactivate_super(pinned_sb);
 	}
 
-	INIT_LIST_HEAD(&root->rdtgrp.pset.tasks);
-
 	cpumask_copy(&root->rdtgrp.cpu_mask, cpu_online_mask);
 	static_key_slow_inc(&rdt_enable_key);
 	rdtgroup_mounted = true;
@@ -1188,7 +1186,6 @@ static void rdt_kill_sb(struct super_block *sb)
 	cconfig.max_closid <<= cdp_enabled;
 
 	kernfs_kill_sb(sb);
-	INIT_LIST_HEAD(&root_rdtgrp->pset.tasks);
 	rdtgroup_mounted = false;
 
 	mutex_unlock(&rdtgroup_mutex);
@@ -1281,3 +1278,85 @@ end:
 
 	return ret ?: nbytes;
 }
+
+static int _rdtgroup_move_task(struct task_struct *tsk, struct rdtgroup *rdtgrp)
+{
+	if (tsk->rdtgroup)
+		atomic_dec(&tsk->rdtgroup->refcount);
+
+	if (rdtgrp == root_rdtgrp)
+		tsk->rdtgroup = NULL;
+	else
+		tsk->rdtgroup = rdtgrp;
+
+	atomic_inc(&rdtgrp->refcount);
+
+	return 0;
+}
+
+static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp,
+			      bool threadgroup, struct kernfs_open_file *of)
+{
+	struct task_struct *tsk;
+	int ret;
+
+	rcu_read_lock();
+	if (pid) {
+		tsk = find_task_by_vpid(pid);
+		if (!tsk) {
+			ret = -ESRCH;
+			goto out_unlock_rcu;
+		}
+	} else {
+		tsk = current;
+	}
+
+	if (threadgroup)
+		tsk = tsk->group_leader;
+
+	get_task_struct(tsk);
+	rcu_read_unlock();
+
+	ret = rdtgroup_procs_write_permission(tsk, of);
+	if (!ret)
+		_rdtgroup_move_task(tsk, rdtgrp);
+
+	put_task_struct(tsk);
+	goto out_unlock_threadgroup;
+
+out_unlock_rcu:
+	rcu_read_unlock();
+out_unlock_threadgroup:
+	return ret;
+}
+
+ssize_t _rdtgroup_procs_write(struct rdtgroup *rdtgrp,
+			   struct kernfs_open_file *of, char *buf,
+			   size_t nbytes, loff_t off, bool threadgroup)
+{
+	pid_t pid;
+	int ret;
+
+	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
+		return -EINVAL;
+
+	ret = rdtgroup_move_task(pid, rdtgrp, threadgroup, of);
+
+	return ret ?: nbytes;
+}
+
+static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of,
+				  char *buf, size_t nbytes, loff_t off)
+{
+	struct rdtgroup *rdtgrp;
+	int ret;
+
+	rdtgrp = rdtgroup_kn_lock_live(of->kn);
+	if (!rdtgrp)
+		return -ENODEV;
+
+	ret = _rdtgroup_procs_write(rdtgrp, of, buf, nbytes, off, false);
+
+	rdtgroup_kn_unlock(of->kn);
+	return ret;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 30/33] x86/intel_rdt_rdtgroup.c: Process schemata input from resctrl interface
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (28 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 29/33] x86/intel_rdt_rdtgroup.c: Tasks iterator and write Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 22:20   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 31/33] Documentation/kernel-parameters: Add kernel parameter "resctrl" for CAT Fenghua Yu
                   ` (2 subsequent siblings)
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

There is one "schemata" file in each rdtgroup directory. User can input
schemata in the file to control how to allocate resources.

The input schemata first needs to pass validation. If there is no syntax
issue, kernel digests the input schemata and find CLOSID for each
domain for each resource.

A shared domain covers a few different resource domains which share
the same CLOSID. Kernel will find a CLOSID in each shared domain. If
an existing CLOSID and its CBMs match input schemata, the CLOSID is
shared by this rdtgroup. Otherwise, kernel tries to alloc a new
CLOSID for this rdtgroup. If a new CLOSID is available, update QoS MASK
MSRs. If no more CLOSID is available, kernel report ENODEV to user.

A shared domain is in preparation for multiple resources (like L2)
that will be added very soon.

User can read the schemata saved in the file.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/intel_rdt_rdtgroup.h |   6 +
 arch/x86/kernel/cpu/intel_rdt_schemata.c  | 674 ++++++++++++++++++++++++++++++
 2 files changed, 680 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_schemata.c

diff --git a/arch/x86/include/asm/intel_rdt_rdtgroup.h b/arch/x86/include/asm/intel_rdt_rdtgroup.h
index 43a3b83..782513e 100644
--- a/arch/x86/include/asm/intel_rdt_rdtgroup.h
+++ b/arch/x86/include/asm/intel_rdt_rdtgroup.h
@@ -17,6 +17,12 @@ extern struct list_head rdtgroup_lists;
 extern struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn);
 extern void rdtgroup_kn_unlock(struct kernfs_node *kn);
 
+/* Defiend in intel_rdt_schemata.c. */
+extern int get_default_resources(struct rdtgroup *rdtgrp);
+extern ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off);
+extern int rdtgroup_schemata_show(struct seq_file *s, void *v);
+
 /* cftype->flags */
 enum {
 	RFTYPE_WORLD_WRITABLE = (1 << 4),/* (DON'T USE FOR NEW FILES) S_IWUGO */
diff --git a/arch/x86/kernel/cpu/intel_rdt_schemata.c b/arch/x86/kernel/cpu/intel_rdt_schemata.c
new file mode 100644
index 0000000..4e624f0
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt_schemata.c
@@ -0,0 +1,674 @@
+#include <linux/slab.h>
+#include <asm/intel_rdt_rdtgroup.h>
+
+struct resources {
+	struct cache_resource *l3;
+};
+
+static int get_res_type(char **res, enum resource_type *res_type)
+{
+	char *tok;
+
+	tok = strsep(res, ":");
+	if (tok == NULL)
+		return -EINVAL;
+
+	if (!strcmp(tok, "L3")) {
+		*res_type = RESOURCE_L3;
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static int divide_resources(char *buf, char *resources[RESOURCE_NUM])
+{
+	char *tok;
+	unsigned int resource_num = 0;
+	int ret = 0;
+	char *res;
+	char *res_block;
+	size_t size;
+	enum resource_type res_type;
+
+	size = strlen(buf) + 1;
+	res = kzalloc(size, GFP_KERNEL);
+	if (!res) {
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	while ((tok = strsep(&buf, "\n")) != NULL) {
+		if (strlen(tok) == 0)
+			break;
+		if (resource_num++ >= 1) {
+			pr_info("More than one line of resource input!\n");
+			ret = -EINVAL;
+			goto out;
+		}
+		strcpy(res, tok);
+	}
+
+	res_block = res;
+	ret = get_res_type(&res_block, &res_type);
+	if (ret) {
+		pr_info("Unknown resource type!");
+		goto out;
+	}
+
+	if (res_block == NULL) {
+		pr_info("Invalid resource value!");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (res_type == RESOURCE_L3 && cat_enabled(CACHE_LEVEL3)) {
+		strcpy(resources[RESOURCE_L3], res_block);
+	} else {
+		pr_info("Invalid resource type!");
+		goto out;
+	}
+
+	ret = 0;
+
+out:
+	kfree(res);
+	return ret;
+}
+
+static bool cbm_validate(unsigned long var, int level)
+{
+	u32 maxcbmlen = max_cbm_len(level);
+	unsigned long first_bit, zero_bit;
+
+	if (bitmap_weight(&var, maxcbmlen) < min_bitmask_len)
+		return false;
+
+	if (var & ~max_cbm(level))
+		return false;
+
+	first_bit = find_first_bit(&var, maxcbmlen);
+	zero_bit = find_next_zero_bit(&var, maxcbmlen, first_bit);
+
+	if (find_next_bit(&var, maxcbmlen, zero_bit) < maxcbmlen)
+		return false;
+
+	return true;
+}
+
+static int get_input_cbm(char *tok, struct cache_resource *l,
+			 int input_domain_num, int level)
+{
+	int ret;
+
+	if (!cdp_enabled) {
+		if (tok == NULL)
+			return -EINVAL;
+
+		ret = kstrtoul(tok, 16,
+			       (unsigned long *)&l->cbm[input_domain_num]);
+		if (ret)
+			return ret;
+
+		if (!cbm_validate(l->cbm[input_domain_num], level))
+			return -EINVAL;
+	} else  {
+		char *input_cbm1_str;
+
+		input_cbm1_str = strsep(&tok, ",");
+		if (input_cbm1_str == NULL || tok == NULL)
+			return -EINVAL;
+
+		ret = kstrtoul(input_cbm1_str, 16,
+			       (unsigned long *)&l->cbm[input_domain_num]);
+		if (ret)
+			return ret;
+
+		if (!cbm_validate(l->cbm[input_domain_num], level))
+			return -EINVAL;
+
+		ret = kstrtoul(tok, 16,
+			       (unsigned long *)&l->cbm2[input_domain_num]);
+		if (ret)
+			return ret;
+
+		if (!cbm_validate(l->cbm2[input_domain_num], level))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int get_cache_schema(char *buf, struct cache_resource *l, int level,
+			 struct rdtgroup *rdtgrp)
+{
+	char *tok, *tok_cache_id;
+	int ret;
+	int domain_num;
+	int input_domain_num;
+	int len;
+	unsigned int input_cache_id;
+	unsigned int cid;
+	unsigned int leaf;
+
+	if (!cat_enabled(level) && strcmp(buf, ";")) {
+		pr_info("Disabled resource should have empty schema\n");
+		return -EINVAL;
+	}
+
+	len = strlen(buf);
+	/*
+	 * Translate cache id based cbm from one line string with format
+	 * "<cache prefix>:<cache id0>=xxxx;<cache id1>=xxxx;..." for
+	 * disabled cdp.
+	 * Or
+	 * "<cache prefix>:<cache id0>=xxxxx,xxxxx;<cache id1>=xxxxx,xxxxx;..."
+	 * for enabled cdp.
+	 */
+	input_domain_num = 0;
+	while ((tok = strsep(&buf, ";")) != NULL) {
+		tok_cache_id = strsep(&tok, "=");
+		if (tok_cache_id == NULL)
+			goto cache_id_err;
+
+		ret = kstrtouint(tok_cache_id, 16, &input_cache_id);
+		if (ret)
+			goto cache_id_err;
+
+		leaf = level_to_leaf(level);
+		cid = cache_domains[leaf].shared_cache_id[input_domain_num];
+		if (input_cache_id != cid)
+			goto cache_id_err;
+
+		ret = get_input_cbm(tok, l, input_domain_num, level);
+		if (ret)
+			goto cbm_err;
+
+		input_domain_num++;
+		if (input_domain_num > get_domain_num(level)) {
+			pr_info("domain number is more than max %d\n",
+				MAX_CACHE_DOMAINS);
+			return -EINVAL;
+		}
+	}
+
+	domain_num = get_domain_num(level);
+	if (domain_num != input_domain_num) {
+		pr_info("%s input domain number %d doesn't match domain number %d\n",
+			"l3",
+			input_domain_num, domain_num);
+
+		return -EINVAL;
+	}
+
+	return 0;
+
+cache_id_err:
+	pr_info("Invalid cache id in field %d for L%1d\n", input_domain_num,
+		level);
+	return -EINVAL;
+
+cbm_err:
+	pr_info("Invalid cbm in field %d for cache L%d\n",
+		input_domain_num, level);
+	return -EINVAL;
+}
+
+static bool cbm_found(struct cache_resource *l, struct rdtgroup *r,
+		      int domain, int level)
+{
+	int closid;
+	int l3_domain;
+	u64 cctable_cbm;
+	u64 cbm;
+	int dindex;
+
+	closid = r->resource.closid[domain];
+
+	if (level == CACHE_LEVEL3) {
+		l3_domain = shared_domain[domain].l3_domain;
+		cbm = l->cbm[l3_domain];
+		dindex = get_dcbm_table_index(closid);
+		cctable_cbm = l3_cctable[l3_domain][dindex].cbm;
+		if (cdp_enabled) {
+			u64 icbm;
+			u64 cctable_icbm;
+			int iindex;
+
+			icbm = l->cbm2[l3_domain];
+			iindex = get_icbm_table_index(closid);
+			cctable_icbm = l3_cctable[l3_domain][iindex].cbm;
+
+			return cbm == cctable_cbm && icbm == cctable_icbm;
+		}
+
+		return cbm == cctable_cbm;
+	}
+
+	return false;
+}
+
+enum {
+	CURRENT_CLOSID,
+	REUSED_OWN_CLOSID,
+	REUSED_OTHER_CLOSID,
+	NEW_CLOSID,
+};
+
+/*
+ * Check if the reference counts are all ones in rdtgrp's domain.
+ */
+static bool one_refcnt(struct rdtgroup *rdtgrp, int domain)
+{
+	int refcnt;
+	int closid;
+
+	closid = rdtgrp->resource.closid[domain];
+	if (cat_l3_enabled) {
+		int l3_domain;
+		int dindex;
+
+		l3_domain = shared_domain[domain].l3_domain;
+		dindex = get_dcbm_table_index(closid);
+		refcnt = l3_cctable[l3_domain][dindex].clos_refcnt;
+		if (refcnt != 1)
+			return false;
+
+		if (cdp_enabled) {
+			int iindex;
+
+			iindex = get_icbm_table_index(closid);
+			refcnt = l3_cctable[l3_domain][iindex].clos_refcnt;
+
+			if (refcnt != 1)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Go through all shared domains. Check if there is an existing closid
+ * in all rdtgroups that matches l3 cbms in the shared
+ * domain. If find one, reuse the closid. Otherwise, allocate a new one.
+ */
+static int get_rdtgroup_resources(struct resources *resources_set,
+				  struct rdtgroup *rdtgrp)
+{
+	struct cache_resource *l3;
+	bool l3_cbm_found;
+	struct list_head *l;
+	struct rdtgroup *r;
+	u64 cbm;
+	int rdt_closid[MAX_CACHE_DOMAINS];
+	int rdt_closid_type[MAX_CACHE_DOMAINS];
+	int domain;
+	int closid;
+	int ret;
+
+	l3 = resources_set->l3;
+	memcpy(rdt_closid, rdtgrp->resource.closid,
+	       shared_domain_num * sizeof(int));
+	for (domain = 0; domain < shared_domain_num; domain++) {
+		if (rdtgrp->resource.valid) {
+			/*
+			 * If current rdtgrp is the only user of cbms in
+			 * this domain, will replace the cbms with the input
+			 * cbms and reuse its own closid.
+			 */
+			if (one_refcnt(rdtgrp, domain)) {
+				closid = rdtgrp->resource.closid[domain];
+				rdt_closid[domain] = closid;
+				rdt_closid_type[domain] = REUSED_OWN_CLOSID;
+				continue;
+			}
+
+			l3_cbm_found = true;
+
+			if (cat_l3_enabled)
+				l3_cbm_found = cbm_found(l3, rdtgrp, domain,
+							 CACHE_LEVEL3);
+
+			/*
+			 * If the cbms in this shared domain are already
+			 * existing in current rdtgrp, record the closid
+			 * and its type.
+			 */
+			if (l3_cbm_found) {
+				closid = rdtgrp->resource.closid[domain];
+				rdt_closid[domain] = closid;
+				rdt_closid_type[domain] = CURRENT_CLOSID;
+				continue;
+			}
+		}
+
+		/*
+		 * If the cbms are not found in this rdtgrp, search other
+		 * rdtgroups and see if there are matched cbms.
+		 */
+		l3_cbm_found = cat_l3_enabled ? false : true;
+		list_for_each(l, &rdtgroup_lists) {
+			r = list_entry(l, struct rdtgroup, rdtgroup_list);
+			if (r == rdtgrp || !r->resource.valid)
+				continue;
+
+			if (cat_l3_enabled)
+				l3_cbm_found = cbm_found(l3, r, domain,
+							 CACHE_LEVEL3);
+
+			if (l3_cbm_found) {
+				/* Get the closid that matches l3 cbms.*/
+				closid = r->resource.closid[domain];
+				rdt_closid[domain] = closid;
+				rdt_closid_type[domain] = REUSED_OTHER_CLOSID;
+				break;
+			}
+		}
+
+		if (!l3_cbm_found) {
+			/*
+			 * If no existing closid is found, allocate
+			 * a new one.
+			 */
+			ret = closid_alloc(&closid, domain);
+			if (ret)
+				goto err;
+			rdt_closid[domain] = closid;
+			rdt_closid_type[domain] = NEW_CLOSID;
+		}
+	}
+
+	/*
+	 * Now all closid are ready in rdt_closid. Update rdtgrp's closid.
+	 */
+	for_each_cache_domain(domain, 0, shared_domain_num) {
+		/*
+		 * Nothing is changed if the same closid and same cbms were
+		 * found in this rdtgrp's domain.
+		 */
+		if (rdt_closid_type[domain] == CURRENT_CLOSID)
+			continue;
+
+		/*
+		 * Put rdtgroup closid. No need to put the closid if we
+		 * just change cbms and keep the closid (REUSED_OWN_CLOSID).
+		 */
+		if (rdtgrp->resource.valid &&
+		    rdt_closid_type[domain] != REUSED_OWN_CLOSID) {
+			/* Put old closid in this rdtgrp's domain if valid. */
+			closid = rdtgrp->resource.closid[domain];
+			closid_put(closid, domain);
+		}
+
+		/*
+		 * Replace the closid in this rdtgrp's domain with saved
+		 * closid that was newly allocted (NEW_CLOSID), or found in
+		 * another rdtgroup's domains (REUSED_CLOSID), or found in
+		 * this rdtgrp (REUSED_OWN_CLOSID).
+		 */
+		closid = rdt_closid[domain];
+		rdtgrp->resource.closid[domain] = closid;
+
+		/*
+		 * Get the reused other rdtgroup's closid. No need to get the
+		 * closid newly allocated (NEW_CLOSID) because it's been
+		 * already got in closid_alloc(). And no need to get the closid
+		 * for resued own closid (REUSED_OWN_CLOSID).
+		 */
+		if (rdt_closid_type[domain] == REUSED_OTHER_CLOSID)
+			closid_get(closid, domain);
+
+		/*
+		 * If the closid comes from a newly allocated closid
+		 * (NEW_CLOSID), or found in this rdtgrp (REUSED_OWN_CLOSID),
+		 * cbms for this closid will be updated in MSRs.
+		 */
+		if (rdt_closid_type[domain] == NEW_CLOSID ||
+		    rdt_closid_type[domain] == REUSED_OWN_CLOSID) {
+			/*
+			 * Update cbm in cctable with the newly allocated
+			 * closid.
+			 */
+			if (cat_l3_enabled) {
+				int cpu;
+				struct cpumask *mask;
+				int dindex;
+				int l3_domain = shared_domain[domain].l3_domain;
+				int leaf = level_to_leaf(CACHE_LEVEL3);
+
+				cbm = l3->cbm[l3_domain];
+				dindex = get_dcbm_table_index(closid);
+				l3_cctable[l3_domain][dindex].cbm = cbm;
+				if (cdp_enabled) {
+					int iindex;
+
+					cbm = l3->cbm2[l3_domain];
+					iindex = get_icbm_table_index(closid);
+					l3_cctable[l3_domain][iindex].cbm = cbm;
+				}
+
+				mask =
+				&cache_domains[leaf].shared_cpu_map[l3_domain];
+
+				cpu = cpumask_first(mask);
+				smp_call_function_single(cpu, cbm_update_l3_msr,
+							 &closid, 1);
+			}
+		}
+	}
+
+	rdtgrp->resource.valid = true;
+
+	return 0;
+err:
+	/* Free previously allocated closid. */
+	for_each_cache_domain(domain, 0, shared_domain_num) {
+		if (rdt_closid_type[domain] != NEW_CLOSID)
+			continue;
+
+		closid_put(rdt_closid[domain], domain);
+
+	}
+
+	return ret;
+}
+
+static void init_cache_resource(struct cache_resource *l)
+{
+	l->cbm = NULL;
+	l->cbm2 = NULL;
+	l->closid = NULL;
+	l->refcnt = NULL;
+}
+
+static void free_cache_resource(struct cache_resource *l)
+{
+	kfree(l->cbm);
+	kfree(l->cbm2);
+	kfree(l->closid);
+	kfree(l->refcnt);
+}
+
+static int alloc_cache_resource(struct cache_resource *l, int level)
+{
+	int domain_num = get_domain_num(level);
+
+	l->cbm = kcalloc(domain_num, sizeof(*l->cbm), GFP_KERNEL);
+	l->cbm2 = kcalloc(domain_num, sizeof(*l->cbm2), GFP_KERNEL);
+	l->closid = kcalloc(domain_num, sizeof(*l->closid), GFP_KERNEL);
+	l->refcnt = kcalloc(domain_num, sizeof(*l->refcnt), GFP_KERNEL);
+	if (l->cbm && l->cbm2 && l->closid && l->refcnt)
+		return 0;
+
+	return -ENOMEM;
+}
+
+/*
+ * This function digests schemata given in text buf. If the schemata are in
+ * right format and there is enough closid, input the schemata in rdtgrp
+ * and update resource cctables.
+ *
+ * Inputs:
+ *	buf: string buffer containing schemata
+ *	rdtgrp: current rdtgroup holding schemata.
+ *
+ * Return:
+ *	0 on success or error code.
+ */
+static int get_resources(char *buf, struct rdtgroup *rdtgrp)
+{
+	char *resources[RESOURCE_NUM];
+	struct cache_resource l3;
+	struct resources resources_set;
+	int ret;
+	char *resources_block;
+	int i;
+	int size = strlen(buf) + 1;
+
+	resources_block = kcalloc(RESOURCE_NUM, size, GFP_KERNEL);
+	if (!resources_block)
+		return -ENOMEM;
+
+	for (i = 0; i < RESOURCE_NUM; i++)
+		resources[i] = (char *)(resources_block + i * size);
+
+	ret = divide_resources(buf, resources);
+	if (ret) {
+		kfree(resources_block);
+		return -EINVAL;
+	}
+
+	init_cache_resource(&l3);
+
+	if (cat_l3_enabled) {
+		ret = alloc_cache_resource(&l3, CACHE_LEVEL3);
+		if (ret)
+			goto out;
+
+		ret = get_cache_schema(resources[RESOURCE_L3], &l3,
+				       CACHE_LEVEL3, rdtgrp);
+		if (ret)
+			goto out;
+
+		resources_set.l3 = &l3;
+	} else
+		resources_set.l3 = NULL;
+
+	ret = get_rdtgroup_resources(&resources_set, rdtgrp);
+
+out:
+	kfree(resources_block);
+	free_cache_resource(&l3);
+
+	return ret;
+}
+
+static void gen_cache_prefix(char *buf, int level)
+{
+	sprintf(buf, "L%1d:", level == CACHE_LEVEL3 ? 3 : 2);
+}
+
+static int get_cache_id(int domain, int level)
+{
+	return cache_domains[level_to_leaf(level)].shared_cache_id[domain];
+}
+
+static void gen_cache_buf(char *buf, int level)
+{
+	int domain;
+	char buf1[32];
+	int domain_num;
+	u64 val;
+
+	gen_cache_prefix(buf, level);
+
+	domain_num = get_domain_num(level);
+
+	val = max_cbm(level);
+
+	for (domain = 0; domain < domain_num; domain++) {
+		sprintf(buf1, "%d=%lx", get_cache_id(domain, level),
+			(unsigned long)val);
+		strcat(buf, buf1);
+		if (cdp_enabled) {
+			sprintf(buf1, ",%lx", (unsigned long)val);
+			strcat(buf, buf1);
+		}
+		if (domain < domain_num - 1)
+			strcat(buf, ";");
+		else
+			strcat(buf, "\n");
+	}
+}
+
+/*
+ * Set up default schemata in a rdtgroup. All schemata in all resources are
+ * default values (all 1's) for all domains.
+ *
+ * Input: rdtgroup.
+ * Return: 0: successful
+ *	   non-0: error code
+ */
+int get_default_resources(struct rdtgroup *rdtgrp)
+{
+	char schema[1024];
+	int ret = 0;
+
+	if (cat_enabled(CACHE_LEVEL3)) {
+		gen_cache_buf(schema, CACHE_LEVEL3);
+
+		if (strlen(schema)) {
+			ret = get_resources(schema, rdtgrp);
+			if (ret)
+				return ret;
+		}
+		gen_cache_buf(rdtgrp->schema, CACHE_LEVEL3);
+	}
+
+	return ret;
+}
+
+ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
+			char *buf, size_t nbytes, loff_t off)
+{
+	int ret = 0;
+	struct rdtgroup *rdtgrp;
+	char *schema;
+
+	rdtgrp = rdtgroup_kn_lock_live(of->kn);
+	if (!rdtgrp)
+		return -ENODEV;
+
+	schema = kzalloc(sizeof(char) * strlen(buf) + 1, GFP_KERNEL);
+	if (!schema) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	memcpy(schema, buf, strlen(buf) + 1);
+
+	ret = get_resources(buf, rdtgrp);
+	if (ret)
+		goto out;
+
+	memcpy(rdtgrp->schema, schema, strlen(schema) + 1);
+
+out:
+	kfree(schema);
+
+out_unlock:
+	rdtgroup_kn_unlock(of->kn);
+	return ret ?: nbytes;
+}
+
+int rdtgroup_schemata_show(struct seq_file *s, void *v)
+{
+	struct kernfs_open_file *of = s->private;
+	struct rdtgroup *rdtgrp;
+
+	rdtgrp = rdtgroup_kn_lock_live(of->kn);
+	seq_printf(s, "%s", rdtgrp->schema);
+	rdtgroup_kn_unlock(of->kn);
+	return 0;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 31/33] Documentation/kernel-parameters: Add kernel parameter "resctrl" for CAT
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (29 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 30/33] x86/intel_rdt_rdtgroup.c: Process schemata input from resctrl interface Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 22:25   ` Thomas Gleixner
  2016-09-08  9:57 ` [PATCH v2 32/33] MAINTAINERS: Add maintainer for Intel RDT resource allocation Fenghua Yu
  2016-09-08  9:57 ` [PATCH v2 33/33] x86/Makefile: Build intel_rdt_rdtgroup.c Fenghua Yu
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Add kernel parameter "resctrl" for CAT L3:

resctrl=disable_cat_l3: disable CAT L3
resctrl=simulate_cat_l3: simulate CAT L3

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 Documentation/kernel-parameters.txt | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index a4f4d69..1240a4f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3692,6 +3692,19 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			Memory area to be used by remote processor image,
 			managed by CMA.
 
+	resctrl=	[X86] Resource control
+			Format:
+			[disable_cat_l3][,simulate_cat_l3]
+
+			disable_cat_l3  - Disable CAT L3. By default, CAT L3
+					  is enabled.
+
+			simulate_cat_l3 - Simulate CAT L3 on a machine that
+					  doesn't have the feature. In the
+					  simulation, max closid is 16 and
+					  max cbm lenghth is 20, and host
+					  machine's cache hierarchy is used.
+
 	rw		[KNL] Mount root device read-write on boot
 
 	S		[KNL] Run init in single mode
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 32/33] MAINTAINERS: Add maintainer for Intel RDT resource allocation
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (30 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 31/33] Documentation/kernel-parameters: Add kernel parameter "resctrl" for CAT Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08  9:57 ` [PATCH v2 33/33] x86/Makefile: Build intel_rdt_rdtgroup.c Fenghua Yu
  32 siblings, 0 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

We create six new files for Intel RDT resource allocation:
arch/x86/kernel/cpu/intel_rdt.c
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
arch/x86/include/asm/intel_rdt.h
arch/x86/include/asm/intel_rdt_rdtgroup.h
Documentation/x86/intel_rdt.txt
Documentation/x86/intel_rdt_ui.txt

Add maintainer in MAINTAINERS to maintain the files.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 MAINTAINERS | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index db814a8..de9e47a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9833,6 +9833,15 @@ L:	linux-rdma@vger.kernel.org
 S:	Supported
 F:	drivers/infiniband/sw/rdmavt
 
+RDT - RESOURCE ALLOCATION
+M:	Fenghua Yu <fenghua.yu@intel.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	arch/x86/kernel/cpu/intel_rdt*
+F:	arch/x86/include/asm/intel_rdt*
+F:	Documentation/x86/intel_rdt*
+F:	include/linux/resctrl.h
+
 READ-COPY UPDATE (RCU)
 M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
 M:	Josh Triplett <josh@joshtriplett.org>
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v2 33/33] x86/Makefile: Build intel_rdt_rdtgroup.c
  2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
                   ` (31 preceding siblings ...)
  2016-09-08  9:57 ` [PATCH v2 32/33] MAINTAINERS: Add maintainer for Intel RDT resource allocation Fenghua Yu
@ 2016-09-08  9:57 ` Fenghua Yu
  2016-09-08 23:59   ` Thomas Gleixner
  32 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-08  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86, Fenghua Yu

From: Fenghua Yu <fenghua.yu@intel.com>

Build the user interface file intel_rdt_rdtgroup.c.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/Makefile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 4a8697f..4d7e0e5 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -34,6 +34,8 @@ obj-$(CONFIG_CPU_SUP_CENTAUR)		+= centaur.o
 obj-$(CONFIG_CPU_SUP_TRANSMETA_32)	+= transmeta.o
 obj-$(CONFIG_CPU_SUP_UMC_32)		+= umc.o
 
+obj-$(CONFIG_INTEL_RDT)	+= intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_schemata.o
+
 obj-$(CONFIG_X86_MCE)			+= mcheck/
 obj-$(CONFIG_MTRR)			+= mtrr/
 obj-$(CONFIG_MICROCODE)			+= microcode/
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 11/33] x86/intel_rdt: Hot cpu support for Cache Allocation
  2016-09-08  9:57 ` [PATCH v2 11/33] x86/intel_rdt: Hot cpu support for Cache Allocation Fenghua Yu
@ 2016-09-08 10:03   ` Thomas Gleixner
  2016-09-13 18:18   ` Nilay Vaish
  1 sibling, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 10:03 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
> +/*
> + * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
> + * which are one per CLOSid on the current package.
> + */
> +static void cbm_update_msrs(void *dummy)
> +{
> +	int maxid = boot_cpu_data.x86_cache_max_closid;
> +	struct rdt_remote_data info;
> +	unsigned int i;
> +
> +	for (i = 0; i < maxid; i++) {
> +		if (cctable[i].clos_refcnt) {
> +			info.msr = CBM_FROM_INDEX(i);
> +			info.val = cctable[i].cbm;
> +			msr_cpu_update(&info);
> +		}
> +	}
> +}
> +
> +static int intel_rdt_online_cpu(unsigned int cpu)
> +{
> +	struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
> +
> +	state->closid = 0;
> +	mutex_lock(&rdtgroup_mutex);
> +	/* The cpu is set in root rdtgroup after online. */
> +	cpumask_set_cpu(cpu, &root_rdtgrp->cpu_mask);
> +	per_cpu(cpu_rdtgroup, cpu) = root_rdtgrp;
> +	/*
> +	 * If the cpu is first time found and set in its siblings that

-ENOPARSE

> +	 * share the same cache, update the CBM MSRs for the cache.
> +	 */
> +	if (rdt_cpumask_update(cpu))
> +		smp_call_function_single(cpu, cbm_update_msrs, NULL, 1);

This smp_call_function() is a pointless exercise. online callbacks are
guaranteed to run on @cpu.

> +	mutex_unlock(&rdtgroup_mutex);
> +}
> +
> +static int clear_rdtgroup_cpumask(unsigned int cpu)
> +{
> +	struct list_head *l;
> +	struct rdtgroup *r;
> +
> +	list_for_each(l, &rdtgroup_lists) {
> +		r = list_entry(l, struct rdtgroup, rdtgroup_list);
> +		if (cpumask_test_cpu(cpu, &r->cpu_mask)) {
> +			cpumask_clear_cpu(cpu, &r->cpu_mask);
> +			return 0;
> +		}
> +	}
> +
> +	return -EINVAL;

What's the point of the return value if it gets ignored anyway.

> +}
> +
> +static int intel_rdt_offline_cpu(unsigned int cpu)
> +{
> +	int i;
> +
> +	mutex_lock(&rdtgroup_mutex);
> +	if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
> +		mutex_unlock(&rdtgroup_mutex);
> +		return;
> +	}
> +
> +	cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), cpu_online_mask);
> +	cpumask_clear_cpu(cpu, &tmp_cpumask);
> +	i = cpumask_any(&tmp_cpumask);
> +
> +	if (i < nr_cpu_ids)
> +		cpumask_set_cpu(i, &rdt_cpumask);
> +
> +	clear_rdtgroup_cpumask(cpu);
> +	mutex_unlock(&rdtgroup_mutex);
> +}
> +
>  static int __init intel_rdt_late_init(void)
>  {
>  	struct cpuinfo_x86 *c = &boot_cpu_data;
> @@ -169,6 +247,13 @@ static int __init intel_rdt_late_init(void)
>  
>  	for_each_online_cpu(i)
>  		rdt_cpumask_update(i);
> +
> +	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> +				"AP_INTEL_RDT_ONLINE",
> +				intel_rdt_online_cpu, intel_rdt_offline_cpu);

Why are you using nocalls() here? cpuhp_setup_state() will invoke
intel_rdt_online_cpu() on every online cpu.

And you just call rdt_cpumask_update() for each cpu. What is doing the rest
of the cpu initialization (cpu_rdtgroup, root_rtgroup->cpu_mask) ????

> +	if (err < 0)
> +		goto out_err;

Oh well.....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 12/33] x86/intel_rdt: Intel haswell Cache Allocation enumeration
  2016-09-08  9:57 ` [PATCH v2 12/33] x86/intel_rdt: Intel haswell Cache Allocation enumeration Fenghua Yu
@ 2016-09-08 10:08   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 10:08 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
>  /*
> + * Minimum bits required in Cache bitmask.
> + */
> +unsigned int min_bitmask_len = 1;

Global variable w/o an corresponding declaration in a header file?

> +/*
>   * Mask of CPUs for writing CBM values. We only need one CPU per-socket.
>   */
>  static cpumask_t rdt_cpumask;
> @@ -51,6 +55,42 @@ struct rdt_remote_data {
>  	u64 val;
>  };
>  
> +/*
> + * cache_alloc_hsw_probe() - Have to probe for Intel haswell server CPUs
> + * as it does not have CPUID enumeration support for Cache allocation.
> + *
> + * Probes by writing to the high 32 bits(CLOSid) of the IA32_PQR_MSR and
> + * testing if the bits stick. Max CLOSids is always 4 and max cbm length
> + * is always 20 on hsw server parts. The minimum cache bitmask length
> + * allowed for HSW server is always 2 bits. Hardcode all of them.
> + */
> +static inline bool cache_alloc_hsw_probe(void)
> +{
> +	u32 l, h_old, h_new, h_tmp;
> +
> +	if (rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_old))
> +		return false;
> +
> +	/*
> +	 * Default value is always 0 if feature is present.
> +	 */
> +	h_tmp = h_old ^ 0x1U;
> +	if (wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_tmp) ||
> +	    rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_new))
> +		return false;
> +
> +	if (h_tmp != h_new)
> +		return false;
> +
> +	wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_old);
> +
> +	boot_cpu_data.x86_cache_max_closid = 4;
> +	boot_cpu_data.x86_cache_max_cbm_len = 20;
> +	min_bitmask_len = 2;

So min_bitmask_len gets updated here, but it's not used anywhere. Neither
that cache_alloc_hsw_probe() function is used ....

> +	return true;
> +}
> +
>  void __intel_rdt_sched_in(void *dummy)
>  {
>  	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> @@ -225,9 +265,6 @@ static int __init intel_rdt_late_init(void)
>  	u32 maxid;
>  	int err = 0, size, i;
>  
> -	if (!cpu_has(c, X86_FEATURE_CAT_L3))
> -		return -ENODEV;

And we now initialize the thing unconditionally no matter whether the
feature is available or not. Interesting.

The changelog does tell a different story than the patch ....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 13/33] Define CONFIG_INTEL_RDT
  2016-09-08  9:57 ` [PATCH v2 13/33] Define CONFIG_INTEL_RDT Fenghua Yu
@ 2016-09-08 10:14   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 10:14 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
> From: Vikas Shivappa <vikas.shivappa@linux.intel.com>

This changelog is still utter crap.

> CONFIG_INTEL_RDT is defined.

We know already from $subject that CONFIG_INTEL_RDT is introduced.

>                              The option provides support for resource
> allocation which is a sub-feature of Intel Resource Director Technology
> (RDT).

The config switch says:
 
> +config INTEL_RDT
> +	bool "Intel Resource Director Technology support"

So what? Is this now resource allocation or resource director or does the
actual stuff introduce something completely different?

> +	default n
> +	depends on X86_64 && CPU_SUP_INTEL

Why is this 64bit only? The changelog and/or the help text should tell
that.

> +	help
> +	  This option provides support for resource allocation which is a
> +	  sub-feature of Intel Resource Director Technology(RDT).

Sure and you repeat the changelog nonsense here again.

> +	  Current implementation supports L3 cache allocation.
> +	  Using this feature a user can specify the amount of L3 cache space
> +	  into which an application can fill.

Sigh

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 15/33] x86/intel_rdt: Adds support to enable Code Data Prioritization
  2016-09-08  9:57 ` [PATCH v2 15/33] x86/intel_rdt: Adds support to enable Code Data Prioritization Fenghua Yu
@ 2016-09-08 10:18   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 10:18 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
>  
> +struct clos_config {
> +	unsigned long *closmap;
> +	u32 max_closid;
> +	u32 closids_used;
> +};

Another badly formatted and undocumented structure

> +struct clos_config cconfig;
> +bool cdp_enabled;

Once more global variables without a declaration in a header and no user
outside of this file.

> +#define __DCBM_TABLE_INDEX(x)	(x << 1)
> +#define __ICBM_TABLE_INDEX(x)	((x << 1) + 1)
>  
>  struct rdt_remote_data {
>  	int msr;
> @@ -122,22 +123,28 @@ static int closid_alloc(u32 *closid)
>  
>  	lockdep_assert_held(&rdtgroup_mutex);
>  
> -	maxid = boot_cpu_data.x86_cache_max_closid;
> -	id = find_first_zero_bit(closmap, maxid);
> +	maxid = cconfig.max_closid;

Cute. You can remove all that code because maxid is always 0.

>  /*
>   * Set only one cpu in cpumask in all cpus that share the same cache.
>   */
> @@ -191,7 +213,7 @@ static inline bool rdt_cpumask_update(int cpu)
>   */
>  static void cbm_update_msrs(void *dummy)
>  {
> -	int maxid = boot_cpu_data.x86_cache_max_closid;
> +	int maxid = cconfig.max_closid;

Ditto

>  	size = BITS_TO_LONGS(maxid) * sizeof(long);
> -	closmap = kzalloc(size, GFP_KERNEL);
> -	if (!closmap) {
> +	cconfig.closmap = kzalloc(size, GFP_KERNEL);
> +	if (!cconfig.closmap) {

Simply because it's never initialized.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 16/33] x86/intel_rdt: Class of service and capacity bitmask management for CDP
  2016-09-08  9:57 ` [PATCH v2 16/33] x86/intel_rdt: Class of service and capacity bitmask management for CDP Fenghua Yu
@ 2016-09-08 10:29   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 10:29 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:

> Add support to manage CLOSid(CLass Of Service id) and capacity
> bitmask(cbm) for code data prioritization(CDP).

I manage to understand that.

> Closid management includes changes to allocating, freeing closid and
> closid_get and closid_put and changes to closid availability map during
> CDP set up. 

But this is just a random sequence of word, function names and a reference
to the availability map which is not touched at all in this patch.

> CDP has a separate cbm for code and data.

> +/*
> + * When cdp mode is enabled, refcnt is maintained in the dcache_cbm entry.

Sorry. I really cannot figure out what that means.

> + */
>  static inline void closid_get(u32 closid)
>  {
> -	struct clos_cbm_table *cct = &cctable[closid];
> +	struct clos_cbm_table *cct = &cctable[DCBM_TABLE_INDEX(closid)];
>  
>  	lockdep_assert_held(&rdtgroup_mutex);
>  
> @@ -139,7 +155,7 @@ static int closid_alloc(u32 *closid)
>  static inline void closid_free(u32 closid)
>  {
>  	clear_bit(closid, cconfig.closmap);
> -	cctable[closid].cbm = 0;
> +	cctable[DCBM_TABLE_INDEX(closid)].cbm = 0;
>  
>  	if (WARN_ON(!cconfig.closids_used))
>  		return;
> @@ -149,7 +165,7 @@ static inline void closid_free(u32 closid)
>  
>  static void closid_put(u32 closid)
>  {
> -	struct clos_cbm_table *cct = &cctable[closid];
> +	struct clos_cbm_table *cct = &cctable[DCBM_TABLE_INDEX(closid)];

So if CDP is disabled we look at table[closid] and if it's enabled we look
at table[closid << 1]. What is managing the interleaved entries in the
table?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 17/33] x86/intel_rdt: Hot cpu update for code data prioritization
  2016-09-08  9:57 ` [PATCH v2 17/33] x86/intel_rdt: Hot cpu update for code data prioritization Fenghua Yu
@ 2016-09-08 10:34   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 10:34 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
> Updates hot cpu notification handling for code data prioritization(cdp).

Some more useless information

> The capacity bitmask(cbm) is global for both data and instruction and we
> need to update the new online package with all the cbms by writing to
> the IA32_L3_QOS_n MSRs.

If I wouldn't know the details of the hardware then this explanation would
make me run away screaming ....

> +static void cbm_update_msr(u32 index)
> +{
> +	struct rdt_remote_data info;
> +	int dindex;
> +
> +	dindex = DCBM_TABLE_INDEX(index);
> +	if (cctable[dindex].clos_refcnt) {
> +
> +		info.msr = CBM_FROM_INDEX(dindex);
> +		info.val = cctable[dindex].cbm;
> +		msr_cpu_update((void *) &info);
> +
> +		if (cdp_enabled) {
> +			info.msr = __ICBM_MSR_INDEX(index);
> +			info.val = cctable[dindex + 1].cbm;
> +			msr_cpu_update((void *) &info);
> +		}
> +	}

As usual there is a complete lack of comments here.

> +}
> +
>  /*
>   * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
>   * which are one per CLOSid on the current package.
> @@ -230,15 +250,10 @@ static inline bool rdt_cpumask_update(int cpu)
>  static void cbm_update_msrs(void *dummy)
>  {
>  	int maxid = cconfig.max_closid;
> -	struct rdt_remote_data info;
>  	unsigned int i;
>  
>  	for (i = 0; i < maxid; i++) {
> -		if (cctable[i].clos_refcnt) {
> -			info.msr = CBM_FROM_INDEX(i);
> -			info.val = cctable[i].cbm;
> -			msr_cpu_update((void *) &info);
> -		}
> +		cbm_update_msr(i);
>  	}

The curly braces can go as well.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 18/33] sched.h: Add rg_list and rdtgroup in task_struct
  2016-09-08  9:57 ` [PATCH v2 18/33] sched.h: Add rg_list and rdtgroup in task_struct Fenghua Yu
@ 2016-09-08 10:36   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 10:36 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:

> From: Fenghua Yu <fenghua.yu@intel.com>
> 
> rg_list is linked list to connect to other tasks in a rdtgroup.

There is no rg_list in this patch ....
 
> The point of rdtgroup allows the task to access its own rdtgroup directly.

The point? 

Thanks,

	tglx

> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/sched.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 62c68e5..4b1dce0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1766,6 +1766,9 @@ struct task_struct {
>  	/* cg_list protected by css_set_lock and tsk->alloc_lock */
>  	struct list_head cg_list;
>  #endif
> +#ifdef CONFIG_INTEL_RDT
> +	struct rdtgroup *rdtgroup;
> +#endif
>  #ifdef CONFIG_FUTEX
>  	struct robust_list_head __user *robust_list;
>  #ifdef CONFIG_COMPAT
> -- 
> 2.5.0
> 
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 19/33] magic number for resctrl file system
  2016-09-08  9:57 ` [PATCH v2 19/33] magic number for resctrl file system Fenghua Yu
@ 2016-09-08 10:41   ` Thomas Gleixner
  2016-09-08 10:47     ` Borislav Petkov
  0 siblings, 1 reply; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 10:41 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:

$subject lacks a subsystem prefix ...

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 19/33] magic number for resctrl file system
  2016-09-08 10:41   ` Thomas Gleixner
@ 2016-09-08 10:47     ` Borislav Petkov
  0 siblings, 0 replies; 96+ messages in thread
From: Borislav Petkov @ 2016-09-08 10:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Fenghua Yu, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, Sep 08, 2016 at 12:41:27PM +0200, Thomas Gleixner wrote:
> On Thu, 8 Sep 2016, Fenghua Yu wrote:
> 
> $subject lacks a subsystem prefix ...

 ... and a commit message.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface
  2016-09-08  9:57 ` [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface Fenghua Yu
@ 2016-09-08 11:22   ` Borislav Petkov
  2016-09-08 22:01   ` Shaohua Li
  1 sibling, 0 replies; 96+ messages in thread
From: Borislav Petkov @ 2016-09-08 11:22 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, Sep 08, 2016 at 02:57:00AM -0700, Fenghua Yu wrote:
> From: Fenghua Yu <fenghua.yu@intel.com>
> 
> The documentation describes user interface of how to allocate resource
> in Intel RDT.
> 
> Please note that the documentation covers generic user interface. Current
> patch set code only implemente CAT L3. CAT L2 code will be sent later.
> 
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> ---
>  Documentation/x86/intel_rdt_ui.txt | 164 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 164 insertions(+)
>  create mode 100644 Documentation/x86/intel_rdt_ui.txt
> 
> diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
> new file mode 100644
> index 0000000..27de386
> --- /dev/null
> +++ b/Documentation/x86/intel_rdt_ui.txt

Why isn't this part of Documentation/x86/intel_rdt.txt and needs to be a
separate file?

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection
  2016-09-08  9:57 ` [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection Fenghua Yu
@ 2016-09-08 11:50   ` Borislav Petkov
  2016-09-08 16:53     ` Yu, Fenghua
  2016-09-08 13:17   ` Thomas Gleixner
  2016-09-13 22:40   ` Dave Hansen
  2 siblings, 1 reply; 96+ messages in thread
From: Borislav Petkov @ 2016-09-08 11:50 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, Sep 08, 2016 at 02:57:01AM -0700, Fenghua Yu wrote:
> From: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> 
> This patch includes CPUID enumeration routines for Cache allocation and
> new values to track resources to the cpuinfo_x86 structure.
> 
> Cache allocation provides a way for the Software (OS/VMM) to restrict
> cache allocation to a defined 'subset' of cache which may be overlapping
> with other 'subsets'. This feature is used when allocating a line in
> cache ie when pulling new data into the cache. The programming of the
> hardware is done via programming MSRs (model specific registers).
> 
> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> ---

...

> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 92a8308..62d979b9 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -12,7 +12,7 @@
>  /*
>   * Defines x86 CPU feature bits
>   */
> -#define NCAPINTS	18	/* N 32-bit words worth of info */
> +#define NCAPINTS	19	/* N 32-bit words worth of info */
>  #define NBUGINTS	1	/* N 32-bit bug flags */
>  
>  /*
> @@ -220,6 +220,7 @@
>  #define X86_FEATURE_RTM		( 9*32+11) /* Restricted Transactional Memory */
>  #define X86_FEATURE_CQM		( 9*32+12) /* Cache QoS Monitoring */
>  #define X86_FEATURE_MPX		( 9*32+14) /* Memory Protection Extension */
> +#define X86_FEATURE_RDT		( 9*32+15) /* Resource Director Technology */
>  #define X86_FEATURE_AVX512F	( 9*32+16) /* AVX-512 Foundation */
>  #define X86_FEATURE_AVX512DQ	( 9*32+17) /* AVX-512 DQ (Double/Quad granular) Instructions */
>  #define X86_FEATURE_RDSEED	( 9*32+18) /* The RDSEED instruction */
> @@ -286,6 +287,9 @@
>  #define X86_FEATURE_SUCCOR	(17*32+1) /* Uncorrectable error containment and recovery */
>  #define X86_FEATURE_SMCA	(17*32+3) /* Scalable MCA */
>  
> +/* Intel-defined CPU features, CPUID level 0x00000010:0 (ebx), word 18 */

Seems like this leaf is dedicated to CAT and has only 2 feature bits
defined in the SDM. Please use init_scattered_cpuid_features() instead
of adding a whole CAP word.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 20/33] x86/intel_rdt.h: Header for inter_rdt.c
  2016-09-08  9:57 ` [PATCH v2 20/33] x86/intel_rdt.h: Header for inter_rdt.c Fenghua Yu
@ 2016-09-08 12:36   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 12:36 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:

Subject: x86/intel_rdt.h: Header for inter_rdt.c

inter_rdt? I know about Inter Mailand ....

> The header mainly provides functions to call from the user interface
> file intel_rdt_rdtgroup.c.

What the heck? We do not introduce function prototypes and whatever crap
without an implementation. We add the stuff when we add a function or
implement something which needs a define/struct whatever.
 
> +enum resource_type {
> +	RESOURCE_L3  = 0,
> +	RESOURCE_NUM = 1,

Why does this need an explicit enum initialization?

> +};
> +
> +#define MAX_CACHE_LEAVES        4
> +#define MAX_CACHE_DOMAINS       64
> +
> +DECLARE_PER_CPU_READ_MOSTLY(int, cpu_l3_domain);
> +DECLARE_PER_CPU_READ_MOSTLY(struct rdtgroup *, cpu_rdtgroup);
>  
>  extern struct static_key rdt_enable_key;
>  void __intel_rdt_sched_in(void *dummy);
>  
> +extern bool cdp_enabled;
> +
> +struct rdt_opts {
> +	bool cdp_enabled;
> +	bool verbose;
> +	bool simulate_cat_l3;
> +};
> +
> +struct cache_domain {
> +	cpumask_t shared_cpu_map[MAX_CACHE_DOMAINS];
> +	unsigned int max_cache_domains_num;
> +	unsigned int level;
> +	unsigned int shared_cache_id[MAX_CACHE_DOMAINS];
> +};
> +
> +extern struct rdt_opts rdt_opts;
> +
>  struct clos_cbm_table {
>  	unsigned long cbm;
>  	unsigned int clos_refcnt;
>  };
>  
>  struct clos_config {
> -	unsigned long *closmap;
> +	unsigned long **closmap;
>  	u32 max_closid;
> -	u32 closids_used;
>  };
>  
> +struct shared_domain {
> +	struct cpumask cpumask;
> +	int l3_domain;
> +};
> +
> +#define for_each_cache_domain(domain, start_domain, max_domain)	\
> +	for (domain = start_domain; domain < max_domain; domain++)
> +
> +extern struct clos_config cconfig;
> +extern struct shared_domain *shared_domain;
> +extern int shared_domain_num;
> +
> +extern struct rdtgroup *root_rdtgrp;
> +
> +extern struct clos_cbm_table **l3_cctable;
> +
> +extern unsigned int min_bitmask_len;
> +extern void msr_cpu_update(void *arg);
> +extern inline void closid_get(u32 closid, int domain);

extern inline?

> +extern void closid_put(u32 closid, int domain);

That's declared static in the source, but sure you do not notice because
intel_rdc.c is not hooked up to the Makefile yet.....

> +extern void closid_free(u32 closid, int domain, int level);

is declared static inline ....

> +extern int closid_alloc(u32 *closid, int domain);

amd more of this crap to follow.

I explicitely asked you last time to do:

>> Which is not making the review any simpler. In order to understand the
>> modifications I have to go back and page in the original stuff from last
>> year once again. So I have to read the original patch first to
>> understand the modifications and then get the overall picture of the new
>> stuff. Please fold stuff back to the proper places so I can start
>> reviewing this thing under the new design idea instead of twisting my
>> brain around two designs.
 
And you replied:

> Ok. I will do that.

Actually you did the reverse. You introduced more crap in the original
patches. See 12/32 vs. the previous version 
http://marc.info/?l=linux-kernel&m=146836100821478

What's the value of mechanically split patches which cannot even compile on
their own? Nothing at all except creating the hell for reviewers.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 21/33] x86/intel_rdt_rdtgroup.h: Header for user interface
  2016-09-08  9:57 ` [PATCH v2 21/33] x86/intel_rdt_rdtgroup.h: Header for user interface Fenghua Yu
@ 2016-09-08 12:44   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 12:44 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:

> From: Fenghua Yu <fenghua.yu@intel.com>
> 
> This is header file for user interface file intel_rdt_rdtgroup.c.

Really useful information complementary to $subject - NOT!

And again you introduce stuff without an implementation. How the hell is
that useful? It's just annoying having to lookup this patch when reviewing
the one which adds the actual code ....
 
> +#define MAX_RDTGROUP_TYPE_NAMELEN	32
> +#define MAX_RDTGROUP_ROOT_NAMELEN	64
> +#define MAX_RFTYPE_NAME			64
> +
> +#include <linux/kernfs.h>
> +#include <asm/intel_rdt.h>
> +
> +extern void rdtgroup_exit(struct task_struct *tsk);
> +
> +/* cftype->flags */
> +enum {
> +	RFTYPE_WORLD_WRITABLE = (1 << 4),/* (DON'T USE FOR NEW FILES) S_IWUGO */

Huch? What's the point of this?

> +
> +	/* internal flags, do not use outside rdtgroup core proper */
> +	__RFTYPE_ONLY_ON_DFL  = (1 << 16),/* only on default hierarchy */
> +	__RFTYPE_NOT_ON_DFL   = (1 << 17),/* not on default hierarchy */
> +};
> +
> +#define CACHE_LEVEL3		3
> +
> +struct cache_resource {
> +	u64 *cbm;
> +	u64 *cbm2;
> +	int *closid;
> +	int *refcnt;
> +};

Some more undocumented structs.

> +
> +struct rdt_resource {
> +	bool valid;
> +	int closid[MAX_CACHE_DOMAINS];
> +	/* Add more resources here. */
> +};
> +
> +struct rdtgroup {
> +	struct kernfs_node *kn;		/* rdtgroup kernfs entry */

I told you before not to use tail comments and of course you comment the
obvious and not anything else. We have kerneldoc for this.

> +struct rftype {
> +	/*
> +	 * By convention, the name should begin with the name of the
> +	 * subsystem, followed by a period.  Zero length string indicates
> +	 * end of cftype array.
> +	 */

See above.

> +	char name[MAX_CFTYPE_NAME];
> +	unsigned long private;

And please align the struct members proper. Reading this is a PITA.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection
  2016-09-08  9:57 ` [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection Fenghua Yu
  2016-09-08 11:50   ` Borislav Petkov
@ 2016-09-08 13:17   ` Thomas Gleixner
  2016-09-08 13:59     ` Yu, Fenghua
  2016-09-13 22:40   ` Dave Hansen
  2 siblings, 1 reply; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 13:17 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
> +			cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
> +			c->x86_l3_max_closid = edx + 1;
> +			c->x86_l3_max_cbm_len = eax + 1;

According to the SDM:

EAX     Bits  4:0:  Length of the capacity bit mask for the corresponding ResID.
        Bits 31:05: Reserved

EDX	Bits 15:0:  Highest COS number supported for this ResID.
	Bits 31:16: Reserved

So why are we assuming that bits 31-5 of EAX and 16-31 of EDX are going to
be zero forever and if not that they are just extending the existing bits?
If that's the case then we don't need to mask out the upper bits, but the
code wants a proper comment about this.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* RE: [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection
  2016-09-08 13:17   ` Thomas Gleixner
@ 2016-09-08 13:59     ` Yu, Fenghua
  0 siblings, 0 replies; 96+ messages in thread
From: Yu, Fenghua @ 2016-09-08 13:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Anvin, H Peter, Ingo Molnar, Luck, Tony, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Shankar, Ravi V,
	Vikas Shivappa, Prakhya, Sai Praneeth, linux-kernel, x86

> On Thu, 8 Sep 2016, Fenghua Yu wrote:
> > +			cpuid_count(0x00000010, 1, &eax, &ebx, &ecx,
> &edx);
> > +			c->x86_l3_max_closid = edx + 1;
> > +			c->x86_l3_max_cbm_len = eax + 1;
> 
> According to the SDM:
> 
> EAX     Bits  4:0:  Length of the capacity bit mask for the corresponding ResID.
>         Bits 31:05: Reserved
> 
> EDX	Bits 15:0:  Highest COS number supported for this ResID.
> 	Bits 31:16: Reserved
> 
> So why are we assuming that bits 31-5 of EAX and 16-31 of EDX are going to
> be zero forever and if not that they are just extending the existing bits?
> If that's the case then we don't need to mask out the upper bits, but the
> code wants a proper comment about this.

You are right. We cannot assume the upper bits are always zero. I fixed the
issue by masking out the upper bits.

Thanks.

-Fenghua 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 22/33] x86/intel_rdt.c: Extend RDT to per cache and per resources
  2016-09-08  9:57 ` [PATCH v2 22/33] x86/intel_rdt.c: Extend RDT to per cache and per resources Fenghua Yu
@ 2016-09-08 14:57   ` Thomas Gleixner
  2016-09-13 22:54   ` Dave Hansen
  1 sibling, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 14:57 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:

> +#define __DCBM_TABLE_INDEX(x) (x << 1)
> +#define __ICBM_TABLE_INDEX(x) ((x << 1) + 1)

This macro mess is completely undocumented.

> +inline int get_dcbm_table_index(int x)

static inline ??? 

> +{
> +	return DCBM_TABLE_INDEX(x);
> +}
> +
> +inline int get_icbm_table_index(int x)
> +{
> +	return ICBM_TABLE_INDEX(x);
> +}

Why are you introducing these when they are not used at all?

>  /*
>   * cache_alloc_hsw_probe() - Have to probe for Intel haswell server CPUs
>   * as it does not have CPUID enumeration support for Cache allocation.
> @@ -98,41 +123,141 @@ static inline bool cache_alloc_hsw_probe(void)
>  
>  	wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_old);
>  
> -	boot_cpu_data.x86_cache_max_closid = 4;
> -	boot_cpu_data.x86_cache_max_cbm_len = 20;
> +	boot_cpu_data.x86_l3_max_closid = 4;
> +	boot_cpu_data.x86_l3_max_cbm_len = 20;

So if you actually change the name of the cpudata struct member, then this
would make sense to be split out into a seperate patch. But hell, the patch
order of this stuff is an unholy mess anyway.

>  	min_bitmask_len = 2;
>  
>  	return true;
>  }
>  
> +u32 max_cbm_len(int level)
> +{
> +	switch (level) {
> +	case CACHE_LEVEL3:
> +		return boot_cpu_data.x86_l3_max_cbm_len;
> +	default:
> +		break;
> +	}
> +
> +	return (u32)~0;
> +}
> +u64 max_cbm(int level)

More functions without users and of course without a proper prefix for
public consumption. Documentation is not available either.

> +{
> +	switch (level) {
> +	case CACHE_LEVEL3:
> +		return (1ULL << boot_cpu_data.x86_l3_max_cbm_len) - 1;

So the above max_cmb_len() returns:

	cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
	c->x86_l3_max_cbm_len = eax + 1;

i.e the content of leaf 10:1 EAX plus 1.

According to the SDM:

    EAX 4:0: Length of the capacity bit mask for the corresponding ResID.

So first of all. Why is there a "+ 1" ? Again, that's not documented in the
cpuid initialization code at all.

So now you take that magic number and return:

       (2 ^ magic) - 1

Cute. To be honest I'm too lazy to read through the SDM and figure it out
myself. This kind of stuff belongs into the code as comments. I seriously
doubt that the function names match the actual meaning.

> +	default:
> +		break;
> +	}
> +
> +	return (u64)~0;

What kind of return code is this ?

> +static inline bool cat_l3_supported(struct cpuinfo_x86 *c)
> +{
> +	if (cpu_has(c, X86_FEATURE_CAT_L3))
> +		return true;
> +
> +	/*
> +	 * Probe for Haswell server CPUs.
> +	 */
> +	if (c->x86 == 0x6 && c->x86_model == 0x3f)
> +		return cache_alloc_hsw_probe();

Ah now we have an actual user for that haswell probe thingy ....

> +	return false;
> +}
> +
>  void __intel_rdt_sched_in(void *dummy)

I still have not found a reasonable explanation for this dummy argument.

>  {
>  	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> +	struct rdtgroup *rdtgrp;
> +	int closid;
> +	int domain;

Sigh.
  
>  	/*
> -	 * Currently closid is always 0. When  user interface is added,
> -	 * closid will come from user interface.
> +	 * If this task is assigned to an rdtgroup, use it.
> +	 * Else use the group assigned to this cpu.
>  	 */
> -	if (state->closid == 0)
> +	rdtgrp = current->rdtgroup;
> +	if (!rdtgrp)
> +		rdtgrp = this_cpu_read(cpu_rdtgroup);

This makes actually sense! Thanks for listening!

> +
> +	domain = this_cpu_read(cpu_shared_domain);
> +	closid = rdtgrp->resource.closid[domain];
> +
> +	if (closid == state->closid)
>  		return;
>  
> -	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, 0);
> -	state->closid = 0;
> +	state->closid = closid;
> +	/* Don't really write PQR register in simulation mode. */
> +	if (unlikely(rdt_opts.simulate_cat_l3))
> +		return;
> +
> +	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, closid);
>  }
>  
>  /*
>   * When cdp mode is enabled, refcnt is maintained in the dcache_cbm entry.
>   */
> -static inline void closid_get(u32 closid)
> +inline void closid_get(u32 closid, int domain)

s/static inline/inline/ What the heck?

Can you please explain what this is doing?

>  {
> -	struct clos_cbm_table *cct = &cctable[DCBM_TABLE_INDEX(closid)];
> -
>  	lockdep_assert_held(&rdtgroup_mutex);
>  
> -	cct->clos_refcnt++;
> +	if (cat_l3_enabled) {
> +		int l3_domain;
> +		int dindex;

  		int l3_domain, dindex;

> +		l3_domain = shared_domain[domain].l3_domain;
> +		dindex = DCBM_TABLE_INDEX(closid);
> +		l3_cctable[l3_domain][dindex].clos_refcnt++;
> +		if (cdp_enabled) {
> +			int iindex = ICBM_TABLE_INDEX(closid);

And if you call that variable 'index' instead of 'dindex' then you don't
need an extra one 'iindex'.

> +
> +			l3_cctable[l3_domain][iindex].clos_refcnt++;

So now you have a seperate refcount for the Icache part, but the comment
above the function still says otherwise.

> +		}
> +	}
>  }
>  
> -static int closid_alloc(u32 *closid)
> +int closid_alloc(u32 *closid, int domain)
>  {
>  	u32 maxid;
>  	u32 id;
> @@ -140,105 +265,215 @@ static int closid_alloc(u32 *closid)
>  	lockdep_assert_held(&rdtgroup_mutex);
>  
>  	maxid = cconfig.max_closid;
> -	id = find_first_zero_bit(cconfig.closmap, maxid);
> +	id = find_first_zero_bit((unsigned long *)cconfig.closmap[domain],

Why do you need a typecast here? Get your damned structs straight.

> +				 maxid);
> +
>  	if (id == maxid)
>  		return -ENOSPC;
>  
> -	set_bit(id, cconfig.closmap);
> -	closid_get(id);
> +	set_bit(id, (unsigned long *)cconfig.closmap[domain]);
> +	closid_get(id, domain);
>  	*closid = id;
> -	cconfig.closids_used++;
>  
>  	return 0;
>  }
>  
> -static inline void closid_free(u32 closid)
> +unsigned int get_domain_num(int level)
>  {
> -	clear_bit(closid, cconfig.closmap);
> -	cctable[DCBM_TABLE_INDEX(closid)].cbm = 0;
> -
> -	if (WARN_ON(!cconfig.closids_used))
> -		return;
> +	if (level == CACHE_LEVEL3)
> +		return cpumask_weight(&rdt_l3_cpumask);

get_domain_num(level) suggests to me that it returns the domain number
corresponding to the level, but it actually returns the number of bits set
in the rdt_l3_cpumask. Very intuitive - NOT!

> +	else
> +		return -EINVAL;

Proper return value for a function returning 'unsigned int' ....

> +}
>  
> -	cconfig.closids_used--;
> +int level_to_leaf(int level)
> +{
> +	switch (level) {
> +	case CACHE_LEVEL3:
> +		return 3;
> +	default:
> +		return -EINVAL;
> +	}
>  }
>  
> -static void closid_put(u32 closid)
> +void closid_free(u32 closid, int domain, int level)
>  {
> -	struct clos_cbm_table *cct = &cctable[DCBM_TABLE_INDEX(closid)];
> +	struct clos_cbm_table **cctable;
> +	int leaf;
> +	struct cpumask *mask;
> +	int cpu;
> +
> +	if (level == CACHE_LEVEL3)
> +		cctable = l3_cctable;

Oh well. Why needs this assignment to happen here?

> +
> +	clear_bit(closid, (unsigned long *)cconfig.closmap[domain]);
> +
> +	if (level == CACHE_LEVEL3) {

And not here where it actually makes sense?

> +		cctable[domain][closid].cbm = max_cbm(level);
> +		leaf = level_to_leaf(level);
> +		mask = &cache_domains[leaf].shared_cpu_map[domain];
> +		cpu = cpumask_first(mask);
> +		smp_call_function_single(cpu, cbm_update_l3_msr, &closid, 1);

A comment explaining that this must be done on one of the cpus in @domain
would be too helpful.

> +	}
> +}
>  
> +static void _closid_put(u32 closid, struct clos_cbm_table *cct,

Please use two underscores so it's obvious.

> +			int domain, int level)
> +{
>  	lockdep_assert_held(&rdtgroup_mutex);
>  	if (WARN_ON(!cct->clos_refcnt))
>  		return;
>  
>  	if (!--cct->clos_refcnt)
> -		closid_free(closid);
> +		closid_free(closid, domain, level);
>  }
>  
> -static void msr_cpu_update(void *arg)
> +void closid_put(u32 closid, int domain)
> +{
> +	struct clos_cbm_table *cct;
> +
> +	if (cat_l3_enabled) {
> +		int l3_domain = shared_domain[domain].l3_domain;
> +
> +		cct = &l3_cctable[l3_domain][DCBM_TABLE_INDEX(closid)];
> +		_closid_put(closid, cct, l3_domain, CACHE_LEVEL3);
> +		if (cdp_enabled) {
> +			cct = &l3_cctable[l3_domain][ICBM_TABLE_INDEX(closid)];
> +			_closid_put(closid, cct, l3_domain, CACHE_LEVEL3);
> +		}
> +	}
> +}
> +
> +void msr_cpu_update(void *arg)
>  {
>  	struct rdt_remote_data *info = arg;
>  
> +	if (unlikely(rdt_opts.verbose))
> +		pr_info("Write %lx to msr %x on cpu%d\n",
> +			(unsigned long)info->val, info->msr,
> +			smp_processor_id());

That's what DYNAMIC_DEBUG is for.

> +
> +	if (unlikely(rdt_opts.simulate_cat_l3))
> +		return;

Do we really need all this simulation stuff?

> +
>  	wrmsrl(info->msr, info->val);
>  }
>  
> +static struct cpumask *rdt_cache_cpumask(int level)
> +{
> +	return &rdt_l3_cpumask;

That's a really useful helper .... Your choice of checking 'level' five
times in a row in various helpers versus returning l3 unconditionally is at
least interesting.

> +int get_cache_leaf(int level, int cpu)
> +{
> +	int index;
> +	struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);

this_cpu_ci is a complete misnomer as it suggests that it's actually the
cacheinfo for 'this cpu', i.e. the cpu on which the code is executing.

> +	struct cacheinfo *this_leaf;
> +	int num_leaves = this_cpu_ci->num_leaves;
> +
> +	for (index = 0; index < num_leaves; index++) {
> +		this_leaf = this_cpu_ci->info_list + index;
> +		if (this_leaf->level == level)
> +			return index;

The function is misnomed as well. It does not return the cache leaf, it
returns the leaf index .....

> +	}

Why do you have a cacheinfo related function in this RDT code? 

> +
> +	return -EINVAL;
> +}
> +
> +static struct cpumask *get_shared_cpu_map(int cpu, int level)
> +{
> +	int index;
> +	struct cacheinfo *leaf;
> +	struct cpu_cacheinfo *cpu_ci = get_cpu_cacheinfo(cpu);
> +
> +	index = get_cache_leaf(level, cpu);
> +	if (index < 0)
> +		return 0;
> +
> +	leaf = cpu_ci->info_list + index;

While here you actually get the leaf.

> +
> +	return &leaf->shared_cpu_map;
>  }
  
>  static int clear_rdtgroup_cpumask(unsigned int cpu)
> @@ -293,63 +531,270 @@ static int clear_rdtgroup_cpumask(unsigned int cpu)
>  
>  static int intel_rdt_offline_cpu(unsigned int cpu)
>  {
> -	int i;
> +	struct cpumask *shared_cpu_map;
> +	int new_cpu;
> +	int l3_domain;
> +	int level;
> +	int leaf;

Sigh. 1/3 of the line space is wasted for single variable declarations.
 
>  	mutex_lock(&rdtgroup_mutex);
> -	if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
> -		mutex_unlock(&rdtgroup_mutex);
> -		return;
> -	}
>  
> -	cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), cpu_online_mask);
> -	cpumask_clear_cpu(cpu, &tmp_cpumask);
> -	i = cpumask_any(&tmp_cpumask);
> +	level = CACHE_LEVEL3;

I have a hard time to understand the value of that 'level' variable, but
well that's the least of my worries with that code.

> +
> +	l3_domain = per_cpu(cpu_l3_domain, cpu);
> +	leaf = level_to_leaf(level);
> +	shared_cpu_map = &cache_domains[leaf].shared_cpu_map[l3_domain];
>  
> -	if (i < nr_cpu_ids)
> -		cpumask_set_cpu(i, &rdt_cpumask);
> +	cpumask_clear_cpu(cpu, &rdt_l3_cpumask);
> +	cpumask_clear_cpu(cpu, shared_cpu_map);
> +	if (cpumask_empty(shared_cpu_map))
> +		goto out;

So what clears @cpu in the rdtgroup cpumask?

> +
> +	new_cpu = cpumask_first(shared_cpu_map);
> +	rdt_cpumask_update(&rdt_l3_cpumask, new_cpu, level);
>  
>  	clear_rdtgroup_cpumask(cpu);

Can you please use a consistent prefix and naming scheme?

    rdt_cpumask_update()
    clear_rdtgroup_cpumask()

WTF?

> +out:
>  	mutex_unlock(&rdtgroup_mutex);
> +	return 0;
> +}
> +
> +/*
> + * Initialize per-cpu cpu_l3_domain.
> + *
> + * cpu_l3_domain numbers are consequtive integer starting from 0.
> + * Sets up 1:1 mapping of cpu id and cpu_l3_domain.
> + */
> +static int __init cpu_cache_domain_init(int level)
> +{
> +	int i, j;
> +	int max_cpu_cache_domain = 0;
> +	int index;
> +	struct cacheinfo *leaf;
> +	int *domain;
> +	struct cpu_cacheinfo *cpu_ci;

Eyes hurt.

> +
> +	for_each_online_cpu(i) {
> +		domain = &per_cpu(cpu_l3_domain, i);

per_cpu_ptr() exists for a reason.

> +		if (*domain == -1) {
> +			index = get_cache_leaf(level, i);
> +			if (index < 0)
> +				return -EINVAL;
> +
> +			cpu_ci = get_cpu_cacheinfo(i);
> +			leaf = cpu_ci->info_list + index;
> +			if (cpumask_empty(&leaf->shared_cpu_map)) {
> +				WARN(1, "no shared cpu for L2\n");

So L2 is always the right thing for every value of @level? And what the
heck is the value of that WARN? Nothing because the callchain is already
known. What's worse is that you don't tell about @level, @index, @i (which
should be named @cpu).

> +				return -EINVAL;
> +			}
> +
> +			for_each_cpu(j, &leaf->shared_cpu_map) {

So again independent of @level you fiddle with cpu_l3_domain. Interesting.

> +				domain = &per_cpu(cpu_l3_domain, j);
> +				*domain = max_cpu_cache_domain;
> +			}
> +			max_cpu_cache_domain++;

what's the actual meaning of max_cpu_cache_domain?

> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int __init rdt_setup(char *str)
> +{
> +	char *tok;
> +
> +	while ((tok = strsep(&str, ",")) != NULL) {
> +		if (!*tok)
> +			return -EINVAL;
> +
> +		if (strcmp(tok, "simulate_cat_l3") == 0) {
> +			pr_info("Simulate CAT L3\n");
> +			rdt_opts.simulate_cat_l3 = true;

So this goes into rdt_opts

> +		} else if (strcmp(tok, "disable_cat_l3") == 0) {
> +			pr_info("CAT L3 is disabled\n");
> +			disable_cat_l3 = true;

While this is a distinct control variable. Very consistent.

> +		} else {
> +			pr_info("Invalid rdt option\n");

Very helpful w/o printing the actual option ....

> +			return -EINVAL;
> +		}
> +	}
> +
> +	return 0;
> +}
> +__setup("resctrl=", rdt_setup);
> +
> +static inline bool resource_alloc_enabled(void)
> +{
> +	return cat_l3_enabled;
> +}

Oh well.

> +
> +static int shared_domain_init(void)
> +{
> +	int l3_domain_num = get_domain_num(CACHE_LEVEL3);
> +	int size;
> +	int domain;
> +	struct cpumask *cpumask;
> +	struct cpumask *shared_cpu_map;
> +	int cpu;

More random variable declarations.

> +	if (cat_l3_enabled) {
> +		shared_domain_num = l3_domain_num;
> +		cpumask = &rdt_l3_cpumask;
> +	} else
> +		return -EINVAL;

Missing curly braces.

> +
> +	size = shared_domain_num * sizeof(struct shared_domain);
> +	shared_domain = kzalloc(size, GFP_KERNEL);
> +	if (!shared_domain)
> +		return -EINVAL;
> +
> +	domain = 0;
> +	for_each_cpu(cpu, cpumask) {
> +		if (cat_l3_enabled)
> +			shared_domain[domain].l3_domain =
> +					per_cpu(cpu_l3_domain, cpu);
> +		else
> +			shared_domain[domain].l3_domain = -1;
> +
> +		shared_cpu_map = get_shared_cpu_map(cpu, CACHE_LEVEL3);
> +
> +		cpumask_copy(&shared_domain[domain].cpumask, shared_cpu_map);

What's the point of updating the cpumask when the thing is disabled? If
there is a reason then this should be documented in a comment.

> +		domain++;
> +	}
> +	for_each_online_cpu(cpu) {
> +		if (cat_l3_enabled)
> +			per_cpu(cpu_shared_domain, cpu) =
> +					per_cpu(cpu_l3_domain, cpu);

More missing curly braces. And using an intermediate variable would remove
this hard to read line breaks.

> +	}
> +
> +	return 0;
> +}
> +
> +static int cconfig_init(int maxid)
> +{
> +	int num;
> +	int domain;
> +	unsigned long *closmap_block;
> +	int maxid_size;
> +
> +	maxid_size = BITS_TO_LONGS(maxid);
> +	num = maxid_size * shared_domain_num;
> +	cconfig.closmap = kcalloc(maxid, sizeof(unsigned long *), GFP_KERNEL);

Really intuitive. You calc num right before allocating the closmap pointers
and then you use it in the next alloc.

> +	if (!cconfig.closmap)
> +		goto out_free;
> +
> +	closmap_block = kcalloc(num, sizeof(unsigned long), GFP_KERNEL);
> +	if (!closmap_block)
> +		goto out_free;
> +
> +	for (domain = 0; domain < shared_domain_num; domain++)
> +		cconfig.closmap[domain] = (unsigned long *)closmap_block +

More random type casting.

> +					  domain * maxid_size;

Why don't you allocate that whole mess in one go?

	unsigned int ptrsize, mapsize, size, d;
    	void *p;

	ptrsize = maxid * sizeof(unsigned long *);
    	mapsize = BITS_TO_LONGS(maxid) * sizeof(unsigned long);
    	size = ptrsize + num_shared_domains * mapsize;

	p = kzalloc(size, GFP_KERNEL);
	if (!p)
		return -ENOMEM;

	cconfig.closmap = p;	
	cconfig.max_closid = maxid;

	p += ptrsize;
	for (d = 0; d < num_shared_domains; d++, p += mapsize)
	       cconfig.closmap[d] = p;
	return 0;

Would be too simple. Once more.

> +
> +	cconfig.max_closid = maxid;
> +
> +	return 0;
> +out_free:
> +	kfree(cconfig.closmap);
> +	kfree(closmap_block);
> +	return -ENOMEM;
> +}
> +
> +static int __init cat_cache_init(int level, int maxid,
> +				 struct clos_cbm_table ***cctable)
> +{
> +	int domain_num;
> +	int domain;
> +	int size;
> +	int ret = 0;
> +	struct clos_cbm_table *p;
> +
> +	domain_num = get_domain_num(level);
> +	size = domain_num * sizeof(struct clos_cbm_table *);
> +	*cctable = kzalloc(size, GFP_KERNEL);
> +	if (!*cctable) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	size = maxid * domain_num * sizeof(struct clos_cbm_table);
> +	p = kzalloc(size, GFP_KERNEL);
> +	if (!p) {
> +		kfree(*cctable);
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	for (domain = 0; domain < domain_num; domain++)
> +		(*cctable)[domain] = p + domain * maxid;

Same crap.

> +
> +	ret = cpu_cache_domain_init(level);
> +	if (ret) {
> +		kfree(*cctable);
> +		kfree(p);
> +	}
> +out:
> +	return ret;
>  }
>  
>  static int __init intel_rdt_late_init(void)
>  {
>  	struct cpuinfo_x86 *c = &boot_cpu_data;
>  	u32 maxid;
> -	int err = 0, size, i;
> -
> -	maxid = c->x86_cache_max_closid;
> -
> -	size = maxid * sizeof(struct clos_cbm_table);
> -	cctable = kzalloc(size, GFP_KERNEL);
> -	if (!cctable) {
> -		err = -ENOMEM;
> -		goto out_err;
> +	int i;
> +	int ret;
> +
> +	if (unlikely(disable_cat_l3))

This inlikely is completely pointless. This is not a hotpath function. It
just makes the code harder to read.

> +		cat_l3_enabled = false;
> +	else if (cat_l3_supported(c))
> +		cat_l3_enabled = true;
> +	else if (rdt_opts.simulate_cat_l3 &&
> +		 get_cache_leaf(CACHE_LEVEL3, 0) >= 0)
> +		cat_l3_enabled = true;
> +	else
> +		cat_l3_enabled = false;

Please move that into resource_alloc_enabled() and make it readable.

> +	if (!resource_alloc_enabled())
> +		return -ENODEV;
> +
> +	if (rdt_opts.simulate_cat_l3) {
> +		boot_cpu_data.x86_l3_max_closid = 16;
> +		boot_cpu_data.x86_l3_max_cbm_len = 20;
> +	}
> +	for_each_online_cpu(i) {
> +		rdt_cpumask_update(&rdt_l3_cpumask, i, CACHE_LEVEL3);
>  	}
>  
> -	size = BITS_TO_LONGS(maxid) * sizeof(long);
> -	cconfig.closmap = kzalloc(size, GFP_KERNEL);
> -	if (!cconfig.closmap) {
> -		kfree(cctable);
> -		err = -ENOMEM;
> -		goto out_err;
> +	maxid = 0;
> +	if (cat_l3_enabled) {
> +		maxid = boot_cpu_data.x86_l3_max_closid;
> +		ret = cat_cache_init(CACHE_LEVEL3, maxid, &l3_cctable);
> +		if (ret)
> +			cat_l3_enabled = false;
>  	}
>  
> -	for_each_online_cpu(i)
> -		rdt_cpumask_update(i);
> +	if (!cat_l3_enabled)
> +		return -ENOSPC;

Huch? How do you get here when cat_l3_enabled is false?

> +
> +	ret = shared_domain_init();
> +	if (ret)
> +		return -ENODEV;

  Leaks closmaps

> +
> +	ret = cconfig_init(maxid);
> +	if (ret)
> +		return ret;

Leaks more stuff.

>  	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
>  				"AP_INTEL_RDT_ONLINE",
>  				intel_rdt_online_cpu, intel_rdt_offline_cpu);

I still have not figured out how all that init scheme is working so that
you can use nocalls() for the hotplug registration. 

> -	if (err < 0)
> -		goto out_err;
> +	if (ret < 0)
> +		return ret;

And this as well.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 23/33] x86/intel_rdt_rdtgroup.c: User interface for RDT
  2016-09-08  9:57 ` [PATCH v2 23/33] x86/intel_rdt_rdtgroup.c: User interface for RDT Fenghua Yu
@ 2016-09-08 14:59   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 14:59 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
> From: Fenghua Yu <fenghua.yu@intel.com>
> 
> We introduce a new resctrl file system mounted under /sys/fs/resctrl.
> User uses this file system to control resource allocation.
> 
> Hiearchy of the file system is as follows:
> /sys/fs/resctrl/info/info
> 		    /<resource0>/<resource0 specific info files>
> 		    /<resource1>/<resource1 specific info files>
> 			....
> 		/tasks
> 		/cpus
> 		/schemata
> 		/sub-dir1
> 		/sub-dir2
> 		....
> 
> User can specify which task uses which schemata for resource allocation.
> 
> More details can be found in Documentation/x86/intel_rdt_ui.txt
> 
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/include/asm/intel_rdt.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
> index 85beecc..aaed4b4 100644
> --- a/arch/x86/include/asm/intel_rdt.h
> +++ b/arch/x86/include/asm/intel_rdt.h
> @@ -40,6 +40,8 @@ struct cache_domain {
>  	unsigned int shared_cache_id[MAX_CACHE_DOMAINS];
>  };
>  
> +extern struct cache_domain cache_domains[MAX_CACHE_LEAVES];
> +

This patch split is so annoying it's not even funny anymore.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 24/33] x86/intel_rdt_rdtgroup.c: Create info directory
  2016-09-08  9:57 ` [PATCH v2 24/33] x86/intel_rdt_rdtgroup.c: Create info directory Fenghua Yu
@ 2016-09-08 16:04   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 16:04 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
> +/*
> + * kernfs_root - find out the kernfs_root a kernfs_node belongs to
> + * @kn: kernfs_node of interest
> + *
> + * Return the kernfs_root @kn belongs to.
> + */
> +static inline struct kernfs_root *get_kernfs_root(struct kernfs_node *kn)
> +{
> +	if (kn->parent)
> +		kn = kn->parent;

So this is guaranteed to have a single nesting?

> +	return kn->dir.root;
> +}
> +
> +/*
> + * rdtgroup_file_mode - deduce file mode of a control file
> + * @cft: the control file in question
> + *
> + * S_IRUGO for read, S_IWUSR for write.
> + */
> +static umode_t rdtgroup_file_mode(const struct rftype *rft)
> +{
> +	umode_t mode = 0;
> +
> +	if (rft->read_u64 || rft->read_s64 || rft->seq_show)
> +		mode |= S_IRUGO;
> +
> +	if (rft->write_u64 || rft->write_s64 || rft->write)
> +		mode |= S_IWUSR;

Why don't you store the mode in rtftype instead of evaluating it at
runtime?

Aside of that [read|write]_[s|u]64 are nowhere used in this whole patch
series, but take plenty of storage and line space for nothing.

> +static int rdtgroup_add_files(struct kernfs_node *kn, struct rftype *rfts,
> +			      const struct rftype *end)
> +{
> +	struct rftype *rft;
> +	int ret;
> +
> +	lockdep_assert_held(&rdtgroup_mutex);
> +
> +	for (rft = rfts; rft != end; rft++) {
> +		ret = rdtgroup_add_file(kn, rft);
> +		if (ret) {
> +			pr_warn("%s: failed to add %s, err=%d\n",
> +				__func__, rft->name, ret);
> +			rdtgroup_rm_files(kn, rft, end);

So we remove the file which failed to be added along with those which we
have not been added yet.

		    rdtgroup_rm_files(kn, rfts, rft);

Might be more correct, but I might be wrong as usual.

> +/*
> + * Get resource type from name in kernfs_node. This can be extended to
> + * multi-resources (e.g. L2). Right now simply return RESOURCE_L3 because
> + * we only have L3 support.

That's crap. If you know that you have seperate types then spend the time
to implement the storage instead of documenting your lazy/sloppyness.

> + */
> +static enum resource_type get_kn_res_type(struct kernfs_node *kn)
> +{
> +	return RESOURCE_L3;
> +}
> +
> +static int rdt_max_closid_show(struct seq_file *seq, void *v)
> +{
> +	struct kernfs_open_file *of = seq->private;
> +
> +	switch (get_kn_res_type(of->kn)) {
> +	case RESOURCE_L3:
> +		seq_printf(seq, "%d\n",
> +			boot_cpu_data.x86_l3_max_closid);

x86_l3_max_closid is u16 ..... %u ?

And that line break is required because the line is

		seq_printf(seq, "%d\n",	boot_cpu_data.x86_l3_max_closid);

exactly 73 characters long ....

> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return 0;
> +}
> +
> +static int rdt_max_cbm_len_show(struct seq_file *seq, void *v)
> +{
> +	struct kernfs_open_file *of = seq->private;
> +
> +	switch (get_kn_res_type(of->kn)) {
> +	case RESOURCE_L3:
> +		seq_printf(seq, "%d\n",
> +			boot_cpu_data.x86_l3_max_cbm_len);

Ditto

> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return 0;
> +}

> +static void rdt_info_show_cat(struct seq_file *seq, int level)
> +{
> +	int domain;
> +	int domain_num = get_domain_num(level);
> +	int closid;
> +	u64 cbm;
> +	struct clos_cbm_table **cctable;
> +	int maxid;
> +	int shared_domain;
> +	int cnt;

Soon you occupy half of the screen.

> +	if (level == CACHE_LEVEL3)
> +		cctable = l3_cctable;
> +	else
> +		return;
> +
> +	maxid = cconfig.max_closid;
> +	for (domain = 0; domain < domain_num; domain++) {
> +		seq_printf(seq, "domain %d:\n", domain);
> +		shared_domain = get_shared_domain(domain, level);
> +		for (closid = 0; closid < maxid; closid++) {
> +			int dindex, iindex;
> +
> +			if (test_bit(closid,
> +			(unsigned long *)cconfig.closmap[shared_domain])) {
> +				dindex = get_dcbm_table_index(closid);
> +				cbm = cctable[domain][dindex].cbm;
> +				cnt = cctable[domain][dindex].clos_refcnt;
> +				seq_printf(seq, "cbm[%d]=%lx, refcnt=%d\n",
> +					 dindex, (unsigned long)cbm, cnt);
> +				if (cdp_enabled) {
> +					iindex = get_icbm_table_index(closid);
> +					cbm = cctable[domain][iindex].cbm;
> +					cnt =
> +					   cctable[domain][iindex].clos_refcnt;
> +					seq_printf(seq,
> +						   "cbm[%d]=%lx, refcnt=%d\n",
> +						   iindex, (unsigned long)cbm,
> +						   cnt);
> +				}
> +			} else {
> +				cbm = max_cbm(level);
> +				cnt = 0;
> +				dindex = get_dcbm_table_index(closid);
> +				seq_printf(seq, "cbm[%d]=%lx, refcnt=%d\n",
> +					 dindex, (unsigned long)cbm, cnt);
> +				if (cdp_enabled) {
> +					iindex = get_icbm_table_index(closid);
> +					seq_printf(seq,
> +						 "cbm[%d]=%lx, refcnt=%d\n",
> +						 iindex, (unsigned long)cbm,
> +						 cnt);
> +				}

This is completely unreadable. Split it into static functions ....

> +			}
> +		}
> +	}
> +}
> +
> +static void show_shared_domain(struct seq_file *seq)
> +{
> +	int domain;
> +
> +	seq_puts(seq, "Shared domains:\n");
> +
> +	for_each_cache_domain(domain, 0, shared_domain_num) {
> +		struct shared_domain *sd;
> +
> +		sd = &shared_domain[domain];
> +		seq_printf(seq, "domain[%d]:", domain);
> +		if (cat_enabled(CACHE_LEVEL3))
> +			seq_printf(seq, "l3_domain=%d ", sd->l3_domain);
> +		seq_printf(seq, "cpumask=%*pb\n",
> +			   cpumask_pr_args(&sd->cpumask));

What's the value of printing a cpu mask for something which is not enabled?

> +	}
> +}
> +
> +static int rdt_info_show(struct seq_file *seq, void *v)
> +{
> +	show_shared_domain(seq);
> +
> +	if (cat_l3_enabled) {
> +		if (rdt_opts.verbose)

Concatenate the conditionals into a single line please.

> +			rdt_info_show_cat(seq, CACHE_LEVEL3);
> +	}
> +
> +	seq_puts(seq, "\n");
> +
> +	return 0;
> +}
> +
> +static int res_type_to_level(enum resource_type res_type, int *level)
> +{
> +	int ret = 0;
> +
> +	switch (res_type) {
> +	case RESOURCE_L3:
> +		*level = CACHE_LEVEL3;
> +		break;
> +	case RESOURCE_NUM:
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	return ret;

Groan. What's wrong with

static int res_type_to_level(type)
{
	switch (type) {
	case RESOURCE_L3: return CACHE_LEVEL3;
	case RESOURCE_NUM: return -EINVAL;
	}
}

and at the callsite you do:


> +}
> +
> +static int domain_to_cache_id_show(struct seq_file *seq, void *v)
> +{
> +	struct kernfs_open_file *of = seq->private;
> +	enum resource_type res_type;
> +	int domain;
> +	int leaf;
> +	int level = 0;
> +	int ret;
> +
> +	res_type = (enum resource_type)of->kn->parent->priv;
> +
> +	ret = res_type_to_level(res_type, &level);
> +	if (ret)
> +		return 0;

  	level = res_type_to_level(res_type);
	if (level < 0)
	   	return 0;

That gets rid of the initialization of level as well and becomes readable
source code. Hmm?

> +
> +	leaf =	get_cache_leaf(level, 0);

  	leafidx = cache_get_leaf_index(...);

I trip over this over and over and I can't get used to this misnomer.

> +
> +	for (domain = 0; domain < get_domain_num(level); domain++) {
> +		unsigned int cid;
> +
> +		cid = cache_domains[leaf].shared_cache_id[domain];
> +		seq_printf(seq, "%d:%d\n", domain, cid);

Proper print qualifiers are overrated....

> +static int info_populate_dir(struct kernfs_node *kn)
> +{
> +	struct rftype *rfts;
> +
> +	rfts = info_files;

	struct rftype *rfts = info_files;

> +	return rdtgroup_add_files(kn, rfts, rfts + ARRAY_SIZE(info_files));
> +}

> +static int rdtgroup_partition_populate_dir(struct kernfs_node *kn)

Has no user.

> +LIST_HEAD(rdtgroup_lists);

I told you before that globals or module static variable don't get defined
in the middle of the code and not being sticked to a function definition
w/o a space.

> +static void init_rdtgroup_root(struct rdtgroup_root *root)
> +{
> +	struct rdtgroup *rdtgrp = &root->rdtgrp;
> +
> +	INIT_LIST_HEAD(&rdtgrp->rdtgroup_list);
> +	list_add_tail(&rdtgrp->rdtgroup_list, &rdtgroup_lists);
> +	atomic_set(&root->nr_rdtgrps, 1);
> +	rdtgrp->root = root;

Yuck.

	grp = root->grp;
	init(grp);
	root->nr_grps = 1;
	grp->root = root;

Confused.

> +}
> +
> +static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops;
> +struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn)
> +{
> +	struct rdtgroup *rdtgrp;
> +
> +	if (kernfs_type(kn) == KERNFS_DIR)
> +		rdtgrp = kn->priv;
> +	else
> +		rdtgrp = kn->parent->priv;

So this again assumes that there is a single level of directories....

> +	kernfs_break_active_protection(kn);
> +
> +	mutex_lock(&rdtgroup_mutex);
> +	/* Unlock if rdtgrp is dead. */
> +	if (!rdtgrp)
> +		rdtgroup_kn_unlock(kn);
> +
> +	return rdtgrp;
> +}
> +
> +void rdtgroup_kn_unlock(struct kernfs_node *kn)
> +{
> +	mutex_unlock(&rdtgroup_mutex);
> +
> +	kernfs_unbreak_active_protection(kn);
> +}
> +
> +static char *res_info_dir_name(enum resource_type res_type, char *name)
> +{
> +	switch (res_type) {
> +	case RESOURCE_L3:
> +		strncpy(name, "l3", RDTGROUP_FILE_NAME_LEN);
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return name;

What's the purpose of this return value if its ignored at the call site?

> +}
> +
> +static int create_res_info(enum resource_type res_type,
> +			   struct kernfs_node *parent_kn)
> +{
> +	struct kernfs_node *kn;
> +	char name[RDTGROUP_FILE_NAME_LEN];
> +	int ret;
> +
> +	res_info_dir_name(res_type, name);

So name contains random crap if res_type is not handled in res_info_dir_name().

> +	kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, NULL);
> +	if (IS_ERR(kn)) {
> +		ret = PTR_ERR(kn);
> +		goto out;
> +	}
> +
> +	/*
> +	 * This extra ref will be put in kernfs_remove() and guarantees
> +	 * that @rdtgrp->kn is always accessible.
> +	 */
> +	kernfs_get(kn);
> +
> +	ret = rdtgroup_kn_set_ugid(kn);
> +	if (ret)
> +		goto out_destroy;
> +
> +	ret = res_info_populate_dir(kn);
> +	if (ret)
> +		goto out_destroy;
> +
> +	kernfs_activate(kn);
> +
> +	ret = 0;
> +	goto out;

Hell no.

> +
> +out_destroy:
> +	kernfs_remove(kn);
> +out:
> +	return ret;
> +
> +}
> +
> +static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn,
> +				    const char *name)
> +{
> +	struct kernfs_node *kn;
> +	int ret;
> +
> +	if (parent_kn != root_rdtgrp->kn)
> +		return -EPERM;
> +
> +	/* create the directory */
> +	kn = kernfs_create_dir(parent_kn, "info", parent_kn->mode, root_rdtgrp);
> +	if (IS_ERR(kn)) {
> +		ret = PTR_ERR(kn);
> +		goto out;
> +	}
> +
> +	ret = info_populate_dir(kn);
> +	if (ret)
> +		goto out_destroy;
> +
> +	if (cat_enabled(CACHE_LEVEL3))
> +		create_res_info(RESOURCE_L3, kn);
> +
> +	/*
> +	 * This extra ref will be put in kernfs_remove() and guarantees
> +	 * that @rdtgrp->kn is always accessible.
> +	 */
> +	kernfs_get(kn);
> +
> +	ret = rdtgroup_kn_set_ugid(kn);
> +	if (ret)
> +		goto out_destroy;
> +
> +	kernfs_activate(kn);
> +
> +	ret = 0;
> +	goto out;

Copy and paste .... sucks.

> +out_destroy:
> +	kernfs_remove(kn);
> +out:
> +	return ret;
> +}
> +
> +static int rdtgroup_setup_root(struct rdtgroup_root *root,
> +			       unsigned long ss_mask)
> +{
> +	int ret;
> +
> +	root_rdtgrp = &root->rdtgrp;
> +
> +	lockdep_assert_held(&rdtgroup_mutex);
> +
> +	root->kf_root = kernfs_create_root(&rdtgroup_kf_syscall_ops,
> +					   KERNFS_ROOT_CREATE_DEACTIVATED,
> +					   root_rdtgrp);
> +	if (IS_ERR(root->kf_root)) {
> +		ret = PTR_ERR(root->kf_root);
> +		goto out;
> +	}
> +	root_rdtgrp->kn = root->kf_root->kn;
> +
> +	ret = rdtgroup_populate_dir(root->kf_root->kn);
> +	if (ret)
> +		goto destroy_root;
> +
> +	rdtgroup_create_info_dir(root->kf_root->kn, "info_dir");
> +
> +	/*
> +	 * Link the root rdtgroup in this hierarchy into all the css_set

css_set objects ? Again: Copy and paste sucks, when done without brain
involvement.

> +	 * objects.
> +	 */
> +	WARN_ON(atomic_read(&root->nr_rdtgrps) != 1);
> +
> +	kernfs_activate(root_rdtgrp->kn);
> +	ret = 0;
> +	goto out;
> +
> +destroy_root:
> +	kernfs_destroy_root(root->kf_root);
> +	root->kf_root = NULL;
> +out:
> +	return ret;
> +}

> +static int get_shared_cache_id(int cpu, int level)
> +{
> +	struct cpuinfo_x86 *c;
> +	int index_msb;
> +	struct cpu_cacheinfo *this_cpu_ci;
> +	struct cacheinfo *this_leaf;
> +
> +	this_cpu_ci = get_cpu_cacheinfo(cpu);

Once more. this_cpu_ci is actively misleading.

> +
> +	this_leaf = this_cpu_ci->info_list + level_to_leaf(level);
> +	return this_leaf->id;
> +	return c->apicid >> index_msb;
> +}

> +static void init_cache_domain(int cpu, int leaf)
> +{
> +	struct cpu_cacheinfo *this_cpu_ci;
> +	struct cpumask *mask;
> +	unsigned int level;
> +	struct cacheinfo *this_leaf;
> +	int domain;
> +
> +	this_cpu_ci = get_cpu_cacheinfo(cpu);
> +	this_leaf = this_cpu_ci->info_list + leaf;
> +	cache_domains[leaf].level = this_leaf->level;
> +	mask = &this_leaf->shared_cpu_map;
> +	for (domain = 0; domain < MAX_CACHE_DOMAINS; domain++) {
> +		if (cpumask_test_cpu(cpu,
> +			&cache_domains[leaf].shared_cpu_map[domain]))
> +			return;
> +	}
> +	if (domain == MAX_CACHE_DOMAINS) {
> +		domain = cache_domains[leaf].max_cache_domains_num++;
> +
> +		cache_domains[leaf].shared_cpu_map[domain] = *mask;
> +
> +		level = cache_domains[leaf].level;
> +		cache_domains[leaf].shared_cache_id[domain] =
> +			get_shared_cache_id(cpu, level);

I've seen similar code in the other file. Why do we need two incarnations
of that? Can't we have a shared cache domain information storage where all
info is kept for both the control and the user space interface?

> +	}
> +}
> +
> +static __init void init_cache_domains(void)
> +{
> +	int cpu;
> +	int leaf;
> +
> +	for (leaf = 0; leaf < get_cpu_cacheinfo(0)->num_leaves; leaf++) {
> +		for_each_online_cpu(cpu)
> +			init_cache_domain(cpu, leaf);

What updates this stuff on hotplug?

> +	}
> +}
> +
> +void rdtgroup_exit(struct task_struct *tsk)
> +{
> +
> +	if (!list_empty(&tsk->rg_list)) {

I told you last time that rg_list is a misnomer ....

> +		struct rdtgroup *rdtgrp = tsk->rdtgroup;
> +
> +		list_del_init(&tsk->rg_list);
> +		tsk->rdtgroup = NULL;
> +		atomic_dec(&rdtgrp->refcount);

And there is still no sign of documentation on how that list is used and
protected.

> +	}
> +}
> +
> +static void rdtgroup_destroy_locked(struct rdtgroup *rdtgrp)
> +	__releases(&rdtgroup_mutex) __acquires(&rdtgroup_mutex)

Where?

> +{
> +	int shared_domain;
> +	int closid;
> +
> +	lockdep_assert_held(&rdtgroup_mutex);
> +
> +	/* free closid occupied by this rdtgroup. */
> +	for_each_cache_domain(shared_domain, 0, shared_domain_num) {
> +		closid = rdtgrp->resource.closid[shared_domain];
> +		closid_put(closid, shared_domain);
> +	}
> +
> +	list_del_init(&rdtgrp->rdtgroup_list);
> +
> +	/*
> +	 * Remove @rdtgrp directory along with the base files.  @rdtgrp has an
> +	 * extra ref on its kn.
> +	 */
> +	kernfs_remove(rdtgrp->kn);
> +}
> +
> +static int
> +rdtgroup_move_task_all(struct rdtgroup *src_rdtgrp, struct rdtgroup *dst_rdtgrp)
> +{
> +	struct list_head *tasks;
> +
> +	tasks = &src_rdtgrp->pset.tasks;
> +	while (!list_empty(tasks)) {

  list_for_each_entry_safe() ???

> +		struct task_struct *tsk;
> +		struct list_head *pos;
> +		pid_t pid;
> +		int ret;
> +
> +		pos = tasks->next;
> +		tsk = list_entry(pos, struct task_struct, rg_list);
> +		pid = tsk->pid;
> +		ret = rdtgroup_move_task(pid, dst_rdtgrp, false, NULL);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Forcibly remove all of subdirectories under root.
> + */
> +static void rmdir_all_sub(void)
> +{
> +	struct rdtgroup *rdtgrp;
> +	int cpu;
> +	struct list_head *l;
> +	struct task_struct *p;
> +
> +	/* Move all tasks from sub rdtgroups to default */
> +	rcu_read_lock();
> +	for_each_process(p) {
> +		if (p->rdtgroup)
> +			p->rdtgroup = NULL;
> +	}
> +	rcu_read_unlock();

And how is that protected against concurrent forks?

> +
> +	while (!list_is_last(&root_rdtgrp->rdtgroup_list, &rdtgroup_lists)) {
> +		l = rdtgroup_lists.next;
> +		if (l == &root_rdtgrp->rdtgroup_list)
> +			l = l->next;
> +
> +		rdtgrp = list_entry(l, struct rdtgroup, rdtgroup_list);
> +		if (rdtgrp == root_rdtgrp)
> +			continue;
> +
> +		for_each_cpu(cpu, &rdtgrp->cpu_mask)
> +			per_cpu(cpu_rdtgroup, cpu) = root_rdtgrp;
> +
> +		rdtgroup_destroy_locked(rdtgrp);
> +	}
> +}

> +static void *rdtgroup_seqfile_start(struct seq_file *seq, loff_t *ppos)
> +{
> +	return seq_rft(seq)->seq_start(seq, ppos);
> +}
> +
> +static void *rdtgroup_seqfile_next(struct seq_file *seq, void *v, loff_t *ppos)
> +{
> +	return seq_rft(seq)->seq_next(seq, v, ppos);
> +}
> +
> +static void rdtgroup_seqfile_stop(struct seq_file *seq, void *v)
> +{
> +	seq_rft(seq)->seq_stop(seq, v);
> +}
> +
> +static int rdtgroup_seqfile_show(struct seq_file *m, void *arg)
> +{
> +	struct rftype *rft = seq_rft(m);
> +
> +	if (rft->seq_show)
> +		return rft->seq_show(m, arg);
> +	return 0;
> +}
> +
> +static struct kernfs_ops rdtgroup_kf_ops = {
> +	.atomic_write_len	= PAGE_SIZE,
> +	.write			= rdtgroup_file_write,
> +	.seq_start		= rdtgroup_seqfile_start,
> +	.seq_next		= rdtgroup_seqfile_next,
> +	.seq_stop		= rdtgroup_seqfile_stop,
> +	.seq_show		= rdtgroup_seqfile_show,
> +};

And once more nothing uses this at all. So why is it there?

> +static struct kernfs_ops rdtgroup_kf_single_ops = {
> +	.atomic_write_len	= PAGE_SIZE,
> +	.write			= rdtgroup_file_write,
> +	.seq_show		= rdtgroup_seqfile_show,
> +};
> +
> +static void rdtgroup_exit_rftypes(struct rftype *rfts)
> +{
> +	struct rftype *rft;
> +
> +	for (rft = rfts; rft->name[0] != '\0'; rft++) {
> +		/* free copy for custom atomic_write_len, see init_cftypes() */
> +		if (rft->max_write_len && rft->max_write_len != PAGE_SIZE)
> +			kfree(rft->kf_ops);
> +		rft->kf_ops = NULL;
> +
> +		/* revert flags set by rdtgroup core while adding @cfts */
> +		rft->flags &= ~(__RFTYPE_ONLY_ON_DFL | __RFTYPE_NOT_ON_DFL);
> +	}
> +}
> +
> +static int rdtgroup_init_rftypes(struct rftype *rfts)
> +{
> +	struct rftype *rft;
> +
> +	for (rft = rfts; rft->name[0] != '\0'; rft++) {
> +		struct kernfs_ops *kf_ops;
> +
> +		if (rft->seq_start)
> +			kf_ops = &rdtgroup_kf_ops;
> +		else
> +			kf_ops = &rdtgroup_kf_single_ops;

Ditto.

> +
> +		/*
> +		 * Ugh... if @cft wants a custom max_write_len, we need to
> +		 * make a copy of kf_ops to set its atomic_write_len.
> +		 */
> +		if (rft->max_write_len && rft->max_write_len != PAGE_SIZE) {
> +			kf_ops = kmemdup(kf_ops, sizeof(*kf_ops), GFP_KERNEL);
> +			if (!kf_ops) {
> +				rdtgroup_exit_rftypes(rfts);
> +				return -ENOMEM;
> +			}
> +			kf_ops->atomic_write_len = rft->max_write_len;

No user either. Copy and paste once more ?

> +		}
> +
> +		rft->kf_ops = kf_ops;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * rdtgroup_init - rdtgroup initialization
> + *
> + * Register rdtgroup filesystem, and initialize any subsystems that didn't
> + * request early init.
> + */
> +int __init rdtgroup_init(void)
> +{
> +	int cpu;
> +
> +	WARN_ON(rdtgroup_init_rftypes(rdtgroup_root_base_files));
> +
> +	WARN_ON(rdtgroup_init_rftypes(res_info_files));
> +	WARN_ON(rdtgroup_init_rftypes(info_files));
> +
> +	WARN_ON(rdtgroup_init_rftypes(rdtgroup_partition_base_files));
> +	mutex_lock(&rdtgroup_mutex);
> +
> +	init_rdtgroup_root(&rdtgrp_dfl_root);
> +	WARN_ON(rdtgroup_setup_root(&rdtgrp_dfl_root, 0));
> +
> +	mutex_unlock(&rdtgroup_mutex);
> +
> +	WARN_ON(sysfs_create_mount_point(fs_kobj, "resctrl"));
> +	WARN_ON(register_filesystem(&rdt_fs_type));
> +	init_cache_domains();
> +
> +	INIT_LIST_HEAD(&rdtgroups);
> +
> +	for_each_online_cpu(cpu)
> +		per_cpu(cpu_rdtgroup, cpu) = root_rdtgrp;

Another more for each cpu loop. Where is the hotplug update happening?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 25/33] include/linux/resctrl.h: Define fork and exit functions in a new header file
  2016-09-08  9:57 ` [PATCH v2 25/33] include/linux/resctrl.h: Define fork and exit functions in a new header file Fenghua Yu
@ 2016-09-08 16:07   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 16:07 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:

> From: Fenghua Yu <fenghua.yu@intel.com>
> 
> A new header file is created in include/linux/resctrl.h. It contains

No comment. You should be able to guess what I'm trying not to say.

> defintions of rdtgroup_fork and rdtgroup_exit for x86. The functions
> are empty for other architectures.
> 
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> ---
>  include/linux/resctrl.h | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>  create mode 100644 include/linux/resctrl.h
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> new file mode 100644
> index 0000000..68dabc4
> --- /dev/null
> +++ b/include/linux/resctrl.h
> @@ -0,0 +1,12 @@
> +#ifndef _LINUX_RESCTRL_H
> +#define _LINUX_RESCTRL_H
> +
> +#ifdef CONFIG_INTEL_RDT
> +extern void rdtgroup_fork(struct task_struct *child);
> +extern void rdtgroup_exit(struct task_struct *tsk);
> +#else
> +static inline void rdtgroup_fork(struct task_struct *child) {}
> +static inline void rdtgroup_exit(struct task_struct *tsk) {}
> +#endif /* CONFIG_X86 */
> +
> +#endif /* _LINUX_RESCTRL_H */
> -- 
> 2.5.0
> 
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* RE: [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection
  2016-09-08 11:50   ` Borislav Petkov
@ 2016-09-08 16:53     ` Yu, Fenghua
  2016-09-08 17:17       ` Borislav Petkov
  0 siblings, 1 reply; 96+ messages in thread
From: Yu, Fenghua @ 2016-09-08 16:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Anvin, H Peter, Ingo Molnar, Luck, Tony,
	Peter Zijlstra, Tejun Heo, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Shankar, Ravi V,
	Vikas Shivappa, Prakhya, Sai Praneeth, linux-kernel, x86

> From: Borislav Petkov [mailto:bp@suse.de]
> On Thu, Sep 08, 2016 at 02:57:01AM -0700, Fenghua Yu wrote:
> > From: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> >
> > This patch includes CPUID enumeration routines for Cache allocation
> > and new values to track resources to the cpuinfo_x86 structure.
> >
> > Cache allocation provides a way for the Software (OS/VMM) to restrict
> > cache allocation to a defined 'subset' of cache which may be
> > overlapping with other 'subsets'. This feature is used when allocating
> > a line in cache ie when pulling new data into the cache. The
> > programming of the hardware is done via programming MSRs (model
> specific registers).
> >
> > Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> > Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> > Reviewed-by: Tony Luck <tony.luck@intel.com>
> > ---
> 
> ...
> 
> > diff --git a/arch/x86/include/asm/cpufeatures.h
> > b/arch/x86/include/asm/cpufeatures.h
> > index 92a8308..62d979b9 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -12,7 +12,7 @@
> >  /*
> >   * Defines x86 CPU feature bits
> >   */
> > -#define NCAPINTS	18	/* N 32-bit words worth of info */
> > +#define NCAPINTS	19	/* N 32-bit words worth of info */
> >  #define NBUGINTS	1	/* N 32-bit bug flags */
> >
> >  /*
> > @@ -220,6 +220,7 @@
> >  #define X86_FEATURE_RTM		( 9*32+11) /* Restricted Transactional
> Memory */
> >  #define X86_FEATURE_CQM		( 9*32+12) /* Cache QoS Monitoring
> */
> >  #define X86_FEATURE_MPX		( 9*32+14) /* Memory Protection
> Extension */
> > +#define X86_FEATURE_RDT		( 9*32+15) /* Resource Director
> Technology */
> >  #define X86_FEATURE_AVX512F	( 9*32+16) /* AVX-512 Foundation */
> >  #define X86_FEATURE_AVX512DQ	( 9*32+17) /* AVX-512 DQ
> (Double/Quad granular) Instructions */
> >  #define X86_FEATURE_RDSEED	( 9*32+18) /* The RDSEED instruction
> */
> > @@ -286,6 +287,9 @@
> >  #define X86_FEATURE_SUCCOR	(17*32+1) /* Uncorrectable error
> containment and recovery */
> >  #define X86_FEATURE_SMCA	(17*32+3) /* Scalable MCA */
> >
> > +/* Intel-defined CPU features, CPUID level 0x00000010:0 (ebx), word
> > +18 */
> 
> Seems like this leaf is dedicated to CAT and has only 2 feature bits defined in
> the SDM. Please use init_scattered_cpuid_features() instead of adding a
> whole CAP word.

Actually this leaf will be extended to have more bits for more resources allocation.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection
  2016-09-08 16:53     ` Yu, Fenghua
@ 2016-09-08 17:17       ` Borislav Petkov
  0 siblings, 0 replies; 96+ messages in thread
From: Borislav Petkov @ 2016-09-08 17:17 UTC (permalink / raw)
  To: Yu, Fenghua
  Cc: Thomas Gleixner, Anvin, H Peter, Ingo Molnar, Luck, Tony,
	Peter Zijlstra, Tejun Heo, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Shankar, Ravi V,
	Vikas Shivappa, Prakhya, Sai Praneeth, linux-kernel, x86

On Thu, Sep 08, 2016 at 04:53:52PM +0000, Yu, Fenghua wrote:
> Actually this leaf will be extended to have more bits for more
> resources allocation.

You can move it to a separate leaf then.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 02/33] Documentation, ABI: Add a document entry for cache id
  2016-09-08  9:56 ` [PATCH v2 02/33] Documentation, ABI: Add a document entry for " Fenghua Yu
@ 2016-09-08 19:33   ` Thomas Gleixner
  2016-09-09 15:11     ` Nilay Vaish
  0 siblings, 1 reply; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 19:33 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
> +What:		/sys/devices/system/cpu/cpu*/cache/index*/id
> +Date:		July 2016
> +Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
> +Description:	Cache id
> +
> +		The id identifies a hardware cache of the system within a given
> +		cache index in a set of cache indices. The "index" name is
> +		simply a nomenclature from CPUID's leaf 4 which enumerates all
> +		caches on the system by referring to each one as a cache index.
> +		The (cache index, cache id) pair is unique for the whole
> +		system.
> +
> +		Currently id is implemented on x86. On other platforms, id is
> +		not enabled yet.

And it never will be available on anything else than x86 because there is
no other architecture providing CPUID leaf 4 ....

If you want this to be generic then get rid of the x86isms in the
explanation and describe the x86 specific part seperately.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 26/33] Task fork and exit for rdtgroup
  2016-09-08  9:57 ` [PATCH v2 26/33] Task fork and exit for rdtgroup Fenghua Yu
@ 2016-09-08 19:41   ` Thomas Gleixner
  2016-09-13 23:13   ` Dave Hansen
  1 sibling, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 19:41 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
>  
>  	cgroup_exit(tsk);
> +	rdtgroup_exit(tsk);

So this actually does:

> +void rdtgroup_exit(struct task_struct *tsk)
> +{
> +
> +       if (!list_empty(&tsk->rg_list)) {
> +               struct rdtgroup *rdtgrp = tsk->rdtgroup;
> +
> +               list_del_init(&tsk->rg_list);
> +               tsk->rdtgroup = NULL;
> +               atomic_dec(&rdtgrp->refcount);
> +       }
> +}

with complete lack of locking .....

Brilliant stuff that.

	  tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 27/33] x86/intel_rdt_rdtgroup.c: Implement resctrl file system commands
  2016-09-08  9:57 ` [PATCH v2 27/33] x86/intel_rdt_rdtgroup.c: Implement resctrl file system commands Fenghua Yu
@ 2016-09-08 20:09   ` Thomas Gleixner
  2016-09-08 22:04   ` Shaohua Li
  1 sibling, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 20:09 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
> +static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops = {
> +	.mkdir          = rdtgroup_mkdir,
> +	.rmdir          = rdtgroup_rmdir,
> +};
> +
> +static struct file_system_type rdt_fs_type = {
> +	.name = "resctrl",
> +	.mount = rdt_mount,
> +	.kill_sb = rdt_kill_sb,
> +};

So the above struct is nicely aligned and readable. While this one is
not. Sigh,

>  struct rdtgroup *root_rdtgrp;
>  static struct rftype rdtgroup_partition_base_files[];
>  struct cache_domain cache_domains[MAX_CACHE_LEAVES];
>  /* The default hierarchy. */
>  struct rdtgroup_root rdtgrp_dfl_root;
>  static struct list_head rdtgroups;
> +bool rdtgroup_mounted;

Your choice of global/static visible variables is driven by a random
generator or what?
  
>  /*
>   * kernfs_root - find out the kernfs_root a kernfs_node belongs to
> @@ -730,6 +749,110 @@ static void rdtgroup_destroy_locked(struct rdtgroup *rdtgrp)
>  	kernfs_remove(rdtgrp->kn);
>  }
>  
> +static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
> +			umode_t mode)
> +{
> +	struct rdtgroup *parent, *rdtgrp;
> +	struct rdtgroup_root *root;
> +	struct kernfs_node *kn;
> +	int ret;
> +
> +	if (parent_kn != root_rdtgrp->kn)
> +		return -EPERM;
> +
> +	/* Do not accept '\n' to avoid unparsable situation.
> +	 */

Where did copy you this comment style from? Its' horrible and here is a
lengthy explanation why: https://lkml.org/lkml/2016/7/8/625

> +static struct dentry *rdt_mount(struct file_system_type *fs_type,
> +			 int flags, const char *unused_dev_name,
> +			 void *data)
> +{
> +	struct super_block *pinned_sb = NULL;
> +	struct rdtgroup_root *root;
> +	struct dentry *dentry;
> +	int ret;
> +	bool new_sb;
> +
> +	/*
> +	 * The first time anyone tries to mount a rdtgroup, enable the list
> +	 * linking tasks and fix up all existing tasks.

What are 'list linking tasks'? What is fixed up here?

> +	 */
> +	if (rdtgroup_mounted)
> +		return ERR_PTR(-EBUSY);

How is this serialized against concurrent mounts? Oh well....

> +	rdt_opts.cdp_enabled = false;
> +	rdt_opts.verbose = false;
> +	cdp_enabled = false;
> +
> +	ret = parse_rdtgroupfs_options(data);
> +	if (ret)
> +		goto out_mount;
> +
> +	if (rdt_opts.cdp_enabled) {
> +		cdp_enabled = true;
> +		cconfig.max_closid >>= cdp_enabled;
> +		pr_info("CDP is enabled\n");
> +	}
> +
> +	init_msrs(cdp_enabled);
> +
> +	root = &rdtgrp_dfl_root;
> +
> +	ret = get_default_resources(&root->rdtgrp);
> +	if (ret)
> +		return ERR_PTR(-ENOSPC);
> +
> +out_mount:
> +	dentry = kernfs_mount(fs_type, flags, root->kf_root,
> +			      RDTGROUP_SUPER_MAGIC,
> +			      &new_sb);
> +	if (IS_ERR(dentry) || !new_sb)
> +		goto out_unlock;

And ret , which is returned @out_unlock is 0. So instead of returning the
error code encoded in dentry you return a NULL pointer.

> +	/*
> +	 * If @pinned_sb, we're reusing an existing root and holding an
> +	 * extra ref on its sb.  Mount is complete.  Put the extra ref.
> +	 */
> +	if (pinned_sb) {

And how exactly becomes pinned_sb != NULL?

> +		WARN_ON(new_sb);
> +		deactivate_super(pinned_sb);
> +	}

Not at all.

> +	INIT_LIST_HEAD(&root->rdtgrp.pset.tasks);
> +
> +	cpumask_copy(&root->rdtgrp.cpu_mask, cpu_online_mask);
> +	static_key_slow_inc(&rdt_enable_key);
> +	rdtgroup_mounted = true;
> +
> +	return dentry;
> +
> +out_unlock:
> +	return ERR_PTR(ret);

So a return magically unlocks stuff. Or does this happen in ERR_PTR()?

This jump label is not only pointless it's also named badly.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 28/33] x86/intel_rdt_rdtgroup.c: Read and write cpus
  2016-09-08  9:57 ` [PATCH v2 28/33] x86/intel_rdt_rdtgroup.c: Read and write cpus Fenghua Yu
@ 2016-09-08 20:25   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 20:25 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:

> Normally each task is associated with one rdtgroup and we use the schema
> for that rdtgroup whenever the task is running. The user can designate
> some cpus to always use the same schema, regardless of which task is
> running. To do that the user write a cpumask bit string to the "cpus"
> file.

Is that just a left over of the previous series or am I completely confused
by now?

> +static int cpus_validate(struct cpumask *cpumask, struct rdtgroup *rdtgrp)
> +{
> +	int old_cpumask_bit, new_cpumask_bit;
> +	int cpu;
> +
> +	for_each_online_cpu(cpu) {
> +		old_cpumask_bit = cpumask_test_cpu(cpu, &rdtgrp->cpu_mask);
> +		new_cpumask_bit = cpumask_test_cpu(cpu, cpumask);
> +		/* Cannot clear a "cpus" bit in a rdtgroup. */
> +		if (old_cpumask_bit == 1 && new_cpumask_bit == 0)
> +			return -EINVAL;
> +	}
> +
> +	/* If a cpu is not online, cannot set it. */
> +	for_each_cpu(cpu, cpumask) {
> +		if (!cpu_online(cpu))
> +			return -EINVAL;
> +	}

cpumask_intersects() exists for a reason. And how is this protected against
cpu hotplug?

> +	list_for_each(l, &rdtgroup_lists) {
> +		r = list_entry(l, struct rdtgroup, rdtgroup_list);
> +		if (r == rdtgrp)
> +			continue;
> +
> +		for_each_cpu_and(cpu, &r->cpu_mask, cpumask)
> +			cpumask_clear_cpu(cpu, &r->cpu_mask);

This code clearly predates the invention of cpumask_andnot()

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 29/33] x86/intel_rdt_rdtgroup.c: Tasks iterator and write
  2016-09-08  9:57 ` [PATCH v2 29/33] x86/intel_rdt_rdtgroup.c: Tasks iterator and write Fenghua Yu
@ 2016-09-08 20:50   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 20:50 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:
>  void rdtgroup_exit(struct task_struct *tsk)
>  {
> +	if (tsk->rdtgroup) {
> +		atomic_dec(&tsk->rdtgroup->refcount);
> +		tsk->rdtgroup = NULL;

The changelog still talks about this list stuff while this patch seems to
remove it and solely rely on the tsk->rdtgroup pointer, which is a sensible
thing to do.

Writing sensible changelogs would make the life of reviewers too easy,
right?

> +	}
> +}
>  
> -	if (!list_empty(&tsk->rg_list)) {
> -		struct rdtgroup *rdtgrp = tsk->rdtgroup;
> +static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s)
> +{
> +	struct task_struct *p;
> +	struct rdtgroup *this = r;
>  
> -		list_del_init(&tsk->rg_list);
> -		tsk->rdtgroup = NULL;
> -		atomic_dec(&rdtgrp->refcount);
> +
> +	if (r == root_rdtgrp)
> +		return;

Why has the root_rdtgroup a task file in the first place?

>  static void rdtgroup_destroy_locked(struct rdtgroup *rdtgrp)
> @@ -781,8 +811,6 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
>  		goto out_unlock;
>  	}
>  
> -	INIT_LIST_HEAD(&rdtgrp->pset.tasks);
> -
>  	cpumask_clear(&rdtgrp->cpu_mask);
>  
>  	rdtgrp->root = root;
> @@ -843,7 +871,7 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
>  	if (!rdtgrp)
>  		return -ENODEV;
>  
> -	if (!list_empty(&rdtgrp->pset.tasks)) {
> +	if (atomic_read(&rdtgrp->refcount)) {

So you rely on rdtgrp->refcount completely now.

>  /*
>   * Forcibly remove all of subdirectories under root.
> @@ -1088,22 +1094,16 @@ void rdtgroup_fork(struct task_struct *child)
>  {
>  	struct rdtgroup *rdtgrp;
>  
> -	INIT_LIST_HEAD(&child->rg_list);
> +	child->rdtgroup = NULL;
>  	if (!rdtgroup_mounted)
>  		return;
>  
> -	mutex_lock(&rdtgroup_mutex);
> -
>  	rdtgrp = current->rdtgroup;
>  	if (!rdtgrp)
> -		goto out;
> +		return;
>  
> -	list_add_tail(&child->rg_list, &rdtgrp->pset.tasks);
>  	child->rdtgroup = rdtgrp;
>  	atomic_inc(&rdtgrp->refcount);

This lacks any form of documentation WHY this is correct and works. I asked
you last time to document the locking and serialization rules ....

>  	cpumask_copy(&root->rdtgrp.cpu_mask, cpu_online_mask);

Can't remember if I told you already, but this is racy against hotplug.

> +static int _rdtgroup_move_task(struct task_struct *tsk, struct rdtgroup *rdtgrp)
> +{
> +	if (tsk->rdtgroup)
> +		atomic_dec(&tsk->rdtgroup->refcount);
> +
> +	if (rdtgrp == root_rdtgrp)
> +		tsk->rdtgroup = NULL;
> +	else
> +		tsk->rdtgroup = rdtgrp;
> +
> +	atomic_inc(&rdtgrp->refcount);
> +
> +	return 0;
> +}
> +
> +static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp,
> +			      bool threadgroup, struct kernfs_open_file *of)
> +{
> +	struct task_struct *tsk;
> +	int ret;
> +
> +	rcu_read_lock();
> +	if (pid) {
> +		tsk = find_task_by_vpid(pid);
> +		if (!tsk) {
> +			ret = -ESRCH;
> +			goto out_unlock_rcu;
> +		}
> +	} else {
> +		tsk = current;
> +	}
> +
> +	if (threadgroup)
> +		tsk = tsk->group_leader;
> +
> +	get_task_struct(tsk);
> +	rcu_read_unlock();
> +
> +	ret = rdtgroup_procs_write_permission(tsk, of);
> +	if (!ret)
> +		_rdtgroup_move_task(tsk, rdtgrp);
> +
> +	put_task_struct(tsk);
> +	goto out_unlock_threadgroup;
> +
> +out_unlock_rcu:
> +	rcu_read_unlock();
> +out_unlock_threadgroup:
> +	return ret;
> +}
> +
> +ssize_t _rdtgroup_procs_write(struct rdtgroup *rdtgrp,
> +			   struct kernfs_open_file *of, char *buf,
> +			   size_t nbytes, loff_t off, bool threadgroup)

global visible? And what is the underscore for?

> +{
> +	pid_t pid;
> +	int ret;
> +
> +	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
> +		return -EINVAL;

Why do you evaluate the buffer inside the lock held region?

This function split is completely bogus and artificial.

> +
> +	ret = rdtgroup_move_task(pid, rdtgrp, threadgroup, of);
> +
> +	return ret ?: nbytes;
> +}
> +
> +static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of,
> +				  char *buf, size_t nbytes, loff_t off)
> +{
> +	struct rdtgroup *rdtgrp;
> +	int ret;
> +
> +	rdtgrp = rdtgroup_kn_lock_live(of->kn);
> +	if (!rdtgrp)
> +		return -ENODEV;
> +
> +	ret = _rdtgroup_procs_write(rdtgrp, of, buf, nbytes, off, false);
> +
> +	rdtgroup_kn_unlock(of->kn);
> +	return ret;
> +}

More sigh.

     tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface
  2016-09-08  9:57 ` [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface Fenghua Yu
  2016-09-08 11:22   ` Borislav Petkov
@ 2016-09-08 22:01   ` Shaohua Li
  2016-09-09  1:17     ` Fenghua Yu
  1 sibling, 1 reply; 96+ messages in thread
From: Shaohua Li @ 2016-09-08 22:01 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, Sep 08, 2016 at 02:57:00AM -0700, Fenghua Yu wrote:
> From: Fenghua Yu <fenghua.yu@intel.com>
> 
> The documentation describes user interface of how to allocate resource
> in Intel RDT.
> 
> Please note that the documentation covers generic user interface. Current
> patch set code only implemente CAT L3. CAT L2 code will be sent later.
> 
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> ---
>  Documentation/x86/intel_rdt_ui.txt | 164 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 164 insertions(+)
>  create mode 100644 Documentation/x86/intel_rdt_ui.txt
> 
> diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
> new file mode 100644
> index 0000000..27de386
> --- /dev/null
> +++ b/Documentation/x86/intel_rdt_ui.txt
> @@ -0,0 +1,164 @@
> +User Interface for Resource Allocation in Intel Resource Director Technology
> +
> +Copyright (C) 2016 Intel Corporation
> +
> +Fenghua Yu <fenghua.yu@intel.com>
> +Tony Luck <tony.luck@intel.com>
> +
> +This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
> +X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
> +
> +To use the feature mount the file system:
> +
> + # mount -t resctrl resctrl [-o cdp,verbose] /sys/fs/resctrl
> +
> +mount options are:
> +
> +"cdp": Enable code/data prioritization in L3 cache allocations.
> +
> +"verbose": Output more info in the "info" file under info directory
> +	and in dmesg. This is mainly for debug.
> +
> +
> +Resource groups
> +---------------
> +Resource groups are represented as directories in the resctrl file
> +system. The default group is the root directory. Other groups may be
> +created as desired by the system administrator using the "mkdir(1)"
> +command, and removed using "rmdir(1)".
> +
> +There are three files associated with each group:
> +
> +"tasks": A list of tasks that belongs to this group. Tasks can be
> +	added to a group by writing the task ID to the "tasks" file
> +	(which will automatically remove them from the previous
> +	group to which they belonged). New tasks created by fork(2)
> +	and clone(2) are added to the same group as their parent.
> +	If a pid is not in any sub partition, it is in root partition
> +	(i.e. default partition).
Hi Fenghua,

Will you add a 'procs' interface to allow move a process into a group? Using
the 'tasks' interface to move process is inconvenient and has race conditions
(eg, some new threads could be escaped).

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 27/33] x86/intel_rdt_rdtgroup.c: Implement resctrl file system commands
  2016-09-08  9:57 ` [PATCH v2 27/33] x86/intel_rdt_rdtgroup.c: Implement resctrl file system commands Fenghua Yu
  2016-09-08 20:09   ` Thomas Gleixner
@ 2016-09-08 22:04   ` Shaohua Li
  2016-09-09  1:23     ` Fenghua Yu
  1 sibling, 1 reply; 96+ messages in thread
From: Shaohua Li @ 2016-09-08 22:04 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, Sep 08, 2016 at 02:57:21AM -0700, Fenghua Yu wrote:
>  /*
>   * kernfs_root - find out the kernfs_root a kernfs_node belongs to
> @@ -730,6 +749,110 @@ static void rdtgroup_destroy_locked(struct rdtgroup *rdtgrp)
>  	kernfs_remove(rdtgrp->kn);
>  }
>  
> +static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
> +			umode_t mode)
> +{
> +	struct rdtgroup *parent, *rdtgrp;
> +	struct rdtgroup_root *root;
> +	struct kernfs_node *kn;
> +	int ret;
> +
> +	if (parent_kn != root_rdtgrp->kn)
> +		return -EPERM;
> +

So we can't create nested groups. Is this limitation temporary? I don't see
this is mentioned in the interface document.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 30/33] x86/intel_rdt_rdtgroup.c: Process schemata input from resctrl interface
  2016-09-08  9:57 ` [PATCH v2 30/33] x86/intel_rdt_rdtgroup.c: Process schemata input from resctrl interface Fenghua Yu
@ 2016-09-08 22:20   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 22:20 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:

> +struct resources {

Darn. The first look made me parse this as a redefinition of 'struct
resource' ... Can't you find an even more generic name for this?

> +	struct cache_resource *l3;
> +};
> +
> +static int get_res_type(char **res, enum resource_type *res_type)
> +{
> +	char *tok;
> +
> +	tok = strsep(res, ":");
> +	if (tok == NULL)

We still write: if (!tok) as anywhere else.

> +static int divide_resources(char *buf, char *resources[RESOURCE_NUM])
> +{
> +	char *tok;
> +	unsigned int resource_num = 0;
> +	int ret = 0;
> +	char *res;
> +	char *res_block;
> +	size_t size;
> +	enum resource_type res_type;

Sigh.

> +
> +	size = strlen(buf) + 1;
> +	res = kzalloc(size, GFP_KERNEL);
> +	if (!res) {
> +		ret = -ENOSPC;
> +		goto out;
> +	}
> +
> +	while ((tok = strsep(&buf, "\n")) != NULL) {
> +		if (strlen(tok) == 0)
> +			break;
> +		if (resource_num++ >= 1) {

How gets that ever greater 1?

> +			pr_info("More than one line of resource input!\n");
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		strcpy(res, tok);
> +	}
> +
> +	res_block = res;
> +	ret = get_res_type(&res_block, &res_type);
> +	if (ret) {
> +		pr_info("Unknown resource type!");
> +		goto out;
> +	}
> +
> +	if (res_block == NULL) {
> +		pr_info("Invalid resource value!");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (res_type == RESOURCE_L3 && cat_enabled(CACHE_LEVEL3)) {
> +		strcpy(resources[RESOURCE_L3], res_block);
> +	} else {
> +		pr_info("Invalid resource type!");
> +		goto out;
> +	}
> +
> +	ret = 0;


You surely found the most convoluted solution for this. Whats wrong with:

	data = get_res_type(res, &type);
	if (IS_ERR(data)) {
		ret = PTR_ERR(data);
		goto out;
	} 

	ret = 0;
	switch (type) {
	case RESOURCE_L3:
		if (cat_enabled(CACHE_LEVEL3))
			strcpy(resources[RESOURCE_L3], data);
		break;
	default:
		ret = -EINVAL;
	}	

That's too simple to understand and too extensible for future resource
types, right?

> +out:
> +	kfree(res);
> +	return ret;
> +}

> +static int get_input_cbm(char *tok, struct cache_resource *l,
> +			 int input_domain_num, int level)
> +{
> +	int ret;
> +
> +	if (!cdp_enabled) {
> +		if (tok == NULL)
> +			return -EINVAL;
> +
> +		ret = kstrtoul(tok, 16,
> +			       (unsigned long *)&l->cbm[input_domain_num]);

What is this type cast for? Can't you just parse the data into a local
unsigned long and then store it after validation?

> +		if (ret)
> +			return ret;
> +
> +		if (!cbm_validate(l->cbm[input_domain_num], level))
> +			return -EINVAL;
> +	} else  {
> +		char *input_cbm1_str;
> +
> +		input_cbm1_str = strsep(&tok, ",");
> +		if (input_cbm1_str == NULL || tok == NULL)
> +			return -EINVAL;
> +
> +		ret = kstrtoul(input_cbm1_str, 16,
> +			       (unsigned long *)&l->cbm[input_domain_num]);
> +		if (ret)
> +			return ret;
> +
> +		if (!cbm_validate(l->cbm[input_domain_num], level))
> +			return -EINVAL;
> +
> +		ret = kstrtoul(tok, 16,
> +			       (unsigned long *)&l->cbm2[input_domain_num]);
> +		if (ret)
> +			return ret;
> +
> +		if (!cbm_validate(l->cbm2[input_domain_num], level))
> +			return -EINVAL;

So you have 3 copies of the same sequence now. At other places you split
out the most tiniest stuff into a gazillion of helper functions ...

Just create a parser helper and call it for any of those types. So the
whole thing boils down to:

static int parse_cbm_token(char *tok, u64 *cbm, int level)
{
	unsigned long data;
	int ret;

	ret = kstrtoul(tok, 16, &data);
	if (ret)
		return ret;
	if (!cbm_validate(data, level))
	   	return -EINVAL;
	*cbm = data;
	return 0;
}

static int parse_cbm(char *buf, struct cache_resource *cr, int domain,
		     int level)
{
	char *cbm1 = buf;
	int ret;

	if (cdp_enabled)
		cbm1 = strsep(&buf, ',');

	if (!cbm1 || !buf)
	   	return -EINVAL;

	ret = parse_cbm_token(cbm1, &cr->cbm[domain]);
	if (ret)
		return ret;

	if (cdp_enmabled)
		return parse_cbm_token(buf, &cr->cbm2[domain]);
	return 0;
}

Copy and paste is simpler than thinking, but the result is uglier and
harder to read.

> +static int get_cache_schema(char *buf, struct cache_resource *l, int level,
> +			 struct rdtgroup *rdtgrp)
> +{
> +	char *tok, *tok_cache_id;
> +	int ret;
> +	int domain_num;
> +	int input_domain_num;
> +	int len;
> +	unsigned int input_cache_id;
> +	unsigned int cid;
> +	unsigned int leaf;
> +
> +	if (!cat_enabled(level) && strcmp(buf, ";")) {
> +		pr_info("Disabled resource should have empty schema\n");

So an empty schema is a string which is != ";". Very interesting.

This enabled check here wants more thoughts. If a resource is disabled,
then the input line should be simply ignored. Otherwise you need to rewrite
scripts, config files just because you disabled a particular resource.

> +		return -EINVAL;
> +	}
> +}
> +
> +enum {
> +	CURRENT_CLOSID,
> +	REUSED_OWN_CLOSID,
> +	REUSED_OTHER_CLOSID,
> +	NEW_CLOSID,
> +};

Another random enum in the middle of the code and at a place where it is
completely disjunct from its usage.

> +
> +/*
> + * Check if the reference counts are all ones in rdtgrp's domain.
> + */
> +static bool one_refcnt(struct rdtgroup *rdtgrp, int domain)

A really self explaining function name - NOT!

> +/*
> + * Go through all shared domains. Check if there is an existing closid
> + * in all rdtgroups that matches l3 cbms in the shared
> + * domain. If find one, reuse the closid. Otherwise, allocate a new one.
> + */
> +static int get_rdtgroup_resources(struct resources *resources_set,
> +				  struct rdtgroup *rdtgrp)
> +{
> +	struct cache_resource *l3;
> +	bool l3_cbm_found;
> +	struct list_head *l;
> +	struct rdtgroup *r;
> +	u64 cbm;
> +	int rdt_closid[MAX_CACHE_DOMAINS];
> +	int rdt_closid_type[MAX_CACHE_DOMAINS];

Have you ever checked what the stack foot print of this whole callchain is?
One of the callers has already a char array[1024] on the stack.....

> +	int domain;
> +	int closid;
> +	int ret;
> +
> +	l3 = resources_set->l3;
> +	memcpy(rdt_closid, rdtgrp->resource.closid,
> +	       shared_domain_num * sizeof(int));

Can you please seperate stuff with new lines ocassionally to make it
readable?

> +	for (domain = 0; domain < shared_domain_num; domain++) {
> +		if (rdtgrp->resource.valid) {
> +			/*
> +			 * If current rdtgrp is the only user of cbms in
> +			 * this domain, will replace the cbms with the input
> +			 * cbms and reuse its own closid.
> +			 */
> +			if (one_refcnt(rdtgrp, domain)) {
> +				closid = rdtgrp->resource.closid[domain];
> +				rdt_closid[domain] = closid;
> +				rdt_closid_type[domain] = REUSED_OWN_CLOSID;
> +				continue;
> +			}
> +
> +			l3_cbm_found = true;
> +
> +			if (cat_l3_enabled)
> +				l3_cbm_found = cbm_found(l3, rdtgrp, domain,
> +							 CACHE_LEVEL3);
> +
> +			/*
> +			 * If the cbms in this shared domain are already
> +			 * existing in current rdtgrp, record the closid
> +			 * and its type.
> +			 */
> +			if (l3_cbm_found) {
> +				closid = rdtgrp->resource.closid[domain];
> +				rdt_closid[domain] = closid;
> +				rdt_closid_type[domain] = CURRENT_CLOSID;
> +				continue;
> +			}

This is unreadable once more.

     		   	if (find_cbm(l3, rdtgrp, domain, CACHE_LEVEL3) {
				closid = rdtgrp->resource.closid[domain];
				rdt_closid[domain] = closid;
				rdt_closid_type[domain] = CURRENT_CLOSID;
				continue;
			}

That requires that find_cbm() - which is a way more intuitive name than
cbm_found() - returns false when cat_l3_enabled is false. Which is trivial
and obvious ...
			
> +		}
> +
> +		/*
> +		 * If the cbms are not found in this rdtgrp, search other
> +		 * rdtgroups and see if there are matched cbms.
> +		 */
> +		l3_cbm_found = cat_l3_enabled ? false : true;

What the heck?

		l3_cbm_found = !cat_l3_enabled;

Is too simple obviously.

Aside of that silly conditional: If cat_l3_enables is false then
l3_cbm_found is true.

> +		list_for_each(l, &rdtgroup_lists) {
> +			r = list_entry(l, struct rdtgroup, rdtgroup_list);
> +			if (r == rdtgrp || !r->resource.valid)
> +				continue;
> +
> +			if (cat_l3_enabled)
> +				l3_cbm_found = cbm_found(l3, r, domain,
> +							 CACHE_LEVEL3);

And because this path is never taken when cat_l3_enabled is false.
 
> +
> +			if (l3_cbm_found) {

We happily get the closid for something which is not enabled at all. What
is the logic here? I can't find any in this convoluted mess.

> +				/* Get the closid that matches l3 cbms.*/
> +				closid = r->resource.closid[domain];
> +				rdt_closid[domain] = closid;
> +				rdt_closid_type[domain] = REUSED_OTHER_CLOSID;
> +				break;
> +			}
> +		}
> +		if (!l3_cbm_found) {
> +			/*
> +			 * If no existing closid is found, allocate
> +			 * a new one.
> +			 */
> +			ret = closid_alloc(&closid, domain);
> +			if (ret)
> +				goto err;
> +			rdt_closid[domain] = closid;
> +			rdt_closid_type[domain] = NEW_CLOSID;
> +		}
> +	}

I really don't want to imagine how this might look like when you add L2
support and if you have code doing this please hide it in the poison
cabinet forever,

> +	/*
> +	 * Now all closid are ready in rdt_closid. Update rdtgrp's closid.
> +	 */
> +	for_each_cache_domain(domain, 0, shared_domain_num) {
> +		/*
> +		 * Nothing is changed if the same closid and same cbms were
> +		 * found in this rdtgrp's domain.
> +		 */
> +		if (rdt_closid_type[domain] == CURRENT_CLOSID)
> +			continue;
> +
> +		/*
> +		 * Put rdtgroup closid. No need to put the closid if we
> +		 * just change cbms and keep the closid (REUSED_OWN_CLOSID).
> +		 */
> +		if (rdtgrp->resource.valid &&
> +		    rdt_closid_type[domain] != REUSED_OWN_CLOSID) {
> +			/* Put old closid in this rdtgrp's domain if valid. */
> +			closid = rdtgrp->resource.closid[domain];
> +			closid_put(closid, domain);
> +		}
> +
> +		/*
> +		 * Replace the closid in this rdtgrp's domain with saved
> +		 * closid that was newly allocted (NEW_CLOSID), or found in
> +		 * another rdtgroup's domains (REUSED_CLOSID), or found in
> +		 * this rdtgrp (REUSED_OWN_CLOSID).
> +		 */
> +		closid = rdt_closid[domain];
> +		rdtgrp->resource.closid[domain] = closid;
> +
> +		/*
> +		 * Get the reused other rdtgroup's closid. No need to get the
> +		 * closid newly allocated (NEW_CLOSID) because it's been
> +		 * already got in closid_alloc(). And no need to get the closid
> +		 * for resued own closid (REUSED_OWN_CLOSID).
> +		 */
> +		if (rdt_closid_type[domain] == REUSED_OTHER_CLOSID)
> +			closid_get(closid, domain);
> +
> +		/*
> +		 * If the closid comes from a newly allocated closid
> +		 * (NEW_CLOSID), or found in this rdtgrp (REUSED_OWN_CLOSID),
> +		 * cbms for this closid will be updated in MSRs.
> +		 */
> +		if (rdt_closid_type[domain] == NEW_CLOSID ||
> +		    rdt_closid_type[domain] == REUSED_OWN_CLOSID) {
> +			/*
> +			 * Update cbm in cctable with the newly allocated
> +			 * closid.
> +			 */
> +			if (cat_l3_enabled) {
> +				int cpu;
> +				struct cpumask *mask;
> +				int dindex;
> +				int l3_domain = shared_domain[domain].l3_domain;
> +				int leaf = level_to_leaf(CACHE_LEVEL3);
> +
> +				cbm = l3->cbm[l3_domain];
> +				dindex = get_dcbm_table_index(closid);
> +				l3_cctable[l3_domain][dindex].cbm = cbm;
> +				if (cdp_enabled) {
> +					int iindex;
> +
> +					cbm = l3->cbm2[l3_domain];
> +					iindex = get_icbm_table_index(closid);
> +					l3_cctable[l3_domain][iindex].cbm = cbm;
> +				}
> +
> +				mask =
> +				&cache_domains[leaf].shared_cpu_map[l3_domain];
> +
> +				cpu = cpumask_first(mask);
> +				smp_call_function_single(cpu, cbm_update_l3_msr,
> +							 &closid, 1);

Again, why don't ypu split that out into a seperate function instead of
having the forth indentation level and random line breaks?

> +static void init_cache_resource(struct cache_resource *l)
> +{
> +	l->cbm = NULL;
> +	l->cbm2 = NULL;
> +	l->closid = NULL;
> +	l->refcnt = NULL;

memset ?

> +}
> +
> +static void free_cache_resource(struct cache_resource *l)
> +{
> +	kfree(l->cbm);
> +	kfree(l->cbm2);
> +	kfree(l->closid);
> +	kfree(l->refcnt);
> +}
> +
> +static int alloc_cache_resource(struct cache_resource *l, int level)
> +{
> +	int domain_num = get_domain_num(level);
> +
> +	l->cbm = kcalloc(domain_num, sizeof(*l->cbm), GFP_KERNEL);
> +	l->cbm2 = kcalloc(domain_num, sizeof(*l->cbm2), GFP_KERNEL);
> +	l->closid = kcalloc(domain_num, sizeof(*l->closid), GFP_KERNEL);
> +	l->refcnt = kcalloc(domain_num, sizeof(*l->refcnt), GFP_KERNEL);
> +	if (l->cbm && l->cbm2 && l->closid && l->refcnt)
> +		return 0;
> +
> +	return -ENOMEM;
> +}
> +
> +/*
> + * This function digests schemata given in text buf. If the schemata are in
> + * right format and there is enough closid, input the schemata in rdtgrp
> + * and update resource cctables.
> + *
> + * Inputs:
> + *	buf: string buffer containing schemata
> + *	rdtgrp: current rdtgroup holding schemata.
> + *
> + * Return:
> + *	0 on success or error code.
> + */
> +static int get_resources(char *buf, struct rdtgroup *rdtgrp)
> +{
> +	char *resources[RESOURCE_NUM];
> +	struct cache_resource l3;
> +	struct resources resources_set;
> +	int ret;
> +	char *resources_block;
> +	int i;
> +	int size = strlen(buf) + 1;
> +
> +	resources_block = kcalloc(RESOURCE_NUM, size, GFP_KERNEL);
> +	if (!resources_block)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < RESOURCE_NUM; i++)
> +		resources[i] = (char *)(resources_block + i * size);

This is a recurring scheme in your code. Allocating a runtime sized array
and initializing pointers.

Darn, instead of open coding this in several places can't you just make a
single function which does exactly that?

> +	ret = divide_resources(buf, resources);
> +	if (ret) {
> +		kfree(resources_block);
> +		return -EINVAL;
> +	}
> +
> +	init_cache_resource(&l3);
> +
> +	if (cat_l3_enabled) {
> +		ret = alloc_cache_resource(&l3, CACHE_LEVEL3);
> +		if (ret)
> +			goto out;
> +
> +		ret = get_cache_schema(resources[RESOURCE_L3], &l3,
> +				       CACHE_LEVEL3, rdtgrp);
> +		if (ret)
> +			goto out;
> +
> +		resources_set.l3 = &l3;
> +	} else
> +		resources_set.l3 = NULL;



> +
> +	ret = get_rdtgroup_resources(&resources_set, rdtgrp);
> +
> +out:
> +	kfree(resources_block);
> +	free_cache_resource(&l3);
> +
> +	return ret;
> +}
> +
> +static void gen_cache_prefix(char *buf, int level)
> +{
> +	sprintf(buf, "L%1d:", level == CACHE_LEVEL3 ? 3 : 2);
> +}
> +
> +static int get_cache_id(int domain, int level)
> +{
> +	return cache_domains[level_to_leaf(level)].shared_cache_id[domain];
> +}
> +
> +static void gen_cache_buf(char *buf, int level)
> +{
> +	int domain;
> +	char buf1[32];
> +	int domain_num;
> +	u64 val;
> +
> +	gen_cache_prefix(buf, level);
> +
> +	domain_num = get_domain_num(level);
> +
> +	val = max_cbm(level);
> +
> +	for (domain = 0; domain < domain_num; domain++) {
> +		sprintf(buf1, "%d=%lx", get_cache_id(domain, level),
> +			(unsigned long)val);
> +		strcat(buf, buf1);

WTF?

	char *p = buf;

	p += sprintf(p, "....", ...);
	p += sprintf(p, "....", ...);
	p += sprintf(p, "....", ...);

Solves the same problem as this local buffer on the stack plus strcat().

> +/*
> + * Set up default schemata in a rdtgroup. All schemata in all resources are
> + * default values (all 1's) for all domains.
> + *
> + * Input: rdtgroup.
> + * Return: 0: successful
> + *	   non-0: error code
> + */
> +int get_default_resources(struct rdtgroup *rdtgrp)
> +{
> +	char schema[1024];

And that number is pulled out of thin air or what?

> +	int ret = 0;
> +
> +	if (cat_enabled(CACHE_LEVEL3)) {
> +		gen_cache_buf(schema, CACHE_LEVEL3);
> +
> +		if (strlen(schema)) {
> +			ret = get_resources(schema, rdtgrp);
> +			if (ret)
> +				return ret;
> +		}
> +		gen_cache_buf(rdtgrp->schema, CACHE_LEVEL3);
> +	}
> +
> +	return ret;
> +}
> +
> +ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
> +			char *buf, size_t nbytes, loff_t off)
> +{
> +	int ret = 0;
> +	struct rdtgroup *rdtgrp;
> +	char *schema;
> +
> +	rdtgrp = rdtgroup_kn_lock_live(of->kn);
> +	if (!rdtgrp)
> +		return -ENODEV;
> +
> +	schema = kzalloc(sizeof(char) * strlen(buf) + 1, GFP_KERNEL);
> +	if (!schema) {
> +		ret = -ENOMEM;
> +		goto out_unlock;
> +	}
> +
> +	memcpy(schema, buf, strlen(buf) + 1);

Open coding kstrdup() is indeed useful. and reevaluating strlen(buf) three
times in the same function is even more useful.

> +
> +	ret = get_resources(buf, rdtgrp);
> +	if (ret)
> +		goto out;
> +
> +	memcpy(rdtgrp->schema, schema, strlen(schema) + 1);

IIRC then the kernel has even strcpy() and strncpy for that matter.

Btw, what makes sure that strlen(schema) is < 1023 ?????

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 31/33] Documentation/kernel-parameters: Add kernel parameter "resctrl" for CAT
  2016-09-08  9:57 ` [PATCH v2 31/33] Documentation/kernel-parameters: Add kernel parameter "resctrl" for CAT Fenghua Yu
@ 2016-09-08 22:25   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 22:25 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:

> From: Fenghua Yu <fenghua.yu@intel.com>
> 
> Add kernel parameter "resctrl" for CAT L3:

We add the fricking documentation for kernel parameters in the patch which introduces them and not an some random other place.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface
  2016-09-09  1:17     ` Fenghua Yu
@ 2016-09-08 22:45       ` Shaohua Li
  2016-09-09  7:22         ` Fenghua Yu
  0 siblings, 1 reply; 96+ messages in thread
From: Shaohua Li @ 2016-09-08 22:45 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, Sep 08, 2016 at 06:17:47PM -0700, Fenghua Yu wrote:
> On Thu, Sep 08, 2016 at 03:01:20PM -0700, Shaohua Li wrote:
> > On Thu, Sep 08, 2016 at 02:57:00AM -0700, Fenghua Yu wrote:
> > > From: Fenghua Yu <fenghua.yu@intel.com>
> > > 
> > > The documentation describes user interface of how to allocate resource
> > > in Intel RDT.
> > > 
> > > Please note that the documentation covers generic user interface. Current
> > > patch set code only implemente CAT L3. CAT L2 code will be sent later.
> > > 
> > > Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> > > Reviewed-by: Tony Luck <tony.luck@intel.com>
> > > ---
> > >  Documentation/x86/intel_rdt_ui.txt | 164 +++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 164 insertions(+)
> > >  create mode 100644 Documentation/x86/intel_rdt_ui.txt
> > > 
> > > diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
> > > new file mode 100644
> > > index 0000000..27de386
> > > --- /dev/null
> > > +++ b/Documentation/x86/intel_rdt_ui.txt
> > > @@ -0,0 +1,164 @@
> > > +User Interface for Resource Allocation in Intel Resource Director Technology
> > > +
> > > +Copyright (C) 2016 Intel Corporation
> > > +
> > > +Fenghua Yu <fenghua.yu@intel.com>
> > > +Tony Luck <tony.luck@intel.com>
> > > +
> > > +This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
> > > +X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
> > > +
> > > +To use the feature mount the file system:
> > > +
> > > + # mount -t resctrl resctrl [-o cdp,verbose] /sys/fs/resctrl
> > > +
> > > +mount options are:
> > > +
> > > +"cdp": Enable code/data prioritization in L3 cache allocations.
> > > +
> > > +"verbose": Output more info in the "info" file under info directory
> > > +	and in dmesg. This is mainly for debug.
> > > +
> > > +
> > > +Resource groups
> > > +---------------
> > > +Resource groups are represented as directories in the resctrl file
> > > +system. The default group is the root directory. Other groups may be
> > > +created as desired by the system administrator using the "mkdir(1)"
> > > +command, and removed using "rmdir(1)".
> > > +
> > > +There are three files associated with each group:
> > > +
> > > +"tasks": A list of tasks that belongs to this group. Tasks can be
> > > +	added to a group by writing the task ID to the "tasks" file
> > > +	(which will automatically remove them from the previous
> > > +	group to which they belonged). New tasks created by fork(2)
> > > +	and clone(2) are added to the same group as their parent.
> > > +	If a pid is not in any sub partition, it is in root partition
> > > +	(i.e. default partition).
> > Hi Fenghua,
> > 
> > Will you add a 'procs' interface to allow move a process into a group? Using
> > the 'tasks' interface to move process is inconvenient and has race conditions
> > (eg, some new threads could be escaped).
> 
> We don't plan to add a 'procs' interface for rdtgroup. We only use resctrl
> interface to allocate resources.
> 
> Why the "tasks" is inconvenient? If sysadmin wants to allocte a portion of L3
> for a pid, the operation in resctl is to write the pid to a "tasks". While
> in 'procs', the operation is to write a partition to a pid. If considering
> convenience, they are same, right?
> 
> A thread uses either default partition (in root dir) or a sub partition (in
> sub-directory). Sysadmin can control that. Kernel handles race condition.
> Any issue with that?

I don't mean writing the 'tasks' file is inconvenient. So to move a process to
a group, we do:
1. get all thread pid of the process
2. write every pid to 'tasks'

this is inconvenient. And if a new thread is created between 1 and 2, we don't
put the thread to the group. Am I missing anything?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 33/33] x86/Makefile: Build intel_rdt_rdtgroup.c
  2016-09-08  9:57 ` [PATCH v2 33/33] x86/Makefile: Build intel_rdt_rdtgroup.c Fenghua Yu
@ 2016-09-08 23:59   ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-08 23:59 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: H. Peter Anvin, Ingo Molnar, Tony Luck, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, 8 Sep 2016, Fenghua Yu wrote:

> Build the user interface file intel_rdt_rdtgroup.c.

> +obj-$(CONFIG_INTEL_RDT)	+= intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_schemata.o

This is garbage. Complete and utter garbage.

First of all this changelog is wrong because this adds not only the user
interface file it finally adds all files.

So up to this point nothing in this series was ever compiled.

And as of patch 13/33 the build is broken when CONFIG_RDT is enabled.

The fact that this whole thing is only ever compiled when the last patch is
applied explains completely why the split up of the patches is a total
nightmare.

Why on earth do you think that 

    Documentation/SubmittingPatches::Section 1::3

does not apply for this work? I can't see any reason why this patch series
would be exempt.

So here is what I want to see:

- Take the combined resulting patch and split it into the following
  sections:

   1) Introduce the Kconfig option

   2) Add intel_rdt.c/h and add it to the Makefile

   3) Add all general code which is unrelated to the user space
      interface/group stuff to intel_rdt.c

      (Split that up if and only if it makes sense)

   4) Add the schedule stuff

   5) Add intel_rdtgroup.c and add it to the Makefile

   6) Add the general filesystem code to it

   7) Add the fork/exit stuff

   8) Add the schema support file

   You might add a few more steps where you think it makes sense, e.g. split
   out kernel parameters or whatever is sensible on its own.

   Don't try to tell me that you need to preserve the original stuff from
   Vikas. It's not useful, it's just annoying and there is no point in
   keeping it. If anyone is interested in this horror he can look it up in
   the LKML archives, but there is no reason to stick that mess into the
   kernel history. You still can mention Vikas in the changelog tags as the
   initial author, but just keeping this horror - which has been even more
   horrified by you - is not an option at all.

- Go through these new patches and make sure that _all_ my review comments
  are addressed. I will make sure they are ....

- Add documentation to the patches where you add the functionality. The
  design documentation is ok as an upfront separate patch.

- Add documentation for locking, protection and lifetime rules.

- Make sure it builds/boots at every step with and without CONFIG_RDT.

But I'm certainly not going through that excercise once more and I'm going to
stop looking at the next series immediately when I notice that there is
still major wreckage left.

I also expect that Reviewed-by tags on the next set of patches are
seriously worth it, which I cannot say for this series sadly.

All I have left to say is:

     yell_WTF(nr_wtf_moments);

I leave the value of the function argument to your imagination.

Please bring it close to zero in the next iteration. It's not rocket
science and if you have questions, I'm (still) happy to help.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface
  2016-09-08 22:01   ` Shaohua Li
@ 2016-09-09  1:17     ` Fenghua Yu
  2016-09-08 22:45       ` Shaohua Li
  0 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-09  1:17 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Fenghua Yu, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	Tony Luck, Peter Zijlstra, Tejun Heo, Borislav Petkov,
	Stephane Eranian, Marcelo Tosatti, David Carrillo-Cisneros,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, Sep 08, 2016 at 03:01:20PM -0700, Shaohua Li wrote:
> On Thu, Sep 08, 2016 at 02:57:00AM -0700, Fenghua Yu wrote:
> > From: Fenghua Yu <fenghua.yu@intel.com>
> > 
> > The documentation describes user interface of how to allocate resource
> > in Intel RDT.
> > 
> > Please note that the documentation covers generic user interface. Current
> > patch set code only implemente CAT L3. CAT L2 code will be sent later.
> > 
> > Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> > Reviewed-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  Documentation/x86/intel_rdt_ui.txt | 164 +++++++++++++++++++++++++++++++++++++
> >  1 file changed, 164 insertions(+)
> >  create mode 100644 Documentation/x86/intel_rdt_ui.txt
> > 
> > diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
> > new file mode 100644
> > index 0000000..27de386
> > --- /dev/null
> > +++ b/Documentation/x86/intel_rdt_ui.txt
> > @@ -0,0 +1,164 @@
> > +User Interface for Resource Allocation in Intel Resource Director Technology
> > +
> > +Copyright (C) 2016 Intel Corporation
> > +
> > +Fenghua Yu <fenghua.yu@intel.com>
> > +Tony Luck <tony.luck@intel.com>
> > +
> > +This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
> > +X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
> > +
> > +To use the feature mount the file system:
> > +
> > + # mount -t resctrl resctrl [-o cdp,verbose] /sys/fs/resctrl
> > +
> > +mount options are:
> > +
> > +"cdp": Enable code/data prioritization in L3 cache allocations.
> > +
> > +"verbose": Output more info in the "info" file under info directory
> > +	and in dmesg. This is mainly for debug.
> > +
> > +
> > +Resource groups
> > +---------------
> > +Resource groups are represented as directories in the resctrl file
> > +system. The default group is the root directory. Other groups may be
> > +created as desired by the system administrator using the "mkdir(1)"
> > +command, and removed using "rmdir(1)".
> > +
> > +There are three files associated with each group:
> > +
> > +"tasks": A list of tasks that belongs to this group. Tasks can be
> > +	added to a group by writing the task ID to the "tasks" file
> > +	(which will automatically remove them from the previous
> > +	group to which they belonged). New tasks created by fork(2)
> > +	and clone(2) are added to the same group as their parent.
> > +	If a pid is not in any sub partition, it is in root partition
> > +	(i.e. default partition).
> Hi Fenghua,
> 
> Will you add a 'procs' interface to allow move a process into a group? Using
> the 'tasks' interface to move process is inconvenient and has race conditions
> (eg, some new threads could be escaped).

We don't plan to add a 'procs' interface for rdtgroup. We only use resctrl
interface to allocate resources.

Why the "tasks" is inconvenient? If sysadmin wants to allocte a portion of L3
for a pid, the operation in resctl is to write the pid to a "tasks". While
in 'procs', the operation is to write a partition to a pid. If considering
convenience, they are same, right?

A thread uses either default partition (in root dir) or a sub partition (in
sub-directory). Sysadmin can control that. Kernel handles race condition.
Any issue with that?

Thanks.

-Fenghua
The same 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 27/33] x86/intel_rdt_rdtgroup.c: Implement resctrl file system commands
  2016-09-08 22:04   ` Shaohua Li
@ 2016-09-09  1:23     ` Fenghua Yu
  0 siblings, 0 replies; 96+ messages in thread
From: Fenghua Yu @ 2016-09-09  1:23 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Fenghua Yu, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	Tony Luck, Peter Zijlstra, Tejun Heo, Borislav Petkov,
	Stephane Eranian, Marcelo Tosatti, David Carrillo-Cisneros,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, Sep 08, 2016 at 03:04:12PM -0700, Shaohua Li wrote:
> On Thu, Sep 08, 2016 at 02:57:21AM -0700, Fenghua Yu wrote:
> >  /*
> >   * kernfs_root - find out the kernfs_root a kernfs_node belongs to
> > @@ -730,6 +749,110 @@ static void rdtgroup_destroy_locked(struct rdtgroup *rdtgrp)
> >  	kernfs_remove(rdtgrp->kn);
> >  }
> >  
> > +static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
> > +			umode_t mode)
> > +{
> > +	struct rdtgroup *parent, *rdtgrp;
> > +	struct rdtgroup_root *root;
> > +	struct kernfs_node *kn;
> > +	int ret;
> > +
> > +	if (parent_kn != root_rdtgrp->kn)
> > +		return -EPERM;
> > +
> 
> So we can't create nested groups. Is this limitation temporary? I don't see
> this is mentioned in the interface document.

Yes, we cannot create nested groups, i.e. up to one level of sub-directory
in the resctrl file system. This is different from cgroup.

I can add this info in the intel_rdt_ui.txt.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface
  2016-09-08 22:45       ` Shaohua Li
@ 2016-09-09  7:22         ` Fenghua Yu
  2016-09-09 17:50           ` Shaohua Li
  0 siblings, 1 reply; 96+ messages in thread
From: Fenghua Yu @ 2016-09-09  7:22 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Fenghua Yu, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	Tony Luck, Peter Zijlstra, Tejun Heo, Borislav Petkov,
	Stephane Eranian, Marcelo Tosatti, David Carrillo-Cisneros,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Thu, Sep 08, 2016 at 03:45:14PM -0700, Shaohua Li wrote:
> On Thu, Sep 08, 2016 at 06:17:47PM -0700, Fenghua Yu wrote:
> > On Thu, Sep 08, 2016 at 03:01:20PM -0700, Shaohua Li wrote:
> > > On Thu, Sep 08, 2016 at 02:57:00AM -0700, Fenghua Yu wrote:
> > > > From: Fenghua Yu <fenghua.yu@intel.com>
> > > > 
> > > > The documentation describes user interface of how to allocate resource
> > > > in Intel RDT.
> > > > 
> > > > Please note that the documentation covers generic user interface. Current
> > > > patch set code only implemente CAT L3. CAT L2 code will be sent later.
> > > > 
> > > > Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> > > > Reviewed-by: Tony Luck <tony.luck@intel.com>
> > > > ---
> > > >  Documentation/x86/intel_rdt_ui.txt | 164 +++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 164 insertions(+)
> > > >  create mode 100644 Documentation/x86/intel_rdt_ui.txt
> > > > 
> > > > diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
> > > > new file mode 100644
> > > > index 0000000..27de386
> > > > --- /dev/null
> > > > +++ b/Documentation/x86/intel_rdt_ui.txt
> > > > @@ -0,0 +1,164 @@
> > > > +User Interface for Resource Allocation in Intel Resource Director Technology
> > > > +
> > > > +Copyright (C) 2016 Intel Corporation
> > > > +
> > > > +Fenghua Yu <fenghua.yu@intel.com>
> > > > +Tony Luck <tony.luck@intel.com>
> > > > +
> > > > +This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
> > > > +X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
> > > > +
> > > > +To use the feature mount the file system:
> > > > +
> > > > + # mount -t resctrl resctrl [-o cdp,verbose] /sys/fs/resctrl
> > > > +
> > > > +mount options are:
> > > > +
> > > > +"cdp": Enable code/data prioritization in L3 cache allocations.
> > > > +
> > > > +"verbose": Output more info in the "info" file under info directory
> > > > +	and in dmesg. This is mainly for debug.
> > > > +
> > > > +
> > > > +Resource groups
> > > > +---------------
> > > > +Resource groups are represented as directories in the resctrl file
> > > > +system. The default group is the root directory. Other groups may be
> > > > +created as desired by the system administrator using the "mkdir(1)"
> > > > +command, and removed using "rmdir(1)".
> > > > +
> > > > +There are three files associated with each group:
> > > > +
> > > > +"tasks": A list of tasks that belongs to this group. Tasks can be
> > > > +	added to a group by writing the task ID to the "tasks" file
> > > > +	(which will automatically remove them from the previous
> > > > +	group to which they belonged). New tasks created by fork(2)
> > > > +	and clone(2) are added to the same group as their parent.
> > > > +	If a pid is not in any sub partition, it is in root partition
> > > > +	(i.e. default partition).
> > > Hi Fenghua,
> > > 
> > > Will you add a 'procs' interface to allow move a process into a group? Using
> > > the 'tasks' interface to move process is inconvenient and has race conditions
> > > (eg, some new threads could be escaped).
> > 
> > We don't plan to add a 'procs' interface for rdtgroup. We only use resctrl
> > interface to allocate resources.
> > 
> > Why the "tasks" is inconvenient? If sysadmin wants to allocte a portion of L3
> > for a pid, the operation in resctl is to write the pid to a "tasks". While
> > in 'procs', the operation is to write a partition to a pid. If considering
> > convenience, they are same, right?
> > 
> > A thread uses either default partition (in root dir) or a sub partition (in
> > sub-directory). Sysadmin can control that. Kernel handles race condition.
> > Any issue with that?
> 
> I don't mean writing the 'tasks' file is inconvenient. So to move a process to
> a group, we do:
> 1. get all thread pid of the process
> 2. write every pid to 'tasks'
> 
> this is inconvenient. And if a new thread is created between 1 and 2, we don't
> put the thread to the group. Am I missing anything?

As said in this doc, "New tasks created by fork(2) and clone(2) are added
to the same group as their parent.". So the new thread created b/w 1 and 2
will automatically go to the "tasks" as the process. Later sysadming can
still move any pid to any group.

Is this convenient?

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 01/33] cacheinfo: Introduce cache id
  2016-09-08  9:56 ` [PATCH v2 01/33] cacheinfo: Introduce cache id Fenghua Yu
@ 2016-09-09 15:01   ` Nilay Vaish
  0 siblings, 0 replies; 96+ messages in thread
From: Nilay Vaish @ 2016-09-09 15:01 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On 8 September 2016 at 04:56, Fenghua Yu <fenghua.yu@intel.com> wrote:
> diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
> index 2189935..cf6984d 100644
> --- a/include/linux/cacheinfo.h
> +++ b/include/linux/cacheinfo.h
> @@ -18,6 +18,7 @@ enum cache_type {
>
>  /**
>   * struct cacheinfo - represent a cache leaf node
> + * @id: This cache's id. ID is unique in the same index on the platform.

Can you explain what 'index on the platform' means?  I have absolutely
no idea what that is.

--
Nilay

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 02/33] Documentation, ABI: Add a document entry for cache id
  2016-09-08 19:33   ` Thomas Gleixner
@ 2016-09-09 15:11     ` Nilay Vaish
  0 siblings, 0 replies; 96+ messages in thread
From: Nilay Vaish @ 2016-09-09 15:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Fenghua Yu, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On 8 September 2016 at 14:33, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Thu, 8 Sep 2016, Fenghua Yu wrote:
>> +What:                /sys/devices/system/cpu/cpu*/cache/index*/id
>> +Date:                July 2016
>> +Contact:     Linux kernel mailing list <linux-kernel@vger.kernel.org>
>> +Description: Cache id
>> +
>> +             The id identifies a hardware cache of the system within a given
>> +             cache index in a set of cache indices. The "index" name is
>> +             simply a nomenclature from CPUID's leaf 4 which enumerates all
>> +             caches on the system by referring to each one as a cache index.
>> +             The (cache index, cache id) pair is unique for the whole
>> +             system.
>> +
>> +             Currently id is implemented on x86. On other platforms, id is
>> +             not enabled yet.
>
> And it never will be available on anything else than x86 because there is
> no other architecture providing CPUID leaf 4 ....
>
> If you want this to be generic then get rid of the x86isms in the
> explanation and describe the x86 specific part seperately.
>

I second tglx here.  Also, I think this patch should be renumbered to
be patch 01/33.  The current 01/33 patch should be 02/33.  That way we
get to read about the definition of cache id first and its use comes
later.

--
Nilay

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface
  2016-09-09  7:22         ` Fenghua Yu
@ 2016-09-09 17:50           ` Shaohua Li
  2016-09-09 18:03             ` Luck, Tony
  0 siblings, 1 reply; 96+ messages in thread
From: Shaohua Li @ 2016-09-09 17:50 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Ravi V Shankar,
	Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Fri, Sep 09, 2016 at 12:22:46AM -0700, Fenghua Yu wrote:
> On Thu, Sep 08, 2016 at 03:45:14PM -0700, Shaohua Li wrote:
> > On Thu, Sep 08, 2016 at 06:17:47PM -0700, Fenghua Yu wrote:
> > > On Thu, Sep 08, 2016 at 03:01:20PM -0700, Shaohua Li wrote:
> > > > On Thu, Sep 08, 2016 at 02:57:00AM -0700, Fenghua Yu wrote:
> > > > > From: Fenghua Yu <fenghua.yu@intel.com>
> > > > > 
> > > > > The documentation describes user interface of how to allocate resource
> > > > > in Intel RDT.
> > > > > 
> > > > > Please note that the documentation covers generic user interface. Current
> > > > > patch set code only implemente CAT L3. CAT L2 code will be sent later.
> > > > > 
> > > > > Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> > > > > Reviewed-by: Tony Luck <tony.luck@intel.com>
> > > > > ---
> > > > >  Documentation/x86/intel_rdt_ui.txt | 164 +++++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 164 insertions(+)
> > > > >  create mode 100644 Documentation/x86/intel_rdt_ui.txt
> > > > > 
> > > > > diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
> > > > > new file mode 100644
> > > > > index 0000000..27de386
> > > > > --- /dev/null
> > > > > +++ b/Documentation/x86/intel_rdt_ui.txt
> > > > > @@ -0,0 +1,164 @@
> > > > > +User Interface for Resource Allocation in Intel Resource Director Technology
> > > > > +
> > > > > +Copyright (C) 2016 Intel Corporation
> > > > > +
> > > > > +Fenghua Yu <fenghua.yu@intel.com>
> > > > > +Tony Luck <tony.luck@intel.com>
> > > > > +
> > > > > +This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
> > > > > +X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
> > > > > +
> > > > > +To use the feature mount the file system:
> > > > > +
> > > > > + # mount -t resctrl resctrl [-o cdp,verbose] /sys/fs/resctrl
> > > > > +
> > > > > +mount options are:
> > > > > +
> > > > > +"cdp": Enable code/data prioritization in L3 cache allocations.
> > > > > +
> > > > > +"verbose": Output more info in the "info" file under info directory
> > > > > +	and in dmesg. This is mainly for debug.
> > > > > +
> > > > > +
> > > > > +Resource groups
> > > > > +---------------
> > > > > +Resource groups are represented as directories in the resctrl file
> > > > > +system. The default group is the root directory. Other groups may be
> > > > > +created as desired by the system administrator using the "mkdir(1)"
> > > > > +command, and removed using "rmdir(1)".
> > > > > +
> > > > > +There are three files associated with each group:
> > > > > +
> > > > > +"tasks": A list of tasks that belongs to this group. Tasks can be
> > > > > +	added to a group by writing the task ID to the "tasks" file
> > > > > +	(which will automatically remove them from the previous
> > > > > +	group to which they belonged). New tasks created by fork(2)
> > > > > +	and clone(2) are added to the same group as their parent.
> > > > > +	If a pid is not in any sub partition, it is in root partition
> > > > > +	(i.e. default partition).
> > > > Hi Fenghua,
> > > > 
> > > > Will you add a 'procs' interface to allow move a process into a group? Using
> > > > the 'tasks' interface to move process is inconvenient and has race conditions
> > > > (eg, some new threads could be escaped).
> > > 
> > > We don't plan to add a 'procs' interface for rdtgroup. We only use resctrl
> > > interface to allocate resources.
> > > 
> > > Why the "tasks" is inconvenient? If sysadmin wants to allocte a portion of L3
> > > for a pid, the operation in resctl is to write the pid to a "tasks". While
> > > in 'procs', the operation is to write a partition to a pid. If considering
> > > convenience, they are same, right?
> > > 
> > > A thread uses either default partition (in root dir) or a sub partition (in
> > > sub-directory). Sysadmin can control that. Kernel handles race condition.
> > > Any issue with that?
> > 
> > I don't mean writing the 'tasks' file is inconvenient. So to move a process to
> > a group, we do:
> > 1. get all thread pid of the process
> > 2. write every pid to 'tasks'
> > 
> > this is inconvenient. And if a new thread is created between 1 and 2, we don't
> > put the thread to the group. Am I missing anything?
> 
> As said in this doc, "New tasks created by fork(2) and clone(2) are added
> to the same group as their parent.". So the new thread created b/w 1 and 2
> will automatically go to the "tasks" as the process. Later sysadming can
> still move any pid to any group.

So if we want to move a process from group1 to group2, we do:
1. find all threads pid of the process
2. write each thread pid to group2's 'tasks'

I don't think this is convenient, but it's ok. Now if we create a new thread
between 1 and 2, the new thread is in group1. The new thread pid isn't in the
pid list we found in 1, so after 2, the new thread still is in group 1. Truely
sysadmin can repeat the step 1 & 2 and move the new thread to group 2, but
there is always chance the process creates new thread between 1 and 2, and the
new thread remains in group 1. There is no guarantee we can safely move a
process from one group to another.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 96+ messages in thread

* RE: [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface
  2016-09-09 17:50           ` Shaohua Li
@ 2016-09-09 18:03             ` Luck, Tony
  2016-09-09 21:43               ` Shaohua Li
  0 siblings, 1 reply; 96+ messages in thread
From: Luck, Tony @ 2016-09-09 18:03 UTC (permalink / raw)
  To: Shaohua Li, Yu, Fenghua
  Cc: Thomas Gleixner, Anvin, H Peter, Ingo Molnar, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shankar, Ravi V, Vikas Shivappa,
	Prakhya, Sai Praneeth, linux-kernel, x86

> I don't think this is convenient, but it's ok. Now if we create a new thread
> between 1 and 2, the new thread is in group1. The new thread pid isn't in the
> pid list we found in 1, so after 2, the new thread still is in group 1. Truely
> sysadmin can repeat the step 1 & 2 and move the new thread to group 2, but
> there is always chance the process creates new thread between 1 and 2, and the
> new thread remains in group 1. There is no guarantee we can safely move a
> process from one group to another.

In general this is true.  But don't most threaded applications have a single thread that
is the one that spawns new threads? Typically the first thread.  Once that is moved,
any new threads will inherit the new group. So there won't be a neverending mopping
operation trying to catch up.

Even this seems outside the expected usage model for CAT where we expect the
system admin to partition the cache between resource groups at boot time and
then assign jobs (or containers, or VMs) to resource groups when they are created.

-Tony

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface
  2016-09-09 18:03             ` Luck, Tony
@ 2016-09-09 21:43               ` Shaohua Li
  2016-09-09 21:59                 ` Luck, Tony
  0 siblings, 1 reply; 96+ messages in thread
From: Shaohua Li @ 2016-09-09 21:43 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Yu, Fenghua, Thomas Gleixner, Anvin, H Peter, Ingo Molnar,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shankar, Ravi V,
	Vikas Shivappa, Prakhya, Sai Praneeth, linux-kernel, x86

On Fri, Sep 09, 2016 at 06:03:12PM +0000, Luck, Tony wrote:
> > I don't think this is convenient, but it's ok. Now if we create a new thread
> > between 1 and 2, the new thread is in group1. The new thread pid isn't in the
> > pid list we found in 1, so after 2, the new thread still is in group 1. Truely
> > sysadmin can repeat the step 1 & 2 and move the new thread to group 2, but
> > there is always chance the process creates new thread between 1 and 2, and the
> > new thread remains in group 1. There is no guarantee we can safely move a
> > process from one group to another.
> 
> In general this is true.  But don't most threaded applications have a single thread that
> is the one that spawns new threads? Typically the first thread.  Once that is moved,
> any new threads will inherit the new group. So there won't be a neverending mopping
> operation trying to catch up.
> 
> Even this seems outside the expected usage model for CAT where we expect the
> system admin to partition the cache between resource groups at boot time and
> then assign jobs (or containers, or VMs) to resource groups when they are created.

Hmm, I don't know how applications are going to use the interface. Nobody knows
it right now. But we do have some candicate workloads which want to configure
the cache partition at runtime, so it's not just a boot time stuff. I'm
wondering why we have such limitation. The framework is there, it's quite easy
to implement process move in kernel but fairly hard to get it right in
userspace.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 96+ messages in thread

* RE: [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface
  2016-09-09 21:43               ` Shaohua Li
@ 2016-09-09 21:59                 ` Luck, Tony
  2016-09-10  0:36                   ` Yu, Fenghua
  0 siblings, 1 reply; 96+ messages in thread
From: Luck, Tony @ 2016-09-09 21:59 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Yu, Fenghua, Thomas Gleixner, Anvin, H Peter, Ingo Molnar,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shankar, Ravi V,
	Vikas Shivappa, Prakhya, Sai Praneeth, linux-kernel, x86

> Hmm, I don't know how applications are going to use the interface. Nobody knows
> it right now. But we do have some candicate workloads which want to configure
> the cache partition at runtime, so it's not just a boot time stuff. I'm
> wondering why we have such limitation. The framework is there, it's quite easy
> to implement process move in kernel but fairly hard to get it right in
> userspace.

You are correct - if there is a need for this, it would be better done in the kernel.

I'm just not sure how to explain both a "procs" and "tasks" interface file in a way
that won't confuse people.

We have:

# echo {task-id} > tasks
  .... adds a single task to this resource group
# cat tasks
  ... shows all the tasks in this resource group

and you want:

# echo {process-id} > procs
   ... adds all threads in {process-id} to this resource group
# cat procs
  ... shows all processes (like "cat tasks" above, but only shows main thread in a multi-threads process)

^ permalink raw reply	[flat|nested] 96+ messages in thread

* RE: [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface
  2016-09-09 21:59                 ` Luck, Tony
@ 2016-09-10  0:36                   ` Yu, Fenghua
  2016-09-12  4:15                     ` Shaohua Li
  0 siblings, 1 reply; 96+ messages in thread
From: Yu, Fenghua @ 2016-09-10  0:36 UTC (permalink / raw)
  To: Luck, Tony, Shaohua Li
  Cc: Thomas Gleixner, Anvin, H Peter, Ingo Molnar, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shankar, Ravi V, Vikas Shivappa,
	Prakhya, Sai Praneeth, linux-kernel, x86

> > Hmm, I don't know how applications are going to use the interface.
> > Nobody knows it right now. But we do have some candicate workloads
> > which want to configure the cache partition at runtime, so it's not
> > just a boot time stuff. I'm wondering why we have such limitation. The
> > framework is there, it's quite easy to implement process move in
> > kernel but fairly hard to get it right in userspace.
> 
> You are correct - if there is a need for this, it would be better done in the
> kernel.
> 
> I'm just not sure how to explain both a "procs" and "tasks" interface file in a
> way that won't confuse people.
> 
> We have:
> 
> # echo {task-id} > tasks
>   .... adds a single task to this resource group # cat tasks
>   ... shows all the tasks in this resource group
> 
> and you want:
> 
> # echo {process-id} > procs
>    ... adds all threads in {process-id} to this resource group # cat procs
>   ... shows all processes (like "cat tasks" above, but only shows main thread in
> a multi-threads process)

The advantage of "tasks" is user can allocate each thread into its own partition.
The advantage of "procs" is convenience for user to just allocate thread group
lead pid and rest of the thread group members go with the lead.

If no "procs" is really inconvenience, we may support "procs" in future.

One way to implement this is we can extend the current interface to accept
a resctrl file system mount parameter to switch b/w "procs" and "tasks" during
mount time. So the file sytem has either "procs" or "tasks" during run time. I don't think it's right to have both of them at the same time in the file system.

Is this the right way to go?

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface
  2016-09-10  0:36                   ` Yu, Fenghua
@ 2016-09-12  4:15                     ` Shaohua Li
  2016-09-13 13:27                       ` Thomas Gleixner
  0 siblings, 1 reply; 96+ messages in thread
From: Shaohua Li @ 2016-09-12  4:15 UTC (permalink / raw)
  To: Yu, Fenghua
  Cc: Luck, Tony, Thomas Gleixner, Anvin, H Peter, Ingo Molnar,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shankar, Ravi V,
	Vikas Shivappa, Prakhya, Sai Praneeth, linux-kernel, x86

On Sat, Sep 10, 2016 at 12:36:57AM +0000, Yu, Fenghua wrote:
> > > Hmm, I don't know how applications are going to use the interface.
> > > Nobody knows it right now. But we do have some candicate workloads
> > > which want to configure the cache partition at runtime, so it's not
> > > just a boot time stuff. I'm wondering why we have such limitation. The
> > > framework is there, it's quite easy to implement process move in
> > > kernel but fairly hard to get it right in userspace.
> > 
> > You are correct - if there is a need for this, it would be better done in the
> > kernel.
> > 
> > I'm just not sure how to explain both a "procs" and "tasks" interface file in a
> > way that won't confuse people.
> > 
> > We have:
> > 
> > # echo {task-id} > tasks
> >   .... adds a single task to this resource group # cat tasks
> >   ... shows all the tasks in this resource group
> > 
> > and you want:
> > 
> > # echo {process-id} > procs
> >    ... adds all threads in {process-id} to this resource group # cat procs
> >   ... shows all processes (like "cat tasks" above, but only shows main thread in
> > a multi-threads process)
> 
> The advantage of "tasks" is user can allocate each thread into its own partition.
> The advantage of "procs" is convenience for user to just allocate thread group
> lead pid and rest of the thread group members go with the lead.
> 
> If no "procs" is really inconvenience, we may support "procs" in future.
> 
> One way to implement this is we can extend the current interface to accept
> a resctrl file system mount parameter to switch b/w "procs" and "tasks" during
> mount time. So the file sytem has either "procs" or "tasks" during run time. I don't think it's right to have both of them at the same time in the file system.

A mount option doesn't make sense, which just creates more trouble. What's
wrong to have both of 'procs' and 'tasks' at the same time, like cgroup? I
think it's more natural to support both. As for the content of 'procs' and
'tasks', we could follow how cgroup handle them.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 08/33] x86/intel_rdt: Add Class of service management
  2016-09-08  9:57 ` [PATCH v2 08/33] x86/intel_rdt: Add Class of service management Fenghua Yu
  2016-09-08  8:56   ` Thomas Gleixner
@ 2016-09-12 16:02   ` Nilay Vaish
  1 sibling, 0 replies; 96+ messages in thread
From: Nilay Vaish @ 2016-09-12 16:02 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On 8 September 2016 at 04:57, Fenghua Yu <fenghua.yu@intel.com> wrote:
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index fcd0642..b25940a 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -21,17 +21,94 @@
>   */
>  #include <linux/slab.h>
>  #include <linux/err.h>
> +#include <asm/intel_rdt.h>
> +
> +/*
> + * cctable maintains 1:1 mapping between CLOSid and cache bitmask.
> + */
> +static struct clos_cbm_table *cctable;
> +/*
> + * closid availability bit map.
> + */
> +unsigned long *closmap;
> +static DEFINE_MUTEX(rdtgroup_mutex);
> +
> +static inline void closid_get(u32 closid)
> +{
> +       struct clos_cbm_table *cct = &cctable[closid];
> +
> +       lockdep_assert_held(&rdtgroup_mutex);
> +
> +       cct->clos_refcnt++;
> +}
> +
> +static int closid_alloc(u32 *closid)
> +{
> +       u32 maxid;
> +       u32 id;
> +
> +       lockdep_assert_held(&rdtgroup_mutex);
> +
> +       maxid = boot_cpu_data.x86_cache_max_closid;
> +       id = find_first_zero_bit(closmap, maxid);
> +       if (id == maxid)
> +               return -ENOSPC;
> +
> +       set_bit(id, closmap);
> +       closid_get(id);
> +       *closid = id;
> +
> +       return 0;
> +}
> +
> +static inline void closid_free(u32 closid)
> +{
> +       clear_bit(closid, closmap);
> +       cctable[closid].cbm = 0;
> +}
> +
> +static void closid_put(u32 closid)
> +{
> +       struct clos_cbm_table *cct = &cctable[closid];
> +
> +       lockdep_assert_held(&rdtgroup_mutex);
> +       if (WARN_ON(!cct->clos_refcnt))
> +               return;
> +
> +       if (!--cct->clos_refcnt)
> +               closid_free(closid);
> +}
>

I would suggest a small change in the function names here.  I think
the names closid_free and closid_put should be swapped. As I
understand, under the current naming scheme, the opposite of
closid_alloc() is closid_put() and the opposite of closid_get() is
closid_free().  This is not the normal way these names are paired.
Typically, alloc and free go as a pair and get and put go as a pair.

--
Nilay

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 09/33] x86/intel_rdt: Add L3 cache capacity bitmask management
  2016-09-08  9:57 ` [PATCH v2 09/33] x86/intel_rdt: Add L3 cache capacity bitmask management Fenghua Yu
  2016-09-08  9:40   ` Thomas Gleixner
@ 2016-09-12 16:10   ` Nilay Vaish
  1 sibling, 0 replies; 96+ messages in thread
From: Nilay Vaish @ 2016-09-12 16:10 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On 8 September 2016 at 04:57, Fenghua Yu <fenghua.yu@intel.com> wrote:
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index b25940a..9cf3a7d 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -31,8 +31,22 @@ static struct clos_cbm_table *cctable;
>   * closid availability bit map.
>   */
>  unsigned long *closmap;
> +/*
> + * Mask of CPUs for writing CBM values. We only need one CPU per-socket.

Does the second line make sense here?

> + */
> +static cpumask_t rdt_cpumask;
> +/*
> + * Temporary cpumask used during hot cpu notificaiton handling. The usage
> + * is serialized by hot cpu locks.

s/notificaiton/notification


--
Nilay

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface
  2016-09-12  4:15                     ` Shaohua Li
@ 2016-09-13 13:27                       ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2016-09-13 13:27 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Yu, Fenghua, Luck, Tony, Anvin, H Peter, Ingo Molnar,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shankar, Ravi V,
	Vikas Shivappa, Prakhya, Sai Praneeth, linux-kernel, x86

On Sun, 11 Sep 2016, Shaohua Li wrote:
> On Sat, Sep 10, 2016 at 12:36:57AM +0000, Yu, Fenghua wrote:
> > One way to implement this is we can extend the current interface to accept
> > a resctrl file system mount parameter to switch b/w "procs" and "tasks" during
> > mount time. So the file sytem has either "procs" or "tasks" during run time. I don't think it's right to have both of them at the same time in the file system.
> 
> A mount option doesn't make sense, which just creates more trouble. What's
> wrong to have both of 'procs' and 'tasks' at the same time, like cgroup? I
> think it's more natural to support both. As for the content of 'procs' and
> 'tasks', we could follow how cgroup handle them.

Right. There is nothing wrong with having both, but the very first step is
to get the basic infrastructure merged. Adding 'procs' is a straight
forward add on which can be implemented on top of the primary patch
set. There is no design change required to support it later.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 10/33] x86/intel_rdt: Implement scheduling support for Intel RDT
  2016-09-08  9:57 ` [PATCH v2 10/33] x86/intel_rdt: Implement scheduling support for Intel RDT Fenghua Yu
  2016-09-08  9:53   ` Thomas Gleixner
@ 2016-09-13 17:55   ` Nilay Vaish
  1 sibling, 0 replies; 96+ messages in thread
From: Nilay Vaish @ 2016-09-13 17:55 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On 8 September 2016 at 04:57, Fenghua Yu <fenghua.yu@intel.com> wrote:
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index 9cf3a7d..9f30492 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -21,6 +21,8 @@
>   */
>  #include <linux/slab.h>
>  #include <linux/err.h>
> +#include <linux/sched.h>
> +#include <asm/pqr_common.h>
>  #include <asm/intel_rdt.h>
>
>  /*
> @@ -41,12 +43,28 @@ static cpumask_t rdt_cpumask;
>   */
>  static cpumask_t tmp_cpumask;
>  static DEFINE_MUTEX(rdtgroup_mutex);
> +struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;

I had pointed this for the previous version of the patch as well.  You
should use DEFINE_STATIC_KEY_FALSE(rdt_enable_key), instead of what
you have here.

--
Nilay

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 11/33] x86/intel_rdt: Hot cpu support for Cache Allocation
  2016-09-08  9:57 ` [PATCH v2 11/33] x86/intel_rdt: Hot cpu support for Cache Allocation Fenghua Yu
  2016-09-08 10:03   ` Thomas Gleixner
@ 2016-09-13 18:18   ` Nilay Vaish
  2016-09-13 19:10     ` Luck, Tony
  1 sibling, 1 reply; 96+ messages in thread
From: Nilay Vaish @ 2016-09-13 18:18 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Tony Luck,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On 8 September 2016 at 04:57, Fenghua Yu <fenghua.yu@intel.com> wrote:
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index 9f30492..4537658 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -141,6 +145,80 @@ static inline bool rdt_cpumask_update(int cpu)
>         return false;
>  }
>
> +/*
> + * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
> + * which are one per CLOSid on the current package.
> + */
> +static void cbm_update_msrs(void *dummy)
> +{
> +       int maxid = boot_cpu_data.x86_cache_max_closid;
> +       struct rdt_remote_data info;
> +       unsigned int i;
> +
> +       for (i = 0; i < maxid; i++) {
> +               if (cctable[i].clos_refcnt) {
> +                       info.msr = CBM_FROM_INDEX(i);
> +                       info.val = cctable[i].cbm;
> +                       msr_cpu_update(&info);
> +               }
> +       }
> +}
> +
> +static int intel_rdt_online_cpu(unsigned int cpu)
> +{
> +       struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
> +
> +       state->closid = 0;
> +       mutex_lock(&rdtgroup_mutex);
> +       /* The cpu is set in root rdtgroup after online. */
> +       cpumask_set_cpu(cpu, &root_rdtgrp->cpu_mask);
> +       per_cpu(cpu_rdtgroup, cpu) = root_rdtgrp;
> +       /*
> +        * If the cpu is first time found and set in its siblings that
> +        * share the same cache, update the CBM MSRs for the cache.
> +        */

I am finding it slightly hard to parse the comment above.  Does the
following sound better:  If the cpu is the first one found and set
amongst its siblings that ...

> +       if (rdt_cpumask_update(cpu))
> +               smp_call_function_single(cpu, cbm_update_msrs, NULL, 1);
> +       mutex_unlock(&rdtgroup_mutex);
> +}
> +
> +static int clear_rdtgroup_cpumask(unsigned int cpu)
> +{
> +       struct list_head *l;
> +       struct rdtgroup *r;
> +
> +       list_for_each(l, &rdtgroup_lists) {
> +               r = list_entry(l, struct rdtgroup, rdtgroup_list);
> +               if (cpumask_test_cpu(cpu, &r->cpu_mask)) {
> +                       cpumask_clear_cpu(cpu, &r->cpu_mask);
> +                       return 0;
> +               }
> +       }
> +
> +       return -EINVAL;
> +}
> +
> +static int intel_rdt_offline_cpu(unsigned int cpu)
> +{
> +       int i;
> +
> +       mutex_lock(&rdtgroup_mutex);
> +       if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
> +               mutex_unlock(&rdtgroup_mutex);
> +               return;
> +       }
> +
> +       cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), cpu_online_mask);
> +       cpumask_clear_cpu(cpu, &tmp_cpumask);
> +       i = cpumask_any(&tmp_cpumask);
> +
> +       if (i < nr_cpu_ids)
> +               cpumask_set_cpu(i, &rdt_cpumask);
> +
> +       clear_rdtgroup_cpumask(cpu);
> +       mutex_unlock(&rdtgroup_mutex);
> +}
> +

Just for my info, why do we need not update MSRs when a cpu goes offline?



Thanks
Nilay

^ permalink raw reply	[flat|nested] 96+ messages in thread

* RE: [PATCH v2 11/33] x86/intel_rdt: Hot cpu support for Cache Allocation
  2016-09-13 18:18   ` Nilay Vaish
@ 2016-09-13 19:10     ` Luck, Tony
  0 siblings, 0 replies; 96+ messages in thread
From: Luck, Tony @ 2016-09-13 19:10 UTC (permalink / raw)
  To: Nilay Vaish, Yu, Fenghua
  Cc: Thomas Gleixner, Anvin, H Peter, Ingo Molnar, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Shankar, Ravi V,
	Vikas Shivappa, Prakhya, Sai Praneeth, linux-kernel, x86

> Just for my info, why do we need not update MSRs when a cpu goes offline?

The CBM (cache bitmask) MSRs are shared by all the cpus that use that same cache. So
they mustn't be updated when some of the CPUs go offline, because the remaining ones
are still using them. I suppose you could do something if the last CPU using a cache goes
offline ... but I don't see that it would achieve anything. Offline CPUs aren't executing any
instructions, so don't do any cache allocations.

-Tony

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection
  2016-09-08  9:57 ` [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection Fenghua Yu
  2016-09-08 11:50   ` Borislav Petkov
  2016-09-08 13:17   ` Thomas Gleixner
@ 2016-09-13 22:40   ` Dave Hansen
  2016-09-13 22:52     ` Luck, Tony
  2 siblings, 1 reply; 96+ messages in thread
From: Dave Hansen @ 2016-09-13 22:40 UTC (permalink / raw)
  To: Fenghua Yu, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	Tony Luck, Peter Zijlstra, Tejun Heo, Borislav Petkov,
	Stephane Eranian, Marcelo Tosatti, David Carrillo-Cisneros,
	Shaohua Li, Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86

On 09/08/2016 02:57 AM, Fenghua Yu wrote:
> --- a/arch/x86/include/asm/disabled-features.h
> +++ b/arch/x86/include/asm/disabled-features.h
> @@ -57,6 +57,7 @@
>  #define DISABLED_MASK15	0
>  #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE)
>  #define DISABLED_MASK17	0
> -#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)
> +#define DISABLED_MASK18	0
> +#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)

Are you sure you don't want to add RDT to disabled-features.h?  You have
a config option for it, so it seems like you should also be able to
optimize some of these checks away when the config option is off.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection
  2016-09-13 22:40   ` Dave Hansen
@ 2016-09-13 22:52     ` Luck, Tony
  2016-09-13 23:00       ` Dave Hansen
  0 siblings, 1 reply; 96+ messages in thread
From: Luck, Tony @ 2016-09-13 22:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Fenghua Yu, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Tue, Sep 13, 2016 at 03:40:18PM -0700, Dave Hansen wrote:
> Are you sure you don't want to add RDT to disabled-features.h?  You have
> a config option for it, so it seems like you should also be able to
> optimize some of these checks away when the config option is off.

Makefile looks like this:

obj-$(CONFIG_INTEL_RDT)        += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_schemata.o

which seems to skip compiling all our code when the CONFIG
option is off.

Our hooks to generic code look like:

+#ifdef CONFIG_INTEL_RDT
+extern void rdtgroup_fork(struct task_struct *child);
+extern void rdtgroup_exit(struct task_struct *tsk);
+#else
+static inline void rdtgroup_fork(struct task_struct *child) {}
+static inline void rdtgroup_exit(struct task_struct *tsk) {}
+#endif /* CONFIG_INTEL_RDT */


Does this disabled-features.h thing do something more?

-Tony

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 22/33] x86/intel_rdt.c: Extend RDT to per cache and per resources
  2016-09-08  9:57 ` [PATCH v2 22/33] x86/intel_rdt.c: Extend RDT to per cache and per resources Fenghua Yu
  2016-09-08 14:57   ` Thomas Gleixner
@ 2016-09-13 22:54   ` Dave Hansen
  1 sibling, 0 replies; 96+ messages in thread
From: Dave Hansen @ 2016-09-13 22:54 UTC (permalink / raw)
  To: Fenghua Yu, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	Tony Luck, Peter Zijlstra, Tejun Heo, Borislav Petkov,
	Stephane Eranian, Marcelo Tosatti, David Carrillo-Cisneros,
	Shaohua Li, Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86

On 09/08/2016 02:57 AM, Fenghua Yu wrote:
> +static int __init rdt_setup(char *str)
> +{
> +	char *tok;
> +
> +	while ((tok = strsep(&str, ",")) != NULL) {
> +		if (!*tok)
> +			return -EINVAL;
> +
> +		if (strcmp(tok, "simulate_cat_l3") == 0) {
> +			pr_info("Simulate CAT L3\n");
> +			rdt_opts.simulate_cat_l3 = true;
> +		} else if (strcmp(tok, "disable_cat_l3") == 0) {
> +			pr_info("CAT L3 is disabled\n");
> +			disable_cat_l3 = true;
> +		} else {
> +			pr_info("Invalid rdt option\n");
> +			return -EINVAL;
> +		}
> +	}
> +
> +	return 0;
> +}
> +__setup("resctrl=", rdt_setup);

So, this allows you to specify both simulation and disabling at the same
time, and in the same option?  That seems a bit screwy, plus it requires
some parsing which is quite prone to being broken.  How about just
having two setup options:

__setup("resctrl=simulate_cat_l3", rdt_setup...);
__setup("resctrl=disable_cat_l3", rdt_setup...);

And allow folks to specify "resctrl" more than once instead of requiring
the comma-separated arguments?  Then you don't have to do any parsing at
all and your __setup() handlers become one-liners.

Is "resctrl" really the best name for this sucker?  Wouldn't
"intel-rdt=" or something be nicer?

Also, a lot of __setup() functions actually clear cpuid bits.  Should
this be clearing X86_FEATURE_CAT_L3 instead of keeping a boolean around
that effectively overrides it?

> +static inline bool cat_l3_supported(struct cpuinfo_x86 *c)
> +{
> +	if (cpu_has(c, X86_FEATURE_CAT_L3))
> +		return true;
> +
> +	/*
> +	 * Probe for Haswell server CPUs.
> +	 */
> +	if (c->x86 == 0x6 && c->x86_model == 0x3f)
> +		return cache_alloc_hsw_probe();
> +
> +	return false;
> +}

#include <asm/intel-family.h> and s/0x3f/INTEL_FAM6_HASWELL_X/, please.
 Then your comment even go away.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection
  2016-09-13 22:52     ` Luck, Tony
@ 2016-09-13 23:00       ` Dave Hansen
  0 siblings, 0 replies; 96+ messages in thread
From: Dave Hansen @ 2016-09-13 23:00 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Fenghua Yu, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On 09/13/2016 03:52 PM, Luck, Tony wrote:
> On Tue, Sep 13, 2016 at 03:40:18PM -0700, Dave Hansen wrote:
>> Are you sure you don't want to add RDT to disabled-features.h?  You have
>> a config option for it, so it seems like you should also be able to
>> optimize some of these checks away when the config option is off.
> 
> Makefile looks like this:
> 
> obj-$(CONFIG_INTEL_RDT)        += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_schemata.o
> 
> which seems to skip compiling all our code when the CONFIG
> option is off.
> 
> Our hooks to generic code look like:
> 
> +#ifdef CONFIG_INTEL_RDT
> +extern void rdtgroup_fork(struct task_struct *child);
> +extern void rdtgroup_exit(struct task_struct *tsk);
> +#else
> +static inline void rdtgroup_fork(struct task_struct *child) {}
> +static inline void rdtgroup_exit(struct task_struct *tsk) {}
> +#endif /* CONFIG_INTEL_RDT */
> 
> Does this disabled-features.h thing do something more?

If you have cpuid checks in common code, disabled-features.h can compile
them out if the config options are turned off.  For instance:

	if (cpu_feature_enabled(X86_FEATURE_PKU))
		foo();

is equivalent to:

#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
	if (boot_cpu_has(X86_FEATURE_PKU))
		foo();
#endif

But, if all the cpu_has(c, X86_FEATURE_CAT_L3) checks are confined to
files only compiled under CONFIG_INTEL_RDT then it won't do you much
good.  But, it's pretty simple to add things, and would help you out if
checks spread beyond intel_rdt*.c.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 26/33] Task fork and exit for rdtgroup
  2016-09-08  9:57 ` [PATCH v2 26/33] Task fork and exit for rdtgroup Fenghua Yu
  2016-09-08 19:41   ` Thomas Gleixner
@ 2016-09-13 23:13   ` Dave Hansen
  2016-09-13 23:35     ` Luck, Tony
  1 sibling, 1 reply; 96+ messages in thread
From: Dave Hansen @ 2016-09-13 23:13 UTC (permalink / raw)
  To: Fenghua Yu, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	Tony Luck, Peter Zijlstra, Tejun Heo, Borislav Petkov,
	Stephane Eranian, Marcelo Tosatti, David Carrillo-Cisneros,
	Shaohua Li, Ravi V Shankar, Vikas Shivappa, Sai Prakhya
  Cc: linux-kernel, x86

On 09/08/2016 02:57 AM, Fenghua Yu wrote:
> +void rdtgroup_fork(struct task_struct *child)
> +{
> +	struct rdtgroup *rdtgrp;
> +
> +	INIT_LIST_HEAD(&child->rg_list);
> +	if (!rdtgroup_mounted)
> +		return;
> +
> +	mutex_lock(&rdtgroup_mutex);
> +
> +	rdtgrp = current->rdtgroup;
> +	if (!rdtgrp)
> +		goto out;
> +
> +	list_add_tail(&child->rg_list, &rdtgrp->pset.tasks);
> +	child->rdtgroup = rdtgrp;
> +	atomic_inc(&rdtgrp->refcount);
> +
> +out:
> +	mutex_unlock(&rdtgroup_mutex);
> +}
...
> diff --git a/kernel/fork.c b/kernel/fork.c
> index beb3172..79bfc99 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -76,6 +76,7 @@
>  #include <linux/compiler.h>
>  #include <linux/sysctl.h>
>  #include <linux/kcov.h>
> +#include <linux/resctrl.h>
>  
>  #include <asm/pgtable.h>
>  #include <asm/pgalloc.h>
> @@ -1426,6 +1427,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	p->io_context = NULL;
>  	p->audit_context = NULL;
>  	cgroup_fork(p);
> +	rdtgroup_fork(p);
>  #ifdef CONFIG_NUMA
>  	p->mempolicy = mpol_dup(p->mempolicy);
>  	if (IS_ERR(p->mempolicy)) {

Yikes, is this a new global lock and possible atomic_inc() on a shared
variable in the fork() path?  Has there been any performance or
scalability testing done on this code?

That mutex could be a disaster for fork() once the filesystem is
mounted.  Even if it goes away, if you have a large number of processes
in an rdtgroup and they are forking a lot, you're bound to see the
rdtgrp->refcount get bounced around a lot.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 26/33] Task fork and exit for rdtgroup
  2016-09-13 23:13   ` Dave Hansen
@ 2016-09-13 23:35     ` Luck, Tony
  2016-09-14 14:28       ` Dave Hansen
  0 siblings, 1 reply; 96+ messages in thread
From: Luck, Tony @ 2016-09-13 23:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Fenghua Yu, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On Tue, Sep 13, 2016 at 04:13:04PM -0700, Dave Hansen wrote:
> Yikes, is this a new global lock and possible atomic_inc() on a shared
> variable in the fork() path?  Has there been any performance or
> scalability testing done on this code?
> 
> That mutex could be a disaster for fork() once the filesystem is
> mounted.  Even if it goes away, if you have a large number of processes
> in an rdtgroup and they are forking a lot, you're bound to see the
> rdtgrp->refcount get bounced around a lot.

The mutex is (almost certainly) going away.  The atomic_inc()
is likely staying (but only applies to tasks that are in 
resource groups other than the default one.  But on a system
where we partition the cache between containers/VMs, that may
essentially be all processes.

We only really use the refcount to decide whether the group
can be removed ... since that is the rare operation, perhaps
we could put all the work there and have it count them with:

	n = 0;
	rcu_read_lock();
	for_each_process(p)
		if (p->rdtgroup == this_rdtgroup)
			n++;
	rcu_read_unlock();
	if (n != 0)
		return -EBUSY;

then we might get the fork hook down to just:

void rdtgroup_fork(struct task_struct *child)
{
	child->rdtgroup = current->rdtgroup;
}

which looks a lot less scary :-)

-Tony

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 26/33] Task fork and exit for rdtgroup
  2016-09-13 23:35     ` Luck, Tony
@ 2016-09-14 14:28       ` Dave Hansen
  0 siblings, 0 replies; 96+ messages in thread
From: Dave Hansen @ 2016-09-14 14:28 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Fenghua Yu, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Tejun Heo, Borislav Petkov, Stephane Eranian,
	Marcelo Tosatti, David Carrillo-Cisneros, Shaohua Li,
	Ravi V Shankar, Vikas Shivappa, Sai Prakhya, linux-kernel, x86

On 09/13/2016 04:35 PM, Luck, Tony wrote:
> On Tue, Sep 13, 2016 at 04:13:04PM -0700, Dave Hansen wrote:
>> Yikes, is this a new global lock and possible atomic_inc() on a shared
>> variable in the fork() path?  Has there been any performance or
>> scalability testing done on this code?
>>
>> That mutex could be a disaster for fork() once the filesystem is
>> mounted.  Even if it goes away, if you have a large number of processes
>> in an rdtgroup and they are forking a lot, you're bound to see the
>> rdtgrp->refcount get bounced around a lot.
> 
> The mutex is (almost certainly) going away.

Oh, cool.  That's good to know.

> The atomic_inc()
> is likely staying (but only applies to tasks that are in 
> resource groups other than the default one.  But on a system
> where we partition the cache between containers/VMs, that may
> essentially be all processes.

Yeah, that's what worries me.  We had/have quite a few regressions from
when something runs inside vs. outside of certain cgroups.  We
definitely don't want to be adding more of those.

> We only really use the refcount to decide whether the group
> can be removed ... since that is the rare operation, perhaps
> we could put all the work there and have it count them with:
> 
> 	n = 0;
> 	rcu_read_lock();
> 	for_each_process(p)
> 		if (p->rdtgroup == this_rdtgroup)
> 			n++;
> 	rcu_read_unlock();
> 	if (n != 0)
> 		return -EBUSY;

Yeah, that seems sane.  I'm sure it can be optimized even more than
that, but that at least gets it out of the fast path.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* RE: [PATCH v2 10/33] x86/intel_rdt: Implement scheduling support for Intel RDT
  2016-09-08  9:53   ` Thomas Gleixner
@ 2016-09-15 21:36     ` Yu, Fenghua
  2016-09-15 21:40       ` Luck, Tony
  0 siblings, 1 reply; 96+ messages in thread
From: Yu, Fenghua @ 2016-09-15 21:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Anvin, H Peter, Ingo Molnar, Luck, Tony, Peter Zijlstra,
	Tejun Heo, Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Shankar, Ravi V,
	Vikas Shivappa, Prakhya, Sai Praneeth, linux-kernel, x86

> From: Thomas Gleixner [mailto:tglx@linutronix.de]
> Sent: Thursday, September 08, 2016 2:54 AM
> 
> On Thu, 8 Sep 2016, Fenghua Yu wrote:
> > +extern struct static_key rdt_enable_key; void
> > +__intel_rdt_sched_in(void *dummy);
> > +
> >  struct clos_cbm_table {
> >  	unsigned long cbm;
> >  	unsigned int clos_refcnt;
> >  };
> >
> > +/*
> > + * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
> > + *
> > + * Following considerations are made so that this has minimal impact
> > + * on scheduler hot path:
> > + * - This will stay as no-op unless we are running on an Intel SKU
> > + * which supports L3 cache allocation.
> > + * - When support is present and enabled, does not do any
> > + * IA32_PQR_MSR writes until the user starts really using the feature
> > + * ie creates a rdtgroup directory and assigns a cache_mask thats
> > + * different from the root rdtgroup's cache_mask.
> > + * - Caches the per cpu CLOSid values and does the MSR write only
> > + * when a task with a different CLOSid is scheduled in. That
> > + * means the task belongs to a different rdtgroup.
> > + * - Closids are allocated so that different rdtgroup directories
> > + * with same cache_mask gets the same CLOSid. This minimizes CLOSids
> > + * used and reduces MSR write frequency.
> > + */
> > +static inline void intel_rdt_sched_in(void) {
> > +	/*
> > +	 * Call the schedule in code only when RDT is enabled.
> > +	 */
> > +	if (static_key_false(&rdt_enable_key))
> 
> static_branch_[un]likely() is the proper function to use.

How to do this? Should I change the line to

+ if (static_branch_unlikely(&rdt_enable_key))
?

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 96+ messages in thread

* RE: [PATCH v2 10/33] x86/intel_rdt: Implement scheduling support for Intel RDT
  2016-09-15 21:36     ` Yu, Fenghua
@ 2016-09-15 21:40       ` Luck, Tony
  0 siblings, 0 replies; 96+ messages in thread
From: Luck, Tony @ 2016-09-15 21:40 UTC (permalink / raw)
  To: Yu, Fenghua, Thomas Gleixner
  Cc: Anvin, H Peter, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Borislav Petkov, Stephane Eranian, Marcelo Tosatti,
	David Carrillo-Cisneros, Shaohua Li, Shankar, Ravi V,
	Vikas Shivappa, Prakhya, Sai Praneeth, linux-kernel, x86

> How to do this? Should I change the line to
>
> + if (static_branch_unlikely(&rdt_enable_key))?

See Documentation/static-keys.txt, there are some examples.

also "git grep static_branch_unlikely" to see existing users

-Tony

^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2016-09-15 21:40 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-08  9:56 [PATCH v2 00/33] Enable Intel Resource Allocation in Resource Director Technology Fenghua Yu
2016-09-08  9:56 ` [PATCH v2 01/33] cacheinfo: Introduce cache id Fenghua Yu
2016-09-09 15:01   ` Nilay Vaish
2016-09-08  9:56 ` [PATCH v2 02/33] Documentation, ABI: Add a document entry for " Fenghua Yu
2016-09-08 19:33   ` Thomas Gleixner
2016-09-09 15:11     ` Nilay Vaish
2016-09-08  9:56 ` [PATCH v2 03/33] x86, intel_cacheinfo: Enable cache id in x86 Fenghua Yu
2016-09-08  9:56 ` [PATCH v2 04/33] drivers/base/cacheinfo.c: Export some cacheinfo functions for others to use Fenghua Yu
2016-09-08  8:21   ` Thomas Gleixner
2016-09-08  9:56 ` [PATCH v2 05/33] x86/intel_rdt: Cache Allocation documentation Fenghua Yu
2016-09-08  9:57 ` [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface Fenghua Yu
2016-09-08 11:22   ` Borislav Petkov
2016-09-08 22:01   ` Shaohua Li
2016-09-09  1:17     ` Fenghua Yu
2016-09-08 22:45       ` Shaohua Li
2016-09-09  7:22         ` Fenghua Yu
2016-09-09 17:50           ` Shaohua Li
2016-09-09 18:03             ` Luck, Tony
2016-09-09 21:43               ` Shaohua Li
2016-09-09 21:59                 ` Luck, Tony
2016-09-10  0:36                   ` Yu, Fenghua
2016-09-12  4:15                     ` Shaohua Li
2016-09-13 13:27                       ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection Fenghua Yu
2016-09-08 11:50   ` Borislav Petkov
2016-09-08 16:53     ` Yu, Fenghua
2016-09-08 17:17       ` Borislav Petkov
2016-09-08 13:17   ` Thomas Gleixner
2016-09-08 13:59     ` Yu, Fenghua
2016-09-13 22:40   ` Dave Hansen
2016-09-13 22:52     ` Luck, Tony
2016-09-13 23:00       ` Dave Hansen
2016-09-08  9:57 ` [PATCH v2 08/33] x86/intel_rdt: Add Class of service management Fenghua Yu
2016-09-08  8:56   ` Thomas Gleixner
2016-09-12 16:02   ` Nilay Vaish
2016-09-08  9:57 ` [PATCH v2 09/33] x86/intel_rdt: Add L3 cache capacity bitmask management Fenghua Yu
2016-09-08  9:40   ` Thomas Gleixner
2016-09-12 16:10   ` Nilay Vaish
2016-09-08  9:57 ` [PATCH v2 10/33] x86/intel_rdt: Implement scheduling support for Intel RDT Fenghua Yu
2016-09-08  9:53   ` Thomas Gleixner
2016-09-15 21:36     ` Yu, Fenghua
2016-09-15 21:40       ` Luck, Tony
2016-09-13 17:55   ` Nilay Vaish
2016-09-08  9:57 ` [PATCH v2 11/33] x86/intel_rdt: Hot cpu support for Cache Allocation Fenghua Yu
2016-09-08 10:03   ` Thomas Gleixner
2016-09-13 18:18   ` Nilay Vaish
2016-09-13 19:10     ` Luck, Tony
2016-09-08  9:57 ` [PATCH v2 12/33] x86/intel_rdt: Intel haswell Cache Allocation enumeration Fenghua Yu
2016-09-08 10:08   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 13/33] Define CONFIG_INTEL_RDT Fenghua Yu
2016-09-08 10:14   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 14/33] x86/intel_rdt: Intel Code Data Prioritization detection Fenghua Yu
2016-09-08  9:57 ` [PATCH v2 15/33] x86/intel_rdt: Adds support to enable Code Data Prioritization Fenghua Yu
2016-09-08 10:18   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 16/33] x86/intel_rdt: Class of service and capacity bitmask management for CDP Fenghua Yu
2016-09-08 10:29   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 17/33] x86/intel_rdt: Hot cpu update for code data prioritization Fenghua Yu
2016-09-08 10:34   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 18/33] sched.h: Add rg_list and rdtgroup in task_struct Fenghua Yu
2016-09-08 10:36   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 19/33] magic number for resctrl file system Fenghua Yu
2016-09-08 10:41   ` Thomas Gleixner
2016-09-08 10:47     ` Borislav Petkov
2016-09-08  9:57 ` [PATCH v2 20/33] x86/intel_rdt.h: Header for inter_rdt.c Fenghua Yu
2016-09-08 12:36   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 21/33] x86/intel_rdt_rdtgroup.h: Header for user interface Fenghua Yu
2016-09-08 12:44   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 22/33] x86/intel_rdt.c: Extend RDT to per cache and per resources Fenghua Yu
2016-09-08 14:57   ` Thomas Gleixner
2016-09-13 22:54   ` Dave Hansen
2016-09-08  9:57 ` [PATCH v2 23/33] x86/intel_rdt_rdtgroup.c: User interface for RDT Fenghua Yu
2016-09-08 14:59   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 24/33] x86/intel_rdt_rdtgroup.c: Create info directory Fenghua Yu
2016-09-08 16:04   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 25/33] include/linux/resctrl.h: Define fork and exit functions in a new header file Fenghua Yu
2016-09-08 16:07   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 26/33] Task fork and exit for rdtgroup Fenghua Yu
2016-09-08 19:41   ` Thomas Gleixner
2016-09-13 23:13   ` Dave Hansen
2016-09-13 23:35     ` Luck, Tony
2016-09-14 14:28       ` Dave Hansen
2016-09-08  9:57 ` [PATCH v2 27/33] x86/intel_rdt_rdtgroup.c: Implement resctrl file system commands Fenghua Yu
2016-09-08 20:09   ` Thomas Gleixner
2016-09-08 22:04   ` Shaohua Li
2016-09-09  1:23     ` Fenghua Yu
2016-09-08  9:57 ` [PATCH v2 28/33] x86/intel_rdt_rdtgroup.c: Read and write cpus Fenghua Yu
2016-09-08 20:25   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 29/33] x86/intel_rdt_rdtgroup.c: Tasks iterator and write Fenghua Yu
2016-09-08 20:50   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 30/33] x86/intel_rdt_rdtgroup.c: Process schemata input from resctrl interface Fenghua Yu
2016-09-08 22:20   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 31/33] Documentation/kernel-parameters: Add kernel parameter "resctrl" for CAT Fenghua Yu
2016-09-08 22:25   ` Thomas Gleixner
2016-09-08  9:57 ` [PATCH v2 32/33] MAINTAINERS: Add maintainer for Intel RDT resource allocation Fenghua Yu
2016-09-08  9:57 ` [PATCH v2 33/33] x86/Makefile: Build intel_rdt_rdtgroup.c Fenghua Yu
2016-09-08 23:59   ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).